A fast, accurate and more rigorous method of training algorithms in reinforcement learning tasks has been developed by KAUST researchers. Improving on current best-practice optimization methods, the new approach has broad applications, from training robots to walk, to developing video games that can outsmart players.
Similar to the learning process in animals and humans, reinforcement learning works by providing reward feedback when an algorithm is close to achieving a specified goal. The closer the algorithm gets to performing a task correctly, the more it is rewarded.
While many machine learning methods, such as stochastic gradient descent, use derivative information to plot the fastest path to convergence, this information is not available in reinforcement learning. This makes it more challenging to train algorithms.
As a result, reinforcement learning methods tend to rely on guesswork. “Much of the work on algorithms in this area are heuristics; there is no mathematical guarantee that they will work,” says Peter Richtárik, computer scientist at KAUST’s Visual Computing Center.
To tackle this issue, Richtárik developed a derivative-free algorithm that was able to reach an optimal value using just one parameter1. Known as simplified direct search (SDS), the algorithm provided a quicker and more reliable alternative to the standard direct search method.
“It was the best approach at the time, but we have since improved it,” says Richtárik.
With computer scientist, El Houcine Bergou, Richtárik applied randomness to the SDS algorithm, which narrowed the process further. The modified approach, called the stochastic three points method, pinpoints the best objective function for proceeding to the next step in the training process2.
“This method is extremely simple to use in practice,” says Bergou. “Its analysis is also simple compared to state-of-the-art similar methods,” says Bergou.
Further developing the process with Ph.D. student, Adel Bibi, Richtárik and Bergou added importance sampling into the mix, which enables machine learning models to select the data points with the best probabilities.
When the team used this approach to teach simulated humanoid, ant and cheetah robots to walk in a reinforced learning scenario, they found that their method outperformed leading standard practices3.
“The proposed algorithm enjoys theoretical convergence rates that are the best we know of in the derivative-free optimization literature,” says Bibi.
The researchers also applied momentum, another technique that fast tracks convergence, to the stochastic three points method with importance sampling to control the movements of the virtual robots4.
While momentum is most often used in deep learning, this was the first instance where it was used in derivative-free optimization.
“We did the analysis and showed that it works mathematically,” says Richtárik. “It’s better than any other robot training approaches that have been proposed.”
The next step for Richtárik and his colleagues is to apply these new optimization methods to other reinforcement learning tasks, such as training real robots and drones.
“We like doing stuff where we can have guarantees and it works in practice,” says Richtárik. “That’s really the jackpot.”
- Konečný, J. & Richtárik, P. Simple complexity analysis of simplified direct search. arXiv:1410.0390 (2014).| article
- Bergou, E.H., Gorbunov, E. & Richtárik, P. Stochastic three points method for unconstrained smooth minimization. arXiv: 1902.03591(2019).| article
- Bibi, A., Bergou, E., Sener, O., Ghanem, B. & Richtárik, P. A stochastic derivative-free optimization method with importance sampling: Theory and learning to control. arXiv: 1902.01272 (2019).| article
- Gorbunov, E., Bibi, A., Sener, O., Bergou, E.& Richtárik, P. A stochastic derivative free optimization method with momentum. arXiv:1905.13278 (2019).| article