top of page

An Introduction to Reinforcement Learning

  • The Codess
  • 2 hours ago
  • 5 min read

Reinforcement learning is one of the subcategories of machine learning that you might be most familiar with. It is exactly how it sounds: an agent (being the computer) learns how to interact with an environment by picking actions that result in rewards, a.k.a reinforcements. The main goal of the agent is to receive the highest possible total reward. The agent doesn’t know what the best behavior is, so it learns through trial and error. 


Does this sound familiar? Maybe the word “reinforcement” brings up images of dog training. If I tell my dog to sit and he does, he will get a treat. If not, he receives no treats. The dog’s objective is to get the most treats (naturally). Indeed, this is where the idea for reinforcement learning for computers stemmed from: animal behavioral research. 


I think people are shocked when I tell them most of the foundational science behind machine learning was published in the 50s-80s, but it’s true! Alan Turing (yes, that Turing!) proposed a “pleasure-pain system” in which “When a pain stimulus occurs, all tentative entries are cancelled, and when a pleasure stimulus occurs, they are all made permanent.” (Turing, 1948). The earliest maze-traversing machine may have been built in 1933 by Thomas Ross, which marked a maze path through a series of switch settings.


Reinforcement learning is extremely popular. This is how labs teach their robot dog or humanoid robot to move around an obstacle course. This is also how AI was taught to compete in Chess and AlphaGo games. It’s easy to measure the success of these algorithms, since there is a clearly defined goal and quantifiable success in the form of total rewards received. 


But back to our computer. Let’s call him Ralph. Ralph has no understanding of treats; he has no feelings of taste, hunger, or joy. Therefore, we cannot simply offer him a biscuit and expect him to know what to do. Instead, mathematicians did what they do and came up with a set of formulas to encourage our robot to do things. 


We have the Bellman Equation, which relates the current and future states to calculate a value function. Let’s relate this to Ralph, our robot dog. Let’s say we want Ralph to complete a maze. Once Ralph reaches the end position, he will receive a treat, aka a reward, r (+1). Ralph can choose actions, a, of moving left, right, or straight. Ralph’s state, s, is just his current position within the maze. Here is what the equation looks like:



Now, let’s break it down in words. The value function of a state under a policy, V𝜋(s), is saying “How many treats does Ralph get now, plus how many treats does Ralph expect to get later, assuming he keeps following the same path?” 


  •  The first summation, Σa∈A 𝜋(a|s), is read in words as: “The probability of taking an action, a, given the current step, s, summed over every action in the set of all possible actions”. In our example, it tells us what Ralph wants to do based on where he is. If he is facing a wall, there is a very low probability he is going to move forward, and a high probability he’s going to move left or right. If the maze exit is to his left, there is a very high probability Ralph is going to choose to move left. 

  • The next summation sums over the next possible states, s’. If Ralph is facing a wall, the next possible states are to the left or right of his current position. The transition probability, P(s'|s,a), says with the selected action, what is the likelihood Ralph will move from the current position to the next position. For example, if Ralph chooses to move forward, but the wall is in front of him, he is not likely to move forward. 

  • The final part, [R(s,a)+𝛾V𝜋(s')] says: “Okay lets add the reward Ralph got from the current state and action to the possibility of getting treats if Ralph keeps moving the same way”. So if he got no treats, but can see the exit, this will encourage Ralph to keep moving the same way. If Ralph hits a wall and can’t see the exit, Ralph will want to try a new path because he is discouraged. The discount factor, 𝛾, tells Ralph how much to value future rewards. In this case, we want 𝛾 to be high because Ralph will only get a treat if he makes it all the way to the end.


So, now we have a mathematical way to “motivate” our robot to react in the way we want. Further, we have different types of algorithms depending on how much information the model has about its environment. For example, when you are looking at a maze on a piece of paper and find the clear path before marking it with your pen, you have the entire “environment” information accessible to you. In this way, you rely more on planning your path. Algorithms that already have an idea of their environment are called “model-based” algorithms. These models tend to work faster, but only work on discrete environments. This means the algorithms only work under real, whole numbers. It works well on Checker grids where you can be on square 1,2,3,4…etc. They do not work on problems that have fractional values. 


On the other hand, if the robot has no information on the environment, as if it is blindfolded and placed in front of the maze, it is considered “model-free”. These algorithms rely on trial-and-error since there is no prior knowledge of the environment beforehand. These methods are also referred to as temporal difference models because they learn to predict future rewards based on their current prediction with a prediction in the future, also called bootstrapping.


The model-free methods can be broken down further into off-policy and on-policy algorithms. Policy refers to the plan the robot has to reach its goal. In our Ralph example, his policy could look like “I am going to go straight until I hit a wall. If I hit a wall, I’m always going to go right. I’ll repeat until I reach the exit.” An on-policy Ralph doesn’t look around at the other options and only learns about the path he is currently on. So, Ralph will continue to go right until he realizes he has gone in a circle, at which point he will change his plan to alternating left and right. An off-policy Ralph doesn’t have to alter his plan directly. He goes right, but learns that left would have been the better option after hitting a wall, and remembers this as the better option even as he continues in a circle. So, where an on-policy Ralph might write his path on the maze in pencil so he can go back and erase it, off-policy Ralph would write two paths in pen at the same time: one all scribbly from backtracking and one a straight, orderly line from start to finish.


These are the fundamental workings behind reinforcement learning. These algorithms are fun to work with because they are easy to grasp due to their foundation in human and animal psychology. It is also easy to see if the model is “performing well” since there is a clear goal in mind. This lends itself well to the field of robotics and the ‘Computer’ player you play against in virtual games. As models receive more training data, they become more robust (hard vs. easy mode). It is also why the ‘Computer’ player is harder to fight against every time you play a new match, the computer ‘remembers’ your strategy to better defeat you.



Comments


bottom of page