top of page
  • Adwith Malpe

Diving into Reinforcement Learning

For my capstone project, I discovered that the most ideal method to implement autonomy within my DeepRacer was by developing and deploying a machine learning algorithm that would aid the car in learning from its mistakes and reward it for every correct decision it made.

Before I start talking about what machine learning and reinforcement learning is, I want to start by talking a little bit about what AWS DeepRacer is. AWS DeepRacer is a 1/18 scale, fully autonomous race car that is capable of performing in a virtual environment and the physical world as well. To empower the DeepRacer's autonomy, Reinforcement Learning algorithms are the order Machine Learning engineers tend to learn and implement into the vehicles software.

Reinforcement learning is one of three types of machine learning. The three types of machine learning include supervised learning, unsupervised learning, and reinforcement learning. The way reinforcement learning works is by establishing an agent that interacts with an environment and is then positively or negatively rewarded based on what action the agent takes. Supervised learning can be understood through an example such as driven training with labeled data of known outputs with provided inputs, the model is trained to predict output for new inputs. Unsupervised learning is inference based training based on unlabeled data with known outputs, the model is trained to identify related structures or similar patterns within the input data. When working with a RL model in a simulated environment, I have to build, train, evaluate, and repeat the process until the model has been optimized. Once the model has been optimized to its full potential, I can deploy the RL model to the car so it may be tested in the physical environment.

A high level description of the process I am undergoing to achieve autonomy is the following steps:

Create Model => Configuration => Reward function and Hyperparameters => Train => Evaluate => Deploy to Car

Before simulating anything, I created my model in the AWS DeepRacer console to setup the environment and establish a proper configuration. From there, I was able to train and evaluate my model in the simulator and then determine whether the model was ready to be deployed to the robot. I am still in the process of improving my model so I will be showing a demo of how it works later on. For now, here is an image of one of my simulations:

In my model, the agent (represented by the DeepRacer) is continuously learning how to avoid incoming objects from trial and error as it interacts with the track environment. An agent is the entity that undergoes a certain behavior, also known as action, based on its experience when interacting with its environment. In my model, the action is the vehicle's steering choice as the DeepRacer decides how to veer itself when encountering an object within a set distance. The direction the DeepRacer chooses to move in as a response to an incoming obstacle is the action. Discrete actions are moving with respect to direction (so the action the DeepRacer is taking is a discrete action). A continuous action is a change in decision such as fluctuating speeds. The action causes the environment to change from its original state to its new state. So in my model, the original state is represented by the racer moving straight on the track and then avoiding an obstacle in front of it, which brings it to a new state. A partial state is a state where the agent can only see a small part of the environment; in this case, the partial state is when the racer is only able to see the obstacle that is in front of it. The absolute state is when the agent knows its positioning with respect to the entire track.

Once the DeepRacer (agent) enters a new state after completing an action, it receives a scalar value, also known as reward. In this case, every time the DeepRacer successfully avoids an obstacle while staying within the track, it receives a reward from the programmed reward function by means of a scalar value. In my case, the agent receives a scalar value of 1.0 for the completely correct term of action and receives an extremely small scalar of 1e-3 if it does not avoid the obstacle. The reward function is responsible for rewarding the agent every time the threshold yields a true value based on the correct terms of action. This helps the DeepRacer (agent) know which actions to take in order to earn the most rewards and thus increase the accuracy of completing the right action. In this case, as the DeepRacer increases the sum scalar value it receives, it knows to repeat those same actions so that it may continue earning rewards. This is an iterative process and the more tests the agent undergoes, the more it understands what actions it should take in the long run to get the most rewards.

This is the basics of how my model works. In my next post, I will discuss more about hyperparameters and how they must be adjusted to optimize this reinforcement learning model.

Here is a link to my github profile with my project repository:

13 views0 comments

Recent Posts

See All


Post: Blog2_Post
bottom of page