PPO_Lander_Implementation

By Matt Stammers

18/09/2023

After now spending 3 months learning reinforcement learning it is now time to write out a full PPO implementation for the lunar-lander environment.

This is based on Costa Huang's excellent tutorial PPO_Tutorial

Starting Out

Step 1

Firstly I made sure that everything was working correctly using CartPole-v1 and 100,000 training steps

This gave me the following result:

and a score of 257.70+/- 91.25 with the following very jumpy training:

Tuning Up (or down!)

Step 2

To fix the jumpy training I reduced the learning and clip rates while increasing the total number of timesteps to 1 million. I kept the decay on the learning rate as it was. Training as would be expected took quite a bit longer this time but after about 300,000 timesteps the model was clearly superior although still jumpy. By about 700,000 timesteps the algorithm had only a very low learning rate but was still quite inconsistent although sometimes it did manage a perfect score. Given the focus was on other environemnts I chose to move on at this point with an imperfect CartPole implementation. You can see the finally deployed model but far from perfect model here - (a policy only algo would have outperformed this!)

CartPole-TestDeployment

Another Environment

Step 3

Now before going back to lunar-lander I want to try and make this work on the MountainCar-v0 environment because I have never tried this environment before and will need to think carefully to make it work.

Iniitally this one absolutely sucked but I gradually increased the learning rate and by 2.5e-3 after about 150,000 timesteps I started getting some better than base (-200) return. I jacked the training steps up to and started getting some very good results by 500,000 steps but they were inconsistent so I continued the training to 1 million.By this point it didn't seem to be getting any better so I stopped it.

It wasn't too bad and I got -124+/-- 35.38 which is not amazing but also not terrible, especially given that I have never even looked at this particular environment before.

Here is the video of it during training:

and here is the link to the final result:

https://huggingface.co/MattStammers/ppo-MountainCar-v0-fullcoded

LunarLander

Step 4

Now its time to take on the lunarlander. This is a more complex environment so I have increased the bactch sizes, reduced the training rate and started with 100,000 training steps. Let's see how we go!

Ok so sadly all on board have perished:

Now let's try to increase the timesteps to a million and see if we get a better result:

This time is marginally better but they still don't land. If we look at the training we can see that the training is very lumpy:

I could at this point remove the random seeding in order to debug but I don't really want to do that. Instead I doubled the batch size to 2048 and made the network bigger doubling the nodes from 64 to 128.

However, I noticed while doing this that the losses/entropy dropped off really quickly which is a poor price to pay for stability so I terminated the run and tried the same architecture this time with a smaller learning rate of 2.5e-5 rather than 4. This kept the entropy much higher and also dropped the loss (purple run = 4).

This also gave me better overall losses/value_loss returns. Training was still a bit spiky but seemed to be heading in the right direction as by 500,000 timesteps I was starting to get a few positive scores this time but the final result was still not good.

Next I wondered if perhaps increasing the number of environments running in parallel might improve the training experience so I increased these from 4 to 16 with now 2 million timesteps (run 5) and also increasing the clip-coefficient to make sure the policy updates properly (run 6) but this actually made things worse.

For run 7 I increased the learning rate again back to and halved the clip rate back to 2.5e-4. This seemed to help so I then reduced the step-size back to 1024. This time at 500,000 steps I started getting some consistently positive results and the training times started to jump accordingly. I was reading this thesis while doing it: https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf which really helped me understand the PPO architecture and the clip function. This ultimately led to being able to unravel the puzzle in 8 attempts. The lander doesn't always land but there is at least a chance of survival this time ;D:

Here is the training dasbhoard:

and here is the final huggingface repo:

HuggingFace_Repo

I am very grateful to huggingface for setting me up well to explore this further and plan to repay the favour by using them to host many of our open source models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO_Lander_Implementation

Starting Out

Tuning Up (or down!)

Another Environment

LunarLander

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PPO_Lander_Implementation

Starting Out

Tuning Up (or down!)

Another Environment

LunarLander