online-appendix – Frans A. Oliehoek

On this page, we show some videos of our experimental results in two different environments, Myopic Breakout and Traffic Control.

Myopic Breakout

The InfluenceNet model (PPO-InfluenceNet) is able to learn the “tunnel” strategy, where it creates an opening on the left (or right) side and plays the ball in there to score a lot of points:

The feedforward network with no internal memory performs considerably worse than the InfluenceNet model:

Traffic Control

The Traffic Control task was modified as follows:

The size of the observable region was slightly reduced and the delay between the moment an action is taken and the time the lights switch was increased to 6 seconds. During these 6 seconds the green light turns yellow.
The speed penalty was removed and there is only a negative reward of -0.1 for every car that is stopped at a traffic light.

As shown in the video below, a memoryless agent can only switch the lights when a car enters the local region. With the new changes, this means that the light turns green too late and the cars have to stop:

On the other hand, the InfluenceNet agent is able to anticipate that a car will be entering the local region and thus switch the lights just in time for the cars to continue without stopping:

Category: online-appendix

Video Demo with ALA submission ‘Influence-Based Abstraction in Deep Reinforcement Learning’

Myopic Breakout

Traffic Control