student projects

Student Projects

I am very happy to supervise BSc / MSc student projects if they relate to my research interests. I generally co-supervise together with my postdocs and PhD students, so to get an idea about a topic check out my team members.


Prospective and starting students: please have a look at my guidelines and resources.

What background / skills do I need for a project in ILDM?

If you are interested in working on ILDM, consider following these courses:

strongly recommended:

  • Artificial Intelligence Techniques
  • Algorithms for Intelligent Decision Making
  • Deep Learning
  • Machine Learning 1


  • Machine Learning 2
  • Information Theory
  • Advanced Algorithms
  • Evolutionary Algorithms
  • Seminar Research Methodology for Data Science

I select students based on their motivation to work in my field, and like to see some evidence that they will be able to actually do the research they would like to do. E.g.,

  • in case you would want to pick up an empirical question, I would want to be convinced somehow that you have the required coding skills.
  • If you want to do something theoretical, I would like to see a first analysis that you have done.

In particular, in case that you want to do something with deep (reinforcement) learning, you will need to show me evidence of successful implementation of deep learning project (in e.g., tensorflow or pytorch).

How to approach me?

When contacting me about a project, please include:

  • evidence that you have sufficient interest and skills in required background (e.g., grade list).
  • a single page with 1 or 2 possible research ideas that you would be interested in.

If I think there could be a match, I will usually suggest talking to one of my team members to further explore.

What if I want to do a project with a company?

In general the university allows students to graduate in a company. For me to engage in such a construction, there needs to be some value in it. The rule of thumb is that the topic must be interesting to me, and there should still be some actual scientific value, but I am open to discuss.

In general, except for any companies below, you would need to find the company yourself.

Currently, I welcome students that are interested in exploring a thesis project at:

  • TenneT, on RL for energygrid control, see below.
  • TNO, The Hague, in the area of intelligent traffic and transportation.
  • NS Reizigers, see description below.

Open projects

Nethack Challenge

I am very interested if students want to work on NetHack and a possible challenge (

Learning to run a power network with multi-objective reinforcement learning (With TenneT)

Power grids are essential for the transport of electricity from producers to consumers, and, hence, represent a backbone of the energy transition. Behind the screens, a lot of decisions need to be made to keep the power grid running smoothly. For example, how to prevent overload on certain lines? Uncertain factors play a big role here, for example, the demand might fluctuate during the day. But more importantly, a growing share of renewable energy means that there can be big fluctuations in where and how much electricity is produced. The network should also be resilient to failures of lines in the grid, such as caused by adverse weather or e.g. cyberattacks.

All this means that keeping the electricity grid running smoothly is a complex sequential decision problem [1, 5]. In this project, we will investigate how reinforcement learning (RL) can be used to help solve this problem. More specifically, we will develop multi-objective RL (MORL) algorithms for power grid control since real-world operators often need to satisfy several opposing objectives (e.g. reduce the loading of the lines vs reduce the complexity of the actions). We will compare MORL approaches against single-objective RL baselines such as [2-3]. At this first stage, we want to find the strengths and weaknesses of the MORL approaches.

At possible subsequent stages, either additional techniques can be included (e.g. graph neural networks) or we can push the system further. This could mean tackling an additional challenges, such as robustness to failures in the network, dealing with an increasing share of renewables over time, and/or being able to express confidence in the network’s decisions. The exact direction will be decided jointly by the student and supervisors, where we will of course try to shape the project to reflect your own research interests. Each of the scenarios above can be tested on existing simulators that were developed for on-line challenges [4-8]. The research project could result in a submission to a future similar challenge.

An internship at TenneT (the Dutch Transmission System Operator) is also possible.

[1] Viebahn et al. (2022). Potential and challenges of AI-powered decision support for short-term system operations. CIGRE Paris Session.
[2] Subramanian, M., Viebahn, J., Tindemans, S. H., Donnot, B., & Marot, A. (2020). Exploring grid topology reconfiguration using a simple deep reinforcement learning approach. arXiv preprint arXiv:2011.13465
[3] Lehna, M., Viebahn, J., Marot, A., Tomforde, S., Scholz, C. (2023). Managing power grids through topology actions: A comparative study between advanced rule-based and reinforcement learning agents. Energy and AI,
[4] Marot, A., Donnot, B., Dulac-Arnold, G., Kelly, A., O’Sullivan, A., Viebahn, J., & Romero, C. (2021). Learning to run a Power Network Challenge: a Retrospective Analysis. arXiv preprint arXiv:2103.03104.
[5] Kelly, A., O’Sullivan, A., de Mars, P., & Marot, A. (2020). Reinforcement learning for electricity network operation. arXiv preprint arXiv:2003.07339.

Probabilistic Programs for Learning in POMDPs

Reasoning about uncertainty is a core intellectual ability of any (artificially) intelligent agent. Imagine a hunter that is dropped in an unknown hunting ground. The hunter needs to quickly learn about the different types of prey, their typical locations and movement patterns in order to survive. However, this itself is difficult due to partial observability: the hunter can only observe animals when it is near them!

The problems in which not everything is observable are often framed as partially-observable Markov Decision Processes (POMDPs). POMDPs are a powerful framework but can be difficult to scale to large problems. One way to scale them is by embedding more structure into the problem through specific domain knowledge. For instance, anticipating the movements of prey becomes easier if we consider that it would not suddenly stop but can change direction.

Probabilistic programs offer a very versatile way of imposing structure on various probabilistic models, like POMDPs. In this project, we will explore how far this can enable us to scale up learning in challenging partially observable dynamic settings.

(together with Sebastijan Dumančić)

Learning to solve train unit shunting problem with disturbances

Looking for a student who is interested doing a project with NS Reizigers.

The Train Unit Shunting Problem (TUSP) is a difficult sequential decision-making problem faced by Dutch Railways (NS). At a service site, several train units need to be inspected, cleaned, repaired, and parked during the night. Also, a specific type of train units needs to depart at a specific time.

At NS, a local search heuristic that evaluates all components of the Train Unit Shunting Problem simultaneously was developed to create shunt plans. The local search heuristic attempts to find a feasible shunt plan by
applying search operators in iterations to move through the search space. The heuristic uses several search operators to move through the search space. In every iteration, the search operators are shuffled in random order. This will be the order in which the operators will be evaluated. Starting with the first search operator, the heuristic evaluates the set of candidate solutions that can be reached through that search operator. A candidate solution is immediately accepted as new solution for the next iteration if it is an improvement over the current solution. If the candidate solution is worse, it is selected with a certain probability depending on the difference in solution quality and the progress of the search process. This probabilistic technique is called simulated annealing.

One challenging aspect of train unit shunting is that frequently, the problem specification will slightly change in the last minute. For example, a train might be delayed, causing the trains to arrive in a different order than expected. A facility might be suddenly unavailable, or some extra tasks might be added to a train due to technical problems. When disturbances occur, making a completely new plan is not desirable. Making a new plan might take a lot of time, but even if a new plan can be made, it would necessitate changing the anticipated operations by all train drivers, maintenance workers, etc. Instead, typically the existing plans are adapted in such a way that the adapted plan is expected to be not too far from the original plan.

In this project, we would like to explore making this approach more robust. For instance, it could be possible to apply (multi-agent) deep reinforcement learning to learn to guide the search operators of local search, to find the minimal number of changes (number of operators) applied to the original plan such that the adapted plan is feasible. Another direction could be to try and learn models of the (probability of) disturbances, andinvestigate ways to incorporate these in the fitness function directly.

Deep representation learning for RL

It is not always easy to understand what and how deep RL agents actually learn, and how this relates to their performance. One way to gain a better understanding is to think about what such agents should ideally learn. When we speak of what a deep RL agent learns, we thereby mean the internal representation that a neural network forms of the environment. That is, the activation patterns that arise in each hidden network layer as the result of feeding (histories of) observations to the network. Among the desirable properties of such an internal representation are that it should make only necessary distinctions between (histories) of observations, allow the agent to learn to act optimally, and enable generalization to new irrelevant feature values and modified dynamics. A representation that has these properties is one in which the distances between states are proportional to a bisimulation metric, which measures how “behaviorally different” states are. Some interesting questions to explore are:

  • Do different representation learning methods for RL (e.g. autoencoder-based, augmentation-based contrastive coding) cause such a representation to be learned, and if so, how quickly?
  • How do the representations learned by different model-free methods compare to such a representation, both during and at the end of training?
  • How can we formulate a scalable auxiliary loss that pushes networks to form such a representation?

Important thereby is also to consider how we can scalably compute how similar to this bisimulation-based representation the learned representations are to allow evaluations on commonly used benchmarks such as Mujoco.

Intelligent Traffic Control

I am quite interested in applying AI techniques in the open source ‘SUMO’ traffic simulator ( Examples of such projects would be:

  • Reinforcement Learning for large-scale Urban Traffic Light Control
  • Assessing the Impact of Improved Sensing for RL-Based Urban Traffic Light Control

Such projects would require good coding skills and a decent background in reinforcement learning or MDPs.

Meta Reasoning in Planning by Meta Learning

Anytime planning algorithms like MCTS enjoy many empirical successes in large sequential decision making problems. However, a fundamental limitation of these algorithms is that they require a massive amount of compute to work. Fortunately, it has been shown that in many games and real-world applications, complex reasoning is not need for every decision in a task to achieve good performance. To reduce such unnecessary reasoning and bring down the overall computational complexity, a process of understanding the consequences of performing reasoning in a certain situation is needed, which we call meta reasoning. Unfortunately, since meta reasoning needs to reason about reasonings, generally it has a higher complexity than solving the task itself. In this project, we will investigate a different path – meta learn how to do meta reasoning in planning. Conceptually, we will try to set up a meta agent on top of a planning agent, that decides how much computation we spend on every decision, with the goal to achieve high task performance with minimal computation. We will investigate if we can use meta learning techniques to construct a meta agent by repeatedly playing games or interacting with environments. The interesting part of this project is to see how well the meta agent can generalize from past experience to a new situation. See Research Questions below.

Some interesting research questions:

  • what architecture and learning algorithms should we use for the meta agent?
  • what information during decision making should we feed into the meta agent for it to perform well and learn fast?
  • how quickly can the meta agent generalize from past experience to a new situation?
  • can we add useful heuristics and inductive biases to the meta agent for it to learn faster?

Requirements of the project:

  • Good coding skills of C++ or Python (required)
  • Knowledge of Monte Carlo Tree Search (preferred)
  • Knowledge of Deep Reinforcement Learning (required)
  • Experience with deep learning libraries: Pytorch (preferred) or Tensorflow

This project will be co-supervised by Jinke He (a PhD student) and me. For more information, please contact

Your topic

If you are interested, or have your own ideas for a project relating to reinforcement learning, multiagent learning, or adversarial learning, please contact me.