Publications

• Sorted by Date • Classified by Publication Type • Classified by Research Category •

Difference Rewards Policy Gradients

Jacopo Castellini, Sam Devlin, Frans A. Oliehoek, and Rahul Savani. Difference Rewards Policy Gradients. In Proceedings of the AAMAS Workshop on Adaptive Learning Agents (ALA), May 2021. Best paper award.

Download

pdf [1.5MB]

Abstract

Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent's contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function as done by Counterfactual Multiagent Policy Gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns a reward network that is used to estimate the difference rewards.

BibTeX Entry

@inproceedings{Castellini21ALA,
    author =    {Castellini, Jacopo and 
                 Devlin, Sam and
                 Oliehoek, Frans A. and 
                 Savani, Rahul},
    title =     {Difference Rewards Policy Gradients},
    booktitle = ALA,
    year =      2021,
    month =     may,
    note =      {\textbf{Best paper award.}},
    abstract = {
    Policy gradient methods have become one of the most popular classes of
    algorithms for multi-agent reinforcement learning. A key challenge,
    however, that is not addressed by many of these methods is multi-agent credit
    assignment: assessing an agent's contribution to the overall performance,
    which is crucial for learning good policies. We propose a novel algorithm
    called Dr.Reinforce that explicitly tackles this by combining difference
    rewards with policy gradients to allow for learning decentralized policies
    when the reward function is known. By differencing the reward function
    directly, Dr.Reinforce avoids difficulties associated with learning the
    Q-function as done by Counterfactual Multiagent Policy Gradients (COMA), a
    state-of-the-art difference rewards method. For applications where the
    reward function is unknown, we show the effectiveness of a version of
    Dr.Reinforce that learns a reward network that is used to estimate the
    difference rewards.        
    }
}

Generated by bib2html.pl (written by Patrick Riley) on Tue Jun 25, 2024 12:39:45 UTC