temporal credit assignment problem

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: rawfonts
failed: trimclip

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages .

A Survey of Temporal Credit Assignment in Deep Reinforcement Learning

The Credit Assignment Problem (CAP) refers to the longstanding challenge of RL agents to associate actions with their long-term consequences. Solving the CAP is a crucial step towards the successful deployment of RL in the real world since most decision problems provide feedback that is noisy, delayed, and with little or no information about the causes. These conditions make it hard to distinguish serendipitous outcomes from those caused by informed decision-making. However, the mathematical nature of credit and the CAP remains poorly understood and defined. In this survey, we review the state of the art of Temporal Credit Assignment (CA) in deep RL . We propose a unifying formalism for credit that enables equitable comparisons of state of the art algorithms and improves our understanding of the trade-offs between the various methods. We cast the CAP as the problem of learning the influence of an action over an outcome from a finite amount of experience. We discuss the challenges posed by delayed effects , transpositions , and a lack of action influence , and analyse how existing methods aim to address them. Finally, we survey the protocols to evaluate a credit assignment method, and suggest ways to diagnoses the sources of struggle for different credit assignment methods. Overall, this survey provides an overview of the field for new-entry practitioners and researchers, it offers a coherent perspective for scholars looking to expedite the starting stages of a new study on the CAP , and it suggests potential directions for future research.

1 Introduction

RL is poised to impact many real world problems that require sequential decision making, such as strategy (Silver et al.,, 2016 , 2018 ; Schrittwieser et al.,, 2020 ; Anthony et al.,, 2020 ; Vinyals et al.,, 2019 ; Perolat et al.,, 2022 ) and arcade video games (Mnih et al.,, 2013 , 2015 ; Badia et al.,, 2020 ; Wurman et al.,, 2022 ) , climate control (Wang and Hong,, 2020 ) , energy management (Gao,, 2014 ) , car driving (Filos et al.,, 2020 ) and stratospheric balloon navigation (Bellemare et al.,, 2020 ) , designing circuits (Mirhoseini et al.,, 2020 ) , cybersecurity (Nguyen and Reddi,, 2021 ) , robotics (Kormushev et al.,, 2013 ) , or physics (Degrave et al.,, 2022 ) . One fundamental mechanism allowing RL agents to succeed in these scenarios is their ability to evaluate the influence of their actions over outcomes – e.g., a win, a loss, a particular event, a payoff. Often, these outcomes are consequences of isolated decisions taken in a very remote past: actions can have long-term effects. The problem of learning to associate actions with distant, future outcomes is known as the temporal Credit Assignment Problem (CAP) : to distribute the credit of success among the multitude of decisions involved (Minsky,, 1961 ) . Overall, the influence that an action has on an outcome represents knowledge in the form of associations between actions and outcomes (Sutton et al.,, 2011 ; Zhang et al.,, 2020 ) . These associations constitute the scaffolding that agencies can use to deduce, reason, improve and act to address decision-making problems and ultimately improve on their data efficiency.

Solving the CAP is paramount since most decision problems have two important characteristics: they take a long time to complete , and they seldom provide immediate feedback, but often with delay and little insight as to which actions caused it. These conditions produce environments where the feedback signal is weak, noisy, or deceiving, and the ability to separate serendipitous outcomes from those caused by informed decision-making becomes a hard challenge. Furthermore, as these environments grow in complexity with the aim to scale to real-world tasks (Rahmandad et al.,, 2009 ; Luoma et al.,, 2017 ) , the actions taken by an agent affect an increasingly vanishing part of the outcome. In these conditions, it becomes challenging to learn value functions that accurately represent the influence of an action, and to be able to distinguish and order the relative long-term values of different actions. In fact, canonical Deep Reinforcement Learning (Deep RL) solutions to control are often brittle to the hyperparameter choice (Henderson et al.,, 2018 ) , inelastic to generalise zero-shot to different tasks (Kirk et al.,, 2023 ) , prone to overfitting (Behzadan and Hsu,, 2019 ; Wang et al.,, 2022 ) , and sample-inefficient (Ye et al.,, 2021 ; Kapturowski et al.,, 2023 ) . Overall, building a solid foundation of knowledge that can unlock solutions to complex problems beyond those already solved calls for better CA techniques (Mesnard et al.,, 2021 ) .

In the current state of RL , action values are a key proxy for action influence . Values actualise a return by synthesising statistics of the future into properties of the present . Recently, the advent of Deep RL (Arulkumaran et al.,, 2017 ) granted access to new avenues to express credit through values, either by using memory (Goyal et al.,, 2019 ; Hung et al.,, 2019 ) , associative memory (Hung et al.,, 2019 ; Ferret et al., 2021a, ; Raposo et al.,, 2021 ) , counterfactuals (Mesnard et al.,, 2021 ) , planning (Edwards et al.,, 2018 ; Goyal et al.,, 2019 ; van Hasselt et al.,, 2021 ) or by meta-learning (Xu et al.,, 2018 ; Houthooft et al.,, 2018 ; Oh et al.,, 2020 ; Xu et al.,, 2020 ; Zahavy et al.,, 2020 ) . The research on CAP is now fervent, and with a rapidly growing corpus of works.

Motivation.

Despite its central role, there is little discussion on the precise mathematical nature of credit. While these proxies are sufficient to unlock solutions to complex tasks, it remains unclear where to draw the line between a generic measure of action influence and credit . Existing works focus on partial aspects or sub-problems (Hung et al.,, 2019 ; Arjona-Medina et al.,, 2019 ; Arumugam et al.,, 2021 ) and not all works refer to the CAP explicitly in their text (Andrychowicz et al.,, 2017 ; Nota et al.,, 2021 ; Goyal et al.,, 2019 ) , despite their findings providing relevant contributions to address the problem. The resulting literature is fragmented and lacks a space to connect recent works and put their efforts in perspective for the future. The field still holds open questions:

What is the credit of an action? How is it different from an action value ? And what is the CAP ? What in words, and what in mathematics?

How do agents learn to assign credit? What are the main methods in the literature and how can they be organised?

How can we evaluate whether a method is improving on a challenge? How can we monitor advancements?

Here, we propose potential answers to these questions and set out to realign the fundamental issue raised by Minsky, ( 1961 ) to the Deep RL framework. Our main goal is to provide an overview of the field to new-entry practitioners and researchers, and, for scholars looking to develop the field further, to put the heterogeneous set of works into a comprehensive, coherent perspective. Lastly, we aim to reconnect works whose findings are relevant for CAP , but that do not refer to it directly. To the best of our knowledge, the work by Ferret, ( 2022 , Chapter 4) is the only effort in this direction, and the literature offers no explicit surveys on the temporal CA problem in Deep RL .

The survey focuses on temporal CA in single-agent Deep RL , and the problems of (i) quantifying the influence of an action mathematically and formalising a mathematical objective for the CA problem (ii) defining its challenges, and categorising the existing methods to learn the quantities above, (iii) defining a suitable evaluation protocol to monitor the advancement of the field. We do not discuss structural CA in Deep Neural Networks (DNNs) , that is, the problem of assigning credit or blame to individual parameters of a DNN (Schmidhuber,, 2015 ; Balduzzi et al.,, 2015 ) . We also do not discuss CA in multi-agent RL , that is, to ascertain which agents are responsible for creating good reinforcement signals (Chang et al.,, 2003 ; Foerster et al.,, 2018 ) . When credit (assignment) is used without any preceding adjective, we always refer to temporal credit (assignment). In particular, with the adjective temporal we refer to fact that “each ultimate success is associated with a vast number of internal decisions” ( Minsky, , 1961 ) and that these decisions, together with states and rewards, are arranged to form a temporal sequence.

The survey focuses on Deep RL . In surveying existing formalisms and methods, we only look at the Deep RL literature, and when proposing new ones, we tailor them to Deep RL theories and applications. We exclude from the review methods specifically designed to solve decision problems with linear or tabular RL , as they do not bode well for scaling to complex problems.

We address Q1. , Q2. and Q3. in the three major sections of the manuscript. Respectively:

Section 4 addresses Q1. , proposing a definition of credit and the CAP and providing a survey of action influence measures.

Section 5 and Section 6 address Q2. , respectively discussing the key challenges to solving the CAP and the existing methods to assign credit.

Section 7 answers Q3. , reviewing the problem setup, the metrics and the evaluation protocols to monitor advancements in the field.

For each question, we contribute by: (a) systematising existing works into a simpler, coherent space; (b) discussing it, and (c) synthesising our perspective into a unifying formalism. Table 1 outlines the suggested reading flow according to the type of reader.

2 Related Work

Three existing works stand out for proposing a better understanding of the CAP explicitly. Ferret, ( 2022 , Chapter 4) designs a conceptual framework to unify and study credit assignment methods. The chapter proposes a general formalism for a range of credit assignment functions and discusses their characteristics and general desiderata. Unlike Ferret, ( 2022 , Chapter 4) , we survey potential formalisms for a mathematical definition of credit (Section 4 ); in view of the new formalism, we propose an alternative view of the methods to assign credit (Section 6 ), and an evaluation protocol to measure future advancements in the field. Arumugam et al., ( 2021 ) analyses the CAP from an information theoretic perspective. The work focuses on the notion of information sparsity to clarify the role of credit in solving sparse reward problems in RL . Despite the work questioning what credit is mathematically, it does not survey existing material, and it does not provide a framework that can unify existing approaches to represent credit under a single formalism. Harutyunyan et al., ( 2019 ) propose a principled method to measure the credit of an action. However, the study does not aim to survey existing methods to measure credit, the methods to assign credit, and the methods to evaluate a credit assignment method, and does not aim to organise them into a cohesive synthesis.

The literature offers also surveys on related topics. We discuss them in Appendix A to preserve the fluidity of the manuscript.

As a result, none of these works position CAP in a single space that enables thorough discussion, assessment and critique. Instead, we propose a formalism that unifies the existing quantities that represent the influence of an action (Section 4 ). Based on this, we can analyse the advantages and limitations of existing measures of action influence. The resulting framework provides a way to gather the variety of existing methods that learn these quantities from experience (Section 6 ), and to monitor the advancements in solving the CAP .

3 Notation and Background

Here we introduce the notation and background that we will follow in the rest of the paper.

We use calligraphic characters to denote sets and the corresponding lowercases to denote their elements, for example, x ∈ 𝒳 𝑥 𝒳 x\in\mathcal{X} italic_x ∈ caligraphic_X . For a measurable space ( 𝒳 , Σ ) 𝒳 Σ (\mathcal{X},\Sigma) ( caligraphic_X , roman_Σ ) , we denote the set of probability measures over 𝒳 𝒳 \mathcal{X} caligraphic_X with Δ ⁢ ( 𝒳 ) Δ 𝒳 \Delta({\mathcal{X}}) roman_Δ ( caligraphic_X ) . We use an uppercase letter X 𝑋 X italic_X to indicate a random variable, and the notation ℙ X subscript ℙ 𝑋 \mathbb{P}_{X} blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT to denote its distribution over the sample set 𝒳 𝒳 \mathcal{X} caligraphic_X , for example, ℙ X : 𝒳 → Δ ⁢ ( 𝒳 ) : subscript ℙ 𝑋 → 𝒳 Δ 𝒳 \mathbb{P}_{X}:\mathcal{X}\rightarrow\Delta({\mathcal{X}}) blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT : caligraphic_X → roman_Δ ( caligraphic_X ) . When we mention a random event X 𝑋 X italic_X (for example, a random action ) we refer to a random draw of a specific value x ∈ 𝒳 𝑥 𝒳 x\in\mathcal{X} italic_x ∈ caligraphic_X from its distribution ℙ X subscript ℙ 𝑋 \mathbb{P}_{X} blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and we write, X ∼ ℙ X similar-to 𝑋 subscript ℙ 𝑋 X\sim\mathbb{P}_{X} italic_X ∼ blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT . When a distribution is clear from the context, we omit it from the subscript and write ℙ ⁢ ( X ) ℙ 𝑋 \mathbb{P}(X) blackboard_P ( italic_X ) instead of ℙ X ⁢ ( X ) subscript ℙ 𝑋 𝑋 \mathbb{P}_{X}(X) blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_X ) . We use 𝟙 𝒴 ⁢ ( x ) subscript 1 𝒴 𝑥 \mathbbm{1}_{\mathcal{Y}}(x) blackboard_1 start_POSTSUBSCRIPT caligraphic_Y end_POSTSUBSCRIPT ( italic_x ) for the indicator function that maps an element x ∈ 𝒳 𝑥 𝒳 x\in\mathcal{X} italic_x ∈ caligraphic_X to 1 1 1 1 if x ∈ 𝒴 ⊂ 𝒳 𝑥 𝒴 𝒳 x\in\mathcal{Y}\subset\mathcal{X} italic_x ∈ caligraphic_Y ⊂ caligraphic_X and 0 0 otherwise. We use ℝ ℝ \mathbb{R} blackboard_R to denote the set of real numbers and 𝔹 = { 0 , 1 } 𝔹 0 1 \mathbb{B}=\{0,1\} blackboard_B = { 0 , 1 } to denote the Boolean domain. We use ℓ ∞ ⁢ ( x ) = ∥ x ∥ ∞ = sup i | x i | subscript ℓ 𝑥 subscript delimited-∥∥ 𝑥 subscript supremum 𝑖 subscript 𝑥 𝑖 \ell_{\infty}(x)=\lVert x\rVert_{\infty}=\sup_{i}|x_{i}| roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | to denote the ℓ ℓ \ell roman_ℓ -infinity norm of a vector x 𝑥 x italic_x of components x i subscript 𝑥 𝑖 x_{i} italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . We write the Kullback-Leibler divergence between two discrete probability distributions ℙ P ⁢ ( X ) subscript ℙ 𝑃 𝑋 \mathbb{P}_{P}(X) blackboard_P start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) and ℙ Q ⁢ ( X ) subscript ℙ 𝑄 𝑋 \mathbb{P}_{Q}(X) blackboard_P start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_X ) with sample space 𝒳 𝒳 \mathcal{X} caligraphic_X as: D K ⁢ L ( ℙ P ( X ) | | ℙ Q ( X ) ) = ∑ x ∈ 𝒳 [ ℙ P ( x ) log ( ℙ P ( x ) / ℙ Q ( x ) ) ] D_{KL}(\mathbb{P}_{P}(X)||\mathbb{P}_{Q}(X))=\sum_{x\in\mathcal{X}}[\mathbb{P}% _{P}(x)\log({\mathbb{P}_{P}(x)}/{\mathbb{P}_{Q}(x)})] italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) | | blackboard_P start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_X ) ) = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT [ blackboard_P start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_x ) roman_log ( blackboard_P start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_x ) / blackboard_P start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_x ) ) ] .

Reinforcement Learning.

We consider the problem of learning by interacting with an environment. A program (the agent ) interacts with an environment by making decisions ( actions ). The action is the agent’s interface with the environment. Before each action, the agent may observe part of the environment and take suitable actions. The action changes the state of the environment. After each action, the agent may perceive a feedback signal (the reward ). The goal of the agent is to learn a rule of behaviour (the policy ) that maximises the expected sum of rewards.

𝑡 1 S_{t+1} italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT being s ′ superscript 𝑠 ′ s^{\prime} italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT depends on a state-transition distribution: ℙ μ ( S t + 1 = s ′ | S t = s , A t = a ) = μ ( s ′ | s , a ) , ∀ s , s ′ ∈ 𝒮 \mathbb{P}_{\mu}(S_{t+1}=s^{\prime}|S_{t}=s,A_{t}=a)=\mu(s^{\prime}|s,a),% \forall s,s^{\prime}\in\mathcal{S} blackboard_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ) = italic_μ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) , ∀ italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S . We refer to S t subscript 𝑆 𝑡 S_{t} italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the random state at time t 𝑡 t italic_t . The probability of the action a 𝑎 a italic_a depends on the agent’s policy, which is a stationary mapping π : 𝒮 → Δ ⁢ ( 𝒜 ) : 𝜋 → 𝒮 Δ 𝒜 \pi:\mathcal{S}\rightarrow\Delta(\mathcal{A}) italic_π : caligraphic_S → roman_Δ ( caligraphic_A ) , from a state to a probability distribution over actions.

𝑡 1 (s_{t},a_{t},r_{t},s_{t+1}) ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) as a trajectory or episode d = { s t , a t , r t , : 0 ≤ t ≤ T } d=\{s_{t},a_{t},r_{t},:0\leq t\leq T\} italic_d = { italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , : 0 ≤ italic_t ≤ italic_T } , where T 𝑇 T italic_T is the horizon of the episode.

We mostly consider episodic settings where the probability of ending in an absorbing state in finite time is 1 1 1 1 , resulting in the random horizon T 𝑇 T italic_T . We consider discrete action spaces 𝒜 = { a i : 1 ≤ i ≤ n } 𝒜 conditional-set subscript 𝑎 𝑖 1 𝑖 𝑛 \mathcal{A}=\{a_{i}:1\leq i\leq n\} caligraphic_A = { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : 1 ≤ italic_i ≤ italic_n } only.

We refer to the return random variable Z t subscript 𝑍 𝑡 Z_{t} italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , as the sum of discounted rewards from time t 𝑡 t italic_t to the end of the episode, Z t = ∑ t = k T γ k − t ⁢ R ⁢ ( S k , A k ) subscript 𝑍 𝑡 superscript subscript 𝑡 𝑘 𝑇 superscript 𝛾 𝑘 𝑡 𝑅 subscript 𝑆 𝑘 subscript 𝐴 𝑘 Z_{t}=\sum_{t=k}^{T}\gamma^{k-t}R(S_{k},A_{k}) italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k - italic_t end_POSTSUPERSCRIPT italic_R ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . The control objective of an RL problem is to find a policy π * superscript 𝜋 \pi^{*} italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT that maximises the expected return,

Partially-Observable MDPs (POMDPs) .

𝑘 1 0 𝑘 𝑡 1 superscript 𝒪 𝒜 ℛ 𝑡 ℋ h_{t}=\{O_{0}\}\cup\{A_{k},R_{k},O_{k+1}:0<k<t-1\}\in(\mathcal{O}\times% \mathcal{A}\times\mathcal{R})^{t}=\mathcal{H} italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_O start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } ∪ { italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT : 0 < italic_k < italic_t - 1 } ∈ ( caligraphic_O × caligraphic_A × caligraphic_R ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = caligraphic_H .

Generalised Policy Iteration (GPI) .

where v ^ k subscript ^ 𝑣 𝑘 \hat{v}_{k} over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the value approximation at iteration k 𝑘 k italic_k , A t ∼ ℙ π ( ⋅ | S t ) A_{t}\sim\mathbb{P}_{\pi}(\cdot|S_{t}) italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ blackboard_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( ⋅ | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , and S t + 1 ∼ ℙ π , μ ( ⋅ | S t , A t ) S_{t+1}\sim\mathbb{P}_{\pi,\mu}(\cdot|S_{t},A_{t}) italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ blackboard_P start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . The Bellman operator is a γ 𝛾 \gamma italic_γ -contraction in the ℓ ∞ subscript ℓ \ell_{\infty} roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT and the ℓ 2 subscript ℓ 2 \ell_{2} roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norms, and its fixed point is the value of the policy π 𝜋 \pi italic_π . Hence, successive applications of the Bellman operator improve prediction accuracy because the current value gets closer to the true value of the policy. We refer to the PE as the prediction objective (Sutton and Barto,, 2018 ) . Policy improvement maps a policy π 𝜋 \pi italic_π to an improved policy:

k\in[1,+\infty] italic_k ∈ [ 1 , + ∞ ] , and under mild assumptions, MPI converges to an optimal policy (Puterman,, 2014 ) .

4 Quantifying action influences

We start by answering Q1. , which aims to address the problem of what to measure, when referring to credit. Since Minsky, ( 1961 ) raised the Credit Assignment Problem (CAP) , a multitude of works paraphrased his words:

“ The problem of how to incorporate knowledge ” and “ given an outcome, how relevant were past decisions? ” (Harutyunyan et al.,, 2019 ) ,

“ Is concerned with identifying the contribution of past actions on observed future outcomes ” (Arumugam et al.,, 2021 ) ,

“ The problem of measuring an action’s influence on future rewards ” (Mesnard et al.,, 2021 ) ,

“ An agent must assign credit or blame for the rewards it obtains to past states and actions ” (Chelu et al.,, 2022 ) ,

“ The challenge of matching observed outcomes in the future to decisions made in the past ” (Venuto et al.,, 2022 ) ,

“ Given an observed outcome, how much did previous actions contribute to its realization? ” (Ferret,, 2022 , Chapter 4.1) .

These descriptions converge to Minsky’s original question and show agreement in the literature on an informal notion of credit. In this introduction, we propose to reflect on the different metrics that exist in literature to quantify it. We generalise the idea of action value , which often only refers to q 𝑞 q italic_q -values, to that of action influence , which describes a broader range of metrics used to quantify the credit of an action. While we do not provide a definitive answer on what credit should be, we review how different works in the existing RL literature have characterised it. We now start from developing an intuition of the notion of credit.

Consider Figure 1 , inspired to both Figure 1 of Harutyunyan et al., ( 2019 ) and to the umbrella problem in Osband et al., ( 2020 ) . The action taken at x 0 subscript 𝑥 0 x_{0} italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT determines the return of the episode by itself. From the point of view of control , any policy that always takes a ′ superscript 𝑎 ′ a^{\prime} italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in x 0 subscript 𝑥 0 x_{0} italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (i.e., π * ∈ Π * : π * ⁢ ( a ′ | x 0 ) = 1 : superscript 𝜋 superscript Π superscript 𝜋 conditional superscript 𝑎 ′ subscript 𝑥 0 1 \pi^{*}\in\Pi^{*}:\pi^{*}(a^{\prime}|x_{0})=1 italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 1 ), and then any other action afterwards, is an optimal policy. From the CAP point of view, some optimal actions, namely those after the first one, do not actually contribute to optimal returns. Indeed, alternative actions still produce optimal returns and contribute equally to each other to achieve the goal, so their credit is equal. We can see that, in addition to optimality, credit not only identifies optimal actions but informs them of how necessary they are to achieve an outcome of interest.

From the example, we can deduce that credit evaluates actions for their potential to influence an outcome. The resulting CAP is the problem of estimating the influence of an action over an outcome from experimental data and describes a pure association between them.

Why solving the CAP ?

Action evaluation is a cornerstone of RL . In fact, solving a control problem often involves running a GPI scheme. Here, the influence of an action drives learning for it suggests a possible direction to improve the policy. For example, the action-value plays that role in Equation ( 3 ). It follows that the quality of the measure of influence fundamentally impacts the quality of the policy improvement. Low quality evaluations can lead the policy to diverge from the optimal one, hinder learning, and slow down progress (Sutton and Barto,, 2018 ; van Hasselt et al.,, 2018 ) . On the contrary, high quality evaluations provide accurate, robust and reliable signals that foster convergence, sample-efficiency and low variance. While simple evaluations are enough for specialised experiments, the real world is a complex blend of multiple, sometimes hierarchical tasks. In these cases, the optimal value changes from one task to another and these simple evaluations do not bode well to adapt to general problem solving. Yet, the causal structure that underlies the real word is shared among all tasks and the modularity of its causal mechanisms is often a valuable property to incorporate. In these conditions, learning to assign credit in one environment becomes a lever to assign credit in another ( Ferret et al., 2021a, ) , and ultimately makes learning faster, more accurate and more efficient. For these reasons, and because an optimal policy only requires discovering one single optimal trajectory, credit stores knowledge beyond that expressed by optimal behaviours alone, and solving the control problem is not sufficient to solve the CAP , with the former being an underspecification of the latter.

4.1 Are all action values , credit ?

These proxies are sufficient to select optimal actions and unlock solutions to complex tasks (Silver et al.,, 2018 ; Wang et al., 2016b, ; Kapturowski et al.,, 2019 ; Badia et al.,, 2020 ; Ferret et al., 2021b, ) . However, while these works explicitly refer to the action influence as a measure of credit, the term is not formally defined and it remains unclear where to draw the line between credit and other quantities. Key questions arise: What is the difference between these quantities and credit? Do they actually represent credit as originally formulated by Minsky, ( 1961 ) ? If so, under what conditions do they do? Without a clear definition of what to measure, we do not have an appropriate quantity to target when designing an algorithm to solve the CAP . More importantly, we do not have an appropriate quantity to use as a single source of truth and term of reference to measure the accuracy of other metrics of action influence, and how well they approximate credit. To fill this gap, we proceed as follows:

Section 4.2 formalises what is a goal or an outcome : what we evaluate the action for;

Section 4.3 unifies existing functions under the same formalism;

Section 4.4 formalises the CAP following this definition.

Section 4.5 analyses how different works interpreted and quantified action influences and reviews them.

Section 4.6 distils the properties that existing measure of action influence.

We suggest the reader only interested in the final formalism to directly skip to Section 4.4 , and to come back to the next sections to understand the motivation behind it.

4.2 What is a goal ?

Because credit measures the influence of an action upon achieving a certain goal, to define credit formally we must be able to describe goals formally, and without a clear understanding of what constitutes one, an agent cannot construct a learning signal to evaluate its actions. Goal is a synonym for purpose , which we can informally describe as a performance to meet or a prescription to follow. Defining a goal rigorously allows to make the relationship between the action and the goal explicit (Ferret,, 2022 , Chapter 4) and enables the agent to decompose complex behaviour into elementary ones in a compositional (Sutton et al.,, 1999 ; Bacon et al.,, 2017 ) , and possibly hierarchical way (Flet-Berliac,, 2019 ; Pateria et al.,, 2021 ; Hafner et al.,, 2022 ) . This idea is at the foundation of many CA methods (Sutton et al.,, 1999 , 2011 ; Schaul et al., 2015a, ; Andrychowicz et al.,, 2017 ; Harutyunyan et al.,, 2019 ; Bacon et al.,, 2017 ; Smith et al.,, 2018 ; Riemer et al.,, 2018 ; Bagaria and Konidaris,, 2019 ; Harutyunyan et al.,, 2018 ; Klissarov and Precup,, 2021 ) . We proceed with a formal definition of goals in the next paragraph, and review how these goals are represented in seminal works on CA in the one after. This will lay the foundation for a unifying notion of credit later in Sections 4.3 .

Defining goals.

To define goals formally we adopt the reward hypothesis , which posits:

That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). (Sutton,, 2004 ) .

Here, the goal is defined as the behaviour that results from the process of maximising the return. The reward hypothesis has been further advanced by later studies (Abel et al.,, 2021 ; Pitis,, 2019 ; Shakerinava and Ravanbakhsh,, 2022 ; Bowling et al.,, 2023 ) . In the following text, we employ the goal definition in Bowling et al., ( 2023 ) , which we report hereafter:

Definition 1 (Goal) .

Subjective and objective goals..

The Markov Reward Theorem holds both in the case of the preferences being defined internally by the agent itself – this is the case of intrinsic motivation (Piaget et al.,, 1952 ; Chentanez et al.,, 2004 ; Barto et al.,, 2004 ; Singh et al.,, 2009 ; Barto,, 2013 ; Colas et al.,, 2022 ) – and in case they originate from an external entity, such as agent-designer. In the first case, the agent doing the maximising is the same as the one holding the ordering over policies, and we refer to the corresponding goal as a subjective goal . In the second case, an agent designer or an unknown, non-observable entity holds the ordering and a separate learning agent is the one pursuing the optimisation process. We refer to a goal as an objective goal in this latter case. These settings usually correspond to the distinction between goals and sub-goals in the literature (Liu et al.,, 2022 ) .

In other words, this way of defining goals or outcomes corresponds to defining a task to solve, which in turn can be expressed through a reward function with the characteristics described above. Vice-versa, the reward function can encode a task. When credit is assigned with respect to a particular goal or outcome, it then evaluates the influence of an action to solving a particular task. As discussed above, this is key to decompose and recompose complex behaviours and the definition aligns with that of other disciplines, such as psychology where a goal …is a cognitive representation of something that is possible in the future (Elliot and Fryer,, 2008 ) or philosophy, where representations do not merely read the world as it is but they express preferences over something that is possible in the future (Hoffman,, 2016 ; Prakash et al.,, 2021 ; Le Lan et al.,, 2022 ) .

Representing goals and outcomes.

However, expressing the relation between actions and goals explicitly, that is, when the function that returns the credit of an action has a goal as an input, raises the problem of how to represent a goal for computational purposes. This is important because among the CA methods that define goals explicitly (Sutton et al.,, 2011 ; Schaul et al., 2015a, ; Andrychowicz et al.,, 2017 ; Rauber et al.,, 2019 ; Harutyunyan et al.,, 2019 ; Tang and Kucukelbir,, 2021 ; Arulkumaran et al.,, 2022 ; Chen et al.,, 2021 ) , not many use the rigour of a general-purpose definition of goal such as that in Bowling et al., ( 2023 ) . In these works, the goal-representation space , which we denote as ψ ∈ Ψ 𝜓 Ψ \psi\in\Psi italic_ψ ∈ roman_Ψ , is arbitrarily chosen to represent specific features of a trajectory. It denotes an object , rather than a performance or a prescription to meet. For example, a goal-representation ψ 𝜓 \psi italic_ψ can be a state (Sutton et al.,, 2011 ; Andrychowicz et al.,, 2017 ) and ψ ∈ Ψ = 𝒮 𝜓 Ψ 𝒮 \psi\in\Psi=\mathcal{S} italic_ψ ∈ roman_Ψ = caligraphic_S . It can be a specific observation (Nair et al.,, 2018 ) with ψ ∈ Ψ = 𝒪 𝜓 Ψ 𝒪 \psi\in\Psi=\mathcal{O} italic_ψ ∈ roman_Ψ = caligraphic_O . Alternatively, it can be an abstract feature vector (Mesnard et al.,, 2021 ) that reports on some characteristics of a history, and we have ψ ∈ Ψ = ℝ d 𝜓 Ψ superscript ℝ 𝑑 \psi\in\Psi=\mathbb{R}^{d} italic_ψ ∈ roman_Ψ = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with d 𝑑 d italic_d is the dimensionality of the vector. Even, a goal can be represented by a natural language instruction (Luketina et al.,, 2019 ) and ψ ∈ Ψ = ℝ d 𝜓 Ψ superscript ℝ 𝑑 \psi\in\Psi=\mathbb{R}^{d} italic_ψ ∈ roman_Ψ = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be the embedding of that piece of text. A goal can be represented by a scalar ψ ∈ Ψ = ℝ 𝜓 Ψ ℝ \psi\in\Psi=\mathbb{R} italic_ψ ∈ roman_Ψ = blackboard_R (Chen et al.,, 2021 ) that indicates a specific return to achieve, or even a full command (Schmidhuber,, 2019 ) , that is a return to achieve is a specific window of time.

While these representations are all useful heuristics, they lack formal rigour and leave space to ambiguities. For example, saying that the goal is a state might mean that visiting the state at the end of the trajectory is the goal or that visiting it in the middle of it is the goal. This is often not formally defined and what is the reward function that corresponds to that specific representation of a goal is not always clear. In the following text, when surveying a method or a metric that specifies a goal, we refer to the specific goal representation used in the work and make an effort to detail what is the reward function that underpins that goal representation.

4.3 What is an assignment ?

Having established a formalism for goals and outcomes we are now ready to describe credit formally and we proceed with a formalism that unifies the existing measures of action influence. We first describe a generic function that generalises most CAs , and then proceed to formalise the CAP . Overall, this formulation provides a term of reference for the quantities described in Section 4.5 . We now formalise an assignment :

Definition 2 (Assignment) .

Consider an action a ∈ 𝒜 𝑎 𝒜 a\in\mathcal{A} italic_a ∈ caligraphic_A , a goal g ∈ 𝒢 𝑔 𝒢 g\in\mathcal{G} italic_g ∈ caligraphic_G , and a context c ∈ 𝒞 𝑐 𝒞 c\in\mathcal{C} italic_c ∈ caligraphic_C representing some experience data. We use the term assignment function or simply assignment to denote a function 𝒦 𝒦 \mathcal{K} caligraphic_K that maps a context, an action and an outcome to a quantity y ∈ 𝒴 𝑦 𝒴 y\in\mathcal{Y} italic_y ∈ caligraphic_Y , which we refer to as the influence of the action:

Here, a context c ∈ 𝒞 𝑐 𝒞 c\in\mathcal{C} italic_c ∈ caligraphic_C represents some input data and can be arbitrarily chosen depending on the assignment in question. A context must hold information about the present, for example, the current state or the current observation; it may contain information about the past, for example, the sequence of past decisions occurred until now for a POMDP ; to evaluate the current action, it must contain information about what future actions will be taken in-potentia , for example by specifying a policy to follow when a ∈ 𝒜 𝑎 𝒜 a\in\mathcal{A} italic_a ∈ caligraphic_A is not taken, or a fixed trajectory, in which case the current action is evaluated in hindsight (Andrychowicz et al.,, 2017 ) . We provide further details to contexts in Appendix B .

In the general case, the action influence is a random variable Y ∈ 𝒴 ⊂ ℝ d 𝑌 𝒴 superscript ℝ 𝑑 Y\in\mathcal{Y}\subset\mathbb{R}^{d} italic_Y ∈ caligraphic_Y ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . This is the case, for example, of the action-value distribution (Bellemare et al.,, 2017 ) as described in Equation 10 , where the action influence is defined over the full distribution of returns. However, most methods extract some scalar measures of the full influence distribution, such as expectations (Watkins,, 1989 ) , and the action influence becomes a scalar y ∈ ℝ 𝑦 ℝ y\in\mathbb{R} italic_y ∈ blackboard_R . In the following text, we mostly consider scalar forms of the influence 𝒴 = ℝ 𝒴 ℝ \mathcal{Y}=\mathbb{R} caligraphic_Y = blackboard_R as these represent the majority of the existing formulations.

In practice, an assignment provides a single mathematical form to talk about the multitude of ways to quantify action influence that are used in the literature. It takes an action a ∈ 𝒜 𝑎 𝒜 a\in\mathcal{A} italic_a ∈ caligraphic_A , some contextual data c ∈ 𝒞 𝑐 𝒞 c\in\mathcal{C} italic_c ∈ caligraphic_C and a goal g ∈ 𝒢 𝑔 𝒢 g\in\mathcal{G} italic_g ∈ caligraphic_G and maps it to some measure of action influence . While maintaining the same mathematical form, different assignments can return different values of action influence and steer the improvement in different directions.

Equation ( 4 ) also resembles the General Value Function (GVF) (Sutton et al.,, 2011 ) , where the influence y = q π ⁢ ( s , a , g ) 𝑦 superscript 𝑞 𝜋 𝑠 𝑎 𝑔 y=q^{\pi}(s,a,g) italic_y = italic_q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_g ) is the expected return of the policy π 𝜋 \pi italic_π when taking action a 𝑎 a italic_a in state s 𝑠 s italic_s , with respect a goal g 𝑔 g italic_g . However, in GVFs : (i ) y 𝑦 y italic_y is an action value and does not generalise other forms of action influence; the goal is an MDP state g ∈ 𝒮 𝑔 𝒮 g\in\mathcal{S} italic_g ∈ caligraphic_S and does not generalise to our notion of goals in Section 4.2 ; the function only considers forward predictions and does not generalise to evaluating an action in hindsight (Andrychowicz et al.,, 2017 ) . Table 2 page 2 contains further details on the comparison and further specifies the relationship between the most common functions and their corresponding assignment.

4.4 The credit assignment problem

The generality of the assignment formalism reflects the great heterogeneity of action influence metrics, which we review later in Section 4.5 . This heterogeneity shows that, even if most studies agree on an intuitive notion of credit, they diverge in practice on how to quantify credit mathematically. Having unified the existing assignments in the previous section, we now proceed to formalise the CAP analogously. This allows us to put the existing methods into a coherent perspective as a guarantee for a fair comparison, and to maintain the heterogeneity of the existing measures of action influence.

We cast the CAP as the problem of approximating a measure of action influence from experience. We assume standard model-free, Deep RL settings and consider an assignment represented as a neural network k : 𝒞 × 𝒜 × 𝒢 × Φ → ℝ : 𝑘 → 𝒞 𝒜 𝒢 Φ ℝ k:\mathcal{C}\times\mathcal{A}\times\mathcal{G}\times\Phi\rightarrow\mathbb{R} italic_k : caligraphic_C × caligraphic_A × caligraphic_G × roman_Φ → blackboard_R with parameters ϕ ∈ Φ = ℝ n italic-ϕ Φ superscript ℝ 𝑛 \phi\in\Phi=\mathbb{R}^{n} italic_ϕ ∈ roman_Φ = blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT that can be used to approximate the credit of the actions. This usually represents the critic, or the value function of a RL algorithm. In addition, we admit a stochastic function to represent the policy, also in the form of a neural network f : 𝒮 × Θ → Δ ⁢ ( 𝒜 ) : 𝑓 → 𝒮 Θ Δ 𝒜 f:\mathcal{S}\times\Theta\rightarrow\Delta(\mathcal{A}) italic_f : caligraphic_S × roman_Θ → roman_Δ ( caligraphic_A ) , with parameters set θ ∈ Θ = ℝ m 𝜃 Θ superscript ℝ 𝑚 \theta\in\Theta=\mathbb{R}^{m} italic_θ ∈ roman_Θ = blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT . We assume that n ≪ | 𝒮 | × | 𝒜 | much-less-than 𝑛 𝒮 𝒜 n\ll|\mathcal{S}|\times|\mathcal{A}| italic_n ≪ | caligraphic_S | × | caligraphic_A | and m ≪ | 𝒮 | × | 𝒜 | much-less-than 𝑚 𝒮 𝒜 m\ll|\mathcal{S}|\times|\mathcal{A}| italic_m ≪ | caligraphic_S | × | caligraphic_A | and note that often subsets of parameters are shared among the two functions.

Definition 3 (The credit assignment problem) .

Consider an MDP ℳ ℳ \mathcal{M} caligraphic_M , a goal g ∈ 𝒢 𝑔 𝒢 g\in\mathcal{G} italic_g ∈ caligraphic_G , and a set of experience c ∈ 𝒞 𝑐 𝒞 c\in\mathcal{C} italic_c ∈ caligraphic_C . Consider an arbitrary assignment K ∈ 𝒦 𝐾 𝒦 K\in\mathcal{K} italic_K ∈ caligraphic_K as described in Equation ( 4 ). Given a parameterised function K ~ : 𝒞 × 𝒜 × 𝒢 × Φ → ℝ normal-: normal-~ 𝐾 normal-→ 𝒞 𝒜 𝒢 normal-Φ ℝ \widetilde{K}:\mathcal{C}\times\mathcal{A}\times\mathcal{G}\times\Phi% \rightarrow\mathbb{R} over~ start_ARG italic_K end_ARG : caligraphic_C × caligraphic_A × caligraphic_G × roman_Φ → blackboard_R with parameters set ϕ ∈ Φ ⊂ ℛ n italic-ϕ normal-Φ superscript ℛ 𝑛 \phi\in\Phi\subset\mathcal{R}^{n} italic_ϕ ∈ roman_Φ ⊂ caligraphic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , we refer to the Credit Assignment Problem as the problem of finding the set of parameters ϕ ∈ Φ italic-ϕ normal-Φ \phi\in\Phi italic_ϕ ∈ roman_Φ such that:

Different choices of action influence have a great impact on the hardness of the problem. In particular, there is a trade-off between:

how effective the chosen measure of influence is to inform the direction of the policy improvement,

how easy it is to learn that function from experience.

For example, using causal influence (Janzing et al.,, 2013 ) as a measure of action influence makes the CAP hard to solve in practice. The reason is that discovering causal mechanisms from associations alone is notoriously challenging (Pearl,, 2009 ; Bareinboim et al.,, 2022 ) and pure causal relationships are rarely observed in nature (Pearl et al.,, 2000 ) but in specific experimental conditions. However, causal knowledge is reliable, robust to changes in the experience collected and effective, and causal mechanisms can be invariant to changes in the goal. On the contrary, q 𝑞 q italic_q -values are easier to learn as they represent a measure of statistical correlation between state-actions and outcomes, but their knowledge is limited to the bare minimum necessary to solve a control problem. Which quantity to use in each specific instance or each specific problem is still subject of investigation in the literature. Ideally, we should aim for the most effective measure of influence that can be learned with the least amount of experience.

4.5 Existing assignment functions

We now survey the most important assignment functions from the literature and their corresponding measure of action influence. The following list is not exhaustive, but rather it is representative of the limitations of existing credit formalisms. For brevity, and without loss of generality, we omit functions that do not explicitly evaluate actions (for example, state-values), but we notice that it is still possible to reinterpret an assignment to a state as an assignment to a set of actions for it affects all the actions that led to that state.

State-action values

(Shannon,, 1950 ; Schultz,, 1967 ; Michie,, 1963 ; Watkins,, 1989 ) are a hallmark of RL , and are described by the following expression:

Here, the context c 𝑐 c italic_c is a state s ∈ 𝒮 𝑠 𝒮 s\in\mathcal{S} italic_s ∈ caligraphic_S in the case of MDPs or a history h ∈ ℋ ℎ ℋ h\in\mathcal{H} italic_h ∈ caligraphic_H for a POMDP . The q 𝑞 q italic_q -function quantifies the credit of an action by the expected return of the action in the context. q 𝑞 q italic_q -values are among the simplest way to quantify credit and offer a basic mechanism to solve control problems. However, while q 𝑞 q italic_q -functions offer solid theoretical guarantees in tabular RL , they can be unstable in Deep RL . When paired with bootstrapping and off-policy learning, q-values are well known to diverge from the optimal solution (Sutton and Barto,, 2018 ) . van Hasselt et al., ( 2018 ) provides empirical evidence of the phenomenon, investigating the relationship between divergence and performance, and how different variables affect divergence. In particular, the work showed that the Deep Q-Network (DQN) (Mnih et al.,, 2015 ) is not guaranteed to converge to the optimal q 𝑞 q italic_q -function. The divergence rate on both evaluation and control problems increases depending on specific mechanisms, such as the amount of bootstrapping, or the amount of prioritisation of updates ( Schaul et al., 2015b, ) . An additional problem arises in GPI schemes used to solve control problems. While during evaluation the policy is fixed, here the policy continuously changes, and it becomes more challenging to track the target of the update while converging to it, as the change of policy make the problem appear non-stationary from the point of view of the value estimation. This is because the policy changes, but there is no signal that informs the policy evaluation about the change. To mitigate the issue, many methods either use a fixed network as an evaluation target (Mnih et al.,, 2015 ) , they perform Polyak averaging of the target network (Haarnoja et al.,, 2018 ) , or they clip the gradient update to a maximum cap (Schulman et al.,, 2017 ) . To further support the idea, theoretical and empirical evidence (Bellemare et al.,, 2016 ) shows that the q 𝑞 q italic_q -function is inconsistent : for any suboptimal action a 𝑎 a italic_a , the optimal value function q * ⁢ ( s , a ) superscript 𝑞 𝑠 𝑎 q^{*}(s,a) italic_q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) describes the value of a non-stationary policy, which selects a different action π * ⁢ ( s ) superscript 𝜋 𝑠 \pi^{*}(s) italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) (rather than a 𝑎 a italic_a ) at each visit of s 𝑠 s italic_s . The inconsistency of q 𝑞 q italic_q -values for suboptimal actions has also been shown empirically. Schaul et al., ( 2022 ) measured the per-state policy change W ( π , π ′ | s ) = ∑ a ∈ 𝒜 | π ( a | s ) − π ′ ( a | s ) | W(\pi,\pi^{\prime}|s)=\sum_{a\in\mathcal{A}}|\pi(a|s)-\pi^{\prime}(a|s)| italic_W ( italic_π , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT | italic_π ( italic_a | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_a | italic_s ) | for several Atari 2600 games Arcade Learning Environment (ALE) (Bellemare et al.,, 2013 ) , and showed that the action-gap (the difference of value between the best and the second best action) undergoes brutal changes despite the agent maintaining a constant value of expected returns. In practice, Deep RL algorithms often use q 𝑞 q italic_q -targets to approximate the q 𝑞 q italic_q -value, for example, n 𝑛 n italic_n -step targets (Sutton and Barto,, 2018 , Chapter 7) , or λ 𝜆 \lambda italic_λ -returns (Watkins,, 1989 ; Jaakkola et al.,, 1993 ; Sutton and Barto,, 2018 , Chapter 12) . However, we consider them as methods , rather than quantities to measure credit, since the q 𝑞 q italic_q -value is the quantity to which the function approximator converges. For this reason we discuss them in Section 6.1 .

(Baird,, 1999 ) measures, in a given state, the difference between the q-value of an action and the value of its state

Here, the context c 𝑐 c italic_c is the same as in Equation ( 6 ). Because v π ⁢ ( s ) = ∑ a ∈ 𝒜 q ⁢ ( s , a ) ⁢ ℙ π ⁢ ( a ) superscript 𝑣 𝜋 𝑠 subscript 𝑎 𝒜 𝑞 𝑠 𝑎 subscript ℙ 𝜋 𝑎 v^{\pi}(s)=\sum_{a\in\mathcal{A}}q(s,a)\mathbb{P}_{\pi}(a) italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_q ( italic_s , italic_a ) blackboard_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a ) and A π ⁢ ( s , a ) = q π ⁢ ( s , a ) − 𝔼 π ⁢ [ q π ⁢ ( s , a ) ] superscript 𝐴 𝜋 𝑠 𝑎 superscript 𝑞 𝜋 𝑠 𝑎 subscript 𝔼 𝜋 delimited-[] superscript 𝑞 𝜋 𝑠 𝑎 A^{\pi}(s,a)=q^{\pi}(s,a)-\mathbb{E}_{\pi}[q^{\pi}(s,a)] italic_A start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) - blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) ] , the advantage function measures action influence by the amount an action is better than average, that is, if A π ⁢ ( s , a ) > 0 superscript 𝐴 𝜋 𝑠 𝑎 0 A^{\pi}(s,a)>0 italic_A start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) > 0 . As also shown in Bellemare et al., ( 2016 ) , using the advantage to quantify credit can increase the action-gap , the value difference between the optimal and the second-best action. Empirical evidence has shown the consistent benefits of advantage over q-values (Baird,, 1999 ; Wang et al., 2016b, ; Bellemare et al.,, 2016 ; Schulman et al.,, 2016 ) , most probably due to its regularisation effects ( Vieillard et al., 2020b, ; Vieillard et al., 2020a, ; Ferret et al., 2021a, ) . On the other hand, when estimated directly and not by composing state and state-action values, for example in Pan et al., ( 2022 ) , advantage does not permit bootstrapping. This is because advantage lacks an absolute measure of action influence, and only maintains one that is relative to the other possible actions. Overall, in canonical benchmarks for both evaluation ( Wang et al., 2016b, ) and control (Bellemare et al.,, 2013 ) , advantage has been shown to improve over q 𝑞 q italic_q -values ( Wang et al., 2016b, ) . In particular, policy evaluation experiences faster convergence in large action spaces because the state-value v π ⁢ ( s ) superscript 𝑣 𝜋 𝑠 v^{\pi}(s) italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) can hold information that is shared between multiple actions. For control, it improves the score over several Atari 2600 games compared to both double q 𝑞 q italic_q -learning (van Hasselt et al.,, 2016 ) and prioritised experience replay ( Schaul et al., 2015b, ) .

(Sutton et al.,, 2011 ; Schaul et al., 2015a, ) are a set of q-value functions that predict returns with respect to multiple reward functions:

where R 𝑅 R italic_R is a pseudo-reward function and ℛ ℛ \mathcal{R} caligraphic_R is an arbitrary, pre-defined set of reward functions. Notice that we omit the pseudo-termination and pseudo-discounting terms that appear in their original formulation (Sutton et al.,, 2011 ) to maintain the focus on credit assignment. The context c 𝑐 c italic_c is the same of q 𝑞 q italic_q -values and advantage, and the goal that the pseudo-reward represents is to reach a specific state g = s ∈ 𝒮 𝑔 𝑠 𝒮 g=s\in\mathcal{S} italic_g = italic_s ∈ caligraphic_S . When first introduced (Sutton et al.,, 2011 ) , the idea of GVFs stemmed from the observation that canonical value functions are limit to only a single task at the same time. Solving a new task would require learning a values function ex-novo . By maintaining multiple assignment functions at the same time, one for each goal, GVFs can instantly quantify the influence of an action with respect to multiple goals simultaneously. However, while GVFs maintain multiple assignments, the goal is still not an explicit input of the value function. Instead, it is left implicit and each assignment serves the ultimate goal to maximise a different pseudo-reward function (Sutton et al.,, 2011 ) .

Universal Value Functions Approximators (UVFAs)

( Schaul et al., 2015a, ) scale GVFs to Deep RL and advance their idea further by conflating these multiple assignment functions into a single one, represented as a deep neural network. Here, unlike for state-action values and GVFs , the goal is an explicit input of the assignment:

The action influence here is measured with respect to a goal explicitly. This allows to leverage the generalisation capacity of deep neural networks and to generalise not only over space of states but also over that of goals.

Distributional values

(Jaquette,, 1973 ; Sobel,, 1982 ; White,, 1988 ; Bellemare et al.,, 2017 ) consider the full return distribution Z t subscript 𝑍 𝑡 Z_{t} italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT instead of the expected value:

Distributional advantage

(Arumugam et al.,, 2021 ) proposes a probabilistic equivalent of the expected advantage:

and borrows the properties of both distributional values and the expected advantage. Intuitively, Equation ( 11 ) measures how much the value distribution for a given state-action pair changes relative to the distribution for the particular state only, marginalising over all actions. The relationship between the two distributions can then be interpreted as the distributional analogue of Equation ( 7 ), where the two quantities appear in their expectation instead. The biggest drawback of this measure of action influence is that it is only treated in theory and there is no empirical evidence that supports distributional advantage as a useful proxy for credit in practice.

Hindsight advantage

(Harutyunyan et al.,, 2019 ) stems from conditioning the action influence on future states or returns. The return-conditional hindsight advantage function can be written as follows:

Here A π ⁢ ( s , a , z ) superscript 𝐴 𝜋 𝑠 𝑎 𝑧 A^{\pi}(s,a,z) italic_A start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_z ) denotes the return-conditional advantage and ℙ D ⁢ ( a t | S t = s , Z t = z ) subscript ℙ 𝐷 formulae-sequence conditional subscript 𝑎 𝑡 subscript 𝑆 𝑡 𝑠 subscript 𝑍 𝑡 𝑧 \mathbb{P}_{D}(a_{t}|S_{t}=s,Z_{t}=z) blackboard_P start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z ) is the return-conditional hindsight distribution that describes the probability to take action a 𝑎 a italic_a in s 𝑠 s italic_s , given than we observe return z 𝑧 z italic_z at the end of the episode. The context c 𝑐 c italic_c is the same for q 𝑞 q italic_q -values and advantage and the goal is a specific value of the return Z = z 𝑍 𝑧 Z=z italic_Z = italic_z . The idea of hindsight – initially presented in Andrychowicz et al., ( 2017 ) – is that even if the trajectory does not provide useful information for the main goal, it can be revisited as if the goal was the outcomes just achieved. Hindsight advantage brings this idea to the extreme and rather than evaluating only for a pre-defined set of goals such as in Andrychowicz et al., ( 2017 ) , it evaluates for every experienced state or return. Here, the action influence is quantified by that proportion of return determined by the ratio in Equation ( 12 ). To develop an intuition of it, if the action a 𝑎 a italic_a leads to the return z 𝑧 z italic_z with probability 1 1 1 1 such that ℙ D ( A t = a | S t = s , Z t = z ) = 1 \mathbb{P}_{D}(A_{t}=a|S_{t}=s,Z_{t}=z)=1 blackboard_P start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z ) = 1 , but the behaviour policy π 𝜋 \pi italic_π takes a 𝑎 a italic_a with probability 0 0 , the credit of the action a 𝑎 a italic_a is 0 0 . There exist also a state-conditional formulation rather than a return-conditional one, and we refer to Harutyunyan et al., ( 2019 ) for details on it to keep the description concise.

Future-conditional advantage

(Mesnard et al.,, 2021 ) generalises hindsight advantage to use an arbitrary property of the future:

Counterfactual advantage

(Mesnard et al.,, 2021 ) proposes a specific choice of F 𝐹 F italic_F such that F 𝐹 F italic_F is independent from the current action. This produces a future-conditional advantage that factorises the influence of an action in two components: the contribution deriving from the intervention itself (the action) and the luck represented by all the components not under control of the agent at the time t 𝑡 t italic_t , such as fortuitous outcomes of the state-transition dynamics, exogenous reward noise, or future actions. The form is the same that in Equation 13 , with the additional condition that A t ⟂ F t perpendicular-to subscript 𝐴 𝑡 subscript 𝐹 𝑡 A_{t}\perp F_{t} italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟂ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and we have 𝔼 U [ D K ⁢ L ( ℙ ( A t | S t = s ) | | ℙ ( A t | S t = s , F t = f ) ] = 0 \mathbb{E}_{U}[D_{KL}(\mathbb{P}(A_{t}|S_{t}=s)||\mathbb{P}(A_{t}|S_{t}=s,F_{t% }=f)]=0 blackboard_E start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( blackboard_P ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ) | | blackboard_P ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ) ] = 0 . The main intuition behind counterfactual advantage is the following. While to compute counterfactuals we need access to a model of the environment, in model-free settings we can still compute all the relevant information u 𝑢 u italic_u that does not depend on this model. Once learned, a model of u 𝑢 u italic_u can then represents a valid baseline to compute counterfactuals in a model-free way. To stay in the scope of this section, we detail how to learn this quantity in Section 6.4 .

Posterior value functions

(Nota et al.,, 2021 ) reflects on partial-observability and proposes a characterisation of hindsight advantage bespoke to POMDPs . The intuition behind Posterior Value Functions (PVFs) is that the evaluated action accounts only for a small portion of the variance of returns. The majority of the variance is often due to the part of the trajectory that still has to happen. For this reason, incorporating in the baseline information usually unavailable after the time when the action was taken could have a greater impact in reducing the variance of the policy gradient estimator. PVFs focus on the variance of a future-conditional baseline (Mesnard et al.,, 2021 ) caused by the partial observability. Nota et al., ( 2021 ) factorises a state s 𝑠 s italic_s into an observable component o 𝑜 o italic_o and an non-observable one u 𝑢 u italic_u , and formalises the PVF as follows:

𝑅 subscript 𝑠 𝑡 subscript 𝑎 𝑡 superscript subscript 𝑣 𝑡 𝜋 subscript ℎ 𝑡 q_{t}^{\pi}(h_{t},a)=R(s_{t},a_{t})+v_{t}^{\pi}(h_{t}) italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Policy-conditioned values

(Harb et al.,, 2020 ; Faccio et al.,, 2021 ) are value functions that include the policy as an input. For example, a policy-conditioned state-action value has the form:

where π i ∈ Π subscript 𝜋 𝑖 Π \pi_{i}\in\Pi italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Π denotes the policy from which to sample the actions that follow A t subscript 𝐴 𝑡 A_{t} italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . Here, the context c 𝑐 c italic_c is union of the current MDP state and the policy π 𝜋 \pi italic_π , and the goal is to maximise the MDP return, which is unknown to the agent. The main difference with state-action values is that, all else being equal, q ⁢ ( s , π , a , g ) 𝑞 𝑠 𝜋 𝑎 𝑔 q(s,\pi,a,g) italic_q ( italic_s , italic_π , italic_a , italic_g ) produces different values instantly when π 𝜋 \pi italic_π varies, since π 𝜋 \pi italic_π is now an explicit input, where q π ⁢ ( s , a , g ) superscript 𝑞 𝜋 𝑠 𝑎 𝑔 q^{\pi}(s,a,g) italic_q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_g ) requires a full PE procedure instead. Using the policy as an input raises the problem of representing a policy in a way that can be fed to a neural network. Harb et al., ( 2020 ) and Faccio et al., ( 2021 ) propose two methods to represent a policy. To keep our attention on the CAP , we refer to their works for further details on possible ways to represent a policy (Harb et al.,, 2020 ; Faccio et al.,, 2021 ) . Here we limit to convey that the problem of representing a policy has been already raised in the literature.