Goal-conditioned reinforcement learning learns to reach goals instead of optimizing hand-crafted rewards. Despite its popularity, the community often categorizes goal-conditioned reinforcement learning as a special case of reinforcement learning. In this post, we aim to build a direct conversion from any reward-maximization reinforcement learning problem to a goal-conditioned reinforcement learning problem, and to draw connections with the stochastic shortest path framework. Our conversion provides a new perspective on the reinforcement learning problem: maximizing rewards is equivalent to reaching some goals.
Reinforcement learning (RL) has achieved great success in solving many decision-making problems, from outperforming human-level control on challenging games
When solving a navigation task, researchers will probably consider the problem differently. One researcher might describe it as “maximize cumulative reward.” Another might say “reach the destination as quickly as possible.” A third might say “maximize the probability of reaching the destination.” These sound like different problems, but they are closely related.
In standard reinforcement learning (RL), the objective is to maximize cumulative reward. In the stochastic shortest path (SSP) view, the objective is to reach a terminal state with minimum total cost, often interpreted as minimizing path length
Two sides of this triangle are already well understood. Prior work shows that any RL problem can be converted into an equivalent SSP problem
Can any reward maximization problem be converted into a goal-reaching problem?
In this post we show that the answer to this question is YES: any reward-maximization problem (MDP) can be converted into an equivalent goal-reaching problem (GCMDP). Our construction augments the original MDP with two absorbing states and pushes the reward function into the transition dynamics. We show that, under this construction, maximizing discounted return in the original MDP is exactly equivalent to maximizing the probability of reaching one of the absorbing states in the augmented GCMDP. Combined with the known RL–SSP equivalence, this yields a clean relationship.
In this section, we will review the three building blocks, RL, SSP, and GCRL, formally, providing the foundations to formulate the conversions between them.
We consider a Markov decision process
We first review the standard RL objective and the definition of successor measures. The goal of RL is to learn a policy \(\pi: \mathcal{S} \to \Delta(\mathcal{A})\) that maximizes the expected discounted return
\[\begin{align} \max_{\pi} J_{\gamma}(\pi), \quad J_{\gamma}(\pi) = (1 - \gamma) \mathbb{E}_{\tau \sim \pi(\tau)} \left[ \sum_{t = 0}^{\infty} \gamma^t r(s_t, a_t) \right], \label{eq:rl-obj} \end{align}\]where \(\tau\) is a trajectory sampled by the policy \(\pi\). Alternatively, we can swap the discounted sum over rewards into a discounted sum over states and use it to describe the expected discounted return. Namely, the discounted sum over states is called the discounted state occupancy measure
where \(p^{\pi}_t(s)\) is the probability measure of reaching state \(s\) at exact time step \(t\) under policy \(\pi\):
\[\begin{align} p^{\pi}_t(s) = \int_{\mathcal{S} \times \mathcal{A} \times \cdots \times \mathcal{S} \times \mathcal{A}} p_0(s_0) \prod_{k = 0}^{t - 1} \pi(a_k \mid s_k) p(s_{k + 1} \mid s_k, a_k) ds_0 da_0 \cdots ds_{t - 1} da_{t - 1}, \quad s_t = s. \label{eq:time-dependent-succ-measure} \end{align}\]We note that the probability measure of visiting state \(s\) at the beginning is \(p^{\pi}_0(s) \triangleq p_0(s)\).
This alternative definition implies that maximizing the expected discounted return is equivalent to maximizing the expected reward under the successor measure.
We now review the formulation of goal-conditioned RL problems. Goal-conditioned RL
Prior work usually defined the goal-conditioned reward function as a delta measure centered at the goal \(r_{\text{GC}}(s, g) = \delta(s \mid g)\)
Intuitively, this objective indicates that solving GCRL is equivalent to finding the goal-conditioned policy that maximizes the probability measure of reaching desired goals when commanded towards those goals. Note that the GCRL problem is also equivalent to a multi-task RL problem
Until now, we have not discussed what happens after the agent reaches the goal. Since GCRL problems typically have infinite horizons and the agent can still move after reaching the goal, the optimal behavior is not only reaching the goal but also staying at the goal as long as possible. We next discuss a special type of GCMDP that simplifies agents’ behaviors after reaching the goal, namely, the stochastic shortest path problem.
The third building block for our discussion is the stochastic shortest path problem. We will review the formal definition of the SSP problem and discuss its connection with GCRL. The stochastic shortest path problem is defined as
Definition 1. Given a goal $g \in \mathcal{G}$, a SSP policy $\pi_{\text{SSP}}: \mathcal{S} \times \mathcal{G} \to \Delta(\mathcal{A})$ is proper, if there exists a finite time step $t < \infty$ such that the policy reaches the goal $g$ when commanded towards the same goal $g$, i.e., $p^{\pi_{\text{SSP}}}_t(g \mid g) > 0$.
Therefore, a mild assumption is to assume all policies of interest are proper. Using the definition of a proper policy, the SSP problem can be adapted from a GCRL problem with the following modifications:
A negative cost for not reaching the goal, e.g., \(r_{\text{SSP}}(s, g) = - \mathbb{1}(s \neq g)\).
A goal-conditioned transition probability measures \(p_{\text{SSP}}: \mathcal{S} \times \mathcal{A} \times \mathcal{G} \to \Delta(\mathcal{S})\): for any transition \((s, a, s')\) and any goal \(g\),
\[\begin{align} p_{\text{SSP}}(s' \mid s, a, g) = \begin{cases} p(s' \mid s, a) & \text{if} \, s \neq g \\ \\ \delta(s' \mid g) & \text{otherwise} \end{cases}. \label{eq:ssp-transition} \end{align}\]Intuitively, the agent will always stay at the goal after reaching it.
Finite horizon \((\gamma = 1)\) with proper policies. For any proper policy, the SSP objective is to minimize the sum of costs (maximize the sum of rewards) before hitting the goal:
\[\begin{align} \max_{\pi_\text{SSP}: \: \pi_\text{SSP} \text{ is proper}} J_{\text{SSP}}(\pi_{\text{SSP}}), \quad J_{\text{SSP}}(\pi_{\text{SSP}}) = \mathbb{E}_{\tau \sim \pi_{\text{SSP}}(\tau)} \left[ \sum_{t = 0}^{\infty} r_{\text{SSP}}(s_t, g) \right]. \end{align}\]This objective is finite because of the proper policy assumption.
As indicated by the goal-conditioned transitions in Eq.\(~\ref{eq:ssp-transition}\), the agent will always stay at the desired goals once it reaches them. Therefore, the optimal behavior is to reach the goal as quickly as possible. We include an example comparing the optimal behaviors of GCRL and SSP in the figure below.
Remark. Although we show that SSP can be interpreted as a special case of GCRL (Eq.\(~\ref{eq:ssp-transition}\)), we consider them separately in our discussion. The reason is twofold. First, the standard GCRL formulation does not assume goal-conditioned transition probability measures as in Eq.\(~\ref{eq:ssp-transition}\). Second, the GCRL problem does not assume every Markovian policy is proper as in Definition 1. We note that the optimal GCRL policy is proper.
Prior work has already shown that (See Sec. 2.3 of Bertsekas and Tsitsiklis (1996)
Can we convert any RL problem into a GCRL problem?
We will construct a conversion next.
Building upon the three building blocks, we now show that any RL problem can be converted into a GCRL problem. The goal of our construction is to preserve policy optimality: the optimal policy in a standard MDP is equivalent to the optimal goal-conditioned policy for reaching some goal in the resulting GCMDP. We start the construction by augmenting the state space and the transition probability measure of a standard MDP \(\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, r, p_0, \gamma)\).
The augmented state space additionally includes two terminal states, a success state \(g_+\) and a failure state \(g_-\): \(\mathcal{S}_{\text{aug}} = \mathcal{S} \cup \{ g_+, g_- \}\).
The augmented probability measures of transiting into states in the original state space are proportional to the discount factor; The augmented probability measures of transiting into the two terminal states are proportional to the reward in the original MDP: for any state \(s \in \mathcal{S}_{\text{aug}}\), action \(a \in \mathcal{A}\), and next state \(s' \in \mathcal{S}_{\text{aug}}\),
\[\begin{align} p_{\text{aug}}(s' \mid s, a) = \begin{cases} \gamma p(s' \mid s, a) & \text{if } s \in \mathcal{S} \text{ and } s' \in \mathcal{S} \\ (1 - \gamma) r(s, a) & \text{if } s \in \mathcal{S} \text{ and } s' = g_+ \\ (1 - \gamma) (1 - r(s, a)) & \text{if } s \in \mathcal{S} \text{ and } s' = g_- \\ \delta(s' \mid s) & \text{if } s \in \{ g_+, g_- \} \end{cases}. \label{eq:aug-transition-prob-measure} \end{align}\]These augmentations allow us to define the goal state as a singleton \(\mathcal{G}_{\text{aug}} = \{ g_+ \}\) with goal prior \(p_{\mathcal{G}_{\text{aug}}}(g) = \delta(g \mid g_+)\), choose the reward function as the delta measure \(r_\text{aug}(s, g) = \delta(s \mid g)\). We will use a separate discount factor \(\gamma_{\text{aug}} \in [0, 1]\) for the new augmented GCMDP. Taken together, the augmented GCMDP shares the same action space and initial state measure as the original MDP: \(\mathcal{M}_{\text{aug}} = (\mathcal{S}_{\text{aug}}, \mathcal{A}, \mathcal{G}_{\text{aug}}, p_{\text{aug}}, p_0, \gamma_{\text{aug}}, p_{\mathcal{G}_{\text{aug}}})\). Below, we provide a didactic example showing the conversion from a standard MDP to its augmented GCMDP.
Didactic example. River Swim MDP
Our construction provides a generic conversion from any MDP to its corresponding GCMDP. However, there is one important question remaining:
Does the augmented GCMDP preserve policy optimality?
In the context of the River Swim MDP, the question becomes: is the optimal goal-reaching policy for state $g_+$ in the augmented GCMDP equivalent to the policy that keeps moving right in the original MDP?
Fortunately, the answer is YES and we provide a formal proof for the general statement in the next section.
In this section, we will show that the optimal goal-conditioned policy in the augmented GCMDP is equivalent to the optimal policy in the original reward-maximizing MDP. We start by examining the GCRL objective of the augmented GCMDP. Given that the augmented reward function is a delta measure \(r_\text{aug}(s, g) = \delta(s \mid g)\) and the goal space only contains \(g_+\), the GCRL objective of the augmented GCMDP (Eq.\(~\ref{eq:gcrl-obj}\)) for the goal-conditioned policy \(\pi_{\text{aug}}(a \mid s, g)\) reduces to
\[\begin{align} J_{\text{GCRL}}(\pi_{\text{aug}}) = \mathbb{E}_{g \sim \delta(g \mid g_+)}[p^{\pi_{\text{aug}}}_{\gamma_{\text{aug}}}( g \mid g)] = p^{\pi_{\text{aug}}}_{\gamma_{\text{aug}}}( g_+ \mid g_+), \end{align}\]which is the success measure of reaching the single success state
We are now ready to prove that the optimal goal-conditioned policy \(\pi_{\text{aug}}^{\star}\) in the augmented GCMDP corresponds to the optimal policy \(\pi^{\star}\) in the original MDP. The main idea is to relate the GCRL objective to the RL objective, showing that they are equivalent to each other. We will first introduce the hitting time and then use it to compute the successor measure.
Definition 2. Given a goal-conditioned policy $\pi_{\text{aug}}: \mathcal{S} \to \Delta(\mathcal{A})$ and any state $s \in \mathcal{S}_{\text{aug}}$, the hitting time $H^{\pi_{\text{aug}}}(s)$ indicates the first time step that the agent reaches the state $s$, \begin{align} H^{\pi_{\text{aug}}}(s) = \inf \left\{ h \in \mathbb{N}: s_h = s \text{ following } \pi_{\text{aug}} \right\}. \end{align}
We can relate the probability mass of the hitting time of the success state $g_+$ to the success measure in the original MDP.
Lemma 1. Given the goal-conditioned policy $\pi_{\text{aug}}$, the probability mass of the hitting time of the success state $g_+$ satisfies, for any $h \in \mathbb{N}$, \begin{align} p \left( H^{\pi_{\text{aug}}}(g_+) = h \right) = \begin{cases} 0 & h = 0 \\ (1 - \gamma) \gamma^{h - 1} \mathbb{E}_{s \sim p^{\pi_{\text{aug}}}_{h - 1}(s), a \sim \pi_{\text{aug}}(a \mid s) } \left[ r(s, a) \right] & h = 1, 2, \cdots. \end{cases} \end{align}
Eventually, using this lemma, we can compute the successor measure of the goal-conditioned policy \(\pi_{\text{aug}}: \mathcal{S} \to \Delta(\mathcal{A})\) in the augmented GCMDP.
Theorem 1. The successor measure of the goal-conditioned policy $\pi_{\text{aug}}$ at the success state $g_+$ in the augmented GCMDP is equivalent to the RL objective in the original MDP, i.e., \begin{align} p^{\pi_{\text{aug}}}_{\gamma_{\text{aug}}}(g_+) = J_{\text{GCRL}}(\pi_{\text{aug}}) = \frac{(1 - \gamma) \gamma_{\text{aug}} }{1 - \gamma \gamma_{\text{aug}}} J_{\gamma \gamma_{\text{aug}}}(\pi_{\text{aug}}). \end{align} Therefore, when $\gamma_{\text{aug}} = 1$, the augmented GCMDP preserves policy optimality: \begin{align} \pi_{\text{aug}}^{\star} = \text{argmax} \, J_{\text{GCRL}}(\pi_{\text{aug}}) = \text{argmax} \, J_{\gamma}(\pi) = \pi^{\star}. \end{align}
Given the goal-conditioned policy $\pi_{\text{aug}}$, by definition of the successor measure (Eq.$~\ref{eq:succ-measure}$), we have \begin{align*} p^{\pi_{\text{aug}}}_{\gamma_{\text{aug}}}(g_+) &= (1 - \gamma_{\text{aug}}) \sum_{t = 0}^{\infty} \gamma_{\text{aug}}^t p^{\pi_{\text{aug}}}_{t}(g_+) \\ &\overset{(a)}{=} (1 - \gamma_{\text{aug}}) \sum_{t = 0}^{\infty} \gamma_{\text{aug}}^t p \left( H^{\pi_{\text{aug}}}(g_+) \leq t \right) \\ &\overset{(b)}{=} (1 - \gamma_{\text{aug}}) \sum_{t = 0}^{\infty} \gamma_{\text{aug}}^t \cdot \sum_{h = 0}^{t} p \left( H^{\pi_{\text{aug}}}(g_+) = h \right) \\ &\overset{(c)}{=} (1 - \gamma_{\text{aug}}) \sum_{h = 0}^{\infty} p \left( H^{\pi_{\text{aug}}}(g_+) = h \right) \cdot \sum_{t = h}^{\infty} \gamma^t_{\text{aug}} \\ &= \sum_{h = 0}^{\infty} \gamma_{\text{aug}}^h p \left( H^{\pi_{\text{aug}}}(g_+) = h \right) \\ &\overset{(d)}{=} \sum_{h = 1}^{\infty} \gamma_{\text{aug}}^h \left( (1 - \gamma) \gamma^{h - 1} \mathbb{E}_{s \sim p^{\pi_{\text{aug}}}_{h - 1}(s), a \sim \pi_{\text{aug}}(a \mid s) } \left[ r(s, a) \right] \right) \\ &= (1 - \gamma) \gamma_{\text{aug}} \sum_{h = 1}^{\infty} (\gamma \gamma_{\text{aug}})^{h - 1} \mathbb{E}_{s \sim p^{\pi_{\text{aug}}}_{h - 1}(s), a \sim \pi_{\text{aug}}(a \mid s) } \left[ r(s, a) \right] \\ &\overset{(e)}{=} \frac{(1 - \gamma) \gamma_{\text{aug}}}{1 - \gamma \gamma_{\text{aug}}} \cdot (1 - \gamma \gamma_{\text{aug}}) \sum_{h = 0}^{\infty} (\gamma \gamma_{\text{aug}})^{h - 1} \mathbb{E}_{s \sim p^{\pi_{\text{aug}}}_{h}(s), a \sim \pi_{\text{aug}}(a \mid s) } \left[ r(s, a) \right] \\ &\overset{(f)}{=} \frac{(1 - \gamma) \gamma_{\text{aug}} }{1 - \gamma \gamma_{\text{aug}}} \mathbb{E}_{s \sim p_{\gamma \gamma_{\text{aug}}}^{\pi_{\text{aug}}}(s), a \sim \pi_{\text{aug}}(a \mid s)}\left[ r(s, a) \right] \\ &= \frac{(1 - \gamma) \gamma_{\text{aug}} }{1 - \gamma \gamma_{\text{aug}}} J_{\gamma \gamma_{\text{aug}}}(\pi_{\text{aug}}) \end{align*} where in (a) we apply the observation that reaching the success state $g_+$ at exact time step $t$ in the augmented GCMDP suggests that the hitting time must be less than $t$, in (b) we use the additive axiom of probability for disjoint events, in (c) we swap the sum over time step $t$ and the sum over hitting time $h$ by grouping the terms using different hitting time, in (d) we plug in the definition of the hitting time (Lemma 1), in (e) we apply change of variable, and in (f) we plug in the definition of the successor measure (Eq.$~\ref{eq:succ-measure}$).
Therefore, when setting $\gamma_{\text{aug}} = 1$, we have $p^{\pi_{\text{aug}}}_{\gamma_{\text{aug}} = 1}(g_+) = J_{\gamma}(\pi_{\text{aug}})$. Thus, maximizing the successor measure in the augmented GCMDP is equivalent to maximizing the RL objective in the original MDP, i.e., \begin{align*} \pi_{\text{aug}}^{\star} = \text{argmax} \, J_{\text{GCRL}}(\pi_{\text{aug}}) = \text{argmax} \, J_{\gamma}(\pi) = \pi^{\star}. \end{align*}
Remark. When setting \(\gamma_{\text{aug}} = 1\), one intriguing observation is that solving the augmented GCMDP is similar to solving an SSP problem as well.
Taken together, our conversion provides a new perspective to understand the RL problem: maximizing rewards is equivalent to reaching some goals. Therefore, in theory, we can use any GCRL algorithms to solve a reward-maximizing RL problem. We next mention some practical caveats of our conversion from an RL problem to a GCRL problem.
Hindsight relabeling
In practice, solving the augmented GCMDP might not be easier than solving the original MDP directly (e.g., using TD learning methods). The reasons are twofold. First, the main component of our construction is to augment the transition probability measure using the reward function in the original MDP. After augmentation, the rewards go into the stochasticity of the transition, resulting in much higher variance when interacting with the environment. Second, for dense rewards that provide supervision for RL algorithms at each time step, those supervisions have been deferred into the two additional states $g_+$ and $g_-$ (a sparse reward problem), inducing much more challenging exploration.
We have shown the conversion from RL problems to the corresponding GCRL problems, but is this conversion actually useful? Can we use algorithms for one problem to solve another problem in practice? We next answer these questions by running GCRL algorithms to maximize rewards in a canonical discrete MDP called Cliff Walking.
The Cliff Walking MDP is adapted from Example 6.6 from Sutton and Barto (1998)
We select four different state-of-the-art GCRL and SSP algorithms to solve the augmented GCMDP, comparing against the standard Q-learning algorithm in the original MDP.
GCQL is the goal-conditioned variant of Q-learning with the indicator reward function \(r_{\text{aug}}(s, g) = \delta(s \mid g) = \mathbb{1}(s = g)\).
GCQL w/ HER applies the hindsight relabeling
CRL
QRL
To prevent confounding errors from data collection and speed up learning, we collect a dataset with \(100\text{K}\) transitions in the original MDP for training Q-learning, and also collect another dataset with \(100\text{K}\) transitions in the GCMDP for training the aforementioned GCRL and SSP algorithms. For evaluation, we compare the average success rates for reaching the goal in the original MDP over \(100\) trajectories, reporting means and standard deviations over $4$ random seeds.
The results in the figure above suggest that Q-learning quickly converges to $100\%$ success rate, while GCRL and SSP methods struggle to match its performance. Although CRL and QRL typically estimate a dense success measure or a dense distance function, they still suffer from the high variance in environmental transitions. This observation is consistent with the caveat that solving the augmented GCMDP is not necessarily easier than solving the original MDP, due to high transition variance and challenging exploration. We also observe that GCQL achieves a success rate similar to its variant with HER, indicating that hindsight relabeling might not speed up learning since $g_+$ and $g_-$ are not in the original state space.
In this blog post, we share new ideas about converting a standard RL problem into a GCRL problem. These ideas are motivated by prior connections between the RL and the SSP and, in turn, complementarily bridge all three building blocks. Although our preliminary experiments do not show a positive sign for using GCRL algorithms to solve any reward-maximization RL problems in practice, we believe our constructions are still of interest to the broader community and raise many intriguing research directions.
Prior work has used unique properties for the (deterministic) shortest path problem, such as the triangle inequality
Another key question is how to make our conversions practical? Perhaps there are alternative constructions that convert an MDP into a GCMDP, enabling efficient application of modern GCRL algorithms. Perhaps we need to develop powerful GCRL algorithms to harness these conversions.
Importantly, our conversions also bring up an open-ended question: After all, RL, SSP, and GCRL are different frameworks describing some mechanisms (blinds). So what are the underlying physical rules that actually govern the world (the elephant)?
Answering these questions not only introduces new interpretations to the RL problems but also motivates the development of efficient and scalable RL algorithms. Taken together, we are excited to see more progress towards general-purpose reinforcement learning agents that can actually understand the world.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX