*
Continuing off from my last post (about Knowledge-Based RL), this topic also revolves around "human knowledge"
to increase RL performance.
*

For a quick reminder, in Knowledge-Based RL (KBRL), we leverage human knowledge to enhance an agent's performance. We accomplish this in numerous ways, but research most commonly uses logical reasoning (e.g. Logical Neural Networks vs. Markov Networks, First Order Logic, Graphs, etc.) to visualize human knowledge and the relationship between various entities, and an RL component exploits such information.

From such definition, KBRL stands to use *facts*, or at least *commonsense*, to obtain higher feats. Generally,
these two ideas can be *explainable* (a word that one of my labmates really dislikes lol -> how do you define explainability?).
Then, **what about information that have low explainability**? For example, I really like vanilla ice cream over chocolate
ice cream. Why? I just *prefer* it.

This concept forms the basis for **Preference-Based Reinforcement Learning** (PBRL). In essence, we know that humans
prefer A over B, even when there's no clear-cut explanation. Following suite, we apply this to creating better policies.

In this blog, I'll briefly explain Preference-Based RL and go over the following:

- Intro to PBRL & technical explanation
- Current PBRL advancements
- PBRL for Language Models (!!)

㍙ Intro to PBRL & technical explanation

ᕗ So what is PBRL and in what cases might PBRL be advantageous compared to normal RL? ᕙ

To preface, Christiano et al. [6] explain how defining a great reward function may be challenging in some tasks. The authors state:

"For example, suppose that we wanted to use reinforcement learning to train a robot to clean a table or scramble an egg. It's not clear how to construct a suitable reward function, which will need to be a function of the robot's sensors. We could try to design a simple reward function that approximately captures the intended behavior, but this will often result in behavior that optimizes our reward function without actually satisfying our preferences."

Imagine it's your first time learning to cook. One way we can learn is through *imitation*, which inspires *Imitation
Learning* (that's a blog for another day). Another way we can learn is through feedback, i.e., we can perform some
actions, and a teacher critiques us. This is the essence of PBRL. The agent performs and a human critic evaluates whether
its performance is suitable. We can then **learn a reward function based off of these human preferences** which
potentially outperforms suboptimal policies acting upon shortcuts to maximize rewards.

ᕗ Great, so how do we accomplish this technically? ᕙ

*As of my knowledge, most PBRL work follow this method.*

*tl;dr version. * First, we must have at least two models: (1) the agent \(\pi_\theta\)
(2) the reward function estimate \(\hat{r}_\psi\) which estimate the human's *unexplainable*
reward function. We train \(\pi_\theta\) from \(r_t = \hat{r}_\psi(o_t, a_t)\).
A human compares pairs of collected segments of trajectories \((\sigma^1, \sigma^2)\) by the agent, where \(\sigma = (
o_1, o_2, ..., o_n)\), and the human reports whether they prefer one over the other (or if they are equal).
The reward function estimate is updated based on whether it can accurately capture the human preferences.
We can now use this model to feed reward for the agent model.

*in-depth version. * Since the human gives feedback on their preference regarding
trajectory pairs, we must create a method that intuitively shows whether \(\hat{r}_\psi\) also follows the same
preference. We use the following equation (1) to showcase the reward
estimate's preference:

\( \begin{array}{c} \hat{P}[\sigma^1 > \sigma^2] = \dfrac{\text{exp} (\sum\limits_{t=0}^n \hat{r}(o^1_t, a^1_t))} {\text{exp} (\sum\limits_{t=0}^n \hat{r}(o^1_t, a^1_t)) + \text{exp} (\sum\limits_{t=0}^n \hat{r}(o^2_t, a^2_t))} \end{array} \)

This essentially captures the likelihood that the reward function estimate prefers \(\sigma^1\) over \(\sigma^2\).
The numerator resembles the exponential of the total sum of rewards from trajectory 1, and the denominator is the sum
of the exponential of the total sum of rewards from trajectory 1 combined with the exponential of the total sum of
rewards from trajectory 2. *As an example, say in \(\sigma^1\) the agent received a
total reward of 10 while in \(\sigma^2\) the agent received a total reward of 5. Then, the likelihood that
\(\hat{r}_\psi\) prefers \(\sigma^1\) over \(\sigma^2\) would be \(\frac{\text{exp}(10)}{\text{exp}(10) +
\text{exp}(5)} = \frac{e^{10}}{e^{10} + e^5} \approx0.99 \). *

Now, we use these estimated likelihood preferences and update the model based on the actual labels following this equation (2):

\( \begin{array}{c} \mathcal{L}(\psi) = - \sum\limits_{(\sigma^1, \sigma^2)\in \mathcal{D}} (1-y)\text{ ln }\hat{P}[\sigma^1 > \sigma^2] + y\text{ ln }\hat{P}[\sigma^1 < \sigma^2] \end{array} \)

where \(D\) is the collected human-preference dataset consisting of \((\sigma^1, \sigma^2, y)\) where \(y=(0,1,0.5)\), indiciating the critic's preference (or equal preference). Intuitively, we update the reward function estimate to also have a likelihood preference following \((0,1,0.5)\). Now, we can use cross-entropy loss to update the reward function estimate in a Supervised Learning manner.

*Note. Some papers diverge slightly here and there, for example, Kim et al. [4] use
the expectation rather than the summation for equation (2).*

ᕗ Finally, we've accomplished in creating a model that estimates the human's preference. This enables the policy to receive rewards following a complex function that may not be easy to define. ᕙ

㍚ Current PBRL advancements

That being said, the results from [6] contradict the outcome we hoped for. While
PBRL and traditional RL go head-to-head for the easier, OpenAI Gym tasks, in the more complex, Atari games,
the PBRL-trained agent **cannot (on-average) outperform the traditional RL agent.** Furthermore, even during cases
when the PBRL agent successfully outperformed the traditionally-trained agent, the authors use a tremendous number of
queries to the human critic - around 5.5k human labels, or in **total roughly 5 hrs of human labor** (as stated on the
footnotes of page 11).

ᕗ So, what advancements have been made to improve PBRL performance to inch towards our original goal, that is, an agent that learns how to scramble some eggs? ᕙ

*Imitation Learning + PBRL. *
The next iteration of work published by similar authors to [6] is Ibarz et al.'s work in [7]. Ibarz et al. argue that:
(1) if only training the agent by preference, the agent may find difficulty in traversing the state space wide enough
due to suboptimal comparisons completed by the critic from a pair of poor-performing trajectories (2) training solely
by preference is inefficient time-wise.

Hence, the authors decide to pretrain an agent using Imitation Learning and
finetune the model using human preferences. While the results of [7] also may not be the best, Ibarz et al. realize that
labels outputted by a synthetic oracle aids agent performance significantly. In other words, if the human critic is
an expert in the task (and not just some random person hired to participate in research), we can potentally train
an agent that develops the optimal policy. *So, then in a task where there is no synthetic
oracle and any human is an expert, we can create a super AI! Totally not foreshadowing for a future section...*

*PBRL benchmark.* [1] develops a benchmark for PBRL methods. Specifically, they simulate
human preferences but add irrationality in these simulations. While the paper goes in-depth
in comparing various features, one thing that I'd like to mention specifically is in section 3.2, or this following
equation (3):

\( \begin{array}{c} \hat{P}[\sigma^1 > \sigma^2] = \dfrac{\text{exp} (\beta\sum\limits_{t=0}^n \hat{r}(o^1_t, a^1_t))} {\text{exp} (\beta\sum\limits_{t=0}^n \hat{r}(o^1_t, a^1_t)) + \text{exp} (\beta\sum\limits_{t=0}^n \hat{r}(o^2_t, a^2_t))} \end{array} \)

Note that equation (3) is almost identical to
equation (1); the only difference stems from this new hyperparameter \(\beta \in [0, \infty)\). The authors in
[1] mention that sometimes, human preferences don't stem from rational decision-making (lol). Hence, it's a wise decision to
incorporate an irrationality parameter in these simulated decisions to better capture authentic human preferences.
The authors play around with \(\beta\); as \(\beta \rightarrow 0\), the more the preferences follow uniform
distribution. On the other hand, as \(\beta \rightarrow \infty\), the more the preferences become *rational*
(as any rational human being should prefer a trajectory with higher expected returns) and deterministic.

*PBRL + Exploration vs. Exploitation.* Lastly, I'll go over [3]. To preface, *
Exploration vs. Exploitation* is a classic problem in reinforcement learning. The idea follows that (most probably)
there doesn't exist an optimal (and feasible) policy the agent can take that balances exploration (i.e., examining
the entire state space) and exploitation (i.e., taking greedy approaches knowing they succeed). Liang et al. summarize
that previous research observes better performance when the agent explores the state space based on its history:

"Thrun (1992) showed that exploration methods that utilize the agent's history has been shown to perform much better than random exploration. Hence, a common setup is to include an intrinsic reward as an exploration bonus. The intrinsic reward can be defined by Count-Based methods which keep count of previously visited states and rewards the agents for visiting new states Bellemare et al. (2016); Tang et al. (2017); Ostrovski et al. (2017)."

We previously mentioned how an agent might encounter challenges for exploration when we use end-to-end PBRL. Hence, the authors drive an improved exploration behavior by incorporating an intrinsic reward, intuitively based on the human's confidence of feedbacks. The new reward function follows equation (4):

\( \begin{array}{c} r := \hat{r}_{\psi}(o_t, a_t) + \beta_t \cdot \hat{r}_{\text{std}}(o_t, a_t) \end{array} \)

where \(\beta_t=\beta_0(1-\rho)^t\) is a decay rate of the instrisic reward as training progresses. The added feature in Equation (4): is \(\hat{r}_{\text{std}}(o_t, a_t)\), which stands for the standard deviation of all the reward functions \(\{\hat{r}_\psi\}^N_{i=1}\). With low confidence of human feedback, comes higher variance between the multitude of reward functions. Therefore, we nudge the agent into exploring such state-action pairs where the critic has less confidence in the expected return. Hyperparameter \(\beta\) serves a similar purpose to \(\epsilon\) in e-greedy methods where over time, the agent should exploit more often than explore.

*
Note. [2,4] are other papers co-advised by Kimin Lee and Pieter Abbeel that I'd highly checkout.
*

㍛ PBRL for Language Models

ᕗ We've now seen a basic survey of current PBRL research. Adding on though, one of the most exciting implementations of PBRL is in the training of Large Language Models! ᕙ

Generally, LLMs may be trained in an unsupervised learning manner on numerous (huge!) datasets. For example, GPT3 [5] use a crazy amount of data to develop a model with commonsense knowledge:

\( \begin{array}{c c c c} \hline & \text{Quantity} & \text{Weight in} & \text{Epochs elapsed when} \\ \text{Dataset} & \text{(tokens)} & \text{training mix} & \text{training for 300B tokens} \\ \hline \text{Common Crawl (filtered)} & \text{410 billion} & \text{60%} & 0.44 \\ \text{WebText2} & \text{19 billion} & \text{22%} & 2.9 \\ \text{Books1} & \text{12 billion} & \text{8%} & 1.9 \\ \text{Books2} & \text{55 billion} & \text{8%} & 0.43 \\ \text{Wikipedia} & \text{3 billion} & \text{3%} & 3.4 \\ \end{array} \)

Alongside this huge dataset, the LLM might get **fine-tuned through human-in-the-loop RL**! This is where PBRL comes to
play. In [8], Wu et al. are tasked to summarize entire books using LLMs, and fine-tune pre-trained models (in their
case they use GPT3) through PBRL. The authors compare fine-tuning GPT3 with PBRL vs. Behavior Cloning on expert
demonstrations and observe that PBRL outperforms when data increases.

The authors also explain how researchers must consider *scalable oversight*, i.e. developing an effective
training signal when tasks scale larger. A task such as summarizing books may cause challenges for researchers to
design a straightforward training signal for the LLM to perform effectively. **Potentially, the only path to
providing a beneficial training signal is to use PBRL.** This provides valuable insight into fine-tuning other LLMs
(or just large models) for a variety of tasks that we aim to solve in the future (such as scrambling an egg!). While
PBRL must have access to a laborous amount of data, the benefit it may provide (alongside the fact that it's magnitudes
smaller than the datasets used) for model performance brings excitement into each step taken closer towards AGI.

ᐍ References

[1] Lee et al. B-Pref: Benchmarking Preference-Based Reinforcement Learning. NeurIPS 2021.[2] Park et al. SURF: SEMI-SUPERVISED REWARD LEARNING WITH DATA AUGMENTATION FOR FEEDBACK-EFFICIENT PREFERENCE-BASED REINFORCEMENT LEARNING. ICLR 2022.

[3] Liang et al. REWARD UNCERTAINTY FOR EXPLORATION IN PREFERENCE-BASED REINFORCEMENT LEARNING. ICLR 2022.

[4] Kim et al. PREFERENCE TRANSFORMER: MODELING HUMAN PREFERENCES USING TRANSFORMERS FOR RL. ICLR 2023.

[5] Brown et al. Language Models are Few-Shot Learners. OpenAI 2020.

[6] Christiano et al. Deep Reinforcement Learning from Human Preferences. NIPS 2017.

[7] Ibarz et al. Reward learning from human preferences and demonstrations in Atari. NIPS 2018.

[8] Wu et al. Recursively Summarizing Books with Human Feedback. OpenAI 2021.