Live!Check out the Label Studio 2024 recap post

Improving on RLHF with Language Feedback

Community

In November of 2022, the “State of AI Report” released its fifth survey on what its authors and reviewers considered the most exciting developments in AI. With over 110 pages of content — including summaries and extensive citations — this is one of the largest industry reports to date. Looking at the insights, we can tell that 2022 was an explosive year in the development of human-centric AI, and we can draw some conclusions on the directions AI research is going in. One section that caught our attention at Label Studio was the two pages devoted to recent research in “Reinforcement Learning from Human Feedback” (RLHF). In 2017 RLHF initially emerged as a technique for using human feedback to improve machine-learning models. Recently we've seen a resurgence of its use as an essential training method for building large generative models like ChatGPT.

2022 will be remembered as a year where generative models like ChatGPT were capturing the internet's imagination. However, as users were astounded by the flood of generative models, researchers were busy refining methods to leverage the power of expressive human feedback to make fine-tuning models even more effective. As we go forward into 2023, we're seeing this as part of an emerging trend of crafting ML models that improve through direct, expressive human feedback, and the techniques to do this are only getting more and more powerful.

In the rest of this article, we'll explore the basics of RLHF, how they were used in building models like ChatGPT, and the direction new research is taking us in human-guided training and refinement.

How does RLHF work at a high level?

RLHF starts with foundational models trained with unsupervised learning. This is typically an expensive and time-consuming step, with models built from a massive amount of automatically collected data, and trained with unsupervised learning methods. These models provide a robust foundation, capturing an immense amount of knowledge. With this raw material in hand, one can take the next step in the RLFH process: building a reward model that takes a collection of inputs and returns scalar values that represent human preferences.

To train this reward model, a collection of prompts is fed into the original model, creating a new dataset of input/output pairs. Humans score these pairs, ranking them according to a system designed to assist in smoothing the bias and noise inherent in human evaluation (Elo ratings, the same technique used to compare competitive chess players, is a popular choice). These scored and ranked input/output pairs are the training set for the reward model.

This reward model is then used with a Reinforcement Learning (RL) algorithm to fine-tune the original language model. There are many options for fine-tuning methods, with Proximal Policy Optimization (PPO) being a popular and well-understood choice for very large models.

GPT-3, RLHF, and ChatGPT

Building large generative models relies on unsupervised learning using automatically collected, massive data sets. For example, GPT-3 was trained with data from “Common Crawl,” “Web Text,” and other data sources. When we’re discussing the scale of these data sets, we truly mean scale: the sources contain petabytes of data, a requirement to tune the over 175 billion parameters in GPT-3. The enormous training sets and the large parameter space give GPT-3 its expressive power as a generalized language model.

However, this sort of unsupervised learning comes at a cost. As stated in their original paper on GPT-3: “internet-trained models have internet-scale biases.” Anyone who has worked with a publicly sourced language dataset of even modest size will know that such data is full of bias and harmful speech, and these biases become built into models trained by them. To combat these biases, OpenAI turned to RLHF to create a new, evolved model: InstructGPT.

To train InstructGPT, a group of 40 human annotators “sensitive to the preferences of different demographic groups, and were good at identifying outputs that were potentially harmful” were selected to provide training feedback. The reinforcement learning dataset was generated by passing a collection of predefined inputs through GPT-3. The annotation team then ranked this set of prompt/output pairs, which was used as the training set to fine-tune GPT-3 to create the InstructGPT model.

Researchers found that InstructGPT models significantly outperformed GPT-3 baseline in metrics related to response quality preference, truthfulness, toxicity, and generalization; with only a cost that was “modest relative to pretraining.” Overall, InstructGPT required about 0.15% of the computational training resources of GPT-3.

InstructGPT then served as the foundation for ChatGPT. The success of Chat GPT, a large generative model with wide public availability, depended on its unsupervised training on large data sets and combined with supervised training from human feedback. Where previous attempts at releasing publicly available chatbots had failed, OpenAI made impressive improvements in designing ChatGPT to interact in a conversational way that allows it to “answer follow-up questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.”

This conversational approach to human/machine interaction gives a taste of the future of these large models. This approach isn't just a new method for interacting with these models, but is an increasingly important mechanism for refining and continuing to improve upon them. OpenAI states in its ChatGPT interface that "our goal is to get external feedback in order to improve our systems and make them safer." This kind of expressive human feedback will be an essential component of how future large generative models will be trained, going beyond the improvements from RLHF.

Language Feedback Training

Although RLHF is quite powerful, it has limitations. Scoring input/output pairs is noisy, requiring abstracting ranking in a way that helps smooth the biases from different annotators. Researchers at New York University describe these limitations, stating “an RLHF preference model provides limited learning signal, compared to the full expressiveness of language that humans use.” To work around these limitations, the research team developed a new refinement method, Language Feedback (LF), which is already showing remarkable training results.

LF is similar to RLHF in how a model is fed prompts, with the resulting pairs evaluated by a human annotation team. Compared to previous feedback approaches of ranking responses for reinforcement learning, an LF annotator gives long-form, natural-language feedback describing possible improvements to the output. For example, say the model is prompted to “summarize the book Moby Dick.” If the model returned the summary “Moby Dick is about Ahab’s quest for vengeance against a dolphin,” a human annotator might respond with “the summary should say that Moby Dick is about Ahab’s quest for vengeance against a whale.”

This feedback to the original prompt, along with the output, is then used to generate several new refinements from the original model. The refinements with the highest similarity to the human feedback are used to fine-tune the original model.

Using LF training, researchers demonstrated the power and influence that small datasets built from expressive feedback can have. State of the AI reported that “using only 100 samples of human-written feedback fine-tunes a GPT-3 model to roughly human-level summarization ability.”

The model’s performance improved with minimal training, and the cumbersome process of ranking prompt-text pairs was eliminated. The authors reiterated the power of this process, stating, “language feedback is a natural form of communicating with models, which may make it easier for many people to provide informative, high-quality feedback.”

These improvements bring us back to the potential of ChatGPT’s human-centric interface. Its interactive, chat-based approach allows every user to provide rich feedback – feedback that serves as high-quality, targeted training data for the next generation of generative AI.

We are at the beginning of a new stage of development in large, generative models. While these models may come with tremendous risks associated with training from large collections of mechanically collected data, the research also shows that applying a human signal to these models not only improves model quality but also reduces harm.

One of the most exciting aspects of this research is how even as generative models become more powerful through automated data collection and unsupervised learning, thoughtfully applying a human signal always results in a better model. This work underscores the importance that open data labeling platforms like Label Studio will have in facilitating the creation of machine learning models built with integrity, safety, and reduction of bias in mind.

Related Content

  • Ameru: Labeling for a Greener World

    Learn how Ameru is using Label Studio to power their smart garbage bins as they work to accelerate an economically sustainable zero-waste future.

    Micaela Kaplan

    Data Scientist

  • Temporary Community Slack Outage

    Messaging on the Label Studio Community Slack is currently unavailable. We are working to resolve the issue.

    Label Studio Team

    December 7, 2023

  • Heartex is now HumanSignal!

    HumanSignal is about the signal that humans provide to models, helping them to adapt, learn, and align with the needs of organizations and society at large.

    Max Tkachenko

    Co-founder, HumanSignal and Label Studio