Human Preferences collection for RLHF


This project will help you to get up your LLM to the ChatGPT quality level through collecting comparison data to establish human preferences for the responses generated by the supervised model.

Through ranking multiple responses based on quality, you can train a reward model that effectively captures human preferences. This reward model plays a crucial role in Reinforcement Learning, optimizing the performance of the fine-tuned foundational model.

Further Reading and Resources

How to collect the dataset

The dataset for RLHF consists of two parts:

  1. input prompts
  2. Alternative generated responses for each prompt.

To simplify the task for the human labeler, it is recommended to have 2 responses per prompt to select from.

Start with an initial set of prompts and responses, where each item is a JSON object with the following structure:

    "prompt": "The quick brown fox...",
    "answer1": "jumps over the lazy dog.",
    "answer2": "bags few lynx."
}, ...]

Collect examples either by generating them manually, or use your baseline model to generate multiple alternative hypotheses.

After your dataset has started to be collected in dataset.json file, create a project and upload the dataset to Label Studio.

Starting your labeling project

  1. Create new project in Label Studio
  2. Go to Settings > Labeling Interface > Browse Templates > Generative AI > Human Preference collection for RLHF
  3. Save the project

Import the dataset

Using python SDK you can import the dataset with input prompts into Label Studio. With the PROJECT_ID of the project you’ve just created, run the following code:

from label_studio_sdk import Client

ls = Client(url='<YOUR-LABEL-STUDIO-URL>', api_key='<YOUR-API_KEY>')

project = ls.get_project(id=PROJECT_ID)

Then you can start annotating the dataset by creating the responses.

How to configure the labeling interface

The Human Preference collection for RLHF template includes the following labeling interface in XML format:

<View className="root">
    .root {
      box-sizing: border-box;
      margin: 0;
      padding: 0;
      font-family: 'Roboto',
      line-height: 1.6;
      background-color: #f0f0f0;

    .container {
      margin: 0 auto;
      padding: 20px;
      background-color: #ffffff;
      border-radius: 5px;
      box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);

    .prompt {
      padding: 20px;
      background-color: #0084ff;
      color: #ffffff;
      border-radius: 5px;
      margin-bottom: 20px;
      box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);

    .answers {
      display: flex;
      justify-content: space-between;
      flex-wrap: wrap;
      gap: 20px;

    .answer-box {
      flex-basis: 49%;
      padding: 20px;
      background-color: rgba(44, 62, 80, 0.9);
      color: #ffffff;
      border-radius: 5px;
      box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);

    .answer-box p {
      word-wrap: break-word;

    .answer-box:hover {
      background-color: rgba(52, 73, 94, 0.9);
      cursor: pointer;
      transition: all 0.3s ease;

    .lsf-richtext__line:hover {
      background: unset;

    .answer-box .lsf-object {
      padding: 20px
  <View className="container">
    <View className="prompt">
      <Text name="prompt" value="$prompt" />
    <View className="answers">
      <Pairwise name="comparison" toName="answer1,answer2"
                selectionStyle="background-color: #27ae60; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.2); border: 2px solid #2ecc71; cursor: pointer; transition: all 0.3s ease;" />
      <View className="answer-box">
        <Text name="answer1" value="$answer1" />
      <View className="answer-box">
        <Text name="answer2" value="$answer2" />
<!--{ "data" : {
  "prompt": "What are the key benefits of using Reinforcement Learning from Human Feedback (RLHF) for dataset collection in the context of Large Language Model (LLM) generation?",
  "answer1": "Reinforcement Learning from Human Feedback (RLHF) for dataset collection in Large Language Model (LLM) generation provides key benefits such as improved model performance through direct optimization, better alignment with human values by incorporating human feedback, and the ability to iteratively refine the model based on user interactions, resulting in a more user-friendly and efficient language model.",
  "answer2": "Using Reinforcement Learning from Human Feedback (RLHF) for dataset collection in Large Language Model (LLM) generation offers advantages such as enhanced model capabilities by optimizing for desired outcomes, greater adaptability to human preferences through the inclusion of human feedback, and the opportunity to continuously improve the model based on user experiences, ultimately leading to a more effective and responsive language model."

The <Style> section defines a custom UI design for the labeling interface, along with the layout provided by the <View> tag. In this example, we use a simple layout with a prompt and two answer boxes. <Pairwise> tag defines the pairwise comparison between the answers written in the <Text> tags. Displayed text is taken from the $prompt, $answer1 and $answer2 variables, which are defined in the <Text> tags with the value attribute.

Additionally, you can modify "prompt", $answer1 and $answer2 in XML comments section to see how it looks with your data.

Export the dataset

There have to be from hundreds to thousands of tasks labeled to get your LLM being fine-tuned, depending on the complexity of your problem statement.

After you’ve labeled enough tasks, you can export the dataset in the following raw Label Studio JSON format:

    "id": 1,
    "data": {
      "prompt": "Generate a Python function that takes a list of integers as input and returns the sum of all even numbers in the list."
    "annotations": [
        "id": 1,
        "created_at": "2021-03-03T14:00:00.000000Z",
        "result": [
            "from_name": "instruction",
            "to_name": "prompt",
            "type": "textarea",
            "value": {
              "text": [
                "def sum_even_numbers(numbers):\n    return sum([n for n in numbers if n % 2 == 0])"
// other fields

The above represents the list of tasks with annotations. Each task has a data.prompt field with the input prompt, and each “annotations” item contains a response result under result.value.text field. You can create more than one annotation per task.

Alternatively, you can download the same data in CSV format:

"Generate...","def sum..."

How to fine-tune the model

You generated examples can be used to finetune the opensource LLM models like GPT-2, T5, Falcon, LLaMa, etc. You can check the complete list of models on HuggingFace LLM leaderboard, download and finetune the model from Model Hub. Alternatively, there are finetuning services available: OpenAI, CoHere, (AI21 Studio)[], MosaicML, Google Cloud AI Platform, AzureML, etc.