Fine Tuning Llama 3 - Adapting LLMs for Specialized Domains
In this hands-on webinar, Michaela Kaplan walks through how to fine-tune LLaMA 3 for medical Q&A using Label Studio and the medchat dataset. You'll learn how to integrate LLMs into your annotation pipeline, generate synthetic data, evaluate outputs, and iterate with human-in-the-loop workflows—all from your local environment.
Transcript
Nate: Cool. All right, let's go ahead and get this thing rolling. Welcome, everybody, to today’s HumanSignal webinar. We're really excited to have you here—and even more excited about today’s topic.
Just a couple of announcements before we get started. First, we are recording this webinar and will share the recording afterward so you can rewatch or catch anything you missed. This will be a hands-on session with code examples.
We’ll also have a Q&A at the end. Michaela will have about 15–20 minutes to answer your questions, depending on the presentation length. Just use the Q&A widget at the bottom of your Zoom screen to submit them.
All right, let’s introduce our speaker and topic: “Fine-Tuning LLaMA 3: Adapting LLMs for Specialized Domains.” Definitely a hot topic right now. Michaela Kaplan is our machine learning evangelist here at HumanSignal. Before joining us, she worked at SiriusXM as a data scientist, so she brings real-world experience. With that, I’ll turn it over to Michaela. Thanks again for joining.
Michaela: Thanks, Nate. And yes, to confirm, this session is being recorded.
Welcome, everyone! Today, we're going to walk through how to fine-tune LLaMA 3. Let's start with the basics: What is fine-tuning, and why do it?
Fine-tuning is the process of adapting a pre-trained model to a specific task or domain. You usually start with an off-the-shelf LLM like LLaMA 3 or GPT-4. You try prompt engineering or retrieval-augmented generation (RAG), and when that’s not enough—or when you're ready to go deeper—you fine-tune.
Reasons to fine-tune include preventing hallucinations, ensuring factual correctness, and capturing domain-specific knowledge that isn’t publicly available—like proprietary company data or protected medical information.
The fine-tuning process is cyclical. Today, we’ll work with medical Q&A using the medchat dataset. We’ll first observe LLaMA 3’s baseline performance, fine-tune on medchat, generate more data using midal, and then fine-tune again. And of course, we keep a human in the loop throughout.
Let’s dive in. I’m running Label Studio locally—the open source version. I’ve got a fresh instance with no projects yet, and I’ll be walking through everything live.
We’re also using a Jupyter notebook, which we’ll share in the blog post after this webinar. Some initial setup cells aren’t shown because they contain a private API key.
First, we load the medchat dataset into our notebook and print a sample Q&A pair: “What is one method of administering this drug?” with an answer about IV infusion methods.
Next, we create a new Label Studio project via the SDK. We configure the labeling interface with a simple HTML-like config and set up our ML backend using the LLM interactive template. We're using Ollama for this, with LLaMA 3 as the model.
I’ve already launched the Ollama instance and the Docker Compose ML backend. We use host.docker.internal
to wire everything together because both containers are running locally.
Now we jump to Label Studio, refresh, and connect our model via settings. I’ll call it "llama3", provide the backend URL, and turn on interactive predictions.
Let’s add the medchat training tasks. You’ll now see a list of questions and answers populated in the data manager.
Click into a sample task and you’ll see the model processing a prompt like “Answer the question: {text}.” This syntax injects the text from the metadata.
After a few seconds, the model responds. This is where human review comes in. You decide if the answer makes sense or needs to be corrected.
Label Studio also supports batch predictions. I selected three tasks and hit "Retrieve Predictions." Some outputs were solid, others not so much. Again, a reminder: human validation is key.
Now that we’ve seen baseline performance, let’s fine-tune. We'll use the medchat data directly for our first round. The notebook installs required packages, sets up a fast tokenizer and LoRA adapters, formats the data, tokenizes it, and launches training.
Training takes about an hour. When it's done, we save the model as a GGUF file and register it in Ollama. After updating Docker Compose to use the new model name, we restart the service.
But medchat is small, so we’ll expand it by generating more Q&A pairs using midal. We’ll do this in two steps: first generate questions, then answers.
We create a new Label Studio project for question generation, connect the tuned model, and use a pre-written prompt to guide it. You’ll see that it sometimes outputs answers too—something to keep in mind.
Once we have a few solid generated questions, we move to a new answer generation project. We format the outputs, import them as tasks, and ask the model to generate answers.
After validating these generated Q&A pairs with human review, we prepare them for another round of fine-tuning using the same notebook structure.
That’s the core loop: evaluate the model, generate more data, fine-tune again. And always keep a human in the loop to guard against errors and hallucinations.
Nate: Awesome. Thanks, Michaela. That was fantastic. Let’s jump into some of the questions we received.
Audience Question: What should I do if the model prediction is wrong?
Michaela: Great question. You can edit the proposed answer directly in Label Studio. Once updated and submitted, the system logs the change so it’s auditable. You can also add checkboxes or tags to mark if a prediction was “correct” or “needs review.”
Audience Question: What hardware are you using?
Michaela: I’m on Apple Silicon. The Ollama setup lets you run LLMs locally without stressing too much about your specs. You don’t need to worry about RAM or GPU the same way you would otherwise.
Audience Question: Can I configure the backend to use a custom provider?
Michaela: Absolutely. LLM Interactive can be set up for Ollama, OpenAI, Azure, or even your own custom backend. It’s flexible. There’s documentation and videos on how to do this, or feel free to reach out.
Audience Question: How should I evaluate the quality of generated questions and answers?
Michaela: You can look for things like relevance, clarity, and whether the question is actually answerable from the source text. Prompting the model to cite its source or reasoning can help. But ultimately, human evaluation is critical.
Audience Question: How do you ensure data privacy and security?
Michaela: First, we don’t store your data at Label Studio. It stays in your infrastructure. If you don’t expose the model or data externally, you maintain full control. Fine-tuning a local model like LLaMA is especially helpful here.
Audience Question: What quality control techniques do you recommend?
Michaela: Start with human review. In the Enterprise version, you can set up reviewer workflows. You can also use embeddings to measure semantic similarity or compute token overlap. LLM-as-a-judge or jury systems are another great option—basically using one model to evaluate another.
Audience Question: What’s the difference between auto annotations and interactive predictions?
Michaela: If interactive predictions are off, your frontend inputs don’t get passed to the backend model. Turn it on if you want to provide dynamic prompts, like we're doing here.
Audience Question: How much data do I need to fine-tune?
Michaela: It depends on your domain. A few hundred examples might be enough to start, especially with LoRA. Just make sure you evaluate your model on a separate test set.
Audience Question: Can I do unsupervised fine-tuning on raw data first?
Michaela: Definitely! That’s a great way to inject domain knowledge before doing supervised fine-tuning.
Audience Question: Will synthetic data ever be better than human-labeled data?
Michaela: I think synthetic data is useful and scalable, but I wouldn’t fully trust it yet. Models still hallucinate and reflect their training biases. Keeping a human in the loop is essential, both for accuracy and for responsible AI practices.
Audience Question: Any recommendations for using Ollama with image recognition?
Michaela: I’m not sure Ollama supports image tasks out of the box, but other frameworks do. You’d follow a similar pattern—fine-tune on paired image and text data and evaluate carefully.
Audience Question: Can I enforce structured output formats?
Michaela: Yes! OpenAI recently released structured output support using Pydantic. We published a blog post about this—definitely check that out if you want to enforce output schemas.
Nate: All right, we’re out of time. Michaela, thank you again. Thanks to everyone who joined, asked questions, and followed along. We’ll send out the recording, the notebook, and a written walkthrough shortly. See you at the next webinar!
Michaela: Thanks, everyone!