How Mind Moves and HumanSignal Brought Trust to AI in Healthcare

Community July 23, 2025

The Challenge: Hallucinations, Health, and High Stakes

Large language models can sound smart, but in medicine, sounding smart isn’t enough. At one of the most trusted institutions in health research, a team was piloting a new GenAI assistant aimed at delivering accurate, evidence-based responses to health questions. The system pulled from NIH-vetted sources like PubMed Central and MedlinePlus, but even with a retrieval-augmented generation (RAG) pipeline in place, one critical question remained:

Can we trust the model’s outputs, especially when the stakes are health-related?

That question isn’t theoretical. From policy alignment to health literacy to factual accuracy, every sentence generated needed scrutiny. The team needed a rigorous evaluation framework that could balance expert insight with scalable workflows.

The Solution: A Human-in-the-Loop Blueprint with Mind Moves and Label Studio

That’s where Mind Moves came in, bringing human-centered design and organizational strategy to help create an AI evaluation process that was not only effective but also scalable and sustainable across teams. In addition to leading the evaluation framework, Mind Moves developed the Retrieval-Augmented Generation (RAG) pipeline that the evaluation system connects to, enabling end-to-end performance assessment. In partnership with HumanSignal’s Label Studio, they designed a six-phase human-in-the-loop workflow to assess the AI assistant’s outputs across dimensions such as:

Interpretability (was the meaning clear?)
Readability (was it accessible for users?)
Accuracy and Evidence Support (was it factually correct and properly cited?)
Alignment with NIH standards (did it stay within the guardrails?)
Fact‑checking (did each sentence align with its cited reference chunks?)
Reference response creation (was a reference‑grounded answer generated?)
Reference response selection (was the strongest reference‑supported option chosen?)

Instead of relying on spreadsheets or ad hoc review tools, the team used Label Studio to orchestrate a complex annotation project that spanned:

4 reviewer groups, consisting of 20 annotators
100 biomedical questions, both expert and non-expert
6 structured projects, each building on the last
20,000+ annotation tasks, with varying levels of cognitive difficulty

Each group worked within a dedicated Label Studio workspace, labeling the same set of questions in sequence. This allowed for layered insights: sentence-level judgments, template-driven evaluations, and comparative analysis over time.

“What we set out to do would have been impossible in Excel or JavaScript. Label Studio gave us a clean, customizable hub to control both the scope and quality of our evaluation and annotation work. The benchmark dataset our annotators produced now lets us use an LLM‑as‑a‑judge framework for large‑scale experimentation and evaluation.” — Dr. James Shanahan, Mind Moves

A Hybrid Evaluation Pipeline

Mind Moves also helped develop an interoperable dual-pipeline system: one for RAG-based response generation, and one for evaluation.

The generation pipeline used secure, serverless infrastructure to retrieve and synthesize data from NIH-vetted sources.
The evaluation pipeline incorporated both human annotation (via Label Studio) and LLM-as-a-judge evaluations using GPT-4 and Claude 3.

By exporting annotation data from Label Studio in structured formats (CSV/JSON), the team was able to compare human and LLM evaluations head-to-head, revealing important gaps. Notably, human reviewers were more strict and conservative, especially on evidence support and clarity, highlighting the essential role of domain expertise.

Early Results and Lessons

The pilot yielded a number of early-stage insights:

Interpretability rated high— Most outputs were clear, but not all.
Readability measured at acceptable— accessible at or below an 8th-grade reading level to support health literacy.
Alignment with NIH standards scored extremely high— affirming the strength of the prompt engineering in areas like not providing medical advice. Beyond the metrics, the project also helped create a cadre of annotators with hands-on experience, who will return to their own work better equipped to assess results from LLMs as well.
Overall acceptance rate landed at 50% — A strong starting point for an early-stage GenAI system in a high-risk domain.

Beyond the metrics, the project also educated 78% first-time annotators on how AI evaluation works, deepening internal capacity and building trust from the ground up.

Why It Matters

Human Signal and Mind Moves were able to develop a repeatable, human-centered workflow that scales across annotators, safeguards against hallucinations, and upholds medical integrity.

This wasn’t just a tech problem. It was an institutional challenge, and the team met it with clarity and care.

“In a world where trust in health information is fragile, we knew we needed a system that reflected human judgment, not just machine outputs. This process helped us build that trust.” - Nicole Sroka, CEO Mind Moves

Final Thoughts

Evaluating AI in critical domains requires more than benchmarks and dashboards - it demands human judgment, structured workflows, and tools that support continuous improvement. With Label Studio, Mind Moves advanced toward building AI systems that not only deliver answers but also earn trust.