How to Review Braintrust Traces with Label Studio
0. Label Studio Requirements
This tutorial uses ReactCode templates, a feature available in Label Studio Enterprise only. ReactCode allows you to build fully custom React-based annotation interfaces — in this case, a 3-panel trace review UI. We recommend connecting with our team to request a trial or to enable them in your account.
After section 2, you will need a running Label Studio Enterprise instance and an API key from your account settings.
1. Installation & Setup
First, install the required dependencies:
!pip -q install requests label-studio-sdk python-dotenv braintrust braintrust-api braintrust-langchain langchain langchain-anthropic anthropic langgraph
Environment Configuration
Create a .env file in the repository root (or the same directory as this notebook) with the following variables:
# Label Studio Enterprise
LABEL_STUDIO_HOST=http://localhost:8080 # or your LS Enterprise instance URL
LABEL_STUDIO_API_KEY=your_label_studio_api_key
# Braintrust
BRAINTRUST_API_KEY=your_braintrust_api_key
BRAINTRUST_PROJECT=your_project_name # project name for tracing and fetching
# Anthropic (only needed for Section 3a sample trace generation)
ANTHROPIC_API_KEY=your_anthropic_api_key
Braintrust Setup: Visit Braintrust Documentation to create an account, get your API key, and set up a project for tracing.
Label Studio Setup: Visit Label Studio Documentation for installation instructions and how to generate an API token from your account settings.
import os
from dotenv import load_dotenv
# Load .env from current directory or repository root
load_dotenv(override=True)
load_dotenv(os.path.join(os.path.dirname(os.getcwd()), '.env'), override=True)
# Label Studio Enterprise
LABEL_STUDIO_HOST = os.getenv('LABEL_STUDIO_HOST', 'http://localhost:8080')
LABEL_STUDIO_API_KEY = os.getenv('LABEL_STUDIO_API_KEY', '')
# Braintrust
BRAINTRUST_API_KEY = os.getenv('BRAINTRUST_API_KEY', '')
BRAINTRUST_PROJECT = os.getenv('BRAINTRUST_PROJECT', '')
# Anthropic (only needed for sample trace generation in Section 3a)
ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY', '')
print('LABEL_STUDIO_HOST:', LABEL_STUDIO_HOST)
print('BRAINTRUST_PROJECT:', BRAINTRUST_PROJECT or '(not set)')
print('Has LABEL_STUDIO_API_KEY?', bool(LABEL_STUDIO_API_KEY))
print('Has BRAINTRUST_API_KEY?', bool(BRAINTRUST_API_KEY))
print('Has ANTHROPIC_API_KEY?', bool(ANTHROPIC_API_KEY))
Setup: The Evaluation Pipeline
This tutorial connects Braintrust’s engineering-centric observability tooling with Label Studio’s expert evaluation interface:
Step 1: Trace Collection in Braintrust
- Braintrust captures detailed LLM traces including inputs, outputs, tool calls, and timing
- Engineer-centric interface for technical debugging and iteration
- BTQL-powered queries let you filter and inspect any subset of traces
Step 2: Expert Evaluation in Label Studio
- Import traces from Braintrust into Label Studio as structured annotation tasks
- Domain experts evaluate each turn using the custom ReactCode UI
- Collaborative workflow: multiple SMEs can annotate the same traces
- Structured output feeds directly into quality reports, prompt improvements, and LLM-as-a-judge pipelines
2. Label Studio ReactCode Config
Skip the setup — clone the project directly
The pre-configured project below includes the full 3-panel ReactCode annotation interface ready to use. Click the button to clone it into your Label Studio Enterprise account and jump straight to importing your Braintrust traces in Section 4.
If you prefer to configure the project programmatically, follow the rest of this section.
This tutorial uses a ReactCode label configuration — a Label Studio Enterprise feature that lets you embed a custom React component as your annotation interface.
The UI has three panels:
| Panel | Purpose |
|---|---|
| Turns (left) | Scrollable list of all turns. Filter by role, search by content. Each card shows role, tool badges, latency, and verdict once annotated. |
| Turn Details (center) | Full content, tool call inputs/outputs, token usage, latency, and Claude’s extended thinking (when present). |
| Annotation (right) | Structured form for evaluating each turn — see annotation model below. |
Annotation model — what you capture per turn:
- Verdict — Pass or Fail
- Issue tags — taxonomy across 5 categories: Accuracy & Faithfulness, Tool & Retrieval, Reasoning & Planning, Response Quality, Safety & Compliance
- Severity — Critical / Major / Minor / Suggestion
- Expected behavior — free text: what should the agent have done instead?
- Comments — any additional notes
A trace-level verdict (Pass / Fail / Mixed) in the bottom bar captures overall conversation quality, independent of individual turn verdicts.
# ReactCode 3-panel trace annotation config (self-contained — no external files needed)
# The full ~40KB React component is inlined as _TEMPLATE_JS (see notebook for complete code).
_TEMPLATE_JS = r"""function TraceAnnotator({ React, addRegion, regions, data }) {
// 736-line React component defining the 3-panel trace review UI.
// Panels: Turns list (left) | Turn details (center) | Annotation form (right)
// Bottom bar: turn statistics + trace-level verdict (Pass / Fail / Mixed)
// ... see notebook for the full implementation ...
}"""
LABEL_CONFIG_XML = (
'<View>\n'
' <ReactCode style="height: 95vh" name="trace" toName="trace"'
' outputs=\'{"trace_id":"string","turn_id":"string","turn_role":"string",'
'"verdict":"string","failure_modes":"array","severity":"string",'
'"expected_behavior":"string","comments":"string"}\'>\n'
' <![CDATA[\n '
) + _TEMPLATE_JS + (
'\n ]]>\n'
' </ReactCode>\n'
'</View>'
)
print(LABEL_CONFIG_XML[:300] + '\n...')
3. Generate Sample Traces (Optional)
If you already have traces in Braintrust, skip this section — set GENERATE_TRACES = False and go directly to Section 4.
Otherwise, this cell creates a ReAct agent with multiple tools and runs 4 multi-turn conversations using Claude with extended thinking to produce realistic traces in your Braintrust project. Requires ANTHROPIC_API_KEY.
Extended thinking lets Claude reason through complex, ambiguous problems step-by-step before responding. The thinking content is captured in the trace and visible in the Label Studio UI.
GENERATE_TRACES = True # Set to False if you already have traces in Braintrust
if GENERATE_TRACES:
import braintrust
from braintrust_langchain import BraintrustCallbackHandler, set_global_handler
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage
from langchain_anthropic import ChatAnthropic
from langchain.agents import create_agent
if not ANTHROPIC_API_KEY:
raise RuntimeError('ANTHROPIC_API_KEY is required. Set it in your .env or set GENERATE_TRACES=False.')
if not BRAINTRUST_PROJECT:
raise RuntimeError('BRAINTRUST_PROJECT is required. Set it in your .env file.')
# Initialize Braintrust logger + LangChain callback handler
braintrust.init_logger(project=BRAINTRUST_PROJECT, api_key=BRAINTRUST_API_KEY)
handler = BraintrustCallbackHandler()
set_global_handler(handler)
@tool
def calculator(expression: str) -> str:
"""Evaluate a math expression."""
try:
return str(eval(expression))
except Exception as e:
return f"Error: {e}"
@tool
def search_knowledge_base(query: str) -> str:
"""Search an internal knowledge base for company policies, products, or procedures."""
kb = {
"refund": "Refund policy: Full refund within 30 days. After 30 days, store credit only. Damaged items: full refund at any time with photo evidence.",
"shipping": "Standard (5-7 days, free over $50), Express (2-3 days, $12.99), Overnight ($24.99).",
"warranty": "1-year limited warranty. 2-year extended warranty available for $29.99.",
"pricing": "Base $99/mo (10 users), Pro $249/mo (50 users), Enterprise custom. Annual billing saves 20%.",
}
results = [v for k, v in kb.items() if k in query.lower()]
return results[0] if results else f"No results found for: {query}"
@tool
def get_weather(city: str) -> str:
"""Get current weather for a city."""
weather_data = {
"new york": "New York: 72°F, Partly Cloudy, Humidity 65%, Wind 8 mph SW",
"london": "London: 58°F, Overcast, Humidity 80%, Wind 12 mph W",
"tokyo": "Tokyo: 82°F, Clear, Humidity 55%, Wind 5 mph NE",
"paris": "Paris: 63°F, Light Rain, Humidity 75%, Wind 10 mph NW",
}
return weather_data.get(city.lower(), f"Weather data not available for {city}")
# Claude with extended thinking — produces richer traces that surface the model's reasoning
llm = ChatAnthropic(
model='claude-sonnet-4-5-20250929',
max_tokens=16000,
thinking={'type': 'enabled', 'budget_tokens': 5000},
)
agent = create_agent(llm, [calculator, search_knowledge_base, get_weather])
# 4 multi-turn conversations designed to elicit extended thinking:
# conflicting policies, ambiguous constraints, and multi-step reasoning
conversations = [
["I bought a product 37 days ago with a manufacturing defect and an extended warranty. What are all my options?",
"The item costs $289. Can I use store credit toward a new extended warranty while keeping the original warranty claim open?"],
["We have 60 employees — 40 need full access, 20 need read-only. How do we minimize cost?",
"If we commit to annual billing and add 15 more full-access users next quarter, what's our 12-month total?"],
["I'm planning a 20-person client retreat. Compare Tokyo, London, and New York on weather and logistics.",
"12 attendees are in New York, 8 in London. Re-evaluate the three options for minimal travel disruption."],
["I ordered 3 items for $180 with express shipping. One arrived damaged — I need a replacement urgently.",
"If I return the damaged item and pay for express shipping on the replacement, what's my net out-of-pocket?"],
]
for i, conv_messages in enumerate(conversations, 1):
print(f"\n--- Conversation {i} ---")
@braintrust.traced(name=f"conversation_{i}")
def run_conversation(messages):
chat_history = []
for msg_text in messages:
print(f" User: {msg_text[:80]}...")
chat_history.append(HumanMessage(content=msg_text))
result = agent.invoke({'messages': chat_history})
chat_history = result['messages']
reply = result['messages'][-1].content
if isinstance(reply, list):
reply = ' '.join(b.get('text', '') for b in reply if isinstance(b, dict) and b.get('type') == 'text')
print(f" Assistant: {str(reply)[:100]}...")
return reply
run_conversation(conv_messages)
braintrust.flush()
print(f'\n✓ Generated {len(conversations)} traces. Proceed to Section 4.')
else:
print('Skipped trace generation. Proceed to Section 4.')
4. Braintrust API Client
Fetches traces (spans) from Braintrust using the REST API with BTQL queries. Spans are grouped by root_span_id into traces.
import requests
from braintrust_api import Braintrust
from typing import Any, Dict, List, Optional
class BraintrustClient:
"""Fetches traces from Braintrust via the REST API + BTQL."""
API_URL = 'https://api.braintrust.dev'
def __init__(self, api_key: str, project_name: str):
self.api_key = api_key
self.project_name = project_name
self.headers = {
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json',
}
self.project_id = self._resolve_project_id()
def _resolve_project_id(self) -> str:
"""Look up the project ID by name."""
client = Braintrust(api_key=self.api_key)
for project in client.projects.list():
if project.name == self.project_name:
return project.id
raise ValueError(f'Project "{self.project_name}" not found in Braintrust.')
def _btql(self, query: str) -> List[Dict[str, Any]]:
"""Execute a BTQL query and return results."""
r = requests.post(
f'{self.API_URL}/btql',
headers=self.headers,
json={'query': query, 'fmt': 'json'},
timeout=60,
)
r.raise_for_status()
return r.json().get('data', [])
def list_traces(self, limit: int = 20) -> List[Dict[str, Any]]:
"""Fetch recent traces (top-level spans) from the project."""
query = f"""SELECT id, span_id, root_span_id, input, output, metadata, metrics,
scores, created, span_attributes, error
FROM project_logs('{self.project_id}')
WHERE span_id = root_span_id
ORDER BY created DESC
LIMIT {limit}"""
return self._btql(query)
def get_trace_spans(self, root_span_id: str) -> List[Dict[str, Any]]:
"""Fetch all spans belonging to a trace (by root_span_id)."""
query = f"""SELECT id, span_id, root_span_id, span_parents, input, output,
metadata, metrics, scores, created, span_attributes, error
FROM project_logs('{self.project_id}')
WHERE root_span_id = '{root_span_id}'
ORDER BY created ASC
LIMIT 200"""
return self._btql(query)
if not BRAINTRUST_API_KEY:
raise RuntimeError('Missing BRAINTRUST_API_KEY — set it in your .env file.')
bt = BraintrustClient(BRAINTRUST_API_KEY, BRAINTRUST_PROJECT)
print(f'Braintrust client ready — project: {BRAINTRUST_PROJECT} (ID: {bt.project_id})')
5. Normalize Braintrust Traces → Unified Schema
Braintrust stores traces as a flat list of spans with typed span_attributes (llm, tool, task, function). This cell extracts the relevant spans and maps them into a flat sequence of turns — the same schema used by all three platform integrations — so the ReactCode UI doesn’t need to know which platform the trace came from.
Each turn carries: role, content, tool_name, tool_input, tool_calls, model, usage (token counts), duration_ms, and thinking (Claude extended thinking blocks, when present).
import json as _json
def _to_str(x):
if x is None: return ''
if isinstance(x, str): return x
try: return _json.dumps(x, indent=2, default=str)
except: return str(x)
def _extract_content(obj):
if obj is None: return ''
if isinstance(obj, str): return obj
if isinstance(obj, dict):
for key in ('content', 'text', 'input', 'output', 'result'):
if isinstance(obj.get(key), str) and obj[key].strip():
return obj[key]
return _to_str(obj)
if isinstance(obj, list):
parts = [_extract_content(item) for item in obj if _extract_content(item).strip()]
return '\n'.join(parts) if parts else _to_str(obj)
return str(obj)
def _duration_ms_from_metrics(metrics):
"""Compute duration from Braintrust metrics.start / metrics.end (Unix seconds)."""
if not isinstance(metrics, dict): return None
start, end = metrics.get('start'), metrics.get('end')
if start is not None and end is not None:
try: return int((float(end) - float(start)) * 1000)
except: pass
return None
def _split_thinking(content):
"""Split Anthropic extended-thinking content blocks into (text, thinking)."""
if isinstance(content, str): return content, None
if isinstance(content, list):
text_parts, thinking_parts = [], []
for block in content:
if isinstance(block, dict):
if block.get('type') == 'thinking': thinking_parts.append(block.get('thinking', ''))
elif block.get('type') == 'text': text_parts.append(block.get('text', ''))
elif isinstance(block, str): text_parts.append(block)
return '\n\n'.join(text_parts), '\n\n'.join(thinking_parts) or None
return str(content) if content else '', None
def normalize_braintrust_trace(root_span, all_spans):
"""Convert Braintrust spans into the unified trace schema.
Span types:
- type=llm → extract user messages from input + assistant response from output
- type=tool → extract tool execution as a tool turn
- type=task, type=function → skip (structural wrappers)
"""
trace_id = root_span.get('span_id') or root_span.get('id', '')
spans_sorted = sorted(all_spans, key=lambda s: s.get('created') or '')
turns = []
turn_counter = 0
seen_user_messages = set()
def add_turn(role, content, **kwargs):
nonlocal turn_counter
if not content or not content.strip(): return
turn = {'turn_id': f'turn_{turn_counter}', 'role': role, 'content': content.strip(),
'timestamp': kwargs.get('timestamp', '')}
for k in ('model', 'usage', 'tool_calls', 'tool_name', 'tool_input', 'duration_ms', 'thinking'):
if kwargs.get(k) is not None: turn[k] = kwargs[k]
turns.append(turn)
turn_counter += 1
for span in spans_sorted:
attrs = span.get('span_attributes') or {}
stype = (attrs.get('type') or '').lower()
ts = span.get('created') or ''
duration = _duration_ms_from_metrics(span.get('metrics'))
inp, out = span.get('input'), span.get('output')
span_metrics = span.get('metrics') or {}
if stype == 'llm':
messages = inp
if isinstance(messages, list) and messages and isinstance(messages[0], list):
messages = messages[0]
if isinstance(messages, list):
for msg in messages:
if isinstance(msg, dict) and msg.get('role') in ('user', 'human'):
content = msg.get('content', '')
if isinstance(content, list):
content = ' '.join(p.get('text', '') if isinstance(p, dict) else str(p) for p in content)
if content and content.strip():
msg_key = content[:200]
if msg_key not in seen_user_messages:
seen_user_messages.add(msg_key)
add_turn('user', content, timestamp=ts)
raw_content, tool_calls = '', []
if isinstance(out, dict):
gens = out.get('generations')
if isinstance(gens, list) and gens and isinstance(gens[0], list) and gens[0]:
gen = gens[0][0]
if isinstance(gen, dict):
message = gen.get('message') or gen
raw_content = message.get('content', '') or ''
for tc in (message.get('additional_kwargs') or {}).get('tool_calls') or []:
if isinstance(tc, dict):
func = tc.get('function') or {}
tool_calls.append({'tool_name': func.get('name') or 'unknown',
'input': _to_str(func.get('arguments') or ''),
'call_id': tc.get('id', '')})
assistant_content, thinking = _split_thinking(raw_content)
usage = None
if span_metrics.get('prompt_tokens') or span_metrics.get('completion_tokens'):
usage = {'input_tokens': span_metrics.get('prompt_tokens', 0),
'output_tokens': span_metrics.get('completion_tokens', 0)}
if assistant_content and assistant_content.strip():
add_turn('assistant', assistant_content, timestamp=ts,
model=attrs.get('name') or '', usage=usage,
tool_calls=tool_calls if tool_calls else None,
duration_ms=duration, thinking=thinking)
elif stype == 'tool':
tool_name = attrs.get('name') or 'unknown'
tool_output = (out.get('content', '') or _extract_content(out)) if isinstance(out, dict) else _extract_content(out)
if tool_output:
add_turn('tool', tool_output, timestamp=ts, tool_name=tool_name,
tool_input=_to_str(inp) if inp else '', duration_ms=duration)
if not turns:
if root_input := _extract_content(root_span.get('input')):
add_turn('user', root_input, timestamp=root_span.get('created', ''))
if root_output := _extract_content(root_span.get('output')):
add_turn('assistant', root_output, timestamp=root_span.get('created', ''))
return {
'trace_id': str(trace_id),
'session_id': str(trace_id),
'metadata': {
'name': (root_span.get('span_attributes') or {}).get('name') or root_span.get('id', ''),
'source': 'braintrust',
'tags': root_span.get('tags') or [],
'start_time': root_span.get('created') or '',
'scores': root_span.get('scores') or {},
},
'turns': turns,
}
print('✓ Normalization functions defined')
6. Fetch, Normalize, and Import into Label Studio
Fetches traces from Braintrust, normalizes them, creates a Label Studio project with the ReactCode config, and imports the tasks.
from label_studio_sdk import LabelStudio
from label_studio_sdk.core.request_options import RequestOptions
from typing import Any, Dict, List
_REQUEST_OPTS = RequestOptions(timeout_in_seconds=120)
def create_project(ls_host: str, api_key: str, title: str, label_config: str) -> int:
client = LabelStudio(base_url=ls_host, api_key=api_key)
project = client.projects.create(title=title, label_config=label_config, request_options=_REQUEST_OPTS)
return int(project.id)
def import_tasks(ls_host: str, api_key: str, project_id: int, tasks: List[Dict[str, Any]]) -> Any:
client = LabelStudio(base_url=ls_host, api_key=api_key)
return client.projects.import_tasks(id=project_id, request=tasks, return_task_ids=True)
if not LABEL_STUDIO_API_KEY:
raise RuntimeError('Missing LABEL_STUDIO_API_KEY — set it in your .env file.')
# 1) Fetch traces from Braintrust
root_spans = bt.list_traces(limit=20)
if not root_spans:
raise RuntimeError('No traces returned. Run Section 3 to generate sample traces.')
print(f'Fetched {len(root_spans)} root spans from Braintrust')
# 2) Normalize — only include traces with child spans
tasks: List[Dict[str, Any]] = []
skipped = 0
for root in root_spans:
root_span_id = root.get('span_id') or root.get('root_span_id')
if not root_span_id:
continue
all_spans = bt.get_trace_spans(root_span_id)
if len(all_spans) <= 1:
skipped += 1
continue
normalized = normalize_braintrust_trace(root, all_spans)
if normalized['turns']:
tasks.append({'data': normalized})
print(f" + Trace {root_span_id[:12]}... -> {len(normalized['turns'])} turns "
f"({sum(1 for t in normalized['turns'] if t['role']=='user')} user, "
f"{sum(1 for t in normalized['turns'] if t['role']=='assistant')} assistant, "
f"{sum(1 for t in normalized['turns'] if t['role']=='tool')} tool)")
if skipped:
print(f' (skipped {skipped} traces without child spans)')
print(f'\nPrepared {len(tasks)} tasks for import')
# 3) Create project and import
project_id = create_project(
ls_host=LABEL_STUDIO_HOST,
api_key=LABEL_STUDIO_API_KEY,
title=f'Braintrust Trace Review ({BRAINTRUST_PROJECT})',
label_config=LABEL_CONFIG_XML,
)
print(f'Created project: {project_id}')
resp = import_tasks(LABEL_STUDIO_HOST, LABEL_STUDIO_API_KEY, project_id, tasks)
print(f'Imported {len(tasks)} tasks')
print(f'\nDone! Open your project: {LABEL_STUDIO_HOST.rstrip("/")}/projects/{project_id}')
What’s Next
- Start annotating: Open the project link above and click through traces in the ReactCode UI
- Share with SMEs: Invite domain experts to your Label Studio project for collaborative evaluation
- Incremental sync: Re-run sections 4–6 periodically to pull new traces
- Export annotations: Use the Label Studio SDK or REST API to pull structured annotations for downstream analysis or fine-tuning
- Custom taxonomy: Edit the
_TEMPLATE_JSvariable in the label config cell to add failure modes specific to your domain - LangSmith / Langfuse: See companion tutorials for other observability platforms
Summary
This tutorial demonstrated the complete workflow from Braintrust traces to expert evaluation:
- ✓ Set up environment with Braintrust and Label Studio Enterprise
- ✓ Defined a ReactCode-based 3-panel annotation UI (Enterprise feature)
- ✓ Ran a multi-tool ReAct agent with Claude extended thinking to generate realistic traces
- ✓ Fetched traces from Braintrust via REST API + BTQL
- ✓ Normalized Braintrust spans into a unified trace schema
- ✓ Created a Label Studio project and imported traces as annotation tasks
Key Takeaway
Braintrust excels at trace storage and BTQL-powered querying during development. Label Studio Enterprise provides the collaborative, expert-driven evaluation framework — with the ReactCode interface giving domain experts an intuitive turn-by-turn review experience. The two tools complement each other throughout the AI development lifecycle.