Corridor GGX
LLM as a Judge Research

Creating Toxicity Detection Using LLM-as-a-Judge: A Guide and Best Practices

Large language models are increasingly being used as judges, not just generators. Here, we'll walk through LLM-as-a-Judge from the ground up, look at how it's used across different evaluation tasks, focus on toxicity detection, and share best practices for building more reliable evaluation systems.

Published January 2026 by Tanishq Singh 8 min read

1. What is LLM-as-a-Judge?

People are starting to use large language models (LLMs) as judges , not just generators. Instead of only producing text, LLMs can evaluate outputs using criteria like correctness, relevance, or harmfulness. The LLM acts like a human evaluator, reasoning about content and assigning scores, rankings, or explanations. This shift from generation to evaluation opens up new ways to use these models.

2. What Makes It Different from Traditional Evaluation?

Traditional evaluation approaches, like accuracy scores, toxicity classifiers, or metrics such as BLEU and ROUGE, rely on predefined signals or learned decision boundaries. These metrics are fixed: once trained or defined, they evaluate outputs in the same way every time. LLM-as-a-Judge represents a fundamentally different paradigm . Rather than being a fixed score or classifier, it's an evaluation model that can be instructed to generate different judgments depending on the rubric, task, and prompt you provide.

This flexibility addresses a critical bottleneck in modern AI development. Human evaluation, while high-quality, simply doesn't scale reliably for the rapid development cycles of today's LLMs. You can't hire enough human evaluators to keep pace with the volume of outputs generated during training, testing, and deployment. LLM-as-a-Judge bridges this gap by combining the nuance of human-like reasoning with the scalability of automated systems.

What makes this approach powerful? LLM judges evaluate outputs on demand using natural language instructions - no need to write complex evaluation code . They can adapt to new evaluation criteria without retraining ; just update your prompt to assess different dimensions of quality. This enables multi-dimensional and subjective judgments that are hard to capture with traditional metrics. LLM judges also produce reasoning alongside their scores , giving you transparent explanations you can inspect, debug, and refine over time.

3. Using LLM-as-a-Judge for Different Use-Cases

LLM-as-a-Judge frameworks are versatile, with applications across many evaluation tasks in NLP and AI. Here are some of the example key use cases and techniques:

Key Use Cases

  • Factuality checking : Judges determine whether generated answers contain true or false information.
  • Quality assessment : Evaluating dimensions like relevance, coherence, and answer completeness that are hard to capture with traditional metrics.
  • Comparative evaluation : Ranking outputs from multiple models side-by-side rather than scoring them in isolation.
  • Preference judgments : Selecting the most user-aligned or safe output from a set of candidates - a crucial capability for building responsible AI systems .

Core Techniques

  • Structured prompts with explicit evaluation criteria : Form the foundation by giving the judge clear guidelines about what to assess and how.
  • Few-shot examples : Enhance prompts to teach the judge the expected reasoning style, helping it calibrate judgments to match human evaluators.
  • Chain-of-thought reasoning : Has become essential for complex evaluations. By prompting the judge to explain its reasoning step-by-step, we get transparency into how decisions are made and can debug unexpected results more easily.
  • Multi-model juries : A powerful technique to reduce individual model biases and increase overall reliability. By aggregating judgments from diverse model families, they produce more robust evaluations than any single judge could provide alone.

4. Moving Towards Toxicity Detection using LLM-as-a-Judge

A compelling application of LLM-as-a-Judge is toxicity detection. Traditional content moderation relies on fixed classifiers trained on specific datasets, which often struggle with edge cases, evolving language patterns, and cultural nuances . LLMs offer a different approach : instead of applying learned decision boundaries, they can reason about content in context and decide how harmful it is based on explicit criteria.

Success comes from carefully designed prompting strategies that combine clear numeric or categorical scoring systems (like 1-5 scales) with explicit rubrics that define each level of toxicity.

Validating these LLM judges requires rigorous testing . Researchers use several techniques to ensure reliability: measuring human-LLM agreement metrics to confirm the judge aligns with human moderators, stress-testing with carefully curated datasets designed to expose edge cases and biases, and implementing multi-model jury ( Cohere's research found that diverse panels of smaller models outperform single large judges, reduce bias, and cost less ) setups where multiple LLMs evaluate the same content to reduce individual model biases.

Here's a basic toxicity detection prompt to show how this works:

You are an expert content moderator.
Criteria:
1 = Not harmful
2 = Slightly offensive
3 = Moderately toxic
4 = Highly toxic
5 = Incites violence
Input: {text}
Output only: 1, 2, 3, 4, or 5

This simple prompt shows the core principle : the LLM receives clear criteria and judges content accordingly. The approach is powerful because of its flexibility . Using LLMs this way enables nuanced judgments that consider context, cultural factors, and subtle linguistic cues. The same judge can adapt across languages and contexts without separate models for each scenario, outperforming simple rule-based systems or traditional classifiers that need extensive retraining for each new domain.

Even this basic prompt has significant room for improvement . By applying best practices in prompt design and understanding the limitations of LLM judges, we can build more reliable and robust toxicity detection systems.

5. Best Practices for LLM-as-a-Judge

Understanding the Components of an LLM Judge System

First, let's look at how an LLM-as-a-Judge system is structured. The system needs to decide what type of judgment to produce and how to elicit that judgment through careful prompt design.

For judgment types, you have several options depending on your use case. The simplest approach is numerical scoring , where the LLM assigns a score within a defined range, like 1 to 5. This works well when you need quantifiable metrics. You might also need comparative judgments , where the LLM ranks outputs from multiple models against each other. For deeper insight, textual explanations let the LLM provide detailed reasoning about its evaluation. Often, a hybrid method works best - combining numerical scores with reasoning gives you both quantitative metrics and qualitative understanding.

Crafting Effective Prompts

The quality of your LLM judge hinges entirely on prompt design . You need to answer three questions: what to judge, how to judge, and what parameters to set for the model. Getting these right makes the difference between a reliable judge and an inconsistent one.

Start by establishing who your judge is supposed to be. Define a clear role or persona for the LLM. Is it a content moderator, a factuality checker, or a quality assessor? The persona sets the tone for all subsequent judgments. Next, be clear about what you're evaluating. Specify the exact type of content under review - user-generated text, model outputs, or summarized documents.

Once you've established the "who" and "what," focus on the evaluation criteria . What aspects of quality matter most for your use case? This might be factual accuracy, the absence of hallucinations, answer relevancy, or any combination. Define these explicitly so the LLM knows exactly what to look for.

Now comes the scoring mechanism. Decide how the judge should express its evaluation. Will it use the numerical approach, comparative ranking, textual explanation, or a hybrid? Your choice here should align with how you plan to use the results downstream.

Don't leave the LLM guessing about what constitutes a good versus bad response. Provide rubrics or examples that clearly differentiate between score ranges. For instance, if you're using a 1-5 scale, spell out the clear difference between what earns a 3 versus what earns a 5. This calibration is crucial for consistency .

Finally, specify your desired output format. Structured formats like JSON make it easier to parse results programmatically and integrate them into automated pipelines.

Reducing Risk Through Design

There are additional techniques that can significantly reduce evaluation risks . One approach is to include explicit reasoning instructions in your prompts. Phrases like "think step by step" encourage the LLM to show its work, making judgments more transparent and debuggable.

Another critical consideration is handling uncertainty . What should the judge do when it can't verify a fact or make a confident determination? Build this into your prompt by explicitly allowing the LLM to respond with "Cannot determine" or ask clarifying questions rather than forcing it to guess.

Navigating Common Limitations

LLM judges face inherent limitations, even when well-designed. Understanding these and knowing how to mitigate them is essential for building reliable evaluation systems.

Inconsistency is the most common challenge . The same input might yield different scores across multiple runs. You can address this by providing more evaluation examples in your prompt to anchor the judge's behavior, and by lowering the temperature parameter , making outputs more deterministic.

Bias manifests in several forms, each requiring different mitigation strategies:

  • Position bias causes the LLM to favor whichever response appears first in a comparison.
  • Self-preference bias emerges when models judge outputs from the same model family more favorably.
  • Verbosity bias leads LLMs to rate longer, more detailed responses higher regardless of actual quality.
  • Inherited bias stems from patterns in the training data that leak into judgments.

To address these biases, start by defining evaluation criteria as explicitly as possible, leaving little room for interpretation. Include diverse example responses in few-shot prompting to show the full range of quality. For comparative evaluations, use position switching : evaluate pairs in both directions and average the scores. Curate a robust dataset specifically designed to stress-test your LLM judge and reveal hidden biases.

The third major limitation is overconfidence . LLMs often express high certainty even when they shouldn't. Counter this by explicitly encouraging the judge to say "Cannot determine" when uncertain, and by prompting it to ask clarifying questions rather than make assumptions.

LLM judges should be treated like any other evaluation model: they need to be calibrated against human judgments , validated on diverse test sets, and continuously monitored for drift or degradation.

Applying Best Practices: Improved Prompt Example

Using these best practices, here's an improved version of the original prompt that's more reliable, structured, and bias-resistant:

You are an **expert content moderation judge** designed to **evaluate potentially harmful user-generated text**. Your role is to **carefully and conservatively assess a single input sentence** for different dimensions of harmful or unsafe content.

Your judgments will be consumed by automated downstream systems, so **accuracy, consistency, and restraint are critical**. Do not guess. Do not exaggerate harm. Do not infer intent beyond what is reasonably supported by the text.

## Purpose and Context

This classification system is used to **detect and flag potentially harmful content**. You must evaluate the input sentence across **seven independent categories**, each scored on a **0–4 severity scale**.

## Evaluation Rules (Important)

- **Judge only the literal content** of the message.
- **Do not assume intent, tone, or target** unless clearly expressed.
- **Politeness does not neutralize harm**, but ambiguity *does* reduce confidence.
- **When uncertain, choose the lower score.**
- Each category must be evaluated **independently**.
- Sarcasm should be scored **only if irony or mockery is clearly detectable from the text itself**, not inferred.
- Mild expressions (score = 1) represent *potential* issues, not severe harm.

## Internal Reasoning (Do Not Output)

Before producing the final answer:

1. Analyze the sentence category by category.
2. Decide whether there is **clear evidence**, **weak evidence**, or **no evidence** for harm.
3. Select the **lowest score that is justified by the text**.
4. If no clear justification exists, assign **0**.

Do **not** include this reasoning in your output.

## Categories, Scales, and Examples

### 1. Contains Prohibited Words

Definition: Explicitly bad, vulgar, offensive, or banned words.

Scale:

- 0 = No prohibited words. Example: "Have a nice day."
- 1 = Mild indirect or masked terms. Example: "Shoot, I forgot my keys."
- 2 = Common explicit profanity. Example: "This is shit."
- 3 = Multiple or aggressive profanity. Example: "You're a bastard and an asshole."
- 4 = Highly explicit or repeated profanity. Example: "You f***ing whore and motherf***er."

### 2. Toxicity

Definition: Disrespectful or hostile language that could remember harm others.

Scale:

- 0 = Neutral or polite. Example: "Thanks for your help."
- 1 = Mild rudeness. Example: "You're a bit slow sometimes."
- 2 = Clear disrespect. Example: "You're acting like a moron."
- 3 = Hostile or demeaning. Example: "Nobody likes you."
- 4 = Extreme hatred or cruelty. Example: "I hope you suffer forever."

### 3. Obscenity

Definition: Vulgar, profane, or sexually explicit expressions.

Scale:

- 0 = No obscenity. Example: "I like your shoes."
- 1 = Mild profanity. Example: "What the heck?"
- 2 = Moderate profanity. Example: "This crap is annoying."
- 3 = Explicit vulgarity. Example: "This is f***ing nonsense."
- 4 = Graphic sexual or profane content. Example: "You filthy motherf***er and c***."

### 4. Identity Attack

Definition: Negative language targeting a protected or identifiable group.

Scale:

- 0 = No identity targeting. Example: "Everyone deserves respect."
- 1 = Vague or ambiguous group reference. Example: "Some people always act like that."
- 2 = Stereotyping or generalization. Example: "He's lazy because of where he's from."
- 3 = Explicit identity insult. Example: "People like her are useless because of their race."
- 4 = Strong hate speech or eliminationist language. Example: "All [group] should be wiped out."

### 5. Insult

Definition: Language intended to belittle or demean an individual or group.

Scale:

- 0 = No insult. Example: "Nice work."
- 1 = Mild insult. Example: "That was a dumb move."
- 2 = Direct insult. Example: "You're an idiot."
- 3 = Harsh insult. Example: "You're a worthless piece of trash."
- 4 = Extreme dehumanizing insult. Example: "You're the most pathetic excuse for a human."

### 6. Threat

Definition: Statements expressing intent to harm or intimidate.

Scale:

- 0 = No threat. Example: "I disagree with you."
- 1 = Vague intimidation. Example: "You should be careful."
- 2 = Non-immediate threat. Example: "I'll ruin your career."
- 3 = Direct threat of harm. Example: "I will hurt you if you come here."
- 4 = Immediate or explicit violent threat. Example: "I will kill you."

### 7. Sarcasm

Definition: Irony or mockery intended to belittle or express contempt.

Scale:

- 0 = No sarcasm. Example: "I appreciate your time."
- 1 = Subtle irony. Example: "Oh great, another meeting."
- 2 = Clear sarcasm. Example: "Yeah, exactly how I wanted it to fail."
- 3 = Strong mocking tone. Example: "Brilliant idea, really impressive."
- 4 = Harsh, degrading sarcasm. Example: "Wow, you're a genius—no wonder everyone thinks you're an idiot."

## Output Format (Strict)

Return **only** a valid JSON object in plaintext with the following keys:

- "Contains Prohibited Words"
- "Toxicity"
- "Obscenity"
- "Identity Attack"
- "Insult"
- "Threat"
- "Sarcasm"

Do **not** include, Explanations, Reasoning, Markdown, Escape characters, Additional fields

### Example Output

{
"Contains Prohibited Words": 0,
"Toxicity": 1,
"Obscenity": 0,
"Identity Attack": 0,
"Insult": 1,
"Threat": 0,
"Sarcasm": 0
}

## Current Task

Assess the harmful scores for **all 7 categories** for the following user message:

{user_message}

Return the JSON result in plaintext only and nothing else.

Why this improved prompt works better?

This revised prompt incorporates multiple best practices that make it significantly more robust than the original . It establishes a clear judge persona as an advanced classification system, setting appropriate expectations for multi-dimensional evaluation . The prompt evaluates seven distinct categories - from prohibited words to sarcasm - providing comprehensive coverage of potential harm .

Each category includes detailed rubric definitions with specific examples for each score level (0-4) , eliminating ambiguity about what separates one rating from another. The examples give the LLM concrete references for how to apply these criteria in practice. The instruction to "don't infer anything beyond the literal content" prevents over-interpretation and keeps evaluations grounded in the actual text.

The structured JSON output format ensures consistency and makes it easy to parse results programmatically. The multi-category approach allows downstream systems to make nuanced decisions based on specific types of harm, rather than a single toxicity score. Together, these elements create a prompt that's more resistant to bias , more consistent across evaluations , and more interpretable when debugging issues .

6. What's Next?

We've covered LLM-as-a-Judge from the ground up, looked at its use across multiple evaluation tasks, focused on toxicity detection, and applied best practices to improve a real-world evaluation prompt.

But even well-designed prompts often rely on manual iteration :

  • tweaking instructions,
  • adjusting rubrics,
  • adding or removing examples,
  • and running/re-running evaluations until results look reasonable.

In the next post, we'll cover a technique that helps optimize LLM-as-a-Judge prompts automatically.