01 Dec 10:52

penguine-ip

7200841

🎉 Metrics for AI agents, multi-turn synthetic data generation, and more! Latest

Latest

Full support for agentic evals :)

If you're building agents, DeepEval can now analyze and give you metric scores based on the trace of your LLM app.

🎯 1. Task Completion

Evaluate whether an agent actually completes the intended task, not just whether its final output “looks correct.”

Captures:

Goal completion
Intermediate step correctness
Error recovery
Procedural accuracy

Docs: https://deepeval.com/docs/metrics-task-completion

🔧 2. Tool Correctness

Evaluates whether tools were invoked correctly, meaningfully, and in the right order.

Captures:

Correct tool usage
Correct argument formatting
Avoiding hallucinated tools
Using tools only when needed

Docs: https://deepeval.com/docs/metrics-tool-correctness

🧩 3. Argument Correctness

Evaluates whether the agent’s arguments to tools are valid, structured, and aligned to the task.

Captures:

Correct parameter selection
Type/format adherence
Logical argument formation
Avoiding semantically incorrect inputs

Docs: https://deepeval.com/docs/metrics-argument-correctness

⚡ 4. Step Efficiency

Measures how efficiently an agent completes a task — rewarding fewer unnecessary steps and penalizing detours.

Captures:

Optimality of step count
Redundant tool calls
Unnecessary loops
Waffling behavior

Docs: https://deepeval.com/docs/metrics-step-efficiency

🪜 5. Plan Adherence

Evaluates how well the agent follows a predefined or self-generated plan.

Captures:

Alignment to planned steps
Deviations and detours
Fidelity to strategy
Execution according to intent

Docs: https://deepeval.com/docs/metrics-plan-adherence

🧭 6. Plan Quality

Evaluates the quality of the plan itself when the agent generates one.

Captures:

Clarity
Completeness
Achievability
Logical ordering of steps

Docs: https://deepeval.com/docs/metrics-plan-quality

🧪 New: Multi-Turn Synthetic Goldens Generation

Synthetic data generation now supports multi-turn goldens instead of just single-turn.

You can now generate:

🎭 Multi-turn conversational scenarios
📝 Scenario + Expected Outcome pairs
🔁 Turn-by-turn dialogue structure
💬 Goldens instantly compatible with the Conversation Simulator
🚀 Direct pipeline: Generate → Simulate → Evaluate

Perfect for building large-scale synthetic datasets for support agents, sales agents, research assistants, workflow agents, and any multi-step conversational system.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
conversational_goldens = synthesizer.generate_conversational_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
)

Docs here (click on the "multi-turn" tab): https://deepeval.com/docs/synthesizer-generate-from-docs

Assets 2

04 Aug 07:15

penguine-ip

v3.7.2

7972313

🎉 New Interfaces, Reduce ETL Code < 50%!

Less Code to Load Data In and Out of DeepEval's Ecosystem :)

If you're using any of the features below, you'll likely see a 50% reduction in code required, especially around ETL for formatting things in and out of DeepEval's ecosystem. This includes:

🆚 Arena-GEval

The first LLM-arena-as-a-Judge metric, now runs a blinded experiment and swaps positions randomly for a fair verdict on which LLM output is better.

Docs: https://deepeval.com/docs/metrics-arena-g-eval

⚛️ You can now run component-level evals by simply running a for loop against your dataset of goldens.

Simply run your loop -> call your agent X number of times -> get your evaluation results. No more trying to fit non-test-case-friendly formats. Instead DeepEval will find your LLM traces automatically to run evals.

from somewhere import your_async_llm_app # Replace with your async LLM app
from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[Golden(input="...")])

for golden in dataset.evals_iterator():
    # Create task to invoke your async LLM app
    task = asyncio.create_task(your_async_llm_app(golden.input))
    dataset.evaluate(task)

Docs: https://deepeval.com/docs/evaluation-component-level-llm-evals

💬 Conversation simulator is now based on goldens.

Previously you have to define a list of user intentions, profile items, with a ton of more configs to juggle between. Now you can define a list of goldens with a standardized benchmark of scenarios to have turns generated for.

from deepeval.test_case import Turn
from deepeval.simulator import ConversationSimulator

# Create ConversationalGolden
conversation_golden = ConversationalGolden(
    scenario="Andy Byron wants to purchase a VIP ticket to a cold play concert.",
    expected_outcome="Successful purchase of a ticket.",
    user_description="Andy Byron is the CEO of Astronomer.",
)

# Define chatbot callback
async def chatbot_callback(input):
    return Turn(role="assistant", content=f"Chatbot response to: {input}")

# Run Simulation
simulator = ConversationSimulator(model_callback=chatbot_callback)
conversational_test_cases = simulator.simulate(goldens=[conversation_golden])
print(conversational_test_cases)

Docs: https://deepeval.com/docs/conversation-simulator

We also updated our docs with more improvements to come 👀

Assets 2

15 Jul 19:44

penguine-ip

v3.2.6

e87c01c

🎉 Renewed datasets, single vs multi-turn

⚙️ New Features

DeepEval's 3.2.6 release focuses on single-vs multi-turn use cases in datasets!

🧩 Support for Single-Turn and Multi-Turn Datasets

Single-turn datasets: Simple input → output pairs for one-off prompt testing.
Multi-turn datasets: Full conversation flows with alternating user/assistant turns. Perfect for simulating real chat interactions.

DeepEval now automatically detects whether a dataset is single-turn or multi-turn based on structure and routes to the appropriate evaluation logic.

🧪 Conversational Goldens

Introduced a new concept: conversational goldens, which contains scenario, (and optionally expected_outcome) but not things like input and expected output as with single-turn use cases..

✅ Improvements

Smarter dataset evaluation routing: Whether single-turn or multi-turn, DeepEval figures it out and builds test cases accordingly.
Improved multi-turn context preservation: Each conversational turn is maintained during evaluation, giving more accurate multi-turn metrics.

This release is setting the stage for future multi-turn use cases.

Docs here: https://deepeval.com/docs/evaluation-datasets

Assets 2

25 Jun 18:49

penguine-ip

v3.1.9

c208f9b

🎉 New Arena GEval Metric, for Pairwise Comparisons

Metric that is alike LLM Arena is Here

In DeepEval's latest release, we are introducing ArenaGEval, the first ever metric to compare test cases to choose the best performing one based on your custom criteria.

It looks something like this:

from deepeval import evaluate
from deepeval.test_case import ArenaTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval

a_test_case = ArenaTestCase(
    contestants={
        "GPT-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris",
        ),
        "Claude-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris is the capital of France.",
        ),
    },
)
arena_geval = ArenaGEval(
    name="Friendly",
    criteria="Choose the winter of the more friendly contestant based on the input and actual output",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
)


metric.measure(a_test_case)
print(metric.winner, metric.reason)

Docs here: https://deepeval.com/docs/metrics-arena-g-eval

Assets 2

19 Jun 07:46

penguine-ip

v3.1.5

268d73f

🎉 New Multimodal Metrics, with Platform Support

In DeepEval's latest release, we are introducing multimodal G-Eval, plus 7+ multimodal metrics!

Previously we had great support for single-turn, text evaluation in the form of LLMTestCases, but now we're adding MLLMTestCase, which accepts images:

from deepeval.metrics import MultimodalGEval
from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase, MLLMImage
from deepeval import evaluate

m_test_case = MLLMTestCase(
    input=["Show me how to fold an airplane"],
    actual_output=[
        "1. Take the sheet of paper and fold it lengthwise",
        MLLMImage(url="./paper_plane_1", local=True),
        "2. Unfold the paper. Fold the top left and right corners towards the center.",
        MLLMImage(url="./paper_plane_2", local=True)
    ]
)
text_image_coherence = MultimodalGEval(
    name="Text-Image Coherence",
    criteria="Determine whether the images and text is coherence in the actual output.",
    evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
)

evaluate(test_cases=[m_test_case], metrics=[text_image_coherence])

Docs here: https://deepeval.com/docs/multimodal-metrics-g-eval

PS. This also includes platform support

Assets 2

10 Jun 09:16

penguine-ip

v3.0.8

c64d86e

🎉 New Conversational Evaluation, LiteLLM Integration

In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated.

Previously we assumed a conversation as as a list of LLMTestCases, which might necessarily be the case. Now a conversational test case is made up of a list of Turns instead, which follows OpenAI's standard messages format:

from deepeval.test_case import Turn

turns = [Turn(role="user", content="...")]

Docs here: https://deepeval.com/docs/evaluation-test-cases#conversational-test-case

Assets 2

07 Jun 11:09

penguine-ip

v3.0.6

5e81905

New Loading Bars, And Cloud Storage

Added new loading bars for component-level evals, and deepeval view to see results on Confident AI.

Assets 2

27 May 17:59

penguine-ip

v3.0

ce6a763

LLM Evals - v3.0

🚀 DeepEval v3.0 — Evaluate Any LLM Workflow, Anywhere

We’re excited to introduce DeepEval v3.0, a major milestone that transforms how you evaluate LLM applications — from complex multi-step agents to simple prompt chains. This release brings component-level granularity, production-ready observability, and simulation tools to empower devs building modern AI systems.

🔍 Component-Level Evaluation for Agentic Workflows

You can now apply DeepEval metrics to any step of your LLM workflow — tools, memories, retrievers, generators — and monitor them in both development and production.

Evaluate individual function calls, not just final outputs
Works with any framework or custom agent logic
Real-time evaluation in production using observe()
Track sub-component performance over time

📘 Learn more →

🧠 Conversation Simulation

Automatically simulate realistic multi-turn conversations to test your chatbots and agents.

Define model goals and user behavior
Generate labeled conversations at scale
Use DeepEval metrics to assess response quality
Customize turn count, persona types, and more

📘 Try the simulator →

🧬 Generate Goldens from Goldens

Bootstrapping eval datasets just got easier. Now you can exponentially expand your test cases using LLM-generated variants of existing goldens.

Transform goldens into many meaningful test cases
Preserve structure while diversifying content
Control tone, complexity, length, and more

📘 Read the guide →

🔒 Red Teaming Moved to DeepTeam

All red teaming functionality now lives in its own focused project: DeepTeam. DeepTeam is built for LLM security — adversarial testing, attack generation, and vulnerability discovery.

🛠️ Install or Upgrade

pip install deepeval --upgrade

🧠 Why v3.0 Matters

DeepEval v3.0 is more than an evaluation framework — it's a foundation for LLM observability. Whether you're debugging agents, simulating conversations, or continuously monitoring production performance, DeepEval now meets you wherever your LLM logic runs.

Ready to explore?
📚 Full docs at deepeval.com →

Assets 2

15 May 05:13

penguine-ip

v2.9.0

086d6dd

G-Eval Rubric

Rubric Available for G-Eval

https://www.deepeval.com/docs/metrics-llm-evals#rubric

Assets 2

06 May 11:58

penguine-ip

v2.8.5

78310fb

Cleanup Tracing, Component Evals, Etc.

In this release we've cleaned up some dependencies to separate out dev packages, as well as more tracing verbose logs for debugging.

Assets 2

Releases: confident-ai/deepeval

🎉 Metrics for AI agents, multi-turn synthetic data generation, and more!

Full support for agentic evals :)

🎯 1. Task Completion

🔧 2. Tool Correctness

🧩 3. Argument Correctness

⚡ 4. Step Efficiency

🪜 5. Plan Adherence

🧭 6. Plan Quality

🧪 New: Multi-Turn Synthetic Goldens Generation

Uh oh!

🎉 New Interfaces, Reduce ETL Code < 50%!

Less Code to Load Data In and Out of DeepEval's Ecosystem :)

🆚 Arena-GEval

⚛️ You can now run component-level evals by simply running a for loop against your dataset of goldens.

💬 Conversation simulator is now based on goldens.

Uh oh!

🎉 Renewed datasets, single vs multi-turn

⚙️ New Features

🧩 Support for Single-Turn and Multi-Turn Datasets

🧪 Conversational Goldens

✅ Improvements

Uh oh!

🎉 New Arena GEval Metric, for Pairwise Comparisons

Metric that is alike LLM Arena is Here

Uh oh!

🎉 New Multimodal Metrics, with Platform Support

In DeepEval's latest release, we are introducing multimodal G-Eval, plus 7+ multimodal metrics!

Uh oh!

🎉 New Conversational Evaluation, LiteLLM Integration

In DeepEval's latest release, we are introducing a slight change in how a conversation is evaluated.

Uh oh!

New Loading Bars, And Cloud Storage

Uh oh!

LLM Evals - v3.0

🚀 DeepEval v3.0 — Evaluate Any LLM Workflow, Anywhere

🔍 Component-Level Evaluation for Agentic Workflows

🧠 Conversation Simulation

🧬 Generate Goldens from Goldens

🔒 Red Teaming Moved to DeepTeam

🛠️ Install or Upgrade

🧠 Why v3.0 Matters

Uh oh!

G-Eval Rubric

Rubric Available for G-Eval

Uh oh!

Cleanup Tracing, Component Evals, Etc.

Uh oh!