Skip to main content

Confident AI QuickStart

caution

Are you following best LLM evaluation practices? Without a serious evaluation workflow, your testing results aren't really valid, and you might be wasting a lot of time iterating on the wrong things.

Confident AI is the LLM evaluation platform for DeepEval. It is native to DeepEval, and was designed for teams building LLM applications to maximize its performance, and to safeguard against unsatisfactory LLM outputs. Whilst DeepEval's open-source metrics are great for running evaluations, there is so much more to building a robust LLM evaluation workflow than collecting metric scores.

If you're serious about LLM evaluation, Confident AI is for you.

Confident AIConfident AI

Apart from running the actual evaluations, you'll need a way to:

  • Curate a robust testing dataset
  • Perform LLM benchmark analysis
  • Tailor evaluation metrics to your opinions
  • Improve your testing dataset over time

Confident AI enforces this by offering an opinionated, centralized platform to manage all the things mentioned above, which for you means "more accurate, informative, and faster insights", and allows you to "identify any performance gaps and identify how to improve my LLM system".

DID YOU KNOW?
Great LLM Evaluation==Quality of Dataset×Quality of Metrics\textbf{Great LLM Evaluation} == \textbf{Quality of Dataset} \times \textbf{Quality of Metrics}

Why Confident AI?

If your team has ever tried building its own LLM evaluation pipeline, here are the list of problems your team has likely encountered (and it's a long list):

  • Dataset Curation Is Fragmented And Annoying

    • Your team often juggle between tools like Google Sheets or Notion to curate and update datasets, leading to constant back-and-forth between engineers and domain expert annotators.
    • There is no "source of truth" since datasets aren't in-sync with your codebase for evaluations.
  • Evaluation Results Are (Still) More Vibe Checks Rather Than Experimentation

    • You basically just look at failing test cases, but they don’t provide actionable insights, and sharing it among your team is hard.
    • It’s impossible to compare benchmarks side-by-side to understand how changes impact performance for each unit test, making it more guesswork than experimentation.
  • Testing Data Are Static With No Easy Way To Keep Them Updated

    • Your LLM application needs and priorities evolves in production, but your datasets don’t.
    • Figuring out how to query and incorporate real-world interactions into evaluation datasets is tedious and error-prone.
  • Building A/B Testing Infrastructure Is Hard And Current Tools Don't Cut It

    • Setting up A/B testing for prompts/models to route traffic between versions is easy, but figuring out which version performed better and on what areas is hard.
    • Tools like PostHog or Mixpanel give user-level analytics, while other LLM observability tools focus too much on cost and latency, none of which tell you anything about the end output quality.
  • Human Feedback Doesn't Lead to Improvements

    • Teams spend time collecting feedback from end-users or internal reviewers, but there’s no clear path to integrate it back into datasets.
    • A lot of manual effort is needed to make good use of feedback, and unfortunately it is a waste of everyone's time.
  • There's No End To Manual Human Intervention

    • Teams rely on human reviewers to gatekeep LLM outputs before it reaches users in production, but the process is random, unstructured, and never ending.
    • No automation to focus reviewers on high-risk areas or repetitive tasks.

Confident AI solves all of your LLM evaluation problems so you can stop going around in circles. Here's a diagram outlining how Confident AI works:

Installation

Go to the root directory of your project and create a virtual environment (if you don't already have one). In the CLI, run:

python3 -m venv venv
source venv/bin/activate

In your newly created virtual environment, run:

pip install -U deepeval
note

We always recommend keeping deepeval updated to its latest version to use Confident AI.

Login to Confident AI

Everything in deepeval is already automatically integrated with Confident AI, including any custom metrics you've built on deepeval. To start using Confident AI with deepeval, simply login in the CLI:

deepeval login

Follow the instructions displayed on the CLI (to create an account, get your Confident API key, paste it in the CLI), and you're good to go.

DID YOU KNOW?

You can also login directly in Python if you already have a Confident API Key:

deepeval.login_with_confident_api_key("your-confident-api-key")

Or, via the CLI:

deepeval login --confident-api-key "your-confident-api-key"

Setup Your Evaluation Model

info

You can also use ANY custom LLM of your choice, although we DON'T recommend it for this quickstart guide due to some custom models being especially error-prone to outputting valid Jsons.

You'll need to set your OPENAI_API_KEY as an enviornment variable before running an evaluation, since the metrics we'll be using is an LLM-evaluated metric.

export OPENAI_API_KEY=<your-openai-api-key>

Alternatively, if you're working in a notebook enviornment (Jupyter or Colab), set your OPENAI_API_KEY in a cell:

 %env OPENAI_API_KEY=<your-openai-api-key>

Please do not include quotation marks when setting your OPENAI_API_KEY if you're working in a notebook enviornment as it is invalid syntax.

note

You can also run evaluations on Confident AI using our models, but that's a more advanced topic for later on in this documentation.

Run Your First Evaluation

Now that you're logged in, create a python file, for example say experiment_llm.py. We're going to be evaluating a medical chatbot for this quickstart guide, but it can be any other LLM systems that you are building.

experiment_llm.py
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

# See above for contents of fake data
fake_data = [...]

# Create a list of LLMTestCase
test_cases = []
for fake_datum in fake_data:
test_case = LLMTestCase(
input=fake_datum["input"],
actual_output=fake_datum["actual_output"],
retrieval_context=fake_datum["retrieval_context"]
)
test_cases.append(test_case)

# Define metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.5)
faithfulness = FaithfulnessMetric(threshold=0.5)

# Run evaluation
evaluate(test_cases=test_cases, metrics=[answer_relevancy, faithfulness])
note

In reality, you'll want to generate actual_outputs and retrieval_contexts at evaluation time. This is because by having a static dataset of inputs as a list of LLMTestCases, you'll be able to experiment which two versions of your LLM application performs better.

Finally, run experiment_llm.py to have Confident AI benchmark your LLM application for you:

python experiment_llm.py

Congratulations 🎉! You just ran your first evaluation which created a test run on Confident AI. Before diving into the platform, let's breakdown what happened.

  • We looped through the fake_data dataset, and created a list of LLMTestCases.
  • The LLMTestCase variable input mimics a user input, and actual_output is a placeholder for what your application's supposed to output based on this input.
  • The LLMTestCase variable retrieval_context contains the retrieved context from your knowledge base, and AnswerRelevancyMetric(threshold=0.5) and FaithfulnessMetric(threshold=0.5) is an default metric provided by deepeval for you to evaluate your LLM output's relevancy based on the provided retrieval context.
  • All metric scores range from 0 - 1, which the threshold=0.5 threshold ultimately determines if your metric have passed or not. A test case only passes if all of its metric passes.

🚨 But unfortunately, not all test cases have passed. Can you click on View Test Case Details on the failing test case to figure out why?

Confident AI
info

Failing test cases represents areas of improvement, and by improving your LLM application through various means, such as writing better prompts, using better tool calls, or even fine-tuning a custom model, you'll be able to make failing test cases passing.

In the next section, we'll learn how to create and use a dataset on Confident AI so we can move away from fake_data.