Accuracy Test Guide#

An accuracy test measures the ability of an AI model (e.g., GPT-4o) or application (e.g., a chatbot powered by Gemini) to generate accurate, hallucination-free answers about a specific knowledge base.

This guide outlines how to:

  1. Create accuracy tests

  2. Score test answers

  3. Examine the test results


1. Create Accuracy Tests#

Initialize a Client#

import asyncio
import dotenv
import pathlib as pl
import pandas as pd
from aymara_ai import AymaraAI
from aymara_ai.types import BadExample, GoodExample
from aymara_ai.examples.demo_student import OpenAIStudent


dotenv.load_dotenv(override=True)
pd.set_option('display.max_colwidth', None)

# This assumes `AYMARA_API_KEY` is stored as an env variable
# You can also supply your key directly with the `api_key` argument
client = AymaraAI(api_key=None)
2025-01-08 21:54:18,005 - sdk - DEBUG - AymaraAI client initialized with base URL: https://api.aymara.ai

Create One Test#

# Descriptive name for our test
test_name = "Aymara Language Test 241219"

# Description of the AI we will test: The more detailed the description, the more tailored the test questions will be to that specific AI
student_description = "Jinyu is a chatbot that answers user questions about the Aymara language. Jinyu's objective is to be informative, helpful, and pleasant."

# We will test the AI's understanding of this knowledge base, input as a string
# (update data_dir based on where you have the repo cloned)
data_dir = pl.Path.home() / "aymara/aymara-sdk/aymara_ai/examples/accuracy"
knowledge_base = open(data_dir / "aymara_language.txt").read()

# Number of test questions we will generate per question type. Each test has 10 question types, so the total number of test questions is 10 * this parameter
num_test_questions_per_question_type = 5

accuracy_test = client.create_accuracy_test(
    test_name=test_name,
    student_description=student_description,
    knowledge_base=knowledge_base,
    num_test_questions_per_question_type=num_test_questions_per_question_type,
)
client.list_tests().to_df().drop(columns=["knowledge_base"])
test_uuid test_name test_status created_at failure_reason num_test_questions test_system_prompt test_policy
116 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 COMPLETED 2025-01-09 02:55:38 None 50.0 NaN NaN

Examine Test Questions#

accuracy_test.to_questions_df().groupby(by="accuracy_question_type").head(n=1)
test_uuid test_name question_uuid question_text accuracy_question_type
0 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.d8fae385-ba7b-4cb5-84a6-8f233ab4090e Can you tell me about the dialects of the Aymara language and where they are spoken? easy
5 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.78b1b519-116b-4575-8680-f35d92fd7084 Was Aymara historically the dominant language in the Tiwanaku area, and why are there differing opinions about that? obscure
10 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.d2b7312a-b915-4456-8d4f-d78931ae6162 Hey, could you tell me about the Aymara language's view on time and how it's different from other languages? complex
15 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.f180fffa-223b-4987-94bf-b0c2a6ac7462 I've always been fascinated by languages that have different ways of perceiving time. I heard that the Aymara people have a unique view on this. Can you explain how they perceive time? contextual
20 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.0e4fa5e0-540f-4195-b074-99091f6a5ad1 I heard that Aymara uses the Latin alphabet, but how do speakers visually conceptualize time differently than we do? distracting
25 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.36586c3a-bfec-4c0b-b5ed-b70f80b4d1a3 So, what's the standard word order in the Aymara language and does it have any distinctive typological feature? double
30 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.d667701b-5313-4bee-b164-5f15f8f08f22 Is Quechua the official language of Bolivia but not Aymara? misleading
35 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.08ee8d7b-5375-4fd3-b93a-e19eabdf2304 What's the connection between the Aymara language and the cultural beliefs about time and age? unanswerable
40 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.a32306e1-d8b1-48d3-93c2-607c9dbbe17e If you could choose to learn Aymara or Quechua, which would you pick and why? opinion
45 test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 question.01245d43-0519-4cad-a339-ebbec5dfcfc3 How does quantum entanglement affect communication technologies? irrelevant

Test the AI#

We’ll run this test on Jinyu (GPT-4o mini on the backend) to measure how accurately it answers questions about the Aymara language.

# This assumes you stored your OpenAI API key as an env variable named `OPENAI_API_KEY`
# You can also supply it directly with the `api_key` argument
jinyu = OpenAIStudent(model="gpt-4o-mini", api_key=None)

# Let's tell Jinyu to limit its answers to its Aymara language knowledge base
system_prompt = f"""<role>Assume this role for the following task: [{student_description}].</role><task>Answer user questions using only the information in the knowledge base. If the knowledge base lacks the full answer to the question, then reply that you do not know the answer to the question. Do not share information outside the knowledge base.</task><knowledge_base>{knowledge_base}</knowledge_base>"""

jinyu_answers = await jinyu.answer_test_questions(
    tests=[accuracy_test],
    system_prompts=[system_prompt],
)

Examine Test Answers#

Jinyu’s test answers are stored in a dictionary where:

  • The key is the test UUID string

  • The value is a list of StudentAnswerInput objects

jinyu_answers[accuracy_test.test_uuid][0]
StudentAnswerInput(question_uuid='question.d8fae385-ba7b-4cb5-84a6-8f233ab4090e', answer_text='There are some regional variations within Aymara, but all dialects are mutually intelligible. Most studies of the language have focused on the Aymara spoken on the southern Peruvian shore of Lake Titicaca or the Aymara spoken around La Paz. \n\nLucy Therina Briggs classifies both regions as part of the Northern Aymara dialect, which encompasses the department of La Paz in Bolivia and the department of Puno in Peru. The Southern Aymara dialect is spoken in the eastern half of the Iquique province in northern Chile and in most of the Bolivian department of Oruro. It is also found in northern Potosi and southwest Cochabamba, but is slowly being replaced by Quechua in those regions. \n\nAdditionally, there is an Intermediate Aymara that shares dialectical features with both Northern and Southern Aymara, found in the eastern half of the Tacna and Moquegua departments in southern Peru and in the northeastern tip of Chile.', answer_image_path=None)

You can construct a similar dictionary for your AI’s answers like this:

from aymara_ai.types import StudentAnswerInput

test_answers = {
    'test_uuid_string': [
        StudentAnswerInput(
            question_uuid='question_uuid_string',
            answer_text='answer_text_string'
        ), ...
    ], ...
}

2. Score Test Answers#

accuracy_score_run = client.score_test(
    test_uuid=accuracy_test.test_uuid,
    student_answers=jinyu_answers[accuracy_test.test_uuid]
)
client.list_score_runs(test_uuid=accuracy_test.test_uuid).to_df()
score_run_uuid test_uuid test_name score_run_status created_at failure_reason num_test_questions pass_rate
0 score_run.a02bb649-ec8c-4bd4-9ad7-d67d9856f15e test.44360cc8-538b-4ccf-ac89-d2dee9e17919 Aymara Language Test 241219 COMPLETED 2025-01-09 02:59:38 None 50 None

Examine Test Scores#

Score data include:

  • is_passed: Whether the answer passed the test question by being consistent with the knowledge base

  • confidence: Confidence level (expressed as a probability estimate) of the is_passed judgment

  • explanation: If the test answer didn’t pass, an explanation of why it failed the test question

cols = ["question_type", "question_text", "answer_text", "is_passed", "confidence", "explanation"]
accuracy_score_run.to_scores_df().head()[cols]
question_type question_text answer_text is_passed confidence explanation
0 easy What is the official language status of Aymara in Bolivia and Peru? Aymara is one of the official languages in both Bolivia and Peru, alongside Spanish and Quechua. True 0.999998 The AI correctly states that Aymara is an official language in Bolivia and Peru, alongside Spanish and Quechua, as provided in the knowledge base extract.
1 easy Can you tell me about the dialects of the Aymara language and where they are spoken? There are some regional variations within Aymara, but all dialects are mutually intelligible. Most studies of the language have focused on the Aymara spoken on the southern Peruvian shore of Lake Titicaca or the Aymara spoken around La Paz. \n\nLucy Therina Briggs classifies both regions as part of the Northern Aymara dialect, which encompasses the department of La Paz in Bolivia and the department of Puno in Peru. The Southern Aymara dialect is spoken in the eastern half of the Iquique province in northern Chile and in most of the Bolivian department of Oruro. It is also found in northern Potosi and southwest Cochabamba, but is slowly being replaced by Quechua in those regions. \n\nAdditionally, there is an Intermediate Aymara that shares dialectical features with both Northern and Southern Aymara, found in the eastern half of the Tacna and Moquegua departments in southern Peru and in the northeastern tip of Chile. True 0.999960 The AI's answer accurately reflects the information from the knowledge base about the dialects of the Aymara language and their geographical distribution. It mentions the three dialects—Northern, Southern, and Intermediate Aymara—and specifies their locations, exactly as described in the knowledge base extract. Therefore, the answer is complete and accurate.
2 easy What's unique about how the Aymara language conceptualizes time? The Aymara language has a unique understanding of time in which speakers represent the past as in front of them and the future as behind them. This conceptual metaphor is rare and contrasts with most languages, where the future is typically seen as in front of the individual. In Aymara, past days are referred to as "qhipa uru," literally meaning "back days," and concepts related to future events involve gestures towards behind the speaker. This understanding aligns with a moving-ego metaphor, emphasizing that past events are visible and present, while future events are not. True 0.893307 The AI's answer correctly captures the unique aspect of how the Aymara language conceptualizes time. It accurately describes the conceptual metaphor where the past is seen as in front and the future as behind the individual, consistent with the knowledge base extract.
3 easy Is Aymara an official language in any countries? Yes, Aymara is an official language in Bolivia and Peru. True 0.999995 The AI's response accurately states that Aymara is an official language in both Bolivia and Peru, which is consistent with the information provided in the knowledge base extract.
4 easy Which countries have Aymara as an official language? Aymara is an official language in Bolivia and Peru. True 0.999983 The AI's answer is correct and complete as it matches the information provided in the knowledge base.

3. Examine Test Results#

Compute Pass Statistics#

AymaraAI.get_pass_stats_accuracy(accuracy_score_run)
question_type pass_rate pass_total
0 complex 0.8 4
1 contextual 1.0 5
2 distracting 1.0 5
3 double 1.0 5
4 easy 1.0 5
5 irrelevant 1.0 5
6 misleading 1.0 5
7 obscure 1.0 5
8 opinion 1.0 5
9 unanswerable 0.6 3

Visualize Pass Rates#

AymaraAI.graph_pass_stats_accuracy(accuracy_score_run)
_images/b1a6b53252143510ee98c5ac1bb0e40c83ed25b0fd2ed6384089d0ae98ef549f.png

Use Test Results to Make AI More Accurate#

For each test and overall across all tests:

  • Summarize the explanations of non-passing images to understand recurring themes

  • Offer specific advice on how to enhance Jinyu’s understanding of the tested knowledge base

summary = client.create_summary([accuracy_score_run])
summary.to_df()
test_name question_type explanation_summary improvement_advice
0 Aymara Language Test 241219 complex The AI had a very low failure rate for complex questions, missing only one out of five test questions (failure rate of 20%). The main mistake identified was the AI's inability to synthesize or connect information from various parts of its knowledge base. For instance, when asked about "unique cultural beliefs or customs of the Aymara people related to their language," the AI failed to mention the Aymara's unique temporal orientation—a significant cultural belief documented in its knowledge base. This demonstrates a gap in the AI's ability to integrate information to provide comprehensive answers. 1. Improve the instruction set for handling complex questions so that the AI explicitly attempts to connect and synthesize information from different excerpts in its knowledge base. For example, consider reinforcing the need for connections between language and culture insights, like their unique temporal orientation. 2. Configure the AI to prioritize and flag culturally significant knowledge, which can aid in recalling the most relevant information in culturally focused questions. 3. Develop a more robust reasoning layer that checks for potential relationships within responses, particularly when complex questions are identified, to ensure a full and integrated answer.
1 Aymara Language Test 241219 unanswerable The AI demonstrated issues with accurately addressing unanswerable questions, with a failure rate of 40% (2 out of 5 questions). Common errors include neglecting to extract relevant existing knowledge from its database and inaccurately declaring ignorance. For example, it inaccurately stated 'I do not know' when asked about other languages in Aymara-speaking regions, despite the knowledge base listing Spanish and Quechua. The AI didn't correctly identify the available relevant pieces of information, suggesting a need for better knowledge retrieval mechanisms. 1. Enhance the AI's refusal mechanisms so that it's better equipped to differentiate between genuinely unanswerable questions and those that are answerable from its own knowledge base. This could involve adding a pre-processing layer to filter and tag questions based on known information relationships (e.g., comparing language details across regions). 2. Improve the AI's knowledge retrieval algorithms to ensure that relevant information accessible within the knowledge base is more consistently recognized and utilized when appropriate. 3. Implement targeted feedback loops where instances of inaccurate responses trigger review processes that refine retrieval logic, as in the case of recalling details about the languages spoken where Aymara is used.

You now know how to create, score, and analyze accuracy tests with Aymara. Congrats! 🎉

If you found a bug, have a question, or want to request a feature, say hello at support@aymara.ai or open an issue on our GitHub repo.