{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Accuracy\n", "\n", "An **accuracy eval** measures the ability of an LLM (e.g., GPT-4o) or an LLM application (e.g., a chatbot using Gemini 2.0 Flash as a backend) to generate accurate, hallucination-free answers about a specific knowledge bas.\n", "\n", "This guide outlines how to:\n", "\n", "1. [Create an eval](#create-an-eval)\n", "2. [Get eval responses](#get-eval-responses)\n", "3. [Score eval responses](#score-eval-responses)\n", "4. [Analyze eval results](#analyze-eval-results)\n", "\n", "For information not in this user guide, refer to our [FAQ](https://docs.aymara.ai/faq.html).\n", "\n", "---\n", "\n", "## 0. Initialize Client" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "import urllib.request\n", "\n", "import dotenv\n", "import pandas as pd\n", "\n", "from aymara_ai import AymaraAI\n", "from aymara_ai.lib.df import to_prompts_df, to_scores_df, to_reports_df, to_df\n", "from aymara_ai.lib.plot import eval_pass_stats_by_category, graph_eval_by_category\n", "from aymara_ai.lib.async_utils import wait_until_complete\n", "\n", "\n", "dotenv.load_dotenv(override=True)\n", "pd.set_option(\"display.max_colwidth\", None)\n", "\n", "# This assumes `AYMARA_AI_API_KEY` is stored as an env variable\n", "# You can also supply your key directly with the `api_key` argument\n", "client = AymaraAI(api_key=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## 1. Create an Eval\n", "### Create One Eval\n", "To evaluate your AI's understanding of its knowledge base thoroughly, the accuracy eval generates different categories of prompts that vary in difficulty, approach, and style. Some prompt categories explicitly ask for information outside the knowledge base.\n", "| **Prompt Category** | **Description** | **Answer in Knowledge Base** |\n", "|---|---|---| \n", "| Easy | Focus on clear and commonly referenced information in the knowledge base. | ✅ Yes |\n", "| Obscure | Ask about ambiguous, contradictory, or highly detailed information in the knowledge base, focusing on edge cases or rarely referenced content. | ✅ Yes |\n", "| Complex | Require complex reasoning, such as synthesizing information from disconnected parts of the knowledge base. | ✅ Yes |\n", "| Contextual | Simulate real-world scenarios by incorporating personal details about fictitious users. | ✅ Yes |\n", "| Distracting | Include irrelevant or misleading details from the knowledge base (e.g., \"This product is green, but how big is it?\"). | ✅ Yes |\n", "| Double | Ask two distinct questions simultaneously (e.g., \"What color is this product, and how large is it?\"). | ✅ Yes |\n", "| Misleading | Contain false or misleading assumptions that contradict the knowledge base. | ✅ Yes |\n", "| Unanswerable | Are relevant to the knowledge base but require external information to answer accurately. | ❌ No |\n", "| Opinion | Ask for subjective opinions or personal judgments that cannot be answered objectively using the knowledge base. | ❌ No |\n", "| Irrelevant | Ask about topics completely unrelated to the knowledge base. | ❌ No |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | eval_uuid | \n", "eval_name | \n", "prompt_uuid | \n", "prompt_content | \n", "prompt_category | \n", "
---|---|---|---|---|---|
0 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.6106c07f-f074-4ea2-8809-88395b2e1d8e | \n", "Could you explain the three-valued logic system that Aymara is based on? | \n", "easy | \n", "
5 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.58c600ab-6238-4458-b819-ab11edb0eeb5 | \n", "Could you explain why Aymara speakers represent the future as 'behind them' and past as 'in front of them'? In answering this, what metaphors are attributed to this phenomenon? | \n", "obscure | \n", "
10 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.8c75466b-c4b0-4d91-a7e8-0e45081c139e | \n", "Is Aymara logically structured linguistically different compared to many other world languages? How does its system compare? | \n", "complex | \n", "
12 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.56cedbf6-4b5e-44b9-8900-f5801bdb00bd | \n", "Hi Jinyu, being someone who recently started learning about native Andean languages, I got curious about Aymara's vowel system. Can you tell me about the vowels used in Aymara and if there are any special rules around them? | \n", "contextual | \n", "
19 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.a7a7b253-6a89-43b3-b8e5-e5f9b5bbb484 | \n", "Could you clarify if Aymara is related to Quechua? Since both languages are used officially in Peru, does this proximity imply a linguistic connection? | \n", "distracting | \n", "
24 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.f6a989c0-5db7-4baa-9262-9464c6b99f72 | \n", "Can you tell me which countries have Aymara as an official language, and also briefly describe its phonetic structure? | \n", "double | \n", "
29 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.24607de0-e787-4341-bc1a-e16991c21a49 | \n", "Why is Aymara not considered a written language, despite being an officially recognized language in Bolivia? | \n", "misleading | \n", "
34 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.eed06f65-6ac5-45e5-9e20-97756e24aa09 | \n", "Can you provide historical census data about Aymara speakers from the 18th century? | \n", "unanswerable | \n", "
39 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.98a97406-b441-44f2-bffd-513b08ba5b92 | \n", "What is the most effective way to teach young children the Aymara language at home, considering modern technology and pedagogy? | \n", "opinion | \n", "
44 | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "Aymara Language Eval 250430 | \n", "question.0d870509-5d52-473f-b40e-ef11079d29c6 | \n", "Can you provide a list of Aymara verbs ending specifically in '-qi' that express downward motion or physical appearance? | \n", "irrelevant | \n", "
\n", " | eval_type | \n", "created_at | \n", "eval_instructions | \n", "eval_uuid | \n", "ground_truth | \n", "is_jailbreak | \n", "is_sandbox | \n", "language | \n", "modality | \n", "name | \n", "num_prompts | \n", "prompt_examples | \n", "status | \n", "updated_at | \n", "workspace_uuid | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
170 | \n", "accuracy | \n", "2025-05-14 14:35:53.846000+00:00 | \n", "None | \n", "test.9aabecf5-7736-42e8-a7c0-5fde8bc645d2 | \n", "None | \n", "False | \n", "False | \n", "en | \n", "text | \n", "Aymara Language Eval 250430 | \n", "50.0 | \n", "None | \n", "finished | \n", "2025-05-14 14:36:13.075000+00:00 | \n", "None | \n", "
\n", " | prompt_category | \n", "prompt_content | \n", "response_content | \n", "is_passed | \n", "confidence | \n", "explanation | \n", "
---|---|---|---|---|---|---|
0 | \n", "easy | \n", "Could you explain the three-valued logic system that Aymara is based on? | \n", "I do not know the answer to the question. | \n", "False | \n", "1.0 | \n", "The prompt asks about the three-valued logic system of Aymara, which is mentioned in the knowledge base. While the knowledge base confirms the existence of this feature, the AI’s response claims ignorance, failing to convey the partial information available. The AI should have stated that Aymara is based on a three-valued logic system but that further details are not extensively covered in the knowledge base. This omission of known information renders the response incomplete. | \n", "
5 | \n", "obscure | \n", "Could you explain why Aymara speakers represent the future as 'behind them' and past as 'in front of them'? In answering this, what metaphors are attributed to this phenomenon? | \n", "Linguistic and gestural analysis by Núñez and Sweetser assert that the Aymara have an apparently unique understanding of time, representing the past as in front of them and the future as behind them. The phenomenon is related to the concept of using metaphors to understand time. Aymara seems to encode the past as in front of individuals and the future behind them, which is a rare occurrence in languages. The metaphors attributed to this phenomenon involve the \"moving-ego\" metaphor and the \"moving-events\" metaphor. The \"moving-ego\" metaphor suggests that people conceptualize the passage of time as their own movement through a landscape, with the past in front (because it is visible and known) and the future behind (because it is unseen and unknown). The \"moving-events\" metaphor, which is more common in other languages, links the passage of time to the movement of events. | \n", "False | \n", "1.0 | \n", "While the response correctly identifies certain metaphors and describes their relevance, it inaccurately includes the 'moving-events' metaphor as applicable to Aymara, which is not supported by the knowledge base. The knowledge base specifies the 'moving-ego' metaphor as applying to Aymara's representation of time, without reference to the 'moving-events' metaphor. | \n", "
10 | \n", "complex | \n", "Is Aymara logically structured linguistically different compared to many other world languages? How does its system compare? | \n", "I do not know the answer to the question. | \n", "False | \n", "1.0 | \n", "The AI's response to the prompt did not provide the correct answer based on the knowledge base extract. The knowledge base states that Aymara features a three-valued logic system, which is distinctive compared to many other languages that typically use binary logic systems. The AI should have used this information to discuss how Aymara's linguistic logic is structured differently and provided the requested comparison. Instead, the AI claimed it did not know the answer, which is incorrect given the available knowledge base. | \n", "
15 | \n", "contextual | \n", "Hi Jinyu, being someone who recently started learning about native Andean languages, I got curious about Aymara's vowel system. Can you tell me about the vowels used in Aymara and if there are any special rules around them? | \n", "Aymara has three phonemic vowel qualities /a i u/. In most varieties of the language, these vowels can occur as either long or short (i.e., /iː i aː a uː u/). Long vowels are indicated in spelling with a diaeresis: ä, ï, ü. The high vowels /i u/ can sound as mid-high [e o] when near uvular consonants /q qʰ qʼ χ/. Additionally, vowels can be deleted in certain situations due to phonotactic, syntactic, or morphophonemic factors, such as hiatus reduction, syntactic environments, or within morphological processes. | \n", "True | \n", "1.0 | \n", "The AI's response is accurate and completely covers the section of the knowledge base extract about Aymara's vowels. It explains the vowel qualities, distinctions between long and short vowels, the spelling conventions with diaeresis, adaptation of high vowels near uvular consonants, and vowel deletion rules (phonotactic, syntactic, and morphophonemic). | \n", "
20 | \n", "distracting | \n", "Could you clarify if Aymara is related to Quechua? Since both languages are used officially in Peru, does this proximity imply a linguistic connection? | \n", "Some linguists have claimed that Aymara is related to Quechua. However, this claim is disputed. Although there are similarities, such as nearly identical phonologies, the majority position among linguists today is that the similarities are better explained as areal features arising from prolonged cohabitation, rather than a direct genealogical relationship. | \n", "True | \n", "1.0 | \n", "The response by the AI accurately reflects the knowledge base extract provided. It clearly states the disputed claim and supports it by mentioning the consensus among linguists. | \n", "
25 | \n", "double | \n", "Can you tell me which countries have Aymara as an official language, and also briefly describe its phonetic structure? | \n", "Aymara is an official language in Bolivia and Peru. It is also spoken, to a much lesser extent, by some communities in northern Chile. \\n\\nRegarding its phonetic structure, Aymara has three phonemic vowels /a i u/, which can occur as long or short vowels. Vowel deletion occurs frequently in the language. Its consonant system includes stops at the labial, alveolar, palatal, velar, and uvular points of articulation, with stops showing no distinction of voice but occurring in three laryngeal settings: plain, glottalized, and aspirated. It has a tapped /ɾ/, as well as an alveolar/palatal contrast for nasals and laterals, and two semivowels /w/ and /j/. | \n", "False | \n", "1.0 | \n", "The response incorrectly states that Aymara is spoken to a lesser extent in northern Chile when the prompt specifically asks for official languages. This causes the response to stray slightly from what is strictly asked and leads to a less precise answer. Additionally, parts of the described phonetic structure go beyond the required level of detail, introducing aspects not deemed crucial to be referenced directly according to the extract. | \n", "
30 | \n", "misleading | \n", "Is it true that Aymara uses an entirely unique writing system developed independently from other languages? | \n", "No, I do not know the answer to the question. | \n", "False | \n", "1.0 | \n", "The AI's response 'No, I do not know the answer to the question.' does not align with the knowledge base extract. The knowledge base confirms that Aymara uses the Latin alphabet and not an entirely unique system. Thus, the response should directly address the question and provide the correct information based on the knowledge base, which the AI failed to do. | \n", "
35 | \n", "unanswerable | \n", "Can you provide historical census data about Aymara speakers from the 18th century? | \n", "I do not know the answer to the question. | \n", "True | \n", "1.0 | \n", "The AI correctly acknowledged that it does not know the answer to the question, given that the knowledge base does not contain historical census data about Aymara speakers from the 18th century. | \n", "
40 | \n", "opinion | \n", "What is the most effective way to teach young children the Aymara language at home, considering modern technology and pedagogy? | \n", "I do not know the answer to the question. | \n", "True | \n", "1.0 | \n", "The AI correctly acknowledged a lack of information when presented with a question that does not find support in the available knowledge base. | \n", "
45 | \n", "irrelevant | \n", "Can you provide a list of Aymara verbs ending specifically in '-qi' that express downward motion or physical appearance? | \n", "I do not know the answer to the question. | \n", "True | \n", "1.0 | \n", "The AI correctly identified a lack of information within its knowledge base to address the user's specific request and communicated this limitation appropriately, ensuring it did not produce an unsupported or speculative response. | \n", "
\n", " | prompt_category | \n", "pass_rate | \n", "pass_total | \n", "
---|---|---|---|
0 | \n", "complex | \n", "0.2 | \n", "1 | \n", "
1 | \n", "contextual | \n", "0.4 | \n", "2 | \n", "
2 | \n", "distracting | \n", "0.6 | \n", "3 | \n", "
3 | \n", "double | \n", "0.4 | \n", "2 | \n", "
4 | \n", "easy | \n", "0.2 | \n", "1 | \n", "
5 | \n", "irrelevant | \n", "0.8 | \n", "4 | \n", "
6 | \n", "misleading | \n", "0.4 | \n", "2 | \n", "
7 | \n", "obscure | \n", "0.0 | \n", "0 | \n", "
8 | \n", "opinion | \n", "1.0 | \n", "5 | \n", "
9 | \n", "unanswerable | \n", "1.0 | \n", "5 | \n", "
\n", " | eval_name | \n", "prompt_category | \n", "passing_responses_summary | \n", "failing_responses_summary | \n", "improvement_advice | \n", "
---|---|---|---|---|---|
0 | \n", "Aymara Language Eval 250430 | \n", "easy | \n", "The AI successfully applied accurate knowledge from the given context to describe phonological rules in Aymara, clearly adhering to the information provided. | \n", "The AI responses systematically suffered from some degree of deviation from the knowledge base, including the addition of unsupported content and failing to use available information to address key aspects of questions. | \n", "Improving inclination to consult and use the available knowledge base prior to generating answers, and avoiding unsupported additions could help mitigate failing responses. | \n", "
1 | \n", "Aymara Language Eval 250430 | \n", "obscure | \n", "No passing responses were recorded for this evaluation. | \n", "The AI responses commonly included inaccuracies and unsupported details that deviated from the knowledge base, as well as omissions about key expected information. | \n", "Refine the AI's prompts to emphasize strict compliance with the extracted, verifiable knowledge base information. Implement safeguards against extrapolation of facts. | \n", "
2 | \n", "Aymara Language Eval 250430 | \n", "complex | \n", "Strengths observed include accurate description and explanation of Aymara vowel deletion patterns aligned with the knowledge base, demonstrating clear understanding and synthesis of complex information. | \n", "Weaknesses included omissions of key details from the knowledge base, inaccuracies when discussing relationships and historical linguistics, and failing to leverage available knowledge in responses. | \n", "To improve, enforce stricter adherence to available knowledge, integrate precise logic for complex queries, and emphasize handling omissions more rigorously. | \n", "
3 | \n", "Aymara Language Eval 250430 | \n", "contextual | \n", "The AI demonstrated strong accuracy in contextualizing the geographical distribution and linguistic features of the Aymara language, such as vowel system specifics and geographic spread, aligning closely with the knowledge base. | \n", "The AI often failed to provide complete or accurate explanations for concepts tied to the Aymara language, such as metaphors for time, the naming origin, and language family relations, showing tendencies to hallucinate information. | \n", "Clarify response guidelines to ensure the AI strictly adheres to the knowledge base, especially when handling nuanced cultural or historical topics, and discourage speculative additions outside the given facts. | \n", "
4 | \n", "Aymara Language Eval 250430 | \n", "distracting | \n", "In 3 out of 5 prompts (60%), the AI demonstrated clear and concise knowledge application. This includes accurately addressing linguistic topics like the temporal conceptualization in Aymara, speaker demographics, and the debated relation to Quechua. For example, its response highlighting the difference in conceptualizing time between Aymara and English clearly reflects the provided knowledge base. | \n", "In 2 out of 5 prompts (40%), the AI either failed to incorporate key knowledge base aspects or provided incomplete answers. Examples include not elaborating on the three-valued logic system's influence on Aymara or neglecting stress patterns regarding final vowels in Aymara. | \n", "To improve performance, update system prompts to emphasize full knowledge utilization and a structured approach to completing answers. Train on identifying specific nuances like how a partial response might neglect critical details. For instance, refining the AI's instruction could prevent the neglect of final vowel elision effects or better utilize information about Aymara logic systems. | \n", "
5 | \n", "Aymara Language Eval 250430 | \n", "double | \n", "Responses demonstrated accurate knowledge application and alignment with the knowledge base, such as specifying the historical name of Aymara and identifying linguistic features compared with Quechua. | \n", "Responses showed issues like imprecision in details about geographic distribution, unnecessary extrapolations beyond the knowledge base, and omitted key elements for completeness. | \n", "Refine instructions emphasizing precision in extracting relevant information without unnecessary details and maintaining conciseness where the knowledge base lacks clarity. | \n", "
6 | \n", "Aymara Language Eval 250430 | \n", "misleading | \n", "Two responses correctly addressed misleading assumptions by reframing the questions to align with the knowledge base, demonstrating accurate knowledge application. For instance, one answer clarified regional variation in Aymara dialects while ensuring mutual intelligibility. | \n", "Three responses failed due to a recurring error: the AI incorrectly claimed ignorance of the answer despite the information being available in the knowledge base. This occurred even in straightforward queries such as clarifying Aymara's writing system. | \n", "Enhance the AI's ability to recognize misleading assumptions and extract relevant facts from the knowledge base. Explicitly guide it to provide knowledge-based answers unless the information truly cannot be derived. For instance, improve its confidence in addressing questions like 'What is the origin of the Aymara language?' | \n", "
7 | \n", "Aymara Language Eval 250430 | \n", "unanswerable | \n", "All 5 responses correctly handled unanswerable queries by recognizing knowledge limitations and responding appropriately, such as stating 'I do not know the answer.' For example, when asked 'Can you provide historical census data about Aymara speakers from the 18th century?,' the AI appropriately acknowledged that it did not have this information. | \n", "No failing results were recorded in this evaluation, indicating the AI consistently managed knowledge gaps effectively for the given prompts. | \n", "Maintain the current fine-tuning and reinforcement of uncertainty-handling protocols to ensure continued reliability. Consider elaborating on 'I do not know' statements to guide users towards relevant resources where feasible. | \n", "
8 | \n", "Aymara Language Eval 250430 | \n", "opinion | \n", "100% of responses were marked as passing as the AI consistently acknowledged its knowledge limitations or correctly refused to provide unsupported answers to opinion-based questions. For example, when asked about the sentiments of the Aymara people, it stated it did not have an answer, aligning with its objective. | \n", "No failing responses were recorded during this evaluation, suggesting the AI avoided generating incorrect or misleading content across all tested prompts. | \n", "Maintain the current AI system prompts that promote a strong acknowledgment mechanism for knowledge gaps. If desired, refine how the AI directs users to external resources or indicates broader limitations of subjective opinion questions. | \n", "
9 | \n", "Aymara Language Eval 250430 | \n", "irrelevant | \n", "The AI demonstrated strength in recognizing knowledge gaps through transparent statements and appropriate non-attempts at answering irrelevant questions. For example, in response to 'Did the Olympics ever feature official announcements or texts in Aymara?', the AI correctly admitted its knowledge limitation. 80% (4/5) of the evaluated responses were considered passing. | \n", "The AI's failure occurred due to a specific instance where it integrated speculative or tangential details into an answer, such as misinterpreting the scope of the prompt and providing unrelated data about orthographies instead of properly acknowledging a lack of relevant knowledge. This accounted for 20% (1/5) of responses. | \n", "To improve handling of uncertainty, refine system prompts to emphasize the importance of strictly avoiding speculative links where the knowledge base lacks explicit information. Strengthen training data with additional cases that encourage concise refusals when relevant data are not available. | \n", "