Testing of LLM models — A challenging frontier

7 min readFeb 13, 2024

Software testers bring a wealth of experience in the testing domain, including UI, API, and mobile applications. However, they are consistently confronted with the ongoing challenge of testing emerging technologies. One such challenge involves testing applications driven by Generative AI and Large Language Models (LLMs).

The principal challenge encountered in testing applications powered by Large Language Models (LLMs) arises from the non-deterministic nature of their output results. Unlike conventional applications where anticipated outcomes can be reliably predicted, LLM-based applications generate diverse responses even when presented with identical inputs.

Traditionally, testers have not extensively considered testing costs in their routine activities. Running regression tests, conducting exploratory testing, or performing routine sanity checks multiple times typically incurs minimal cost implications. However, testing non-open source LLM-based applications presents a distinctive cost scenario. The testing cost is directly proportional to the number of tokens employed. In simpler terms, an escalation in query volume results in higher testing costs. This unique cost structure introduces complexity to the overall testing process for LLM-based applications.

Testing Scope and Approach:

The testing scope of an LLM-based application is intricately linked to the efficacy of the underlying model, and the quality of this model plays a pivotal role in achieving optimal results. Models can be broadly categorized into two types:

Proprietary/Generic Models: The first type is proprietary, represented by models like OpenAI’s GPT-3.5 . These models operate as black boxes for users but demonstrate exceptional performance in handling intricate tasks.
Model Training and Fine-Tuning: This category involves open-source models, providing users with the potential to train them using customized data in addition to the existing language model.

Some of the Key Test metrics for Large Language Model (LLM) model testing are crucial measures employed to evaluate the performance, accuracy, and effectiveness of such models. These metrics provide a comprehensive understanding of how well the LLM performs in various aspects. Here are some key test metrics commonly used for LLM model testing:

Perplexity (PPL): Perplexity measures how well the language model predicts a sample or sequence of words. A lower perplexity indicates better performance as it signifies that the model is more certain and accurate in predicting the next word in a sequence.
BLEU (Bilingual Evaluation Understudy): BLEU assesses the quality of machine-generated text by comparing it to a reference or human-generated text. BLEU calculates the precision and recall of n-grams (consecutive sequences of n words) and provides a score ranging from 0 to 1. Higher BLEU scores indicate better agreement with human-generated references.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE evaluates the quality of summaries or generated text by comparing it against a set of reference summaries. ROUGE measures precision, recall, and F1 score for overlapping n-grams between the generated and reference text. It is particularly useful for evaluating the informativeness of the generated content.
Word Embedding Metrics (e.g., Word2Vec, GloVe): Measures the semantic similarity between words based on their contextual embeddings in a high-dimensional space. Word embedding metrics assess how well the LLM captures semantic relationships between words. This is crucial for understanding the model’s ability to represent and generate meaningful content.
Contextual Embedding Metrics (e.g., BERTScore): Evaluate the quality of generated text by considering contextual embeddings, especially useful for assessing sentence and document-level coherence. Contextual embedding metrics account for the contextual understanding of words within sentences, providing a more nuanced evaluation of the LLM’s performance.

When employing the above metrics for LLM model testing, it’s essential to tailor the evaluation based on the specific use case, target audience, and application requirements. Additionally, considering a combination of these metrics provides a more comprehensive assessment of the LLM’s overall performance.

Testing a Large Language Model (LLM) involves the utilization of well-established metrics such as Perplexity (PPL), BLEU, and ROUGE, commonly employed for the assessment of traditional language models. However, the direct applicability of these metrics may be limited when evaluating applications deployed in production that leverage LLM as third-party services. In many scenarios, testers are not specifically tasked with the direct evaluation of the language model itself; instead, the focus is on ensuring that the application aligns with and fulfills our specified requirements.

Some of the major testing types to be performed on validating applications leverage LLM are:

Functional testing:

To assess the correctness of LLM-based applications, the test cases should encompass fundamental use cases tailored to the application, while also taking into account potential user behaviors. In simpler terms, test cases should reflect what users intend to accomplish. Customized applications often include embedding domain-specific terminologies and knowledge, within an existing baseline LLM. The test cases should primarily address domain-specific scenarios. Below are some of the Keep areas to verify for check application is meeting its functional requirement :

Based on Generative AI’s non-deterministic feature, we can’t do the exact match for the test results. However, we need to measure the accuracy of the response which can be done using different evaluator patterns some of which include below LangChain evaluators:

String Evaluators: These evaluators examine the forecasted string based on a given input, typically involving a comparison with a reference string.
Trajectory Evaluators: These are employed to assess the complete course of agent actions.
Comparison Evaluators: These evaluators are specifically designed for comparing predictions generated in two separate runs using a common input.

Accuracy:

Objective: Evaluate the correctness and accuracy of LLM responses despite the non-deterministic nature of Generative AI.
Challenge: Lack of exact match due to non-deterministic features.
Importance: Crucial to ensure reliable and trustworthy responses.
Approach: Utilize evaluators such as LangChain, which provide patterns for measuring correctness and accuracy. However, we need to measure the correctness and accuracy of the response which can be done using different evaluator patterns some of which include below LangChain evaluators:
String Evaluators: These evaluators examine the forecasted string based on a given input, typically involving a comparison with a reference string.
Trajectory Evaluators: These are employed to assess the complete course of agent actions.
Comparison Evaluators: These evaluators are specifically designed for comparing predictions generated in two separate runs using a common input.

Factual Correctness:

Objective: Ensure that information generated by LLM applications aligns with real-world facts.
Challenge: Hallucination, a weakness of LLMs, can result in inaccurate information.
Importance: Critical, especially in customized applications where information must be sourced from a reliable domain knowledge base to mitigate the impact of hallucination.
Approach: Rigorous validation against known factual information, leveraging external databases or reliable sources to verify accuracy.

Semantic and Format Correctness of Responses:

Objective: Confirm that the output from LLM applications adheres to the correct format, demonstrating proper grammar, spelling, and syntax.
Importance: Crucial for user comprehension and overall user experience.
Challenges: Ensuring that responses are not only syntactically correct but also semantically meaningful.
Approach: Employing automated tools for grammar and syntax checking, alongside manual review to assess semantic correctness. Feedback from end-users can also be valuable in refining responses.

Completeness of Responses:

Objective: Ensure that LLM-generated responses include all necessary and essential content, leaving no significant information missing.
Importance: Essential to deliver comprehensive and valuable information to users.
Challenges: Balancing completeness without overwhelming users with excessive information.
Approach: Defining criteria for essential content, employing test cases that cover a range of scenarios, and iterative refinement based on user feedback.

Readability of Responses:

Objective: Confirm that LLM-generated responses are logically and linguistically coherent, easily understood, and adhere to the expected format.
Importance: Critical for user satisfaction and effective communication.
Challenges: Assessing tone of response, context awareness, and overall coherence in responses.
Approach: Utilizing readability metrics, conducting user surveys for feedback on comprehension and tone, and refining the model based on readability assessments.

Non-functional testing:

Non-functional testing is essential for validating and verifying the aspects of LLM models that go beyond their core functionality. It ensures that these models are not only accurate and effective but also performant, reliable, secure, and compliant with relevant standards and regulations. Some of the key non-functional testing which we need to focus on are:

Performance Testing:

Performance evaluations for applications leveraging Large Language Models (LLM) comprise two key facets:

Processing Speed:

Definition: The time taken by a language model to generate a response, contingent on the efficiency of the underlying infrastructure supporting the language model.
Significance: This metric gauges the model’s computational efficiency, influencing its real-time responsiveness.
Considerations: Assessing and optimizing the infrastructure is pivotal to enhancing processing speed.

Response Speed:

Definition: The duration from user input to receiving the model-generated response. This includes processing speed along with considerations for network latency and potential delays.
Significance: Reflects the end-to-end experience for users, encompassing not just model efficiency but also external factors affecting response time.
Considerations: Addressing network latency and identifying potential bottlenecks contribute to improving overall response speed.

Moreover, Scalability serves as a critical metric, evaluating an application’s capacity to handle increased traffic and interactions. Similar to performance and load testing for conventional applications, ensuring effective functionality under specific workloads — such as heightened user engagement and increased data volumes — is imperative. Scalability testing is integral to identifying potential limitations and ensuring the application’s robust performance in dynamic and evolving usage scenarios.

Security Testing:

The robustness of LLM-based applications ensures not only their resistance against security attacks but also guarantees the preservation of data privacy and system integrity. Security controls should be reviewed to ensure the robustness and security of applications powered by LLM. Compliance with legal and regulatory requirements, ethics and privacy standards, and fair management of harmful biases is essential for ensuring the integrity of LLM-based applications.

Conclusion:

Testing Apps based on Language Model (LLM) models is essential, paralleling the testing procedures applied to tradtional software. The approach while not replacing benchmarks, complements them to provide a comprehensive evaluation. The benefits of testing are highlighted in contrast to the challenges posed. Benchmarking generation tasks with multiple right answers is arduous, and relying solely on benchmarks for such tasks might not instill confidence. Getting human conclusions on model output is labor-intensive and becomes less useful when iterating on the model.

Ultimately, we advocates that the time invested in testing LLM models and their behavior is a good investment as it not only addresses immediate issues but also contributes to long-term model improvement and robustness.