The popularity of AI apps continues to grow. Last year, global users spent around $726 million on AI general assistants, according to Statista. This category of AI apps has become the highest-grossing, leaving core models and graphics generators and editors in the second and third positions.
Software development companies like Belitsoft offer AI software development services for businesses of various domains. In this article, Belitsoft’s experts discuss the types of tests for AI applications, tools, techniques, and useful tips.
Types of AI Testing
AI applications include a wide variety of solutions, such as facial recognition tools, recommendation systems, clinical diagnosis doctors’ assistants, AI-powered security threat prevention, and others. All those solutions need different types of testing, as they have different functionality. Usually, QA engineers combine traditional testing with AI-specific methods.
- Functional Testing: It is a set of tests to check the core functionality. Testers verify if the AI algorithms produce the expected results and if the whole app works in accordance with its functional requirements.
- Usability Testing: It ensures that the app is convenient to use. If there is a conversational AI assistant, testers assess if the language is natural, and how the app behaves in unfamiliar contexts or handles errors. It also checks if the app understands various users’ accents.
- Integration Testing: AI functionality is usually part of the whole architecture. That is why it is important to see how well it integrates with other components, datasets, etc.
- Performance Testing: These tests check the whole AI model’s performance and measures such as response times, throughput, etc. QA experts understand if the model needs improvements and what its possibilities are in various conditions.
- API Testing: It involves testing the interactions between APIs and AI components and verifying the endpoints, the formats of the data input and output, and the response format.
- Security Testing: This type of test is necessary to make sure the data is safe during the AI processing and the system information is not at risk of leakage.
Tools for AI Testing
Partnering with a tech firm means they will cover AI application testing with all necessary types of tests and relevant tools and methods. However, if you decide to conduct testing independently, you can benefit from some of the helpful frameworks and libraries:
- TensorFlow Extended is an open-source platform. It helps developers with building, deploying, and maintaining machine learning (ML) algorithms. TensorFlow Model Analysis (TFMA) and TensorFlow Data Validation (TFDV) are the tools that evaluate ML models and guarantee data quality. They identify biases, search for anomalies in the datasets, and conduct other checks.
- PyTorch is an open-source ML framework. Developers use it to build and train deep learning models, and to test those models. Methods like k-fold cross-validation and empirical evaluation help to assess the performance, accuracy, and reliability of the output before deployment.
- DeepMind Lab is another open-source tool. It is available on GitHub. Testers use it as a 3D learning environment to test AI agents, particularly in reinforcement learning scenarios.
AI Testing Peculiarities
AI-powered apps demand testing their AI components and the integration of the AI part with other non-AI parts of the system. When you decide on the testing strategy, the following issues must be taken into consideration:
- The necessity for sustainable testing. Released non-AI apps require additional testing rounds when changes or updates are made. AI apps should be tested regularly to update their training data and prevent misleading output and hallucinations. Powerful solutions must be able to generate quality responses even in new contexts.
- Addressing non-deterministic behavior. AI apps may generate new output every time you make the same request. That is why QA is more than just bug fixing with AI apps. It is the ability to predict responses and make them meet the quality requirements.
- Data sourcing. Collecting and preparing data for AI applications demands the expertise of testing specialists. Depending on the app, the data may consist of numbers, text, images, sounds, etc.
Techniques for Testing AI Applications
Testing non-deterministic output
With apps that generate images, testers check what the large language model (LLM) returns and if the app correctly renders what it returns. For example, testers take a screenshot of the image generated by an AI app and compare it with the “golden master”, i.e., a benchmark. They decide whether the differences between the two images are acceptable.
Another way of using golden master testing is comparing the generated output with the result that was previously acknowledged as good. QA specialists export an object as structured data (SVG, XML) and compare it with a master. This method is convenient when there is a need to extract and test specific coordinates in a canvas API.
Allow/deny lists include words, phrases, and images that testers use to check prompt filters that don’t let the system generate unsafe data.
Using AI oracles to assess the output
For users, the most important thing while dealing with AI apps is the quality of the output. In earlier times, testers used calculations to check if the app returned relevant results. However, it is not enough to test LLM’s output accuracy, as it often receives subjective and complex prompts. For example, asking a chat to create “elegant presentation visuals for a corporate event” causes subjectivity, as different people may have a different understanding of it depending on their tastes or cultural context.
To test such functionality, QA engineers might apply an additional method – multiple choice quizzes. They take the output and upload it to the LLM. Then, the LLM analyzes it and answers a multiple choice quiz. It should provide a single correct answer.
Self-criticism testing allows LLMs to assess their own performance. Testers provide the app with some input and after receiving the output, they send the same input together with responses and ask the system to evaluate the results.
Addressing AI output randomness
Seeds are values that help to control the randomness of the LLM’s output. Developers use seeds to achieve consistency across multiple test runs. They set the seed to a specific value (number) and therefore make the model produce the same output every time for the same input, which helps with testing and debugging.
Another way to control the randomness is by lowering the temperature. It is an LLM hyperparameter. By setting this parameter, testers decrease the model’s hallucinations and make the responses less likely to change.
Tips on Cost Management
Testing AI apps is rather costly and should be performed on a regular basis. That is why it is recommended to limit the number of tests to the extent sufficient to keep the flawless performance of the app. Here are a few ideas on how to save the budget:
- As mentioned above, AI apps usually have the functionality that testers can check with the help of traditional tests. Run those tests frequently, with every code update. It will help to catch issues early. Tests that check the AI model itself can be run less often (e.g., nightly or weekly), since they are slower and more expensive, and usually only need to be rerun if the model changes.
- Do not run all the tests with every check-in. It is enough to cover around 20% of tests in a random sample. This figure is not a universal standard. QA professionals should adjust it to a specific project based on historical defect rates and risks. After some time, these samples will cover the whole functionality of the app.
- If there is something changed in the application, select the types of tests that will check the functionality that might be affected by those changes. If you find errors, expand the test suit and continue the check.
- Do not roll out the changes in the app immediately to the end user. Use A/B testing to see how the changes affect the system’s performance.
Conclusion
Software development firms like Belitsoft offer the following testing services:
- Manual testing of the edge cases that are challenging to automate
- Automated testing with scalable frameworks, repeatable patterns, and maintainable code
- Outsourced services for companies that need immediate assistance
- Consulting the QA issues that might occur with flaky test suits or internal refactoring processes
- Full QA process support
Robust test automation allows businesses to achieve greater release confidence, faster onboarding for new engineers, and a significant reduction in manual testing effort—sometimes exceeding 70%. This leads to stable high-volume workflows and minimizes production rollbacks.
About the Author:
Dmitry Baraishuk is a partner and Chief Innovation Officer at a software development company Belitsoft (a Noventiq company). He has been leading a department specializing in custom software development for 20 years. The department has hundreds of successful projects in such services as healthcare and finance IT consulting, AI software development, application modernization, cloud migration, data analytics implementation, and more for US-based startups and enterprises.