More than half of Russian companies use AI in their activities. To keep up with market changes, these systems must be regularly improved and retrained on fresh data. Nonetheless, it might be hard to recognize when a model no longer performs with desired accuracy rates and needs to be updated. Furthermore, assessing AI reliability requires significant resources, including hundreds of megabytes of data, dozens of work hours, and seasoned professionals. Many seemingly promising solutions do not pass empirical tests because their customers do not possess enough expertise to objectively evaluate their quality.
AI product testing, however, may be facilitated with specialized software, namely digital testing grounds. Such programs run virtual tests on a model, simulating various operational settings, in order to assess its accuracy rates, selectivity, stability, and other critical properties required for comprehensive analysis. Though the market has seen dozens of such systems in recent years, including domestic ones, their certification and availability for small and medium businesses remains a challenge. It also does not help that a majority of existing testing grounds are narrowly specialized and tailored for the needs of one particular field, for instance, medicine.
To that purpose, ITMO developers created Polioks – a digital testing ground for AI systems that has more functions than its alternatives. The program evaluates an AI system’s efficiency against several criteria simultaneously and compares it with similar solutions on the market. Another advantage is its ease of use; even an untrained user can launch the application and analyze the resulting report.
The application works as follows: first, specialists need to manually or automatically collect AI testing scenarios on the platform, given their applied tasks, operating conditions, and expected accuracy. Afterward, the built-in AI system will generate synthetic data to test the model, which will be followed by a series of pre-set automatic tests. The final stage is the analysis of testing results through an ML model and traditional statistical methods to make objective conclusions about the effectiveness of the new AI technology.
“What we came up with is a convenient and user-friendly tool that doesn't require users to do any coding or download anything; instead, all you need to do is upload your data and model file into the system. Moreover, our solution combines the best practices for AI performance analysis that simulate “bad” scenarios and evaluate a system’s performance in those conditions. The software also allows for testing in extreme conditions. We increase distortion or decrease the amount of input data to the level where the system starts to show results below the quality metrics. It essentially gives us an analysis of the model's applicability limits. Additionally, the program allows specialists to compare their models with similar ones from open-source libraries or those created with ML on the testing ground (e.g., with the framework Fedot – another project by ITMO developers). This is a critical criterion for model assessment,” highlights Sergey Ivanov, a researcher at ITMO’s Research Center “Strong AI in Industry.”

Sergey Ivanov. Photo by Dmitry Grigoryev / ITMO NEWS
As a general rule, the testing of AI systems rarely includes more than two or three accuracy indicators; whereas Polioks offers a detailed result – a text report with diagrams, graphs, and other visual content reviewing a system’s performance. The materials contain dozens of accuracy indicators calculated under different conditions, the principles of the model’s operation, and the numerical indicators of the GOST-specified characteristics needed for system certification. This data can be utilized to not only assess the efficacy of new models – but also optimize further training of existing ones. The digital testing ground will help users conduct regular virtual tests to confirm the declared characteristics of their AI system and, if necessary, contact the developers for an update.
As of now, the system is mostly tailored for tabular data and time series; it also employs some specialized methods to assess machine vision models, which allows it to take into account such complex factors as fine-tuning and uncertain applications. Polioks makes AI system testing faster while maintaining the necessary accuracy levels.
“In the long run, we’re going to adjust our software for working with language models, which are now major players in AI development. Currently, they are evaluated using a set of standard tests, which can’t always indicate their efficiency in real-world applications. Our interest also lies in the quality assessment of large language models for code generation. For now, however, our prime goal is to receive a quality certificate for our software to then integrate it into industry and businesses,” adds Sergey Ivanov.