To perform any chemical study, whether that be the development of a new medicine or material, scientists must collect and organize data from thousands of academic papers. Although AI can simplify this process, it still, however, struggles with field-specific terms, different types of data (texts, tables, or graphs) and the context of chemical research. For AI to improve in this field, researchers need to manually collect high-quality datasets based on scientific papers.

As a solution, experts from ITMO’s Center for Artificial Intelligence in Chemistry presented ChemX – a massive collection of data that contains ten datasets designed to test and advance automated extraction systems for chemical research. For that, the team manually extracted and cross-checked data from over 1,500 peer-reviewed papers; their analysis showed an error rate of less than 4%. This was confirmed as follows: the authors randomly selected 20% of data from the dataset and verified if that matched the original sources. The developed system is entirely open-source – all datasets, a detailed manual, and code for experiments are available on HuggingFace and GitHub.

ChemX is a multimodal database, which means it contains various types of data extracted from texts, tables, diagrams, and graphs. The datasets cover a range of fields related to nanomaterials and small molecules: cytotoxicity and antibacterial activity of nanoparticles, properties of nanozymes and magnetic materials, biological activity of benzimidazole and oxazolidinone antibiotics, thermodynamic properties of chelate complexes, photostability of drug molecules and their cocrystals, as well as corneal permeability of ophthalmic drugs. 

“AI is used today to perform a range of chemical tasks, which makes it vital to have a high-quality, reliable data source. We hand-assembled ten specialized datasets, conducted an expert testing, and used the datasets to analyze existing data extraction methods from research papers. The results showed that available solutions so far fall short of our practices and need further development,” notes Anastasia Vepreva, an author of the paper and an engineer at ITMO’s Center for Artificial Intelligence in Chemistry. 

Anastasia Vepreva. Photo by ITMO's Center for Artificial Intelligence in Chemistry

Anastasia Vepreva. Photo by ITMO's Center for Artificial Intelligence in Chemistry

Additionally, the developers used their dataset to train and test contemporary LLMs (GPT-4o) and AI agents in regard to automated data extraction. So far, the methods lack precision in numerical parameters and complex structures – the Simplified Molecular Input Line Entry System (SMILES) notation. At the same time, multiagent systems, such nanoMINER, which was developed by the center’s experts for data processing in nanomaterials and nanozymes, perform well in its field and cannot be expanded to other tasks.

The team plans to advance their system in the future.

“Our goal is to develop a more universal, intelligent solution to extract and analyze multimodal scientific data. We strive to create multiagent systems that can efficiently adapt to different datasets within one research area and integrate specialized tools for identifying chemical structures into AI pipelines. This will allow us to improve the accuracy rate when dealing with complex molecular representations and pave the way for applied and reliable automation systems in chemistry,” says Julia Razlivina, an author of the paper and an engineer at ITMO’s Center for Artificial Intelligence in Chemistry.

Julia Razlivina. Photo by Artur Ruslanovich

Julia Razlivina. Photo by Artur Ruslanovich

The study is supported by the national program Priority 2030.