Processing large amounts of corporate data and turning them into databases is only possible with tools that can recognise texts and their structure. This is typically accomplished with automation algorithms that process texts and identify their elements: headers, paragraphs, and such. However, existing tools for this purpose have their drawbacks: for instance, the optical character recognition software Tesseract doesn’t detect text structure, while LLM-based solutions (such as the one from OpenAI) can’t handle longer documents and “get lost” in their context and structure. Moreover, using third-party services to process corporate documents poses data leak risks.
A team from ITMO’s Institute for Artificial Intelligence has developed a tool that lacks the downsides of these popular solutions. Their new data processing library, DocuMentor, can highly accurately detect and extract the hierarchical structure of a document, identifying various elements inside: headers, tables, images, or formulas. The service can handle popular document formats, such as PDF, DOCX, and Markdown. In the future, the developers are planning to expand this list.
The library transforms documents into machine-readable JSON files (a JavaScript-based text format) that contain information on document structure: headers, paragraphs, tables, and other elements. Such “marked” documents can be used with search systems – for instance, they are utilised in corporate chat assistants.
At the core of the library is dots OCR, an optical character recognition VLM. Additionally, the team enhanced the solution with tools for automated collection and structuring of DOCX files, as well as processing of the text layer in PDFs. They also implemented additional algorithms to improve document recognition quality at every stage: when the system is extracting individual elements within the document, recognizing headers of different levels, identifying styles and font sizes, or correcting errors made by dots OCR when identifying the document structure.
Next, the team compared DocuMentor’s accuracy of document processing and structure analysis with popular counterparts Dedoc and Marker. DocuMentor has a 1.3% error rate in character recognition and 2.5% – in word recognition. Compared to its analogs, this is 6-10 times fewer mistakes made when analyzing texts and 2-6 times fewer mistakes when analyzing scanned documents in PDF format. Moreover, DocuMentor can highly accurately identify element locations in PDF files – 98% for regular text-based PDFs and 94% for scanned documents.
The library can be integrated into any document structure recognition or analysis products. For instance, the developers are planning to introduce it into the ProAGI multiagent software development system as one of the frameworks agents can use to process PDF files.
“One advantage of our library is that this is the first time that we’ve built a step-by-step algorithm for extracting the largest amount of information on a document’s structure with a minimal number of errors. Our library is of interest not only to researchers but to companies that can use it for in-house document processing. Understanding document structure is a key factor in developing search systems and creating databases that’ll help integrate AI into the workflow,” shares Mikhail Kovalchuk, one of the developers and an engineer at ITMO’s Institute for Artificial Intelligence.
Mikhail Kovalchuk. Photo by Dmitry Grigoryev / ITMO NEWS
The library is available under the open-source license BSD-3.
