How do plagiarism detection tools work?

Dmitriy: The majority of modern systems for text-plagiarism detection implement digital algorithms. These algorithms can effectively detect copied content, retrieve its source and assess its volume. Special databases and evaluation systems are used to test these new plagiarism detection algorithms and understand which of them show the best results.

For example, there is a database called PAN Summary in which the task of plagiarism detection is divided into source retrieval and text alignment. When testing the algorithms in this database, the evaluation system assesses the quality of identifying reused passages between documents.

How did you realize that evaluation systems needed improving?

Anton: Before we started developing evaluation systems, we worked a lot with plagiarism detection algorithms themselves. It all started with the Hack the Plagiarizer! Hackathon, which was held at the AINL Conference at ITMO University in 2017. At the hackathon, we were given two weeks to solve the text alignment problem. When performing this task, you’re given two documents, one of which contains a number of copied passages from the other one. The task is to find all these passages (sentences or paragraphs).

We quickly came up with a working algorithm, which inspired us to participate in the evaluation track of the Dialogue Conference which offered similar tasks. Success was not long in coming and we ended up publishing an article at Dialogue 2018.

We decided not to rest on our oars and tried our hand at more prestigious international competitions in this field. Held since 2013, they feature several datasets differing in size and complexity of duplications. After conducting the first series of experiments, we noticed that we achieved the best result in the most difficult dataset, but we didn’t give it much attention. Instead, we decided to focus on writing one more article. However, as it often happens with significant discoveries, when it was all done and ready for publication, we realized that it was all wrong. It was not our algorithm that we owed our good results to but the imperfection of the evaluation system whose metrics were far from ideal, especially for the most complex datasets. Anyone could use this imperfection and get an excellent result.

We used the remaining 48 hours before the deadline to modify the metrics, evaluate the quality of the existing state-of-the-art algorithms, and rewrite the text of our article. During these two days, we only got some five hours of sleep, but it was really worth it: after we sent the article to the ACL conference and had a good sleep, we checked all the modifications with a clear head, and it was all good.

What are your plans for the future?

Anton: First of all, we’d like to describe the results of our work in more detail. In addition to that, we would like to evaluate more algorithms using our new system.

In general, I would now like to continue my research in such fields as interpretable machine learning and natural language processing. Neural networks are still like “black boxes” for many researchers: people know that they exist but don’t understand how they work and what modifications are needed to move this technology forward. In some fields such as the banking industry and medicine, this is a major obstacle hindering the development process.

The replication crisis is yet another global problem in modern science, associated with the growing complexity of machine learning models. The thing is that the more complex the model is, the harder it is to replicate its neural architecture based on its description in a scientific article.

However, with the help of fairly universal and reliable methods, interpretation of neural network forecasts will become as accessible a research tool as are statistical tests of significance. Such methods are being actively developed now, and I would like to contribute to this field during my PhD studies.