Bacterial communities, or microbiomes, are agglomerations of hundreds of microorganisms that inhabit the same medium – for instance, soil or the human gut. Every individual microorganism within such a community has its own effect on the human or the environment: some can be harmful, while others are beneficial. In order to assess the “qualities” of a microbiome, scientists assemble genetic chains of each community – metagenomes. A metagenome is a unique collection of DNA segments in the form of letter sequences, each representing specific molecules. The longer such a code is, the more thoroughly it describes a microorganism. Although all of today’s algorithms are able to assemble quite lengthy “chains,” they only display the average characteristics of these organisms. Such data mostly serves to show the similarities between the DNA of various bacteria, but not the differences. This makes it more difficult to pinpoint specific strains within bacterial communities and to describe their unique qualities.

Researchers from ITMO have designed an algorithm that assembles genetic chains at the strain level while displaying most of the mutations found in the DNA of microorganisms. Strainy processes metagenome data in two stages. First, it clusterizes it: the input metagenome data is used as a reference genome – a chain with similar core genetic sequences. The algorithm then compares DNA segments to the reference genome and, upon detecting disparities, adds the new variant to the existing chain. Thus, it is able to “record” information about nearly all bacterial mutations.

“In the second stage, the algorithm re-assembles the source metagenome while accounting for all the discovered variants. The output is a graph – a digital recording of bacterial genes – of a much larger size that contains information about bacteria on the strain level. If, say, we only had one variant of a particular bacterial DNA segment in the source metagenome, after the algorithm is done with its task, we’ll have several variants. Strainy also produces statistical data about the number of such DNA fragments, their length, and coverage,” comments Ekaterina Kazantseva, one of the authors of the paper and a bioinformatics graduate of ITMO University.

The resulting data can be used not only to identify strains within a microbiome, but also to describe the properties of the bacteria contained within. For instance, their pathogenicity – in other words, their capacity to cause illness in other organisms – or their antibiotic resistance. This data can also be used to study in detail the evolution and mutation of bacteria, as well as to predict potential future changes in the genome. All this is helpful in knowing how specific bacteria types may evolve and which qualities they’ll gain.

So far, the algorithm has been tested on the metagenome of soil bacteria. In the future, the researchers plan to test Strainy’s effectiveness in working with other types of genome data, including that taken from samples of human cancerous tumors.

Ekaterina Kazantseva’s co-author for the study is Ataberk Donmez, a PhD student at the University of Maryland (USA).