Though the internet may seem like a chaotic, ever-expanding network, scientists managed to find patterns in the way its websites and pages interact with one another back in the 1990s.

To decode it in a single system, the researchers used a webgraph, which describes direct links between websites – hyperlinks, to be exact – regardless of their content, design, and themes. 

A webgraph is typically depicted as a set of dots joined by lines or arrows. Each vertex is a website whereas lines, which are called edges, are hyperlinks within the website’s pages. There could not be a single graph of the entire world-wide web, given the number of existing websites and pages that exceeds dozens of billions; Wikipedia alone consists of around 10 million pages. Nevertheless, such a model, albeit abstract, shows the internet as a natural object with its laws that can be studied. 

Rule #1 – scale-free

The first study of the internet was conducted by Albert-László Barabási and Réka Albert. In 1999, the physicists proposed a scale-free networks model that accurately describes the emergence of many naturally occuring networks such as social communications, economic transactions, biological processes, and, most importantly, the internet. This concept assumes that despite the emergence of more and more new sites and the “fading” of old ones, the web graph itself remains stable, and the number of its vertices and connections increases proportionally by a constant factor, which is approximately equal to 10. 

Rule #2 – six degrees of separation

From this follows the so-called concept of six degrees of separation – the idea that everyone in the world, regardless of their status or place of residence, is connected through six social connections. The same can be applied to websites: visitors can move from one website to another – say, from ITMO.NEWS to the website of the Moscow Zoo – in no more than six clicks on hyperlinks. 

Credit: VisualGeneration / photogenica.ru

Credit: VisualGeneration / photogenica.ru

Rule #3 – money makes money

Albert-László Barabási and Réka Albert, who formulated this rule, called it preferential attachment, which basically means the more cited a website is, the more likely it is to receive new links.  

Imagine how the internet appeared: first, there was only one website that could only refer to itself; in a little while the second site occurred – it could link to the first one and to itself, and then there was the third website that could refer to the first two plus itself, and so on. This thought experiment shows that the “original progenitor” gains more cites with each iteration. In graph theory, this is called the degree of a graph’s vertex. In this example, the first website has a degree equal to 3 (itself, the second and third websites), the second one – 1, and the third one – 0 (no other website refers to it). 

Credit: VisualGeneration / photogenica.ru

Credit: VisualGeneration / photogenica.ru

Math formulas vs. spammers

The listed rules and formulas have been embraced by search engineers such as Yandex Research, which has been running a joint study with the Moscow Institute of Physics and Technology (MIPT) that amidst other things is meant to help fight spammers and advance search algorithms. 

On the agenda are spamming sites that increase one another’s references through a dense network of hyperlinks, artificially improving their ranking in search engines. 

The researchers from Yandex and MIPT are investigating suspicious sites using web graphs and the Barabási-Albert model. Here is how it works: the first to be calculated is the degree of vertices in a grid of sites connected by hyperlinks (that is, how many times each site was cited) and then the number of shared edges is summed up. If the actual number of edges greatly exceeds the expected one (derived from the formulas), the website is considered as spam. 

Just like any law of physics, these formulas, although discovered back in 1999 when there were no more than 20 million websites, remain applicable to the present day with the number of pages online currently exceeding tens of billions. This shows the fact that despite being artificial and complex, the internet is still a part of nature that can be described, calculated, and structured.

You can watch the whole lecture here (in Russian).