Surely, testing offers at least a slight safety that you won't have to deal with a whole wave of problems after you change the code. One of the the methods to it is conducting unit-tests, when one checks functions on different types of inputs and then looks at what's happening.
Unfortunately, we don't really use it in "Vkontakte", even though everyone knows that it is what has to be done. Most teams just do code review — this is a mechanism that helps detect common mistakes and slips in the code. Then there is integration and testing (testing of different modules as a group -- Ed.). Integration and testing may well be the best method to check if everything is alright.
Some believe that programmers at VKontakte don't test their code, that we don't have a QA Department. That's not true. We do have such a department, and those who work there a great at it. Still, one has to understand that no department, no testing system is anywhere close to real live people who can break anything. No matter how hard you try, some bug will always go unnoticed. Don't get your hopes too high: something will definitely go wrong.
So, how do you get to know it? Especially in a service of such scale as VK?
Surely, sometimes the scale works to your benefit: if you fix everything really fast, most users won't even notice it. Speed is what's most important. To find mistakes and bugs, we use different statistics. We have lots of diagrams that reflect different work criteria of each section. If we see a peak value on a stable diagram — this means a breakdown is most possible. If we see such values on different diagrams at the same time — that means we have serious problems. Comparing diagrams is also important: this helps understand what's happening and avoid false alarms. We make dashboards from the indicative ones, then specialists analyze them.
And surely we keep track of the statistics in what has to do with technical support. All in all, our users rely on it a lot. Usually in five minutes time we receive about sixty requests like "why did you block my account" or "please delete my naked photos from this fellow's page". Yet, when real problems arise, we receive up to 400 requests in the same period of time.
There is also the unconventional approach to analyzing a situation — we track the popular "вкживи" (VKlive) hashtag on Twitter.
It's widely known that VK uses its own original databases that are optimized for particular tasks. One of the advantages to this approach is that we have unified messaging protocols. This allows us to have only one entry point to all databases, so we can block any and all requests if necessary.
After we've learned that something went wrong, we have to localize the bug. How do we do it? Well, we gather all errors and keep them on a single resource. We gather logs, as well — so as to use them as an aid. This really helps when analyzing a situation.
We also study the commits (the saves of changes in the code) on a diagram. When we see a peak value, and there's lots of commits — it means that they led to the problem. If you don't have to check too many lines of code, you'll find the mistake fast. And we use a special chat where at each deploy, the list of commits is added as a message. It helps a lot, as well.
Communication is one of the most important things in a team. When any problems arise, what one has to do is to quickly gather a small team of people who possess maximum information on the problem. Surely, no one in our company can possess full information on the whole service. This is why it's really important to gather a competent team as soon as possible. We have a special chat for this purpose, as well: when you see a message, it means that everything's gone wrong. Wrong enough to distract you from anything — you just have to read the messages and react if there's something you can help with.
Sometimes, it's necessary to test something in production, because there are just no other ways to it. Some functions are to be tested in real life conditions, you just can't create a test for them.
So, what do we do? How to do it so that users won't notice? We really like simple solutions at our company. So, we use the "if and random" approach. Let's say we have the new code and the old code. Each time we receive a request, we "flip a coin" and use one of them to process it. This allows to limit the test group. And this approach works great. We use something similar to infrastructural changes. To avoid the crash of all servers, only a limited number of users get access to such changes. If anything goes wrong, only several servers will crash. That's bad, but not as bad as it could be.
In case of a Highload project (the workload for the resource is high -- Ed.) all the bad things that can happen will happen. It's just probability theory working against you. You have about a hundred billion requests per second, so even if the probability of some problem happening is 1% or even below that, it'll still happen. I believe that is one of the main rules you have to deal with when working on a Highload project.
To sum up, I'll give you some advice. First of all, always remember that people make mistakes much more often than machines. So computerize everything that can be computerized. A human generates bugs everywhere and anytime, so the less you rely on people, the better.
Use the statistics as much as you can. Use any indexes to detect bugs, even the most unconventional ones. And don’t forget quick rollbacks — that will save you both time and trouble. Just roll the system back and study the problem to find a solution. If you do it fast, most users won't even notice anything.
Work as a team if you want to solve your problem, and the main advice is — don't panic.
Vyacheslav Shebanov's lecture was part of the IT Hardcore international conference held on October, 8th, 2016.