Pokémon and Headaches: What is Big Data?
Is Big Data really that big? And if it’s big, it’s big compared to what? When and how does data become so big that it starts to frighten programmers? Those are only a few of the questions related to Big Data that can trouble both the uninitiated coders as well as any humanities-minded people. It may even seem that Big Data is everywhere, and there's no way to deal with it. Vladimir Krasilnik, a developer for Yandex.Market, brought some clarity to these questions during the "Joker-2016.Student Edition" conference.
How big is Big?
Some say that Big Data is something you'll have to "dig into". Some even believe that it's something that has to do with philosophy. One thing everyone agrees on is that Big Data is a problem. Well, to clear it all up, I will start with an example, and then we'll work out some general definition.
Everyone knows Twitter. What do you think, how many tweets a day goes through its system? According to the Internet Live Stats, there are around 500 million short messages. Is it a lot or not? Is this Big Data?
Let's say you've decided to open your own business in Web analytics. And you have an order — some company wants to know today’s most popular hash-tags on Twitter. How do you do it? You can take a tweet, "disassemble" it into hash-tags and then put them into a database. For each hash-tag, you'll have to use a separate counter. But that won't work as well — nothing can count all these 500 million tweets a day. So, what’s next? Surely, you could buy more servers and create a network. But that won't work out, as well — your counter will get the information from different sources, and no counter will be able to record it correctly or update fast enough. To straighten things out, we can send a particular hash-tag to a particular server. This way, there will be less confusion. And that's some solution. Still, servers must be maintained; also, this solution is no good at adapting to related tasks — like making statistics by countries or adding data from other social networks.
We are all human, and we make mistakes. In the end, you’ll fail your task, get no money and go bankrupt. After all this trouble, you’ll come a step closer to developing the smart solution: one that will not only solve the problem with popular hash-tags, but will also adapt to related tasks. That would be using lambda architecture. All the data is processed on three levels, then sorted and categorized. One can say that this is the common method of working with Big Data.
Where does Big Data come from?
Big Data is always a result of redundancy — there is too much data as there is so much information. For counting hash-tags using lambda architecture, we would have to store all data from the counters, and whole tweets as well. It can even result in creating even more data — by replication while counting, as we'll have to store the same data in different data centers so as not to lose it.
Big Data originates from commerce. There is a fun example of it. Once, an infuriated father of a 15-year old girl came to complain to a supermarket — it offered her coupons for diapers. And then he had to make excuses himself — it turned out his daughter was pregnant. How did the store get it? It stored information on all purchases. After the girl bought certain items, the store expected her to by something from similar categories. When you're offered additional information, that's Big Data as well.
Big Data means headaches.
Big Data has some distinct technological features. If that's Big Data we're dealing with, there'll be hyper-converged infrastructure and the possibility to make different inquiries. Also, Big Data calls for high accuracy and data input security. Some serious analytical instruments are to be used to process them. Serious means that they allow taking the whole data and getting some results for an inquiry. Also, Big Data technologies are generally so complex that their developers name them after Pokémon and create fun logos for them: his yellow elephant, for instance.
Big Data can also be defined by several essential features. Volume is one of them. Diversity is the other — almost anything can be stored. Validity — you'll need as many tweets as there were to know the popular hash-tags. Speed — the data flows are insane, and while you process one portion, you get three more. Value — you can work with great amounts of data, but if there's no value to the results, that's not Big Data. For example, you have to see the tweet that your neighbor pays for the beer today before there's nothing left. Big Data is inconsistence. If you are late, then there is no value to what you know. And Big Data is something one has to visualize. That can be charts, maps — anything.
But there is a definition I like most: Big Data means headaches. If your project works fine without you paying much attention to it — that's not Big Data. Working with Big Data means constant data flows that never end, solutions that never work, and the constant need to think up new ones. So, Big Data is constant adaptation.