Ten years ago, Big Data was still some instrument used by specialists working in the financial sector, bioinformatics and a couple more narrow fields; now it enters the business and is applied in many different areas of human activity - from analyzing socioeconomic processes in cities to marketing and recruitment.
Facing the constant increase in the quantity of data, different companies are looking for ways of increasing the data processing speed and optimizing data storage methods. For instance, the classical approach of batch processing is now being replaced by stream processing, i.e. now specialists all over the world make use of mechanisms that allow processing data as it comes and getting results straight away.
For instance, not so long ago Facebook guaranteed that downloaded content would be refreshed every two minutes. Now, the situation has changed, and Twitter already shortened this time to mere seconds.
There are lots of different types of data, so naturally, there is a whole range of approaches used to store and process it: SQL and no-SQL databases (PostgreSQL and Cassandra, for instance), general-purpose data storages (HDFS) and data storages for specialized data (like SciDB that focuses on scientific data). Yet, so as to provide for sufficient flexibility and coverage, all of these solutions make use of similar interfaces and methods of storing information, with no regard to its specificity.
What is data semantics and how can it help shorten the processing time
ITMO scientists have also been working on stream processing technologies for quite a while now. Yet, as opposed to companies that use solutions that focus on optimizing data processing, the university's specialists decided to approach the problem in a different way.
"A year ago, we decided to study the approaches that differ from regular Big Data optimization solutions. Surely, optimizing data processing is necessary and relevant, yet what interested us were the problems we are about to face in the nearest future. We’ve tried to devise ways to process data more effectively, with greater output - in terms of processing time, usability, comprehensiveness, and focused on the very nature of data, its semantics," explains Denis Nanosov, one of the research's authors.
Denis Nanosov
So, what is data semantics? In essence, it is the aggregate of data structure (attributes, data types, object types), interconnections between objects of data (article-author, author-position and the like), and the peculiarities of its presentation format (the formats the data is stored in). Practically, this means that before putting data into storage and keeping it there, the specialists try to understand what it is about.
Specialists from the High-Performance Computing Department expanded on this principle in the research that resulted in developing the Exarch modular data repository. Its main goal is to provide fast access to data, effectively distribute data amongst storage hubs and easily extend its functionality so as to store data in any format. The research results and the new system's core were presented at the recent ICCBDC 2017 in London. The authors note that the use of data semantics allows to not just shorten the processing time, but also increase the speed of processes that have to do with storage organization, as well as decrease the storage volume.
The High-Performance Computing Department
How does it work?
Let's say that you have to process a large amount of metocean data, which is essential for modeling and forecasting the meteorologic conditions in a particular location on Earth. For instance, specialists from ITMO's eScience Research Institute faced a similar task when they introduced a decision support system for preventing wind-driven floods in St. Petersburg. The project included modeling and forecasting the development of wind-driven floods based on a system of hydrodynamic and probabilistic models with regard to the main factors that affect the storm surge's life cycles: ambient pressure, ice cover, wind waves and different hydrodynamical processes.
Yet, is it possible to expand the project's geography and conduct a similar study that will cover a greater territory? Specialists of the eScience Research Institute are currently introducing projects that will allow modeling weather conditions anywhere on Earth, meaning they now face the problem of storing, processing and analyzing even more data.
Video: decision support system for predicting wind-driven floods in St. Petersburg.
"When performing such an analysis, the speed with which we can access particular data becomes critical. Processing speed is also very important. A simple example: we had to process 50-years worth of data from some 100,000 places in the Arctic Region, and to derive 20-40 parameters for a certain place, the classical solution took several hours - all because of the ill-considered traditional data organization and the lack of optimization mechanisms. That is already too long, and we then had to process these parameters, as well," shares Denis Nasonov.
As opposed to the classical analysis when specialists work with files that contain certain information, the scientists decided to analyze the files' storage structure. Let's say that the data that the specialists derive as a result is represented as table fields. If we're talking about data on the Arctic Region, then the water level in the region is represented by a particular field, the information on it is stored for one hour. Yet, to create a comprehensive data storage model, one has to analyze different scenarios, including those when one has to process not just certain fields but data for hundreds of thousands of locations over some 50 years of time.
Predicting weather using Big Data. Credit: vizrt.com
With regard to such issues, the research's authors decided to reorganize the data, and account for such an entity as the point (geographical coordinates) over time. This allowed to avoid the necessity to upload, open and close millions of files in search for particular parameters for certain locations in different table fields. This approach allowed to decrease the number of requests necessary to get the desired information by tens of times, as well as speed up the process. As a result, the analysis that would've taken several months was done in days.
Having compared these results with those of the existing solutions, the scientists learned that their approach to the reorganization is also two times more beneficial when used on simple tasks.
Alexander Visheratin
"We compared our system with the work of commonly used solutions - HDFS, Cassandra, and got a diagram: where they spend 100 seconds, our solution spends 40, so we observe a two times improvement in download time and even better indications in what has to do with reading. Exarch showed great results: depending on the information quantity and some other factors, it can processed information times or even tens of times faster," comments Alexander Visheratin, the article's main author and engineer for the eScience Research Institute.
Further Development
After getting great results with meteorological data, the next step will be aligning the mechanism and using it on other types of complex data - for instance, in analyzing data from social networks, as well as the Scopus database. Also, the scientists plan to make the system's core public, and post it on GitHub next year.
Predicting weather using Big Data. Credit: esri.com
"Everyone will have the opportunity to make use of Exarch - from serious scientific organizations who experience the lack of computational capacities and time to different startups who want to save resources by using the new solution - instead of spending lots of money on increasing one's computational capacities, they may prefer to reorganize their data in a more effective manner," concludes Denis Nasonov.