What is your advice for those who only started getting involved in this field and aren’t sure what to start with?

In fact, there are many opportunities for becoming a specialist in Data Science now. First of all, if one has self-discipline and motivation, they can learn everything by themselves. There are open courses, in Russian as well. Surely, many are in English, and they are enough to understand the basic algorithms and theory. Of the Russian courses, I can recommend the Introduction to Machine Learning (Coursera, Higher School of Economics, led by Konstantin Vorontsov) and the Machine Learning and Data Analysis specialization (six courses on Coursera by the Moscow Institute of Physics and Technology and Yandex). As for foreign courses, I believe the most useful are Machine Learning (Coursera, led by Andrew Ng), Neural Networks for Machine Learning (Coursera, led by Geoffrey Hinton), as well as the so-called nanodegrees on the Udacity platform created in collaboration with major companies: Data Analyst (in collaboration with Facebook), Machine Learning Engineer (in collaboration with), Deep Learning.

To go deeper and understand how it works in practice, one can participate in contests on data analysis. The largest platform for that is kaggle.com https://www.kaggle.com/competitions. It hosts contests in different fields; for instance, the current one is a task by Sberbank of Russia that has to do with real estate market. Its main goal is to develop algorithms that will allow forecasting real estate prices based on data on housing and macroeconomic regularities. Similar contests are held by many companies, and participating in them is a great way to test your skills and learn something new. What's more, participation in contests, analysis of one’s own projects and completion of courses is a good way to substantiate one's skills when applying for a job.

Surely, there are also educational programs at different universities. One of them is the Yandex School of Data Analysis, which gives good basic knowledge of the topic. The School has its branches in Moscow, St. Petersburg, Yekaterinburg and Novosibirsk.

What are the basic competencies a Data Science specialist has to have?

Andrei Sozykin's lecture

On the basic level, that would be a good level of mathematics, including applied statistics and probability theory, also, programming is definitely a must. As of now, specialists in the field of Data Science mostly work with Python programming language and libraries for data analysis. What's more, the specialist has to be well-versed in modern machine learning methods, and that would not only be deep learning (it's only one of the methods), but such methods as gradient boosting which are now actively used as well. Thus, specialists have to know all relevant methods and be able to use libraries for machine learning.

In what time can one learn the basics and start applying them?

Basically, one can learn the theory in about one year - that would be enough to get the skills which will allow completing tasks and working with data if there's a specialist who will consult you on the workings of a particular field. Yet, now companies require specialists that can not only solve tasks on an engineering level, but also apply the data they get. Companies have lots of data, but they don't know how to use it, and that is a very relevant problem which can only be solved by a specialist who is also well-versed in a particular area.

Ideally, a Data Science specialist has to be able to do lots of things - have a good grasp of programming and mathematics, as well as know a particular field well - the banking sector, telecommunications, medicine, or any other. So in essence, a Data Science specialist has to have multidisciplinary knowledge. Surely, these can be the competences of not a single person, but a team: let's say one is really good at programming, the other is an expert in mathematics, and the third person knows a lot about the banking sector; altogether, they can produce great results. At our university, we aim to arrange the groups in a way that the students would be able to work well as teams.

Andrei Sozykin's lecture

You focus your research on a particular field - Big Data analysis in medicine. Please, tell us more about your current projects.

We have two projects on image analysis in medicine: that would be analyzing images of the heart (namely MRI results) and images of skin cancer. What is this for? Nowadays, physicians spend several hours to process MRI results, as MRI results contain whole sets of images. Using those, a highly qualified physician has to define the affected areas. In other words, he examines lots of images (that can have noise and such) and defines the "regions of interests" based on this data. This is a very difficult task that takes a lot of time. Our program, as well as others in this field, help define these regions on separate images, and also creates 3D models of one's heart using as many images as possible. All of that helps to automatize the physicians' work, helps do it faster and with better precision.

In the near future, we plan to start modeling different diseases. For instance, now we research different means to treat arrhythmia. In some cases, arrhythmia is caused by a spiral wave in the left ventricle, which causes the heart to contract at a very fast pace. If this lasts for a long period of time, it can cause great harm and even death. Yet, there are ways to stop the wave. Surely, we can't conduct experiments on living people - due to both ethical and technical reasons. Yet, we can conduct computational experiments using supercomputers. We study different patterns of the spiral wave's propagation, as well as treatment methods that will allow stopping it and thus preventing harm to one's health.

Such research is being conducted by our students who use the so-called project approach for their education. In Yekaterinburg, there's a branch of the Yandex School of Data Analysis. Some of our students completed courses there, and then stayed at our university for their PhD programs, as they want to learn a particular subject field. They study medical images, and this work lets them to further advance their skills by applying them to real scientific tasks.

A apinting processed by a neural network. Credit: lamcdn.net

If talking about machine learning methods, in which fields there are best results?

To my mind, the best achievement in the field of neural networks and their application is development of driverless cars – that would be both from the technical point of view and progress in general. Among other fields that have progressed recently are CAT (computer-aided translation) systems and various chat-bots. Fintech continues to rapidly develop and is in great demand amongst banks and other financial entities.

Finally, a very important field is everything that has to do with automation of routine intellectual labor. This can be widely applied in accounting and jurisprudence, where such things as bookkeeping operations and audit can be automatized. There already are companies that offer such services, Russia included. For instance, the Knopka company uses machine learning and neural networks to automatize bookkeeping operations.

In this case, everything is quite easy. As e-document management has already been introduced, a neural network can receive incoming documents and identify them. Then it does the work a human accountant would do. Surely, a human accountant still has to control its work, but thanks to automation, companies need less such specialists, as they will no longer have to do routine work manually.

What tasks and challenges now stand before scientists in the Data science field?

From the technical point of view, scientists will have to increase these systems' precision. This will have to do with collecting and analyzing more data. For instance, that will allow to improve the operation of driverless cars even in those instances that are really rare.

Though neural networks already have enough data for typical situations, they sometimes can't cope with something extraordinary - something they've never seen. A famous example is when a driverless car got into an accident having mistaken the color of the car that was ahead of it. It was almost the same as the color of the sky, and the neural network just didn't recognize the trailer as part of the car. Yet, the network will now avoid such mistakes in all the cars it controls. A human's experience is more limited, also, humans often repeat the same mistakes or can get into an accident due to feeling bad. The network, on the other hand, never gets tired, doesn't drink, feel sick or somber.

Driverless car. Credit: svopi.ru

Which makes it much more reliable than a human?

Still, there are many ethical and juridical concerns here. We still can't say how much we can trust neural networks, especially when it comes to fields that have to do with people's health and lives. The aforementioned medical diagnostics, for instance. On one hand, neural networks can find such complex regularities in medical data and offer such solutions that no living person can. On the other, we still don't know how they do it. For us, neural networks are something like black boxes. We don't understand what happens inside and why exactly do they offer particular solutions. In medicine, that is unacceptable, as in this field we have to know precisely what we do and why. That is why the question of using neural networks when it has to do with human lives is still open.

Will it be possible to completely understand how neural networks work in the nearest future?

Scientists constantly make such attempts, but the results are still no good. Sometimes, even a small change in data can make the network give a whole different result. For instance, it recognizes a car on an image and that's true, then, several pixels are changed, and it suddenly believes it to be a cat. Yet, humans still see a car. And no one knows why the network reacts to those changes in this particular way.

Yet, what we can achieve in the nearest future - analyzing larger amounts of data, developing more complex algorithms for machine learning and the like - is most relevant, and will allow us to do what we now spend a month on in several hours.

 

The Open Seminar on Deep Learning, or How to Become a Data Scientist took place in St. Petersburg on May 16. The event was organized by ITMO University and the NVIDIA Company; its participants shared about how one becomes a specialist on analyzing Big Data, as well as demonstrated a solution to a classical task on image recognition using neural networks. Andrei Sozykin, head of the High-performance Computer Technologies Department of Ural Federal University, was one the event’s speakers. Mr. Sozykin does research in the field of medical image analysis with the use of deep learning, and leads a YouTube channel where he posts lectures on Deep Learning technologies.