You and your colleagues have recently published an article on voice authentication and bank frauds. Could you tell us about it?
Banks are now actively adopting voice identification for user convenience. However, it has its drawbacks, and convenience, in this case, comes with lower security. Lately, you probably heard the stories about hackers using fake voice recordings to deceive security systems and gain access to accounts, make transactions, get loans, or transfer money to others. In our article, we analyzed synthesized voices, cybercrime and online frauds, and methods for recognizing generated speech.
In fact, is it difficult to make a believable facsimile of someone's voice?
If we have a quality record, then it won’t be hard to synthesize a new audio track, even with a completely different text. Nowadays, synthesis systems based on deep neural networks are especially widespread. Anyone already can find them online, and, if necessary, any competent Python programmer can retrain these networks or write a new command code for them in the blink of an eye. As a result, you get a natural-sounding voice that can fool an automated system. That’s what creates a significant problem in banking.
Is this done by hacking the bank’s databases for the records of customers’ voices?
No, besides, that’s quite difficult. Such systems are well protected and have multi-level security that is hard to bypass. Usually, criminals use phishing: a “bank employee” calls you, strikes up a conversation with you, and records it.
How long should the conversation last? Is it enough to simply answer the call and say “hello”? Or does it take a longer conversation?
Of course, a simple “hello” isn’t enough. But if you answer five or six questions, then criminals will get data on the amplitude of the voice, tone, timbre, speech semantics, and the use of syllables and words. All this data is recorded, analyzed, and thus used to generate a new audio track that can deceive the bank’s system.
What measures are now being taken to fight this kind of fraud?
Specialists are working on systems that would be able to distinguish a generated voice from a real one. They believe that it’s impossible to recreate a person’s speech. You can always spot a difference – at least for now.
Yet such methods aren’t universal. These systems are in their early days so a majority of banks rely on multi-factor authentication, including not only voice but also some other criteria.
What can ITMO offer?
We’re also working on a related project and developing systems for identifying generated audio tracks.
How do they work?
As I said, synthesized tracks are not perfect, and we know that. If you look closely at the diagram of a generated audio track, it will not be as smooth as a live recording. There will be some bursts and semantic connections will be different. These are the signs of robotic speech.
The biggest challenge lies in discretization. We record speech as an analog signal but the computer converts this signal into binary code for processing and storage. Here comes distortion. The higher the sampling rate, the smoother the synthesizer's speech will be. However, its data volume and size increase. To put it another way, it becomes difficult to generate. Some defects can only be detected by a person with a perfect ear or a well-trained program, which we’re working on.
How is the work on this project going?
It’s coming to an end. We’ve proposed algorithms, and now we need to implement them in a finished program for testing and evaluation. We plan to do all this by the summer. Then, we want to improve our system. So far, its accuracy is 85% but we won’t stop here.
Are there any industrial or funding partners interested in your work?
Yes, but it’s too early to share any details.