Connected devices are proliferating at a rapid rate, and this growth means that we’re only just beginning to scratch beneath the surface with potential use cases for Internet of Things (IoT) technology. IoT has quickly moved beyond basic internet-connected gadgets and wearables to more sophisticated interactive features like voice processing, which in turn has led to a significant rise in voice-activated devices such as smart speakers.
32 percent of surveyed consumers reported owning a smart speaker in August 2018, compared with 28 percent in January of earlier that year, according to new research by Adobe Analytics. The adoption rate of voice assistant technology has overtaken even that of smartphones and tablets – in fact, some predict that as many as 225 million smart speakers will be in homes worldwide by 2020. But at what risk?
Smart Speakers are Booming
As we watch this smart speaker base continue to grow, we have to consider the potential security ramifications these devices bring into people’s homes. One such lesser-known threat we can expect to see is skill squatting, and it is likely to develop into a legitimate cybersecurity problem.
Voice assistant-powered devices rely on ‘skills,’ or combinations of verbal commands that instruct the assistant to perform a task. When a user gives a verbal command through a phrase or statement, the device registers the command and determines which skill the user would like to activate. From turning on the lights in your living room to adding an item to your grocery list – or even buying those groceries – for every command you give, there’s a skill attached to that task.
Every smart assistant has the ability to get even smarter with small software applets that allow it to run processes automatically. These applets will look for a statement and then act upon it by running a number of linked skills; for example, while standing in the kitchen, you can instruct your smart speaker to ‘play some dinner music,’ which locates the music and then activates the closest speaker. But in order to execute the command, the device has to accurately interpret the user’s words before connecting the command with the specific skill the user would like to activate.
This past September, Amazon reported that developers had built and launched more than 50,000 Alexa skills with over 3,500 brands contributing to this library. It’s likely that even more skills and developers have begun using toolkits to make new skills in the four months that have already passed since this report.
There are no limitations in sight for voice processing technology, which is both exciting and unsettling. You might even consider this space to be a bit like the Wild West – there’s unfettered room for innovation, but with few safeguards or a solid user understanding of risks associated with this technology.
Did you Mean ‘Deer’ or ‘Dear’?
What happens when a smart speaker connects its user to the wrong skill? Usually just user frustration followed by aggressive repetition of the intended command. But, there can also be much more sinister conclusions.
Voice processing technology does not always interpret commands correctly. In the case of homophones or audibly unclear commands, mistakes are likely to be made. After testing 537,000 audio transcriptions of single-word speech samples against Amazon’s Alexa platform, a team of researchers at the University of Illinois at Urbana-Champaign (UIUC) found 27 predictable errors. Some of these were homophones – think ‘sale’ and ‘sail’ – but some had different phonetic structures, like ‘coal’ and ‘call’ or ‘dime’ and ‘time.’
All of this potential for error exposes users to the risk of activating skills they did not intend to – and therefore opens up a new avenue for cybercriminals to exploit. Bad actors can develop skills that prey on predictable errors in hopes of redirecting commands to malicious skills designed to do things like grant access to password information, a home network or even transmit recordings to a third party. This is known as skill squatting.
Take the ‘coal’ and ‘call’ example – a hacker might know that “call mom” is a common phrase spoken to a smart speaker. They can then develop a fraudulent skill that is triggered by someone speaking “coal mom,” an entirely different phrase that is unlikely to be given as a legitimate command, but which the smart speaker could easily confuse for an intended phrase, run the hacker’s malicious command and then link to and run the correct skill – all while the user is entirely unaware that this has happened.
Unfortunately for consumers, the UIUC research team was able to successfully squat 25 of the 27 predictable errors at least once – a 93 percent success rate.
Weaponized for Attacks
Although these attacks have not yet been found in the wild, the real-world repercussions are all too easy to imagine. We know from experience – and now research – that speech recognition systems make mistakes that could give cybercriminals access to a user’s home network. By activating a squatted skill, an unexpecting user could allow a malicious actor to extract information about their account, home network and even passwords before running the requested command. Because these devices typically operate quickly and without screens, the squatted skill would be activated so fast that the user would not notice. Like other attacks, cybercriminals can capitalize on human behavior and predictable errors to hijack intended commands and route users to malicious skills.
As of yet, there’s not a large attack of this nature on the scale or magnitude of WannaCry or Meltdown/Spectre to point to as a warning, but as with all new innovations, there will be breakdowns in speech/voice processing technology. Both cybersecurity professionals and consumers need to get serious about how to secure these devices. Just think about the nearly 50 percent of Americans who now own smart speakers – that’s a lot of vulnerable users for cybercriminals to target.