In a given year, there are estimated to be between 13 and 20 million people worldwide who are mechanically ventilated in order to allow them to get enough oxygen in their blood while they suffer with respiratory failure. While this mechanical ventilation — air being forced into and out of the patients’ lungs by a machine — can be life-saving, in severe cases it requires a tube to be inserted 24/7 into the patient’s mouth or directly into the throat via surgery. In either case, the patient loses their ability to speak and, perhaps unsurprisingly, they report this as the most distressing aspect of being mechanically ventilated.
Well, thanks to cutting-edge machine learning research being carried out in Germany, A.I. is now providing speech for these speechless patients. The research, led by Dr. Arne Peine of the University of Aachen, involves a machine vision model that predicts what the patient is trying to say based on their lip movements — it’s a lip-reading algorithm. Even though this research is in its infancy, it works quite well, with only a 6.3% error rate — dramatically increasing the capacity for these mechanically-ventilated patients to communicate with healthcare workers and loved ones.
In order to train their model, the researchers collected their own dataset consisting of 7000 short videos of patients uttering several hundred different sentences in German and in English. This relatively small dataset was effective because while lip-reading can be trickier in normal social settings, patients on a mechanical ventilator are unable to move their heads around so the lips remain fixed in one place with a full field of view to the camera.
In terms of the machine learning model itself, it involves two distinct stages each of which consists of relatively simple deep learning model architectures. The first stage — called the Audio Feature Estimator — takes in several consecutive frames of video of the lips moving while the patient is speaking. Deep learning models consist of several layers of so-called artificial neurons that loosely mimic the way biological brain cells work. In this Audio Feature Estimator model, the video frames pass through convolutional neural network layers (that are specialized to identify spatial patterns in images) and then through layers called gated recurrent units (that are specialized for dealing with data that occur in a sequence over time). This results in the first of two stages, the Audio Feature Estimator, outputting a prediction of the sound the patient is trying to utter with their lips. So, the first stage of the model takes in video of lips moving and outputs a prediction of the audio that would be associated with that lip movement.
The second stage of the model — called the Speech-to-Text stage — takes the audio that was output by the first stage and then — again using a relatively straightforward deep learning architecture consisting of convolutional neural network layer and a gated recurrent unit layer — it outputs a prediction of the words the patient is trying to say. These words can then be routed through standard off-the-shelf text-reading algorithms in order to allow mechanically-ventilated patients to speak in real-time, just by moving their lips!
Having now prototyped their approach, the next step for these German clinical researchers is to expand their training dataset and explore more recent deep learning architectures such as those used for visual style transfer in order to get the 6% error rate lower and broaden the range of speechless patients that their approach is effective for. If you’d like to dig deeper on the research, check out the full paper — called Two-stage visual speech recognition for intensive care patients.
I loved this practical, socially-beneficial application of A.I. and I hope you enjoyed hearing about it too. I encourage you to tag me in social-media posts on LinkedIn or Twitter to suggest super-cool new A.I. applications for future Five-Minute Friday episodes. The idea for today’s episode, for example, came from a LinkedIn comment I was tagged in by the brilliant A.I. product manager Alice Desthuilliers, with whom I’ve had the great pleasure of working with regularly at my machine learning company Nebula.
If you don’t already know deep learning well — including the convolutional neuron layers and gated recurrent units mentioned in today’s episode — you can learn all about these layer types and how they fit into deep neural networks from my book Deep Learning Illustrated, which is available in seven languages. Alternatively, the book is available in a digital format within the O’Reilly learning platform, where you can also find a video-tutorial version of my deep learning content called Deep Learning with TensorFlow, Keras, and PyTorch. Many employers and universities provide access to O’Reilly; however, if you don’t already have access, you can grab a free 30-day trial of the platform using our special code SDSPOD23.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.