Speech Decoding: Where and How
Our brain-AI interface should hear what you want to say, without you actually having to say it. What could that mean, in practice?
At e184, we’re working towards a world where you can communicate with AI at the speed of thought. That means mastering the art of brain decoding, learning how to take signals gathered from the brain’s magnetic field and understand what you’re really thinking. And as much of our communication is via the spoken word, it means we need to be able to decode speech.
We’re just starting our journey. We will be talking with experts to hammer out our approach, and hiring talented researchers who can make our dream a reality. At the same time, we’re laying groundwork for our device in other ways. We need to think about how we build our technology: its shape and layout, what regions of the brain we need to measure and with what resolution. And we want to start thinking about data, about the measurements needed to train a machine learning model for the decoding tasks we need.
Focusing on speech, these two questions have a shared core: what, exactly, will we decode? Speech isn’t a single phenomenon. It’s a process, one with different levels, from our earliest ideas of what we want to say to the concrete movements we make when saying it. As we plan, we investigate which of these levels give us the best path to a brain-AI interface.
The Articulatory Level
At the most concrete, we could try to measure articulation. When we speak, we move muscles in the tongue, lips, and jaw, and exercise control of the larynx. A decoder could try to represent those movements directly, inferring their kinematics, or could aim at phonemes, the most basic sounds we use to distinguish words.
Aiming for articulatory data would mean a focus on those brain areas that most directly relate to the muscle movements of speech. The precentral gyrus sends movement signals to the spinal cord, and its ventral region contains the orofacial primary motor cortex, which controls core functionality for speech, including loudness. The adjacent premotor cortex is involved in planning movement, and its signals correspond to more complex patterns. Successful approaches that target articulatory data generally measure signals from these regions. This can involve looking for analogues to the signals observed from muscles in electromyographic measurements, or by examining the high-gamma brain waves that have been observed to correlate well with the kinds of articulatory movements sought.
Targeting the articulatory level is intuitive for a device for paralyzed patients, who may send nerve signals intended to produce speech and be unable to produce it. It seems less useful for our goal of a device that can be used silently by healthy users. It also may be more vulnerable to muscle artifacts, especially if we wanted to use it in a non-implanted approach, as current approaches rely on features with high frequency and high spatial specificity. This may make this approach inherently limited to the implanted approaches currently used to target it, and thus less desirable from our perspective.
The Acoustic Level
Another approach would be to target the sound itself, mapping brain states to a waveform or spectrogram representation. This involves targeting the representation of sound in the auditory cortex, with the superior temporal gyrus involved in processing higher-level features of sound (turning phonemes into linguistic data) and Heschl’s gyrus appears to have an important role in imagined speech, particularly dialogue with oneself.
While some research studies acoustic representations in order to understand how the brain processes auditory information in general, other studies, based on successes decoding imagined speech, see the potential for a brain-computer interface. Typically focused on paralyzed patients, interfaces along those lines create synthesized speech, for example with a vocoder representation or using a self-supervised approach to identify key auditory features.
As we prioritize enabling users to compose text and communicate with AI, the specific vocal waveform is less relevant for our use-case than it would be for paralyzed patients who may be interested in speaking in something approximately like their own voice. It is also harder to evaluate and model, being a very high-dimensional representation of speech data.
The Linguistic Levels
Moving up in abstraction from movement and sound, we can ask about the actual words someone intends to convey (lexical-level speech), or work at the level of parts of words like syllables or even individual phonetic elements (sublexical-level). Speech at this level is spread more widely in the brain, and different approaches have targeted areas of the sensorimotor cortex and the perisylvian cortex.
These approaches tend to use features of language models as features to structure data, like word logits or n-grams, and to aim to produce likelihoods for classifying phonetic elements. The end result, unlike the previous two layers, can be interpreted directly as text, which is an appealing trait for our goal of a brain-AI interface.
While linguistic approaches line up well with our goals, they have so far been challenging targets. The most successful approaches still heavily depend on particular conditions: they have trouble transferring between subjects, or they work only for a limited vocabulary or a particular task structure or protocol. This has especially been an issue for non-implanted approaches.
The Semantic and Sentence Levels
Finally, one could imagine targeting speech at its most abstract, at the level of meaning. This could mean trying to reproduce whole sentences, or even broader concepts. Neural data at this level is more widely spread through the brain, and studies have found useful signals in large-scale networks like the default mode network, the frontotemporal language-selective network, and the visual network. Targeting this level would thus require fairly broad brain coverage.
Approaches to this level can train on data gathered from reading stories, and classify the input with computational tools that capture semantic content like contextual language model embeddings and semantic vectors. Some work at this level results in a ranking of candidate sentences, but other work yields a paraphrased text.
Such a paraphrase would be more relevant to our goals, but would come with clear downsides for users, who might feel that their words are not being accurately represented. It also raises privacy concerns, as a device trained purely to pick up meaning could pick up meanings that the user does not intend to communicate.
Conclusions
Our analysis suggests that the articulatory and acoustic levels, while important for medical devices, will not be as well-suited to our goals. Some mixture of linguistic and semantic content, instead, seems ideal. Data at the semantic level could patch over the difficulties purely linguistic-level approaches have in non-implanted modalities, while lexical or sublexical data is needed to make sure that the device accurately conveys what the subject actually intends to say.
With that said, at this stage, we shouldn’t rule anything out. We want to build a versatile device, one that can target a wide variety of brain regions. We don’t just want to reproduce existing successes of implants in a non-implanted modality. While that would already be a great success, the field is already showing some progress in that direction. What we want, instead, is a platform to build on, one that gets us closer to our ultimate goal of extending human capabilities, fluidly incorporating AI into our own cognition to safeguard our voice in the future. Our best path at the present time may still be a familiar one, growing out of current-day successes…but it could also be something radically new.
As we investigate the way forward, we especially want to discuss different modeling approaches: different sources of data, different embeddings, different targets and architectures. Soon we will be reaching out to expert contacts, getting multiple opinions to help us find the best way to begin. If you’d like to join that conversation, or to work with us on other frontier BCI applications, contact Peter Zhegin at p@e184.com.
And if you’ve got your own thoughts on the topic more broadly, we’d love to hear them. What levels should we prioritize for speech decoding, and what methods should we use? Let us know in the comments!



