Work on automatic speech recognition exhibits the characteristic AI pattern of early success in restricted domains followed by failure to scale, because current systems treat speech as classification of fixed sound patterns in a feature space, whereas human hearing is guided by Gestalt perception in which the overall meaning of an utterance determines how individual sounds and phonemes are perceived.
By Hubert L. Dreyfus, from What Computers Can't Do
Key Arguments
- Dreyfus reports Oettinger’s review: 'There was considerable initial success in building apparatus that would eke out a sequence of discrete phonemes out of the continuous speech waveform.'
- He summarizes the feature‑space method: 'Usually the trick is something like identifying a number of features, treating these as if they were coordinates in some hyperspace, then passing planes that cordon off, if you will, different blocks of this space. If your speech event falls somewhere within one of these blocks you say that it must have been that sound and you recognize it.'
- He notes the scaling breakdown: 'This game was fairly successful in the range of twenty to a hundred or so distinct things, but after that, these blocks become so small and clustered so close together that you no longer can achieve any reliable sort of separation. Everything goes to pot.'
- Oettinger’s phenomenological remark suggests that phonemes are retrospective constructs: 'Perhaps . . . in perception as well as in conscious scholarly analysis, the phoneme comes after the fact, namely . . . it is constructed, if at all, as a consequence of perception not as a step in the process of perception itself.'
- Oettinger is 'driven' to a Gestalt view: 'maybe there is some kind of Gestalt perception going on, that here you are listening to me, and somehow the meaning of what I'm saying comes through to you all of a piece. And it is only a posteriori, and if you really give a damn, that you stop and say, "Now, here was a sentence and the words in it were of such and such type, and maybe here was a noun and here was a vowel and that vowel was this phoneme and the sentence is declarative, etc."'
- Dreyfus interprets this as evidence that 'the total meaning of a sentence (or a melody or a perceptual object) determines the value to be assigned to the individual elements,' directly contradicting the AI assumption that recognition proceeds from low‑level elements up.
Source Quotes
His analysis of speech recognition work is worth reproducing in detail, both because this pattern recognition problem is important in itself and because this work exhibits the early success and subsequent failure to generalize which we have come to recognize as typical of artificial intelligence research. There was considerable initial success in building apparatus that would eke out a sequence of discrete phonemes out of the continuous speech waveform. While phonemic analysis has been dominant in that area, numerous other approaches to this decoding problem have also been followed.
All have shared this initial degree of success and yet all, so far, have proved to be incapable of significant expansion beyond the recognition of the speech of a very few distinct individuals and the recognition of a very few distinct sound patterns whether they be phonemes or words or whatever. All is well as long as you are willing to have a fairly restricted universe of speakers, or sounds, or of both. Within these limitations you can play some very good tricks.
There are now lots of machines, some experimental, some not so experimental, that will recog- nize somewhere between 20 and 100 distinct sound patterns, some of them quite elaborate. Usually the trick is something like identifying a number of features, treating these as if they were coordinates in some hyperspace, then passing planes that cordon off, if you will, different blocks of this space. If your speech event falls somewhere within one of these blocks you say that it must have been that sound and you recognize it.
If your speech event falls somewhere within one of these blocks you say that it must have been that sound and you recognize it. This game was fairly successful in the range of twenty to a hundred or so distinct things, but after that, these blocks become so small and clustered so close together that you no longer can achieve any reliable sort of separation. Everything goes to pot. 4 This leads Oettinger to a very phenomenological observation: Perhaps . . . in perception as well as in conscious scholarly analysis, the phoneme comes after the fact, namely . . . it is constructed, if at all, as a consequence of perception not as a step in the process of perception itself.5 This would mean that the total meaning of a sentence (or a melody or a perceptual object) determines the value to be assigned to the individual elements.
Everything goes to pot. 4 This leads Oettinger to a very phenomenological observation: Perhaps . . . in perception as well as in conscious scholarly analysis, the phoneme comes after the fact, namely . . . it is constructed, if at all, as a consequence of perception not as a step in the process of perception itself.5 This would mean that the total meaning of a sentence (or a melody or a perceptual object) determines the value to be assigned to the individual elements. Oettinger goes on reluctantly to draw this conclusion: This drives me to the unpopular and possibly unfruitful notion that maybe there is some kind of Gestalt perception going on, that here you are listening to me, and somehow the meaning of what I'm saying comes through to you all of a piece.
4 This leads Oettinger to a very phenomenological observation: Perhaps . . . in perception as well as in conscious scholarly analysis, the phoneme comes after the fact, namely . . . it is constructed, if at all, as a consequence of perception not as a step in the process of perception itself.5 This would mean that the total meaning of a sentence (or a melody or a perceptual object) determines the value to be assigned to the individual elements. Oettinger goes on reluctantly to draw this conclusion: This drives me to the unpopular and possibly unfruitful notion that maybe there is some kind of Gestalt perception going on, that here you are listening to me, and somehow the meaning of what I'm saying comes through to you all of a piece. And it is only a posteriori, and if you really give a damn, that you stop and say, "Now, here was a sentence and the words in it were of such and such type, and maybe here was a noun and here was a vowel and that vowel was this phoneme and the sentence is declarative, etc."6 Phenomenologists, not committed to breaking down the pattern so that it can be recognized by a digital computer, while less appalled, are no less fascinated by the gestalt character of perception.
Key Concepts
- There was considerable initial success in building apparatus that would eke out a sequence of discrete phonemes out of the continuous speech waveform.
- All is well as long as you are willing to have a fairly restricted universe of speakers, or sounds, or of both.
- identifying a number of features, treating these as if they were coordinates in some hyperspace, then passing planes that cordon off, if you will, different blocks of this space.
- This game was fairly successful in the range of twenty to a hundred or so distinct things, but after that, these blocks become so small and clustered so close together that you no longer can achieve any reliable sort of separation. Everything goes to pot.
- the phoneme comes after the fact, namely . . . it is constructed, if at all, as a consequence of perception not as a step in the process of perception itself.
- maybe there is some kind of Gestalt perception going on, that here you are listening to me, and somehow the meaning of what I'm saying comes through to you all of a piece.
- the total meaning of a sentence (or a melody or a perceptual object) determines the value to be assigned to the individual elements.
Context
Dreyfus uses Oettinger’s survey of speech recognition research to exemplify AI’s limited, feature‑based pattern recognition and to support a Gestalt, meaning‑first model of human perception.