Copyright 1998 ACM ASSETS
'98. Appears in
Proceedings of The
Third International ACM SIGCAPH Conference on Assistive
Technologies, April 15-17, 1998, Marina del Rey, CA, USA
Harriet J. Fell College of Computer Science Northeastern University Boston, Massachusetts 02115, USA Tel: +1-617-373-2198 fell@ccs.neu.edu |
Joel MacAuslan Karen Chenausky Speech Technology and Applied Research Lexington, Massachusetts 02173, USA Tel: +1-781-863-0310 starcorp@ix.netcom.com |
Linda J. Ferrier Department of Speech-Language Pathology and Audiology Northeastern University Boston, Massachusetts 02115, USA Tel: +1-617-373-5754 l.ferrier@nunet.neu.edu |
Our human judge and EVA showed better agreement (93%) in counting the number of utterances than the trained phoneticians in the Oller and Lynch study [18]. EVA and our human judge agreed on 80% of their categorizations according to duration, and 87% of their categorizations according to frequency, of the 411 commonly recognized utterances.
In the current phase of our project, we are developing tools to classify babbles in Oller's "Canonical Syllable Stage" (6 to 11 months): well formed syllables and reduplicated sequences of such syllables. There are several aspects of these vocalizations that we plan to analyze automatically:
The Liu-Stevens Landmark Detection Program was developed as a part of an adult-speech recognition system for analysis of continuous speech [11]; that system is founded on Stevens' acoustic model of speech production [23]. Central to this theory are landmarks, points in an utterance around which listeners extract information about the underlying distinctive features. They mark perceptual foci and articulatory targets. The most common landmarks are acoustically abrupt and are associated with consonantal segments, e.g., stop closures and releases. Such consonantal closures are mastered by infants when they produce syllabic utterances. Liu's program was tested on a low-noise (Signal to Noise ratio = 30 decibels) database, LAFF, of four adult speakers speaking 20 syntactically correct sentences. Error rates were determined for three types of landmarks -- Glottis, Sonorant and Burst. The overall error rate was 15%, with the best performance for glottis landmarks (5%).
The Liu-Stevens program first sends speech input through a general processing stage in which a spectrogram is computed and divided into six frequency bands. Then coarse- and fine-processing passes are executed. In each pass, an energy waveform is constructed in each of the six bands, the time derivative of the energy is computed, and peaks in the derivative are detected. Localized peaks in time are found by matching peaks from the coarse- and fine-processing passes. These peaks represent times of abrupt spectral change in the six bands.
In type-specific processing, the localized peaks direct processing to find three types of landmarks. These three types are:
The first change was to adjust the boundaries of the six frequency bands to better capture abrupt changes in F0, F2, and F3 (F1 is unused). Using ranges for formants in infant vocalizations cited in the literature [10, 4, 9], we set the bounds on the six frequency bands as shown in the table below. Additionally, we created a seventh band, composed of the union of bands three through six, for future work in detecting burst landmarks. (Bursts are not detected in these infant recordings as reliably as they had been in Liu's adult recordings.) See Table 1.
Band | Bounds for an Adult Male | Purpose | Bounds for an Infant |
1 | 0-400Hz F0 ~ 150 | To capture F0 | 150-600Hz F0 ~ 400Hz |
F1 ~ 500Hz | Ignore F1 | F1 ~ 1000Hz | |
2 | 800-1500Hz | For intervocalic consonantal segments
a zero is introduced in this range: "Bands 2 and 3 overlap in the hope that one of these bands will capture a spectral prominence.[12] At a sonorant consonantal closure, spectral prominences above F1 show a marked abrupt decrease in energy. | 1200-2500Hz |
3 | 1200-2000Hz F2 ~ 1500Hz | Onsets and offsets of aspiration and frication will lie in at least one of these four bands. | 1800-3000Hz F2 ~ 3000Hz |
4 | 2000-3500Hz F3 ~ 2500Hz |   | 3000-4000Hz |
5 | 3500-5000Hz F3 ~ 5000Hz |   | 4000-6000Hz |
6 | 5000-8000Hz | Spans the remaining frequency up to 8000Hz. | 6000-8000Hz |
7 |   | A threshold might be used on this band to detect +b/ -b landmarks | 1800-8000Hz |
We use a high-pass filter (cutoff: 150Hz) on our source files before applying the landmark program. This lowers the interference from ambient noise.
We were not satisfied with the initial marking of voicing (+g/ -g) done by Liu's program on our digitized samples. This algorithm works by looking only at the energy in Band 1. It assumes that high energy in this band indicates the presence of voicing. Our infant vocalizations usually exhibited lower energy than the adult male samples in the LAFF and TIMIT databases used by Liu. The ESPS get_f0 function [6] (which measures periodicity to calculate F0 in voiced regions) appeared to be more reliable at finding the voiced parts of the infant signals. We integrated this with Liu's program by multiplying Liu's coarse-pass Band 1 energy by the "probability of voicing" returned by get_f0 (a 0/1 value).
Experimentation with get_f0 resulted in settings of Min F0 = 150Hz and Max F0 = 1200Hz. (An earlier study [7], of four infants found an average F0 between 290Hz and 320Hz.) Utterances with fundamental frequency in the range were handled appropriately by the landmark software. We did not attempt to handle squeals, i.e. vocalizations with F0 > 1200. The lower threshold of 150Hz was sufficient to filter out noise. This left sounds that might be classified as growls but that were nonetheless suitable for analysis by the landmark program.
The program applies a variety of rules to check and possibly modify the initial +g/ -g settings. Adult rules and infant rules differ for two reasons. In data samples of an adult male reading single sentences, no pauses were expected. So the original code inserted a +g/-g into any -g/ +g interval of duration greater than 150ms. In a ten-second sequence of infant babbles, there are likely to be pauses that are at least 150ms long, so we adopted 350ms for this insertion rule. We adjusted thresholds related to vocalic energy levels to accommodate the faint babbles uttered by some of our infants.
We started with an inter-judge reliability study to assure the consistency of independent hand-marking of spectrograms by the judges. We then conducted a small study comparing the results of EVA to the landmarks agreed on by the judges.
Subject | Sex | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
TW | boy | x |   | x | x |   | x |   |
EK | boy | x |   | x | x |   | x | x |
NZ | boy |   | x | x |   | x | x |   |
JD | boy |   |   |   | x | x | x | x |
LS | girl |   | x | x |   | x | x |   |
Parents filled out an Infant State Form to testify that their infant had followed a normal schedule and was in an appropriate mood for recording. Infants whose parents reported their infants were in an atypical state were requested to reschedule. The infants were then recorded interacting with their parents in a sound-proof booth. Digital recordings were made using a high quality lavaliere microphone with a wireless transmitter and receiver. The following equipment was used for recording:
In order to remove parents' utterances, non-speech utterances, and background noise from the tapes of the babies we recorded, the tapes were screened to mark the locations and phonetic content of the reduplicative babbles. We digitized only those sections containing clear babbling, omitting sections obscured by extraneous noise or voices. The resulting files were high-pass filtered to remove noise below 150Hz. All files contained recordings of syllables that consisted of, at minimum, a vowel nucleus (a vocant), and optionally, an oral or glottal constriction (a closant).
7.6.1 Inter-Judge Reliability
The criterion for inter-judge agreement on perceptual hand-marking of landmarks was set at 95%. We defined inter-judge reliability as
( number landmarks agreed upon by both / number agreements + number disagreements)
Inter-judge reliability on the existence, type and placement of landmarks was carried out by having the two trained phoneticians independently mark 15 arbitrarily selected, digitized samples not in the training set. These samples ranged in length from 3 to 10 seconds. Each judge used headphones to listen to the infant utterances, and viewed both the time waveform and wide-band spectrogram on-screen. The judges worked independently, entering their judgments into separate files. At no point did they refer to analyses carried out by the EVA program.
Since the longest file was 10 seconds long, the worst screen resolution was approximately 10 ms per pixel. (Waveforms were typically scaled to the screen width.) The accuracy was limited primarily by screen resolution and each judge's ability to manipulate the computer's mouse.
Between them the two judges marked a total of 204 landmarks (see Figure 1 for an example). The judges marked 198 landmarks that were close to each other. One judge marked 4 landmarks that the second judge did not, and the second judge marked two that the first did not. The two judges agreed on presence and position, to within 50ms, on 185 of the 204 landmarks (91%) and on position, to within 100ms, on 194 of the 204 landmarks (95%). We felt that this agreement met our original training criterion of 95% inter-judge reliability on perceptual hand-marking of landmarks. All utterances marked after the completion of this test were then analyzed by one judge. Problematic utterances were marked and agreement jointly negotiated with the other judge.
7.6.2 EVA to human judge comparison:
To test EVA-human reliability, the following rules were used to match EVA's landmark vocabulary with that of the human judges. Landmarks that were further than 200ms apart were assumed to represent different events.
For the purposes of the first test, we considered as valid only those landmarks hand-marked and agreed upon to within 100ms by both human judges. As shown in the table, we did not count h# (glottal closure) marks at the start of voiced regions because these are not (yet) in EVA's repertoire.
Human Judges | EVA |
h# (at voicing start) hv hv | (do not match/count) +g -g |
or | |
h# (within voicing) hv hv | +s (no correspondence) -g or -s (if both marked, use -g, do not count -s) |
hv (no preceding h#) hv | +g -g) |
em (only one instance) em hv hv | +g (no correspondence) -s -g |
em em (no hv/hv following) | +g -g |
ay (initial glide) | (ignore region at this point) |
sh (only one instance) hv hv | +s (no correspondence) -g |
In this test, we considered the human consensus as the standard. H denotes the total number of validly hand-marked landmarks. We counted three kinds of errors for EVA: A deletion is a missed landmark where EVA detected no landmark in the vicinity of the hand-labeled landmark. The deletion rate is given by:
An insertion is a false landmark, i.e., one which was detected by EVA but not by the judges. The insertion rate is given by:
A shift is a landmark, found by EVA and matched by our correspondence rules, that is more than 100ms from the corresponding hand-labeled landmark. The shift rate is given by:
Then the total error rate is defined as:
7.6.3 First Comparison Experiment
As a preliminary test of EVA's agreement with human landmark detection, we ran EVA on 15 digitized samples that were marked by both judges. Together, the 15 digitized samples had 128 validly marked landmarks (agreed on by both judges). Comparatively, EVA had 9 insertions and 3 deletions. Of the 125 landmarks found by EVA that corresponded to the valid human landmarks, only 1 was more than 100ms from either judge. The error rates were:
The insertions were a +s/ -s pair 12 ms apart, three +g/ -g pairs 32, 35, and 37ms apart, and an unmatched -g. The deletions were all at the end of one digitized sample and just after a region that was not counted due to disagreement between the judges.
The table below shows the statistics on the 124 landmarks, that were agreed on by both judges and corresponded to EVA landmarks (within 100ms):
Distance | EVA to | >EVA to judge 2 | inter-judge |
Average | 15.4ms | 13.7ms | 15.0ms |
Standard Deviation | 16.1ms | 16.3ms | 16.3ms |
Maximum | 93 | 93 | 90 |
number > 50ms | 5 | 6 | 6 |
percent > 50ms | 4.0% | 4.8% | 4.8% |
number > 70ms | 4 | 4 | 2 |
percent > 70ms | 3.2% | 3.2% | 1.6% |
EVA counted 67 syllables in these samples while the judges found 64. In places where one syllable boundary was missing, 0.5 syllable was counted. EVA deleted 1.5 syllables found by the judges and inserted 4 syllables, the +s/-s and +g/-g pairs described above.
7.6.4 Second Comparison Experiment
Second Comparison Experiment: In this test, the digitized samples were hand-labeled by only one phonetician and compared to EVA's performance. (Based on the previous two-judge comparison, perhaps 2-3% of the differences should be attributed to the judges, not to EVA.) The 11 samples were all of subject LS, five recorded at 7 months and six recorded at 8 months. The phonetician picked these samples to test EVA because they contained relatively few phonemes not yet in EVA's repertoire, e.g. glides. The 7-month samples, however, were known to contain substantial amounts of vocal fry in the utterances, which might interfere with EVA's evaluation.
In the 11 digitized samples, 150 landmarks were marked by the human judge. EVA detected 2 landmarks that the judge did not and omitted 17 that the judge noted. Of the remaining 135 landmarks found by EVA that paired with the valid human landmarks, only 2 were more than 100ms from their corresponding marks. The rates of disagreement (which may or may not be EVA errors) were:
The table below shows the statistics on the 133 landmarks that corresponded to EVA landmarks to within 100ms (i.e. all but the two shift errors).
Distance between Landmarks | EVA-to-Judge |
Average | 19.2ms |
Standard Deviation | 19.7ms |
Median | 13.1 |
Maximum | 86.5 |
EVA detected one +s/-s pair not found by the phonetician (judge); a second phonetician deemed that this pair was unclear. EVA omitted 19 landmarks labeled by the human judge. Of these, 12 landmarks (6 syllables) came from the 7-month recordings and represented syllables with substantial vocal fry; 4 landmarks (2 syllables) were found in a single digitized sample in a region of low vocal energy; the others may have been erroneous deletions by EVA. In all cases, the second phonetician concluded that the disagreements were within the bounds of typical inter-judge (human) differences.
We plan to expand EVA to include categorization of syllables according to initial sounds (whether there is an initial consonant and if so, whether it is a stop, fricative, nasal, or liquid) and use this to develop an inventory of phones in infants' repertoires at each age level. The data will be used to establish an initial normative data base on the pre-speech development of typically developing infants.
This work was sponsored in part by NIH grant #R41-HD34686.
[2] Bayley, N. 1969. Manual for the Bayley scales of infant development. New York: The Psychological Corporation.
[3] Bickley, C. 1989. Acoustic evidence for the development of speech, RLE Technical Report No. 548, 111-124.
[4] de Boysson-Bardies, B., Halle, P., Sagart, L., and Durand, C., 1989. A crosslinguistic investigation of vowel formants in babbling. J. Child Lang., 16, 1-17.
[5] Capute, A. J., Palmer, F. B., Shapiro, B. K., Wachtel, R. C., & Accardo, P. J. 1981. Early lan-guage development: Clinical application of the language and auditory milestone scale. In R.E. Stark (Ed.), Language Behavior in Infancy and Early Childhood. New York.
[6] ESPS with Waves, 1993. Entropic Research Laboratory, Inc. AT&T Bell Laboratories
[7] Fell, H.J., Ferrier, L.J., Schneider, D., & Mooraj, Z. 1996, EVA, An early vocalization
analyzer: an empirical validity study of computer categorization. Proceedings of Assets '96, 57-61.
[9] Kent, R. D. & Murray, A. D. 1991. Acoustic features of infant vocalic utterances at 3, 6 and 9 months. In R. J. Baken & R. G. Daniloff (Eds.), Readings in Clinical Spectrography of Speech (pp. 402-414). New Jersey: Singular Publishing Group and Kay Elemetrics.
[10] Kent, R. D. & Read, C. 1992. The Acoustic Analysis of Speech, San Diego: Singular Publishing Group, Inc.
[11] Liu, S. 1994. Landmark detection of distinctive feature-based speech recognition. Journal of the Acoustical Society of America, 96, 5, Part 2, 3227.
[12] Liu, S. 1995. Landmark Detection for Distinctive Feature-based Speech Recognition. Ph.D. Thesis. CamPidge, MA: Mass. Inst. Tech.
[13] Locke, J. L. 1989. Babbling and early speech: Continuity and individual differences. First Language, 9, 191-206.
[14] Locke, J. L. 1993. The child's path to spoken language. CamPidge, MA: Harvard Univ. Press.
[15] Menyuk, P., Liebergott, J., Shultz, M., Chesnick, M, & Ferrier, L.J. 1991. Patterns of Early Language Development in Premature and Full Term Infants. J. Speech and Hearing Res. 34, 1.
[16] Menyuk, P., Liebergott, J., & Shultz, M. 1995. Early Patterns of Language Development in Full Term and Premature Infants. Lawrence Erlbaum Associates: Hillsdale, N.J.
[17] Oller, D.K. 1980. The emergence of sounds of speech in infancy. In G. H. Yeni-Komshian, J.F. Kavanagh, & C.A. Ferguson (eds.), Child Phonology Volume 1: Production . New York: Academic Press.
[18] Oller, D.K. & Lynch, M.P. 1992. Infant utterances and innovations in infraphonology: Toward a Poader theory of development and disorders. In Ferguson, C.A., Menn, L., & Stoel-Gammon, C. (Eds.), Phonological Development: Models research implications (pp. 509-536). Timonium, Maryland: York Press.
[19] Oller, D. K. & Seibert, J. M. 1988. Babbling of prelinguistic mentally retarded children. American Journal on Mental Retardation, 92, 369-375.
[20] Proctor, A. 1989. Stages of normal noncry vocal development in infancy: A protocol for assessment. Topics in Language Disorders, 10 (1), 26-42
[21] Stark, R.E. 1981. Stages of Development in the first year of life, In Child Phonology: Vol. I, G. H. Yeni-Komshian, C. A. Ferguson, J. Kavanagh (Eds.), New York: Academic Press.
[22] Stevens, K.N. 1992. Lexical access from features. Speech Communication Group Working Papers, Volume VIII, Research Laboratory of Electronics, MIT, 119-144.
[23] Stevens, K.N., Manuel, S., Shattuck-Hufnagel and Liu, S. 1992. Implementation of a model for lexical access based on features. Proc. Int'l. Conf. Spoken Language Processing, Banff, Alberta, 1, 499-502.
Last Updated: April 12, 1998 7:50 pm
The URL for this document is:
http://www.ccs.neu.edu/home/fell/fellAssets98.html