A Study of Speech Recognition based on Inner Structure

by Isi Shun
Japanese page
< Message from author >


Study of speech science has been some prospered and its application is some useful for us.
However, i feel that we still lack for understanding about human speech recognition principle.
Not process superficial feature, but discover hidden mechanism in diversity of speech signal wave, that is so many shapes but one common meaning, is necessary.
So, i made this web page hopingof proper speech recognition principle study and its development.



< Speech Recognition Based On Inner Structure >


Human speech wave is varying, due to individual vocal cords, comfortable condition, and surrounding environment. There is no fixed reference value.
The mechanism, generating structure, exists inside of the observed speech signal. Let's call it inner structure.
The nature of speech signal is diversity, that is many forms but one common phonetic meaning.
Every inner structure are grouped due to vocal organs movement conditions, conditions of mouth open range and tongue turn range, and there is certain freedom in generative mechanism that causes diversity of speech signal.
Speech recognition based on inner structure is a pattern recognition method to estimate which the inner structure of the spoken phoneme.

(1)Due to the speech organs  structure and its movement limits, and requirement of utterance is discriminative, significant sound position of organs structure, place of articulation,  must lead to at the end or neutral. Therefore, combination of position correspond to phoneme is limited to some. State is quantized.
(2)Due to restriction  by relation in structure, feature parameters of speech ( for example, formant frequency set) are not independent each other. They can not take value independently. If first formant and second formant are known, third formant will be able to be predicted by generation mechanism calculation of the structure. To apply like this way is a kind of structure fitting.
(3)Everybody has different organ and does not speak as utter same others , however speech itself is understandable for everyone. This will lead the concept of relative valuation (like concept of frequency ratio) and starting point (a key).

The recognition method is to find how to generate the voice sound based on generating mechanism by applying (slightly forced) its representative feature to simplified model, like a similar method of face recognition.


Source type and Effect type
source type effect
resonance temporal change nose effect
glottal source
turbulent source

effect type feature
resonance frequency formant frequency
strength formant gain
temporal change variable center frequency
nose effect loss compared to standard source and standard resonance



Band Pass Filter bank and its application to voice sound analysis


No.6, 30 June 2009
Revised, 24 April 2019


Comments about features of five Japanese vowels

In Japanese, vowel /a/ and vowel /u/ are basic.

Vowel /a/ feature is that according to principle of maximum radiation (*1)  from mouth by using  two waves , radiated waves from mouth are put in order as closely and cooperative. (1) In higher frequency range, harmonic waves of the two waves appear as closely and cooperative as same as fundamental ones in case of vowel /a/.  This effect is similar to horn loudspeaker.
(*1)principle of maximum radiation leads summit consists of 2 or more peaks.

Vowel /e/, waves from mouth are adjusted by tongue that in lower frequency range there is no pair of waves,  just one, beside in higher frequency range, pair of waves are as closely and cooperative. (4)
Beside vowel /o/ is complex sound. It's basing on vowel /a/ and extended to /u/ . (3) Also, there is an equivalent that reverses position of /a/ and /u/.

Vowel /u/ has no color which means that  has no pair of waves which are closely and cooperative (as a key) , or, if there are, they are not extreme and gentle slope in frequency spectrum. The feature of vowel /u/  is feature less other than vowel /a/ , /e/, or etc.  Since that, many solutions exist as match as vowel /u/.(5)

Vowel /i/ is of vowel /u/ and noisy signals added to it. Its noisy signals are caused by flow in narrow way out in mouth. This noisy existence informs of closely mouth utterance. (2)



Speech Waveform Generation by Two Tubes Model and Glottal Volume Velocity

Estimation of vocal tract as two tubes model or three tubes model



No.28, 25 March 2019


Consonants perception

Vowels can be explained as a resonance phenomenon, however, to generate consonant,
other than glottal vibration pulse, turbulent sound source generated by the tongue, teeth, throat, etc. are necessary.
The physical phenomenon of turbulent sound is described by complicated differential equations and it is difficult to understand to understand.
However, as quite roughly, there are two qualitative considerations.
a) The approximate frequency of the generated sound is evaluated as "air flow speed divided by typical length" at the location.
In order to obtain a high frequency, it can air flow speed up (narrow constriction) and/or use shorter length (ex: tooth gap).

b) There is a threshold for generating turbulent sound.
Reynolds number needs to exceed a certain value.
The speed acceleration to exceed appears as a change in signal period (frequency).

Vowels, although quite roughly, can be simulated by simple resonance tube model and glottal vibration pulse.
However, in case of consonants, those which generate turbulent sounds (constriction and obstacles) break up the tube into pieces, and it becomes a complicated frequency characteristic.

And speed up air flow generating turbulent sound may cause growing amplitude if it meets resonance frequency. (resonance scan)

Consonants perception is to identify the location of the turbulence, where occurs, including resonance effect around it, from the observed signal (unsteady).



Fricative voice /sa/ sound waveform generation by two tubes model and noise source

Plosive voice /ga/ /ka/ sound waveform generation by pseudo blast impulse, noise source, and two tubes model

Nasal voice /na/ /ma/ sound waveform generation by two tubes model and nasal effect source



No.6, 20 February 2019


Previous index page:


A Study of Speech Recognition based on Inner Structure



With gratitude:


mirror of dmoz.org validator.w3.org

old Open Directory Project mirror
check HTML grammar
by W3C




No.127, 3 September 2022

This page first established on 17 July 2005.