< Message from author >
Study of speech science has been
some prospered and its application is some useful for us.
However, i feel that we still lack for
understanding about human speech recognition principle.
Not process superficial feature, but discover hidden mechanism in diversity of speech signal wave, that
is so many shapes but one common meaning, is necessary.
So, i made this web page hopingof proper speech recognition principle study and its development.
Human speech wave is varying, due to individual vocal cords, comfortable condition, and surrounding environment.
There is no fixed reference value.
The mechanism, generating structure, exists inside of the observed speech signal. Let's call it inner structure.
The nature of speech signal is diversity, that is many forms but one
common phonetic meaning.
Every inner structure are grouped due to vocal organs movement
conditions, conditions of mouth open range and tongue turn range, and
there is certain freedom in generative mechanism that
causes diversity of speech signal.
Speech recognition based on inner structure is a pattern recognition method to
estimate which the inner structure of the spoken phoneme.
(1)Due to the speech organs structure and its movement limits, and
requirement of utterance is discriminative, significant sound position
of organs structure, place of articulation, must lead to at the end or neutral. Therefore, combination of
position correspond to phoneme is limited to some. State is quantized.
(2)Due to restriction by relation in structure, feature parameters of speech ( for example, formant frequency set) are not independent each other. They can not take
value independently. If first formant and second formant
are known, third formant will be able to be predicted
by generation mechanism calculation of the structure.
To apply like this way is a kind of structure fitting.
(3)Everybody has different organ and does not speak as utter same others , however speech itself is understandable for everyone. This will lead the concept of relative valuation (like
concept of frequency ratio) and starting point (a key).
The recognition
method is to find how to generate the voice sound based on generating mechanism
by applying (slightly forced) its representative feature to simplified model,
like a similar method of face recognition.
Source type and Effect type
source type |
effect |
resonance |
temporal change |
nose effect |
glottal source |
✔ |
|
✔ |
turbulent source |
✔ |
✔ |
|
No.6, 30 June 2009
Revised, 24 April 2019
Comments about features of five Japanese vowels
In Japanese, vowel /a/ and vowel /u/ are basic.
Vowel /a/ feature is that according to principle of maximum
radiation (*1) from mouth by using two waves , radiated waves
from mouth
are put in order as closely and cooperative. (1) In higher frequency range,
harmonic waves of the two waves
appear as closely and cooperative as same as fundamental ones in case
of vowel
/a/. This effect is similar to horn loudspeaker.
(*1)principle of maximum
radiation leads summit consists of 2 or more peaks.
Vowel /e/, waves from mouth are adjusted by
tongue that in lower frequency range there is no pair of waves,
just one, beside in higher frequency range,
pair of waves are as closely and
cooperative. (4)
Beside vowel /o/ is complex sound. It's basing on vowel /a/
and extended to /u/ . (3) Also, there is an equivalent that reverses position of /a/ and /u/.
Vowel /u/ has no color
which means that has no pair of waves which are
closely and cooperative (as a key) , or, if there are, they are not
extreme and
gentle
slope in frequency spectrum. The feature of vowel /u/ is feature
less other than vowel /a/ , /e/, or etc. Since that, many
solutions
exist as
match as vowel /u/.(5)
Vowel /i/ is of vowel /u/ and noisy signals added to it. Its noisy
signals are caused by flow in narrow way out in mouth. This noisy
existence informs of closely mouth utterance. (2)
No.28, 25 March 2019
Consonants perception
Vowels can be explained as a resonance phenomenon, however, to generate consonant,
other than glottal vibration pulse, turbulent sound source generated by the tongue, teeth, throat, etc. are necessary.
The physical phenomenon of turbulent sound is described by complicated
differential equations and it is difficult to understand to understand.
However, as quite roughly, there are two qualitative considerations.
a) The approximate frequency of the generated sound is evaluated as "air flow speed divided by typical length" at the location.
In order to obtain a high frequency, it can air flow speed up (narrow constriction) and/or use shorter length (ex: tooth gap).
b) There is a threshold for generating turbulent sound.
Reynolds number needs to exceed a certain value.
The speed acceleration to exceed appears as a change in signal period (frequency).
Vowels, although quite roughly, can be simulated by simple resonance tube model and glottal vibration pulse.
However, in case of consonants, those which generate turbulent sounds
(constriction and obstacles) break up the tube into pieces,
and it becomes a complicated frequency characteristic.
And
speed up air flow generating turbulent sound may cause growing amplitude if it meets resonance frequency. (resonance scan)
Consonants perception is to identify the location of the turbulence, where occurs, including resonance effect around it, from the observed signal (unsteady).
No.6, 20 February 2019
Previous index page:
With gratitude:
|
|
old Open Directory Project mirror |
check HTML grammar
by W3C
|
No.127, 3 September 2022
This page first established on 17 July 2005.