A Study of Speech Recognition based on Inner Structure

by   Isi  Shun
 

Japanese page



This page describes a concept of voice recognition regarding phoneme as composite signal.



< Message from author >


Study of speech science has been some prospered and its application is some useful for us.
However, i feel that we still lack for understanding about human speech recognition principle.
Not process superficial feature, but discover hidden mechanism in diversity of speech signal wave, that is so many shapes but one common meaning,  is necessary.
So, i made this web page hoping of  proper speech recognition principle study and its development.




< Speech  Recognition Based On Inner Structure >


The mechanism, generative structure, exists inside of the observed feature of speech signal. Let's call it inner structure.
The nature of speech signal is diversity, that is many forms but one common phonetic meaning.
Every inner structure are grouped  due to vocal organs movement conditions, conditions of mouth open range and tongue turn range, and there is certain freedom in generative mechanism that causes diversity of speech signal.
Speech  recognition based on inner structure is a pattern recognition method to estimate which the inner structure of the spoken phoneme.


(1)Due to  the speech organs  structure and its movement limits, and requirement of utterance is discriminative, significant sound position of organs structure, place of articulation,  must lead to at the end or neutral.  Therefore, combination of position correspond to phoneme is limited to some. 
(2)Due to restriction  by relation in structure,   feature parameters of speech ( for example, formant frequency set)   are not independent each other. They can not  take  value independently.    If first formant and second formant are known,  third formant will be  able to be predicted by  generation mechanism  calculation of the structure.  To apply like this way  is a kind of structure fitting.
(3)Everybody has different organ and does not speak as utter same as  others , however speech itself  is understandable for everyone.
This  will lead the concept of  relative valuation (like concept of  frequency ratio) and  starting point (a key).

The recognition method is to find that the experiment signal contains the sound generated by the structure position.

Pattern recognition based on inner structure is a hypothesis under study, is not completed yet.  Human speech wave is varying,  due to individual  vocal  cords, comfortable condition, and  surrounding environment. And there is no fixed reference value in it.  As elements of speech wave will be defined by relationship each other,  solution for this pattern recognition should be optimization to fit elements to expectant structure.


 apply  speech waveform generation model as one of inner structure to speech recognition



The table below shows one consideration about the kind which can be generated by human mouth structure limitation.
Determine classification of what phoneme, by estimate following factors, and integrate them, and judgment

estimated factors kind Reason why the feature exists. Or, principle to decide it
resonance parameters r1,l1 /a/ principle of maximum radiation from mouth by using two waves
/e/ stationary point by using one wave and two waves
/u/,/i/ other
r1,l1,r2,l2 /o/ Extend method,
from 2 tubes model to 3 tubes model.   1st and 2nd tube follows principle of maximum radiation, 2nd and 3rd follows like /u/.
rl rate of resonance effect in mouth
noise nature constant,burst turbulent boundary
frequency range /i/,/s/,/t/,start mark of /r/ Are they determine by physical of human vocal organs ?
superpose sonant( more disturb origin),independent classification of noise superpose method
effective duration survival rate /t/,/p/ Is the unit to judge a pitch duration ?
break /p/ same as above ?
suppressed portion to be suppressed /m/,/n/,/N(nn)/ Compare with following vowel ? To increase kind by using nose effect
tune controlled pitch stable, rise, fall any state will be composed from these 3 states. To increase kind by using tune.
The content in the table above is still hypothesis. It is an idea to recognize human speech sound.


And transition between phonemes will take as short as possible. Due to both this transition principle and  physical  limitation of  vocal organs movement, transfer way from phoneme to one should be restricted.

Both estimation of vocal organ structure as effect and estimation of vocal  chord structure as sound source should be done  to phoneme discrimination.

No.6,  30  June  2009
Addition, 16 June 2013
Addition, 18 April 2014
Addition, 5 May 2014
Addition, 30 May 2014
Addition, 17 December 2014
Modification, 7 November 2018


How to find structure hidden in sound signal

In general, observed sound signal is not  similar to prediction from the structure. So, identification method based on error, whole error,  between observed signal and prediction from the structure, will not work well.
Getting knowledge of  characters which consist of two or more relation, and examine  if there is the character in some portion of observed. This procedure will be done  step by step,  through trial and error.
Although it is not about speech cognizance, to increase understanding of this procedure, there is
An example of structure identification by rough full search and nonlinear least square method to estimate parameters of two composite structures in signal

The best way for machine recognition of pattern  is logical process,  analytical method or recurrence relation , using whole elements at same time. However,  such method is not found yet.


No.2,  17  February  2015




Discrimination in mixed sound
Human speech, and also, other sound, which human can recognize, has its inner structure. For example, inner structure of piano sound, inner structure of  violin and so on. One may can discriminate the desired sound  by detecting the inner structure in mixed sound.



Comments about features of  five Japanese vowels
In Japanese, vowel /a/ and vowel /u/ are basic.

Vowel /a/ feature is that according to principle of maximum radiation from mouth by using  two waves , radiated waves from mouth are put in order as closely and cooperative. (1) In higher frequency range, harmonic waves of the two waves appear as closely and cooperative as same as fundamental ones in case of vowel /a/.  This effect is similar to horn loudspeaker.
Vowel /e/, waves from mouth are adjusted by tongue that in lower frequency range there is no pair of waves,  just one, beside in higher frequency range, pair of waves are as closely and cooperative. (4)
Beside vowel /o/ is basing on vowel /a/ and extended to /u/ .   (3)

Vowel /u/ has no color which means that  has no pair of waves which are closely and cooperative (as a key) , or, if there are, they are not extreme and gentle slope in frequency spectrum. The feature of vowel /u/  is feature less other than vowel /a/ , /e/, or etc.  Since that, many solutions exist as match as vowel /u/.(5)

Vowel /i/ is of vowel /u/ and noisy signals added to it. Its noisy signals are caused by flow in narrow way out in mouth. This noisy existence informs of closely mouth utterance. (2)




Attention: These comments are still hypothesis.


    principle of consonant

   An experiment to generate sound like phoneme by two or  three tubes model with digital filter



No.27,  7  November  2018




Comments about features of Japanese consonants of  /ha/, /sa/,  /ka/, /ta/, /ma/,  /na/, /ra/ and /pa/

  Besides other languages,  japanese language should be recognized as a pair of consonant and vowel,  in short, consonant  is not independent sound. But,  principle of consonant  may be similar in every languages.

  1. /ha/    adopts  an suitable noise sound instead of voiced sound as source for resonance system. The generated sound is voiceless but it  includes resonance feature.
  2. /sa/   creates  higher band noise due to draft between teeth.
  3. /ka/   creates  irregular burst    which is a  non linear  phenomenon due to strong blow. And it  resonates lightly in mouth.
  4. /ta/    has many forms.  One is fast rise with noise due to normal blow.
  5. /ma/  starts from  sound effect, that is erase nature peak and erase higher frequency band expect two major peaks by nose. And  gradually  makes the sound effect less and revitalize lost  peaks.
  6. /na/   similar effect to /ma/.  difference is initial state of mouth. /ma/ initial is almost of target form state, beside /na/ is from  close  tongue state to target form state.
  7. /ra/   starts from a marked sound when tongue take off the roof of  the mouth (example).    And then,  principle of /a/ will be gradually formed  through unstable states.
  8. /pa/ starts reflective sound by starting strong breath flow.

Attention: These comments are still hypothesis.


An experiment to generate sound like phoneme by two or  three tubes model with digital filter


No.5,  23  May  2009

  Sound Structure Description

 Inner Structure for Pattern Recognition


For speech recognition feature, unit of spectrum or unit of cepstrum is not enough to discriminate speech.
To discriminate, more detailed observation is necessary.
As an experiment, let's study elements of speech signal and their change of times.



<Analysis of Speech Wave >


Speech signal can be composed of several elements. The figure right shows elements which are some linear phase FIR digital filter outputs of speech signal. Frequency band of linear phase FIR digital filters are manual adapted more discriminative. In figure right, top red color wave form is original wave form of one part of utterance japanese vowel "A". Next blue color wave form and three gray color wave forms are elements of which sum makes almost original wave form red color. Blue color is mainly caused by vibration of vocal cords. And gray colors are burst signal in higher frequency range, which is caused by stress at initial cycle of vibration of vocal cords. And they are not simple sin wave.

Besides japanese vowel "A" is composed of set of higher frequency elements, let study next  japanese vowel "I". "I" is composed of a base wave and  some waves like noise on the base wave. In figure right, blue color, element 1, is  base and gray color below, element 2, is  high frequency waves on the base.  If japanese vowel "I" without element 2, it's heard as japanese vowel "U."

Next is japanese vowel "E." It mainly  consists of element 1, which is heard as "O" when it's solo,  and element 2 which features "E."  Element 3 gives effect to be heard as surely "E."

Human discrimination may use dynamic changing portion rather than stable portion of speech signal. Sometimes, a part speech wave which is cut out from continuous speech, is heard as other different syllable from the syllable in the continuous speech. This might be explained by human discrimination priority to dynamic changing portion rather than stable ones.

As simple samples, little bit hard to observe, there are initial dynamic portion of utterance japanese vowel "A", "I", "U", "E", and "O".











Beside utterance japanese vowel "A" makes a set of higher frequency waves, some kind of utterance makes sound restrained. For example, utterance japanese "MU", the figure below is comparison a part of consonant of "MU" with a part of vowel of "MU." Both have gray color element of which cycle is same. But, level of gray color element in consonant is restrained and is very small than one in vowel. In vowel, with blue color element, gray color element is opened and is well.
As a simple sample, there is utterance japanese "MA."




Generally,  japanese consonant portion  is not same, but  is sometimes  modified due to having reflection of each  vowel following the consonant.



Difference from other language utterance:
In Japanese utterance, independent consonant is rare case.  Usually, consonant links vowel following the consonant and is uttered as one syllable.  Duration of foreigner speaker's japanese consonant makes sometimes a little strange for us.



No.7,  13  May  2007




<Description of  Speech Elements >

Let's think about useful way  how to describe elements of speech signal.
This is an example for description of speech elements by  XML (Extensible Markup Language).
It's a description of speech elements for a syllable, japanese utterance vowel "A."

In XML below,
Element_1_for_A means that it's element 1 of vowel "A."
Element_3_for_A? , mark ? means option which  has it or doesn't have it.
Wave_Unit_Type3+, mark + means that Wave_Unit_Type3 occupies continuously some times.


<?xml version="1.0" encoding="UTF-8"?>

<!-- Description of a syllable, japanese utterance vowel "A"  Version 0.02 -->
<!DOCTYPE  Independent_Vowel_A  [
<!--  Speech wave signal consists of three portions, generative, stable, and resolvable one. -->
<!ELEMENT Independent_Vowel_A (Generative_Portion, Stable_Portion, Resolvable_Portion )>

<!-- Description of generative portion -->
<!ELEMENT  Generative_Portion (Wave_Unit_Type1, Wave_Unit_Type2, Wave_Unit_Type3+)>

<!-- Description of stable portion -->
<!ELEMENT  Stable_Portion (Wave_Unit_Type3+)>

<!-- It must have Element 1 -->
<!-- Element 0 is optional, you can hear it as vowel "A" without Element 0. -->
<!-- Element 3 is optional. It gives effect to be surely "A."  -->
<!ELEMENT  Wave_Unit_Type1  ( Element_0?, Element_1_for_A)>
<!ELEMENT  Wave_Unit_Type2  ( Element_0?, Element_1_for_A, Element_2_for_A)>
<!ELEMENT  Wave_Unit_Type3  ( Element_0?, Element_1_for_A, Element_2_for_A, Element_3_for_A?)>

<!-- Element 0, Element 1, Element 2, and  Element 3 are -->
<!ELEMENT  Element_0 (WaveData)>
<!ELEMENT  Element_1_for_A (WaveData)>
<!ELEMENT  Element_2_for_A (WaveData)>
<!ELEMENT  Element_3_for_A (WaveData)>

]>

<Independent_Vowel_A>
</Independent_Vowel_A>



And  also,  these are descriptions of speech elements of stable portion, for a syllable, japanese utterance vowel "I"  , for a syllable, japanese utterance vowel "U" , japanese utterance vowel "E", and  for a syllable, japanese utterance vowel "O."


<!DOCTYPE  Independent_Vowel_I [
<!ELEMENT  Stable_Portion (Wave_Unit_Type2+)>
<!ELEMENT  Wave_Unit_Type2  ( Element_1_for_U, Element_2_for_I)>
]>
Element 2 of "I"  differs from one of other vowels like "O" and "E."
It is rather  rubbing noise than sound in box of mouth.



<!DOCTYPE  Independent_Vowel_U  [
<!ELEMENT  Stable_Portion (Wave_Unit_Type1+)>
<!ELEMENT  Wave_Unit_Type1  ( Element_0?, Element_1_for_U, Element_2_for_U?)>
]>
Element 2 and more of  vowel "U"  gives effect to be surely "U."



<!DOCTYPE  Independent_Vowel_E  [
<!ELEMENT  Stable_Portion (Wave_Unit_Type1+)>
<!ELEMENT  Wave_Unit_Type1  ( Element_0?, Element_1_for_U, Element_2_for_E, Element_3_for_E)>
]>
Element_2_for_E is like weakened element_2_for_O.
Element_3_for_E is emphasized relatively due to weakened element_2_for_E.

And also, following is another representaion of "E." 

<!DOCTYPE  Independent_Vowel_E  [
<!ELEMENT  Stable_Portion (Wave_Unit_Type1+)>
<!ELEMENT  Wave_Unit_Type1  ( Element_0?, Element_for_O, Element_2_for_E,Element_3_for_E?)>
]>
In this representation, element 3 and more of  vowel "E"  gives effect to be surely "E." And element 0 is similar  to element of "U."



<!DOCTYPE  Independent_Vowel_O  [
<!ELEMENT  Stable_Portion (Wave_Unit_Type1+)>
<!ELEMENT  Wave_Unit_Type1  ( Element_0?, Element_1_for_U, Element_2_for_O, Element_3_for_O?)>
]>
Element_3_for_O is relatively too weak than element_2_for_O.



And here is an attention that measurement actual value of each element isn't  fixed one, isn't constant,  they are changeable according to circumstances. Each element is relatively defined between each others, but  this opinion is still a hypothesis.
And another important thing is that all elements are relation, of which origin is common. They are not independent. For example, there are three portions corresponded to glottis's change in wave.
Sum of elements makes a speech signal.  However, if change solely one element by its own way, sound may be broken and quality of the speech signal maybe will be poor.



No.10,  27 October  2007





 < Pattern Recognition Based on Elements of Speech Signal >



In speech signal, the frequency and its band of main elements are not fixed, they vary depending on speaker or situation. To begin search for pattern matching or pattern recognition, initial values about them can be estimated by frequency response by FFT analysis of a certain part of speech signal. The procedure may be combination with mathematics calculation and search technique developed by
AI study. Calculation values as some index  and judgment by certain condition term. Human speech signal is vary, so, search based on certain method will be resolved is Not guaranteed. An idea is multi-thread search, that simultaneously some methods are done until one of them will be resolved. (about matching and learning) Another idea is statistical pattern recognition method of which feature extraction is based on structure features like elements, trace along time, and so on.  Picking up effective discriminative features are significant.


MENU of SEARCH for Pattern Recognition
Efficiently and reasonably, change along time dimension might be done. The change does not  go out of one's way, but , it almost takes near path on structure between current structure and target structure under the condition of  movement  ability in mouth.





The right figures show a syllable description by neighboring two sections in speech waveform of utterance /HA/. In figure top, section 1 is a part of consonant, besides  section 2 is a part of  vowel, in this case /a/.  From the view of frequency response, both sections  have vowel /a/  features,  which are marked by  purple color circle in figure middle.  However, from the view of outputs of filter bank (figure  bottom),  a sure difference exists.  In section 2, a part of vowel, output signals   synchronize with glottal signal, besides in section 1, output signals are not in order well.
In like noisy  signal of consonant portion, it includes feature of following vowel /a/, and changes to the feature stable in vowel portion. This is the  /HA/ structure described by neighboring two sections along time dimension.


Outputs of filter bank (divided band quantity is 10)
You can look at more detail if you click the figure.


a filter bank program which was used for this analysis


Structure of sound changes of times. This is a sample of change of nasal sound, japanese utterance /NA/ and /MA/. /NA/ consists of  nasal sound plus /r/ plus /a/, besides /MA/ consists of nasal sound plus /a/.


Sometimes, in conversation, structure of sound may be out of shape, as similar to letter shape by hand writing against typewriter. We hear and can understand easily as one word unit or as one sentence unit, but if the speech is divide into syllable units and listen to each syllable, sometimes their sounds are not clear, like be sound occurred by the force of inertia, and are hard to discriminate only from the view of sound structure. So, as present speech recognition technique, total speech recognition should be done including to refer possible candidates of one word unit or one sentence unit.


"Synthesis ability and detection ability are complemental."

No.10,  25  May  2008




<Consonant structure example, transform utterance "KA" to utterance "TA"   >


From the view of structure, beginning part of utterance "KA" consists of two elements, which are cave sound in mouth and strong breath turbulent flow sound. Meantime, beginning part of utterance "TA" consists of, mostly, strong breath turbulent flow sound along mouth close to open.
In figure right, upper waveform shows utterance "KA", original waveform. And below one is waveform which is pulled element out by low cut filtering on beginning part of utterance "KA." Below waveform sounds like utterance "TA."
Turbulent flow sound or cave sound are sensitive to man to hear. These  sound waveforms differ much by speakers. So, to detect these sound, not only frequency analysis method, but effective alternate method had been better to be introduced.

Unvoiced sound, like /k/, /t/, and /s/ has no successful synchronize signal, although voiced sound includes glottis signal as synchronize signal.
Let's study burst  waveform tone burst waveform   in turbulent flow sound or cave sound. To compare them and find feature across them, for examples, waveforms of filter band which consists of some band pass filters are shown in left figure.

Another  example is of utterance /SA/. The bottom waveform of figures right shows  waveform replaced initial unvoiced portion of utterance /SA/ by no signal, and this waveform sounds like utterance /TA/. For sound /s/, "some time long stable sound  caused by flow through closing mouth" is one of key factor for /s/ identification.

And sonant depends on timing relate to glottis signal is also known.

Another example. Consonant portion /h/ of  /HA/  differs one of /HO/, because /h/   varies  depending on following vowel.



Waveforms in the below figures are outputs of filter bank (divided band quantity is 8). Comparison unvoiced sound /k/, /t/, and /s/ by a male speaker and a female speaker. You can look at more detail if you click each figure.

by a male speaker
by a female speaker
/k/



/t/


/s/








No.7,  15  April  2008



<A Sound in Mouth differs vowel "O" from vowel "U"  >


To study difference japanese vowel "O"  from japanese vowel "U,"  the figure below  is  a comparison a ending part of "U" portion in utterance japanese vowels "UO"  with a starting part of "O" portion in utterance japanese vowels "UO."  The difference is that element 1-3 appears  clearly, that maybe means as a new sound addition in utterance process in mouth.


And the figure upper is a comparison original (color red)  with element 1-3 removed one (color green), which are same starting parts of "O" portion in utterance japanese vowels "UO."  If element 1-3 is removed from original, it  can be heard as "U" instead of "O."  Both forms are similar as look, but  they are heard as different vowel.
As sample, there is a portion of utterance  japanese vowels "UO."

And,  there is a  turing portion  from "O" to "E" of utterance japanese vowels "OE." Beside element 2  weaken, element 3 (that is element 2 for "E")  strengthen during turing portion from "O" to "E."  This is the voice which consists of both element 2 and element 3 of "OE."

And,  there is a turing portion from "U" to "A" of utterance japanese vowels "UA" that is "UWA." During turing portion from "U" to "A",  vibration frequency of element 1 becomes higher. Beside element 2 strengthen, vibration frequency of element 2 does not change so much than one of element 1. This may mean that  both "U" and "A", element 2  sounds  nearly same  space in mouth, but  level of element 2 of "U" weaken due to tube formed mouth that functions high cut filter.
 
And element itself has its inner structure. There is most difference between "A" and "U"  in element 1.  And, the difference is  microscopically. The figure below is comparison a part of of "U" with a part of "A in utterance "UA" element 1 wave form and its FFT spectrum. Hearing element 1 of "UA" solo can be heard as "UA" roughly. Although both wave form of U" and wave form "A" are similar to vibrating wave if its vibration cycle adjust to be nearly equal, there is microscopically difference in element 1 of "UA." Current analysis method is not enough to recognize these microscopically feature.
However,  analyzing elements of the element makes difference clear. There are main differences in element 2 of element 1 of "UA" and element 3 of element 1 of "UA."




Result of FFT spectrum depends on kind of pre-process window and on which section calculated of the wave form. In this example above, right, spectrum of "A"  has a small peak (blue color arrow) at lower frequency side of top peak (blue color arrow), comparing left, spectrum of "U."  Both spectrums shape are like a mountain due to narrow band filter to calculate element 1 from original wave form.



No.9,  23  September  2007




A former home page:


    A Study of Speech Recognition based on Inner Structure


An opinion:

    Speech Technology and  its contribution for Mankind Social


Epilogue:

    epilogue



With gratitude:

mirror of dmoz.org www.scirus.com
scholar.google.com
validator.w3.org

old Open Directory Project mirror
search scientific
information only
a new search of
academic
literature
check HTML grammar
by W3C





No.114,  7  November  2018
This page first established on 17 July 2005.



Conclusion is " At last, Mystery must be referred to Veda immortal."