A Study of Speech Recognition based on Inner Structure

by   Isi  Shun
 

Japanese page



< Message from author >


Study of speech science has been some prospered and its application is some useful for us.
However, i feel that we still lack for understanding about human speech recognition principle.
Not process superficial feature, but discover hidden mechanism in diversity of speech signal wave, that is so many shapes but one common meaning,  is necessary.
So, i made this web page hoping of  proper speech recognition principle study and its development.


A Study of Speech Recognition based on Inner Structure


< Preface  >

This is a study of human process recognizing spoken phoneme .
Following idea is a hypothesis of speech recognition, which is based on harmonic distortion.

To begin with, analyze spoken speech synchronizing with the pitch and calculate harmonic elements  as measured feature, in order to get rid of dependency on voice fundamental frequency and also because distortion causes tone.

Sound source is distorted in mouth and is made into spoken speech. In general, sound source is not a simple SIN wave. So, suppose that sound source consists of suitable simple harmonic. And estimate how to distort, calculate inversely from measurement harmonic elements to sound source harmonic elements.
In this, suppose that the distortion is meaning of phoneme.

Telling about measurement and its processing for recognition, next subject  is about discriminate-able.

One hypothesis, "Index should exist. The index has pervert view and  treat various shape as limited form." is introduced to explain to be able to discriminate which one in various forms.

For instance, harmonic elements of japanese vowels utterance "AIUEO" are calculated synchronizing with the pitch. And one distortion model is designed. Suppose that the distortion is identified by estimation coefficient of the distortion model.
Based on the coefficient (dimension is 23 in this example), principal component analysis is done. Calculate main axes of which dispersion are biggest and treat them as indices. And display the trace of the utterance, projecting the coefficients on the main axes.
Observation whether there is some difference for discrimination, comparing japanese vowel "A" portion of the trace with japanese vowel "O" portion of the trace, is done.

Some languages, like chinese, changes of voice fundamental frequency are different meaning. In the future, distortion model should be developed to consider change of voice fundamental frequency.

This explanation is one experiment (a hypothesis) and it's not sure correct or not.
In order to discover the true principle of pattern recognition, i think, various experiments are welcome.
 
 


1: Analysis of Harmonic Elements


In speech recognition, some studies based on harmonic analysis were already done.
In this study, the pitch (basic wave length) is indispensable information for recognition. And pitch is acquired by referring  output from procedure of linear phase FIR (finite impulse response) band pass filter, because shifting input wave per linear phase delay can synchronize perfectly input wave  and output of linear phase FIR band pass filter.

Actually, for each person, basic wave length is different. So, prepare many band pass filters of which center frequency are different (filter bank) and select one to fit the person or the speech.

The right figure is one portion of man speech wave, of which basic wave length is 7.89 ms. As basic wave length is fundamental, harmonic elements ,per dB unit, are gotten by DFT (discrete Fourier transform) calculation. 

Ordinary man's dynamic range of hearing is may more than 60 dB. In this sample, as measured feature, from 1 to 23 times harmonic is calculated. For every pitch part of speech wave, harmonic should be calculated like this. 


 
 
 
continued: Leading Part Picking by Speech Signal Change Process
The two plotted figures right shows that 40 harmonic elements, arranged in time scale. In the figures, values from left to right are one time (fundamental)  to 40 times harmonic element, and direction from top to bottom is time passage.
You see that there are strong correlation zone, weak correlation zone, and noisy zone (it does not matter so much whatever it is). It's important to discriminate them by carefully observation upon multi harmonic elements arranged in time scale.
In above figure, in lower times harmonic, there is strong correlation, but in middle and high times harmonic, it may be noisy zone. And on the way, from some time, in middle times harmonic, strong correlation appears. In below figure, in middle times harmonic, there is weak correlation, as same time,  strong correlation  in lower times harmonic.
Picking the leading part out of noise around may contribute more successful pattern recognition result.


i feel that, "Comparing based on the unit like cepstrum, spectrum, and so on is not enough for speech signal pattern recognition. It needs to analyze more detail, to detect what is controlled part of it and how to be controlled on purpose, utilizing its change in stream of times. And grasp close analysis will lead satisfactory answer.

And both speech recognition, and  motion picture or image recognition, maybe be done by  same algorithm,  after appropriate signal processing."


2: Distortion Model of Sound Source


 
 
Sound source is distorted in mouth and is made into spoken speech. Phoneme means the distortion. (This is a hypothesis.
As one mathematical technique for describing how to distort, design some distortion model which has input and output, and estimate coefficient of the distortion model.
Distortion itself is non-linear matter. In this, suppose that distortion can be  described as transformation of harmonic elements.
As one test, discrimination of phoneme by coefficient of distortion model is tried.

True distortion model has not been decided yet. The design of distortion model is many potential. It should be developed in the future.

 
 
 

3: Simultaneous Existence (A hypothesis)



 
Always, opposite state or relative concept, is co-exist. That is, somewhere is "low" when it is "high". Also, somewhere is "cold" when it is "hot". The state or concept like "high" and "hot" cannot exist without opposite one  or relative one. In this, Simultaneous Existence is defined as that  concepts are discriminate-able due to relative ones are co-exist.
In general, Embodiments of one concept become various shape. It is infinite.
 
 

How can we discriminate although there are so various shape ? 

So, let's think of the necessary condition to be discriminate-able.  In this, as one hypothesis, the necessary condition is supposed that   "Index should exist. The index has pervert view and  treat various shape as limited form."

For simplified instance, in japanese language,  5 main vowels, that are "A", "I","U","E",and"O" , are simultaneous existence. The border is rather continuous  or mixed than distinct.

Image that there is a space consisted from some (finite number) indices and each concept which is discriminated corresponds to sub-space of the space.
In other words, as there are indices as axes, there is a space. And then, the space is divided into  some sub-spaces that represents each concept. Always, sub-spaces exist simultaneously. (Simultaneous Existence.)
 

Indexes  will  be consisted of  some various  measures,  which  detects  different  feature.  It may be well  to think that  indexes  are not  combination of parameters  based on one common  measure, but combination of some various ones.

Perhaps, index of the language (or the dialect) is different each other. As an example of french, almost japanese hear as one kind of vowel, although native french hear different vowel. Index should be tuned up to the language (or the dialect).

 
 
 
4: Tracking in the Space consisted of Principal Components as Index


 
 
To trial search for index, a principal component analysis, one major method of multivariate analysis, is done.

The right figure is the trace of japanese vowels utterance "AIUEO", projecting coefficients of a distortion model on the 1st axis 2nd axis  plane and the 1st axis 3rd axis plane. For easy look on time dimension, the color degree of the trace line shows position on the track.. Color blue shows starting portion as utterance "A" and color red shows ending portion as utterance "O."

In the 1st axis 3rd axis plane, it is showed that the sub-space of "A" and the sub-space of "O" is small different for discrimination.

i do not think that the principal component analysis is good solution for this issue.
New solution which makes good use of feature of Simultaneous Existence should be developed.

In the above figure,
color degree of the line shows position on the track.
                             For instance, color blue shows starting portion as utterance "A" and color red shows ending portion as utterance "O."

 


As shown above, speech signal change per time dimension can be described as a trace in space.
One idea, how to do for traces per time dimension are diverse, is continued in "Inside of Theory." 


Regarding to Voiceless Sound 

In case of japanese speech, to be a meaning, voiceless sound always makes a pair with vocal sound. That is after voiceless sound, vocal sound should continue. Therefore, suppose that pitch, sound source of continuing vocal sound, can be used for voiceless sound which is be ahead of its times. And  using sound source of  continuing vocal sound, voiceless sound will be structure-analyzed. But, this is still  an idea.



Supplement:



Latest home page:


    A Study of Speech Recognition based on Inner Structure


An opinion:

    Speech Technology and  its contribution for Mankind Social


Epilogue:

    epilogue




No.90(E),  30  April  2007
This page first established on 17 July 2005.



Conclusion is " At last, Mystery must be referred to Veda immortal."