there is a way to recognize vocals from voice?

Member
Posts: 164
Joined: 2010.10
Post: #1
i'm wondering if i can recognize not all speach, but only vocal letters...
do i have to use FFT?
using oscilloscope from apple a letter "e" seems equal to "a"
do anyone knows how to recognize letters?
thanks
Quote this message in a reply
Member
Posts: 49
Joined: 2011.08
Post: #2
That would be pretty hard. I know in animation, there are apps that do automatic lip-syncing. But that is basically changing a mouth shape to the level of the db. Trying to point out actual letters from wave form would be like describing a mans hair color by judging his foot print.
Quote this message in a reply
Moderator
Posts: 437
Joined: 2002.09
Post: #3
Speech recognition is a difficult problem, an active area of research.

OS X has some speech recognition built in (see Speech system preferences.) I've only played with it a little; it seemed hit or miss. It might get better with training. But perhaps Apple gives you a system API you can use instead of writing something yourself.

One way to turn it into a more tractable problem is to limit the legal input. For example, when you call a company's voicemail prompt system it's often designed to recognize spoken digits, and a few words like "yes", "no", "main menu".

Measure twice, cut once, curse three or four times.
Quote this message in a reply
Member
Posts: 49
Joined: 2011.08
Post: #4
(Sep 3, 2011 05:00 PM)MattDiamond Wrote:  For example, when you call a company's voicemail prompt system it's often designed to recognize spoken digits, and a few words like "yes", "no", "main menu".
Comparing wave forms can work but has issues. I'm sure we have all heard, "I did not understand your request. Please say bla or bla or bla again." Or something like that.
Quote this message in a reply
Member
Posts: 164
Joined: 2010.10
Post: #5
seems pretty difficult, thanks for advices
Quote this message in a reply
Nibbie
Posts: 3
Joined: 2011.09
Post: #6
It looks like this team from Rice University did exactly what you are trying to do:

http://cnx.org/content/m11734/latest/?co...l10223/1.5

They do use FFT and it turns out that vowel formants are pretty distinctive...


And they include MATLAB code (looks pretty easy to port to C++) actually used to detect vowels:

( http://cnx.org/content/m11734/latest/formants.txt )

Code:
function answer = formants(x)
%input is a file at 8000 Hz at 8 bits
%windows input into sample sections, finding formants for each section
%after formants are found, it associates each set with a vowel or consonant
%returns a string which indicates the order of consonant and vowel sounds in the word
%example: formants('mexico.wav') = CeCiCoC
%note: Consonant will be returned at beginning and end regardless of actually content of input

%constants: window size is size of window; n is order of AR model
winsize=256;
n=10;

%reads input file into vector, establishes how many windows will be utilized
x=wavread(x);
[b1,b2]=size(x);
j = (b1 - rem(b1,winsize))/winsize;

%vowel database
%values shown are values of frquency of first and second formant
a=[.650; 1.150];
e=[.430; 1.650];
i=[.250; 1.950];
o=[.420; 1.075];
u=[.325; 1.350]; for k=1:j
    %windowing/normalization process
    c = x(1+winsize*(k-1):winsize*k,1);
    c = (c-mean(c))/max(abs((c-mean(c))));
    c = c.*hamming(winsize);
    
    %AR model
    fn=ar(c,n);
    [num,den]=tfdata(fn,'v');
    
    %freq. response of AR model
    [h,w]=freqz(num,den);
    f=w.*8000/(2000*pi);
    
    %finds all formant frequency and magnitude values
    hnorm=abs(h);
    z=zeros(1,2);
    for(d=1:510)
        if max(hnorm(d:d+2,1)) == hnorm(d+1,1)
            z=[z; [f(d+1,1) hnorm(d+1,1)]];
        end
    end
    
    %generates graphs of frequency response of each window for troubleshooting purposes
    %figure
    %semilogy(f,hnorm)
    %xlabel('Frequency(kHz)')
    %ylabel('Response')
    %title('Vocal Model')
    %k
    
    %makes v vector which contains 0 for definite consonant, 1 for possible vowel
    [blip,blop]=size(z);
    g = z(2:blip,1:2);
    if g(1,1) >= .225 & g(1,2) >= 5
        v(k,1)=1;
    else v(k,1)=0;
    end
    
    %running log of all formant frequency and magnitude values
    t(1:blip-1,k)=g(1:blip-1,1);
    mag(1:blip-1,k)=g(1:blip-1,2);
end

%initial smoother, eliminates 1 long strings of possible vowels as being anomolous
for k=2:j-1
    if (v(k,1) ~= v(k-1,1)) & (v(k,1) ~= v(k+1,1))
        v(k,1)=0;
    end
end

%v(:,2) will label row numbers of output for troubleshooting reference
v(:,2)=(1:j)';
v(:,3) = 0; %systematic elimination method
for k=1:j;
    if v(k,1) ~= 0
       flaga = 1;
       flage = 1;
       flagi = 1;
       flago = 1;
       flagu = 1;        %weeds out negative matches by first formant
       if abs(a(1,1)-t(1,k)) > .15
           flaga = 0;
       end
       if abs(e(1,1)-t(1,k)) > .125
           flage = 0;
       end
       if abs(i(1,1)-t(1,k)) > .05
           flagi = 0;
       end
       if abs(o(1,1)-t(1,k)) > .100
           flago = 0;
       end
       if abs(u(1,1)-t(1,k)) > .075
           flagu = 0;
       end        %weeds out negative matches by second formant
       if flaga == 1 & abs(a(2,1)-t(2,k)) > .250
           flaga = 0;
       end
       if flage == 1 & abs(e(2,1)-t(2,k)) > .200
           flage = 0;
       end
       if flagi == 1 & t(2,k) < 1.6
           flagi = 0;
       end
       if flago == 1 & abs(o(2,1)-t(2,k)) > .150
           flago = 0;
       end
       if flagu == 1 & abs(u(2,1)-t(2,k)) > .200
           flagu = 0;
       end        %assesses what vowel(s) it thinks it has
       if flaga == 1
           v(k,3) = v(k,3) + 1;
       end
       if flage == 1
           v(k,3) = v(k,3) + 10;
       end
       if flagi == 1
           v(k,3) = v(k,3) + 100;
       end
       if flago == 1
           v(k,3) = v(k,3) + 1000;
       end
       if flagu == 1
           v(k,3) = v(k,3) + 10000;
       end
   end end  %final smoother, eliminates 1 long strings of specific vowels
for k=2:j-1
    if v(k,3) ~= v(k-1,3) & v(k,3) ~= v(k+1,3)
        v(k,3) = 0;
    end
end %converts the numeric values into letters
answer = 'C';
for k=2:j
   if v(k,3) == 0 & v(k-1,3) ~= 0
      answer = [answer 'C'];
  elseif v(k,3) == 1 & v(k-1,3) ~= 1
      answer = [answer 'a'];
  elseif v(k,3) == 10 & v(k-1,3) ~= 10
      answer = [answer 'e'];
  elseif v(k,3) == 100 & v(k-1,3) ~= 100
      answer = [answer 'i'];
  elseif v(k,3) == 1000 & v(k-1,3) ~= 1000
      answer = [answer 'o'];
  elseif v(k,3) == 10000 & v(k-1,3) ~= 10000
      answer = [answer 'u'];
  end
end
Quote this message in a reply
Member
Posts: 164
Joined: 2010.10
Post: #7
thanks i'll try to read something!
Quote this message in a reply
Post Reply 

Possibly Related Threads...
Thread: Author Replies: Views: Last Post
  mic blow detection vs very high voice sefiroths 4 6,336 Dec 14, 2010 01:09 AM
Last Post: sefiroths