Geeks With Blogs
.NET Corner Jeans, .NET and Physics (eka The Quantum Boy)
[ This post has moved to - http://tempuri.org/tempuri.html ]

As part of our business, we have to deal with VoIP based media processing often.  Made a  call to a 1800-xxx number or any number, if it "seems" like an "Answering Machine" then we need to detect it. Normally this kind of DSP stuff is done by hardware integrated DSP chips. A similar system is shown below (Dialogic® MSP 1010),



But we don't have that luxury. We have a primitive Answering Machine Detection (AMD) system which is primarily based on cadence analysis. Although we do have a Gortzel based fax tone detection algorithm, up until now we don't have any sophisticated DSP-based processing as part of the core algorithm. Now one of our customer from Turkey reported really bad experience while dialing two particular Turkey number as far as AMD is concerned. All calls to these two numbers are supposed to be Answering Machine instead our software is declaring it as Human or live. They have a reason to be disgusted. When we analyzed the drama, -

1. Made call to some 88.x.y.z SER (router) server for a offending number say P (P@88.x.y.z)
2. Initial response  100/180 (SIP provisional response) came sharp
3. Then the early media is started by parallel media sever from 88.x.y.z1. The media is very similar to - "You have reached Mr. Gordon's residence..." - media one usually get after call gets connected.
4. After some time 200 OK (call gets connected) reached to our end and a tone follows with a silence.
5. Our primitive algorithm is detecting the tone as similar "Hello" and the subsequent silence made it sound like a "Hello, ..." and declared as a potential Human.

To make our life even tougher the tone came at various different frequencies and is not any of the standard DTMF tones. Even the Gortzel DFT based fax detection was tuned to 2100 Hz (Fax Answering Tone Freq.).  What we need  was a Singular  Frequency  Tone detection  where the frequency is not fixed. We need some quick solution otherwise a big customer may be on its way out. One of these recorded media streams is shown below



The first challenge is to find a pattern to identify the tone. We knew that we might need to turn to frequency / spectrum analysis for this puzzle to solve. We used Adobe Audition's built-in frequency analysis and results are enough to kick us going -



The peak is clear there. Now there are various tone detection algorithms including SETI, but  we decided to go our own - a decision which would prove useful later. We already had a license of Intel IPP libraries - http://www.intel.com/cd/software/products/asmo-na/eng/302910.htm -  a high performance signal processing library. Initially FFT is applied to the whole recorded WAV sample (tone/voice). The results were disappointing and time was running by fast. We shifted our attention towards Windowed-FFT with a window size 1024 as shown in the Adobe Audition's frequency analysis panel. FFT size defines the frequency resolution we would measure as

F.R = sampling rate / FFT bin size

Now as we split the whole sample in 1024 windows (within which signal is not periodic), overlapping effect starts kicking in. To reduce such effects usually some windowed-preprocessing is done and then FFT is applied. Obtaining power was the obvious option. Let's jot down what steps we've gone through

1. Applied Hamming window over the window
 
//1. Hamming window (smooth out)...
status = ippsWinHamming_32f_I(in_dbl,
windowSize);
 
// Show message box if status is wrong
IppErrorMessage("ippsWinHamming_32f_I", status);
 
if
(status < 0)
return;

2. Apply FFT with order N=10 (etc. log2(1024))

////////////////////
//Core FFT
////////////////////

status = ippsFFTFwd_RToCCS_32f(in_dbl,
out,
spec,
NULL);

// Show message box if status is wrong
IppErrorMessage("ippsFFTFwd_RToCCS_32f", status);

if (status < 0)
return;

3. As for real signal, ignore other half of the FFT (which usually have complex conjugate)
4. Obtain magnitude vector from FFT

///////////////////
//Magnitude

status = ippsMagnitude_32fc((Ipp32fc*)out,
power_spectrum,
windowSize / 2);

// Show message box if status is wrong
IppErrorMessage("ippsMagnitude_32fc", status);

if (status < 0)
return;

5. Scale / normalize by window size.

//Scale the fft so that it is not a function of the length of x, mx = mx/length(x)

status = ippsDivC_32f_I((Ipp32f)(windowSize / 2),
power_spectrum,
windowSize / 2);

// Show message box if status is wrong
IppErrorMessage("ippsDivC_32f_I", status);

if (status < 0)
return;

6. Calculate power vector, square of scaled magnitude vector

//We need power ~ sqr(mag.)
status = ippsSqr_32f_I(power_spectrum,
windowSize / 2);

// Show message box if status is wrong
IppErrorMessage("ippsSqr_32f_I", status);

if (status < 0)
return;

We were clueless after this. Should we go for a sophisticated peak detection algorithm or there is something for KISS strategy? IPP's high level peak detection function made us tempted enough to try it

IppStatus ippsFindPeaks_32f8u(
const Ipp32f* pSrc,
Ipp8u* pDstPeaks,
int len,
int searchSize,
int movingAvgSize);

But wait a minute after scanning couple of windowed-FFT plots it seems that -

a. in voice window, power is dispersed over the whole FFT spectrum-window


b. in tone window, power is concentrated in 1 or 2 frequency bins (FFT output array index for the window) and for our Singular Frequency Tone, we may well concentrate on the frequency bin at which the max window power occurs



Powered by this simple observation we decided to do following

1. Normalize the power spectrum by dividing the power vector by the total power (iff total power for the window is non-zero)
2. Get the max normalized power of the window.
3. If the value crosses a pre-determined (trial) threshold over for some consecutive number (trial) of times we say that there may be a tone (SF) with a given degree of probability.

And you know, the strategy paid off beautifully. The idea worked consistently over many recorded samples including single frequency tones. We also can print or log the frequency as by following formula

frq := max_power_bin * F.R

But all the samples we have were offline recording and we need to integrate this little DSP logic into our VoIP / media stack where frame (RTP, 10 / 20 ms of audio) used to come sequentially and our chosen window size (1024) is not a multiple of frame-length. So some residual sampling handling have to be done. After these integration efforts and some round of testing we became green to go. I hate  working at Sunday nights.

Happy programming and have some fun.
-Thanks
Deb. Posted on Friday, November 7, 2008 8:49 PM .NET Core , Scientific Geekology , lab.geek.com | Back to top


Comments on this post: Tone Detection, FFT and more...

# re: Tone Detection, FFT and more...
Requesting Gravatar...
Very good article.
I am currently working on the same thing. Trying to get frequency from FFT result, but I am new to DSP.
I have FFT code already to process each 1024 samples of WAV.
Can you tell me how to use the FFT result to get frequency of those 1024 samples?
frq := max_power_bin * F.R What does it mean?
Thanks,
ZWW
Left by ZWW on Apr 13, 2009 3:02 AM

# re: Tone Detection, FFT and more...
Requesting Gravatar...
please send for me your Product Price List
Left by Mahmoud Mansouri on Aug 16, 2009 3:32 AM

Your comment:
 (will show your gravatar)


Copyright © dbose | Powered by: GeeksWithBlogs.net