eurasip journal of signaln advances in signal processing会给其他作者发邮件吗

24小时热门版块排行榜&&&&
(小有名气)
在线: 11.7小时
虫号: 1380002
EURASIP Journal on Advances in Signal Processing录用上线了~~第一篇SCI
经过漫长的投稿 等待审稿 大修 等待重审 初步录用 正式录用 上线
我的第一篇SCI终于诞生了!
希望看到这个帖子的所有
正在写论文
所有虫友都能顺顺利利的达成所想!
[ Last edited by cxksama on
at 13:08 ]
& 本帖已获得的红花(最新10朵)
& 猜你喜欢
已经有11人回复
已经有90人回复
已经有3人回复
已经有37人回复
已经有11人回复
已经有35人回复
已经有11人回复
已经有20人回复
已经有11人回复
已经有10人回复
& 本主题相关商家推荐:
& 本主题相关价值贴推荐,对您同样有帮助:
已经有58人回复
已经有12人回复
已经有14人回复
已经有11人回复
已经有9人回复
已经有3人回复
已经有6人回复
已经有8人回复
已经有10人回复
已经有19人回复
已经有9人回复
已经有8人回复
已经有12人回复
已经有108人回复
已经有15人回复
已经有10人回复
已经有47人回复
已经有9人回复
已经有8人回复
已经有7人回复
已经有35人回复
已经有14人回复
已经有5人回复
已经有7人回复
已经有16人回复
已经有3人回复
& 抢金币啦!回帖就可以得到:
(小有名气)
在线: 19.3小时
虫号: 1314174
★ holala(金币+1): 谢谢参与
楼主好人,祝福楼主,顺便求BB、、、、多多益善、、、、
相关版块跳转
SCI期刊点评
中文期刊点评
论文道贺祈福
会议与征稿布告栏
我要订阅楼主
的主题更新
小木虫,学术科研互动社区,为中国学术科研免费提供动力
违规贴举报删除请联系客服电话: 邮箱:(全天候) 或者 QQ:
广告投放与宣传请联系 李想 QQ:
QQ:&&邮箱:
Copyright & 2001-, All Rights Reserved. 小木虫 版权所有
浏览器进程
打开微信扫一扫
随时随地聊科研EURASIP Journal on Advances in Signal Processing - a SpringerOpen journal
Your cart is empty.
...you'll find more products in the shopping cart.
Total&239.99
Login / Register
Engineering
Signals & Communication
| EURASIP Journal on Advances in Signal Processing - a SpringerOpen journal
Editor-in-Chief: Geert Leus
(electronic version)
Journal no. 13634
EURASIP Journal on Advances in Signal Processing is a peer-reviewed open access journal published under the brand SpringerOpen. It brings science and applications together with emphasis on both practical and theoretical aspects of signal processing in new and emerging technologies. It is directed as much at the practicing engineer as at the academic researcher. EURASIP Journal on Advances in Signal Processing will highlight the extended reach and the diverse applications of signal processing and encourage a cross-fertilization of techniques. All papers should attempt to bring theory to life with practical simulations and examples. EURASIP Journal on Advances in Signal Processing employs paperless, electronic review process to foster fast and speedy turnaround in review process.Application areas include (but are not limited to): communications, networking, sensors and actuators, radar and sonar, medical imaging, biomedical applications, remote sensing, consumer electronics, computer vision, pattern recognition, robotics, fiber optic sensing/transducers, industrial automation, transportation, stock market and financial analysis, seismography, avionics.The average time to first decision is only 40 days
Related subjects &
Journal Citation Reports&
Science Citation Index Expanded (SciSearch), Journal Citation Reports/Science Edition, SCOPUS, Astrophysics Data System (ADS), Zentralblatt Math, Google Scholar, ACM Digital Library, Current Abstracts, Current Contents/Engineering, Computing
and Technology, DBLP, DOAJ, Earthquake Engineering Abstracts, EBSCO Applied Science & Technology Source, EBSCO Computers & Applied Sciences Complete, EBSCO Discovery Service, EBSCO STM Source, EI-Compendex, OCLC, SCImago, Summon by ProQuest
0 电子图书
Subtotal: 0
You have no marked items.
您尚未登录!请登录后编辑您的目录。
2016&Impact Factor
The aim of the EURASIP Journal on Advances in Signal Processing is to highlight the theoretical and practical aspects of signal processing in new and emerging technologies. The journal is directed as much at the practicing engineer as at the academic researcher. Authors of articles with novel contributions to the theory and/or practice of signal processing are welcome to submit their articles for consideration. All manuscripts undergo a rigorous review process. EURASIP Journal on Advances in Signal Processing employs a paperless, electronic review process to enable a fast and speedy turnaround in the review process.The journal is an Open Access journal since 2007.
Please wait...24小时热门版块排行榜&&&&
(小有名气)
散金: 1600
在线: 136.7小时
虫号: 1318755
注册: 专业: 计算机应用技术
EURASIP Journal on Advances in Signal Processing投稿已有5人参与
投稿,现在的状态时new submission。
& 猜你喜欢
已经有11人回复
已经有90人回复
已经有3人回复
已经有37人回复
已经有11人回复
已经有35人回复
已经有11人回复
已经有20人回复
已经有11人回复
已经有10人回复
& 本主题相关商家推荐:
& 本主题相关价值贴推荐,对您同样有帮助:
已经有3人回复
已经有10人回复
已经有2人回复
已经有2人回复
已经有3人回复
已经有9人回复
已经有3人回复
已经有8人回复
已经有19人回复
已经有108人回复
已经有1人回复
已经有82人回复
已经有14人回复
已经有3人回复
(正式写手)
在线: 137.5小时
虫号: 3412728
注册: 专业: 通信理论与系统
★ 小木虫: 金币+0.5, 给个红包,谢谢回帖
楼主能否给个投稿链接,找不到。谢谢
相关版块跳转
SCI期刊点评
中文期刊点评
论文道贺祈福
会议与征稿布告栏
我要订阅楼主
的主题更新
3(金币+10)
小木虫,学术科研互动社区,为中国学术科研免费提供动力
违规贴举报删除请联系客服电话: 邮箱:(全天候) 或者 QQ:
广告投放与宣传请联系 李想 QQ:
QQ:&&邮箱:
Copyright & 2001-, All Rights Reserved. 小木虫 版权所有
浏览器进程
打开微信扫一扫
随时随地聊科研24小时热门版块排行榜&&&&
(小有名气)
在线: 11.7小时
虫号: 1380002
EURASIP Journal on Advances in Signal Processing录用上线了~~第一篇SCI
经过漫长的投稿 等待审稿 大修 等待重审 初步录用 正式录用 上线
我的第一篇SCI终于诞生了!
希望看到这个帖子的所有
正在写论文
所有虫友都能顺顺利利的达成所想!
[ Last edited by cxksama on
at 13:08 ]
& 本帖已获得的红花(最新10朵)
& 猜你喜欢
已经有11人回复
已经有90人回复
已经有3人回复
已经有37人回复
已经有11人回复
已经有35人回复
已经有11人回复
已经有20人回复
已经有11人回复
已经有10人回复
& 本主题相关商家推荐:
& 本主题相关价值贴推荐,对您同样有帮助:
已经有58人回复
已经有12人回复
已经有14人回复
已经有11人回复
已经有9人回复
已经有3人回复
已经有6人回复
已经有8人回复
已经有10人回复
已经有19人回复
已经有9人回复
已经有8人回复
已经有12人回复
已经有108人回复
已经有15人回复
已经有10人回复
已经有47人回复
已经有9人回复
已经有8人回复
已经有7人回复
已经有35人回复
已经有14人回复
已经有5人回复
已经有7人回复
已经有16人回复
已经有3人回复
& 抢金币啦!回帖就可以得到:
holala(金币+1): 谢谢参与
相关版块跳转
SCI期刊点评
中文期刊点评
论文道贺祈福
会议与征稿布告栏
我要订阅楼主
的主题更新
小木虫,学术科研互动社区,为中国学术科研免费提供动力
违规贴举报删除请联系客服电话: 邮箱:(全天候) 或者 QQ:
广告投放与宣传请联系 李想 QQ:
QQ:&&邮箱:
Copyright & 2001-, All Rights Reserved. 小木虫 版权所有
浏览器进程
打开微信扫一扫
随时随地聊科研O EURASIP Journal on Advances in Signal Processing_学霸学习网
O EURASIP Journal on Advances in Signal Processing
Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/content/O EURASIP Journal on Advances in Signal Processinga SpringerOpen JournalRESEARCHOpen AccessAn efficient voice activity detection algorithm by combining statistical model and energy detectionJi Wu* and Xiao-Lei ZhangAbstractIn this article, we present a new voice activity detection (VAD) algorithm that is based on statistical models and empirical rule-based energy detection algorithm. Specifically, it needs two steps to separate speech segments from background noise. For the first step, the VAD detects possible speech endpoints efficiently using the empirical rule- based energy detection algorithm. However, the possible endpoints are not accurate enough when the signal-to- noise ratio is low. Therefore, for the second step, we propose a new gaussian mixture model-based multiple- observation log likelihood ratio algorithm to align the endpoints to their optimal positions. Several experiments are conducted to evaluate the proposed VAD on both accuracy and efficiency. The results show that it could achieve better performance than the six referenced VADs in various noise scenarios.Keywords: energy detection, gaussian mixture model (GMM), multiple-observation, voice activity detection (VAD)* Correspondence: wuji_ee@ Department of Electronic Engineering, Multimedia Signal and Intelligent, Information Processing Laboratory, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, China 1 IntroductionVoice activity detector (VAD) segregates speeches from background noise. It finds diverse applications in many modern speech communication systems, such as speech Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/ recognition, speech coding, noisy speech enhancement, mobile telephony, and very small aperture terminals. During the past few decades, researchers have tried many approaches to improve the VAD performance. Traditional approaches include energy in time domain [1,2], pitch detection [3], and zero-crossing rate [2,4]. Recently, several spectral energy-based new features were proposed, including energy-entropy feature [5], spacial signal correlation [6], cepstral feature [7], higher- order statistics [8,9], teager energy [10], spectral divergence [11], etc. Multi-band technique, which utilized the band differences between the speech and the noise, was also employed to construct the features [12,13]. Meanwhile, statistical models have attracted much attention. Most of them were focused on finding a suitable model to simulate the empirical distribution of the speech. Sohn [14] assumed that the speech and noise signals in discrete Fourier transform (DFT) domain were independent gaussian distribution. Gazor [15] further used Laplace distribution to model the speech signals. Chang [16] analyzed the Gaussian, Laplace, and Gamma distributions in DFT domain and integrated them with goodness-of-fit test. Tahmasbi [17] supposed speech process, which was transformed by GARCH filter, having a variance gammaPage 2 of 10? 2011 Wu and Z licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction any medium, provided the original work is properly cited.distribution. Ramirez [18] proposed the multiple-observation likelihood ratio test instead of the single frame LRT [14], which improved the VAD performance greatly. More recently, many machine learning-based statistical methods were proposed and have shown promising performances. They include uniform most powerful test [19], discriminative (weight) training [20,21], support vector machine (SVM) [22-24], etc. On the other hand, because the speech signals were difficult to be captured perfectly by feature analysis, many empirical rules were constructed to compensate the drawbacks of the VADs. Ramirez [18] proposed the contextual multiple global hypothesis to control the false alarm rate (FAR), where the empirical minimum speech length was used as the premise of his global hypothesis. ETSI frame dropping (FD) VAD [25] was somewhat an assembly of rules that were based on the continuity of speech. Besides, to our knowledge, one widely used empirical technique was the “ hangover ” scheme. Davis [26] designed a state machine-based hangover scheme to improve the SDR. Sohn [14] used Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 3 of 10the hidden Markov model (HMM) to cover the trivial speeches, and Kuroiwa [27] designed a grammatical system to enhance the robustness of the VAD. The statistical models could detect the voice activity exactly, but they are not efficient in practice. On the other hand, the empirical rules could not only distinguish the apparent noise from speech but also co however, they are not accurate enough in detecting the endpoints. In this article, we propose a new VAD algorithm by combining the empirical rule- based energy detection algorithm and the statistical models together. The rest of the article is organized as follows. In Section 2, we will present the empirical rule- based energy detection sub-algorithm and the Gaussian mixture model (GMM)-based multiple-observation log likelihood ratio (MO-LLR) sub-algorithm in detail, and then we will present how the two independent sub-algorithms are combined. In Section 3, several experiments are conducted. The results show that the proposed algorithm could achieve better performances than the six existing algorithms in various noise scenarios at different signal-to-noise ratio (SNR) levels. In Section 4, we conclude this article and summarize our findings.Part1 As for the beginning-point (BP) detection, the silence energy and the low\high energy thresh-olds of the nth observation on are defined as 1 ?+A ? sil = T XIj=n-l(2) where Ej is the short-term energy of and the a, p are the user-defined threshold factors. Given a signal segment {On, On+1, ..., On+nbA}with alength of Nb observations, if there are N[Bl consecutive observations in the segment whose energy is higher than Thl0w, and if the ratio NBI/NB is higher than an empirical threshold ^Bpw, then the first observation 6b energy is higher than Thlow, should be remembered. Then, we detect the given segment from 6b; if there is another Bh consecutive observation whose energy is higher than Thhigh, and if the ratio jNBh/NB is higher than another empirical threshold phph, then one possible BP is detected as 6b. Part2 As for the ending-point (EP) detection, suppose that the energy of current observation 6 五 is lower than T we analyze its subsequent signal segment with Neobservations. If there are l^Eh observations with energy higher than Thhigh in the segment, and if the ratio JNEh/NE is lower than an empirical threshold ^ep, then one possible EP is detected as the current observation 6e.(1 )2 The proposed efficient VAD algorithm 2.1 The proposed VAD algorithm in briefIn [28], Li summarized some general requirements for a practical VAD. In this article, we conclude them as follows and take them as the objective for the proposed algorithm. 1) Invariant outputs at various background energy levels, with maximum improvements of speech detection. 2) Accurate location of detected endpoints. 3) Short time delay or look-ahead. If we use only one algorithm, then it is hard to satisfy the second and third items simultaneously. If the average SNR level of current speech signals is above zero, then the short-term SNRs around the speech endpoints are usually much lower than those between the endpoints. Hence, we could use different detection schemes for different part of one speech segment. The proposed algorithm has two steps to separate speech segments from background noise. For the first step, we use the double threshold energy detection algorithm [2] to detect the possible endpoints of the speech segments efficiently. However, the detected endpoints are rough. Therefore, for the second step, we use the GMM based MO-LLR algorithm to search around the possible endpoints for the accurate ones. By doing so, only signals around the endpoints need the computationally expensive algorithm. Therefore, a lot of detecting time could be saved.2.3 GMM-based MO-LLR algorithmAlthough the energy-based algorithm is efficient to detect speech signals roughly, the endpoints detected by it are not sufficiently accurate. Therefore, some computationally expensive algorithm is needed to detect the endpoints accurately. Here, a new algorithm called the GMM-based MO-LLR algorithm is proposed. Given the current observation On, a window {On^i, ..., On-1, On, On+1, ..., On+W} is defined over On. Acoustic features {xn_i, ..., Xn+m}a are extracted from the window. Two K-mixture GMMs are employed to model the speech and noise distributions, respectively:2.2 Empirical rules-based energy detectionThe efficient energy detection algorithm is not only to detect the apparent speeches but also to find the approximate positions of the endpoints. However, the algorithm is not robust enough when the SNR is low. To enhance its robustness, we integrate it with a group of rules and present it as follows: Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 4 of 10a) Single observation50 100 150 200 250 300 350 400 450 500 550 IndexP(xk=1i \H0) =〉: n0tkN(xi \^0,k/ ^0,k)k=1K⑶ ⑷b) Window length = 10 observations 50 100 150 200 250 300 350 400 450 500 550 Index c) Window length = 30 observations 50 100 150 200 250 300 350 400 450 500 550 Indexwhere i = n - /,...,n + m, H (H0) denotes the hypothesis of the speech (noise), and {办 % 2^} are the parameters of the kth mixture. Base on the above definition, the log likelihood ratio (LLR) Si of the observation Oi can be calculated asSi = log(P(xi\Hi)) - log(P(xi\Ho))and the hard decision on Si is obtained bycis classified as speech, if An & n is classified as noise, otherwise where h is employed to tune the operating point of the MO-LLR algorithm. Figure 1 gives an example of the detection process of the MO-LLR sub-algorithm with l = m - 1. From the figure, we could know that when the window length becomes large, the proposed algorithm has a good ability of controlling the randomness of the speech signals but a relatively weak ability of detecting very short= 11/ if Si & s i I 0/ otherwise (5)Figure 1 M(6) where s is employed to tune the operating point of a single observation. In practice, s is initialized asare the e lines are LLR sco length o length o algorithm hard- des = ― Si +△,wherethe first term denotes the pauses between speeches. Therefore, setting the window to a proper length is important to balance the performance between the speech detection accuracy and the FAR. In our study, the hard decision method (6) is adopted, and two thresholds, hbegin and hend, are used for the BP and EP detections, respectively, instead of a single h in (8).current SNR level, and A is a user-defined constant. The constant “15” can be set to other value too. Until now, we can obtain a new feature vector In = {Sn-/, ..., Sn+m}T (or In = {cn-i, ...,^请厂)from the soft (or hard) decision. Many classifiers based on the new feature can be designed, such as the most simplest one calculating the average value of the feature [29], the global hypothesis on the multiple observation [18], the longterm amplitude envelope method [22], and the discriminative (weight) training method of the feature [20,21]. For simplicity, we just calculate the average value of the feature::2.4 Combination of the energy detection algorithm and the MO-LLR algorithmThe main consideration of the combination is to detect the noise\speech signals that can be easily differentiated by the energy detection algorithm at first, leaving the signals around the endpoints to the MO-LLR subalgorithm. Figure 2 gives a direct explanation of the combination method. From the figure, it is clear that the MO-LLR sub-algorithm is only used around the possible endpoints that are detected by the energy detection algorithm. Hence, a lot of computation can be saved.1------ Si, if soft decision is used m I + + 1 i=n-l (7) Ci, otherwise l + m + 1 i=n-lo1n+mnand classify the current observation On by(8) Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 5 of 10To find how much the mismatching between the training and theWe summarize the proposedOEndpoints detected by energy detection 0test sets will affect the performance, we define two kinds of models as follows: - Noise-dependent model (NDM). This kind of model is trained in a given noise environment, and is only tested in the same environment. -Noise-independent model (NIM). This kind of model is trained from a training set that is a collection of speeches in various noise environments, and is tested in arbitrary noise scenarios.-------------1 True endpointsalgorithm in Algorithm 1 with its state transition graph drawn in Figure 3. Detection region of Note that Detection region of MO-LLR MO-LLR for the MO-LLRFigure 2 Schematic diagram of the proposed combination algorithm.sub-algorithm, because an observation might appear not only in the current window but also in the next window when the MO-LLR window shifts, its output value from Equation 5 or 6 might be used several times. Therefore, the MO-LLR output of any observation should be remembered for a Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 6 of 10few seconds to prevent repeating calculating the LLR score in (5).2.5 Considerations on model training 2.5.7 Matching training for MO-LLR sub-algorithmThe observations between the endpoints have higher energy than those around the endpoints, and they have different spacial distributions with those around the endpoints too. In our proposed algorithm, the input data of the MO- LLR sub-algorithm is just the observations around the endpoints. If we use all data for training, then it is obvious that the mismatching between the distribution of the speeches around the endpoints and the distribution of the speeches on the entire dataset will lower the classification accuracy of the data around the endpoints. Therefore, we only use the observations around the endpoints for GMM training. The expectation-maximum (EM) algorithm is used for GMM training. 2.5.2 Selections of the training dataset In practice, to find the training dataset that matches the test environment perfectly is difficult. Hence, we need a VAD algorithm that is not sensitive to the selections of the training dataset. The performance of the NDM is thought to be better thanNIM. However, we will show in our experiments that the NIM could achieve similar performance with the NDM, which proves the robustness of the proposed algorithm. In conclusion, constructing a training dataset that consists of various noise environments is sufficient for the GMM training in practice.2.6 Extensions and limitations of the proposed algorithmThe proposed combination method is easily extended to other features and classifiers. Many efficient algorithms can replace the energy detection algorithm, and besides MO-LLR algorithm, many accurate algorithms can be used to detect the precise positions of the endpoints too. If designed properly, then we can combine the two complementary sub-algorithms in our proposed method so as to inherit both of their advantages. To better understand the idea, we construct a new combination algorithm using two other sub-algorithms, where the sub-algorithms were proposed by other researchers. -Efficient sub-algorithm. In [28], a new feature is defined asif a(9)J=ntwhere Oj is the jth sample in time domain, I is the user-defined window length, and nf is the index of the first sample in the window. Instead of using Li's system [28] directly, we can just use the feature to replace ours in the energy detection part. - Accurate sub-algorithm. In [22], Ramirez proposed a new feature vector for SVM-based VAD. It was inspiredif 0by [28]. We present it briefly as follows. After DFT analysis of an observation, an N-dimen- sional vector XnFigure 3 State transition diagram of the proposed algorithm.The number “ 1 ” denotes that the speech obs “0” denotes that the noise observation is detected. &E& is short for the energy detection sub- “G“ is short for the GMM based MO-LLR sub-algorithm.=10lOgioJ2nt+I-1({Xn/i}J=i is obtained. In each dimension of the feature, the long-term spectral envelope can be calculated as Xn,i =max{xn,i-t,…,Xn,i+i}, Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/ where l is the user-defined window length. Then, we transform the feature vector to another K-band spectral representation [22]E u -1 XPage 7 of 10respect of the mixture number of the GMM and the combination scheme. At last, we will prove that the proposed algorithm can achieve robust performance in mismatching situation between the training and test sets.n,k = lOlogi,2Kk+i、&N& 53n,u U=Uk j(10)3.1 Experimental setupThe TIMIT [31] speech corpus is used as the dataset. It contains utterances from eight different dialect regions in the USA. It consists of a training set of 326 male and 136 female speakers, and a testing set of 112 male and 56 female speakers. Each speakers utters 10 sentences, so that there are 4,620 utterances in the training set and 1,680 utterances in the test set totally. All the recorded speech signals are sampled at fs = 16 kHz. These TIMIT sets, after resampling from 16 to 8 kHz, are distorted artificially by the NOISEX corpus [32]. To simulate the real-world noise environment, the original TIMIT and NOISEX corpora are filtered by intermediate reference system [33] to simulate the phone handset, and then the SNR estimation algorithm based on active speech level [34] is employed to add four different noise types (babble, factory, vehicle, and white noise) at five SNR levels in a range of [5, 10,..., 25 dB]. Eventually, we get 20 pairs of noise-distorted training and test corpora. As done in a previous study [35], the TIMIT word transcription is used for VAD evaluation, and the inactive speech regions, which are smaller than 200 ms, are set to speech. The percentage of the speech process is 87.78%, which is much higher than the average level of the true application environments. To make the corpora more suitable for VAD evaluation, every utterance is artificially extended at the head and the tail, respectively, with some noise. The percentage of the speech is afterwards reduced to 62.83%, and the renewed corpora can reflect the differences of the VADs apparently. To examine the effectiveness of the proposed VAD algorithm, we compare it with the following existing VAD methods. -G.729B VAD [4]. It is a standard method applied for improving the bandwidth efficiency of the speech communication system. Several traditional features and methods are arranged in parallel. -VAD from ETSI AFE ES 202 050 [25]. It is the front-end model of an European standard speech recognition system. It consists of two VADs. The first one, called “ AFE Wiener filtering (WF) VAD,” is based on the spectral SNR estimation algorithm. The second one, called “AFE FD VAD”, is a set of empirical rules. Its main purpose is to integrate the fragmental output from AFE WF VAD into speech segments. -Sohn VAD [14]. It is a statistical model-based VAD. It uses the minimum-mean square errorwhere Uk=LN/2-k/Kj and k = 0, 1, ..., K - 1. Eventually, the element of the feature vector zn for SVM is defined as Zn, k = En, k - Vn, k, where the spectral representation of the noise Vn& k is estimated in the same way as En, k during the initialization period and the silence period. In [22], Ramirez has shown that the SVM-based VAD could achieve higher classification accuracy than Li's [28]. However, the computational complexity has not been considered. The nonlinear kernel SVM [30]-based VAD has been proved to be superior to the linear kernel SVM-based VAD [23,24]. However, if we use the nonlinear kernel SVM, then the following calculation is traditionally needed to classify a single observation On:(11)where {入疋 1 are the non-negative lagrange variables, Q( ) is the nonlinear kernel operator, T denotes the total observation number of the training set, and Therefore, practice. -Combination of the two sub-algorithms. The two algorithms can be combined efficiently by modifying the sample Oj in time domain (in Equation 9) to the observations in spectral domain. Obviously, even after the combination, the time complexity of the above algorithm is much higher than our proposed method. Therefore, we never tried to realize it. Although the proposed combination method is easily extended, it has one limitation as well. It is weak in detecting very short pauses between speeches. This is because we mainly try to optimize the detecting efficiency instead of pursuing the highest accuracy. If the applications need to detect the short pauses accurately, then we might overcome the drawback by adding some new rules or some complementary algorithms to the energy detection part. the time {zij1i=i for is the training dataset. classifying a single complexityobservation is even as high as ^(T)which is unbearable in3 Experimental analysisIn this section, we will compare the performances of the proposed algorithm with the other referenced VADs in general at first. Then, we will analyze its efficiency in Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 8 of 10estimation algorithm [36] to estimate the spectral SNR, and the gaussian model to model the distributions of the speech and noise. - Ramirez VAD [18]. It combines the multiple-observation technique [11,29] and the statistical VAD at first, and then, it proposes the global hypothesis to control the FAR. -Tahmasbi VAD [17]. It assumes that the speeches, after being filtered by GARCH model, have a variance gamma distribution. We train the GARCH model in matching environment between the trainFor the NDM training, we train 20 pairs of 50-mixture ing and test sets.NDMs from 20 noisy corpora.3.3 Results 3.3.1 Performance comparison with referenced VADsTwo measures are used for evaluation. One measure is the speech detection rate (SDR) and the FAR [37]. In order to evaluate the performance in a single variable, another measure is the harmonic mean F-score [35] between the precision rate of the detected speeches (PR) and the SDRF - score = ― ------SDR+PR(12)3. Parameter settingsAingle observation (frame) length is 25 ms long with overlap of 10 ms. or the rule-based energy detection algorithm, NB in BP detection is set to 20 with ^Bpw = 1/4 and 由=1/5. The NE in EP detection is set to 35 with i = 1/7. or the MO-LLR algorithm, the 39-dimensional fea- econtains 13-dimensional static MFCC features (with energy and without C0), their delta and delta-delta features. The window length is set to 30 with l setting to 14. The constant A referred in (6) is set to 1.5. For the combination of the two sub-algorithms (Algorithm 1), the scanning range S is set to 50. The minimum practical speech length is set to 35. Other parameters related to SNR are show in Table 1. These values are the optimal ones in different SNR levels. We get them from the training set of the noisy TIMIT corpora. In respect of matching training for MO-LLR sub-algorithm, 50 neighboring observations of every endpoint are extracted from the training set for GMM training. In respect of the selections of the training dataset, two kinds of models should be trained for performance comparison. For the NIM training, we randomly extract 231 utterances from every noise-distorted training corpus to form a noise-independent training corpus, and then we train a serial GMM pairs with [1, 2, 3, 5, 15, 35, and 50] 2 . SDR . PR r mixtures correspondingly. Note that the new noise-independent corpus contains 4,620 utterances totally, which is the same size as each noise-distorted training set.The higher the F-score is the better the VAD performs. Table 2 lists the performance comparisons of the proposed algorithm (with 5-mixture NIM) with other existing VADs. From the table, the G.729B, the AFE WF, and AFE FD VAD, which are open sources, have relatively comparable performances with the Sohn, Ramirez, and Tahmasbi VAD. This conclusion is identical with other studies, e.g., [14,18,35]. Also, the performances of the proposed algorithm are better than other referenced VADs. Figure 4 shows the F-score comparisons of the VADs. From the figure, we can see that the proposed algorithm yields higher F-score curves than other VADs. Table 3b lists the average CPU time of the proposed algorithm (with 5-mixtures NIM) and the referenced statistical model-based VADs over all 20 noisy corpora. From the table, it is clear that the proposed algorithm is faster than the three statistical VADs. The reason for the Sohn VAD being slower than Ramirez VAD is that the HMM-based “ hangover ” scheme in Sohn VAD is computationally expensive.3.3.2 How does the mixture number of the GMM affect the performance?If the mixture number of the GMM increases, then it is preferred that the performance of the VAD will be better. However, the computational complexity increases with the mixture number too. Therefore, it is important to find how the mixture number of the GMM will affect the performance and how many mixtures are needed toTable 1 SNR-related parameter settingsSNR 5 dB 10 dB 15 dB 20 dB 25 dBa b” begir h end1.30 1.90 0.27 0.2 0.45 0.25 0.55 0.401.3 0 2.5 020 noisy corpora. From the row, a linear relationship between the mixture number and the CPU time is observed. Table 5 shows the average accuracy of the proposed methods with different mixture numbers over all the noisy corpora. From the table, we can see that the mixture number has little effect on the performance when the number is larger than 5.0.60 0.500.65 0.55compromise the detecting time and the accuracy. The first row of Table 4 lists the average CPU time of the proposed methods with different mixture numbers over all the Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 9 of 10Table 2 Performance comparisons between the proposed algorithm (with 5-mixture NIM) and other referenced VADs (%)G.729B Scenario SNR (dB) SDR FAR AFE WF SDR FAR AFE FD SDR FAR SDR Sohn FAR Ramirez SDR FAR Tahmasbi SDR FAR Proposed SDR FARBabble5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 2570.31 77.99 84.00 87.65 87.97 64.22 73.87 81.72 86.65 87.60 56.78 68.14 77.47 84.54 86.90 51.98 64.60 75.07 83.37 86.5655.11 53.74 50.29 40.32 23.40 50.86 49.84 47.63 38.58 23.24 44.49 44.88 43.65 35.31 19.76 44.66 44.93 43.89 36.34 20.5587.78 94.12 97.15 98.54 99.16 95.35 92.57 96.64 98.36 99.07 76.09 89.18 95.26 97.86 98.90 74.69 88.50 94.92 97.79 98.8725.87 24.99 24.86 25.42 27.09 25.65 18.09 19.19 20.75 22.87 2.05 3.92 5.91 8.41 11.46 1.39 3.29 5.34 7.80 10.9399.97 100.0 0 100.0 0 100.0 0 99.99 99.99 99.98 99.99 99.99 99.99 99.92 99.99 100.0 0 99.99 99.99 99.75 99.96 99.99 99.99 99.9987.41 86.99 83.67 76.48 64.91 79.89 81.63 78.88 71.24 59.30 81.13 83.36 77.96 65.67 49.62 66.18 76.23 78.42 72.21 61.0180.18 83.31 85.76 88.71 90.93 85.78 82.49 84.49 87.52 90.00 80.12 82.27 86.23 89.89 92.61 79.50 83.52 87.63 91.01 93.5139.4 3 28.0 3 17.1 9 11.8 3 8.19 20.9 3 30.8 7 18.1 8 10.8 6 7.87 25.5 6 11.7 4 6.07 4.57 5.43 17.7 5 9.51 5.02 4.33 4.2786.30 88.88 90.68 93.50 95.30 88.00 90.93 90.32 93.29 95.04 85.94 90.98 94.74 96.63 97.21 86.01 91.88 95.15 96.80 97.5036.2 8 23.1 0 10.8 6 6.74 5.02 22.2 1 16.4 0 11.7 8 7.11 5.02 10.0 4 4.45 3.99 4.46 5.07 6.20 4.43 3.32 3.92 4.9177.79 81.57 83.56 87.62 90.89 83.29 84.28 85.79 88.13 90.43 80.98 80.25 84.82 89.72 93.22 79.50 82.22 87.32 91.89 94.3739.8 6 29.3 5 15.6 4 10.9 8 6.95 29.0 9 20.5 6 16.7 0 11.7 8 9.04 38.6 3 16.0 8 8.48 5.45 5.08 29.1 9 12.2 6 5.78 4.60 5.1195.53 96.29 96.79 96.70 95.84 96.23 96.09 96.89 96.81 95.97 93.53 95.50 96.99 97.27 96.44 92.98 95.50 96.95 97.25 96.6027.6 2 15.9 2 10.2 8 8.02 6.51 13.6 7 11.5 7 7.79 6.59 5.87 6.58 4.77 3.95 4.32 4.45 5.63 4.77 3.60 3.67 3.78FactoryVehicleWhiteSDR, s FAR, false alarm rate.Babble noise00Factory noise( QrJOOSILL%&l 8 w HA5101520SNR(dB) Vehicle noiseSNR(dB) White noiseI% LLQr J O O S(51015 SNR(dB)2025 %」SNR(dB) -^ - G729 - - A F E -曰-AFEWF ―一( ar 8s-LL巳comparisons in different noise scenarios.FD匕 ―Sohn-QRamirez Tahmasbi Proposed_^_^^ Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 10 of 10Table 3 CPU time (in seconds) comparisons between the proposed algorithm and other existing VADsSohn Ramirez Tahmasbi Proposed 88.01Table 5 Performance comparisons of the proposed algorithm with different GMM mixture numbers# Mixture 1 2 3 5 96.03 8.31 95.61 15 35 50CPU time7.8114603.88SDR FAR96.28 10.18 95.22The reported results are average ones over all 20 noisy eorporaF-scoreIn conclusion, setting the mixture number to 5 is enough to guarantee the detecting accuracy.SDR, s FAR, false alarm rate96.3 96.25 6 10.1 9.94 1 95.2 95.27 796.1 9 8.65 95.5 996.1 8 8.36 95.6 796.1 1 8.00 95.7 33.3.3 How much time could be saved by using the combination algorithm instead of using MO-LLR only?In order to show the advantage of the combination, we compare the proposed algorithm with the MO-LLR algorithm. Table 4 gives the CPU time comparison between the proposed algorithm and the MO-LLR algorithm. From the table, we can conclude that the proposed algorithm is several times faster than the MO-LLR algorithm.3.3.4 How does the mismatching between the training and the test sets affect the performance?The histograms of the differences between the manually labeled endpoints and the detected ones [28] is used as the measure. The main reason for using this measure is that the MO-LLR sub-algorithm is only used in the area around the endpoints but not over the entire corpora. Figure 5 gives an example of the histograms. It is clear that the BP is much easier detected than the EP. However, since there are too many histograms to show in this article, we substitute the histograms by their means and standard deviations. The closer to zero the means and variances are, the better the GMMs perform. Table 6 lists the average results of the means of the histograms over all the noisy corpora. It is shown that the performance of the NDM is not much better than the NIM, especially when they have the same mixture number, which proves the robustness of the proposed algorithm. From the NIM column only, we could also conclude that the performances change slightly from 5 to 50 mixtures. To summarize, in order to achieve robust performance, we just need to train 5-mixture GMMs from a dataset that consists of various noisy environments instead of training new GMMs for each new test environment. Eventually, the trouble on training new models can be avoided.the efficient rule-based energy detection algorithm, where the rules can enhance the robustness of the energy detection algorithm. The second sub-algorithm is the GMM-based MO-LLR algorithm. Although the MO-LLR is computationally expensive, it can classify the speech and noise accurately. The two sub-algorithms are combined by first using the energy detection algorithm to detect the speeches that are easily differentiated, leaving the speeches around the endpoints to the MO-LLR subalgorithm. The experimental results show that the proposed algorithm could achieve better performances than the six commonly used VADs. It has also been demonstrated that the proposed VAD is more efficient and robust in different noisy environments.Endnotesa4 ConclusionsIn this article, we present an efficient VAD algorithm by combining two sub-algorithms. The first sub-algorithm isHere, we use the MFCC, its delta and delta-delta features as the feature, which has a total dimension of 39. But the proposed method is not limited to the feature. bBecause the G.729B VAD and ETSI AFE VAD are implemented in C code but the other four is implemented in MATLAB code, it's meaningless to compare the proposed algorithm with the G.729B VAD and ETSI AFE VAD directly. Algorithm 1: Combining energy detection & MO-LLR 1: initialization start from silence. BP detection: 2: if a possible BP OB is detected by Part1 of the energy detection 3: if OB is confirmed to be speech by MO-LLR 4: search in a range of (OB-S, OB+S) for the accurate OB BP by MO-LLR. OB is defined as the change point from noise to speech. 5 goto the ending-point detection (Step 12) 6: else 7: move to next observation, goto Step 2 8: endTable 4 CPU time (unit: seconds per test corpus) comparisons between the proposed algorithm and the MO-LLR algorithm# Mixture 1 2 3 5 15 35 50Proposed67.27 (±6.20) 72.73 (±5.75) 77.91 (±6.58) 88.01 (±8.38) 139.10 (±14.86) 241.49 (±29.33) 318.40 (±40.55) 337.77(±0.82) 600.16(±0.97) 799.85(±0.97)MO-LLR 159.43 (±2.20) 167.00(±0.16) 181.00(±0.84) 208.61(±0.41) Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 11 of 10Babble 15dB beginning-point 800 600 400White 15dB beginning-point2000 050-50 0 50 Relative position Babble 15dB ending-point-50 0 Relative position White 15dB ending-point-100-50 0 50 Relative position-100100-50 0 50 Relative position100Figure 5 The accumulating results (histograms) of the differences between the manually labeled endpoints and the detected ones in different noise scenarios. Each column of the histogram is in a width of five observations. If the detected endpoint is in the positive axis of thehistogram, it means that the noises between the detected one and the labeled one are wrongly detected as speech, vise versa.9: else 10: move to next observation, goto Step 2 11: end ending-point (EP) detection: 12: if apossibleEP 6 五 is detected by Part2 of the energy detection 13: if 6E is confirmed to be noise by MO-LLR 14: search in a range of (6 五-3, 6 五 +3) for the accurate EP OE by MO-LLR. OE is defined as the change pointNIM # Mixture 1 2 3 5from speech to noise. 15: if the length from OB to o 五 is toosmall to be practical 16: delete the detected speech endpoints OB and OE 17: end 18: goto the BP detection (Step 2) 19: else 20: move to next observation, goto Step 12. 21: end 22: else 23: move to next observation, goto Step 12. 24: endTable 6 Comparisons of the histogram means and standard deviations between NIMs and NDMsNDM 15 35 50 50BP EP0.13 (±12.63) 2.46 (±19.88)0.35 (±12.29) 2.52 (±19.93)0.41 (±12.31) 1.99 (±19.73)-00.05 (土 11.66) 0.20 (±19.10)0.06 (±11.60) 0.93 (±19.41)-0.06 (±11.33) 0.65 (±18.99)-0.15 (±11.09) 0.22 (±18.79)0.23 (±11.34) 1.22 (±18.11)The histogram is the accumulating result of the differences between the manually labeled endpoints and the detected ones. The reported results are average ones over all 20 noisy corpora. If the mean values are positive, it means that some noises are wrongl otherwise, some speeches are wrongly detected as noise Wu and Zhang EURASIP Journal on Advances in Signal Processing :18 http://asp.eurasipjournals.eom/eontent/Page 12 of 10AbbreviationsDFT: discrete F EM: expectation- FAR: FD: GMM: Ga HMM: hidder M LLR:
MO-LLR: multiple-observation
NDM: noise- NIM: noise- SDR: s SNR: signal-to- SVM: su VAD: voice activity detection.AcknowledgementsThis study was supported by The National High-Tech. R&D Program of China (863 Program) under Grant .Competing interestsThe authors declare that they have no competing interests. Reeeived: 26 November 2010 Aeeepted: 12 July 2011 Published: 12 July 2011References1. JG Wilpon, LR Rabiner, T Martin, An improved word detection algorithm for telephone-quality speech incorporating both syntactic and semantic constraints. AT&T Bell Labs Tech J. 63, 353-364 (1984) LR Rabiner, MR Sambur, An algorithm for determining the endpoints of isolated utterances. Bell Sys Tech」.54⑵,297-315 (1975) R Chengalvarayan, Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition. in 6th Euro Conf Speech Commun, Tech ISCA (1999) A Benyassine, E Shlomot, HY Su, D Massaloux, C Lamblin, 」P Petit, ITU-T Recommendation G. 729 Annex B: a silence compression scheme for use with G. 729 optimized for V. 70 digital simultaneous voice and data applications. IEEE Commun Mag. 35(9), 64-73 (1997). doi:10.527 L Huang, C Yang, A novel approach to robust speech endpoint detection in carenvironments, in Proc Int Conf Acoust, Speech and Signal Process, 3 (2000) R Le Bouquin-Jeannes, G Faucon, Study of a voice activity detector and its influence on a noise reduction system. Speech Commun. 16(3), 245-254 (1995). doi:10.93(94)00056-G J Shen, J Hung, L Lee, Robust entropy-based endpoint detection for speech recognition in noisy environments, in 5th Int Conf Spoken Lang Process (1998) 一 E Nemer, R Goubran, S Mahmoud, Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans Acoust, Speech, Signal Process. 9(3), 217-231 (2001) K Li, M Swamy, M Ahmad, An improved voice activity detection using higher order statistics, in IEEE Trans Acoust, 13(5) Part 2. (Speech, Signal Process, 2005), pp. 965-974 G Ying, L Jamieson, C Mitchell, Endpoint detection of isolated utterances based on a modified Teager energy measurement. in Int Conf Acoust, Speech, Signal Process Vol. 2 (1993) J Ramfrez, J Segura, C Benitez, A De La Torre, A Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Communi. 42(3-4), 271-287 (2004). doi:10.1016/j.specom. G Evangelopoulos, P Maragos, Multiband modulation energy tracking for noisy speech detection. IEEE Trans Audio, Speech Lang Process. 14(6),
(2006) B-F Wu, K Wang, Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments. IEEE Trans Acoust, Speech, Signal Process. 13(5), 762-775 (2005) J Sohn, NS Kim, W Sung, A statistical model-based voice activity detection. IEEE Signal Process Lett. 6(1), 1-3 (1999). doi:10.233 S Gazor, W Zhang, A soft voice activity detector based on a LaplacianGaussian model. IEEE Trans Acoust, Speech, Signal Process. 11 (5), 498-505 (2003) JH Chang, NS Kim, SK Mitra, Voice activity detection based on multiple statistical models. IEEE Trans Signal Process. 54(6),
(2006) R Tahmasbi, S Rezaei, A soft voice activity detection using GARCH filter and variance Gamma distribution. IEEE Trans Audio, Speech Lang Process. 15(4),
(2007) J Ramirez, JC Segura, JM Gorriz, L Garcia, Improved voice activity detection using contextual multiple hypothesis testing for robust speech recognition.2. 3.4.5.6.7.8.9.10.IEEE Trans Audio, Speech Lang Process. 15(8),
(2007) 19. D Kim, K Jang, J Chang, A new statistical voice activity detection based on ump test. IEEE Signal Process Lett. 14(11), 891-894 (2007) 20. S Kang, Q Jo, J Chang, Discriminative weight training for a statistical model based voice activity detection. IEEE Signal Process Lett. 15, 170-173 (2008) 21. T Yu, JHL Hansen, Discriminative training for multiple observation likelihood ratio based voice activity detection. IEEE Signal Process Lett. 17(11), 897-900 (2010) 22. J Ramirez, P Yelamos, J Gorriz, J Segura, SVM-based speech endpoint detection using contextual speech features. Electron Let. 42(7), 426-428 (2006). doi:10.1049/el:. Q Jo, J Chang, J Shin, N Kim, Statistical model based voice activity detection using support vector machine. IET Signal Process. 3(3), 205-210 (2009). doi:10.1049/iet-spr.. JW Shin, JH Chang, NS Kim, Voice activity detection based on statistical models and machine learning approaches. Computer Speech & Language. 24(3), 515-530 (2010). doi:10.1016/j.csl. 25. ETSI, Speech processing, transmission and quality aspects (STQ); distribute advanced front-end feature
compression algorithms. ETSI ES. 202(050) 26. A Davis, S Nordholm, R Togneri, Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold. IEEE Trans Audio, Speech Lang Process. 14(2), 412-424 (2006) 27. S Kuroiwa, M Naito, S Yamamoto, N Higuchi, Robust speech detection method for telephone speech recognition system. Speech Commun. 27, 135-148 (1999). doi:10.-72-7 28. Q Li, J Zheng, A Tsai, Q Zhou, Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Trans Acoust, Speech, Signal Process. 10(3), 146-157 (2002) 29. J Ramirez, JC Segura, C Benitez, L Garcia, A Rubio, Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process Lett. 12(10), 689-692 (2005) 30. B Scholkopf, AJ Smola, Learning With Kernels (MIT Press, Cambridge, MA, 2002) 31. J Garofolo, L Lamel, W Fisher, J Fiscus, D Pallett, N Dahlgren, DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NTIS order number PB91-93) 32. The Rice University, “Noisex-92 database, http://spib.rice.edu/spib 33. ITU-T Rec P.48, Specifications for an intermediate reference system, ITU-T, March 1989 34. ITU-T Rec P.56, Objective measurement of active speech level, ITU-T 1993 35. TV Pham, CT Tang, M Stadtschnitzer, Using artificial neural network for robust voice activity detection under adverse conditions. in Int Conf Comput, Commun Tech, RIVF ‘09, 1 -8 (2009) 36. Y Ephraim, D Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans Audio, Speech Lang Proc. 32(6),
(1984) 37. S Kay, Fundamentals of Statistical signal processing, Volume 2: Detection theory (Prentice Hall PTR, 1998) doi:10.80-2011-18 Cite this article as: Wu and Zhang: An effieient voiee aetivity deteetion algorithm by eombining statistieal model and energy deteetion. EURASIP Journal on Advances in Signal Processing .11.12.13.f^14. 15.Submit your manuscript to a SpringerOpen journal and benefit from:7 Convenient online submission 7 Rigorous peer review 7 Immediatepublication on acceptance 7 Open access: articles freely available online 7 High visibility within the field 7 Retaining the copyright to your article16. 17.Submit your next manuscript at 18.J
更多相关文档}

我要回帖

更多关于 data processing 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信