You are here

Detection and classification of acoustic events

Acoustic Event Detection and Classification

The Acoustic Event Detection and Classification (AEDC) task aims at the detection and classification of acoustic events typical of lecture and meeting rooms and domestic contexts. The AEDC problem, in particular for what regards the classification stage, was initially investigated by SHINE under the european project CHIL and is still a topical research activity. As a matter of facts, AEDC is a key component for an effectivie acoustic scene analysis and reconstruction, which is crucial to achieve robust speech interactions in noisy and reverberant environments, as those being addressed under the DIRHA project. Typically, the task is split in two subtasks:

  • event detection;
  • event classification.

Concerning the detection stage, it is necessary to distinguish between speech (which is relevant for the final interaction) and any other event. In the latter case, the most commonly used feature is the signal energy. A decision criteria based on thresholds to split up the bimodal distribution of signal energy is appropriate for high SNR, typical of close-talk microphones and quiet environments. A robust and reliable discrimination becomes more and more difficult when the SNR decreases, as in noisy environments or in the case of large distance between speaker and microphone. In those scenarios, dynamically adaptive thresholds are in general employed.
The detection of speech events is a research topic itself, often referred to as Speech Activity Detection (SAD), also known as Voice Activity Detection (VAD). SAD is an important pre-processing step in many speech-enabled applications. In particular, for ASR it is essential to identify speech segments and isolate them from those containing only undesired noise. The task is tackled either using more speech related features, like pitch, or more sophisticated classification schemes as those mentioned below. The SHINE single channel VAD technology for distant-speech applications performs a preliminary segmentation based on the picth estimation and a following validation of the detected segments using the estimated SNR as feature.

The classification scheme currently available in SHINE works on isolated acoustic events. The front-end processing is based on 12 Mel scaled Cepstral Coefficients (MCCs) and log-energy of the signal. The analysis step is 10 ms. For what concerns the acoustic modeling, a single HMM corresponds to an acoustic event. Each of the 16 models is based on a left-to-right topology and on the use of Continuous Density HMMs, with output probability distributions represented by means of mixtures having only 1 Gaussian component with diagonal covariance matrices. HMM training is accomplished through the standard Baum-Welch training procedure. The results in terms of event accuracy is 90.3% on a specific database collected on purpose.

The database available at FBK focuses on events that can happen in small environments, like lecture and small-meeting rooms. It contains 16 semantic classes of events: door knock, door open, door slam, steps, chair moving, cough, paper wrapping, falling object, laugh, keyboard clicking, key jingle, spoon, cup jingle, phone ring, phone vibration, MIMIO pen buzz, and applause. More details are available here.

Video Clips and Demo

Some video-clips showing real-time implementations of our AED solutions are available on our demo page: