You are here


The research activities were organized in order to exploit our background, available methods and techniques in a synergetic way. In particular, the approach adopted to address distant-speech interaction was based on a tight combination between acoustic scene analysis (i.e., a preliminary detailed analysis of acoustic activities produced by all the sound sources that are active in the environment) and robust speech recognition.  As for music signal processing, we also relied on the use of established techniques previously applied to distant-speech recognition.

Acoustic scene analysis aims to process and interpret from different perspectives the acoustic information diffused in the environment and picked up by a multi-microphone acquisition system. Multi-source localization and tracking, detection and classification of acoustic events, source separation, speech enhancement, speaker identification represent some of the basic problems that are being investigated in order to characterize the acoustic scene.

The acoustic properties of an enclosure determine how sound is affected when it propagates from an active acoustic source (e.g. a person talking, a loudspeaker or a musical instrument) to a receiving point (e.g. a listener or a microphone). Early reflections convey important information about surfaces found along multiple propagation paths and also give hints about position, orientation and directivity of the sources.

Speech interaction with distant microphones represents a crucial step towards the deployment of flexible and non-invasive voice-enabled interfaces in novel application contexts, as for instance the smart home. In general, the distortion introduced in the signal by the environment, due to the distance between user and microphone, determines a significant reduction of speech recognition accuracy, if compared to the performance that can be obtained in an ideal close-talking condition, i.e., with the speaker at a distance of few centimeters from the microphones.

The goal of this research was the automatic extraction of relevant cues from audio streams, to enable effective solutions of music information retrieval. This is a field on which there was an increasing interest in the years 2005-2015, not only at scientific level (see for instance, the apps that were realized by Shazam and SoundHound).