You are here

SHINE Robotic Platforms

The SHINE unit has two robotic platforms: they are used as showcase to display technological components developed. Our main purpose is to exploit these platforms in order to progress in scientific fields related to multi-microphone signal processing and distant-speech interaction.

Robot Description

Our robot is based on the TurtleBot 2 platform. The basic structure is a Kobuki moving base, a Microsoft Kinect device and an entry-level laptop. The Kobuki base is equipped with two motors, bumpers, encoders on the wheels and a gyroscope. To this standard setup we added eight digital MEMS microphones, some LEDs and an Arduino based LCD screen.

The MEMS microphones are positioned on a plate on top of the TurtleBot, with a layout defined to improve the speech coming from the front direction. The screen is used to display a simple face, with eyes that track a person moving in front of the robot. Nose and mouth are used in conjunction with gesture recognition, to display the results of the user movements. Eight red LEDs are positioned in circle and point to the direction where the speech is coming from, driven by a speaker localization node. Microsoft Kinect is used for the autonomous navigation and body skeleton tracking.

The software architecture is based on ROS, an open source platform for Ubuntu that helps in building robot applications. The main concept is based on software nodes that collaborate, supervised by a master. Nodes can run on different machines to obtain a distributed architecture.

Various software nodes are available for the Kobuki platform in order to access its sensors and control its motors. Other nodes allow the use of Microsoft Kinect. In addition to these low-level nodes there are more complex nodes that compute autonomous navigation avoiding obstacles, body skeleton tracking and a follow-me modality.

We enriched the architecture with new nodes to handle multichannel audio acquisition, beamforming, sound source localization, Arduino LCD. In addition to these nodes, some specific nodes have been developed to interact with the speech recognizer and the dialogue framework. At the moment all the nodes run on-board but the architecture permits to distribute the load on multiple machines.

It is possible to control the robot via low-level commands like “move forward one meter” or “what is the battery level”, mid-level commands like “go to Maurizio’s office” or multimodal commands like “go there”.

In addition to the regular TurtleBot driven by a laptop, we developed a low-cost solution using an ODROID XU4 single-board computer and off-the-shelf components. The ARM-based ODROID platform can run ROS and most of the nodes developed for the laptop. The Kobuki base has been equipped with an ORBBEC Astra device, to replace the discontinued Microsoft Kinect and the eight-microphone board has been replaced by a four MEMS microphones commercial board.

An interesting feature of the robot is its SLAM (Simultaneous Localization And Mapping) ability. The robot is able to enter in an unknown environment and create a map using its sensors. The map can be later used for the autonomous navigation. An alternative way is to provide the robot with a map in a graphic format. In our case we adapted the floor map of the building as a static map.

The navigation uses AMCL (Adaptive Monte Carlo Localization) to localize the robot in a 2D space. The robot uses a navigation stack for the route planning, including costmaps to store information about obstacles in the world. The planning is computed using the static map, but during the navigation the robot is able to modify its trajectory in order to avoid obstacles that appear in front of it.

Multimodal Dialogue Description

FBK built in the past a dialogue engine, which was used in several projects and prototypes. It is able to handle system- and mixed-initiative dialogues, and can cope with relatively complex sentences in natural language in limited domains. The dialogue component presently implemented in the actual version of the FBK robot is composed of the following main parts:

  • engine, written in Perl (there is also a version implemented in C, still to update to get the last functionalities), which is basically an interpreter of a description, written in a proprietary language and compiled in a Perl data structure;
  • description of the dialogue, which consists in:
    • a dialogue strategy – a number of structures and procedures common to many applications, that implement the philosophy of the dialogue – very briefly, there is a main loop in which the status of the dialogue is analyzed in order to decide which is the next move to do. A number of semantic concepts are organized into contexts, each one corresponding to a subdomain, and the dialogue strategy has to cope with all of them;
    • a number of tools that can be easily included in the application, to handle common concepts like numbers, dates, confirmations, etc.;
    • an application-dependent part, where the semantic data necessary to model the desired domains are defined, together with all their dedicated procedures.

The dialogue module communicates through a scheduler – responsible only to exchange messages among processes - with several external modules, each one dedicated to some precise task. The most important processes are:

  • speech acquisition: it gets the audio corresponding to a sentence by the user, at present in a synchronous way, i.e. the user can speak only after the prompt given by the robot. This process can be simple – just one microphone, for instance for phone applications – or complex – in the case of the robot, 8 microphones located on the robot itself: the 8 speech signals are properly processed in order to obtain a unique, enhanced, speech signal. Note that also a text input mode is possible;
  • ASR + parsing: the former processes the input speech in order to get the word sequence, the latter produces a parse tree of the sentence. There must be an alignment between the labels of the parse tree and the labels of the semantic concepts defined in the dialogue application;
  • speech synthesis: it has to produce a speech waveform starting from the word sequence produced by the dialogue;
  • database: it has to collect information from the “external world”, for instance access train timetable, a personal agenda, touristic data, etc.
  • reference resolution: it keeps track of the objects that are nominated during the dialogue and in case of pronouns, tries to find out the right object.
  • robot modules: there are several modules which perform speaker localization, return or set the values of some parameters, execute navigation command that could be simple (go forward 1 meter, turn right 30 degrees) or complex (follow me, navigate to coordinates X,Y), etc.
  • movement tracker: a kinect is able to track the movements of the users in front of the robot. Some gestures are recognized and passed to the dialogue, that in conjunction with spoken commands can handle multimodal input.

 The most interesting features recently added to the dialogue are:

  • compound commands: the system is able to get multiple commands (go forward 1 meter, turn left 30 degrees and then go on for 2 meters), and to properly execute them in the right order, waiting for the formers to be completed before starting the next one.
  • reference resolution: the system can remember objects nominated during the dialogue, and assign to a pronoun pronounced by the speaker the most probable object (how much is your speed? … augment it by 0.2 … bring it to 0.7…), taking into account both properties of the objects and when the user spoke about them.
  • multimodality: the system is able to get input from speech only, from gesture only (stop) or from a combination of the two (go there, indicating a direction with the arm). It is also capable to detect the position of the speaker from his voice – exploiting the delay with which it arrives to the 8 microphones -, and this feature is essential to be able to properly react to the command “come here” even if the speaker is behind the robot.

There are several types of commands that at present can be handled by the system. Commands can be given in a natural way and when the command is incomplete, the system will ask for the missing paramenters. The most important commands are:

  • basic navigation commands: commands like “go forward 2 meters”, “turn right 90 degrees”,“go back 50 centimeters”, “stop”;
  • mid-level navigation commands: commands like “go home”, “go to the entrance door”, “go to the office of Mary” will cause the robot to perform autonomous navigation to a coordinate X,Y associated with the given label;
  • compound commands: the system can handle a list of commands, like “turn right 90 degrees, go forward one meter, turn 30 degrees to the left”;
  • help and self-awareness: the system can access and change some of its internal parameters using commands like “which is the value of your speed?”, “bring it to 1.0”, “set angular speed to 2”, “what can you do?”, “shut up”;
  • teaching commands: the user can teach the robot a new position: “learn, this is the office of John” and since then it will memorize that position so the command “go to office of John” will be properly executed;
  • agenda: the user can save in the agenda a new appointment, or ask for the appointments previously saved. Each appointment needs a date, a time slot, a description to be complete.
  • multimodal commands: “come here”, “go there [+gesture]”, “follow me” will use both spoken commands and information coming from various sensors (gesture detection using kinect, speaker localization via audio processing) to be properly executed.

The system is capable to handle multiple dialogues at a time: this feature was implemented for instance in the DIRHA system where in each room a different dialogue was running, being able to serve different people at the same time. Recently, a web interface has been implemented to be able to use the dialogue potentially from any device connected to Internet. This feature, in addition to some webcam, could allow to control the robot remotely, or in general to use the dialogue to access information from anywhere. The dialogue architecture is modular, in the sense that it is quite easy to add new processes to the system (recently: robot modules, pronoun resolution, multimodality, etc.): this will easily allow to introduce new processes that, maybe exploring the characteristics of the environment, give suggestions to the dialogue about some move to do to improve the interaction with the user (for instance, when detecting a noise source, trying to do something to reduce it – close a window – or proposing to go near the speaker to get a better signal to noise ratio).
At present, no deep learning is included in the dialogue. However, we plan to add it to some component in order to improve the behavior of the whole system. In particular, deep learning could be useful to control the dialogue state and drive the behavior of the dialogue strategy, currently implemented by a set of rules.

The language is Italian. An equivalent version in English is under development.

The following videos show some examples of interaction:

  • The first video shows the main functionalities, including speaker localization and spoken dialogue management;
  • The second one concerns autonomous navigation; 
  • The third one shows an example of gesture handling;
  • The last one shows a simultaneous interaction with two platforms, selectively calling by name each of them before giving a command. Examples of compound commands are also given.