UnivIS
Information system of Friedrich-Alexander-University Erlangen-Nuremberg © Config eG 
FAU Logo
  Collection/class schedule    module collection Home  |  Legal Matters  |  Contact  |  Help    
search:      semester:   
 
 Layout
 
printable version

 
 
SmartWeb

Lead Innovation SmartWeb

The objective of SmartWeb (http://www.smartweb-projekt.de/ ), a project funded by the Federal Ministry of Education and Research (BMBF), is the intelligent, multimodal, mobile access to the Semantic Web. Since 2004 fourteen partners managed by DFKI (Deutsches Forschungsinstitut für künstliche Intelligenz / German Research Center for Artificial Intelligence) are working on the realization of that vision. Apart from a number of universities and other research institutes, the consortium consists of small and medium enterprises and large industrial companies.

A major component of SmartWeb is the Semantic Web. This Next-Generation Internet is based on standardized, semantically descriptive languages for the content of digital documents, so that they are interpretable by machines. One of the approaches within SmartWeb is to transform machine-readable pages of the World Wide Web into semantically structured data. The Chair for Pattern Recognition (LME) is involved in work packages which will provide users with multimodal, mobile access to the resources of the Semantic Web. The SmartWeb front-end will consist of a client (PDA, Smartphone, etc.) and a dialogue server. One component on the server is the multimodal recognizer (MMR), which is made up of several modules that process speech, video and biosignals. Three MMR modules are developed at the Chair for Pattern Recognition: detection and processing of out-of-vocabulary (OOV) words (in cooperation with the SmartWeb partner Sympalog), classification of the user state (such as stressed vs. relaxed), and classification of the user's focus of attention using multimodal input.

The first version of the OOV module was integrated into the SmartWeb system in 2005. During 2006 the components to determine the user's focus of attention (classification of on-view/off-view, on-talk/off-talk) became part of the SmartWeb environment. Results concerning the classification of the user state are shown in a stand-alone demonstrator.

Detection and processing of out-of-vocabulary words

To further improve the processing of out-of-vocabulary (OOV) words within the SmartWeb demonstrator, the main focus in 2006 was on the recognition of sub-word units. The basis for this work was an analysis of two different aspects concerning the use of words and sub-word units for speech recognition: frequency of unknown words and sub-word units, and error rate of a recognizer trained accordingly. For the EVAR corpus (Erlangen train information system) it was shown that syllables performed well in terms of recognition accuracy on the phone level (82.8%). At the same time the rate of unknown syllables in the test set was a moderate 1.0% with respect to the training set. With a word recognizer the recognition accuracy could be improved to 84.6%; however, the out-of-vocabulary rate for words was significantly higher (2.6%). For phones the OOV rate was close to zero, but the recognition accuracy was only 70%. As a consequence the monophone recognizer in the SmartWeb demonstrator was replaced with a syllable recognizer. Even more improvement is expected from an approach with dynamic (data-driven) sub-word units that is explored at the moment.

Apart from the changes to the setup of the hybrid speech recognizer such as parallel recognition of words and sub-word units, there were also efforts to improve the amount of data for training the sub-word unit recognizers. For this a number of speech corpora could be used that were recorded by the SmartWeb partner Sympalog . To cover all new words contained in the speech data, the phonetic lexica at the Chair for Pattern Recognition had to be expanded. During the make-over it was taken into account that the lexica are also used for automatic phoneme-grapheme / grapheme-phoneme conversion with MASSIVE: therefore the syllable boundaries in the orthographic representation of the words were annotated as well.

User State Classification

For efficient interaction between the SmartWeb system and the user, it can be beneficial for the system to have some information about the emotional user state (e.g. stressed vs. relaxed, or annoyed vs. satisfied). This is particularly apparent in the car and motorbike scenario, where the system should ideally be able to adjust the flow of information according to the driver's current workload. Although speech, facial expression, and gesture can in many cases be a clue for the emotional user state, they are very individual and subject to masking. Physiological parameters like skin conductivity, pulse etc. can indicate the state of the user directly. There are mobile measurement systems for physiological parameters, and the sensors may one day be integrated into clothing, the steering wheel of a car, or into a mobile device like a PDA. Within SmartWeb, LME aims to develop techniques that allow user-independent, real-time classification of the user state using physiological signals.

In order to develop a classifier for discriminating between stress/non-stress in a data-driven manner, a comprehensive corpus has been collected during the period under report: DRIVAWORK (Driving under varying workload) contains recordings of six physiological signals, audio, and video under different workload/stress conditions during a simulated car drive. The recordings were taken from 24 subjects and amount to 15 hours of usable data or 1.1 GB of physiological signals. Further, the existing feature sets (on which the classification is based) were improved and extended. When combining all six physiological signals, the extreme states "relaxed" and "stressed" can be recognized person-independently with a class-wise averaged recognition rate of 90%.

Multimodal Recognition of the User's Focus of Attention

The user of a mobile device (e.g. the T-Mobile MDA Pro) can ask spoken questions to SmartWeb. The speech signals are transmitted to a server to be analyzed there. Here, the system classifies whether it has actually been addressed: maybe, the user spoke to another person being present, or even to himself. Using the microphone and the camera of the MDA Pro, all necessary information can be easily obtained, and the user does not have to press any cumbersome push-to-talk button. If a face is detected in the image of the camera, the system classifies the user's focus of attention based on his/her gaze direction: On-View, if the user is directly looking at the display and Off-View, if the user is not looking to the MDA but to another person or elsewhere. Similarly, for speech signals classification is done with respect to the user's intended dialogue partner. The system distinguishes between On-Talk (user utterances which are directed to the system) and Off-Talk (the user talks to another person or to himself). Thus, On-Focus in the SmartWeb scenario is assuemed only if the user is looking towards the device and talking to the system (figure here ). The classification of Off-View is based on the Viola-Jones algorithm for face detection; to detect Off-Talk, prosodic features (duration, energy, fundamental frequency, jitter, shimmer) are used.

In the period under report a demonstrator was implemented that was presented at the CeBIT 2006. In the SmartWeb system it can be now seen on the display of the MDA, whether the request has been recognized as On-Focus. To train and evaluate the system, data as realistic as possible has been collected by our partner LMU Munich. Per utterance, the two classes On-Focus and Off-Focus are classified at 77% correctly using the audio signal (class-wise averaged recognition rate) and at 71% using the video signal. A fusion of both classification scores with a meta-classifier resulted in an improved recognition rate of over 80%. Even the four different classes On-Talk, read Off-Talk, spontaneous Off-Talk, and paraphrasing Off-Talk (a result of SmartWeb is reported to another person) are recognized with 67% (chance: 25%). In the audio signal, even 65% of single words are correctly assigned to the classes On-Talk and Off-Talk. An additionally recorded database with exceptionally cooperative users (acted Off-Talk) showed recognition rates of up to 93%.

Project manager:
Prof. Dr.-Ing. Elmar Nöth

Project participants:
Dr. phil. Anton Batliner, Dr.-Ing. Christian Hacker, Dipl.-Inf. Florian Hönig, Dr.-Ing. Axel Horndasch

Keywords:
Speech recognition; OOV processing; bio signals; user state classification; multimodal information sources

Duration: 1.4.2004 - 30.9.2007

Sponsored by:
Bundesministerium für Bildung und Forschung

Contact:
Nöth, Elmar
Phone +49 9131 85 27888, Fax +49 9131 85 27270, E-Mail: elmar.noeth@fau.de
Publications
Hacker, Christian ; Batliner, Anton ; Nöth, Elmar: Are You Looking at Me, are You Talking with Me -- Multimodal Classification of the Focus of Attention. In: Sojka, P. ; Kopecek, I. ; Pala, K. (Ed.) : Text, Speech and Dialogue. 9th International Conference, TSD 2006, Brno, Czech Republic, September 2006, Proceedings (9th International Conference, TSD 2006 Brno 11-15.9.2006). Berlin, Heidelberg : Springer, 2006, pp 581 -- 588. (Lecture Notes in Artificial Intelligence (LNAI), No. 4188) - ISBN 978-3-540-39090-9
Batliner, Anton ; Hacker, Christian ; Nöth, Elmar: To Talk or not to Talk with a Computer: On-Talk vs. Off-Talk. In: Fischer, Kerstin (Ed.) : How People Talk to Computers, Robots, and Other Artificial Communication Partners (How People Talk to Computers, Robots, and Other Artificial Communication Partners Bremen April 21-23, 2006). 2006, pp 79-100. (University of Bremen, SFB/TR 8 Report Vol. 010-09/2006)
Horndasch, Axel ; Nöth, Elmar ; Batliner, Anton ; Warnke, Volker: Phoneme-to-Grapheme Mapping for Spoken Inquiries to the Semantic Web. In: ISCA (Org.) : Proceedings of the Ninth International Conference on Spoken Language Processing (Interspeech 2006 - ICSLP) (Ninth International Conference on Spoken Language Processing (Interspeech 2006 - ICSLP) Pittsburgh 17.-21.09.2006). Bonn : ISCA, 2006, pp 13-16.
Batliner, Anton ; Hacker, Christian ; Kaiser, Moritz ; Mögele, Hannes ; Nöth, Elmar: Taking into Account the User's Focus of Attention with the Help of Audio-Visual Information: Towards less Artificial Human-Machine-Communication. In: Krahmer, Emiel ; Swerts, Marc ; Vroomen, Jean (Ed.) : AVSP 2007 (International Conference on Auditory-Visual Speech Processing 2007 Hilvarenbeek 31.08.-03.09.2007). 2007, pp 51-56.
Hönig, Florian ; Batliner, Anton ; Nöth, Elmar: Fast Recursive Data-driven Multi-resolution Feature Extraction for Physiological Signal Classification. In: Hornegger, Joachim ; Mayr, Ernst W. ; Schookin, Sergey ; Feußner, Hubertus ; Navab, Nassir ; Gulyaev, Yuri V. ; Höller, Kurt ; Ganzha, Victor (Ed.) : 3rd Russian-Bavarian Conference on Biomedical Engineering (3rd Russian-Bavarian Conference on Biomedical Engineering Erlangen 2.-3.07.2007). Vol. 1. Erlangen : Union aktuell, 2007, pp 47-52. - ISBN 3-921713-33-X
Hönig, Florian ; Batliner, Anton ; Nöth, Elmar: Real-time Recognition of the Affective User State with Physiological Signals. In: Cowie, Roddy ; de Rosis, Fiorella (Ed.) : The Second International Conference on Affective Computing and Intelligent Interaction, Proceedings of the Doctoral Consortium (Affective Computing and Intelligent Interaction Lisbon, Portugal 12-14.09.2007). 2007, pp 1-8. - ISBN 978-989-20-0798-4
Hönig, Florian ; Hacker, Christian ; Warnke, Volker ; Nöth, Elmar ; Hornegger, Joachim ; Kornhuber, Johannes: Developing Enabling Technologies for Ambient Assisted Living: Natural Language Interfaces, Automatic Focus Detection and User State Recognition. In: BMBF (Bundesministerium für Bildung und Forschung) ; VDE (Verband der Elektrotechnik Elektronik Informationstechnik e.V.) (Org.) : Tagungsband zum 1. deutschen AAL-Kongress (1. Deutscher AAL (Ambient Assisted Living)-Kongress Berlin 30.01.2008-01.02.2008). Berlin/Offenbach : VDE Verlag GMBH, 2008, pp 371-375. - ISBN 978-3-8007-3076-6
Nöth, Elmar ; Hacker, Christian ; Batliner, Anton: Does Multimodality Really Help? The Classification of Emotion and of On/Off-Focus in Multimodal Dialogues - Two Case Studies.. In: Grgic, Mislav ; Grgic, Sonja (Ed.) : Proceedings Elmar-2007 (Elmar-2007 Zadar 12.-14.09.). Zadar : Croatian Society Electronics in Marine - ELMAR, 2007, pp 9-16. - ISBN 978-953-7044-05-3
Hönig, Florian ; Batliner, Anton ; Eskofier, Björn ; Nöth, Elmar: Predicting Continuous Stress Ratings of Multiple Labellers from Physiological Signals. In: Jan, Jiri ; Kozumplik, Jiri ; Provanznik, Ivo (Ed.) : Analysis of Biomedical Signals and Images (International Conference BIOSIGNAL 2008 Brno, Czech Republic June 29 - July 1, 2008). Brno : Vutium Press, 2008, pp no pagination. - ISBN 978-80-214-3612-1

Institution: Chair of Computer Science 5 (Pattern Recognition)
UnivIS is a product of Config eG, Buckenhof