UnivIS
Information system of Friedrich-Alexander-University Erlangen-Nuremberg © Config eG 
FAU Logo
  Collection/class schedule    module collection Home  |  Legal Matters  |  Contact  |  Help    
search:      semester:   
 
 Layout
 
printable version

 
 
Preparing Future Multisensorial Interaction Research (PF-Star)

Pf-Star was intended to contribute to the establishment of future activities in the field of multi-sensorial and multilingual communication (interface technologies) on stronger bases by providing technological baselines, comparative evaluations, and assessment of prospects of core technologies, which future research and development efforts can build from. The project took two years and was successfully completed at the end of September 2004. Besides synthesis of emotional speech and emotional faces as well as automatic speech translation the project comprised the two work-packages Recognition of Emotions (WP3) and Children's speech (WP5) which the Chair for Pattern Recognition (Informatik 5) took part in as work-package leader and research partner respectively. Whereas in the first year of Pf-Star new corpora have been collected and research was mainly done with existing databases, in the second year the new databases were applied. The database that was recorded for both work-packages, WP3 and WP5, is the AIBO corpus. It contains 9 hours of speech from children playing with the AIBO robot (spontaneous, emotional speech). At the University of Birmingham comparable recordings were performed.

As for the work-package Recognition of Emotions (WP3) annotations of prosodic peculiarities and annotations of the emotional user state have been assigned to each spoken word in the AIBO database by five labelers by march 2004. As a third annotation the alignment of the children's utterances with AIBO's action is on-going. In the second part of the period under report the English recordings were annotated (emotional user states). We addressed the pivotal problem of annotations which always are intermediary between phenomenon and automatic processing: the classifier is not able to deal with the phenomena themselves but only with those labels that are given to single items, be there turns, words or paragraphs. Therefore an entropy based measurement was developed in order to show, to what extend the performance of a classifier equals the performance of human labelers. We could show that the concept of majority voting using non-experts as labelers really can be an alternative to the expert's approach where labeler are trained thoroughly.

For the automatic classification of emotional words of the AIBO database, we used different prosodic, spectral and part-of-speech features. We ran several experiments using LDA and neural networks and with majority voting as reference. For the two-class problem positive+neutral vs. negative we obtained a class-wise averaged recognition rate (CL) of 76 %. For seven classes (joyful, motherese, neutral, emphatic, touchy, angry, reprimanding) 45% CL was achieved.

SYMPAFLY is another database that was mainly employed in year one of Pf-Star. It contains emotional speech from people calling an automatic flight-reservation-system. In the period under report we investigated in alternative feature sets like MFCC-based and HNR-based (hormonicity-to-noise-ratio) features. Furthermore, features for different levels of classification (utterance, word) have been developed. For the two class problem, 75 % CL could be achieved and for the four class problem 57 % CL.

In the work-package Children's speech (WP5) the work focused on calibration of the new corpora, by establishing speech recognition performance baselines and other measures. At the Chair for Pattern Recognition (Informatik 5) four new corpora that contain about 23 hours of speech were investigated. The AIBO-database contains 9 hours of spontaneous speech. The database OHM8000 contains 9 hours of read speech, with a rather high vocabulary of 8000 words. In addition read speech from German children reading English texts has been recorded. Comparable English sentences can be found in the recordings of the University of Birmingham and ITC-irst. The fourth database comprises a small amount of recordings of children with speech disorder.

For all these databases recognition baseline systems based on different feature sets and different training techniques were established. Variability and formant-distributions of the datasets were investigated: Intra-speaker variability is higher for younger children and the formants are shifted to higher frequency. For the AIBO database higher recognition rates were achieved for "angry" and "emphatic" speech whereas for "motherese" the accuracy dropped. The recognition for the read speech could be improved with several language modeling approaches. For the non-native data it is challenging to automatically score the pronunciation of the children; for this task a 30-dimensional feature set was developed. The detection or wrong pronounced words was performed with a classification rate of 72 %. An application could be computer assisted language learning. For further investigations the proficiency of the children's English was judged by several English-teachers of the OHM-Gymnasium Erlangen. For the children with speech disorder (cleft lip or palate) the Chair for Pattern Recognition has been developing some objective measurement of the children's degree of nasality during speech therapy. Up to now, the best correlation with the human rating is 0.66. The energy portion in the frequency areas typically for nasalized speech was set into relation to the entire speech energy.

Further work with the AIBO database addressed the problem that it is still difficult to recognize speech under noisy conditions. For this purpose TRAP-based features (TempoRAl Pattern), that are known to be more robust, were implemented.

Project manager:
Prof. Dr.-Ing. Elmar Nöth

Project participants:
Dr. phil. Anton Batliner, PD Dr.-Ing. habil. Stefan Steidl, Dr.-Ing. Christian Hacker, Dr.-Ing. Dr. habil. med. Tino Haderlein, Dr.-Ing. Viktor Zeissler, Dr.-Ing. Michael Levit, Dipl.-Math. Silvia Dennerlein

Keywords:
speech recognition; emotions; prosody; children's speech; pronunciation scoring

Duration: 1.10.2002 - 31.12.2003

Sponsored by:
EU, 5th Framework Programme

Mitwirkende Institutionen:
ITC-irst, Italy
RWTH Aachen, Germany
Universität Karlsruhe, Germany
Kungl Tekniska Högskolan, Sweden
University of Birmingham, UK
Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova, Italy


Institution: Chair of Computer Science 5 (Pattern Recognition)
UnivIS is a product of Config eG, Buckenhof