Speaker models generated from training recordings of different speakers should differentiate between speakers. These models are estimated using feature vectors that are based on acoustic observations. So, the feature vectors should themselves show a high degree of inter-speaker variability and a low degree of intra-speaker variability.


Cepstral coefficients of transformed short-time spectra (e.g. Mel-Frequency Cepstral Coefficients - MFCC) are experimentally developed features that are widely used in the domain of automatic speech and speaker detection. Because of the manifold possibilities of parameters for the feature extraction process and the lack of theoretically motivated explanations for the determination of the last-mentioned, only a stepwise investigation of the extraction process can lead to stable acoustic features.


Optimized acoustic features for the representation of speakers enables the improvement of automatic speaker identification and verification. Additionally, the development of methods for forensic investigation of speakers (manually and automatically) is supported.

BE-SyMPHONic: French-Austrian joint project granted by ANR and FWF

Principal investigators: Basilio Calderone, Wolfgang U. Dressler
Co-applicants: Hélène Giraudo, Sylvia Moosmüller

Start of the project: 13th January 2014


Language sounds are realized in several different ways. Every language exploits no more than a sub-set of the sounds that the vocal tract can produce, as well as a reduced number of their possible combinations. The restrictions and the phonemic combinations allowed in the lanquage define a branch of phonology so-called phonotactics.

Phonotactics refers to the sequential arrangement of phonemic segments in morphemes, syllables, and words and underlies a wide range of phonological issues, from acceptability judgements (pseudowords like <poiture> in French or <Traus> in German are phonotactically plausible) to syllable processes (the syllabic structure in a given language is based on the phonotactic permission in that language) and the nature and length of possible consonant clusters (that may be seen as intrinsically marked structures with respect to the basic CV template).


Exploring the psycho-computational representation of the phonotactics in French and German is the aim of this research project.

In particular, our researh will focus on the interplay between phonotactics and word structure in French and German, and investigate the behavioural and computational representations of phonotactic vs. morphonotactic clusters.

As a matter of fact, the basic hypothesis underlying this research project ist that there exist different cognitive and computational representations for the same consonant cluster according to its phonotactic setting. In particular, the occurence of a cluster across a morpheme boundary (morphonotactic cluster) is considered as particularly interesting.


Our research will focus on the interplay between phonotactis and morphology and investigate the behavioural and computational representations of consonant clusters according to whether they are: a) exclusively phonotactic clusters, i.e. the consonant cluster occurs only without morpheme boundaries (e.g. Stein in German); b) exclusively morphonotactic clusters, i.e. the consonant cluster occurs only beyond morpheme boundaries (e.g. lach+st), c) both are true with one of the two being more or less dominant (e.g. dominant lob+st vs. Obst)[1]. Thus we test the existence of different ‘cognitive and computational representations’ and processes for the same and for similar consonant clusters according to their appartenance to a) or b) or c).

The central hypothesis which we test is that speakers not only reactively exploit the potential boundary signaling function of clusters that result from morphological operations, but take active measures to maintain or even enhance this functionality, for example by treating morphologically produced clusters differently than morpheme internal clusters in production or language acquisition. We call this hypothesis, the ‘Strong Morphonotactic Hypothesis’ (henceforth: SMH) (Dressler & Dziubalska-Koɫaczyk 2006, Dressler, Dziubalska-Koɫaczyk & Pestal 2010).

In particular, we suppose that sequences of phonemes exhibiting morpheme boundaries (the ‘morphonotactic clusters’) should provide the speakers with functional evidence about the morphological operation occurred in that sequence; such evidence should be absent in the case of a sequence of phonemes without morpheme boundaries (the ‘phonotactic clusters’).

Hence our idea is to investigate the psycho-computational mechanisms underlying the phonotactic-morphonotactic distinction by approaching the problem from two angles simultaneously: (a) psycholinguistic experimental study of language acquisition and production and (b) language computational modelling.

We aim therefore at providing, on one hand, the psycholinguistic and behavioural support to the hypothesis that morphologically produced clusters are treated differently than morpheme internal clusters in French and German; on the other, we will focus on the distributional and statistical properties of the language in order to verify whether such difference in clusters’ treatment can be inductively modelled by appealing to distributional regularities of the language.

The competences of the two research teams overlap and complement each other. The French team will lead in modelling, computational simulation and psycholinguistic experiments, the Austrian team in first language acquisition, phonetic production and microdiachronic change. These synergies are expected to enrich each group in innovative ways.

[1] An equivalent example for French language is given by a) prise (/priz/ ‘grip’, exclusively phonotactic cluster), b) affiche+ rai (/afiʃʁɛ/ ‘I (will) post’, exclusively morphonotactic cluster) and c) navigue+ rai (/naviɡʁɛ/ ‘I (will) sail’) vs. engrais (/ãɡʁɛ/ ‘fertilizer’), the both conditions are true with morphonotactic condition as dominant.

Derzeit stellen SprecherInnen aus Deutschland die größte AusländerInnengruppe in Österreich und insbesondere in Wien dar. In diesem vom Kulturamt der Stadt Wien geförderten Projekt wird untersucht, ob und inwieweit aufgrund des Kontakts mit der deutschen Standardaussprache diese einen Einfluss auf die österreichische Standardaussprache ausübt und umgekehrt. Es werden akustische Aufnahmen von mehreren SprecherInnengruppen mit unterschiedlich intensivem Kontakt zu deutschen SprecherInnen durchgeführt

Forensic Speech Analysis is currently being developed using two main methodologies:

  • Automatic methods, applying digital signal processing algorithms and Bayes Statistics.
  • Acoustic Phonetics and Phonology based on acoustic measurements of speech parameters, such as formant frequencies and fundamental frequency of speech segments. 

The Institute investigates both approaches in the framework of the FSAAWG (Forensic Speech and Audio Working Group) of ENFSI (the European Network of Forensic Science Institutes).

Implications for pathological speech

Coordinated Project 2016-17 Scuola Normale Superiore (SNS), Pisa – Acoustic Research Institute (ARI), Austrian Academy of Science, Vienna
PIs: Chiara Celata (SNS), Sylvia Moosmueller (ARI)
Research personnel: Chiara Meluzzi (SNS), Bettina Hobel (ARI)

Short Description

The project aims at modeling the impact of speech gesture coordination on the rhythmical properties of languages.

Speech gestural structures are sets of gestures and a specification of how they are temporally and spatially coordinated with respect to one another. Gestural anticipations, posticipations and overlap are the ingredients of coarticulation, i.e., the coordinatory activity of speech movements that allows adjacent vowels and consonants to be produced simultaneously, thus resulting into one smooth whole.

Rhythm is the systematic patterning of timing, accent, and grouping in sequences of events and encompasses both speech and music domains. We only become aware of how important it is in verbal communication when we listen to non-fluent speech. For example, deaf people with impaired or absent auditory feedback can be taught, after cochlear implantation and logopedic rehabilitation, to develop an “auditory” map for speech processing and imitation, but native-like patterns of gestural and rhythmical coordination are much more difficult to achieve.

Both gestural coordination and rhythm thus contribute to the way fluent speech is programmed, produced, and even perceived.

However, we still miss a global understanding of how the two dimensions of gestural coordination and speech rhythm interact in natural languages.

Indeed, the gestural and the rhythmical approaches sometimes make different predictions. For example, we do not know whether the consonants composing heterosyllabic clusters are articulatorily independent from one another and are timed with respect to different vocalic nuclei, as some theoretical frameworks in the domain of gestural coordination would predict, or whether they are rather globally timed with the preceding vocalic nucleus, especially if it is stressed, as some proposals in the domain of speech rhythm assume. Also, we do not know if cross-linguistic differences in how heterosyllabic clusters are articulatorily coordinated to vocalic nuclei reflect or are reflected by cross-linguistic differences in the languages’ rhythmical properties.

This project thus tries to reconcile the gestural and the rhythmical perspective into a unified research framework devoted to uncovering how inter-segmental coordination influences, and is influenced by, the rhythmical properties of supra-segmental entities.

To that aim, we develop a series of cross-linguistic experiments on Italian and Standard Austrian German to clarify some critical aspects of speech organization in the two languages and to establish a link between language-specific phonotactic constraints and the temporal and spatial properties of segments’ production.

The experiments, based on a reading task, include acoustic analyses for the identification of the temporal patterns and articulatory (ultrasound tongue imaging, UTI) analyses for the investigation of gestural coordination.

In addition, it is the purposes of the project to set the stage for an analysis of how the speech of cochlear implanted speakers differs from normal speech with respect to gestural coordination and rhythmic patterns. Spontaneous conversations will be recorded of both Italian and Standard Austrian German speakers. The target of the acoustic analyses will be the identification of the areas of most prominent difficulty concerning both the coarticulatory and the temporal aspects of spontaneous speech produced by CI-speakers.


The aim of this study is to investigate the phonetics of second language acquisition and first language attrition, based on the acoustic and articulatory lateral realizations of Bosnian migrants living in Vienna. Bosnian has two lateral phonemes (a palatalized and an alveolar/velarized one), whereas Standard Austrian German features only one lateral phoneme (an alveolar lateral). In the Viennese dialect however, this phoneme also has a velarized variant.

This phonetic investigation will be conducted with respect to the influence of language contact between Bosnian and SAG, and Bosnian and the Viennese dialect, as well as concerning the influence of gender and identity construction.


The recordings will be conducted with female and male Bosnian speakers, aged between 20 and 35 years at the time of emigration, who came to Vienna during the Bosnian war 1992-1995. Additionally, control groups of monolingual L1 speakers of Bosnian, SAG and Vd will be recorded. All recordings will include reading tasks in order to elicit controlled speech, as well as spontaneous speech in the form of biographical interviews. The analyses will comprise quantitative and qualitative aspects. Quantitatively, the acoustic parameters formant frequencies (especially F2 and F3), duration and intensity of the laterals and their phonetic surrounding will be analyzed. Additionally, articulatory analyses will be performed using EPG and UTI data. Qualitatively, biographical information, language attitudes and social networks will be analysed in order to obtain information about speaker-specific or group-specific characteristics.


The results of this study are relevant to understanding the processes of sound-realization and sound-change in the domains of language contact (phonetic processes in second language acquisition and first language attrition), sociolinguistics, and the sociology of identity construction


The modeling step in speaker detection has an enormous influence on the classification task, because the quality of the model depends on the parameters chosen in this step. False classifications, false identifications, and false verifications can result from malformed speaker models. The initial model parameters have an influence on the final determined parameters of the speaker models. To obtain optimized speaker models, different initialization methods are explored.


Speaker models are represented as Gaussian Mixture Models (GMMs). These models are mixtures of multivariate distributions that are parameterized by the means and the co-variance matrices of the distributions and the mixture weights. The parameters are estimated by the expectation maximization algorithm (EM algorithm) which maximizes the likelihood in the model. Initial model parameters have to be selected for this algorithm. Different initial parameters can lead to a convergence of the algorithm in local maximums. The effect of different initialization methods on the identification rate is analyzed.


Optimized speaker models reflect the speech behavior of the speakers in an optimal way. The inter-speaker variability is maximized while the intra-speaker variability is minimized by avoidance of malformed speaker models. The usage of optimal initialization methods improves the robustness and the reliability of automatic speaker identification and verification systems.


As is customary for urban varieties, the varieties of Vienna are predominantly social varieties. Education and social background form the primary factors which define the language behaviour of the speakers.

The Viennese dialect belongs to the Middle Bavarian dialect group. Around the turn of the century, a sound change arose which monophthongized the diphthongs /aɛ/ and /ɑɔ/ to /æ:/ and /ɒ:/ repectively. This sound change was accomplished around 1950. As a result of the Viennese monophthongization, the palatal constriction location became overloaded. As early as the thirties, Kranzmayer observed what he called the "e-confusion", i.e., people stopped to discern the /e/-vowels, "Segen" (blessing) and "sehen" (to see) became homophones: [se:ŋ].


5 female and 5 male speakers of the Viennese dialect were asked to name pictures, to read sentences, and to speak spontaneously.


As a consequence of the Viennese monophthongization and the consecutive overcrowding of the palatal constriction location, speakers of the Viennese dialect developed two strategies. One group, in the sense Kranzmayer observed, neutralized /e/ and /ɛ/ to /e/. This neutralization made room for the new palatal vowel /æ/.

The other group, however, preserved /e/ and /ɛ/, but sometimes applied the two vowels incorrectly, i.e., produced /ɛ/ instead of /e/ and the other way round. However, since no neutralization took place, the vowel /i/ is shifted to the pre-palatal constriction location. By this shift, room is created on the palatal bar for the new vowel /æ/.

  • Group I, consequently, discerns the following vowels:
  • palatal: /i:, i, e:, e, æ:/
  • velar: /u:, u/
  • uvular: /o:, o, ɔ:, ɔ/
  • pharyngeal: /ɑ:, ɑ, ɒ:/

Group II discerns the vowels as follows:

  • pre-palatal: /i:, i/
  • palatal: /e:, e, ɛ:, ɛ, æ:/
  • velar: /u:, u/
  • uvular: /o:, o, ɔ:, ɔ/
  • pharyngeal: /ɑ:, ɑ, ɒ:/

Lip rounding and duration is distinctive for each vowel system.


In speaker identification and speaker verification, wrong classifications can result from a high similarity between speakers that is represented in the speaker models. These similarities can be explored using the application of cluster analysis.


In speaker detection, every speaker is represented as a Gaussian Mixture Model (GMM). By using a dissimilarity measure for these models (e.g. cross-entropy), cluster analysis can be applied. Hierarchical agglomerative clustering methods are able to show structures in the form of a dendrogram.


Structures in speech corpora can be visualized and can therefore be used to select groups of highly similar or dissimilar speakers. The investigation of the structures concerning the aspect of misclassification can lead to model generation improvements.


Up to now, a thorough phonetic-acoustic and phonological description of the vowels and the vowel system of Standard Austrian German has not been provided.


Approximately 11,000 vowels of three female and three male speakers of Standard Austrian German have been segmented and analyzed acoustically.


Standard Austrian German discerns 13 vowels on five constriction locations:

  • pre-palatal for the /i/ and the /y/ vowels
  • mid-palatal for the /e/ and the /ø/ vowels
  • velar for the /u/ vowels
  • upper pharyngeal for the /o/ vowels
  • lower pharyngeal for /ɑ/

Each vowel pair consists of a constricted and an unconstricted vowel. The front vowels (pre-palatal and mid-palatal) additionally distinguish rounded and unrounded vowels. The following articulatory features sufficiently discriminate all vowels:

  • [± constricted]
  • [± front]
  • [± prepalatal]
  • [± pharyngeal]
  • [± round]

Contrary to general assumptions, F1 and F2 do not sufficiently discern the vowels of Standard Austrian German; F3 is necessary as well. Discriminatory ability is maintained over all speaking styles and prosodic positions.


Projektteil 02 des Sonderforschungsbereichs Deutsch in Österreich. Variation - Kontakt - Perzeption finanziert vom FWF (FWF6002) in Kooperation mit der Universität Salzburg

Projektleitung: Stephan Elspaß, Hannes Scheutz, Sylvia Moosmüller

Beginn des Projekts: 1. Jänner 2016


Gegenstand des Projekts sind die Vielfalt und die Dynamik der verschiedenen Dialekte in Österreich. Auf der Grundlage einer neuen Erhebung sollen in den nächsten Jahren unterschiedliche Forschungsfragen beantwortet werden. Diese lauten etwa: Welche Unterschiede und Veränderungen (z.B. im Wege von Konvergenz-und Divergenzprozessen) lassen sich innerhalb der und zwischen den österreichischen Dialektlandschaften beobachten? Welche Unterschiede im Dialektwandel gibt es zwischen städtischen und ländlichen Gebieten? Lassen sich Generationen- und Genderunterschiede feststellen, die den Dialektwandel betreffen? Welchen Beitrag kann ein umfassender Vergleich von ,real-time‘-und ,apparent-time‘-Analysen zu einer allgemeinen Sprachwandeltheorie leisten?

Zur Beantwortung dieser Fragestellungen werden in der ersten Erhebungsphase an 40 österreichischen Orten Sprachproben von insgesamt 160 Dialektsprecherinnen und -sprechern aus zwei verschiedenen Altersgruppen aufgenommen und analysiert. Weiters werden von ausgewählten Sprecher/inne/n Aufnahmen im Sprachlabor durchgeführt, um Eigenheiten in der Aussprache phonetisch möglichst exakt bestimmen zu können. In der zweiten Erhebungsphase werden an 100 weiteren Standorten in Österreich ergänzende Laboraufnahmen durchgeführt, um die Unterschiede und die Bewegungen zwischen den Dialektlandschaften noch genauer analysieren zu können. Hier sollen auch neueste dialektometrische Verfahren zum Einsatz kommen, um probabilistische Aussagen in Bezug auf die Variation und den Wandel der Dialekte in Österreich treffen zu können. Die Analysen betreffen alle sprachlichen Ebenen von der Aussprache bis zur Grammatik und zum Wortschatz. Die Dokumentation der gewonnenen Daten erfolgt u. a. digital. Es ist geplant, die Daten am Ende auf der Plattform „Deutsch in Österreich“ einem breiten Publikum zugänglich zu machen, insbesondere in Form des ersten ,sprechenden Sprachatlas‘ von ganz Österreich.


Projekthomepage der Kooperationspartner in Salzburg




This project describes vowel systems of several languages acoustically and compares them. The project's main interest is focused on languages with acoustically insufficient descriptions thus far, e.g. Albanian, Romanian, Ful, Mandinka, or Crioulo.


Selected speakers are asked to perform a reading task and to speak spontaneously. Vowels in all positions are segmented, labeled, and analyzed. Formant frequencies (F1, F2, F3) are extracted and the vowel systems are defined.

Language specificity affects not only the number of vowels and their features, but also the extent of variability and stability of certain vowels. A given vowel of language A might be quite stable, whereas the same vowel might exert high variability in language B. In the same way, vowels might be discerned differently. For example, pre-palatal /i/ and mid-palatal /e/ are discerned by F3 in Standard Austrian German (see diagram on SAG), whereas both mid-palatal /i/ and /e/ are predominantly discerned by F2 in Modern Standard Albanian (see diagram on MSA).


In forensic speaker identification, thorough descriptions of the languages in question are often needed in order to conduct a thorough comparison.

FWF DACH I 536-G20: 2011-2013
Cooperation with the Institute of Phonetics and Speech Processing, LMU Munich.

Project leader (Austria): Sylvia Moosmüller
Project leader (Germany): Jonathan Harrington


Across languages, the distinction between so-called tense and lax vowels, e.g., Miete - Mitte ("rent" - "center") or Höhle - Hölle ("cave" - "hell"), is encountered in many languages. However, many different articulatory adjustments might cause this distinction, and these are language-specific.

In the current project, we address this issue by analysing high tense and lax vowel pairs of the type bieten - bitten ("to offer" - "to request"), Hüte - Hütte ("hats" - "hut"), and Buße - Busse ("penance" - "busses") in two related language varieties: Standard Austrian German (SAG) and Standard German German (SGG). Previous studies suggest that high lax vowel pairs like bitten, Hütte, or Busse tend to approximate their respective tense cognates bieten, Hüte, and Buße.

The research questions were investigated by a) comparing the tense and lax vowel pairs in SAG and SGG, b) by investigating whether high lax vowel pairs approximate their tense cognates in SAG, c) by investigating whether the high vowel pairs in SAG are distinguished by quality, by quantity, or by quantity relations with the following consonant, and d) by investigating whether an ongoing sound change can be observed in SAG, with young SAG speakers exhibiting a higher degree to merge the vowels than old SAG speakers.

Main Results:

SGG speakers clearly distinguish the high vowel pairs by quality, whereas speaker-specific strategies can be observed in SAG, with some speakers distinguishing high tense and lay vowel pairs by quality, others merging the quality contrast, but restricting the merger to velar contexts only, and still others merging high tense and lax vowels alltogether. In case of distinction, the differences between high tense and high lax vowels are less pronounced in SAG than in SGG and still less pronounced in the speech of young SAG speakers as compared to old SAG speakers. The same result was observed for quantity distinctions: All speakers differentiate the high vowel pairs by quantity, meaning that the tense vowels of the type bieten, Hüte, and Buße are longer than their respective lax cognates. Again, the differences are most pronounced in SGG and least pronounced in the speech of the young SAG speakers, meaning that the tense vowels of the type bieten, Hüte, and Buße are truncated in the speech of young SAG speakers as compared to old SAG speakers and SGG speakers. Results on the quantity interactions of vowel + consonant sequences prove quantifying aspects in SAG. Again, some age-specific differences emerged insofar as overall, young SAG speakers have shorter durations than old SAG speakers. However, they maintain the timing relations observed for the old SAG speakers. Results on perception strongly suggest that SAG speakers make use of quantity cues in order to distinguish the vowel pairs, whereas SGG speakers rather rely on cues connected with quality. Generally, it can be concluded that quantity distinctions are more relevant in SAG than in SGG.

Project Related Publications:

Harrington, Jonathan, Hoole, Philip, & Reubold, Ulrich. (2012). A physiological analysis of high front, tense-lax vowel pairs in Standard Austrian and Standard German. Italian Journal of Linguistics, 24, 158-183.

Brandstätter, Julia & Moosmüller, Sylvia. (in print). Neutralisierung der hohen Vokale in der Wiener Standardsprache – A sound change in progress? In M. Glauninger & A. Lenz (Eds.), Standarddeutsch in Österreich – Theoretische und empirische Ansätze. Vienna: Vandenhoeck & Ruprecht.

Brandstätter, Julia, Kaseß, Christian H., & Moosmüller, Sylvia (accepted). Quality and quantity in high vowels in Standard Austrian German. In: A. Leemann, M-J. Kolly & V. Dellwo (Eds.), Trends in phonetics and phonology in German-speaking Europe. Zurich: Peter Lang.

Cunha, Conceição, Harrington, Jonathan, Moosmüller, Sylvia, & Brandstätter, Julia (accepted). The influence of consonantal context on the tense-lax contrast in two standard varieties of German. In: A. Leemann, M-J. Kolly & V. Dellwo (Eds.), Trends in phonetics and phonology in German-speaking Europe. Zurich: Peter Lang.

Moosmüller, Sylvia. (in print). Methodisches zur Bestimmung der Standardaussprache in Österreich. In: M. Glauninger & A. Lenz (Eds.), Standarddeutsch in Österreich – Theoretische und empirische Ansätze. Vienna: Vandenhoeck & Ruprecht (=Wiener Arbeiten zur Linguistik).

Moosmüller, Sylvia & Brandstätter, Julia. (in print). Phonotactic Information in the temporal organisation of Standard Austrian German and the Viennese Dialect. Language Sciences.