• Time-Frequency Implementation of HRTFs (HRTF-Imp)

    The FWF project "Time-Frequency Implementation of HRTFs" has started.

    Principal Investigator: Damian Marelli

    Co-Applicants: Peter BalazsPiotr Majdak

  • YIRG Dynamates: Dynamic auditory predictions in human an non-human primates

    Imagine yourself navigating through busy traffic, with cars, bikes and pedestrians moving around you in all directions. In this and many other situations, one of the crucial aspects for survival is accurately determining when and where external events are appearing, and whether and how things move. To allow both fast and accurate responses to external events, the brain continuously generates predictions about the future. For example, your brain predicts where a car will be by the time you want to cross the road. Yet, humans are not the only species for whom it is beneficial to predict the location and timing of sounds. Other primates might share these abilities, as they also need to navigate dense forest areas. To which degree recent evolutionary pressures have shaped human abilities in these contexts is unclear today.

    Furthermore, ambiguities in the sensory environment, due to noise or uncertainty about the source properties, require the brain to generate several co-existing predictions and make decisions among them. To date, most of the existing knowledge about perceptual decision making comes from studying visual tasks. However, very little is known regarding this process in the auditory modality, which serves at least as important functions for survival and social behaviour.

    The YIRG Dynamates aims to address those current key challenges in auditory cognition, by empirically testing and modelling the sensory prediction mechanisms in comparison across evolutionarily closely related species and in realistic but highly controllable virtual acoustic environments. Importantly, in humans, these mechanisms will also be investigated on a physiological level by means of high-density electroencephalography (EEG) allowing to examine whether the resulting model complies with biological restrictions on neural computation. Hence, this project rests on a strong collaboration between cognitive neuroscience, cognitive biology, and computational modelling. To this end, Dynamates comprises an interdisciplinary research team consisting of experts in the fields of computational neuroscience and psychoacoustics (Robert Baumgartner), human EEG and sensory processing (Ulrich Pomper), and comparative cognition between animal species (Michelle Spierings).

    Dynamates will conduct the first systematic comparative study on dynamic auditory predictions in space and time by both human and non-human primates. The findings of this project will broaden our understanding of the underlying neural mechanisms in humans, which may help future efforts to improve treatments for individuals with impairments or pathological biases in decision making. Moreover, the resulting computational model may stimulate further research by testable predictions of decision making across species as well as in more complex economic and social contexts, and it may be directly applied to improve artificial intelligence and virtual reality systems.

  • 04.02.2015 Master Studentship offer at the ARI

    Proposal for a Master studentship (f/m)


    Title: Measurements of auditory time-frequency masking kernels for various masker frequencies and levels.


    Duration: 6 months, working time = 20 hours/week.


    Starting date: ASAP.


    Closing date for applications: until the position is filled.



    Background:Over the last decades, many psychoacoustical studies investigated auditory masking, an important property of auditory perception. Masking refers to the degradation of the detection of a sound (referred to as the “target”) in presence of another sound (the “masker”). In the literature, masking has been extensively investigated with simultaneous (spectral masking) and non-simultaneous (temporal masking) presentation of masker and target. The results were used to develop models of either spectral or temporal masking. Attempts were made to simply combine these models to account for time-frequency masking in perceptual audio codecs like mp3. However, a recent study on time-frequency masking conducted at our lab [1] revealed the inaccuracy of such simple models. The development of an efficient model of time-frequency masking for short-duration and narrow-band signals still remains a challenge. For instance, such a model is crucial for the prediction of masking in time-frequency representations of sounds and is expected to improve current perceptual audio codecs.


    In the previous study [1], the time-frequency masking kernel for a 10-ms Gaussian-shaped sinusoid was measured at a frequency of 4 kHz and a sensation level of 60 dB. A Gaussian envelope is used because it allows for maximum compactness in the time-frequency domain. While these data constitute a crucial basis for the development of an efficient model of time-frequency masking, additional psychoacoustical data are required, particularly the time-frequency masking kernels for different Gaussian masker frequencies and sensation levels.


    The proposed work is part of the ongoing research project POTION: “Perceptual Optimization of audio representaTIONs and coding”, jointly funded by the Austrian Science Fund (FWF) and the French National Research Agency (ANR).


    Aims:The first goal of the work is to conduct psychoacoustical experiments to measure the time-frequency masking kernels for three masker sensation levels (20, 40, and 60 dB) and three masker frequencies (0.75, 4.0, and 8.0 kHz) following the methods in [1]. This part will consist in experimental design, programming, and data collection. The second goal of the work is to interpret the data and compare them to literature data for maskers with various spectro-temporal shapes. This step shall involve the use of state-of-the-art models of the auditory periphery to predict the data.


    Applications:The data will be used to develop a new model of time-frequency masking that should later be implemented and tested in a perceptual audio codec.


    Required skills: Qualification for a Master thesis, knowledge in psychophysical methods andpsychoacoustics, experience with auditory models would be a plus, Matlab programming, good communication, proper spoken/written English.


    Gross salary: 948.80€/month.


    Supervisors: Thibaud Necciari and Bernhard Laback
    Emails: This email address is being protected from spambots. You need JavaScript enabled to view it. / This email address is being protected from spambots. You need JavaScript enabled to view it.
    Tel: +43 1 51581-2538



    [1] T. Necciari. Auditory time-frequency masking: Psychoacoustical measures and application to the analysis-synthesis of sound signals. PhD thesis, Aix-Marseille I University, France, October 2010. Available online.

  • BE-SyMPHONic

    BE-SyMPHONic: French-Austrian joint project granted by ANR and FWF

    Principal investigators: Basilio Calderone, Wolfgang U. Dressler
    Co-applicants: Hélène Giraudo, Sylvia Moosmüller

    Start of the project: 13th January 2014


    Language sounds are realized in several different ways. Every language exploits no more than a sub-set of the sounds that the vocal tract can produce, as well as a reduced number of their possible combinations. The restrictions and the phonemic combinations allowed in the lanquage define a branch of phonology so-called phonotactics.

    Phonotactics refers to the sequential arrangement of phonemic segments in morphemes, syllables, and words and underlies a wide range of phonological issues, from acceptability judgements (pseudowords like <poiture>in French or <Traus>in German are phonotactically plausible) to syllable processes (the syllabic structure in a given language is based on the phonotactic permission in that language) and the nature and length of possible consonant clusters (that may be seen as intrinsically marked structures with respect to the basic CV template).


    Exploring the psycho-computational representation of the phonotactics in French and German is the aim of this research project.

    In particular, our researh will focus on the interplay between phonotactics and word structure in French and German, and investigate the behavioural and computational representations of phonotactic vs. morphonotactic clusters.

    As a matter of fact, the basic hypothesis underlying this research project ist that there exist different cognitive and computational representations for the same consonant cluster according to its phonotactic setting. In particular, the occurence of a cluster across a morpheme boundary (morphonotactic cluster) is considered as particularly interesting.


    Our research will focus on the interplay between phonotactis and morphology and investigate the behavioural and computational representations of consonant clusters according to whether they are: a) exclusively phonotactic clusters, i.e. the consonant cluster occurs only without morpheme boundaries (e.g.Steinin German); b) exclusively morphonotactic clusters, i.e. the consonant cluster occurs only beyond morpheme boundaries (e.g.lach+st), c) both are true with one of the two being more or less dominant (e.g. dominantlob+stvs.Obst)[1]. Thus we test the existence of different ‘cognitive and computational representations’ and processes for the same and for similar consonant clusters according to their appartenance to a) or b) or c).

    The central hypothesis which we test is that speakers not only reactively exploit the potential boundary signaling function of clusters that result from morphological operations, but take active measures to maintain or even enhance this functionality, for example by treating morphologically produced clusters differently than morpheme internal clusters in production or language acquisition. We call this hypothesis, the ‘Strong Morphonotactic Hypothesis’ (henceforth: SMH) (Dressler & Dziubalska-Koɫaczyk 2006, Dressler, Dziubalska-Koɫaczyk & Pestal 2010).

    In particular, we suppose that sequences of phonemes exhibiting morpheme boundaries (the ‘morphonotactic clusters’) should provide the speakers with functional evidence about the morphological operation occurred in that sequence; such evidence should be absent in the case of a sequence of phonemes without morpheme boundaries (the ‘phonotactic clusters’).

    Hence our idea is to investigate the psycho-computational mechanisms underlying the phonotactic-morphonotactic distinction by approaching the problem from two angles simultaneously: (a) psycholinguistic experimental study of language acquisition and production and (b) language computational modelling.

    We aim therefore at providing, on one hand, the psycholinguistic and behavioural support to the hypothesis that morphologically produced clusters are treated differently than morpheme internal clusters in French and German; on the other, we will focus on the distributional and statistical properties of the language in order to verify whether such difference in clusters’ treatment can be inductively modelled by appealing to distributional regularities of the language.

    The competences of the two research teams overlap and complement each other. The French team will lead in modelling, computational simulation and psycholinguistic experiments, the Austrian team in first language acquisition, phonetic production and microdiachronic change. These synergies are expected to enrich each group in innovative ways.

    [1] An equivalent example for French language is given by a)prise(/priz/ ‘grip’, exclusively phonotactic cluster), b)affiche+ rai(/afiʃʁɛ/ ‘I (will) post’, exclusively morphonotactic cluster) and c)navigue+ rai(/naviɡʁɛ/ ‘I (will) sail’) vs.engrais(/ãɡʁɛ/ ‘fertilizer’), the both conditions are true with morphonotactic condition as dominant.

  • Binaural Hearing and the Cochlear Phase Response (BiPhase)

    BiPhase:  Binaural Hearing and the Cochlear Phase Response

    Project Description

    While it is often assumed that our auditory system is phase-deaf, there is a body of literature showing that listeners are very sensitive to phase differences between spectral components of a sound. Particularly, for spectral components falling into the same perceptual filter, the so-called auditory filter, a change in relative phase across components causes a change in the temporal pattern at the output of the filter. The phase response of the auditory filter is thus important for any auditory tasks that rely on within-channel temporal envelope information, most notably temporal pitch or interaural time differences.

    Within-channel phase sensitivity has been used to derive a psychophysical measure of the phase response of auditory filters (Kohlrausch and Sanders, 1995). The basic idea of the widely used masking paradigm is that a harmonic complex whose phase curvature roughly mirrors the phase response of the auditory filter spectrally centered on the complex causes a maximally modulated (peaked) internal representation and, thus, elicits minimal masking of a pure tone target at the same center frequency. Therefore, systematic variation of the phase curvature of the harmonic complex (the masker) allows to estimate the auditory filter’s phase response: the masker phase curvature causing minimal masking reflects the mirrored phase response of the auditory filter.

    Besides the obvious importance of detecting the target in the temporal dips of the masker, particularly of the target is short compared to the modulation period of the masker (Kohlrausch and Sanders, 1995), there are several indications that fast compression in the cochlea is important to obtain the masker-phase effect (e.g., Carlyon and Datta, 1997; Oxenham and Dau, 2004). One indication is that listeners with sensorineural hearing impairment (HI), characterized by reduced or absent cochlear compression due to loss of outer hair cells, show only a very weak masker-phase effect, making it difficult to estimate the cochlear phase response.

    In the BiPhase project we propose a new paradigm for measuring the cochlear phase response that does not rely on cochlear compression and thus should be applicable in HI listeners. It relies on the idea that the amount of modulation (peakedness) in the internal representation of a harmonic complex, as given by its phase curvature, determines the listener’s sensitivity to envelope interaural time difference (ITD) imposed on the stimulus. Assuming that listener’s sensitivity to envelope ITD does not rely on compression, systematic variation of the stimulus phase curvature should allow to estimate the cochlear phase response both in normal-hearing (NH) and HI listeners. The main goals of BiPhase are the following:

    • Aim 1: Assessment of the importance of cochlear compression for the masker-phase effect at different masker levels. Masking experiments are performed with NH listeners using Schroeder-phase harmonic complexes with and without a precursor stimulus, intended to reduce cochlear compression by activation of the efferent system controlling outer-hair cell activity. In addition, a quantitative model approach is used to estimate the contribution of compression from outer hair cell activity and other factors to the masker-phase effect. The results are described in Tabuchi, Laback, Necciari, and Majdak (2016). A follow-up study on the dependency of the masker-phase effect on masker and target duration, the target’s position within the masker, the masker level, and the masker bandwidth and conclusions on the role of compression of underlying mechanisms in simultaneous and forward masking is underway.
    • Aim 2: Development and evaluation of an envelope ITD-based paradigm to estimate the cochlear phase response. The experimental results on NH listeners, complemented with a modeling approach and predictions, are described in Tabuchi and Laback (2017). This paper also provides model predictions for HI listeners.
      Besides the consistency of the overall pattern of ITD thresholds across phase curvatures with data on the masking paradigm and predictions of the envelope ITD model, an unexpected peak in the ITD thresholds was found for a negative phase curvature which was not predicted by the ITD model and is not found in masking data. Furthermore, the pattern of results for individual listeners appeared to reveal more variability than the masking paradigm. Data were also collected with an alternative method, relying on the extent of laterality of a target with supra-threshold ITD, as measured with an interaural-level-difference-based pointing stimulus. These data showed no nonmonotonic behavior at negative phase curvatures. Rather, they showed good correspondence with the ITD model prediction and more consistent results across individuals compared to the ITD threshold-based method (Zenke, Laback, and Tabuchi, 2016).
    • Aim 3: Development of a ITD-based method to account for potentially non-uniform curvatures of the phase response in HI listeners. Using two independent iterative approaches, NH listeners adjusted the phase of individual harmonics of an ITD-carrying complex so that it elicited maximum extent of laterality. Although the pattern of adjusted phases very roughly resembled the expected pattern, there was a large amount of uncertainty (Zenke, 2014), preventing the method from further use. Modified versions of the method will be considered in a future study.


    This project is funded by the Austrian Science Fund (FWF, Project # P24183-N24, awarded to Bernhard Laback). It run from 2013 to 2017


    Peer-reviewed papers

    • Tabuchi, H. and Laback, B. (2017): Psychophysical and modeling approaches towards determining the cochlear phase response based on interaural time differences, The Journal of the Acoustical Society of America 141, 4314–4331.
    • Tabuchi, H., Laback, B., Necciari, T., and Majdak, P (2016). The role of compression in the simultaneous masker phase effect, The Journal of the Acoustical Society of America 140, 2680-2694.

    Conference talks

    • Tabuchi, H., Laback, B., Majdak, P., and Necciari, T. (2014). The role of precursor in tone detection with Schroeder-phase complex maskers. Poster presented at 37th Association for Research in Otolaryngology (ARO) Meeting, San Diego, California.
    • Tabuchi, H., Laback, B., Majdak, P., and Necciari, T. (2014). The perceptual consequences of a precursor on tone detection with Schroeder-phase harmonic maskers. Invited talk at Alps Adria Acoustics Association, Graz, Austria.
    • Tabuchi, H., Laback, B., Majdak, P., Necciari, T., and Zenke,K. (2015). Measuring the auditory phase response based on interaural time differences. Talk at 169th Meeting of the Acoustical Society of America, Pittsburgh, Pennsylvania.
    • Zenke, K., Laback, B., and Tabuchi, H. (2016). Towards an Efficient Method to Derive the Phase Response in Hearing-Impaired Listeners. Talk at 37th Association for Research in Otolaryngology (ARO) Meeting, San Diego, California.
    • Tabuchi, H., Laback, B., Majdak, P., Necciari, T., and Zenke, K. (2016). Modeling the cochlear phase response estimated in a binaural task. Talk at 39th Association for Research in Otolaryngology (ARO) Meeting, San Diego, California.
    • Laback, B., and Tabuchi, H. (2017). Psychophysical and modeling approaches towards determining the cochlear phase response based on interaural time differences. Invited Talk at AABBA Meeting, Vienna, Austria.
    • Laback, B., and Tabuchi, H. (2017). Psychophysical and Modeling Approaches towards determining the Cochlear Phase Response based on Interaural Time Differences. Invited Talk at 3rd Workshop “Cognitive neuroscience of auditory and cross-modal perception, Kosice, Slovakia.


    • Carlyon, R. P., and Datta, A. J. (1997). "Excitation produced by Schroeder-phase complexes: evidence for fast-acting compression in the auditory system," J Acoust Soc Am 101, 3636-3647.
    • Kohlrausch, A., and Sander, A. (1995). "Phase effects in masking related to dispersion in the inner ear. II. Masking period patterns of short targets," J Acoust Soc Am 97, 1817-1829.
    • Oxenham, A. J., and Dau, T. (2004). "Masker phase effects in normal-hearing and hearing-impaired listeners: evidence for peripheral compression at low signal frequencies," J Acoust Soc Am 116, 2248-2257.

    See also


  • BIOTOP: Adaptive Wavelet and Frame techniques for acoustic BEM. FWF Project I-1018-N25

    Biotop Beschreibung
    Workflow Biotop


    Localization of sound sources plays an important role in our everyday lives. The shape of the human head, the torso and especially the shape of the outer ear (pinna) have a filtering effect on incoming sounds and thus play an important role for sound localization. This filtering effect can be described using the so called head related transfer functions (HRTFs). By calculating the distribution of the sound pressure around the head with numerical methods like the boundary element method (BEM), these HRTFs can be calculated numerically.


    In BIOTOP the numerical calculations shall be made more efficient by using adaptive wavelet- and frame techniques. Compared to commonly used BEM basis functions, wavelets have the advantage that wavelets can adapt better to a given distribution of the acoustic field on the head. As a generalization of wavelets, frames allow for an even more flexible construction method and thus for a better adaption to the problem at hand.

    BIOTOP combines abstract theoretical mathematics with numerical and applied mathematics. It is an international DACH (DFG-FWF-SFG) project between the Philipps-Universität Marburg (Stephan Dahlke), the University Basel (Helmut Harbrecht) and the ARI. The expertise of all three research groups shall be combined to develop new strategies and numerical methods. The project is funded by the FWF: Pr. Nr. I-1018-N25


  • FLAME: Frames and Linear Operators for Acoustical Modeling and Parameter Estimation.

    START project of P. Balazs.



    This international, multi-disciplinary and team-oriented project will expand the group Mathematics and Acoustical Signal Processing at the Acoustic Research Institute in cooperation with NuHAG Vienna (Hans G. Feichtinger, M. Dörfler, K. Gröchenig), Institute of TelecommunicationVienna (Franz Hlawatsch), LATP Marseille (Bruno Torrésani) LMA (Richard Kronland-Martinet). CAHR (Torsten Dau, Peter Soendergaard), the FYMA Louvain-la-Neuve (Jean-Pierre Antoine), AG Numerics (Stephan Dahlke), School of Electrical Engineering and Computer Science (Damian Marelli) as well as the BKA Wiesbaden (Timo Becker).

    Within the institute the groups Audiological Acoustics and Psychoacoutics, Computational Acoustics, Acoustic Phonetics and Software Development are involved in the project.

    This project is funded by the FWF as a START price . It is planned to run from May 2012 to April 2018.






    General description:

    We live in the age of information where the analysis, classification, and transmission of information is f essential importance. Signal processing tools and algorithms form the backbone of important technologieslike MP3, digital television, mobile phones and wireless networking. Many signal processing algorithms have been adapted for applications in audio and acoustics, also taking into account theproperties of the human auditory system.

    The mathematical concept of frames describes a theoretical background for signal processing. Frames are generalizations of orthonormal bases that give more freedom for the analysis and modificationof information - however, this concept is still not firmly rooted in applied research. The link between the mathematical frame theory, the signal processing algorithms, their implementations andfinally acoustical applications is a very promising, synergetic combination of research in different fields.

    Therefore the main goal of this multidisciplinary project is to

    -> Establish Frame Theory as Theoretical Backbone of Acoustical Modeling

    in particular in psychoacoustics, phonetic and computational acoustics as well as audio engineering.



    For this auspicious connection of disciplines, FLAME will produce substantial impact on both the heory and applied research.

    The theory-based part of FLAME consists of the following topics:

    • T1 Frame Analysis and Reconstruction Beyond Classical Approaches
    • T2 Frame Multipliers, Extended
    • T3 Novel Frame Representation of Operators Motivated by Computational Acoustics

    The application-oriented part of FLAME consists of:

    • A1 Advanced Frame Methods for Perceptual Sparsity in the Time-Frequency Plane
    • A2 Advanced Frame Methods for the Analysis and Classification of Speech
    • A3 Advanced Frame Methods for Signal Enhancement and System Estimation

    Press information:




  • LION - Localisation and Identification of Moving Noise Sources


    We thank the FWF for supporting the project – grant number I 4299-N32

    Sound source localisation methods are widely used in the automotive, railway, and aircraft industries. Many different methods are available for the analysis of sound sources at rest. However, methods for the analysis of moving sound sources still suffer from the complexities introduced by the Doppler frequency shift, the relatively short measuring times, and propagation effects in the atmosphere. The project LION combines the expertise of four research groups from three countries working in the field of sound source localisation: The Beuth Hochschule für Technik Berlin (Beuth), the Turbomachinery and Thermoacoustics chair at TU-Berlin (TUB), the Acoustic Research Institute (ARI) of the Austrian Academy of Sciences in Vienna and the Swiss laboratory for Acoustics / Noise Control of EMPA. The mentioned institutions cooperate to improve and extend the existing methods for the analysis of moving sound sources. They want to increase the dynamic range, the spatial, and the frequency resolution of the methods and apply them to complex problems like the analysis of tonal sources with strong directivities or coherent and spatially distributed sound sources.



    The partners want to jointly develop and validate these methods, exploiting the synergy effects that arise from such a partnership. Beuth plans to extend the equivalent source method in frequency domain to moving sources located in a halfspace, taking into account the influence of the ground and sound propagation through an inhomogeneous atmosphere. ARI contributes acoustic holography, principal component analysis, and independent component analysis methods and wants to use its experience with pass-by measurements for trains to improve numerical boundary element methods including the transformation from fixed to moving coordinates. TUB develops optimization methods and model based approaches for moving sound sources and will contribute its data base of fly-over measurements with large microphone arrays as test cases. EMPA contributes a sound propagation model based on Time Variant Digital Filters with particular consideration of turbulence and ground effects and will also generate synthetic test cases for the validation of sound source localization algorithms. The project is planned for a period of three years. The work program is organized in four work packages: 1) the development of algorithms and methods, 2) the development of a virtual test environment for the methods, 3) the simulation of virtual test cases, and 4) the application of the new methods to existing test cases of microphone array measurements of trains and aircraft.


  • LocaPhoto: Localization Model & Numeric Simulations

    Localization of sound sources is an important task of the human auditory system and much research effort has been put into the development of audio devices for virtual acoustics, i.e. the reproduction of spatial sounds via headphones. Even though the process of sound localization is not completely understood yet, it is possible to simulate spatial sounds via headphones by using head-related transfer functions (HRTFs). HRTFs describe the filtering of the incoming sound due to head, torso and particularly the pinna and thus they strongly depend on the particular details in the listener's geometry. In general, for realistic spatial-sound reproduction via headphones, the individual HRTFs must be measured. As of 2012, the available HRTF acquisition methods were acoustic measurements: a technically-complex process, involving placing microphones into the listener's ears, and lasting for tens of minutes.

    In LocaPhoto, we were working on an easily accessible method to acquire and evaluate listener-specific HRTFs. The idea was to numerically calculate HRTFs based on a geometrical representation of the listener (3-D mesh) obtained from 2-D photos by means of photogrammetric reconstruction.

    As a result, we have developed a software package for numerical HRTF calculations, a method for geometry acquisition, and models able to evaluate HRTFs in terms of broadband ITDs and sagittal-plane sound localization performance.


    Further information:



  • POTION: Perceptual Optimization of Audio Time-Frequency Representations and Coding.

    French-Austrian bilateral research project funded by the French National Agency of Research (ANR) and the Austrian Science Fund (FWF, project no. I 1362-N30). The project involves two academic partners, namely the Laboratory of Mechanics and Acoustics (LMA - CNRS UPR 7051, France) and the Acoustics Research Institute. At the ARI, two research groups are involved in the project: the Mathematics and Signal Processing in Acoustics and the Psychoacoustics and Experimental Audiology groups.

    Principal investigators: Thibaud Necciari (ARI), Piotr Majdak (ARI) and Olivier Derrien (LMA).

    Running period: 2014-2017 (project started on March 1, 2014).


    One of the greatest challenges in signal processing is to develop efficient signal representations. An efficient representation extracts relevant information and describes it with a minimal amount of data. In the specific context of sound processing, and especially in audio coding, where the goal is to minimize the size of binary data required for storage or transmission, it is desirable that the representation takes into account human auditory perception and allows reconstruction with a controlled amount of perceived distortion. Over the last decades, many psychoacoustical studies investigated auditory masking, an important property of auditory perception. Masking refers to the degradation of the detection threshold of a sound in presence of another sound. The results were used to develop models of either spectral or temporal masking. Attempts were made to simply combine these models to account for time-frequency (t-f) masking effects in perceptual audio codecs. We recently conducted psychoacoustical studies on t-f masking. They revealed the inaccuracy of those models which revealed the inaccuracy of such simple models. These new data on t-f masking represent a crucial basis to account for masking effects in t-f representations of sounds. Although t-f representations are standard tools in audio processing, the development of a t-f representation of audio signals that is mathematically-founded, perception-based, perfectly invertible, and possibly with a minimum amount of redundancy, remains a challenge. POTION thus addresses the following questions:

    1. To what extent is it possible to obtain a perception-based (i.e., as close as possible to “what we see is what we hear”), perfectly invertible, and possibly minimally redundant t-f representation of sound signals? Such a representation is essential for modeling complex masking interactions in the t-f domain and is expected to improve our understanding of auditory processing of real-world sounds. Moreover, it is of fundamental interest for many audio applications involving sound analysis-synthesis.
    2. Is it possible to improve current perceptual audio codecs by considering a joint t-f approach? To reduce the size of digital audio files, perceptual audio codecs like MP3 decompose sounds into variable-length time segments, apply a frequency transform, and use masking models to control the sub-quantization of transform coefficients within each segment. Thus, current codecs follow mainly a spectral approach, although temporal masking effects are taken into account in some implementations. By combining an efficient perception-based t-f transform with a joint t-f masking model in an audio codec, we expect to achieve significant performance improvements.

    Working program:

    POTION is structured in three main tasks:

    1. Perception-based t-f representation of audio signals with perfect reconstruction: A linear and perfectly invertible t-f representation will be created by exploiting the recently developed non-stationary Gabor theory as a mathematical background. The transform will be designed so that t-f resolution mimics the t-f analysis properties by the auditory system and possibly no redundancy is introduced to maximize the coding efficiency.
    2. Development and implementation of a t-f masking model: Based on psychoacoustical data on t-f masking collected by the partners in previous projects and on literature data, a new, complex model of t-f masking will be developed and implemented in the computationally efficient representation built in task 1. Additional psychoacoustical data required for the development of the model, involving frequency, level, and duration effects in masking for either single or multiple maskers will be collected. The resulting signal processing algorithm should represent and re-synthesize only the perceptually relevant components of the signal. It will be calibrated and validated by conducting listening tests with synthetic and real-world sounds.
    3. Optimization of perceptual audio codecs: This task represents the main application of POTION. It will consist in combining the new efficient representation built in task 1 with the new t-f masking model built in task 2 for implementation in a perceptual audio codec.

    More information on the project can be found on the POTION web page.


    • Chardon, G., Necciari, Th., Balazs, P. (2014): Perceptual matching pursuit with Gabor dictionaries and time-frequency masking, in: Proceedings of the 39th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014). Florence, Italy, 3126-3130. (proceedings) ICASSP 2014: Perceptual matching pursuit results

    Related topics investigated at the ARI:

  • SpExCue: Role of spectral cues in sound externalization - objective measures & modeling

    Baumgartner et al. (2017a)

    Spatial hearing is important to monitor the environment for interesting or hazardous sounds and to selectively attend to them. The spatial separation between the two ears and the complex geometry of the human body provide auditory cues about the location of a sound source. Depending on where a sound is coming from, the pinna (or auricle) changes the sound spectrum before the sound reaches the eardrum. Since the shape of a pinna is highly individual (even more so than a finger print) it also affects the spectral cues in a very individual manner. In order to produce realistic auditory perception artificially, this individuality needs to be reflected as precisely as required, whereby the actual requirements are currently unclear. That is why SpExCue was about finding electrophysiological measures and prediction models of how spatially realistic (“externalized”) a virtual sound source is perceived to be.

    Virtual and augmented reality (VR/AR) systems aim to immerse a listener into a well-externalized 3D auditory space. This requires a perceptually accurate simulation of the listener’s natural acoustic exposure. Particularly challenging is to appropriately represent the high-frequency spectral cues induced by the pinnae. To simplify this task, we aim at developing a phenomenological computational model for sound externalization with a particular focus on spectral cues. The model will be designed to predict the listener’s degree of externalization based on binaural input signals and the listener’s individual head-related transfer functions (HRTFs) under static listening conditions.

    The naturally externalized auditory perception can be disrupted, for instance, when listening via headphones or hearing-assistive devices, and instead sounds are heard inside the head. Because of this change in externalization or perceived distance, our investigations of spectral cues also served to study the phenomenon of auditory looming bias (Baumgartner et al., 2017 PNAS): sounds approaching the listener are perceived more intensely than those that are receding from the listener. Previous studies demonstrated auditory looming bias exclusively by loudness changes (increasing/decreasing loudness used to simulate approaching/receding sounds). Hence, it was not clear whether this bias truly reflects perceptual differences in sensitivity to motion direction rather than changes in loudness. Our spectral cue changes were perceived as either approaching or receding at steady loudness and evoked auditory looming bias both on a behavioral level (approaching sounds easier to recognize than receding sounds) and an electrophysiological level (larger neural activity in response to approaching sounds). Therefore, our study demonstrated that the bias is truly about perceived motion in distance, not loudness changes.

    Further, SpExCue investigated how the combination of different auditory spatial cues affects attentional control in a speech recognition task with simultaneous talkers, which requires spatial selective attention like in a cocktail party (Deng et al., in prep). We found that natural combinations of auditory spatial cues caused larger neural activity in preparation to the test signal and optimized the neural processing of the attended speech.

    SpExCue also compared different computational modeling approaches that aim to predict the effect of spectral cue changes on how spatially realistic a sound is perceived (Baumgartner et al., 2017 EAA-ASA). Although many previous experimental results could be predicted by at least one of the models, none of them alone could explain these results. In order to assist the future design of more general computational models for spatial hearing, we finally created a conceptual cognitive model for the formation of auditory space (Majdak et al., in press).


    Erwin-Schrödinger Fellowship from Austrian Science Funds (FWF, J3803-N30) awarded to Robert Baumgartner. Duration: May 2016 - November 2017.

    Follow-up funding provided by Facebook Reality Labs, since March 2018. Project Investigator: Robert Baumgartner.


    • Baumgartner, R., Reed, D.K., Tóth, B., Best, V., Majdak, P., Colburn H.S., Shinn-Cunningham B. (2017): Asymmetries in behavioral and neural responses to spectral cues demonstrate the generality of auditory looming bias, in: Proceedings of the National Academy of Sciences of the USA 114, 9743-9748. (article)
    • Baumgartner, R., Majdak, P., Colburn H.S., Shinn-Cunningham B. (2017): Modeling Sound Externalization Based on Listener-specific Spectral Cues, presented at: Acoustics ‘17 Boston: The 3rd Joint Meeting of the Acoustical Society of America and the European Acoustics Association. Boston, MA, USA. (conference)
    • Deng, Y., Choi, I., Shinn-Cunningham, B., Baumgartner, R. (2019): Impoverished auditory cues limit engagement of brain networks controlling spatial selective attention, in: Neuroimage 202, 116151. (article)
    • Baumgartner, R., Majdak, P. (2019): Predicting Externalization of Anechoic Sounds, in: Proceedings of ICA 2019. (proceedings)
    • Majdak, P., Baumgartner, R., Jenny, C. (2019): Formation of three-dimensional auditory space, in: arXiv:1901.03990 [q-bio]. (preprint)
  • VarDiÖ: Variation and Change of dialect varieties in Austria (in apparent and real time)

    Project Part 02 of the special research area German in Austria. Variation - Contact - Perception funded by FWF (FWF6002) in cooperation with the University of Salzburg

    Principal Investigators: Stephan Elspaß, Hannes Scheutz, Sylvia Moosmüller

    Start of the project: 1st of January 2016

    Project description:

    The diversity and dynamics of the various dialects in Austria are the topic of this project. Based on a new survey, different research questions will be addressed in the coming years, such as: What are the differences and changes (e.g. through processes of convergence and divergence) that can be observed within and between the Austrian dialect regions? What are the alterations in dialect change between urban and rural areas? Are there noticeable generational and gender differences with regard to dialect change? What can a comprehensive comparison of ‘real-time’ and ‘apparent-time’ analyses contribute to a general theory of language change?

    To answer these questions, speech samples from a total of 160 dialect speakers, balanced for age and gender, are collected and analysed within the first four years at 40 locations in Austria. Furthermore, samples from selected speakers will be recorded and valuated under laboratory conditions to determine phonetic peculiarities as precisely as possible. In the second survey phase complementary recordings are carried out at another 100 locations in Austria in order to analyse differences and changes between the dialect landscapes in more detail. State-of-the-art dialectometric methods will be used to arrive at probabilistic statements regarding dialect variation and change in Austria. The analyses will include all linguistic levels from phonetics to syntax and lexis. A documentation of these data will be carried out on the first visual and ‘talking’ dialect atlas of Austria.

    Project page of the project partners in Salzburg


  • Vowel Tensity in Standard Austrian and Standard German

    FWF DACH I 536-G20: 2011-2013
    Cooperation with the Institute of Phonetics and Speech Processing, LMU Munich.

    Project leader (Austria): Sylvia Moosmüller
    Project leader (Germany): Jonathan Harrington


    Across languages, the distinction between so-called tense and lax vowels, e.g., Miete - Mitte ("rent" - "center") or Höhle - Hölle ("cave" - "hell"), is encountered in many languages. However, many different articulatory adjustments might cause this distinction, and these are language-specific.

    In the current project, we address this issue by analysing high tense and lax vowel pairs of the type bieten - bitten ("to offer" - "to request"), Hüte - Hütte ("hats" - "hut"), and Buße - Busse ("penance" - "busses") in two related language varieties: Standard Austrian German (SAG) and Standard German German (SGG). Previous studies suggest that high lax vowel pairs like bitten, Hütte, or Busse tend to approximate their respective tense cognates bieten, Hüte, and Buße.

    The research questions were investigated by a) comparing the tense and lax vowel pairs in SAG and SGG, b) by investigating whether high lax vowel pairs approximate their tense cognates in SAG, c) by investigating whether the high vowel pairs in SAG are distinguished by quality, by quantity, or by quantity relations with the following consonant, and d) by investigating whether an ongoing sound change can be observed in SAG, with young SAG speakers exhibiting a higher degree to merge the vowels than old SAG speakers.

    Main Results:

    SGG speakers clearly distinguish the high vowel pairs by quality, whereas speaker-specific strategies can be observed in SAG, with some speakers distinguishing high tense and lay vowel pairs by quality, others merging the quality contrast, but restricting the merger to velar contexts only, and still others merging high tense and lax vowels alltogether. In case of distinction, the differences between high tense and high lax vowels are less pronounced in SAG than in SGG and still less pronounced in the speech of young SAG speakers as compared to old SAG speakers. The same result was observed for quantity distinctions: All speakers differentiate the high vowel pairs by quantity, meaning that the tense vowels of the type bieten, Hüte, and Buße are longer than their respective lax cognates. Again, the differences are most pronounced in SGG and least pronounced in the speech of the young SAG speakers, meaning that the tense vowels of the type bieten, Hüte, and Buße are truncated in the speech of young SAG speakers as compared to old SAG speakers and SGG speakers. Results on the quantity interactions of vowel + consonant sequences prove quantifying aspects in SAG. Again, some age-specific differences emerged insofar as overall, young SAG speakers have shorter durations than old SAG speakers. However, they maintain the timing relations observed for the old SAG speakers. Results on perception strongly suggest that SAG speakers make use of quantity cues in order to distinguish the vowel pairs, whereas SGG speakers rather rely on cues connected with quality. Generally, it can be concluded that quantity distinctions are more relevant in SAG than in SGG.

    Project Related Publications:

    Harrington, Jonathan, Hoole, Philip, & Reubold, Ulrich.(2012). A physiological analysis of high front, tense-lax vowel pairs in Standard Austrian and Standard German.Italian Journal of Linguistics, 24, 158-183.

    Brandstätter, Julia & Moosmüller, Sylvia. (in print).Neutralisierung der hohen Vokale in der Wiener Standardsprache – A sound change in progress? In M. Glauninger & A. Lenz (Eds.), Standarddeutsch in Österreich – Theoretische und empirische Ansätze.Vienna: Vandenhoeck & Ruprecht.

    Brandstätter, Julia, Kaseß, Christian H., & Moosmüller, Sylvia (accepted). Quality and quantity in high vowels in Standard Austrian German. In: A. Leemann, M-J. Kolly & V. Dellwo (Eds.), Trends in phonetics and phonology in German-speaking Europe. Zurich: Peter Lang.

    Cunha, Conceição, Harrington, Jonathan, Moosmüller, Sylvia, & Brandstätter, Julia (accepted). The influence of consonantal context on the tense-lax contrast in two standard varieties of German. In: A. Leemann, M-J. Kolly & V. Dellwo (Eds.), Trends in phonetics and phonology in German-speaking Europe.Zurich: Peter Lang.

    Moosmüller, Sylvia. (in print). Methodisches zur Bestimmung der Standardaussprache in Österreich. In: M. Glauninger & A. Lenz (Eds.), Standarddeutsch in Österreich – Theoretische und empirische Ansätze. Vienna: Vandenhoeck & Ruprecht (=Wiener Arbeiten zur Linguistik).

    Moosmüller, Sylvia & Brandstätter, Julia.(in print). Phonotactic Information in the temporal organisation of Standard Austrian German and the Viennese Dialect. Language Sciences.