Mathematics and Signal Processing in Acoustics

This is the companion Webpage of the manuscript:

Audlet Filter Banks: A Versatile Analysis/Synthesis Framework using Auditory Frequency Scales

Thibaud Necciari, Nicki Holighaus, Peter Balazs, Zdeněk Průša, Piotr Majdak, and Olivier Derrien.

Abstract: Many audio applications rely on filter banks (FBs) to analyze, process, and re-synthesize sounds. For these applications, an important property of the analysis-synthesis system is the reconstruction error; it has to be kept to a minimum to avoid audible artifacts. Other advantageous properties include stability and low redundancy. To exploit some aspects of human auditory perception in the signal chain, some applications rely on FBs that approximate the frequency analysis performed in the auditory periphery, the gammatone FB being a popular example. However, current gammatone FBs only allow partial reconstruction and stability at high redundancies. In this article, we construct an analysis-synthesis system for audio applications. The proposed system, named Audlet, is based on an oversampled FB with filters distributed on auditory frequency scales. It allows perfect reconstruction for a wide range of FB settings (e.g., the shape and density of filters), efficient FB design, and adaptable redundancy. In particular, we show how to construct a gammatone FB with perfect reconstruction. Experiments demonstrate performance improvements of the proposed gammatone FB when compared to current gammatone FBs in terms of reconstruction error and stability, especially at low redundancies. An application of the framework to audio source separation illustrates its utility for audio processing.

Sound examples for the source separation experiment: click on a system's acronym to hear the corresponding reconstruction.
Reference signals: original mixture -- target

Rt β = 1 β = 1/6 1024-point STFT
1.1 trev_gfb Audlet_gfb Audlet_hann trev_gfb Audlet_gfb Audlet_hann STFT_hann
1.5 trev_gfb Audlet_gfb Audlet_hann trev_gfb Audlet_gfb Audlet_hann STFT_hann
4.0 trev_gfb Audlet_gfb Audlet_hann trev_gfb Audlet_gfb Audlet_hann STFT_hann

This page provides resources for the research article:

"Frame Theory for Signal Processing in Psychoacoustics"

by Peter Balazs, Nicki Holighaus, Thibaud Necciari, and Diana Stoeva

to appear in the book "Excursions in Harmonic Analysis" published by Springer.

Abstract: This review chapter aims to strengthen the link between frame theory and signal processing tasks in psychoacoustics. On the one side, the basic concepts of frame theory are presented and some proofs are provided to explain those concepts in some detail. The goal is to reveal to hearing scientists how this mathematical theory could be relevant for their research. In particular, we focus on frame theory in a filter bank approach, which is probably the most relevant view-point for scientists in audio signal processing. On the other side, basic psychoacoustic concepts are presented to stimulate mathematicians to apply their knowledge in this field.

The present ZIP archive features Matlab/Octave scripts that will allow to reproduce the results presented in Figures 7, 10, and 11 of the article.

IMPORTANT NOTE: The Matlab/Octave toolbox Large Time-Frequency Analysis (LTFAT, version 1.2.0 and above) must be installed to run the codes. This toolbox is freely available at Sourceforge.

If you encounter any issue with the files, please do not hesitate to contact the authors.

 

This page provides the sound files corresponding to the results of the perceptual matching pursuit algorithm presented in:

"Perceptual Matching Pursuit with Gabor Dictionaries and Time-Frequency Masking"

Gilles Chardon, Thibaud Necciari, and Peter Balazs

submitted at the 39th International Conference on Acoustics, Speech, and Signal Processing (ICASSP2014).

Musical excerpt 1: Bruno Maderna (piano concerto, Fig. 2a in the manuscript)

original

IterationsMatching PursuitPerceptual Matching PursuitResidual (MP)Masked components (PMP)Residual + masked components
10000 wav wav wav wav wav
20000 wav wav wav wav wav
40000 wav wav wav wav wav
80000 wav wav wav wav wav

 

Musical excerpt 2: Suzanne Vega (Fig. 2b in the manuscript)

original

IterationsMatching PursuitPerceptual Matching PursuitResidual (MP)Masked components (PMP)Residual + masked components
10000 wav wav wav wav wav
20000 wav wav wav wav wav
40000 wav wav wav wav wav
80000 wav wav wav wav wav

 

This page provides resources for the visualizations and the algorithms in the research manuscript:

"Representation of the inverse of a frame multiplier"
Peter Balazs and Diana T. Stoeva

Abstract

Certain mathematical objects appear in a lot of scientific disciplines, like physics, signal processing and, naturally, mathematics. In a general setting they can be described as frame multipliers, consisting of analysis, multiplication by a fixed sequence (called the symbol), and synthesis. They are not only interesting mathematical objects, but also important for applications, for example for the realization of time-varying filters. In this paper we show a surprising result about the inverse of such operators, if existing, as well as new results about a core concept of frame theory, dual frames. We show that for semi-normalized symbols, the inverse of any invertible frame multiplier can always be represented as a frame multiplier with dual frames and reciprocal symbol. Furthermore, one of those dual frames is uniquely defined and the other one can be arbitrarily chosen. We investigate sufficient conditions for the special case, when both dual frames can be chosen to be the canonical duals. In connection to the above, we show that the set of dual frames determines a frame uniquely. Furthermore, for a given frame, the union of all coefficients of its dual frames is dense in l2. We investigate invertible Gabor multipliers; we show that the inverse of every invertible lattice-invariant operator (in particular, every invertible Gabor frame multiplier with a constant symbol (1)) can be represented as a Gabor frame multiplier with a constant symbol (1). Finally we give a numerical example for the invertibility of multipliers in the Gabor case.

Figures:

Figure 1 in the manuscript (an illustrative example to visualize actions of a multiplier on a musical signal)

In Figure 1 we show a visualization of a multiplier Mm,Φ,Ψ in the time-frequency plane. We consider a music signal f and the action of a multiplier Mm,Φ,Ψ on f. For f we use a 2 seconds long excerpt of the "Jump" from Van Halen (click here to listen the signal). For a time-frequency representation of the musical signal f (TOP LEFT) we use a 'painless' Gabor frame Ψ (a 80 ms Hanning window with 12,5% overlap). By manual estimation, we determine the symbol m that should describe the time-frequency region of the singer's voice. This region is then multiplied by 0.01, the rest by 1 (TOP RIGHT) (see the symbol here). Finally, we show a time-frequency representation of the modified signal Mm,Φ,Ψf (BOTTOM). To listen the modified signal, click here.

(TOP LEFT) The time-frequency representation of the music signal f (TOP RIGHT) The symbol m, found by a (manual) estimation of the time-frequency region of the singer's voice.
(BOTTOM) Time-Frequency representation of Mm,Φ,Ψf.

Figure 2 in the manuscript (numerical visualization of inversion of multipliers)

Here we use the same signal f and the same multiplier Mm,Φ,Ψ as in Figure 1. Note that all the elements of the symbol m fulfill mn,k∊{1,10-2}. Since m is semi-normalized, the multiplier M is analytically invertible [1]. However, the operator is badly conditioned, the condition number is around 99. The signal f is approximately 2 seconds long, using a sampling rate of 44100. Thus, the signal is a 128148-dimensional vector.

Starting from

we compare two approaches numerically:

  • a 'naive' inversion

    <>(corresponding to the approach raised in Question [Q2] in the manuscript)

    To listen the 'naive' inversion, click here.

  • the 'iterative' inversion

    To listen the 'iterative' inversion, click here.

Clearly, the naive approach has strong artifacts. The error is especially big at the boundaries of the constant region of the symbols. The chosen atoms are well localized in time-frequency, so that within the interior of the constant regions, this inversion works well. This could be expected as we have shown in the manuscript that constant symbols allow this kind of inversion for equivalent frames.

The iterative inversion worked well with an error of 3%. This could, naturally, be decreased by investing more calculation time. But also in the chosen setting for the iterative inversion (100 iterations in iframemul [2]) no difference can be seen in the time-frequency representation, as well as no audible difference can be detected.

(TOP LEFT) The time-frequency representation of the result of the 'naive' inversion . (TOP RIGHT) The time-frequency representation of the error of the 'naive' inversion, i.e. .
(BOTTOM LEFT) The time-frequency representation of the iterative inversion . (BOTTOM RIGHT) The time-frequency representation of  the error of the iterative inversion .

Algorithms:

The above visualizations are done using algorithms in the Matlab/Octave toolbox Linear Time-Frequency Analysis (LTFAT) [2] (version 1.4.0 and above). In order to run the script, provided below, first you need to install the the Matlab/Octave toolbox LTFAT, freely available at Sourceforge.

The Matlab script for producing Figures 1 and 2 is available for download here. To run the scrip, you need to have the following two files in the same folder:

Running the script, the output is the Figures 1 and 2.

References:

  • [1] D. T. Stoeva and P. Balazs, "Invertibility of multipliers", Applied and Computational Harmonic Analysis, vol. 33 (2012), 292-299.
  • [2] P. Søndergaard, B. Torrésani, and P. Balazs,"The linear time frequency analysis toolbox", International Journal of Wavelets, Multiresolution and Information Processing, vol. 10, no. 4, pp. 1250032, 2012.

This page provides resources and complementary results for the research manuscript:

"Adapted and adaptive linear time-frequency representations: a synthesis point of view"

P. Balazs, M. Dörfler, M. Kowalski and B. Torrésani

accepted for the special issue 'Time-Frequency Analysis and Applications' of the IEEE Signal Processing Magazine.

Abstract:

In this paper, we give an overview of linear time-frequency representations, focusing mainly on two fundamental aspects. The first one is the introduction of flexibility, more precisely the construction of time-frequency waveform systems that can be adapted to specific signals, or specific signal processing problems. To do this, we base the constructions on frame theory, which allows a lot of options, while still ensuring perfect reconstruction. The second aspect is the choice of the synthesis framework rather than the usual analysis framework. Instead of the correlation of the signal with the chosen waveforms, i.e. the inner product with them, we look at how the signals can be constructed using those waveforms, i.e. find the coefficient in thir linear combination.  We show how this point of view allows the easy introduction of prior information into the representation. We give an overview over methods for transform domain modeling, in particular those based on sparsity and structured sparsity. Finally we present an illustrative application for these concepts: a denoising scheme.

 

Figures:

  • Figure 1: Glockenspiel-Signal. Gabor representations with short window (2.9 ms), resp. long window (92.9 ms).

    NSGT

  • Figure 2: Glockenspiel-Signal: Nonstationary Gabor representation with time-adaptivity (upper plot) and frequency-adaptivity, corresponding to constant-Q scale (lower plot).

    NSGT

    CQT

  • Figure 3: Time-Frequency representations of each denoised layers (Left: transients -- Right: Tonals), for the three methods. From top to bottom: Gabor-GL + Gabor-L, Gabor-GL + Gabor-WGL, Gabor-GL + CQ-Gabor-L

    transient GLL

    tonal GLL

    transient GLWGL

    tonal GLL

    transient GLLCQ

    tonal GLL

 

Codes:

All files are collected in a ZIp-file!

For this code several other packages are needed:

  • The LTFAT toolbox, which provides an implementation of the Gabor  analysis and synthesis operations, with a C backend for efficiency.
  • The StrucAudioToolbox, which provides an implementation of various thresholding operator, and in particular the social-sparsity operators.
  • The NSGT (NonStationary Gabor Transform) toolbox, which provides the implementation, based on the LTFAT toolbox, of the non-stationary Gabor frames, in particular the constant-Q transform.

Acknowledgments: P. Balazs is supported by the Austrian Science Fund (FWF) START-project FLAME ('Frames and Linear Operators for Acoustical Modeling and Parameter Estimation'; Y 551-N13); M. Dörfler is supported by the WWTF project Audiominer (MA09-24); B. Torrésani is supported by the European project UNLocX, grant number 255931, and by the ANR project Metason ANR-10-CORD-010 ; M. Kowalski, benefited from the support of the "FMJH Program Gaspard Monge in optimization and operation research", and from the support to this program from EDF.

Mathematics and Signal Processing for Acoustics

A mathematical background is very important and useful for all physical and engineering sciences. The connection between applied and mathematical research often leads to progress in both directions, due to natural synergy effects. The Acoustic Research Institute considers the investigation of the mathematical background of its numerous research projects, most prominently the signal processing aspects, as an important part of acoustic research.

Application-oriented mathematics develops theoretical results, motivated by application, in contrast to “applied mathematics” focusing on tools for the applied sciences. The application-oriented approach provides results significant both for the applied sciences and theoretical mathematics. The importance of application-oriented mathematics was acknowledged by the Viennese Technology and Science Fund arranging a specific research programme titled Mathematics and …' and is a current research focus both of the Academy of Sciences and the city of Vienna.

Complex experimental designs generate empirical data and often lead to heuristic models with a modest mathematical basis. Mathematically precise statements considerably enhance the precision and stability of established algorithms and can already be implemented at an early stage of model generation. Therefore, mathematics supports the software development in the modelling stage as well as the implementation stage (stability, precision)

The Acoustics Research Institute has strengthened its research in this area in recent years, and will continue to do so. The following goals are set:

  • Fundamental research in mathematical theory
  • Application in psychoacoustical, phonetical and acoustical models.
  • Development of efficient algorithms
  • Particular focus on international, national and internal cooperation.

The cooperation of the group ’Mathematics and Signal Processing’ with the other groups of the Institute has been proven to be very fruitful for all partners and will be further strengthened. While the other groups get methods to solve their relevant problems, well-based in theory, the mathematicians can solve questions relevant for applications but still interesting in theory. This dialog increases the understanding of other fields enormously. It has allowed the successful application for the START-project 'FLAME: Frames and Linear Operators for Acoustical Modeling and Parameter Estimation' in 2011.

Staff

Further Information

Youtube Talks:

Nicki Holighaus - Time-Frequency Frames and Applications to Audio Analysis - Part 1
Peter Balazs - February Fourier Talks 2014

Hans G. Feichtinger - Mathematical and Numerical Aspects of Frame Theory - Part 1 (showing the institute own software STx!)
Georg Tauböck - WWTF Project INSIGHT

 

This page provides resources and complementary results for the research article:

"The ERBlet Transform: An Auditory-Based Time-Frequency Representation with Perfect Reconstruction"

T. Necciari, P. Balazs, N. Holighaus, and P. Søndergaard

presented at the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP2013). A PDF version of the article is available here for download.

Abstract: This paper describes a method for obtaining a perceptually motivated and perfectly invertible time-frequency representation of a sound signal. Based on frame theory and the recent non-stationary Gabor transform, a linear representation with resolution evolving across frequency is formulated and implemented as a non-uniform filterbank. To match the human auditory time-frequency resolution, the transform uses Gaussian windows equidistantly spaced on the psychoacoustic "ERB" frequency scale. Additionally, the transform features adaptable resolution and redundancy. Simulations showed that perfect reconstruction can be achieved using fast iterative methods and preconditioning even using one filter per ERB and a very low redundancy (1.08). Comparison with a linear gammatone filterbank showed that the ERBlet approximates well the auditory time-frequency resolution.

  • Complementary results:
    • ERBlet windows representation: K = 35 ERBlet filters computed for the frequency range 0-8 kHz using V = 1 filter/ERB.
ERBlet analysis windows  ERBlet dual windows
    • Comparison between ERBlet and other representations. Simulations were performed on a 5-sec musical excerpt from the band Manowar (song "Heart of Steel", studio version) in mono format, sampled at 44.1~kHz, 16~bits/sample. All analyses considered the frequency band 0-22.05~kHz.
ERBlet transform Standard Gabor transform
 redundancy = 12, relative reconstruction error < 10-15.  redundancy = 11.80, relative reconstruction error < 10-15.
 ERBlet transform  Constant-Q transform
 redundancy = 12, relative reconstruction error < 10-15.

redundancy = 12, relative reconstruction error < 10-15.
Implementation in [1].

 ERBlet transform  Linear gammatone filterbank
redundancy = 12, relative reconstruction error < 10-15.

redundancy = 128, relative reconstruction error = 1.4 for a delay of 4 ms and no post-processing correction of the filterbank delay. Accounting for the filterbank delay at the output of the re-synthesizer module led to relative reconstruction errors of 4.11 x 10-1, 1.01 x 10-1 and 2.86 x 10-3 for delays of 4, 8 and 16 ms, respectively. Implementation in [2].

  • Algorithms (pseudo-code):
Iterative reconstruction algorithm
 NSGT analysis algorithm
 NSGT synthesis algorithm
  • Matlab/Octave scripts available for download. Archive content:
    • Scripts for computing the ERBlet transform and its inverse. Includes the iterative reconstruction using the conjugate gradients method (Algorithm 1 above).
    • Scripts for generating the figures 1 and 2 presented in the manuscript.

IMPORTANT NOTE: The Matlab/Octave toolboxes Linear Time-Frequency Analysis (LTFAT, version 1.2.0 and above) [3] and Auditory Modeling (AM) must be installed to run the ERBlet codes. These toolboxes are freely available at Sourceforge.

  • References:
    • [1] G. A. Velasco, N. Holighaus, M. Dörfler, and T. Grill,"Constructing an invertible constant-Q transform with nonstationary Gabor frames", in Proceedings of the14th International Conference on Digital Audio Effects (DAFx-11), Paris, France, September 19-23 2011, pp.93–99.
    • [2] V. Hohmann, "Frequency analysis and synthesis using a gammatone filterbank", Acta Acust. united Ac., vol. 88, no. 3, pp. 433–442, 2002.
    • [3] P. L. Søndergaard, B. Torrésani, and P. Balazs, "The linear time-frequency analysis toolbox", Int. J. Wavelets. Multi., vol. 10, no. 4, pp. 1250032, July 2012.

This page provides the sound files corresponding to the results of the irrelevance time-scale filter reported in Necciari et al. "Perceptual optimization of audio representations based on time-frequency masking data for maximally-compact stimuli", presented at the AES 45th conference on Applications of Time-Frequency Processing in Audio, Helsinki, Finland, 2012 March 1-4.

  • Result 1:x1(t); sum of two Gaussians with frequency shift (ΔF) of +4 ERB units and time shift (ΔT) of 10 ms. Masker SPL (Lm) = 80 dB, target SPL (Lt) = 50 dB; target not masked.
  • Result 2: x2(t); sum of two Gaussians with ΔF = -2 ERB units and ΔT = 5 ms. Lm = 80 dB, Lt = 15 dB; target masked.
  • Result 3: x3(t); musical sound. Clarinet playing the note A3.

This webpage is linked to the paper

  • P. Balazs, D. Bayer and F. Jaillet The Structure of the Phase Around Zeros of the STFT, in preparation.

 ad Section 2.1. 'Phase vs. Amplitude Reconstruction in the STFT'

Audio Files:

Upcoming Events

Improving speech technology with the open source VOiCES dataset

ARI guest talk by Michael R. Lomnitz

19. September 2019

14.30

Seminar Room, Wohllebengasse 12-14 / Ground Floor

Read more ...
 

SSW10

The 10th ISCA Speech Synthesis Workshop

20. - 22. September 2019

Vienna, Austria

 

News