Skip to main content

Cepstrum and MFCCs

Mel Scale

The Mel scale is specifically designed to reflect human perception of pitch. (The word "Mel" comes from "melody".) This perceptual scale adjusts frequencies in a way that equal distances on the Mel scale correspond to equal perceptual differences in pitch. The scale compresses low frequencies compared to the logarithmic scale since human hearing is not sensitive to low frequencies. Below is the formula for converting Hertz into Mels.

m=2595log(1+f700)m = 2595 \log \left( 1 + \frac{f}{700} \right)

The default scale used by Audacity for spectrograms is the Mel scale.

Cepstrum

The cepstrum is the inverse Fourier transform of the logarithm of the power spectrum of a signal. The cepstrum separates the slowly varying components (such as pitch information) from the fast-varying ones (such as formant structure) by transforming the frequency domain representation of a signal into what is called the "quefrency" domain. The cepstrum is defined by the following formula.

Cepstrum(x[n])=IFFT(log(FFT(x[n])2))\text{Cepstrum}(x[n]) = \text{IFFT}(\log(|\text{FFT}(x[n])|^2))

In the above equation,

  • FFT(x[n])\text{FFT}(x[n]) is the Fast Fourier Transform of the input signal x[n]x[n],

  • FFT(x[n])2|\text{FFT}(x[n])|^2 is the magnitude spectrum of the signal,

  • log()\log() represents the logarithm operation, usually taken as the natural logarithm,

  • IFFT()\text{IFFT}() is the inverse fast Fourier transform.

Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients (MFCCs) are a set of values used to compress audio into a more compact form for speech processing. The calculation of MFCCs involves several steps.

  1. The signal is divided into small overlapping frames, typically of 20 to 40 milliseconds duration. This is because speech signals are non-stationary, but can be assumed quasi-stationary over short time windows.

  2. Each frame is multiplied by a window function, usually a Hamming window, to reduce spectral leakage in the FFT step.

  3. The FFT is applied to each windowed frame to transform it from the time domain into the frequency domain.

  4. A series of triangular filters, spaced according to the Mel scale, are applied to the power spectrum of each frame. These filters are designed to group the energy around a specific set of frequencies that are equidistant on the Mel frequency axis (e.g. 500 Mels, 1000 Mels, 1500 Mels, etc.).

  5. The log of the power spectrum values is taken. The logarithmic scale compresses the dynamic range of the signal, similar to human perception of loudness.

  6. Finally, the Discrete Cosine Transform (DCT) is applied to the log Mel spectrum to produce a set of coefficients. These coefficients represent the amplitude of different cepstral components. Typically, the first 12 to 13 coefficients are retained, representing the most important features of the signal.

Applications of MFCCs

MFCCs were at one time useful as a feature extraction technique for machine learning applications. However, with more powerful computers and more accessible data sources, we can feed in entire spectrograms to machine learning models without the need to preprocess that audio into MFCCs. Sometimes MFCCs are combined with machine learning techniques, so you might see MFCCs in the wild from time to time.

Copyright © 2024 Audio Internals