How to compute convolution fast for realtime applications?
The Convolution Series
 Definition of convolution and intuition behind it
 Mathematical properties of convolution
 Convolution property of Fourier, Laplace, and ztransforms
 Identity element of the convolution
 Star notation of the convolution
 Circular vs. linear convolution
 Fast convolution
 Convolution vs. correlation
 Convolution in MATLAB, NumPy, and SciPy
 Deconvolution: Inverse convolution
 Convolution in probability: Sum of independent random variables
Introduction
Sound engineering and Virtual Reality audio demand realtime performance. Convolutions are inherently embedded in their inner workings, e.g., in the form of finite impulse response (FIR) filtering for artificial reverberation. Convolution evaluated according to the definition is too slow to keep up and introduces audible latency. To mitigate this we reach out to the fast convolution methods.
But speed itself is not enough. In the abovementioned applications we need to process sound in a blockwise fashion. Methods allowing this are called partitioned convolution techniques.
In this article, we first show why the naive approach to the convolution is inefficient, then show the FFTbased fast convolution. What follows is a description of two of the most popular blockbased convolution methods: overlapadd and overlapsave. Finally, we show how to deal with very long filters (over 1 second long) when using partitioned convolution.
Note: An amazing source about fast convolution techniques is [1]. I highly encourage you to check it out especially if you would like to read more on the topic.
Notation
In software, the signals are always of finite length. Thus, we will denote a signal $x[n]$ of length $N_x$ by a fixedsize vector $\pmb{x} = [x[0], x[1], \dots, x[N_x1]]$.
Naive implementation of convolution
We could implement the convolution between discretetime signals $x$ and $h$ by implementing the definition directly [1]
$y[n] = x[n] \ast h[n] = \sum_{k=\max\{0,nN_h+1\}}^{\min\{n, N_x1\}} x[k] h[n  k], \\ \quad n \in \{0, \dots, N_x + N_h  1\}. \quad (1)$
This translates to the following code
Listing 1. Naive linear convolution.
FFTbased implementation
As we know from the article on circular convolution, multiplication in the discretefrequency domain is equivalent to circular convolution in the discretetime domain. Elementwise multiplication of 2 vectors has time complexity $O(N)$. This is superior to $O(N^2)$ in case of the naive approach. The bottleneck of frequencybased convolution is the transformation to that domain, but it can be achieved in $O(N \log N)$ time, which is still better than $O(N^2)$. And so arises the FFTbased fast convolution (Figure 1).
Figure 1. FFTbased fast convolution. Source: [1].
In order to make circular convolution correspond to linear convolution we need to pad the input signals with sufficiently many zeros so that the frequencydomain representation has enough capacity to represent the result of the linear convolution. This means, that each signal should be extended to have length $K = Nx + Nh  1$.
We can obtain it using function pad_zeros_to()
.
Listing 2.
If our signals are sufficiently long we can compute their discrete Fourier transforms (DFTs) using the Fast Fourier Transform (FFT) algorithm. Thanks to the FFT, the transformation from the time domain to the frequency domain can be computed in $O(N \log N)$ time.
In the frequency domain we can multiply our signals elementwise and then come back to the time domain via inverse FFT.
FFT is known to be most computationally efficient if the transformed signal’s length is a power of 2. We can compute the exponent of the next largest power of 2 with respect to an integer $n > 0$ using the formula
$q_{\text{next}} = \log2(n  1) + 1. \quad (2)$
The above equation allows us to implement another helper function
Listing 3.
Wrapping it all together
Listing 4. FFTbased fast convolution.
Remember that the above algorithm is fast algorithmically. I am not claiming this code is maximally optimized 😉. It is provided for understanding and as a possible baseline implementation.
Blockbased convolution
FFTbased fast convolution is sufficiently fast for offline computation, but musical and virtual reality applications require realtime performance. In these applications, sound is usually processed in blocks of length 64, 128, 256, or more samples. If the signals to be convolved are longer than the block length, we need to adapt our methods so that we can convolve and output only partial results and handle incoming input.
In this section, we will assume that the input signal is an infinite stream (e.g., a looped source replaying a sound file) that comes in blocks $\pmb{x}$ of $B$ samples, and $\pmb{h}$ is a FIR filter of length $N$. Such a setup is called unified input partitioning because the input comes in blocks and the blocks are of equal size.
OverlapAdd Scheme
The first idea to process the input in blocks is to convolve each incoming block with the full filter using FFTbased convolution, store the results of appropriately many past convolutions, and output sums of their subsequent parts in blocks of the same length as the input signal we process. This scheme can be seen in Figure 2.
Figure 2. OverlapAdd convolution scheme. Source: [1].
Some important remarks concerning this methodology:

$K$ (transform length) must be sufficiently large to ensure that the circular convolution is equivalent to the linear convolution (no samples are timealiased). Lengthening the convolved signals is achieved through zeropadding.

OverlapAdd operation to form the output block is much more difficult than it may seem from the scheme. One needs to store all the convolution results and then sum appropriate indices. This may incur high memory and time cost.
Implementation
Listing 5. OverlapAdd convolution.
OverlapSave Scheme
A significant drawback of the OverlapAdd scheme is the necessity to store and sum the computed partial convolutions. Can we do better?
Another approach is called OverlapSave. It is based on storing appropriately many input blocks rather than output blocks, which are shorter than the calculated convolutions. The input blocks are stored in a buffer of length $K$ in a FirstInFirstOut manner, i.e., each new block of input samples shifts all previously stored samples in the buffer, discarding the oldest $B$ samples. We then perform a $K$point FFTbased convolution with the zeropadded filter coefficients.
Inherently, the result of this operation is a circular convolution (because the input signal is not zeropadded). It means that $K$ has to be sufficiently large so as to have the last $B$ samples correspond to the linear convolution. We can then output these $B$ valid samples and discard the rest. This results in a potentially longer transform length than in the OverlapAdd scheme, but removes the need to store and sum the outputs.
The entire scheme can be seen in Figure 3.
Figure 3. OverlapSave convolution scheme. Source: [1].
Implementation
Listing 6. OverlapSave convolution.
What If the Filter Is Long Too?
If the filter we want to convolve the input with has also significant length (e.g., when it represents an impulse response of a large hall with a significant reverberation time), we may have to partition both, the input and the filter. These methods are called partitioned convolution.
Describing partitioned convolution algorithms is beyond the scope of this article, but the main idea is presented in Figure 4. It shows a nonuniformly partitioned convolution scheme, which is generally faster than uniformly partitioned versions [1]. It also involves the use of frequencydomain delay lines (FDLs), which allow computing the FFT of the input blocks just once (in contrast to timedelaying the input blocks and calculating their FFTs for each new output block).
Nonuniformly partitioned convolution is the current state of the art in artificial reverberation using FIR filters for realtime auralization in games and Virtual Reality.
Figure 4. Nonuniformly partitioned convolution scheme. Note the presence of frequencydomain delay lines. Source: [1].
Summary
In this article, we have reviewed the most important convolution algorithms:
 naive linear convolution of fixedlength signals,
 FFTbased convolution of fixedlength signals,
 OverlapAdd and OverlapSave blockbased convolution schemes with unified input partitioning, where the input comes in blocks and the filter is of finite, short length, and
 Nonuniformly partitioned convolution where the input comes in blocks and the filter is very long.
With an understanding of these concepts you can rush off to code an unforgettable sonic Virtual Reality experience!
Bibliography
[1] Frank Wefers Partitioned convolution algorithms for realtime auralization, PhD Thesis, Zugl.: Aachen, Techn. Hochsch., 2015.
Comments powered by Talkyard.