Let’s perform auto-tuning of a voice recording in Python!
Introduction
I love singing. However, I am not the best singer out there.
Frankly, I am not a good singer at all.
But hey, I’m a programmer, so maybe I can write some code to fix my singing?
Can I auto-tune my voice with code? Let’s find out!
Table of Contents
- Introduction
- I wish I could sing this cleanly…
- Solution: Auto-tune to the rescue!
- How to implement the auto-tune?
- How to implement pitch tracking?
- How to shift the pitch of a vocals recording?
- How can we calculate the “correct” pitch?
- Show me the code!
- Auto-tuning: Before and After
- Summary
- Bibliography
I wish I could sing this cleanly…
Out of many songs that I like, one that I found quite interesting to sing is “Skyfall” by Adele.
Actually, I even took some lessons to be able to sing it well.
How well can I sing it? You can judge yourself:
It could be worse, right? 🙂
Now, how to make it better?
(FYI, the backing track comes from YouTube).
Solution: Auto-tune to the rescue!
One of the most difficult tasks of singers is singing in tune.
Today, digital signal processing (DSP) and Python will help us achieve just that!
What would we have to do to correct the pitch (the fundamental frequency) of any vocals recording?
- First, we would have to find out what the pitch actually is at a given point in time. We call this problem in DSP pitch tracking.
- Second, we would need to choose our pitch adjustment strategy, i.e., how we would map the actual pitch to the desired pitch. Should we use a musical score? Various options are available here.
- Third and final, we should adjust the pitch of the recording according to the calculated desired pitch. This step is called pitch shifting.
The whole procedure is called pitch correction in the digital audio effects community but more broadly it is known as the auto-tune or the Cher effect.
The above-described steps are shown as a diagram in Figure 1.
Figure 1. A sample DSP diagram of the auto-tune effect. Special thanks to Marie Tricaud for the tips on creating such diagrams.
So how to implement it?
How to implement the auto-tune?
To implement a simple auto-tune command-line tool, we will use Python because most of the DSP algorithms are readily available as Python packages 😉
How to implement pitch tracking?
If you google for pitch tracking methods, you’ll probably find the PYIN algorithm among the results [1].
It is the current state of the art when it comes to pitch tracking (apart from possibly some deep learning-based approaches).
PYIN is based on the YIN algorithm [2]. The YIN algorithm estimates the pitch from the time-domain signal through computing the autocorrelation function and then refining the result. PYIN extends this approach by applying a Hidden Markov Model to the outputs of the YIN algorithm.
Although it’s a fascinating subject, you don’t need to understand the PYIN algorithm to use it.
Luckily, there is a pyin
function in the librosa library. (If you are looking for music analysis algorithms, this library rocks).
We will use it in the following way:
Listing 1.
f0, voiced_flag, voiced_probabilities = librosa.pyin(audio,
frame_length=frame_length,
hop_length=hop_length,
sr=sr,
fmin=fmin,
fmax=fmax)
Given a monophonic audio
signal, the length of the analysis frames frame_length
, the distance between the frames hop_length
, the sampling rate sr
, and the minimum and the maximum viable frequencies fmin
and fmax
, we obtain the estimated pitch for each frame f0
, the information whether the given frame was voiced or not voiced_flag
, and the probability of each frame being voiced voiced_probabilities
.
We can use the information in the f0
vector to correct the pitch in the voiced frames as indicated by the voiced_flag
. Alternatively, we can rely on the fact that f0
has not-a-number (NaN) values at the unvoiced frames.
How to shift the pitch of a vocals recording?
Let’s again google for approaches to pitch shifting.
And again we’ll find the state of the art: the PSOLA algorithm [3].
PSOLA stands for Pitch-Synchronous Overlap-and-Add.
As a short recap:
- Overlap-and-add (OLA) techniques allow us to change the time duration of a signal without changing its pitch (i.e., perform a time-scale modification, TSM) by dividing the signal into a series of overlapping frames and then reassembling those frames but with a different distance between the frames.
- Resampling allows us to change the duration and the pitch of the signal together (see my tutorial on the variable speed replay effect which uses this approach).
- A combination of an OLA-technique with resampling allows us to change the pitch of a signal without changing its duration. OLA is used to counteract the inherent time-scale modification outcome of the resampling part.
Is PSOLA readily available in some Python package?
Of course it is!
We can simply use the vocode
function from the psola package.
Listing 2.
pitch_shifted_signal = psola.vocode(audio,
sample_rate=int(sr),
target_pitch=corrected_f0,
fmin=fmin,
fmax=fmax)
In this line of code, audio
is the vocals recording, sample_rate
is self-explanatory, target_pitch
contains the values of the corrected pitch, and fmin
and fmax
are the minimum and the maximum target frequencies we’ll pitch-shift to.
The passed-in vector of the desired fundamental frequency values can be of any lenght; the psola library will simply space these values evenly throughout the signal duration. For example, if we pass in a signal with 4000 samples and target_pitch
equal to [440, 880]
, the first 2000 samples will be pitch-shifted to 440 Hz and the remaining 2000 samples to 880 Hz.
How can we calculate the “correct” pitch?
There are various approaches we could take to adjust the pitch. For example, we could
- round to the nearest MIDI note,
- round to the nearest note of the song’s scale, i.e., round to a note from the C major scale if the song is in C major, or
- use the sheet music to match the vocals to the original score.
We could mix and match the above techniques with other tweaks like introducing the dry/wet parameter.
Since “Skyfall” is in C minor, I have decided to go with the second approach 🙂
Show me the code!
Below you’ll find an extensively commented auto-tuning command-line utility.
Listing 3.
I have run this code with the following command:
Listing 4.
$ python auto_tune.py "skyfall_vocals.wav" -c scale --scale C:min --plot
Auto-tuning: Before and After
Finally, it is time to listen to the original recording and the auto-tuned file!
"Skyfall" by Adele (performed by Jan Wilczek 🤐)
Of course, I have intentionally sung poorly to make the effect of the auto-tune that more audible 😉
In Figure 2, there is a spectrogram of an excerpt of the original recording. I’ve plotted over it the tracked pitch and the pitch corrected according to the C-minor scale. If you look closely, you will see how quantized the corrected pitch is. We could make the quantization less strict (and, thus, less artificial) by introducing the dry/wet parameter or rather the amount of pitch correction applied.
Figure 2. Magnitude frequency spectrogram of the “Skyfall” vocals recording with overlaid pitch trajectories: original and corrected according to the C-minor scale.
The auto-tune does the job, I’d say. Wouldn’t you agree? 😉
Summary
In this tutorial, we created a simple command-line Python utility that allows us to auto-tune our vocal recordings. The algorithm uses the PYIN algorithm for pitch tracking, the PSOLA algorithm for pitch shifting, and rounds the pitch to the nearest clean pitch from the given scale.
We then used the program to auto-tune my poor vocal recording of Adele’s “Skyfall”. The quality of the results are for you to judge!
Here is the full source code of the utility: https://github.com/JanWilczek/python-auto-tune
And finally: would you like to learn about other audio effects and how to implement them? Then subscribe to my newsletter to stay up to date with the upcoming tutorials and courses!
Thanks for reading and good luck making your voice shine!
Bibliography
[1] M. Mauch and S. Dixon, pYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014), 2014. [PDF], accessed 27.11.2022.
[2] De Cheveigné, Alain, and Hideki Kawahara, YIN, a fundamental frequency estimator for speech and music, The Journal of the Acoustical Society of America 111.4 (2002): 1917-1930. [PDF], accessed 27.11.2022.
[3] E. Moulines and F. Charpentier, Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech communication, 1990. [PDF], accessed 27.11.2022.
Comments powered by Talkyard.