Slow Down Music Without Changing the Pitch (technical discussion)

Background

The program "Transcribe!" (of which I am the author) is intended to help musicians to transcribe music from recordings. It has the ability to slow down music (or speed it up) in real time without changing the pitch. People sometimes ask how this is done so I have written this discussion of the subject. It is a fairly general discussion but I also comment on the particular methods which I chose to use in Transcribe!

I mostly discuss slowing down rather than speeding up here, but with a moment's thought you can see that pretty much every issue discussed applies equally to both.

What Do We Really Want?

If you have never tried it then you might think that once you have some music on your hard disk in digital form, it would be easy to change the speed without changing the pitch - just a bit of resampling or something like that. But in fact it's difficult. Resampling - changing the sample rate - merely enables you to change the pitch and speed together in a way that's exactly analogous to varying the speed of an analogue tape recorder or vinyl record player. Halve the speed and the pitch goes down an octave.

Before we can decide what we should be aiming for we have to ask what our slowdown program will be used for. Here are some questions we must consider:

Are we working on polyphonic material (many notes at a time) or just one note at a time? Different techniques may be preferred, depending.

Are we running in real time on a desktop computer in which case we must use efficient processing techniques, or are we running on expensive dedicated hardware or non-realtime, in which case we can use more sophisticated (slower) methods?

Are we aiming for a high quality realistic musical result, in which case it is reasonable to limit ourselves to small changes in speed (because large changes can never sound musically realistic anyway) or do we want large changes, in which case we must accept that it will sound "processed"?

Most instruments have a clearly defined start-up noise at the beginning of each note, which is quite different from the sound of the sustained note - for instance the tonguing of a brass instrument, the hammer of a piano, the hit of a drum. If we want the slowed-down music to sound like a real player on a real instrument who simply happens to be playing more slowly then we must not stretch these "transients" but only stretch the sustained notes or silences between them. On the other hand we might want to hear the details of the attack, for instance if we analysing a player's technique for teaching purposes, in which case we want to stretch everything equally.

Suppose we have a sustained bass note at 31Hz (B, the open bottom string of a 5 string bass). When we slow this down we would clearly want the note to remain at 31Hz (31 cycles per second) so the slowdown method must generate more "cycles" of oscillation to fill up the extra time. On the other hand suppose we have a guitar strum where the guitarist smoothly strums all 6 strings in about a fifth of a second - a moderate speed strum. In this case the strum has about 30 note-attacks per second. If we slow this down we want the note-attacks to be spread further apart while there are still 6 of them. So here we have two examples of sonic events which happen at about 30 per second, for which the desired handling is radically different. How can any slowdown method know which approach to use, or even separate out the sounds to be handled with different approaches given that the bass note and the guitar strum could very easily both be happening at the same time?

Most commercial speed or pitch change software is intended for music recording and editing applications - for instance changing the pitch of a singer's voice to correct an out-of-tune note, or adjusting the duration of a music clip to make it fit an advertisement. In that case a natural sound is vital, but there is no need to support changes greater than about 20 or 30% either way, as anything more than that stands practically no chance of sounding natural anyway. High quality programs in this area do indeed make the effort of locating transients and not modifying them, and many other sophisticated techniques. For this to work in real time you would usually be talking about a dedicated DSP processing effects unit.

Transcribe! is intended as an aid for transcribing and needs to slow the music down much more drastically - by a factor of up to 20 in fact - but "natural" sound is fortunately not such a priority, instead the priority is to be able to hear clearly what's happening. For this reason I regard it as more sensible - and easier - for Transcribe! to stretch everything equally.

The bass note vs. guitar strum example above is a tricky one but multi-resolution processing (see below) helps a lot.

By the way, once you solve the problem of changing speed without changing pitch you can easily change pitch without changing speed by applying a touch of resampling afterwards. For instance if you want to raise the pitch then you first lower the speed without changing the pitch, then resample to speed it back up to the original speed while raising the pitch too. There are also more direct ways of changing the pitch which I won't be discussing here.

Basic Technique

The basic technique used by most slowdown methods whether "time domain" or "frequency domain" (see below) is to slice the input sound into short segments - typically in the range from a 100th to a 10th of a second - to spread those segments further apart in time, and to fill the gaps by duplicating bits of the segments either side - a sort of "copy and paste" into the gaps. There are also "modelling" techniques which attempt to analyse the material at a high level and then reconstruct at slower speed from the high level description. These can be good on certain material but I won't be discussing them here.

Apparently back in the steam age you could get tape recorders which implemented this technique mechanically. The playback head was circular and in fact had four playback heads equally spaced around the circumference. The head rotated while the tape moved past it and a brush contact underneath ensured that the head which was currently in contact with the tape was the one whose output was fed to the playback amplifier. The overall speed was controlled by the speed of the tape while the pitch was controlled by the relative speed of the tape past the playback head which would not be the same if the head was rotating. You can see how this involves playing certain little slices of tape twice as one head takes over from another.

This crude technique (whether implemented mechanically or digitally) is easy to do but has many problems with sound quality. The splice points introduce discontinuities in the sound and as there are perhaps 30 splice points per second, this causes a dreadful warbling noise. Also transients are duplicated if they happen to be in a segment which gets used twice, creating a smeared-out effect which becomes very bad at high slowdown ratios. The rest of this discussion will be about some of the techniques we can use to try to reduce these bad effects. The basic idea is, we must analyse the sound to some extent, then use the information gained to find ways of splicing it together without the discontinuities. You might think a simple cross-fade at the joins would do the trick, and certainly it helps by eliminating clicks, but it is not enough. A musical note has a repeating waveform of fundamental frequency plus harmonics and if you splice this at arbitrary points then the repeating waveform shape is upset with a jolt. In music editing you can get away with a cross-fade splice here and there, but not 30 per second. What we would like to do is somehow adjust our splice points in accordance with the frequency of the note so that we splice exactly on a whole number of cycles and avoid any lurch in the waveform shape.

The techniques used for this divide into two categories, "time domain" and "frequency domain". The samples of a digital audio signal are considered to be "time domain" because the samples represent different points in time. To work in the time domain is to work directly with these. In the time domain it is easy to identify the time at which things happen but hard to identify frequency information. We can take a segment of sound and perform a discrete fourier transform (DFT) and this gives us a description of that segment as an array of data points where each point represents a different frequency : to work with this data is to work in the frequency domain. In the frequency domain it is hard to identify the time at which things happen but easy to identify frequency information.

"Time Domain" Techniques

The idea here is to identify the frequency of the note being played (there are various techniques for this) and splice only on whole numbers of cycles. This can work very well, especially if combined with transient-detection to avoid duplicating segments that have a transient in them. But there is a catch : it doesn't work when there are many notes being played at the same time, it only works on one-note-at-a-time material. This makes it excellent for working with single note instruments or solo voice and it is used for the purpose in recording studios, but useless for general purpose music which is polyphonic - many notes at a time. Transcribe! of course needs to work with polyphonic material so does not use this technique.

"Frequency Domain" Techniques

The problem with time domain techniques on polyphonic material is that if we choose a splice point to avoid discontinuity on one of the notes present then this splice point is unlikely to be suitable for the other notes present. What we really want is to separate out the various notes and handle each one differently. "Phase alignment" is what we are talking about here. The phase of a repeating waveform means, exactly what point in its repetition cycle has it reached? If we splice and the waveform has the same phase on both sides of the splice point then we have no discontinuity in the waveform's repeating shape. But if the phase is different on either side, the waveform shape will lurch and not sound good.

The DFT (discrete fourier transform) tells us the amplitude and phase of each of the frequency components present in the segment we DFT, and the fun part is that while we have the signal in the frequency domain we can adjust the phases of the various frequencies independently so as to make them right for the splice point we are using. Then we use the inverse discrete fourier transform (IDFT) to convert this back into the time domain, and use the resulting segment for the splice. This is the "Phase Vocoder", so called because there was once a weird studio effect unit called the Vocoder which split a signal into maybe 8 frequency bands using analogue filters, then applied envelope information from another source to modulate these bands. The phase vocoder is a bit like that except it preserves phase information too, hence the name.

Transcribe!'s Slowdown

Transcribe! has always used a phase vocoder as do most programs which slow down polyphonic music in real time. The FFT (fast fourier transform) algorithm makes it possible to compute DFTs fast enough for this. However the phase vocoder has its difficulties too.

Perhaps the biggest problem with the phase vocoder is the question, how large should the analysis segments be? (the segments we take from the input signal and apply the DFT to). To get accurate analysis of frequencies it is necessary for the segment to contain several (at least 3 or 4) full cycles of the lowest note we might see. If we expect notes down to say 30Hz (not unreasonable) then this means segments of a tenth of a second. Unfortunately this is a far larger segment than we would like to use at higher frequencies and results in severe smearing of transients at large slowdown ratios.

The answer to this is to use "multiresolution analysis" where we split the signal into several frequency channels and use a different segment size for each channel. However the phase computations are already quite tricky even for a single channel and if we have multiple channels then we must also synchronise the phase between the channels or horrible things happen in the crossover areas where one channel takes over from another. Prior to version 5.2 Transcribe! offered two slowdown techniques - "whole numbers" which used a two-channel multiresolution approach but only allowed whole number slowdown ratios because this makes phase synchronisation between channels much easier, and "continuous" which allowed continuously variable speed but used only a single channel.

From version 5.2, Transcribe!'s slowdown incorporates the best features of both previous techniques, and more besides. It uses 5 channels running at consecutively lower sampling rates and using larger analysis segment sizes as the frequencies get lower. It allows continuously variable slowdown while synchronising phase between adjacent channels in the crossover zones.

Transcribe! version 7.2 has further improvements for the handling of percussive sounds and also gives a steadier sound thanks to improved handling of phase. I think it sounds pretty good and hope that you agree. If you haven't already tried Transcribe! then you can download it for a 30 day free trial, and hear for yourself.

If You Want to Know More

Visit Google and search for "phase vocoder".

Recommend this page to others, on these social network sites:

Facebook

Digg

Diigo

Seventh String Home