Portfolio | Designer & Developer

Exploring how deep learning — specifically a U-Net architecture — can isolate individual instruments from a mixed polyphonic music recording using Python.

Music source separation is the task of isolating individual sound sources — such as vocals, drums, bass, or a single instrument — from a mixed audio recording. This is a long-standing challenge in audio signal processing, but recent advances in deep learning have made it more tractable than ever.

Audio waveform visualisation — A polyphonic music track contains multiple instruments layered into a single waveform.

What is Music Source Separation?

When multiple instruments play simultaneously, their sound waves mix together into a single audio signal. The human auditory system can effortlessly pick out a guitar from a full band — but for a machine, untangling those layers is a hard inverse problem. Source separation aims to computationally reverse that mixing process.

Polyphonic vs. Monophonic

Polyphonic music contains multiple simultaneous notes and instruments — like an orchestra or a full band recording. This is far harder to separate than monophonic sources (a single melody line), because frequencies from different instruments overlap heavily in both time and pitch.

In this project, I explored using a U-Net convolutional neural network to separate individual musical instruments from polyphonic tracks. The goal was to take a mixed audio signal and extract each instrument's contribution as a clean, isolated audio stream.

The U-Net Architecture

U-Net was originally developed for biomedical image segmentation in 2015, but its encoder-decoder structure with skip connections maps perfectly onto spectrogram-based audio separation tasks. Instead of segmenting pixels, we're learning to mask frequency bins.

Neural network architecture diagram — The U-Net encoder compresses the spectrogram into a latent representation, while the decoder reconstructs the target instrument's spectrogram.

Encoder Path

The encoder is a series of convolutional blocks that progressively downsample the input spectrogram, capturing increasingly abstract features — from low-level frequency patterns to high-level timbral structure. Each block outputs a feature map that is stored for later use via skip connections.

Decoder Path

The decoder mirrors the encoder, upsampling back to the original spectrogram resolution. At each step, it receives the corresponding encoder feature map via a skip connection, allowing fine-grained frequency detail to be preserved. The final output is a soft mask (values between 0 and 1) applied to the mixture spectrogram to isolate the target instrument.

Working with the MUSDB18 Dataset

150

Music Tracks

10h+

Total Audio

Stem Sources

44.1kHz

Sample Rate

MUSDB18 is the standard benchmark dataset for music source separation. It contains 150 full-length music tracks across various genres, each with professionally recorded stems: vocals, drums, bass, and other (melody/guitar/keys). This paired data — mixture and isolated sources — is essential for supervised training.

Musical instruments in a recording studio — MUSDB18 includes diverse genres — from rock to jazz — making generalisation a real challenge.

Training Pipeline in Python

Load audio at 44.1kHz using librosa and convert to mono
Apply Short-Time Fourier Transform (STFT) to produce magnitude spectrograms
Chunk audio into fixed 3-second segments to manage GPU memory
Train U-Net with Mean Absolute Error (MAE) loss on masked spectrograms
Evaluate separation quality using Signal-to-Distortion Ratio (SDR)
Reconstruct waveform using the estimated mask and original mixture phase (inverse STFT)

Phase Reconstruction Trick

When converting back from a spectrogram to audio, you need phase information that the magnitude spectrogram discards. A simple but effective trick: reuse the original mixture's phase for the reconstructed waveform. This avoids expensive iterative phase estimation (like Griffin-Lim) and produces clean, listenable results.

Python code on a screen — The entire pipeline — data loading, STFT, training, and evaluation — was implemented in PyTorch.

Results & Findings

6.2 dB

Drums SDR

5.1 dB

Bass SDR

3.8 dB

Vocals SDR

2.9 dB

Other SDR

Results showed the strongest performance on drums and bass — instruments with distinct spectral profiles and minimal overlap with other sources. Vocal and melodic instrument separation was harder, as these sources share overlapping frequency ranges with each other and with harmonic overtones from other instruments.

“The spectrogram masking approach is computationally efficient and produces listenable outputs even without complex post-processing — a strong baseline for any source separation system.”
— Bipin Phaiju

Key Challenges

The two hardest problems in this project were phase reconstruction and computational cost. Magnitude spectrograms lose phase information, which is critical for waveform quality. And training on full songs at 44.1kHz is prohibitively expensive — chunking into segments solved the memory issue but introduced boundary artefacts at segment edges.

Future Directions

This project built a solid foundation in audio deep learning. The natural next steps involve more powerful architectures and training strategies that address the limitations of pure spectrogram masking.

What to Try Next

Demucs (Meta AI) uses raw waveform modelling instead of spectrograms, avoiding phase loss entirely. Hybrid Transformer Demucs adds attention mechanisms for long-range temporal dependencies. Data augmentation via pitch shifting and time stretching can also significantly improve generalisation across genres.

This work reinforced how cross-domain transfer of deep learning techniques — from medical imaging to audio — can yield surprisingly effective results, and deepened my understanding of both signal processing fundamentals and practical deep learning in PyTorch.

Bipin Phaiju

Software engineer & MSc Data Science student based in Coventry, UK. Passionate about machine learning, audio processing, and building beautiful web experiences.

Musical Instrument Source Separation for Polyphonic Music