Exploring how deep learning — specifically a U-Net architecture — can isolate individual instruments from a mixed polyphonic music recording using Python.
Music source separation is the task of isolating individual sound sources — such as vocals, drums, bass, or a single instrument — from a mixed audio recording. This is a long-standing challenge in audio signal processing, but recent advances in deep learning have made it more tractable than ever.
What is Music Source Separation?
When multiple instruments play simultaneously, their sound waves mix together into a single audio signal. The human auditory system can effortlessly pick out a guitar from a full band — but for a machine, untangling those layers is a hard inverse problem. Source separation aims to computationally reverse that mixing process.
Polyphonic vs. Monophonic
Polyphonic music contains multiple simultaneous notes and instruments — like an orchestra or a full band recording. This is far harder to separate than monophonic sources (a single melody line), because frequencies from different instruments overlap heavily in both time and pitch.
In this project, I explored using a U-Net convolutional neural network to separate individual musical instruments from polyphonic tracks. The goal was to take a mixed audio signal and extract each instrument's contribution as a clean, isolated audio stream.
The U-Net Architecture
U-Net was originally developed for biomedical image segmentation in 2015, but its encoder-decoder structure with skip connections maps perfectly onto spectrogram-based audio separation tasks. Instead of segmenting pixels, we're learning to mask frequency bins.
Encoder Path
The encoder is a series of convolutional blocks that progressively downsample the input spectrogram, capturing increasingly abstract features — from low-level frequency patterns to high-level timbral structure. Each block outputs a feature map that is stored for later use via skip connections.
Decoder Path
The decoder mirrors the encoder, upsampling back to the original spectrogram resolution. At each step, it receives the corresponding encoder feature map via a skip connection, allowing fine-grained frequency detail to be preserved. The final output is a soft mask (values between 0 and 1) applied to the mixture spectrogram to isolate the target instrument.
Working with the MUSDB18 Dataset
150
Music Tracks
10h+
Total Audio
4
Stem Sources
44.1kHz
Sample Rate
MUSDB18 is the standard benchmark dataset for music source separation. It contains 150 full-length music tracks across various genres, each with professionally recorded stems: vocals, drums, bass, and other (melody/guitar/keys). This paired data — mixture and isolated sources — is essential for supervised training.
Training Pipeline in Python
- Load audio at 44.1kHz using librosa and convert to mono
- Apply Short-Time Fourier Transform (STFT) to produce magnitude spectrograms
- Chunk audio into fixed 3-second segments to manage GPU memory
- Train U-Net with Mean Absolute Error (MAE) loss on masked spectrograms
- Evaluate separation quality using Signal-to-Distortion Ratio (SDR)
- Reconstruct waveform using the estimated mask and original mixture phase (inverse STFT)
Phase Reconstruction Trick
When converting back from a spectrogram to audio, you need phase information that the magnitude spectrogram discards. A simple but effective trick: reuse the original mixture's phase for the reconstructed waveform. This avoids expensive iterative phase estimation (like Griffin-Lim) and produces clean, listenable results.
Results & Findings
6.2 dB
Drums SDR
5.1 dB
Bass SDR
3.8 dB
Vocals SDR
2.9 dB
Other SDR
Results showed the strongest performance on drums and bass — instruments with distinct spectral profiles and minimal overlap with other sources. Vocal and melodic instrument separation was harder, as these sources share overlapping frequency ranges with each other and with harmonic overtones from other instruments.
“The spectrogram masking approach is computationally efficient and produces listenable outputs even without complex post-processing — a strong baseline for any source separation system.”
— Bipin Phaiju
Key Challenges
The two hardest problems in this project were phase reconstruction and computational cost. Magnitude spectrograms lose phase information, which is critical for waveform quality. And training on full songs at 44.1kHz is prohibitively expensive — chunking into segments solved the memory issue but introduced boundary artefacts at segment edges.
Future Directions
This project built a solid foundation in audio deep learning. The natural next steps involve more powerful architectures and training strategies that address the limitations of pure spectrogram masking.
What to Try Next
Demucs (Meta AI) uses raw waveform modelling instead of spectrograms, avoiding phase loss entirely. Hybrid Transformer Demucs adds attention mechanisms for long-range temporal dependencies. Data augmentation via pitch shifting and time stretching can also significantly improve generalisation across genres.
This work reinforced how cross-domain transfer of deep learning techniques — from medical imaging to audio — can yield surprisingly effective results, and deepened my understanding of both signal processing fundamentals and practical deep learning in PyTorch.

Bipin Phaiju
Software engineer & MSc Data Science student based in Coventry, UK. Passionate about machine learning, audio processing, and building beautiful web experiences.