Abstract:In this work, we aim to improve the expressive capacity of waveform-based discriminative music networks by modeling both sequential (temporal) and hierarchical information in an efficient end-to-end architecture. We present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex music tags directly from raw waveform recordings. We also introduce a lightweight variant of MuSLCAT called MuSLCAN, short for Multi-scale and Multi-level Convolutional Attention Network. Both MuSLCAT and MuSLCAN model features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with attention-augmented convolution (AAC) blocks. The backend dynamically recalibrates multi-scale and level features extracted from the frontend by incorporating self-attention. The difference between MuSLCAT and MuSLCAN is their backend components. MuSLCAT's backend is a modified version of BERT. While MuSLCAN's is a simple AAC block. We validate the proposed MuSLCAT and MuSLCAN architectures by comparing them to state-of-the-art networks on four benchmark datasets for music tagging and genre recognition. Our experiments show that MuSLCAT and MuSLCAN consistently yield competitive results when compared to state-of-the-art waveform-based models yet require considerably fewer parameters.

Stereo Feature Enhancement and Temporal Information Extraction Network for Automatic Music Transcription

DAFE-MSGAT: Dual-Attention Feature Extraction and Multi-Scale Graph Attention Network for Polyphonic Piano Transcription

Automatic Respiratory Sound Classification Via Multi-Branch Temporal Convolutional Network

Adaptive Multi-Scale TF-net for High-Resolution Time-Frequency Representations

Multi-Instrument Polyphonic Melody Transcription Based on Deep Learning

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

Multitrack Music Transcription with a Time-Frequency Perceiver

Musical Tempo Estimation Using a Multi-scale Network.

Audio-Based Music Classification with DenseNet And Data Augmentation

Frequency-Temporal Attention Network for Singing Melody Extraction

Attention-Based Deep Spiking Neural Networks for Temporal Credit Assignment Problems.

Triplet Convolutional Network for Music Version Identification.

A holistic approach to polyphonic music transcription with neural networks

Hierarchic Temporal Convolutional Network With Cross-Domain Encoder for Music Source Separation

Unaligned Supervision For Automatic Music Transcription in The Wild

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Hierarchical Attentive Deep Neural Networks for Semantic Music Annotation Through Multiple Music Representations

MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for Discriminative Music Modeling on Raw Waveforms

Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer