Music video emotion classification using slow–fast audio–video network and unsupervised feature representation | Scientific Reports – Nature.com

Deep learning technology gaining popularity in recent years making our daily lives easier in a variety of ways. Aside from visual image and videos, numerous success stories in animal sound^9,10,11,12 and the music information retrieval area¹³ have quickly spread. After the development of deep learning technology, affective computing in multimedia content has gained a lot of interest during the last 5 years. Since there are few studies of music videos, you will compare and contrast them detail in this section.

Many supervised automatic music emotion classifiers have been proposed in which manual annotation guides the system^{14,15,16,17,18}; but the emotion annotation is a relatively costly and complex task. In contrast, an unsupervised approach does not require hard annotation. The system automatically learns from input data characteristics. If a neural network is trained on a large unlabeled dataset of music, the system automatically builds a generality to distinguish musical components. However, only a few unsupervised methods have been proposed for music emotion classification^19,20,21.

As far as we know, no research has been done on video emotion analysis in the context of music video content. We’ll go through some of the supervised and unsupervised video processing techniques that have been created. Our model is based on some prior arts that were used in video analysis. Supervised video classification is widely used in computer vision^22,23,24,25. Several studies^{26,27,28,29,30} illustrate their results of emotion analysis of video. Recent studies explore to deal with the spatiotemporal information of video^6,31,32. The author in³¹ proposed a system for memory-efficient video classification using cluster and aggregate models. Facial expression recognition (FER) is a sub-domain of human affective computing that uses face geometry and textual input to recognize human emotion. The majority of the studies in FER^33,34,35 relied on facial context information and dominant facial landmarks. The quality of the face image, the camera distance, and the depiction of facial landmarks are the most important tuning parameters for FER. Yagya et al.³⁶ proposed a supervised technique for music video emotion analysis that incorporates audio, video, and facial data. The study investigates the importance of musical elements for affective computing in music videos, and the quality of the face image is important to determine the overall music video emotion. SmallBigNet³² presented a method for dealing with the various perspectives in video, in which a big view directs a small view branch in 3D feature space. The works in the video domain that were discussed were only for supervised classification. Research has also been conducted for unsupervised video classification^37,38,39 using convolution and recurrent neural networks. Nomiya et al.⁴⁰ proposed unsupervised emotion classification in video using the Gaussian mixture model (GMM).

Many supervised studies have used hand-crafted spatiotemporal features based on optical flow to capture the motion information of video. This method typically includes histograms of flow⁴¹, motion boundary histograms⁴², and trajectories⁴³ for action recognition in video data. However, it is methodologically unsatisfactory given that optical flow is a hand-designed representation, and two-stream methods are often not learned end-to-end jointly with the flow. The Slow–fast network in⁶ is based on the idea of preserving the spatial and temporal information of video using end-to-end learning. The slow branch captures static but semantically meaningful features whereas the fast branch captures the temporal information of the video sequence. Xiao et al.⁴⁴ demonstrated that a slow–fast video network with audio improves the video action classification and detection task, but they only evaluated audio on the fast pathway. In this study, we extend this concept with audio and video in both slow and fast path and the learned information of both branches is boosted and shared using the MMTM module. The MMTM shares the information from two modalities and boosts the information using the squeeze and excitation⁴⁵ method.

In the case of music video emotion analysis, supervised methods have been conducted using the conventional approach^46,47 and deep learning approach⁵. At the time of writing, the authors are not aware of any unsupervised methods for music video emotion classification. We provide an unsupervised music video representation approach as well as an automatic music video emotion classification method in this work.

Music video emotion dataset

Unlabeled dataset

A large dataset with precise annotation has become increasingly important for training the huge number of parameters of a data-hungry deep learning model. However, the data annotation process is relatively difficult and costly in the field of affective computing. For music videos in particular, the task is more challenging because it includes multiple information sources with their own emotion representation paradigm. The annotation on music videos is challenging even for an expert because emotion itself is a subjective task, dynamic across time, requires different emotion representation schemes for individual music video components (lyrics, music, video), and is influenced by culture. Therefore, data scarcity appears if the affective computing algorithms are data driven. The unsupervised representation helps in the data-driven method to minimize data scarcity and annotation difficulties. The unsupervised network can use a massive amount of raw data floating around the Internet and expanding day by day. The classification of human emotions included in a music videos without annotation is most economic approach in the case like music video where the labeling task is expensive and complicated.

To train the unsupervised neural network, we collected a large number of music videos from YouTube. A 30-s video clip was selected from each music video. The video clips were cut from a random temporal location of full length music video and downloaded using an automated python code. We set a search key related to music videos and downloaded both official and user-generated music videos. We filtered the clips because some were not related to music video and some had only music (not visual dynamics). The data was filtered to exclude reviews, interviews, and other music-related speech. Finally, our sample contained 0.13 million music videos for unsupervised training. The collected data were further processed using audio noise reduction techniques.

Labelled dataset

Several datasets have previously been proposed for supervised music emotion analysis. Some datasets^48,49,50 follow the categorical model⁵¹, providing several discrete categories of emotions, and other datasets^14,52,53 use the dimensional model⁵⁴ to represent emotion as a value in 2D valence and arousal space. Similarly, like music, video emotion datasets^55,56 have been proposed using the categorical model and others^57,58 have used dimensional models.

DEAP dataset⁵⁹ was the first music video dataset available with 120 data samples taken from western music. The dimensional emotion representation model was used to rate affective tags in terms of arousal, valence, and dominance, using value ranges from 1 to 9. The dataset does not include enough samples to adopt the data-driven algorithms like deep neural networks. To overcome the data scarcity problem, we prepared a music video emotion dataset using the contribution of four annotators. The dataset is an extension of⁵ with the same emotion representation framework of six emotion categories, namely: Excited, Fear, Neutral, Relaxation, Sad, and Tension. The modified music video dataset included training, validation, and test samples of 4788, 655, and 300 video samples respectively. The test samples were equally distributed over the six class categories for a fair comparison. This new music video emotion dataset (MVED) has been used in³⁶ and made publicly available.

We categorized the dataset into six distinct classes based on their corresponding emotional adjectives. The “Excited” class usually includes positive emotions. The visual elements of the “Excited” class includes images of a smile on a face, movement of arms, dancing scenes, high lighting, and coloring effects. The audio components of this class include high pitch, large pitch variation, uniform harmony, high volume, low spectral repetition, and diverse articulations, ornamentation, and vibrato. The visual features of the “Fear” class reflect negative emotions via a dark background, unusual appearance, wide eyes, open mouth, a visible pulse on the neck, elbows directed inward, and crossed arms. Common visual elements in the “Tension” class are fast-changing visual scenes; crowded scenes; people facing each other; aggressive facial expressions with large eyes and open mouths; and fast limb movement. The audio elements in the “Tension” and “Fear” classes include high pitch, high tempo, and high rhythmic variation, high volume, and a dissonant complex harmony. The visual elements in the “Sad” class are closed arms, a face buried in one’s hands, hands touching the head, tears in eyes, a single person in a scene, a dark background, and slow-changing scenes. The “Relaxation” class includes ethnic music and is visually represented with natural scenes in slow motion, and single-person performances with musical instruments. The acoustic components of the “Sad” and “Relaxation” class include slow tempo, uniform harmonics, soft music, and low volume. The “Natural” class includes mixed characteristics from all the other five classes. The data samples in each class are diverse in terms of musical group, culture, nationality, language, number of music sources in one audio, and mood.

Data preprocessing

The raw music video data needed to be processed in an acceptable form for the neural network. Our dataset processing follows several steps for each individual data sample of music and video. The music network is trained on the real (magnitude) and imaginary (phase angle) components of the log magnitude spectrogram. The magnitude of log Mel spectrogram was kept in one channel and the phase angle representation was placed on another channel to preserve both the magnitude and phase information of the acoustic signal. The raw music video data needed to be processed in an acceptable form for the neural network. Our dataset processing followed several steps for each data sample of music and video. The music network is trained on the real (magnitude) and imaginary (phase angle) components of the log magnitude spectrogram. The magnitude of log Mel spectrogram was kept in one channel and the phase angle representation was placed on another channel to preserve both the magnitude and phase information of the acoustic signal. Several studies^60,61,62 have demonstrated that phase information improves the performance of both speech and music processing.

For this work, the 30-s audio waveform x_i was converted to mono and then subsampled with a window size of 2048, a sampling rate of 22,050 Hz, and a time shift perimeter of 512 samples. The sampling rate varied for the slow–fast network where x_i in the slow path had a sampling rate of 32 kHz and the x_i in the fast path had a sampling rate of 8 kHz. The Fast Fourier Transform (FFT) was then applied for each window to transform the x_i from time domain to a time–frequency (T–F) representation X_i(t, f). From the entire frequency spectrum, 128 non-linear Mel-scale were selected that matched the human auditory system. The use of the log Mel spectrogram has two benefit compared to wave form audio. First; it reduces the amount of data that need to process by the neural network and second; it is related to human auditory perception and instrument frequency range⁶³.

The sequence of images on the video were collected in a distributed manner to preserve the temporal information of the entire video sequence. Each video was converted into several frames sequences V_i = {v({uptau }), v₂({uptau }), v₃({uptau }), …, v_n({uptau })}, where ({uptau }) represents equal time intervals in the video sequence. For each sample, ({uptau }) changes according to the total n number of fames that were extracted, as shown in Fig. 1. For the video network, 64 frames were taken in a distributed fashion. Video data was processed in a similar way to the audio data by varying the frame rate in the slow (8 frames) and fast (64 frames) branches of the slow–fast network.

Figure 1

Input video processing using distributed selection of frames. The video frames are from the first 30 s of the music video “Without Me” (https://www.youtube.com/watch?v=ZAfAud_M_mg).

After the preprocessing, the input to the audio network was A_N = (sumnolimits_{{{text{i}} = 0}}^{{text{N}}} {{text{X}}_{{text{i}}} }) and to the video network, V_N = (sumnolimits_{{{text{i}} = 0}}^{{text{N}}} {{text{V}}_{{text{i}}} }), where N is the data used in one batch. The multimodal input was the integrated form of the audio and video input.

Proposed network

In this paper, we present multimodal learning using the unsupervised method. In multimodal representation, the complementary information provided by the different modalities is integrated to enhance the system capability. An autoencoder network is developed to represent the multimodal representation using audio and video information. An autoencoder is a generative model, which usually uses a latent representation bottleneck and use it to reconstruct input⁶⁴. The CNN-based autoencoder is one of the popular unsupervised feature representation method^65,66. At the time of writing, we are unaware of any unsupervised music or music video emotion representation method using deep learning technology. In this study, an autoencoder was trained on the encoder-decoder paradigm, where the encoder network used two multimodal architectures with a dense residual block with a variety of convolutions. The decoder network is made simple and lightweight using 2D/3D convolution. We will discuss the detail encoder and decoder network after the various convolution filter used in this study.

Convolution filter

In video processing, 3D convolution has been found to better capture spatial and motion information, but it exponentially increases the complexity of a system. Some popular 3D networks^67,68 have included this complexity and, as a result, require a large dataset for successful training. In this paper, the complexity of 3D convolution was reduced using filer and channel separable convolution. The proposed convolution was an integrated form of channel separable convolution⁶⁹ and (2 + 1)D convolution⁷⁰. For the separable filter, the 3D convolution filter of size ({text{n}} times {text{n }} times {text{n}}) was divided into 2D space as (1 times {text{n }} times {text{n}}) and ({text{n}} times {text{n }} times 1), where n is the convolution filter size for one dimension. Separable filer and channel convolution was also used for the 2D audio network. The square filter of 2D convolution was divided into a temporal filter ((1{ } times {text{n}})) and spatial filter (({text{n }} times 1)), as in⁷¹. The channel size was reduced to one in the sequential block of the dense residual network for the separable channel. A detailed representation of the proposed filter and channel separable convolution is illustrated in the right most configuration in Fig. 2.

Figure 2

3D convolution and its variants in residual block representation.

Encoder network

Two network architectures were used for information encoding, namely the “music-video encoder network” and “separable slow–fast encoder network”. Each network included an MMTM block for information sharing and feature enhancement. The music-video network used conventional 2D/3D convolution, whereas the separable slow–fast network used filter and channel separable convolution.

The music-video encoder network has two parallel branches for music and video processing with an MMTM for information exchange after each dense residual block, as shown in Fig. 3. The music network input is a two-channel time–frequency representation of audio, where the first and second channel are real and imaginary part of the sinusoidal audio signal. The spectral representation has 128 frequency bin and 1292 temporal length. Two-dimensional convolutional filters are used in each dense residual block. In each convolutional block, a batch normalization and rectified linear unit (ReLU) activation function is used for stable training. In the case of the video network, three-dimensional convolutional filters are used to capture both the spatial and temporal features of the video sequence. After each dense residual block, the output is smoothed using a convolution layer with batch normalization.

Figure 3

The music-video encoder network. Video stream (upper) and audio stream (lower) are connected with the MMTM information fusion block.

Inside each dense residual block, the max-pooling operation is used for dimension reduction in the two parallel branches. Each parallel branch is then added together and the result is smoothed with a convolutional layer. Finally, the output from each individual dense residual block of audio and video branch is passed to the MMTM block for information fusion, as shown in Fig. 4.

Figure 4

Detailed overview of the dense residual block of audio and video network (the value related to first block of video network). The symbol ‘C’ indicates the convolutional, ‘K’ indicates kernel size, ‘BN’ indicates batch normalization, ‘R_eLU’ indicates the rectified linear unit and MMTM indicates multimodal transfer module for multimodal information fusion. The lower case value indicates the given value to corresponding symbol.

The 2D/3D convolution in the music-video network increases the complexity and it is impossible to add more audio or video branches for end-to-end training. We drastically reduced the network complexity using filter and channel separable convolution and trained a slow–fast encoder network with audio and video information. Slow–fast networks can be described as a single stream to capture both spatial and motion information. The slow path is designed to capture more static but semantic-rich information, whereas the fast path is tasked with capturing fast motion. We used slow–fast representation for both the audio and video networks with because both media are spatially meaningful over time. Both slow and fast network branches are trained in parallel with information sharing using MMTM after each dense residual block. The MMTM module helps to modulate the audio and video correspondence over time. The architectural detail of the slow music video emotion network is shown in Fig. 5a. The wave audio is sampled at 8 kHz for the slow path of audio network input of size (128, 469, 2), where one channel is the log Mel spectrogram and the other is the phase angle. The slow video path includes 8 image frames with an RGB color channel.

Figure 5

(a) Slow branch of the separable slow–fast music video network. (b) Fast branch of the separable slow–fast music video network.

The fast branch of the slow–fast network has a similar network structure as the slow branch. However, video frames (64 frames) and sampling rate (32 kHz) are four times larger than the slow branch. Another difference is the size of the filter in the convolutional layer inside each dense residual block. To capture the motion information of the video, we process the input with a large filter size and low feature dimension. The architectural detail of the fast branch of the network is shown in Fig. 5b.

In both fast and slow branches, the 3D convolution is limited to only one layer after a dense residual block for smoothing the learnt features. The dense residual block uses only a 2D convolutional layer with dense connection to the input so the network parameters are relatively low and easy to train in an end-to-end fashion. The detailed view of each dense residual block with filter and channel separable convolution is shown in Fig. 6.

Figure 6

Detailed overview of the dense residual block of the audio and video slow–fast network (the first block of video network is illustrated here). The symbol description is similar as in Fig. 4.

Decoder network

In an autoencoder, the latent representation of input is reconstructed as output using a decoder network. A decoder network generally has the same structure as an encoder in reverse order from latent representation to input but other structures are possible. The goal of the decoder network is the reconstruction of encoded data in the original input form. The two decoder networks in this study: the “music-video decoder network” and “separable slow–fast network”, are designed to be simple and lightweight to reduce the number of network parameters of the multimodal architecture.

The music-video decoder network is designed to reconstruct the audio and video sequence from the separate branches. A simple 2D/3D transpose convolutional layer with batch normalization and ReLU non-linear activation function is used to upsample the feature from the latent space. The kernel size and strides are adjusted separately for each transpose convolution layer to make the final feature size equal to the input. The final video size is the same as the input, with three RBG channels, and the audio size is the same as the input log Mel spectrogram with phase information. The proposed decoder network introduces four auxiliary outputs for the video network in the expanding pathway and three auxiliary outputs for the audio network in the expanding pathway with the intention of improving the gradient propagation and decreasing the probabilities of a vanishing gradient for the deep audio–video multimodal networks. The multiple auxiliary outputs work as a kind of deep supervision and minimize the overall loss functions^72,73. The detailed architecture of the video and audio decoder network with multiple auxiliary outputs is shown in Fig. 7a, b respectively.

Figure 7

(a) The 3D video decoder network with multi-stage loss. (b) The 2D audio decoder network with multi-stage loss.

The separable slow–fast decoder network has two audio network branches and two video network branches without information sharing (MMTM block) across the audio and video network. The decoder architecture for the audio and video network in each slow and fast branch reconstructs their respective input dimension from their latent representation. The dimension of latent space and output of each audio and video decoder for the slow and fast path is shown in Fig. 8a, b. respectively.

Figure 8

(a) The slow branch of the separable slow–fast decoder network. (b) The fast branch of the separable slow–fast decoder network.

Fine-tune with labelled dataset

The unsupervised features of music video emotion can be useful in the initial phase of training, but human performance data is required for truly reliable evaluation. To exploit our unsupervised networks for real world emotion analysis of music videos, the network was fine-tuned with the labelled data using the six emotion categories. The goal of fine tuning the unsupervised feature is to identify the emotion classes to which the music video belongs. Given a fixed set of m classes c₁, c₂, c₃, …, c_m ∈ C, audio A_N = (sumnolimits_{{{text{i}} = 0}}^{{text{N}}} {{text{X}}_{{text{i}}} }) and video V_N = (sumnolimits_{{{text{i}} = 0}}^{{text{N}}} {{text{V}}_{{text{i}}} }) data with batch size N, we are interested in predicting the probability P{c_i |(A, V)} for each of the m classes. This probability can be parameterized using the proposed multimodal M which looks at the joint representation of log Mel-spectrogram of audio and video frames in the music video to predict: P{c_i|(A, V)} = M{(Xi(t, f)), (v({uptau }), v₂({uptau }), v₃({uptau }), …, v_n({uptau }))}.

The encoder network was fine-tuned with additional global average pooling and a Softmax layer at the end. We used only five dense residual blocks of music-video encoder network for the supervised training. The features from the audio and video networks are concatenated after global average pooling. Finally, the predicted probability is passed through a Softmax activation function that pushes most probable result closer to 1 while others are pushed closer to 0.