Active Learning for Sound Event Classification Using Bayesian Neural Networks with Gaussian Variational Posterior

Manual annotation of audio material is cumbersome. Active learning aims at minimizing the annotation effort by iteratively selecting an acquisition batch of unlabeled data, asking a human to annotate the selected data and re-training a classifier until an annotation budget is depleted. In this paper we propose the Gaussian-dense active learning (GDAL) algorithm to train a sound event classifier. The classifier is a Bayesian neural network where the weights are normally distributed. This is in contrast to conventional neural networks where weights are not distributed, but have assigned values. The Bayesian nature of the classifier empowers GDAL to select acquisition batches from a set of unlabeled audio clips based on their estimated informativeness. Evaluation results on the UrbanSound8k dataset show that GDAL outperforms a state-of-the-art algorithm based on medoid active learning for all considered annotation budgets and an algorithm based on dropout active learning for sufficiently large annotation budgets.


INTRODUCTION
Sound event classification [1] aims at distinguishing events or situations based on properties of audio signals [2].Some of its applications are wildlife [3] or environmental [4] monitoring and healthcare [5].Training a sound event classifier requires a sufficiently large collection of annotated audio material.Providing manual annotations for such an audio corpus is often the most time-consuming part in the entire process of generating a sound event classifier.
In active learning (AL) [6,7], the annotation process is integrated into the training process: the model learns from labels provided by a human annotator on the fly.The AL process iterates between selecting an acquisition batch of unlabeled data, collecting the annotations, and re-training the model on the updated dataset.As a result, the annotator does not need to label the entire dataset, but only the data that was selected by the AL algorithm.
Bayesian AL [8,9] is a subset of AL where the model trained is Bayesian.A Bayesian model allows the AL algorithm to select acquisition batches based on the estimated informativeness of unlabeled data.A popular type of Bayesian models are Bayesian neural networks (BNNs) [10][11][12][13][14].
Bayesian AL algorithms have been deployed on multiple occasions to solve computer vision problems [8,[15][16][17].However, they have not yet been applied to the sound event classification problem.This paper proposes Gaussian-dense active learning, in the following denoted as GDAL, a Bayesian AL algorithm that trains a sound event classifier.
GDAL makes use of transfer learning and semi-supervised learning to train a label-efficient BNN on a corpus of initially unlabeled sound clips.Semi-supervised learning is implemented by tailoring the Bayesian prior around the empirical data distributionan approach not yet explored in the context of AL with BNNs.Evaluated on the UrbanSound8k dataset [18], GDAL consistently beats the MAL-PANN [19] baseline, while outperforming a modification of the DAL [19] baseline for sufficiently large annotation budgets.

ACTIVE LEARNING WITH BAYESIAN NEURAL NETWORKS
Developing an accurate classifier requires a collection of data where a sufficiently large portion of the dataset is annotated with the respective class labels.Collecting manual labels for the data can be tedious.To this end, AL is a machine learning paradigm which exploits the idea that a machine learning algorithm can be more label-efficient if it is allowed to choose which data is to be annotated [6].
Given is a set of class labels C and a dataset D ={a i } i=1...|D| of unlabeled audio clips a i , i.e. signals, with |D| denoting the cardinality of D. The goal of an AL algorithm is to train a classifier that can accurately predict the respective class label of any clip a ∈ D. To do so, an AL algorithm is allowed to request an annotator to assign labels for up to A clips, where A is the so-called annotation budget.An AL algorithm does not have to request all A labels at once, but can do so iteratively, by querying smaller acquisition batches and re-training the classifier on each iteration.Whenever a clip a ∈ D is annotated with a class label c ∈ C, the tuple (a,c) is added to the labeled dataset L = (a i ,c i ) i=1...|L| .The size of an acquisition batch we denote as ∆|L|.
Bayesian AL is a subclass of AL algorithms where the classifier does not define a single mapping from input a to class probabilities, but is capable of representing a Bayesian distribution over mappings.By defining some prior distribution p(w) over mappings w, and observing human annotations in L, the posterior distribution p(w|L) over mappings can be constructed via the Bayes rule p(w|L)=p(w) (a,c)∈L p(c|w,a)•const. ( where p(c|w,a) is the likelihood of observed annotation (class label) c for input a predicted by a mapping w.A Bayesian AL algorithm selects unlabeled data for annotation based on the Bayesian posterior distribution: a data sample is deemed more informative when the different mappings in the Bayesian posterior predict different classes [8,9,15].A popular family of models capable of representing a Bayesian distribution over mappings are Bayesian neural networks (BNNs) [10,12,20].For these, a mapping from input to class probabilities corresponds to a configuration of network weights.While the Bayesian posterior p(w|L) in ( 1) is not tractable, it can be approximated.To this end, a BNN defines a variational distribution q w θ θ θ over weights, which depends on some parameter set θ θ θ.This variational distribution is fitted towards the Bayesian posterior p(w|L) by finding a θ θ θ that minimizes an objective that is based on the evidence lower bound (ELBO) [20] L(θ θ θ)=KL q w θ θ θ p(w) −E q(w|θ θ θ) where KL(•∥•) is the Kullback-Leibler divergence and E denotes the expectation.
While BNNs have been employed for AL tasks, the algorithm proposed in this paper, GDAL (cf.next Section for more details) is the first application of a BNN to solve the AL task in the domain of sound event classification.

GAUSSIAN-DENSE ACTIVE LEARNING (GDAL)
GDAL is comprised of a feature extractor, a BNN classifier and an acquisition mechanism.The feature extractor (cf.Section 3.1) makes use of transfer learning by feeding clips a i into a pre-trained neural network and obtaining respective feature vectors x i .
The classifier (cf.Section 3.2) operates on the feature vectors.It models the Bayesian posterior distribution over mappings from a feature vector to the respective predicted label.The model learns from the empirical clip distribution by explicitly incorporating it into the Bayesian prior.To our knowledge such an approach has not yet been explored in the context of AL with BNNs.
For each iteration of the AL process, the acquisition mechanism (cf.Section 3.3) is invoked to select a batch of unlabeled clips for annotation, and the classifier is re-trained on the updated dataset.The acquisition mechanism incorporates two different approaches, a geometric approach and an information-theoretic approach, to find most informative unlabeled clips for annotation.

Feature extractor
GDAL uses a pre-trained audio neural network (PANN) [21] to generate a 2048-dimensional feature vector x for each clip a in the dataset D. As a result, all signals a i are transformed into feature vectors x i , upon which the classifier (Section 3.2) and the acquisition mechanism (Section 3.3) operate.

Classifier
GDAL employs a Bayesian classifier which models different mappings from input clip a to class probabilities.This allows GDAL to quantify the informativeness of unlabeled clips and incorporate this quantity into the acquisition mechanism (Section 3.3).
The classifier is shown in Figure 1.An audio clip a is transformed into the respective feature vector x.The feature vector passes through a dense layer followed by a softmax layer to produce a class probability distribution.The dense layer in Figure 1 multiplies a weight matrix W with a feature vector x.Crucially, instead of assigning deterministic values to the entries of W, the weight matrix is sampled from a probability distribution.Throughout the AL process, this probability distribution is fitted towards the Bayesian posterior distribution over W. In other words, GDAL trains a BNN (cf.Section 2) where the weights, denoted as w in (1) and ( 2), are given by the weight matrix W.

Bayesian prior & data-driven regularization
To infer the Bayesian distribution of the network weights W in the dense layer in Figure 1, a prior distribution p(W) over those weights needs to be defined.To make the model learn from unlabeled data typically present abundantly in AL settings, we propose a prior which explicitly incorporates the empirical data distribution: where i and j denote the rows and columns of W. The first factor in (3) is the Gaussian prior over network weights W, widely used in the literature [12,20] to apply weight regularization, i.e. enforce small weights.It is regulated by the standard deviation σ p of the Gaussian term: smaller σ p leads to more weight decay.
The second factor is a product over all clips in the dataset D and makes the prior favor weight matrices W that result in confident 897 class predictions for all labeled and unlabeled clips.This can be seen as a generalization of pseudo-label training [22] to BNNs.The strength of this data-driven regularization is determined by the non-negative parameter λ: higher λ incurs higher prior preference for weight matrices that give confident class predictions.Although the general idea to incorporate data into the prior is not new [20], to our knowledge, a data-driven regularization as in (3) has not yet been explored for BNNs in AL tasks.

Variational Bayesian posterior
To approximate the Bayesian posterior over weights W in Figure 1 the variational distribution q W θ θ θ needs to be parameterized.To this end, GDAL employs a fully factorized Gaussian variational posterior [12,20,23].For that, the variational parameters θ θ θ in (2) are defined as θ θ θ =(Θ µ ,Θ σ ), i.e. a tuple of two matrices, each with the same dimensions as W. The variational distribution is where the standard deviation in ( 4) is defined by applying the softplus function (5) to Θ σ as in [20].In other words, the weights in the dense layer are distributed independently and normally, whereby the mean is defined by the parameter Θ µ , and the variance by Θ σ .

Optimization
BNNs are trained by minimizing the objective (2).The prior (3) in conjunction with the variational parameterization (4) amount to the objective with the set of clips D and the labeled set L. The first term in (6) is the KL divergence between the Gaussian variational posterior (4) and the Gaussian term in the prior (3).The second term results from the KL divergence between the variational posterior and the data-driven regularization term in (3).The third term is the likelihood contribution to the objective.
Unlike the first term, the second and the third cannot be computed analytically, and are estimated via Monte Carlo sampling.First, a minibatch of B clips is sampled, where one half is drawn from D and used to compute the second term, and the other half is drawn from L to compute the third term.For each sampled clip, the weight matrix W is sampled n train number of times.The loss computed from a minibatch is backpropagated to (Θ µ ,Θ σ ) and the parameters of the variational distribution are updated via Adam.On each iteration of the AL process, parameter update is repeated several times.

Acquisition mechanism
GDAL employs a two-stage approach to select unlabeled clips for annotation.To select the first K clips, K-medoid clustering [24] is done in the feature space, and the medoid of each of the K clusters is annotated.The distance metric s between two feature vectors x i and x j is based on the cosine similarity: After the initial acquisition of K annotations, each further acquisition is done via the BatchBALD algorithm [15], which employs the Bayesian model described in Section 3.2 to select an acquisition batch of ∆|L| clips that maximizes the mutual information between the weight matrix W and the class label c, thus seeking the most informative clips.

BASELINE ALGORITHMS
In Section 5.5, GDAL's performance is compared against two state-of-the-art AL algorithms for sound classification: one based on medoid active learning (MAL) and another based on dropout active learning (DAL).
MAL [25] splits sound clips into clusters by performing Kmedoid clustering [24] with mean cluster size κ.Given the annotation budget A, the medoids of the A largest clusters are annotated, and the medoid labels are propagated to the other clips in the respective clusters.A support vector machine classifier is then trained on ground truth and propagated labels.While the original MAL algorithm operates on features that are based on mel frequency cepstral coefficients, a PANN-embedding-based modification, denoted as MAL-PANN [19], has been shown to achieve significantly better performance.MAL-PANN is evaluated for the mean cluster size κ=4, same as in [25].
DAL [19] uses a classifier architecture similar to the one in Figure 1.Instead of defining a Gaussian distribution over weights, DAL achieves randomized predictions by applying random dropout masks to the PANN feature vector.Unlabeled clips are selected by making each prediction cast a vote in favor of one class, and picking the clip with the highest vote entropy, one clip per iteration of the AL process.Unlike GDAL, DAL incorporates unlabeled data into the training by assigning pseudo-labels, and training against those.For a fair comparison, DAL is modified to use the same acquisition batch size as in GDAL, and the initial acquisition is identical to GDAL's (i.e. based on K-medoid clustering).

EXPERIMENTS
The performance of GDAL and the baseline methods is evaluated by simulating an AL process and measuring the classification accuracy at each iteration.Section 5.1 describes the dataset and the performance metric.The default parameter values of GDAL are listed in Section 5.2.Finally, experimental results are presented in Sections 5.3 to 5.5.

Dataset and performance metric
The AL process is simulated on the UrbanSound8k [18] dataset, which consists of 8732 short, weakly labeled clips, each belonging to one of 10 classes.
First, the clips in the training split (without the class labels) are presented to the AL algorithm; this is the dataset D in (3) and (6).Manual annotations are simulated by revealing the ground truth labels to the algorithm.In each iteration of the simulated AL process, the performance is evaluated as the macro-recall [26] on the test split, i.e. the percentage of correctly predicted class labels, averaged over all classes.
AL processes are simulated for annotation budgets A up to 500 for baseline comparisons, and up to 100 for other experiments.Each experiment is conducted on 10 trials, measuring the mean macro-recall via 10-fold cross-validation.80% confidence intervals for the mean macro-recall are computed via bootstrapping and shown as shaded areas in all following plots.

Default parameter values in GDAL
In the following, we define GDAL's default parameter values used for experiments.The prior p(W) in (3) over network weights W is defined by the standard deviation σ p of the Gaussian term and the data regularization strength λ, which are set to σ p =30, λ=10.The optimization-related parameters (cf.Section 3.2.3)are the minibatch size B =4096, the number of weight samples for each input sample n train =2048 and learning rate of the Adam optimizer which is set to 0.1.For the first acquisition, GDAL acquires K =30 annotations via K-medoid-clustering, and applies 6400 backpropagations to train the classifier.For each further iteration, ∆|L| = 3 labels are acquired via BatchBALD, and 2048 backpropagations are done.
The choice of σ p and λ was motivated by preliminary experiments in which the classifier's performance on the training split was analysed, i.e. all clips in the UrbanSound8k dataset that are not in the test split.From all tested combinations, we chose the one with the strongest regularization (i.e.highest λ, lowest σ p ) just before the performance starts dropping.The other parameters were chosen to result in a reasonable computation time.
The impact of the prior parameters λ and σ p on GDAL's performance is studied in Sections 5.3 and 5.4, respectively.For that, multiple experiments are conducted for different values for one parameter, while the other is fixed at the default value defined above.
Finally, a comparison with baseline methods is performed in Section 5.5, while all of GDAL's parameters are at their default values.

Varying the strength λ of the data-driven regularization
The parameter λ in (3) regulates the impact of unlabeled audio clips on the classifier.As λ is varied, best performance is observed for λ=10 as shown in Figure 2. The extreme case of λ = 0 makes the model ignore unlabeled data, whereas too high λ presumably leads to a poor variational posterior approximation, especially for low number of annotations, ultimately resulting in low classification accuracy.Most importantly, Figure 2 shows that data-driven regularization with moderate strength has a beneficial effect on GDAL's performance.

Varying the width σ p of the Gaussian term in the prior
The standard deviation σ p of the Gaussian term in (3) defines the strength of the weight decay in the dense layer shown in Figure 1.Simulations of the AL process with different values of σ p are shown in Figure 3. Best performance is observed for intermediate levels of σ p around 30.Too weak (σ p =100) or too strong (σ p =3) weight decay worsens the performance, whereby too strong weight decay has a more detrimental effect.This is in analogy to conventional neural networks: a network's performance can be boosted by a small weight decay, but degrades if the decay is too strong [27].

Comparison with baselines
GDAL's performance is compared with baseline methods (cf.Section 4) in Figure 4.The proposed GDAL clearly outperforms the MAL-PANN approach, while performing similarly to the DAL approach.For annotation budgets above ca.280, GDAL outperforms all other baselines.

CONCLUSION
In this paper we propose and analyse GDAL, a Bayesian AL algorithm for solving the sound event classification task.GDAL is label efficient due to its application of pre-trained feature extractors, incorporation of unlabeled data into the training process, and a smart acquisition mechanism that is based on estimated informativeness of unlabeled clips.For sufficiently large annotation budgets, GDAL was shown to beat state-of-the-art baselines.

Fig. 1 .
Fig. 1.Computation graph of the classifier in GDAL.The dense layer in Figure1multiplies a weight matrix W with a feature vector x.Crucially, instead of assigning deterministic values to the entries of W, the weight matrix is sampled from a probability distribution.Throughout the AL process, this probability distribution is fitted towards the Bayesian posterior distribution over W. In other words, GDAL trains a BNN (cf.Section 2) where the weights, denoted as w in (1) and (2), are given by the weight matrix W.Training a BNN requires defining a Bayesian prior p(W) (cf.Section 3.2.1)and a variational Bayesian posterior parameterization q W θ θ θ (cf.Section 3.2.2) before optimizing the resulting objective function (2) (cf.Section 3.2.3).

1 Fig. 3 .
Fig. 3. Macro-recall over annotation budget A up to 100 for different widths σ p of the Gaussian factor in the prior.

1 Fig. 4 .
Fig. 4. Macro-recall over annotation budget A up to 500 for GDAL and baseline methods.The proposed GDAL clearly outperforms the MAL-PANN approach, while performing similarly to the DAL approach.For annotation budgets above ca.280, GDAL outperforms all other baselines.