Dynamic Scalable Self-Attention Ensemble for Task-Free Continual Learning

Continual learning represents a challenging task for modern deep neural networks due to the catastrophic forgetting following the adaptation of network parameters to new tasks. In this paper, we address a more challenging learning paradigm called Task-Free Continual Learning (TFCL), in which the task information is missing during the training. To deal with this problem, we introduce the Dynamic Scalable Self-Attention Ensemble (DSSAE) model, which dynamically adds new Vision Transformer (ViT) based-experts to deal with the data distribution shift during the training. To avoid frequent expansions and ensure an appropriate number of experts for the model, we propose a new dynamic expansion mechanism that evaluates the novelty of incoming samples as expansion signals. Furthermore, the proposed expansion mechanism does not require knowing the task information or the class label, which can be used in a realistic learning environment. Empirical results demonstrate that the proposed DSSAE achieves state-of-the-art performance in a series of TFCL experiments.


INTRODUCTION
One of the most fundamental desirable artificial intelligence systems characteristics is to be able to continually acquire novel knowledge and skills without forgetting.The methods having such an ability are called lifelong learning or continual learning models.Modern deep learning can get impressive performance on individual tasks, including classification [1], representation learning and image generation tasks.However, they suffer from huge performance losses when continually learning several different tasks.In this case, the performance loss for a model is called catastrophic forgetting [2].
Lifelong learning methods can be roughly branched into three categories, according to the principle used : regularization [3,4], memory buffer [5,6,7] and dynamic expansion [8,9].The regularization-based approaches normally impose constraints on the objective function during training in order to alleviate catastrophic forgetting [3,4].The memory-based approaches either train a generator or use a fixed-length memory buffer for preserving and replaying past examples during training.The dynamic expansion approaches dynamically build new components to deal with the new tasks during the training [9].However, these approaches rely on task information, which is actually not available in a realistic learning paradigm called the Task-Free Continual Learning (TFCL) [10].
Using a memory buffer is a popular approach in TFCL, which usually designs an efficient sample selection strategy [10,11] by selectively storing and they replying samples during each training step.However, interference occurs in such methods between the old and new sample learning, resulting in adverse knowledge transfer effects [12].Dynamic expansion approaches [13,14,11] increase the model's capacity to deal with the data distribution shift during the training, in order to address the adverse knowledge transfer effect.Recently, the Vision Transformer (ViT) [15] and its variants [16,17,18] have shown impressive capabilities when learning individual tasks, which can be extended in continual learning.The key component of ViT is the self-attention mechanism which models the similarity information between different image patches.The effectiveness of the self-attention mechanism in TFCL has not been investigated so far.Therefore, we develop the Dynamic Scalable Self-Attention Ensemble (DSSAE), which employs the self-attention mechanism to learn a non-stationary data distribution without knowing the task information.To implement this goal, each DSSAE expert employs a self-attention-based feature extractor and a linear classifier.Then, a dynamic expansion mechanism adds new experts when identifying data distribution shifts.Specifically, a memory buffer is used to store the recent samples from a data stream and then evaluate the novelty of the memory buffer as the expansion signal.THIS mechanism enables a compact model, where each expert learns a different underlying data distribution.We summarize the contributions of this paper as follows : • We propose the Dynamic Scalable Self-Attention Ensemble (DSSAE) which can learn non-stationary data distributions without requiring any task information.• A new dynamic expansion mechanism evaluating the memory buffer's novelty as an expansion signal to ensure a compact model structure.• The proposed DSSAE achieves state-of-the-art performance.

DYNAMIC VISION TRANSFORMER ENSEMBLE
In this section, we detail the dynamic scalable vision transformer ensemble.First, we describe each module of DSSAE and the memory updating strategy.Then we propose a novel dynamic expansion mechanism, enabling DSSAE to increase its capacity to deal with the data distribution shift under continual learning.In the final section, we propose a learning algorithm for training DSSAE.Each expert E i in DSSAE consists of a feature extractor F δi and a linear classifier F θi .We implement each feature extractor F δi by using a ViT, given that it has a robust feature learning ability.Let x ∈ R H×W ×S be a data sample.where {H, W, S} represent the image height, weight and channels, respectively.We split an input x into a set of image patches b ∈ R R×(K 2 ×S) where each patch b j ∈ R K 2 has the size of K 2 pixels and R = HW/K 2 is the number of image patches.Let W p be a projection matrix which transfers the image patches into the H-dimensional embedding space : Then we consider combining several self-attention modules into a unified framework called the multi-head attention mechanism [19].Each self-attention module has three independent trainable matrices , aiming to capture different statistical information from an input.A multi-head attention mechanism, therefore, can have a self-attention modules First, we calculate the output by using each self-attention module : where √ r is a scaling factor.Then we concentrate all outputs from self-attention modules into one matrix, resulting in : where Concat(•) denotes that we concatenate all matrices into a single one.Then the output of the multi-head attention mechanism, Eq. ( 3), is fed into a feed-forward Multi-Layer Perceptron (MLP) to produce the feature information for a linear classifier, which is expressed by : where F MLP is the MLP and ε(•) represents the sigmoid activation function.{W m , b m } are the trainable parameters of the linear classifier and y is the prediction for an input x.Let F θi and F δi represent the feature extractor and classifier for the i-th expert, where θ i and δ i represent all parameters of the self-attention module and the classifier, respectively.Since the proposed DSSAE, E = {E 1 , • • • , E c } would involve multiple experts, we introduce a novel dynamic expansion mechanism which enables E to continually add new experts during the training, which is described in the following section.

The dynamic expansion mechanism
In this section, we propose a new dynamic expansion mechanism to enable DSSAE for continual learning.The main motivation is that we want to create a new expert when the data distribution shift occurs during the training with continuous streams of data.In order to implement this, we first consider assigning an autoencoder V i for each expert E i , in order to evaluate the knowledge correlation between the information learnt by the expert through the evaluation of data reconstruction and generation, and the information corresponding to a given data batch.In addition, the autoencoder V i can be also used with the appropriate expert selector when evaluating a given data batch input during the testing phase.Let L i be a fixed-length memory buffer updated at T i and |L| M ax be the maximum data sample capacity for T i .We consider a simple memory updating mechanism that removes the earliest samples while continuously adding new samples from an incoming data stream.Then we introduce a new dynamic mixture model expansion mechanism evaluating the novelty of the data from the memory buffer as the expansion signal at the training step T i : where F Rec (•, •) is the reconstruction error and V z (x ′ j ) is the reconstruction for the j-th memorized sample from the memory buffer, implemented by the autoencoder of the j-th expert.Then, the dynamic expansion mechanism evaluates the memory buffer using all previously trained experts : where we assume that DSSAE has already trained k components and λ is a threshold that controls the model expansion.
In addition, the threshold λ can also balance the model's generalization performance and complexity.For instance, a big λ Fig. 1.The learning procedure of the proposed model consists of three steps.The first step is to update the memory buffer by continually adding a new batch of samples.If the memory buffer is overloaded, we then remove the earliest samples from the memory buffer.The second step is to check the model expansion using Eq. ( 6).Finally, the third step trains the current component on the memory buffer using Eq. ( 7) and Eq. ( 8).
leads to more components and thus improves the performance.On the other hand, a small threshold λ tends to build fewer components, resulting in lower performance.

Implementation
Each E i expert in DSSAE consists of three modules, a ViTbased feature extractor F δj , a linear classifier F θj and an autoencoder V j .For V j we consider an encoder Q j (x) and a decoder S j (z), which are trained by using the reconstruction error loss on the memory buffer at T i : Then the classifier of the j-th expert is trained on the memory buffer using the cross-entropy loss : where F ce (•, •) is the cross-entropy loss function and ⊗ represents the connection between two modules.Algorithm.We introduce a new algorithm for training the Dynamic Scalable Self-Attention Ensemble (DSSAE), which can be summarized into four steps : • Step 1 (Memory updating mechanism).In the i-th training step T i , the model sees a new batch of samples from the data stream D which is added to the memory buffer M i .If the memory buffer M i is overloaded , we remove the earliest samples from the memory buffer until its size becomes equal to |M i | M ax .• Step 2 (Learning process).In the i-th training step, we only train the current component E k on the memory buffer M i using Eq. ( 7) and ( 8), while all previously learnt experts are frozen in order to preserve the prior learnt knowledge.
• Step 3 (Check the model's expansion).If the memory buffer is full |M i | = |M| M ax , we check the model's expansion using Eq. ( 6).If Eq. ( 6) is satisfied, we build a new expert E k+1 into E and return back to the step 1. • Step 4 (Expert selection).Once all training steps have been completed, we select a component for a given testing sample x t according to : where s is the selected expert index for evaluating x t .

Experiment setting
We adopt the standard TFCL benchmarks from [20].Split MNIST : we divide MNIST dataset into five tasks, each consisting of samples from two classes.We repeat this on CI-FAR10 [21], resulting in Split CIFAR10.Split CIFAR100 : we divide CIFAR100 into 20 tasks where each task consists of 2500 samples belonging to five classes.Network architecture and hyperparameters for the classifier.
Following the setting in [20], the maximum memory size is considerd as 2000, 1000 and 5000 for Split MNIST, Split CIFAR10, and Split CIFAR100, respectively.In each training step, a model would only access a batch of 10 samples.We consider the image patch size of 7 × 7 pixels for Split MNIST.
For Split CIFAR10 and Split CIFAR100, we consider the image patch size of 8 × 8 pixels.

Classification task
In Table 1 we evaluate DSSAE on Split MNIST, Split CI-FAR10, and Split CIFAR100, and compare the results with several baselines including : Finetune which directly trains a classifier on the data stream, CURL [14], iCARL [22]  [20], CNDPM, ER + GMED and ER a + GMED [23], where GMED is Gradient based Memory Editing and ER is the Experience rRplay [24], while ER a is ER with data augmentation.Those results show that the proposed DSSAE approach outperforms other baselines on all datasets.
In the following, we also evaluate the performance of various models on a more challenging setting called fuzzy task boundaries [13] in which we swap randomly samples between two tasks for each data stream, thus introducing outliers in their probabilistic representations.We report the results in Table 2, where we also compare the performance of various models on Split MINI-ImageNet [25] which divides the MINI-ImageNet [25] into 20 tasks and each task contains samples of five different classes with images of higher complexity.These results show that the proposed approach still outperforms other baselines when learning this more challenging dataset.

Ablation study
In this section, we perform an ablation study to investigate the effectiveness of each module of the proposed DSSAE.We first evaluate DSSAE when changing the expansion threshold λ from Eq. ( 6) on Split MNIST and the results are provided in Fig. 2. The performance and the number experts when changing the threshold λ.

Methods
Split Table 3. Classification accuracy of five independent runs when considering a CNN-based expert against a ViT-based expert.
Fig. 2. We can observe that a small λ encourages the proposed DSSAE to build more experts while a large threshold λ hinders the expansion process.In addition, more experts can lead to better performance while requiring more parameters.We also investigate whether the ViT-based expert is better than a Convolution Neural Network (CNN) based expert.We consider a baseline model where each expert employs a CNN as a feature extractor instead of ViT, called DSCNNE.We train DSCNNE on three datasets and the results are reported in Table 3.These results show that the proposed DSSAE outperforms DSCNNE on three datasets, demonstrating that the ViT-based experts used in DSSAE perform better than the CNN-based expert.

CONCLUSION
In this paper, we propose a new model called the Dynamic Scalable Vision Transformer Ensemble (DSSAE), for continual learning.DSSAE dynamically adds new experts, each containing a Visual Transformer (ViT), to deal with the data distribution shift under the continual learning scenario.In order to avoid frequent expansion and ensure the knowledge diversity among the trained components, we propose a new dynamic expansion mechanism evaluating the novelty of incoming samples with respect to the knowledge already aquired by the model.Furthermore, such a mechanism does not require accessing the task information or class label, and can be used in a realistic continual learning setting.We perform a series of experiments, and the empirical results demonstrate that the proposed approach achieves the state of the art performance.

2. 1 .
Experts of DSSAE First, we introduce the learning paradigm for TFCL and then the detailed implementation of each expert of DSSAE.Let T = {T 1 , • • • , T n } be a set of training steps/times for learning a data stream D. This data stream consists of n data batches D = {D 1 , • • • , D n }, where each data batch D i = {x i,j , y i,j } b j=1 consists of several training examples and b is the batch size.In a certain training step T i , the model can only access the data batch D i while all previous data batches {D 1 , • • • , D i−1 } are not available.After the model has finished all training steps, we evaluate the model's performance on all testing samples.In the following, we describe the implementation of an expert for DSSAE.

Table 2 .
The classification accuracy of five independent runs for various models over streams with fuzzy task boundaries.