Compressing Cross-Domain Representation via Lifelong Knowledge Distillation

Most Knowledge Distillation (KD) approaches focus on the discriminative information transfer and assume that the data is provided in batches during training stages. In this paper, we address a more challenging scenario in which different tasks are presented sequentially, at different times, and the learning goal is to transfer the generative factors of visual concepts learned by a Teacher module to a compact latent space represented by a Student module. In order to achieve this, we develop a new Lifelong Knowledge Distillation (LKD) framework where we train an infinite mixture model as the Teacher which automatically increases its capacity to deal with a growing number of tasks. In order to ensure a compact architecture and to avoid forgetting, we propose to measure the relevance of the knowledge from a new task for a set of experts making up the Teacher module, guiding each expert to capture the probabilistic characteristics of several similar domains. The network architecture is expanded only when learning an entirely different task. The Student is implemented as a lightweight probabilistic generative model. The experiments show that LKD can train a compressed Student module that achieves the state of the art results with fewer parameters.


INTRODUCTION
Lifelong learning (LLL), representing the ability to continuously learn from experiences, is an essential characteristic of all living beings, enabling them to adapt and survive.However, learning and acquiring new knowledge from a series of tasks represents a challenge in artificial systems due to the catastrophic forgetting [1] which occurs when switching tasks.
Most existing studies address the forgetting from a series of predictive tasks, where the model is trained to remember the discriminative information across tasks.In this paper, we address a more challenging scenario in which the model is required to remember the generative representation information across domains over time.To implement this goal, we provide a mechanism for compressing and storing the probabilistic representations associated with the knowledge learnt from several data domains during LLL.For a given set of distinct domains (tasks) {T 1 , . . ., T K }, when learning the i-th task, we only access the data samples x drawn from a specific domain T i .Our goal is to find a model M = {f ω , G ε } which can embed the knowledge from all prior tasks into a latent space Z through the inference process f ω : X → Z and recover the data from the embedded latent space Z through a generative process G ε : Z → X .Once M was trained, we can easily implement many down-stream tasks on the embedded latent space such as interpolation [2,3] and log-likelihood estimation [4,5].This learning process opens a new direction for LLL where the model compresses the accumulated knowledge from a sequence of tasks into a compact latent space.
The primary challenge when attempting to learn multiple tasks is the catastrophic forgetting.Many approaches aiming to alleviate catastrophic forgetting are based on episodic memory systems [6] or by using Generative Replay Mechanisms (GRMs) [7].In this paper, we focus on the GRM based models, since such methods do not rely on real data from prior tasks.Existing GRM models, although successfully used for LLL, fail to learn long sequences of tasks, where each database is characterized by different probabilistic representations.This is due to the mode collapse [8], when GRMs learn several entirely different data.It was shown that by employing expansible mixture models we can deal with such challenges.However, existing mixture models [4,9,5,10] require preserving the whole network architecture while performing the model selection at the testing phase.
Inspired by addressing the drawbacks of GRMs while employing mixture models, we propose a new Lifelong Learning Framework (LKD), consisting of a Teacher-Student model, where the Teacher evolves over time according to the expansion mechanism to accumulate knowledge from a dynamically changing environment.The Student is designed to continually embed generative factors from the knowledge learned from the Teacher into a single latent space in which different data domains are embedded into multiple clusters.To ensure a compact network architecture for the Teacher, we propose to calculate the dependency between the incoming task and each expert through a knowledge consistency evaluation approach, which guides the selection and expansion of the experts in the Teacher.The main contributions are : (1) We propose

Preliminaries
Problem definition: Variational Autoencoder (VAE) is an explicit generative latent variable model which learns an observed variable x, while estimating a latent variable z over the latent space Z within a unified optimization framework.For a given VAE model p θ (x, z), we aim to search the optimal parameters θ that maximize the marginal log-likelihood log p θ (x) = p θ (x | z)p(z) dz.This involves the prior distribution p(z) = N (0, I) (Gaussian distribution with an unit vector I), which is intractable to optimize since it requires access to all z.VAEs training relies on maximizing an Evidence Lower Bound (ELBO) to the marginal log-likelihood by, [11] : where q ξ (z | x) is a variational distribution aiming to approximate the true posterior p θ (z | x).p θ (x | z) is the decoding distribution and D KL (•) represents the Kullback-Leibler (KL) divergence.

Teacher module
Using a single VAE for learning frequent GRM processes has significant limitations when learning a sequence of tasks [2].In this paper, we develop a novel infinite mixture of VAEs (experts) as a dynamically expandable experts-based memory system for the Teacher module, where each expert captures one or several similar visual concepts from several given tasks.Let us assume that we have already trained and each component M i has the parameter set {θ i , ξ i }.The dynamic expansion and selection mechanism is shown in Fig. 1.We require to evaluate the knowledge similarity between the information accumulated by each expert and that corresponding to the incoming task, in order to guide the Teacher module to adopt the appropriate learning strategy for the next task T t+1 .In the following, we describe how we perform the selection and expansion for the next task learning.Selection and expansion.As shown in Fig. 1. once the T tth task was learnt, the selection or expansion procedure is performed by a non-parametric inference process in which we firstly evaluate the probability r, for mixture's expansion or component selection, by comparing the knowledge measure Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
min{F ks (M i , T t+1 )} K i=1 and a threshold hold : where F ks (•) is a pre-defined function that evaluates the knowledge similarity.Then we update the selection probability p i of each expert as : If r = 0, then the Teacher module expands its capacity, p (K+1) = 1, otherwise the Teacher module selects an expert according to {p 1 , . . ., p K }.
Training the infinite mixture model.After determining the selection probability, we define the Teacher's loss function for the following (t + 1)-th task, as : where S is the potential number of experts, determined by * = K if the Teacher does not expand (r = 1) at the (t + 1)th task learning, otherwise a new expert is added, S = K + 1. Θ is the set of all components parameters.Then we train the Teacher by using Eq. ( 4) with the expert's weights w = {w 1 , • • • , w S } sampled from a Categorical distribution ∼ Cat(p 1 , . . ., p S ), optimizing either a selected, or a newly created component at T t+1 .

Knowledge similarity (KS) evaluation
Existing mixture models [13,14] use the log-likelihood, evaluated by each component when provided with a new training set, for the expansion or selection process.However, these models have the following drawbacks: 1) They require the existence of inference mechanisms for each component; 2) They do not have a mechanism to compare the statistical representations of a given task to their components' representations.In the following we address these shortcomings by proposing two approaches for evaluating the KS between each expert and the incoming task, without requiring any inference mechanism for the experts.
Firstly, we assume that we have the Student module, implemented by a single VAE, which is trained to learn the knowledge from all experts.The Student module already knows the whole information learnt so far and we can use its representation for the KS evaluation.We employ the cosine distance on the feature space Z of the Student for the KS evaluation : where z t+1,u and z j,u are the feature vectors extracted by the Student when considering the inputs x t+1,u and x j,u , drawn from the incoming task T t+1 and the j-th expert, respectively.m is the number of samples used for the evaluation.Given that a larger measure in Eq. ( 5) represents a better similarity, we consider the following KS criterion in Eq (2) : Secondly, we introduce another approach to evaluate the KS by comparing the sample log-likelihood between the incoming task and each expert, estimated using the Student model with parameters {θ s , ξ s } : where F ks in Eq. ( 2) is implemented by F Log in this case.

Data-free knowledge distillation (KD)
Unlike in other KD approaches that transfer knowledge at the predictive tasks [15], the proposed KD transfers data representation information through a sampling procedure without accessing real samples and labels.In order to embed the information from all experts into a single latent space, we implement the Student module as a VAE of parameters {θ s , ξ s }.
In the following, we introduce a new KD-based loss function that encourages the knowledge transfer on the posterior and the decoding distribution between the Teacher and Student: where p θs (x | z) and q ξs (z | x) are the decoding and variational distributions for the Student module.Since we can not access past samples when evaluating Eq. ( 8) and ( 9), we estimate the KL divergence using the sampling process, where the past data x is generated by each expert and is then used as input for the inference model of the same expert.Together with the KD process, we introduce a new objective function allowing the Student to learn novel knowledge without forgetting previously learnt information : where {η 1 , η 2 } are hyperparameters balancing the learning of a new task and the already learnt knowledge.In practice, we divide the optimization of Eq. ( 10) into two independent optimization processes, where one is used to learn the new task only by minimizing −L ELBO (x; θ s , ξ s ) and the second one is used for the knowledge transfer by minimizing L KD1 + ηL KD2 .We set η = 0.001 in all experiments.

Fig. 1 .
Fig. 1.The Lifelong Knowledge Distillation (LKD) framework consisting of the Teacher and Student modules.to learn the accumulated generative factors, from successive domains by developing a novel lifelong learning framework; (2) A new approach to regularize the selection and expansion of the Teacher, which ensures a compact network architecture during training; (3) We introduce a new knowledge distillation loss that can distill the generative representations from the Teacher to the Student.
S i .When learning T i , during the lifelong learning, a model would draw samples from the training set D S i , while all previous training sets {D S 1 , • • • , D S i−1 } are not available.Once the learning of all tasks is finished, the model's performance is evaluated on {D be the training sets of N tasks, where each D T i consists of N T i testing samples x T i over the data space X .This paper focuses on the cross-domain setting, aiming to learn a sequence of several datasets.Let T = {T 1 , • • • , T N } be a set of N tasks, where each task T i is defined by a training set D