Unsupervised Transfer Aided Lifelong Regression for Learning New Tasks Without Target Output

As an emerging learning paradigm, lifelong learning solves multiple consecutive tasks based upon previously accumulated knowledge. When facing with a new task, existing lifelong learning approaches need both input and desired output data to construct task models before knowledge transfer can succeed. However, labeling each task requires extensive labors and time, which can be prohibitive for real-world lifelong regression problems. To reduce this burden, we propose to incorporate unsupervised feature into lifelong regression via coupled dictionary learning, enabling to learn new tasks without target output data. Specifically, the input data for each task is encoded as unsupervised feature while both input and output data are used to construct task predictor. The unsupervised feature is linked with task predictor through two dictionaries that are coupled by a joint sparse representation. Because of the learned coupling between the two spaces, the task predictor for the new coming task can be recovered given only the input data. We further incorporate active task selection into this framework, enabling actively choosing tasks to learn in a task-efficient manner. Three case studies are used to evaluate the effectiveness of our method, in comparison with existing lifelong learning approaches. Results show that our method is able to accurately predict new tasks through unsupervised transfer, eliminating the need to label tasks before constructing the predictor.


Unsupervised Transfer Aided Lifelong Regression for
Learning New Tasks Without Target Output Tong Liu , Member, IEEE, Xulong Wang , Po Yang , Senior Member, IEEE, Sheng Chen , Life Fellow, IEEE, and Chris J. Harris Abstract-As an emerging learning paradigm, lifelong learning solves multiple consecutive tasks based upon previously accumulated knowledge.When facing with a new task, existing lifelong learning approaches need both input and desired output data to construct task models before knowledge transfer can succeed.However, labeling each task requires extensive labors and time, which can be prohibitive for real-world lifelong regression problems.To reduce this burden, we propose to incorporate unsupervised feature into lifelong regression via coupled dictionary learning, enabling to learn new tasks without target output data.Specifically, the input data for each task is encoded as unsupervised feature while both input and output data are used to construct task predictor.The unsupervised feature is linked with task predictor through two dictionaries that are coupled by a joint sparse representation.Because of the learned coupling between the two spaces, the task predictor for the new coming task can be recovered given only the input data.We further incorporate active task selection into this framework, enabling actively choosing tasks to learn in a task-efficient manner.Three case studies are used to evaluate the effectiveness of our method, in comparison with existing lifelong learning approaches.Results show that our method is able to accurately predict new tasks through unsupervised transfer, eliminating the need to label tasks before constructing the predictor.Index Terms-Lifelong regression, unsupervised feature, coupled dictionary learning, knowledge transfer, active task selection.

I. INTRODUCTION
T RANSFER learning and multi-task learning methods re- duce the amount of experience needed to learn individual tasks by reusing knowledge from other related tasks [1], [2].This knowledge transfer significantly improves learning efficiency and modeling performance, as compared to learning tasks in isolation following the traditional machine learning paradigm.
Transfer learning methods transfer knowledge from source tasks to help learning new task [3], [4], [5], [6], which however fail to optimize the performance over all the tasks, while multi-task learning methods jointly learn all the observed tasks by sharing knowledge [7], [8], [9], [10], [11], but they cannot learn new unseen task.
To combat both limitations of transfer learning and multi-task learning, lifelong learning as a new research paradigm, was proposed to learn consecutive new task based upon previously built knowledge as well as to automatically update the past knowledge accumulated from the past encountered tasks upon the learning of the new task [12].This technique is particularly suitable to solve some scenarios with multiple consecutive tasks over long-time scales [13], [14], [15], [16], [17], such as applications of sentiment classification, robotic control, natural language processing and diseases modeling [12].It is widely understood that a fundamental principle for better learning is to incorporate available prior knowledge in the learning process [18], [19], [20].Anything learned from a previous learning task can be regarded as a piece of knowledge, and this knowledge can be reserved to help future learning.This is the core idea of lifelong learning.More specifically, lifelong learning maintains a knowledge base which stores the knowledge learned in the previous learning tasks.When learning a new task, the knowledge accumulated provides available prior knowledge for the current learning task.New knowledge acquired in the current learning process is then used to update the knowledge base.For example, a student who has never studied psychology before wants to study it.This can be regarded as a new task.The student has the past education of learning philosophy, literature, and other subjects.Knowledge the student gained in these past learning tasks are stored in the student's knowledge base, i.e., the student's brain, and these 'prior' knowledge can help the student in learning the new subject psychology.New knowledge that the student will gain in studying psychology in turn will enhance the student's knowledge base.It can be seen that lifelong learning imitates human learning.
Among lifelong learning community, the efficient lifelong learning algorithm (ELLA) framework is one of the most popular approaches [21], [22].The ELLA factorizes learned task models into a shared latent dictionary as the knowledge base to facilitate knowledge transfer as tasks arrive consecutively.When new task arrives, the ELLA transfers knowledge through the shared dictionary to learn new model, and refines the dictionary with the knowledge learned from current task.By updating the dictionary over time, newly acquired knowledge is incorporated into the knowledge base, thereby improving previously learned models' performance.The ELLA framework was first created for regression and classification, and it was later developed for policy gradient reinforcement learning (PG-ELLA) [23], [24], [25], [26], [27].By replacing the task model with policy, the PG-ELLA enables to learn decision making tasks consecutively, transferring knowledge to accelerate learning new policy.The work of [28] further extended ELLA from a single agent to a network of agents, and proposed the collective lifelong learning algorithm to enable sharing knowledge in a distributed manner for multiple agents.Different from ELLA, another typical lifelong learning model is deep neural network, where catastrophic forgetting is the key issue in its continuous learning process.Inspired by synaptic consolidation in human brains, elastic weight consolidation (EWC) was proposed to combat the catastrophic forgetting problem in deep networks by restricting the change of important neural network weights of previous tasks when learning new task [29].The EWC has been successfully applied to object detection [30], neural machine translation [31], image generation [32], and so on.
One notable issue is that lifelong learning in the above problem setting is a passive process, in which the leaner must learn every encountering task and it also has no control over the learning order for tasks.In some situations, the agent may have a pool of candidate tasks to learn, and it can intelligently choose the next task to learn in order to maximize the overall performance.With this goal, the work [33] incorporates active curriculum selection strategy into ELLA, enabling the learner to choose tasks in certain order so as to maximize future learning performance using as few tasks as possible.The authors of [33] proposed several active task selection mechanisms for selecting the next best task, and demonstrated that the diversity heuristic method (ELLA-diver) has superior efficiency to build knowledge library over other methods.Considering a different active task selection, the work [34] integrates outlier detection into lifelong learning so as to selectively learn the next task based on the tasks' importance.By either choosing tasks in certain order or selectively choosing important tasks to learn, both these methods learn in a task-efficient manner, which is particularly important when dealing with massive tasks.
While above lifelong learning methods demonstrate outstanding performance in many applications, one important preliminary need is to gather sufficient both input and desired output data for the new coming task and characterize task relationships.For lifelong regression problems, desired output is also referred to as target output.When new task arrives, the learner requires sufficient training data of both input and target output to identify task relationships before bootstrapping a model via transfer.This need for desired output data imposes a serious challenge for practical lifelong regression problems, as persistent manual data annotation for every new coming task is time-consuming and economically costly, and often the leaner is expected to learn new task rapidly without the delay to wait for labeling task.To overcome this restriction, one famous early work of [35] incorporates high-level task descriptors into lifelong reinforcement learning (TaDeLL), and use both task descriptors and training data to model inter-task relationships.The results of [35] show that using task descriptors improves the performance of learned policies, and moreover, it enables predicting policy for new task without training data via zero-shot transfer given only task descriptors.TaDeLL was further extended for regression problem in [36], where task model can be predicted given only descriptors for new task.This 'learning without training data' seems very appealing.But the fact is that TaDeLL requires domain-specific task descriptors that must characterize the underlying dynamics of data in individual tasks well.For instance, the work [36] used the engineering system's basic parameters, such as length, mass, damping constant, etc., as task descriptors for the engineering system considered, because these parameters define the system's underlying dynamics and have a close relation to the data characteristics.However, for most real-world tasks, seeking such appropriate and unified descriptors to identify different tasks requires in-depth cross-domain knowledge, which is generally impossible to achieve.Moreover, inaccurate task descriptors will lead to wrong task model and degrade the achievable learning performance considerably.Hence, TaDell is not generally applicable to many applications.
Consequently, to our best knowledge, how to efficiently utilize large amount of unlabeled data in characterizing and learning each consecutive task with improved performance is an important challenge for the lifelong learning community.This motivates our current work to develop an effective lifelong regression model that enables to learn new task without target output data, thus reducing the burden for labeling every coming task.We explore the use of input data to achieve unsupervised transfer for learning new task without desired output.Our approach to incorporate input information into lifelong regression is general, as it does not need domain-specific task descriptors that require human expert.Instead, we encode input data as feature vectors that identify each task and treat these unsupervised features as side information to augment task predictor on the individual tasks.Similar to [35], [36], [37], we use coupled dictionary learning to link the unsupervised feature space with the task predictor's parameter space, where the two spaces are linked through the two dictionaries that are coupled by the same sparse coding.Because of the learned coupling between the two spaces, the unsupervised features act as backup to the task predictor, enabling the learner to accurately construct predictors for the unseen tasks given only their unsupervised features.This capacity is very important in the online setting of lifelong regression process.It enables the agent to rapidly learn new tasks through unsupervised transfer from the previously learned tasks, without the need to first label the future tasks for collecting the target output data.To make our lifelong model learns in a task-efficient manner, we further incorporate active task selection into this framework.Three case studies, 1) school examination score prediction, 2) Parkinson disease symptom score prediction, and 3) Alzheimer disease progression modeling, are used to demonstrate the effectiveness of the proposed scheme, in comparison with existing lifelong learning approaches.Extensive experiments demonstrate that our method can accurately predict the new task using only input data via unsupervised transfer.
Notably, it should be emphasize that our proposed unsupervised transfer aided lifelong learning differs from the unsupervised transfer learning or domain adaptation.The goal of unsupervised domain adaptation is to train a single model for a target domain or task with unlabeled data by transferring knowledge from a source task in which desired output data is accessible [38], [39], [40].These methods usually consider only a single target task, and they fail to learn in a lifelong setting where multiple tasks are acquired sequentially over long-time scales.Unlike the traditional unsupervised domain adaptation methods that are only restricted to single-source single-target, the recently emerged multi-target domain adaptation are able to deal with multiple domains [41], [42], [43], [44].But they still fail to learn in a continual manner.Another related learning paradigm is continual learning [45].The continual learning aims to address the catastrophic forgetting problem in which the model is likely to forget the past learned tasks when encountering new tasks.Existing continual learning methods use either model regularization or experience replay to tackle catastrophic forgetting [29], [46].By incorporating continual learning mechanism into unsupervised domain adaptation, the recent continual domain adaptation is most similar to our problem setting.In continual domain adaptation, the unlabeled or labeled target task data are received in streaming batches, and the model is continuously adapted with each batch of target data [47], [48], [49], [50], [51].Note that the continual domain adaptation aims to learn adaptively to deal with domain shift when encountering new unseen tasks.By contrast, our method enables to learn new task adaptively while in the meantime optimize the performance of overall encountered tasks by updating the accumulated knowledge.A comparison of various learning paradigms is tabulated in Table I.Moreover, the continual domain adaptation methods only focus on object recognition or classification, and they are not applicable for regression learning [47], [48].Although the existing lifelong learning approaches, such as ELLA [22] and TaDeLL [36] can be used for regression problem, they fail to learn and predict new consecutive tasks using solely unannotated data.
It is worth recapping that although TaDell [36] is the most similar to our method, its capacity of learning without data heavily depends on finding an appropriate task descriptor.As aforementioned, seeking such appropriate task descriptors typically requires in-depth expert knowledge, which are generally unavailable for most real-world applications.By contrast, our method is immune to this restriction, and it only requires unlabeled data that is easily to obtain for new tasks.Our method can be considered as an improvement over existing lifelong learning methods with the key idea of unsupervised transfer.Specifically, this paper provides the following contributions: 1) Based on coupled dictionary learning, we incorporate unsupervised features into lifelong learning that use a factorized representation of the learned knowledge to facilitate transfer and improve predictive performance.
2) Most importantly, we show that our proposed method is able to accurately modeling and predict new consecutive tasks using solely unannotated data through unsupervised transfer.
3) The proposed scheme is integrated with active task selection mechanism, which enables further improving learning efficiency when encountering massive tasks.4) We analysis the method theoretically, and use three realworld datasets to validate its effectiveness.The rest of this paper is organized as follows.Section II reviews the background on lifelong learning.Section III presents the proposed unsupervised transfer aided lifelong regression framework in detail.Section IV summarizes our proposed algorithm with theoretical analysis.Section V evaluates the proposed method with three case studies.Section V concludes the paper with remarks about future works.

A. Problem Definition
In the lifelong regression setting, the learner faces multiple consecutive regression tasks {Z (1) , Z (2) , . . ., Z (T max ) }, and must rapidly learn each new task by building upon its previous knowledge.Each regression task Z (t) = {f (t) , x (t) , y (t) } is specified by a function mapping f (t) : x (t) → y (t) from the input space x (t) ∈ R d onto the output space y (t) ∈ R. At each time step t, the agent receives a batch of n t labeled training data (x i=1 for learning task t, where x (t) i and y (t) i denote the ith training input sample and the associated desired output sample, respectively, for task t.
Let T denote the number of tasks that the agent has encountered so far.Its goal is to consecutively construct a set of task models or predictors { f (1) , . . ., f (T ) } such that each f (t) approximates f (t) to make accurate prediction on new data, and new model f (t) can be acquired efficiently when the agent encountering new task t.Ideally, knowledge learned from previous tasks {Z (1) , . . ., Z (T −1) } should accelerate training and improve performance on new task Z (T ) .
It can be seen that the lifelong learning is very different from existing learning frameworks.In the traditional learning framework, the learner has multiple batches of data generated by the same underlying process, and therefore the learner may use the multiple models identified from the previously encountered multiple data batches to predict the new batch of data, using for example, selective ensemble regression.Lifelong learning represents a much more general learning setting.Every task can represent a different batch of data, characterized by its own task definition and associated underlying data generation mechanism.Hence, it is necessary to construct a new task model for each coming new task, and the task data can be discarded after learning of the new model.On the other hand, all the tasks share some common characteristics or have the shared knowledge, which can be exploited to facilitate faster and better learning of the new task.This is the essence of the lifelong learning.

B. Efficient Lifelong Learning Algorithm
The ELLA [22] was developed to operate in this lifelong learning setting.To be specific, the ELLA learns and maintains a shared knowledge library L ∈ R d×k , which forms a basis for all task models and facilitates knowledge transfer between tasks.For each task t, the ELLA learns a model f (t) (x) = f (x; θ (t) ) that is parametrized by a d-dimensional task-specific parameter vector θ (t) .This model parameter is a linear combination of the columns of L using the sparse coefficients s (t) ∈ R k as θ (t) = Ls (t) .The dictionary L stores chunks of knowledge that are shared for all the tasks, and the sparse code s (t) extracts the relevant pieces of knowledge for a particular task t.Hence, this model parameter factorization enables effective knowledge transfer among tasks.
Given the training data (x for each task t, the ELLA minimizes the following objective function: where i ; Ls (t) )) with J (•) being a squared-loss function for regression problem y i ; Ls (t) ), S = [s (1) s (2) • • • s (T ) ] is the matrix consisting of all the sparse coefficient vectors, and the L 1 norm is used to control the sparsity of s (t) with the regularization parameter μ, while • F is the Frobenius norm, which regularizes the complexity of dictionary L with the regularization parameter λ.This problem can be solved in a batch learning setting for off-line multi-task learning framework [52].
To solve it in a lifelong learning setting, the ELLA tasks a second-order Taylor expansion to approximate the objective around an estimate θ )) of the single-task model parameters for each task, and updates only the coefficients s (t)  for the current task at each time step.This process reduces the optimization (1) to the problem of sparse coding the single-task modeling in the shared dictionary L, and enables solving L and S efficiently by the following recursive updating rules that constitute the ELLA: where v 2 A = v T Av, the elements of L are initialized by randomly taking values from (0, 1), ) is the Hessian matrix of the loss J ( θ ), ⊗ denotes the Kronecker product, and A ∈ R (kd)×(kd) is initialized to the all zero-elements matrix, while b ∈ R kd is initialized to the all zero-elements vector, the vector stacking operator vec[•] stacks the columns of matrix one by one to form a vector, I (kd) is the (kd) × (kd) identity matrix, and the matrix forming operator mat[•] d×k converts a (dk)-dimensional vector into a (d × k)-dimensional matrix.
Each time when new task t arrives, this method requires the input-output data (x before updating s (t) and L. However, labeling data for every upcoming task is time-consuming, and most of the time we only have unlabeled or input data at the first glance of a new task.In order to eliminate this need for desired output data, in this paper, we propose to incorporate unsupervised feature into the learning process, and hence to enable unsupervised transfer on new tasks.Specifically, upon learning a few tasks with complete input-output data, future task models can be constructed given only input information.

A. Overview of Proposed Framework
For task t, define its training input data matrix n t ] and the corresponding desired output vector y (t) ∈ R n t as y (t) = [y n t ] T .As depicted in Fig. 1, our proposed framework follows the lifelong learning setting.During the agent's lifetime, massive tasks are received consecutively.As a new task arrives, knowledge accumulated from the previous tasks is selectively transferred to learn the new task, and newly acquired knowledge from the current task is stored in the knowledge base for future use.In order to achieve unsupervised knowledge transfer on new task, we incorporate unsupervised feature into lifelong learning via sparse coding with a coupled dictionary, enabling the unsupervised feature and task predictor to augment each other.For each task with complete training data, the task predictor is constructed by input Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and output data, while the unsupervised feature is encoded only by input data.In order to link two feature spaces, we employ two dictionaries that act as knowledge repositories for the two spaces, and they are coupled by a joint sparse representation.Because of the learned coupling, the predictor for a new task can be reconstructed given only the unsupervised feature.This capacity of learning new task predictors without desired output eliminates the need to labeling new tasks in lifelong regression process.
The above lifelong learning framework is a passive process, in which the learner has no control over the order of the tasks to learn.In some situations, the learner has knowledge of the next several tasks that it needs to learn.Motivated by [33], we further extend this framework to active lifelong regression by incorporating a similar active task selection mechanism.Hence, our model can choose the next task to learn from a pool of candidate tasks in order to maximize future learning performance.In the following, we will present each component of our algorithm in details.

B. Task Predictor
Ideally, each task has complete training input and output data (X (t) , y (t) ) that enables the construction of the task predictor f (X; θ) = X T θ.We construct the task predictor by the regularized least square (LS) estimator and the model parameter parameter is thus computed as where β is a small positive regularization parameters, e.g., β = 10 −6 .The Hessian Υ (t) of the squared-loss function J (θ (t) ) around the single task solution θ is given by For each task with complete training input and output data, we first compute the predictor's parameters θ and Hessian Υ (t) before performing knowledge transfer in learning process.

C. Unsupervised Feature
When new task t arrives, it is often easy to obtain unlabeled or input data X (t) while target output data y (t) are difficult to acquire quickly.Although input data itself cannot be used to construct task predictor, it also contains vital information that identifies each task.Our goal is to use the input data to supplement the task predictor, treating it as a backup to learn new task when output data is unavailable.
To incorporate input information into the learning procedure, the input data matrix X (t) ∈ R d×n t needs to be transformed into a d-dimensional feature vector that can link with the predictor's parameter vector θ ∈ R d .In order to link these two spaces, therefore, we transform the original input data matrix X (t)  into the d-dimensional feature vector φ(X (t) ), where φ(•) is an operator that encodes a matrix as a vector.Express the ith column of X (t) as x The simplest way to achieve this encoding is to compute the mean value of each row of X (t) , yielding where ) is the unsupervised feature for task t.Our lifelong learner uses this feature vector to represent each task, treating it as side information to augment task predictor for individual tasks.

D. Coupled Dictionary Optimization
After obtaining the predictor parameter vector θ (t) and the unsupervised feature x(t) = φ(X (t) ) for each task, the next step is to link the two feature spaces, so that each can augment the learning of the other.Motivated by [35], [36], we link the two feature spaces through the dual dictionaries that are coupled by a joint sparse representation.The original idea of using coupled dictionary learning is to link the high-level task descriptions with the learned model to achieve zero-shot transfer for new tasks.We use the coupled dictionaries to link the task predictor's space with the unsupervised feature' space, so as to make full use of input information and achieve learning new task without output data.
Recall that the lifelong learning approach factorizes the predictor parameters θ (t) for each task as a sparse linear combination of a shared dictionary by θ (t) = Ls (t) , where each column of the dictionary L represents a cohesive chunk of knowledge.In lifelong learning, the dictionary L is refined overtime as the model learns more tasks.The sparse coefficient vectors S encode the task predictors in the shared dictionary, providing an embedding of the tasks based on how their predictors share knowledge.Similar to this, the unsupervised feature vector x(t) can also be linearly factorized using a shared dictionary K ∈ R d×k over the unsupervised feature's space.Like L, this dictionary K captures the relationships among the unsupervised features for multiple tasks, with the coefficients that similarly embed tasks based on the commonalities in their unsupervised features.In order to link the two spaces, we enforce the two dictionaries, L and K, to share the same sparse coefficient vectors S so as to reconstruct both the predictors and the unsupervised features.Hence, for task t, θ (t) = Ls (t) , x(t) = Ks (t) . ( Because we enforce the two dictionaries with the same sparse code s (t) , the relevant pieces of information for a task predictor are coupled with its associated unsupervised feature.To optimize L and K, we first reformulate the objective (1) for the coupled dictionaries as where the parameter ρ balances the task predictor's fit to the unsupervised feature's fit.
To solve the optimization (10) in a lifelong setting, we approximate J (θ (t) ) by a second-order Taylor expansion around the regularized LS parameter estimate θ (t) given in (6).That is, we expand J (θ (t) ) around θ (t) for each task as where denotes the gradient operator.The first term J ( θ ) is a constant and can be omitted.Since θ (t) is the minimizer of the objective J (θ (t) ), J ( θ(t) ) is zero, and the second term can also be removed.Thus, the loss function J (θ (t) ) is approximated by the last term of (11), which can be rewritten as θ (t) − Ls (t) 2 Υ (t) , given θ (t) = Ls (t) .With this approximated J (θ (t) ), the optimization ( 10) is simplified as Further defining where 0 d×d is the d × d zero matrix, the optimization (12) can be rewritten in a concise form as This optimization has the identical form to (1), and it can be solved efficiently in a lifelong setting.Specifically, similar to the classic ELLA, we can solve the sparse vector s (t) given H acquired at task (t − 1), and then update H, i.e., L and K, given s (t) .Obviously, given s (t) , the two dictionaries L and K can be updated independently.When a task arrives, we perform three operations to update our model, namely, compute s (t) , update L and update K. Specifically, the sparse vector s (t) is first computed using the current basis H by solving the following L 1regularized regression problem, which is an example of the Lasso: After s (t) is obtained, the two dictionaries, L and K, can be calculated independently by the recursive updating (3) to (5).In particular, to update the dictionary K, we simply replace Υ (t)   by ρI d , θ by x(t) and L by K in (3) to (5).The per-task updating rules are given in Algorithm 1.
Remark 1 Solving the sparse coding by ( 15) is basically learning new task with the previously built knowledge repository H, that is, knowledge transfer from past learned tasks, while Algorithm 1: Unsupervised Transfer Aided Lifelong Regression.

E. Unsupervised Transfer Learning
In a lifelong setting, multiple consecutive tasks arrive rapidly and it may have insufficient time to labeling every coming task, and for tasks with only input data, it is unable to construct predictor.Incorporating unsupervised feature however enables our approach to construct a predictor for the new task with only input data.This ability to perform unsupervised transfer is enabled by the coupled dictionary learning, which allows us to use unsupervised feature to recover task predictor through coupled dictionaries and sparse coding.The unsupervised transfer process for learning a new task using solely unlabeled data as well as the previously learned libraries L and K is shown in Fig. 2.
Given the input data X (t new ) for a new task, we first encode X (t new ) as the feature vector x(t new ) = φ(X (t new ) ), and then estimate the sparse coding in the latent unsupervised feature space via Lasso on the learned dictionary K s Since this estimated s (t new ) also serves as the sparse coding for the latent dictionary L, it can be used to recover the task predictor for the new task t new as Hence, this new task predictor's parameter θ is obtained only through the task's input data X (t new ) .This eliminates the need to collect output data for model construction.This unsupervised transfer learning procedure is given in Algorithm 2.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.1: Inputs: input data for new task X (t new ) , learned libraries L and K. 2: Encode X (t new ) into feature vector x(t new ) using (8).3: Solve sparse coding s (t new ) by Lasso of ( 16).4: Recover task predictor by computing its parameter vector θ (t new ) using (17).

F. Active Task Selection
To make our method capable of learning in a task-efficient manner, we further incorporate an active task selection mechanism into our approach.The problem is formulated as follows.The agent has access to training data from a pool of candidate unlearned tasks {Z (T +1) , . . ., Z (T pool ) }, where T + 1 < T pool < T max .Based on training data for these candidate tasks, the learner selects the index of the next task to learn t next ∈ {T + 1, . . ., T pool }, which will maximize the learning performance.Without loss of generality, the value of T pool is fixed and set as T pool = 12 T max in our study.We employ the diversity heuristic proposed in [33] for selecting the next best task.The basic idea is to encourage the current learned model or library to capture information of the widest range of tasks.If the current library does not fit well for a new task t, it means that the information on task t has not been captured in the current library.Thus, in order to acquire information from the widest range of tasks, the next task should be the one that the current library is doing the worst, that is, the loss on the training data of this task is maximum.Although we have the dual dictionaries L and K, we can simply use the main dictionary L that contains both input and output information, to calculate the heuristic as where θ and Υ (t) are calculated by ( 6) and (7), respectively.Remark 2 The active task selection mechanism (18) tends to select tasks that are encoded poorly with the current dictionary L, and the selected tasks are likely to be significantly different from the previously learned tasks, thus encouraging the agent to learn diverse tasks.However, this does not means that after the active task selection based training of t next tasks, the model generalization or test performance is necessarily better than the model with the non-active task selection based training of t next tasks.Whether this is the case depends on the underlying data generating process.Furthermore, when all the training tasks are used, the models obtained with and without active task selection should have the same or similar test performance, because the both models have seen all the training tasks and the order of all the training tasks learned should have litter effect on the overall generalization performance.

IV. ALGORITHM SUMMARY AND ANALYSIS
Our proposed approach has two versions, namely, unsupervised transfer aided lifelong regression (UTLR), in which the agent has no control over learning order of tasks, and UTLR-Ac, which is equipped with active task selection mechanism presented in Section III-E.The proposed framework has two phases: training phase and evaluation phase.During training phase, some training tasks serve as the candidate task pool.Each time the agent actively chooses (UTLR-Ac) or passively accepts (UTLR) one task from the task pool to learn so as to incrementally build its libraries L and K.After the agent has encountered all the training tasks, the two libraries are fixed and they act as the knowledge base to help learning future unseen tasks.During evaluation phase, new task arrives sequentially.With the aid of L and K, the agent performs either model prediction or unsupervised transfer depending on whether the new task is labeled or not.For the unlabeled new task, the agent only uses the input data to recover the task predictor so as to predict the new data of this task. 1  Convergence analysis: In order to prove the convergence of the proposed framework, we use the theoretical results of [22], since these results can directly apply to our framework.
The work [22] has proved that the learned dictionary becomes increasingly stable, i.e., converged, as more tasks are learned.This convergence result requires two conditions: 1) The tuples (Υ (t) , θ (t) ) are drawn from an independent identical distribution (i.i.d.) with compact support to bound the norms of L and s (t) .
2) For all the tasks up to task t, let L k be the subset of the current dictionary L t , where only the columns corresponding to the non-zero elements of s (t) are included.Then, all the eigenvalues of the matrix L T k Υ (t) L k need to be strictly positive.
The work [22] demonstrates that both these conditions are met for the lifelong learning framework given in (2) to (5).
We incorporate unsupervised feature into this framework by augmenting θ into Θ (t) , L into H, and Υ (t) into Ψ (t) .Since θ (t) and Υ (t) are drawn from an i.i.d., clearly Θ (t) and Ψ (t)  are also drawn from an i.i.d., according to the definition of ( 13).Hence condition 1) holds for our method.To verify condition 2), we note that the eigenvalue of L k and the positive ρ, and hence they are strictly positive.Therefore, both the two conditions are met for our proposed method, and the convergence result of [22] can be applied to our proposed approach.
Computational complexity: We now analyze the online computational complexity of learning new task by our method.The construction of task predictor by the regularized LS estimator (6) has a complexity on the order of O(d 3 ).The adaptation of single dictionary L ∈ R d×k and sparse coding s (t) ∈ R k costs O(k 2 d 3 ).Since we incorporate unsupervised feature into lifelong learning by augmenting L ∈ R d×k into H ∈ R (2 d)×k , the coupled dictionary adaptation costs O(k 2 (2 d) 3 ).Thus, the overall complexity of per-task adaptation is O(d 3 + k 2 (2 d) 3 ), which is independent of task number.

V. EXPERIMENTS
Three real-world applications, examination score prediction, Parkinson disease symptom score prediction and Alzheimer disease progression modeling, are included to demonstrate the effectiveness of our proposed approach.

A. Experimental Setup
Our two proposed methods, UTLR and UTLR-Ac, are compared with three existing lifelong learning approaches, the ELLA [22], the ELLA-diver [33], which actively chooses tasks to learn with diversity heuristic method, and the ELLA-diver++ [33], which is a stochastic version of ELLA-diver.The alternative lifelong learning approach EWC is also chosen as a benchmark for comparison.In the original work [29], the cross-entropy loss is used for classification problem, and we modify the loss of EWC to the mean square error in order to apply EWC to solve regression problems.Additionally, the single-task learning (STL) that learns multiple tasks independently, is used as the baseline method.The LS regression is used to construction task predictor for all the methods.It should be noted that either the unsupervised domain adaptation or continual domain adaptation methods are unsuitable to be compared with the proposed lifelong regression method, as they address very different problems.Also TaDell [36] cannot be used for comparison, because it needs domain-specific task descriptor, which is not available for most real-world datasets.Basically, our method can be regarded as a generalized version of TaDell using task input data rather than domain-specific task descriptor.
For all the lifelong models, the dictionary size k and the regularization parameters are independently chosen for each dataset using grid search over the ranges of {1, 2, . . ., 5} for k and {10 −n , n = 0, . . ., 6} for the regularization parameters, respectively, to achieve their best performance.The previous works [22], [33], [36] suggested to select the dictionary size from k ∈ {1, 2, . . ., 10} but the datasets used in these previous works were most classification problems.We have experimented with k ∈ {1, 2, . . ., 10} for our three case studies but the results were not better than with k ∈ {1, 2, . . ., 5}.The analysis on the sensitivity of the algorithmic parameters can be found in [22], [52].For EWC, a two-layer MLP with ReLU non-linearities in each layer is utilized as the training model.The network model is trained using stochastic gradient descent with learning rate 0.0001, and 10 independent experiments with different random seeds are conducted for model training.For our proposed method, ρ is a key parameter that balances the predictor's fit to the unsupervised feature's fit, and we empirically investigate its impact on the model prediction and unsupervised transfer performance.
In the experiments, we split the data 50%-50% as the training and testing datasets for each task.The training set is used to construct task predictor, while the testing set is for performance evaluation.Additionally, we divide the set of tasks into two subsets: one set of training tasks that serve as the pool for active task selection (for UTLR-Ac, ELLA-diver and ELLA-diver++) and are used to learn the knowledge library, and one set of evaluation tasks on which we measure the performance of the learned library.We set T pool = 1 2 T max for all the experiments.After a model has learned all the training tasks, its knowledge library is fixed, and we use the learned library to measure the prediction performance on the testing datasets of evaluation tasks.For the evaluation of unsupervised transfer, the model has Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.no access to the training set's output data for each evaluation task, and it can only use the input data to recover the task predictor for performance evaluation on the testing set.
The root mean squared error (RMSE) and the mean absolute error (MAE) are used to evaluate the testing prediction performance.In the lifelong learning setting, we are also interested in the online computational complexity for learning each task.Hence, the averaged computation time per task (ACTpT) is utilized to quantify the online computational complexity of a lifelong model.For all the lifelong models, training tasks are presented sequentially to the learner, following the corresponding online learning setting.To mitigate the impact of task order on the algorithms, the training and evaluation task orders are randomly generated over 10 independent experiments, and we report the mean and standard deviation (STD) of the RMSE, MAE and ACTpT over 10 realizations.

B. School Examination Score Prediction
We first evaluate the algorithms on school examination score dataset which has been widely used in multi-task and lifelong regression investigation [7], [9], [11], [13], [21], [22].The dataset contains examination scores of 15,362 students from 139 secondary schools, and each school is considered as a regression task.For each task, the goal is to predict scores for all the students in the school according to their input features.Each student has 28 features (the task dimension d = 28), including student-specific features and school-specific features, and the corresponding output is the student's examination score.The numbers of students for these 139 schools vary from 25 to 251 (the number of samples for each task n t ∼ 25 to 251).From the total of 139 tasks, we use 69 as the training tasks and the other 70 as the evaluation tasks.
The impact of ρ on the prediction and unsupervised transfer performance of our two models is investigated in Fig. 3.When the value of ρ is large (ρ = 3, 2, 1), the unsupervised feature plays the dominant role and the task model has less impact on the algorithm's performance.Hence the model prediction performance are similar to the unsupervised transfer performance.When ρ = 0.1, the model prediction accuracy improves while maintaining an acceptable unsupervised transfer accuracy, which indicates that this value of ρ balances well the task model's fit to the unsupervised feature's fit.When ρ decreases further to 0.01, the model prediction accuracy only decreases slightly but the unsupervised transfer performance degrades dramatically.This is because when ρ becomes very small, the unsupervised feature has little impact on the algorithm, which makes it unable to recover the task predictor via transfer.Hence, ρ = 0.1 is appropriate for this case study.
Table II presents the test performance comparison of various methods, where for the lifelong learning methods, the models with the best and runner-up performance are emphasized with boldface black and blue colours, respectively.Note that for our proposed approach, the model prediction is carried out by both the task predictor and the unsupervised feature, while the unsupervised transfer performance is obtained only by the unsupervised feature.Our approach is the only one that can carry out this unsupervised feature based prediction.Since EWC is implemented on PyTorch, it will be unfair to compare its computation time with other models that are implemented on Matlab.So we do not present the ACTpT of EWC.However, since EWC is based on neural networks, its training and testing are much more time costly than the other methods using linear base models.Clearly, although the STL imposes the least computational cost, it has the worst prediction performance compared with the lifelong models.Both our UFLR and UFLR-Ac attains the smallest model prediction RMSE and MAE, compared with the three ELLA-based methods and EWC model.Moreover, our UFLR imposes the lowest ACTpT among all the lifelong models.Most significantly, our proposed approach is able to use input data only to recover the task model, and achieves the unsupervised transfer performance that is similar to or slightly better than the ELLA-based methods.This clearly demonstrates the excellent unsupervised transfer performance of our method.
Clearly, after all the training tasks have been learned, the performance of an active task selection based lifelong model should be the same or similar to that of the nonactive task selection based counterpart, and incorporating active task selection into a lifelong model significantly increases the algorithm's complexity.To investigate how task selection impacts on the model's prediction performance, we conduct the following experiment.After the lifelong model selects a training task to update its knowledge library, we measure its test performance on all the evaluation tasks using the current library.This procedure yields a learning curve that depicts the relationship between the prediction performance and the number of tasks learned, which is shown in Fig. 4 for the five lifelong model.The test RMSEs of all the models decrease as the number of tasks learned increases.This is because the lifelong models become more knowledgeable as their libraries capture more knowledge from more tasks.The learning curves of our UFLR and UFLR-Ac are very similar, and this is also the case for the ELLA and ELLA-diver.This indicates that active task selection only has minor impact on the lifelong model's generalization performance for this case study.The reason may be that the data distributions for the tasks of school examination score dataset are similar.Note that since the training pipeline of EWC is different from the ELLA-based methods, we do not conduct this experiment for EWC.

C. Parkinson Disease Symptom Score Prediction
This dataset is composed of a range of biomedical voice measurements from 42 patients with early-stage Parkinson's disease [53], and it has been used to evaluate lifelong models [13], [34].The dataset contains 5875 voice recordings from these 42 patients, with the observations for each patient vary from n t = 101 to 168.The aim is to predict the Motor and Total UPDRS scores from the 16 voice measures.The symptom score prediction using d = 16 biomedical features for a patient is considered as a regression task and we have 42 tasks in total.Since the UPDRS scores consist of Motor and Total, we establish two regression datasets in our experiment: Parkinson-Motor and Parkinson-Total, each containing 20 training tasks and 21 evaluation tasks.Based on the results of Fig. 5, we set ρ = 0.01 for our method, as this value best trades off the model prediction and unsupervised transfer.
The test performance of various models for Parkinson-Motor and Parkinson-Total datasets are compared in Tables III and IV, respectively.Again, our methods achieves the best prediction performance with the smallest test RMSE and MAE.Furthermore, our UFLR attains the second-lowest ACTpT and the lowest ACTpT for the two datasets, respectively.Also our models can recover the task model using input data only.The unsupervised transfer performance of our models, although not as accurate as the model prediction accuracy of ELLA, are better than that of STL.Fig. 6 depicts the test learning curves as the functions of the number of tasks learned for various lifelong models.It can be seen that although our models begin with the larger test RMSEs than the ELLA-based models, their prediction errors decreases dramatically after learning more tasks.This is because our methods have two libraries to initialize, thus having higher error at the beginning, and as the number of tasks learned increases, the two libraries can capture more knowledge, leading to higher prediction accuracy.Observe that the RMSE learning curve of UFLR-Ac decreases more quickly than that of UFLR.For Parkinson-Motor and Parkinson-Total datasets, the active task selection mechanism seems to enable the model to learn faster.

D. Alzheimer Disease Progression Modeling
This dataset is from Alzheimer's Disease Neuroimaging Initiative (ADNI) [54].The ADNI project is a longitudinal study, which collects various measurements repeatedly over a 6-month or 1-year interval from patients.The first time patients receiving screening in hospital to obtain magnetic resonance imaging (MRI) is called baseline, and the time point for the follow-up visits is denoted by the duration starting from the baseline.The latest ADNI has up to 120 months' follow-up Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.data available for some patients, which are divided as baseline (M00), 6th month (M06), 12th month (M12), 24th month (M24), 36th month (M36), 48th month (M48), 60th month (M60), 72th month (M72), 84th month (M84), 96th month (M96), 108th month (M108) and 120th month (M120).The aim is to predict patients' cognitive scores at multiple time points using their MRI features.Hence, the cognitive score prediction at one time point is considered as a regression task, and we have 12 tasks in total.
The number of samples for each task varies from n t = 69 to 1074, and the dimension of features is d = 314.Alzheimer disease progression prediction is a very popular multi-task regression problem [55], [56], [57], [58], [59] The impact of ρ on the model prediction and unsupervised transfer performance of our methods for Alzheimer-ADAS dataset is shown in Fig. 7.For Alzheimer-MMSE, the results are similar and therefore they are omitted.Unlike the previous two case studies, the unsupervised feature plays a more important role than the task model in this case.This may be because the dimension of input in this case is much larger.According to Fig. 7, we set ρ = 5 for our models to achieve the best model prediction and unsupervised transfer accuracy.
The test performance comparison of various models for Alzheimer-ADAS and Alzheimer-MMSE datasets are presented in Tables V and VI, respectively.It can be seen that EWC attains the second best prediction accuracy.Again, our methods achieve the best prediction accuracy, and our UFLR imposes the lowest ACTpT.More significantly, the unsupervised transfer accuracies of our models are similar to their model prediction accuracies.This makes sense because in this case prediction is mainly contributed by unsupervised feature, and the unsupervised transfer accuracy should be comparable to the model prediction.The test RMSE learning curves as the functions of the number of tasks learned are depicted in Fig. 8 for various lifelong models.Observe that the test RMSEs of our methods decrease much more rapidly as more tasks are learned compared with the ELLA-based methods, which again demonstrates the superior learning capability of our models.Also observe that the active task selection does not seem to help to speed up learning.This is may be because of very limited number of training tasks.

E. Discussion of the Algorithm
Three real-world regression datasets from different application scenarios demonstrate the superiority of the proposed unsupervised transfer aided lifelong regression framework.Our proposed method not only consistently attains the best modeling accuracy compared with existing lifelong regression methods, but also provides important capacity of learning and prediction of new tasks with only input data.Note that the best existing unsupervised transfer aided strategy TaDell [36] can not be applied to our real-world regression benchmarks datasets.TaDell relies on the zero-shot transfer based on the so-called task descriptors.For it to work, these domain-specific task descriptors, which must characterize the underlying dynamics of data in individual tasks well, need to be hand crafted first.For simple engineering systems, some basic system parameters, such as length, mass, damping constant, etc., may be used as task descriptors because Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.they define the system's underlying dynamics and have a close relation to the data characteristics.However, for most real-world tasks, seeking such appropriate and unified descriptors to identify different tasks requires in-depth cross-domain knowledge, which is generally impossible to achieve.By contrast, our proposed method learns new task and performs the unsupervised knowledge transfer with only input data, which is generally applicable to many applications.In terms of computational efficiency, the experimental results have demonstrated that our method has lower online time cost than the efficient ELLA.Most importantly, the computation cost is independent with the number of tasks, which is clearly affordable when massive tasks are received over long-time scales.This work mainly focus on lifelong regression problem, hence we only use regression datasets to evaluate our algorithm.To our best knowledge, most existing lifelong/continual learning algorithms, such as the works of [47], [48], [49], [50], [51], only focus on object recolonization or classification, and they are not applicable to regression learning.By contrast, our proposed UFLR algorithm is specifically designed for regression problem, which fills a gap in the field.It is also worth mentioning that our method has a similar flexible structure with ELLA, and it can also be extended to address classification problem.In this case, we can compare the proposed framework with more state-of-theart lifelong learning algorithms on classification benchmarks.However, we emphasize again that this research is devoted specifically for lifelong regression problems with consecutive unlabeled tasks, and for such challenging application area, our proposed framework shows considerable advantages over the existing state-of-the-art, as evidenced by the experimental results.

VI. CONCLUSION AND FUTURE WORKS
This paper has proposed an effective lifelong regression framework capable of learning new consecutive tasks without desired output data.Specifically, during training phase, the input data for each task are encoded as feature vectors while both the input and output data are used to construct a single-task predictor using LS estimator.The unsupervised features and task predictor's parameters are factorized into two dictionaries that are coupled by a joint sparse coding.The leaner can also actively choose next training task to learn based on how poorly the current dictionary encodes the selected tasks.When new task arrives, the learner can perform either model prediction or unsupervised transfer depending on whether the task's data are labeled or not.Even if the new task is not labeled, the learner can still recover the task predictor using unsupervised features via knowledge transfer.This novel capability has ensured that our proposed lifelong regression framework has better generalization performance over the existing state-of-the-art lifelong regression models., which has been validated with applications to three real-world lifelong regression problems.
Defining an appropriate unsupervised feature for unlabeled task remains an open question.In our proposed framework, we simply use the mean values of input data as the feature vectors.In future, we will explore alternative more advanced features provided by various unsupervised learning methods to further improve the unsupervised transfer accuracy.Another interesting future direction is to extend this lifelong regression framework to nonlinear regression, since many real-life tasks are very complex and have strong nonlinearities.Hence, nonlinear models, such as neural networks, can potentially be used in the proposed scheme to replace linear regression.Note that our method has a flexible model structure, and it can also be extended to address classification problem by replacing the base linear regression predictor with simple classifier.This future extension enables our framework to be applicable to wider range of lifelong learning scenarios where labeling new tasks is challenging.

Fig. 3 .
Fig. 3. Impact of ρ on the model prediction and unsupervised transfer performance of the proposed methods for school examination score dataset.

Fig. 5 .
Fig. 5. Impact of ρ on the model prediction and unsupervised transfer performance of the proposed methods for (a) Parkinson-Motor, and (b) Parkinson-Total.

TABLE II PERFORMANCE
COMPARISON OF STL, ELLA, ELLA-DIVER, ELLA-DIVER++, EWC AS WELL AS PROPOSED UFLR AND UFLR-AC FOR SCHOOL EXAMINATION SCORE DATASET

TABLE III PERFORMANCE
COMPARISON OF STL, ELLA, ELLA-DIVER, ELLA-DIVER++, EWC AS WELL AS PROPOSED UFLR AND UFLR-AC FOR PARKINSON-MOTOR DATASET Fig. 4. Comparison of test RMSE performance versus number of tasks learned for school examination score dataset.

TABLE IV PERFORMANCE
COMPARISON OF STL, ELLA, ELLA-DIVER, ELLA-DIVER++, EWC AS WELL AS PROPOSED UFLR AND UFLR-AC FOR PARKINSON-TOTAL DATASET Fig. 6.Comparison of test RMSE performance versus number of tasks learned for (a) Parkinson-Motor, and (b) Parkinson-Total.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V PERFORMANCE
COMPARISON OF STL, ELLA, ELLA-DIVER, ELLA-DIVER++, EWC AS WELL AS PROPOSED UFLR AND UFLR-AC FOR ALZHEIMER-ADAS DATASET Fig.7.Impact of ρ on the model prediction and unsupervised transfer performance of the proposed methods for Alzheimer-ADAS dataset.