Lifelong Learning Meets Dynamic Processes: An Emerging Streaming Process Prediction Framework With Delayed Process Output Measurement

As an emerging machine learning technique, lifelong learning is capable of solving multiple consecutive tasks based on previously accumulated knowledge. Although this is highly desired for streaming process prediction in industry, lifelong learning methods have so far failed to gain applications to mainstream adaptive predictive modeling of time-varying industrial processes. This is because when faced with a new data batch, existing lifelong learning approaches need both input and output data to construct local predictors before knowledge transfer can succeed. But in many process industries, the process output data are hard to measure online and it often takes time to acquire them from off-site laboratory analysis. This delayed acquisition of target output data makes it challenging to apply lifelong learning and other existing adaptive mechanisms to dynamic industrial processes with delayed process output measurement. To overcome this difficulty, this article proposes a novel lifelong learning framework that can rapidly predict new data batches with input data only before the arrival of the process output measurement. Specifically, we propose to incorporate process input information into lifelong learning via coupled dictionary learning, to enable the prediction of new batches without target output data. The input feature is linked with a local predictor through two dictionaries that are coupled by a joint sparse representation. Because of the learned coupling between the two spaces, the local predictor for the new batch can be reconstructed by knowledge transfer given only process inputs. Two industrial case studies are used to evaluate the effectiveness of our proposed framework and reveal the intrinsic learning mechanism of our lifelong process modeling to perform knowledge base (KB) adaptation.

Abstract-As an emerging machine learning technique, lifelong learning is capable of solving multiple consecutive tasks based on previously accumulated knowledge.Although this is highly desired for streaming process prediction in industry, lifelong learning methods have so far failed to gain applications to mainstream adaptive predictive modeling of time-varying industrial processes.This is because when faced with a new data batch, existing lifelong learning approaches need both input and output data to construct local predictors before knowledge transfer can succeed.But in many process industries, the process output data are hard to measure online and it often takes time to acquire them from off-site laboratory analysis.This delayed acquisition of target output data makes it challenging to apply lifelong learning and other existing adaptive mechanisms to dynamic industrial processes with delayed process output measurement.To overcome this difficulty, this article proposes a novel lifelong learning framework that can rapidly predict new data batches with input data only before the arrival of the process output measurement.Specifically, we propose to incorporate process input information into lifelong learning via coupled dictionary learning, to enable the prediction of new batches without target output data.The input feature is linked with a local predictor through two dictionaries that are coupled by a joint sparse representation.Because of the learned coupling between the two spaces, the local predictor for the new batch can be reconstructed by knowledge transfer given only process inputs.Two industrial case studies are used to evaluate the effectiveness of our proposed framework and reveal the intrinsic learning mechanism of our lifelong process modeling to perform knowledge base (KB) adaptation.

I. INTRODUCTION
M ACHINE learning has taken a center stage in the recent years for scientific and engineering developments.Inspired by human intelligence, the recent advances in deep learning bring it to new heights [1], and it has been successfully applied to all walks of life, such as image recognition, natural language processing, competitive games (AlphaGo) [2], protein structure prediction (AlphaFold) [3], and short-term weather forecasting [4].Following the success of machine learning in these areas, process industry has also begun to harvest the benefits of these breakthroughs [5].Exploiting the availability of explosive process data, the current industrial revolution, also known as Industry 4.0, is focusing on advanced data modeling and analytics to improve control and high-level decision-making [6], [7].Hence, adaptive and accurate modeling of industrial plants from massive and long-term process data will aid intelligent and autonomous industrial systems.
The current mainstream paradigm for industrial predictive models, which can predict the evolution of process output given process inputs, is to run machine learning algorithms on a given dataset that was historically collected from industrial plants, and to hope that the trained model will generalize well for the new data unseen in training [5], [8], [9].In machine learning terminology, this is called isolated learning because it does not retain and accumulate knowledge learned in the past training and use it to facilitate future learning.Without the ability to accumulate knowledge from the past learning, a machine learning model typically needs a large number of training samples to learn effectively.In particular, the collected dataset needs to be informative and sufficiently rich to cover the whole dynamics of the underlying process.However, this is often not the case in practice, because many industrial systems operate in a continuous manner and generate data from their operation in the form of streams, whose state changes over time [10], [11].This time-varying process characteristics can be caused by many factors, such as changes of raw materials and operating conditions, mechanical abrasions, and catalyst deactivation.Due to these process drifts, predictive models trained over historical dataset become obsolete, as they do not represent the newly emerged process state.
To tackle the above problem, various adaptive mechanisms are used for industrial predictive models, to realize online adaptation with newly measured labeled samples, i.e., both process input and output samples, so as to maintain satisfactory performance over a long operating period [12].One notable issue for this online learning is that the model adaptation needs to be performed at each sampling time with both process input and output information.This imposes two challenges.First, the online computation cost by reestimating model structure and parameters should be sufficiently small to be accommodated within each sampling period that is determined by control systems [13].More importantly, the process output data should be obtained timely at each time step so that the current modeling residual can be calculated to perform model adaptation.The first issue has been addressed to some extent by numerous techniques, and some methods are reviewed in Section II-A.However, the need for label or process output data immediately at every sampling time is often infeasible in many process industries, because these output variables or labels are typically hard to measure online and they can be obtained only through off-line laboratory analysis [5], which may take hours.For example, it is essential to monitor the lignite moisture online for the microwave lignite drying process [14], which serves as the feedback information for real-time control of the microwave power to prevent overheating.However, measurement of lignite moisture takes time and it cannot be acquired timely at every sampling time in a closed control loop.This causes delayed label data acquisition for process control.As a result, the unavailability of timely process output data or desired labels makes such online learning strategy impractical for streaming process prediction application.
Different from the aforementioned isolated learning, lifelong learning was proposed to mimic the human learning ability of accumulating and maintaining knowledge learned from the past and using it seamlessly for future learning [15].A simple comparison between the two learning paradigms is shown in Fig. 1.Isolated learning learns each dataset independently without knowledge accumulation or transfer.In contrast, lifelong learning learns consecutive new data batches based on the previously built knowledge base (KB) and automatically updates the KB learned from past encountered data batches upon learning of the new batch [16], [17].Three key characteristics of lifelong learning system-1) continuous learning ability; 2) knowledge accumulation and maintenance by KB adaptation; and 3) knowledge transfer to facilitate future learning [18], [19], [20], [21]-make this emerging technique suitable to deal with long-term data with changing dynamic characteristics, which is the case for streaming process prediction.Hence, by online learning of the predictive models on a batch-by-batch base, lifelong learning has some advantage for streaming process prediction application.
In the lifelong learning community, the efficient lifelong learning algorithm (ELLA) is one of the most popular meth- ods [22].With the assumption that local models of multiple related tasks share a common knowledge library, ELLA learns new tasks by selectively transferring knowledge from the KB and refining the KB over time to incorporate new knowledge learned from current tasks.This framework has been demonstrated to be effective for handling lifelong regression, classification, and decision-making problems, such as image recognition [22], [23] and engineering system control [24], [25], [26], [27].However, when faced with a new task or batch, ELLA needs both input and targeted output data to construct task model before knowledge transfer can succeed [22].As mentioned before, for many process industries, true process output data are hard to measure online and it often takes time to obtain them from off-line laboratory analysis.Because of this need for output data in constructing task models for new batches, ELLA cannot be applied directly to streaming process prediction with delayed process output data acquisition.This motivates us to investigate a new lifelong learning framework, which is capable of adaptively and accurately predicting new data batches with input data only by knowledge transfer before observing the true process output.
Motivated by the above background, this article proposes a novel lifelong learning framework that makes full use of process input information to enable predicting consecutive batches of process data with delayed output measurement.Specifically, we encode input data into a feature vector that contains essential process operating information and treat these input features as side information to augment local predictors on each batch data.To enable knowledge transfer between two spaces, we use coupled dictionary learning to connect the input's feature space with the predictor's parameter space, where the two spaces are linked through two dictionaries that are coupled by the same sparse coding.Because of the learned coupling between the two spaces, the lifelong learner is capable of rapidly reconstructing local predictors for the new coming batches given only process inputs.This capacity of "learning without targeted output" is very important for lifelong process modeling, because the output data are often hard to obtain immediately and the learner requires to quickly make prediction by knowledge transfer before observing true process output.Two industrial case studies, involving a penicillin fermentation process and a wastewater treatment plant (WWTP), are used to demonstrate the effectiveness of our proposed framework.Most importantly, we reveal the intrinsic learning mechanism of this new lifelong process modeling and demonstrate how our method deals with process drifts by KB adaptation.In summary, our novel contributions are listed below.
1) We define the lifelong learning or modeling problem for dynamic industrial processes for the first time and propose a novel lifelong learning framework that fully considers the key characteristics of process drifts and delayed process output measurement.2) Based on coupled dictionary learning, we incorporate process input information into lifelong learning that uses a factorized representation of the learned knowledge to facilitate transfer and improve predictive performance.3) We show that our method is able to accurately predict new data batches using only process inputs through unsupervised knowledge transfer.4) Two industrial case studies are carried out to demonstrate its effectiveness, and we reveal the intrinsic learning mechanism based on prior process knowledge.The rest of this article is organized as follows.Section II summarizes the related works.Section III reviews lifelong learning and presents its challenge in application to industrial processes.Section IV details the proposed lifelong learning-based streaming process prediction framework, and Section V evaluates its effectiveness with two industrial case studies.Section VI concludes the article with remarks for future works.

A. Streaming Modeling of Dynamic Processes
For streaming or online modeling of time-varying dynamic processes, the key is to update the predictive model's structure and parameters to track the changing system dynamics.A popular method widely used in practice is multiple local model learning strategy [28], [29].By partitioning the overall modeling space into multiple local subspaces, a set of local models can be constructed to capture the overall process characteristics.Based on this principle, the selective-ensemblebased multiple local model learning enables automatically identifying newly emerged process patterns by growing the local model set online [30].To reduce computation burden, the growing and pruning selective ensemble regression can not only learn new process patterns but also discard outdated patterns by pruning unwanted local models [31], [32].However, the local model adaptation for this strategy requires both process input and output data.When acquisition of output data is delayed, the local model adaptation cannot take place, and online prediction has to rely on the existing local model set given input data, i.e., it reduces to a nonadaptive model.
Inspired by gain scheduling [33], [34], another locally linear regression partitions the training input space into multiple subspaces and constructs a local linear model for each region.During inference or online prediction, appropriate local models from the trained model set are selected or combined based on input data using switching or ensemble mechanisms [35], [36].However, this locally linear regression is a nonadaptive model.During online operation, it cannot adapt the local linear model set even when both process input and output are available.The success of this locally linear regression therefore heavily depends on sufficiently rich training data to determine all the number and boundaries of subspaces of the underlying process in the training phase.If the process operates in real-time with time-varying data streams, the online prediction performance of this nonadaptive model may degrade considerably.
Another online modeling strategy is based on global model learning.A typical approach is to adopt the radial basis function (RBF) network [37], [38].During online model adaptation, the output weights of the RBF network are updated by recursive least square (RLS) to track the time-varying characteristics [39].To further enhance the adaptability to nonstationary data, the fast tunable gradient RBF model adjusts both the hidden node structure and output weights to capture newly emerged process patterns [40], [41].This efficient online tracker is further combined with deep learning technique for high-dimensional nonstationary process modeling in [42].Although this method is very effective and efficient for online process tracking, it needs both input and output information for model adaptation at each sampling time.This becomes impractical again when the acquisition of process output is seriously delayed.It can be seen that most existing streaming modeling approaches cannot cope with delayed output measurement in process industry, and novel real-time learning framework is urgently needed to tackle this problem.

B. Lifelong Learning
The core idea of lifelong learning is to solve multiple consecutive tasks over long-time scales upon previously accumulated knowledge, and ELLA is one of the most popular approaches [22].It factorizes the learned task models into a shared latent dictionary as the KB to facilitate knowledge transfer as tasks arrive consecutively.When new task arrives, ELLA transfers knowledge through the shared dictionary to learn new model and refines the dictionary with the knowledge learned from the current task.By updating the dictionary over time, newly acquired knowledge is incorporated into the KB, thereby improving the performance of previously learned models.Although ELLA-based methods [22], [23], [24], [25], [26], [27] achieve very good performance in many applications, one important requirement is the need of first gathering sufficient labeled data for the new coming tasks.This Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
need for labeled data imposes a serious challenge for practical problems, because data annotation for every new coming task is time-consuming.Often a learner is expected to learn new task rapidly without the delay to wait for labeling task.
To mitigate this difficulty, the work [43] incorporated high-level task descriptors into lifelong reinforcement learning and used both task descriptors and training data to model intertask relationships.This method was extended to regression in [44], where the model can be predicted given only input data and task descriptors for new task.However, this method requires domain-specific task descriptors that must accurately characterize the underlying process dynamics.For simple engineering systems, such as a robotic arm, it is possible to define a set task descriptors that accurately reflect the system's true underlying dynamics.However, practical industrial processes are highly complex, and it is difficult if not impossible to define accurate task descriptors for them.Consequently, similar to other existing lifelong learning methods, the lifelong learning with zero-shot knowledge transfer [43], [44] cannot be applied to practical industrial processes with delayed process output measurement.
Hence, it is necessary and vital to develop new lifelong process modeling framework for industrial process applications, where labeled data for new tasks are difficult to obtain quickly.The novel contribution of this article is to propose a new lifelong learning framework for streaming industrial processes with delayed process output measurement.Note that unlike the lifelong learning with zero-shot knowledge transfer [43], [44], our proposed lifelong learning method is capable of performing unsupervised knowledge transfer with input data only and there is no need to first define task descriptors.

A. Problem Formulation
For a streaming process, let multiple data batches be received consecutively as {D (1) , D (2) , . . ., D (T max ) }.The predictive model must rapidly predict each new batch by building upon its previously learned knowledge, and a local predictor f (t) can be constructed on each batch data , where n t is the number of samples (batch size), and x (t)  i ∈ R d and y (t) i ∈ R are the ith input sample and the associated output or label sample, respectively, for batch t.
The generic steaming process prediction is formulated as the following framework.Let T − 1 be the number of batches with complete process input and output data that the process has generated so far and { f (1) , . . ., f (T −1) } be the previously built local predictors.Because of delayed process output measurement, when new batch T first arrives, it contains only the process inputs {x (T )  i } n T i=1 .The predictor must be able to accurately predict the true process output y (T ) i based on the process input x (T ) i and the knowledge learned from the previous batches, { f (1) , . . ., f (T −1) }.Only later when the true process outputs {y (T )  i } n T i=1 for a new batch are acquired, the predictor f (T ) may then be constructed on the completed data batch Ideally, the knowledge learned from the previous batches should accelerate this model construction and the constructed f (T ) should contribute its learned new knowledge to the learned knowledge library.

B. Revisit of ELLA
ELLA [22] learns and maintains a KB L ∈ R d×k , which forms a shared basis for all the predictors and facilitates knowledge transfer between them.For each batch t, ELLA constructs a local predictor f (t) (x) = f (x; θ (t) ) that is parameterized by a d-dimensional batch-specific parameter vector θ (t) .This model parameter is a linear combination of the columns of L using the sparse coefficients s (t)  ∈ R k as θ (t)  = Ls (t) .The dictionary L stores chunks of knowledge that are shared for all the batches, and the sparse code s (t)  extracts the relevant pieces of knowledge for a particular batch t.Hence, this model parameter factorization enables effective knowledge transfer among batches.Typically, a local linear model f (t) (x) = x T θ (t) is adopted, and therefore, we must have k ≤ d.This is because in this case the columns of L, i.e., the dimension of the KB, cannot exceed the dimension of the input space d.Typically, k < d as some elements of x may be collinear.
Given the process inputs and outputs {x (t) i , y (t) i } n t i=1 for each batch t, ELLA optimizes the following cost function: where ] is the matrix consisting of all the sparse coefficient vectors, and the L 1 norm is used to control the sparsity of s (t) with the regularization parameter µ, while ∥•∥ F is the Frobenius norm, which regularizes the complexity of dictionary L with the regularization parameter λ.
To solve this optimization, ELLA takes a second-order Taylor expansion to approximate the objective around an estimate θ (t) of the local predictor parameters for each batch t) , and only updates the coefficients s (t) for the current batch at each time step.This process reduces the optimization (1) to the problem of sparsely coding the local predictors in the shared dictionary L, and it enables solving L and S efficiently by the following recursive updating rules [22]: where ∥v∥ 2 A = v T Av, the elements of L are initialized by taking values randomly from (0, 1), and (t)  = ( θ (t) ) is the Hessian matrix of the loss J ( θ (t) ), while ⊗ denotes the Kronecker product, A ∈ R (kd)×(kd) is initialized to the all zero-element matrix, and b ∈ R kd is initialized to the all zero-element vector.Furthermore, the vector stacking operator vec[•] stacks the columns of matrix one by one to form a vector, I (kd) is the (kd) × (kd) identity matrix, and the matrix forming operator mat[•] d×k converts a (dk)-dimensional vector into a (d × k)-dimensional matrix.
Remark 1: The theoretical justification of using L 1 -and L 2norms to regularize the model complexity is well understood in machine learning.In particular, imposing an L 1 -norm on the model parameters has the desired property of enforcing the sparsity of the model, i.e., making many parameters to zero.Analysis of ELLA is also widely available in the literature.For example, the convergence analysis of ELLA, (2)-( 5), to the solution of the optimization (1) can be found in [22].
Remark 2: The dictionary size or the sparsity level k is an important hyperparameter of ELLA, which is obviously problem-dependent.Ideally, it would be highly desirable to be able to determine the value of k from data.For the single-task learning, this is indeed achievable as the L 1norm of the encoding vector in the optimization naturally enforces sparsity and automatically makes k smaller than d according to the underlying data structure.However, ELLA considers the multiple tasks, and it is unknown to us how to automatically determine an appropriate value for k from data.Therefore, the dictionary size k is typically chosen empirically as in [22].

C. Challenge for Applying ELLA to Streaming Processes
As can be seen from the updating rules (2)-( 5), each time when a new batch t arrives, ELLA requires first to estimate an initial local predictor θ (t) before it can update s (t) and L. The updated sparse coding vector s (t) and dictionary L can then be used to construct the new predictor to predict the true process outputs for new batch t.In other words, the new local predictor can only be constructed if at least some new batch data contain both the process inputs and the corresponding process outputs.However, for many streaming processes, it often takes a long time to acquire process output data.When encountering a new batch at first glance, typically only the process input data are available for making predictions.This imposes a serious challenge for applying the existing lifelong models to streaming processes with delayed process output measurement, since they need sufficient labeled data for new tasks to start building new predictors for prediction.
To eliminate this need for labeled data for predicting new batches and hence make the lifelong learning better suited for industrial process prediction, we propose a novel lifelong learning framework that makes full use of process inputs to enable unsupervised knowledge transfer on predicting new batches.Specifically, upon learning a few batches with complete process input and output data, future local predictors for new batches can be constructed given only input data.It is worth emphasizing again that our scheme is completely novel and it is very different from the scheme of [43] and [44], which needs input data and task descriptors to construct local predictors for new batches.For a complex industrial process, it is impossible to craft its task descriptors.

IV. PROPOSED STREAMING PROCESS PREDICTION FRAMEWORK
Our proposed industrial lifelong learning system is depicted in Fig. 2. The industrial system operates in real-time to consecutively generate multiple data batches in the form of streams.For each batch of data, we define the process input matrix n t ] ∈ R d×n t and the corresponding desired output vector As a new batch arrives, knowledge accumulated from the previous batches is selectively transferred to predict the new batch, and newly acquired information from the current batch is stored in the KB for future use.To achieve knowledge transfer on new batch prediction without output data, we propose to incorporate input features into lifelong modeling framework via coupled dictionary learning, enabling input features and the local predictor to augment each other.For historical batches with complete process input and output data, the local predictor is constructed, while the input features are encoded only by process inputs.To link two feature spaces, we use two dictionaries that act as knowledge repositories for the two spaces, and they are coupled by a joint sparse representation.Because of the learned coupling, the local predictor for a new coming batch can be reconstructed given only the process inputs.This capacity of learning new predictors without targeted output is particularly suitable for streaming industrial process prediction, where acquisition of true process output data may encounter long delay.We now detail each part of our proposed framework.
A. Learning From Historical Batches 1) Local Predictor: For historical batch with complete process input and output data (X (t) , y (t) ), a local model f (X; θ) = X T θ is constructed.Specifically, the parameter vector of the local model is computed using the regularized least square (LS) estimator as where β is a small positive regularization parameter, e.g., β = 10 −6 .The Hessian (t) of the squared-loss function J (θ (t) ) around the single task solution θ (t) is given by Hence, for historical data batches, we first compute the predictors parameter vectors θ (t) and Hessian matrices (t) before performing knowledge transfer in the learning process.
2) Input Feature: In the process industry, the process inputs also known as operational data X (t) are often easy and quick to acquire online from sensors directly, while the process output data y (t) are difficult to obtain by online measurement and they typically take long time to acquire from off-line laboratory analysis.In such cases, process output data experience a delayed acquisition, which makes supervised modeling impractical.Although input data itself cannot be used to construct a predictor, they also contain essential process operational information [45].It is highly beneficial to use the input data for unsupervised learning to supplement the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.supervised modeling, thereby using it as a "backup" to predict new batch when desired process output data are unavailable for constructing the predictor timely.
To incorporate process inputs alone into the learning procedure, the input data matrix X (t)  ∈ R d×n t is transformed into a d-dimensional feature vector that can link with the predictor's parameter vector θ ∈ R d .To link these two spaces, we can transform X (t) into the d-dimensional feature vector ψ(X (t) ), where ψ(•) is an operator that encodes a matrix into a vector.The simplest encoding is the direction of the mean value in each row of X (t) .Expressing the ith column of where x(t) j = (1/n t ) n t i=1 x j,i , 1 ≤ j ≤ d.Hence, x(t) are the input features for batch t.We next show how to link input features with local predictor via coupled dictionary learning.
3) Coupled Dictionary Learning: The idea of coupled dictionary learning was used in the scheme [43], [44] to link the high-level task descriptions with the learned model to achieve knowledge transfer for new tasks [43], [44].Similarly, we use the coupled dictionaries to link the local predictor's space with the input features' space, so as to fully exploit process input information and achieve predicting a new batch without the need for process output data.
Recall that the lifelong learning approach factorizes the predictor parameters θ (t) for each task as a sparse linear combination of a shared dictionary by θ (t)  = Ls (t) , where each column of L represents a cohesive chunk of knowledge [22].The sparse coefficient vectors S encode the local predictors in the shared dictionary, providing an embedding of the batches based on how their predictors share knowledge.
In the same way, the input feature vector x(t) can also be linearly factorized using a shared dictionary K ∈ R d×k over the process input's space.K has a similar function with L, which captures the relationships among the input features for different batches.To link the two spaces, we enforce the two dictionaries, L and K , to share the same sparse coefficient vectors S so as to reconstruct both the local predictors and the input features.Hence, for batch t Because the two dictionaries are enforced to have the same sparse code s (t) , the relevant pieces of information for the local predictor become coupled with its associated input features.To optimize the coupled dictionaries L and K , the objective (1) is reformulated for the coupled dictionaries as min where the parameter ρ balances the local predictor's fit to the input feature's fit.
To solve the optimization (10) in a lifelong learning setting, J (θ (t) ) is approximated by a second-order Taylor expansion around the regularized LS estimate θ (t) given in (6).That is, we expand J (θ (t) ) around θ (t) for each batch as where ▽ denotes the gradient operator.The first term J ( θ is a constant and can be omitted.Since θ (t) is the minimizer of the objective J (θ (t) ), ▽J ( θ (t) ) is zero, and the second term can also be removed.Thus, the loss function J (θ (t) ) is approximated by the last term of (11), which can be rewritten as ∥ θ = Ls (t) .With the approximation (11), the optimization (10) is simplified as min where 0 d×d is the d × d zero matrix, and the optimization ( 12) can be rewritten in a concise form as min Clearly, the optimization ( 14) has the identical form to (1), and it can be solved in the same way.Note that ( 14) can be decoupled into the two optimization problems of similar form on L and K , respectively.Hence, the two dictionaries can be updated independently.When batch t arrives, three steps are performed to update the lifelong learning model, namely, compute s (t) , update L, and update K .The sparse vector s (t) is first computed using the current basis H by solving the following L 1 -regularized regression problem, which is a Lasso: After s (t) is obtained, the two dictionaries, L and K , are calculated independently by the recursive updating (3)- (5).
In particular, to update the dictionary K , we simply replace (t) by ρ I d , θ (t) by x(t) , and L by K in (3)-( 5).

B. Predicting a New Batch by Knowledge Transfer
The main advantage of linking the local predictor and the input features is that we can construct the local predictor for new batches using only process input data, which is valuable for streaming processes.This capability of unsupervised knowledge transfer is enabled by the coupled dictionary learning, which allows us to use input features to recover the local predictor through coupled dictionaries and sparse coding.
Given the process input data X (T ) for new batch T , we first encode X (T ) as the feature vector x(T ) = ψ(X (T ) ), and then estimate the sparse coding in the input feature space via Lasso based on the previously learned dictionary K s Since this estimated s (T ) also serves as the sparse coding for the latent dictionary L, it can be used to recover the local predictor for new batch T as This new local predictor θ (T ) is obtained with the input data X (T ) only, and it can then be used to predict the true process outputs for new batch T as y (T ) = (X (T ) ) T θ (T ) .
This completely eliminates the need to wait for output data to construct a model.It can be seen from ( 16) to ( 17) that the construction of a new local predictor depends on the previously built dictionary L and the current input features x(T ) .L contains the knowledge learned from all the past batches, while the input features contain the latest process operation information.Combining both L and x(T ) can enhance the predictive performance of the new model.This is another advantage of the proposed unsupervised knowledge transfer.
Remark 3: Based on the coupled dictionary learning, the premise of using the learned dictionary K and input features to reconstruct the new predictor is that L and K are closely related.Recall that we factorize the local predictor's parameters and input features as θ (t)  = Ls (t) and x(t) = K s (t) , respectively.The dictionary L is basically extracted from the previously learned local predictors, and hence, it captures the inner characteristics of the underlying system or the relationship between process input and output.If we assume that K and L contain similar knowledge of the process, the input features given in Section IV-A2 must also characterize the underlying system dynamics as well.In other words, the predictor's parameter is an implicit function of the process input for individual batches.If the process characteristics were completely independent of the process input, the use of input features and dictionary K would not be able to reconstruct an accurate predictor.However, this cannot be the case in practice.This is because the output is always related to the input, and hence, the process input is closely related to the system characteristics.Since input features contain essential process operation knowledge, the integration of the latest operating information into the previously built KB can enhance the accuracy of new predictor construction.

C. Algorithm Summary
The proposed lifelong learning framework for industrial process prediction is summarized in Algorithm 1, which operates naturally in three stages, namely, initial training, online prediction, and KB adaptation.
1) Initial Training: Multiple historical batches with complete process input and output data are collected and used to build the KB for the lifelong learner.After supervised training, the trained KB, L and K , are kept.2) Online Prediction: When a new data batch with process input data only arrives, the lifelong learner first constructs a new local predictor based on the trained KB and input data by unsupervised knowledge transfer, and then makes the prediction for the new data patch.3) KB Adaptation: After observing the true output data, the completed current batch data are used to update the KB so as to acquire the most up-to-data knowledge.It is worth recapping that the classic ELLA [22] cannot be applied to this industrial streaming process prediction application.This is because to perform online prediction for new batches, ELLA first requires sufficient labeled data to identify task relationships before bootstrapping a model Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1 Lifelong Learning-Based Streaming Process Prediction via knowledge transfer.However, the labeled data are not available for online prediction.Also, the scheme of [43] and [44] cannot be applied to this industrial streaming process prediction application, since it is generally impossible to craft high-level task descriptors for a practical industrial process.In contrast, our framework constructs predictors for predicting new batches by unsupervised knowledge transfer based on process input information only.In addition, we use the online KB adaptation in our framework because the process characteristics can change significantly.It is vital to capture the latest data characteristics in the KB.This is the elegance of lifelong learning.Hence, our proposed lifelong learning framework is very suitable for dynamic process prediction and modeling particularly when the acquisition of process output data is seriously delayed.
We now analyze the online computational complexity of learning new batch by our method.The construction of local predictor by the regularized LS estimator (6) has a complexity on the order of O(d 3 ).The adaptation of single dictionary L ∈ R d×k and sparse coding s (t)  ∈ R k costs O(k 2 d 3 ).Since we incorporate input features into lifelong learning framework by augmenting L ∈ R d×k into H ∈ R (2d)×k , the coupled dictionary adaptation costs O(k 2 (2d) 3 ).Thus, the overall complexity of per-batch adaptation is O(d 3  + k 2 (2d) 3 ), which is independent of batch number.
Remark 4: The online operations of our proposed lifelong learning framework include online prediction and KB adaptation.Specifically, when a new batch with process input data only arrives, a local predictor is constructed explicitly by unsupervised knowledge transfer based on process input and the constructed model is used to predict the process outputs for the batch.Later, only when the corresponding true process output data are available, the supervised KB adaptation takes place.Since the base model is linear, i.e., d is very small, and furthermore k < d, the operations of online prediction and KB adaptation are very fast, and the operation time of these online operations is well within the time duration of a batch, which may contain tens to hundreds of samples.In a nutshell, the online operation time of our lifelong learning framework is not an issue.

V. EXPERIMENTS
Two industrial applications for a penicillin fermentation process and a WWTP are used to verify the effectiveness of our proposed lifelong modeling framework.Three metrics, the mean square error (mse), mean absolute error (MAE), and the coefficient of determination (R 2 ), are used to evaluate both training and online prediction performance.
We operate the proposed framework in the following three different learning modes.
1) Pro-Nonadaptive: The training data are first used to build a KB.During online operation, the trained KB is fixed and the learner makes prediction after receiving each batch data.This is essentially Algorithm 1 minus KB Adaptation part.2) Pro-Adaptive: This is the completed Algorithm 1.After online prediction and when the true process outputs for the current batch become available, the KB is adapted to capture the system characteristics in the current batch.3) Pro-Idealized: This scheme continuously operates in the training mode.After observing the complete process input and output data of a new batch, a fully updated KB is obtained that accumulates all the knowledge from all the batches.This "full" KB is used to reconstruct all the individual models for the individual batches.The individual models then make the "prediction" for the individual batches.It can be seen that only pro-nonadaptive and pro-adaptive can be applied to the scenario of adaptive online modeling and prediction for streaming processes with delayed process output measurement.Pro-idealized is impractical and is used here to represent the idealized performance limit (training performance) of the lifelong modeling framework.Since the base model for the proposed framework is linear, for a fair comparison, the proposed framework is compared with the state-of-art linear models, including LS, Bayesian augmented Lagrangian (BAL) [46], [47], partial LS (PLS) [6], [45], RLS [39], [48], and clustering-based locally linear regression (CLR) [35], [36].For the LS, BAL, and PLS, the training data are used to construct models, and the trained models Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
are fixed during online operation.For the CLR, the k-means clustering is first used to partition the training input data into multiple data clusters, and local models are built by LS for individual clusters.The number of clusters is determined by grid search based on the training mse.The trained local linear model set is fixed in online operation.During online prediction, the Euclidean distances are measured between the training clusters and the new input data, and the local model with the closest distance is selected for prediction.We also modify this CLR with ensemble model learning called "CLRensemble," where the trained local models are combined with weights based on the distances for prediction.The standard RLS performs online prediction and adaptation on the sampleby-sample base.Specifically, the model adapted at the previous sample is used to make the prediction on the current input sample.After this sample prediction, the true process output sample is assumed available, and the RLS performs the model adaptation using the process input-output sample pair.We refer to this standard RLS as "RLS-idealized," which cannot be used in adaptive online modeling and prediction for streaming processes with delayed process output measurement.It is used here to represent the idealized performance limit achievable if there exists no delay in the acquisition of process output.To have a fair comparison with our proadaptive, we modify the classic RLS with batch prediction and adaptation called "RLS-batch."Specifically, the RLS model updated from the previous batch is used to predict the new batch.After the true process output data for this new batch have been acquired, it updates the model over the batch, and the updated model will be used for the prediction of the next batch.It is worth recapping that LS, BAL, PLS, CLR, CLRensemble, and our pro-nonadaptive are nonadaptive models, i.e., they only make online prediction with input and do not make model adaptation.

A. Penicillin Fermentation Process
The penicillin fermentation process is a biochemical fed-batch process with multimode time-varying characteristics.It was widely used for performance assessment of adaptive modeling approaches [28], [29].During the fermentation process, the penicillin and substrate concentrations are two hard-to-measure process outputs, which may take hours to acquire from laboratory analysis.Our goal is to predict these two process outputs using easy-to-measure process input variables before obtaining their true values from laboratory measurements.The process inputs and outputs are tabulated in Table I.Based on the knowledge of this process, the process input vector at sample i is defined as . By changing different operating conditions, 1600 samples are collected from the PenSim tool [28], and they are divided into the training (800 samples) and online testing (800 samples) sets.During online operation, batches of data are received consecutively.In practice, the batch size is determined by applications, e.g., delayed acquisition time for process output.In general, a large batch size enables constructing a more accurate local model for each batch while a small one can better capture the time-varying local data characteristics of different batches.Also, the batch size should For the proposed framework, the dictionary size k and the regularization parameters are chosen empirically.In addition, ρ is an important parameter that the model's fit to the input data's fit [44], and we set ρ = 1 to achieve its best prediction performance.The forgetting factor of RLS is set to 0.98.The ridge term for LS models is empirically set to 10 −5 .The number of latent variables for PLS is set to 6 to attain its best performance.The augmented Lagrangian parameter and regularize parameter for BAL are carefully chosen to be 10 −3 to obtain its best performance.The number of clusters for CLR and CLR-ensemble is set to 8, as the training mse does not decrease with further increase in clusters.
The performance comparison of various methods for predicting penicillin concentration and substrate concentration is tabulated in Tables II and III, respectively.As expected, the training performance of the LS, BAL, PLS, and RLS is the same or similar but our proposed lifelong learning framework attains a slightly better training performance, since it includes input features in knowledge transfer to enrich the modeling accuracy.CLR and CLR-ensemble attain much better training performance than the other methods, because they incorporate local data characteristics of different regions to enhance the overall modeling capacity.What really matters, however, is the prediction performance.In terms of online prediction accuracy, the nonadaptive LS, BAL, and PLS schemes achieve similar performance but again our pro-nonadaptive achieves a slightly better online prediction performance, again because it includes input features in constructing predictive model.With batch adaptation after predicting new batches, our pro-adaptive scheme outperforms the pro-nonadaptive scheme by around 2 dB in the testing mse.The online prediction accuracy of CLR and CLR-ensemble degrades spectacularly from the training accuracy and becomes even inferior to nonadaptive single modeling approaches.Its poor prediction performance may be due to the nonsmoothness of the boundaries for the subspaces identified in training.Also, local models learned in training may not capture newly emerged data states in unseen data.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The RLS-batch has very poor online prediction performance.At first glance, this may seem strange but it actually makes sense.The RLS algorithm has a short memory, allowing it to forget the past data and concentrate on the current data characteristics.Hence, after batch adaptation, the model forgets most of the past knowledge and captures the characteristics of the current batch.This model is used to predict the next batch.For a process with highly time-varying characteristics, the characteristics of the next new batch can be very different from those captured in the model, and this is the root cause of its poor adaptive performance.In contrast, in our proadaptive scheme, the learned KB contains information from all the past batches and, moreover, the process operational information of the current batch is extracted for unsupervised model adaptation to accurately predict the current batch.In Fig. 3, we compare the online predictions of RLS-batch and our pro-adaptive with the true process outputs for further illustration of the above discussion.With sample-by-sample adaptation, RLS-idealized is capable of attaining the excellent online prediction performance that is very close to the training performance.But this adaptive scheme cannot be applied to the penicillin fermentation process with delayed process output measurement.Also, as expected, the impractical pro-idealized provides the ultimate performance limit.The online mse learning curves of various methods are compared in Fig. 4, where the mse value at test sample t, mse(t), is calculated from the first test sample to the tth test sample as in which y i denotes the ith process output sample, and y i is its prediction.The prediction errors of our lifelong learning framework in three different learning modes are compared in Fig. 5.It can be seen that the performance of pro-adaptive is much better than that of pro-nonadaptive, and it is very close to the idealized training performance offered by pro-idealized.This again demonstrates the importance of online KB adaptation and knowledge sharing mechanism.To further investigate this online learning mechanism for building KB, we study the impact of the number of training batches on the online prediction performance of the two practical lifelong learning schemes, pro-nonadaptive and pro-adaptive, over all the online test batches.The experimental results are plotted in Fig. 6, where the effectiveness of the adaptive lifelong model, pro-adaptive, is self-evident.On predicting the penicillin concentration, for example, two training batches are enough for pro-adaptive to attain a comparable performance to the case of using all the eight training batches.This demonstrates that equipped with online KB adaptation, the pro-adaptive scheme is capable of compensating for the lack of training data/tasks by continually acquiring most up-to-data knowledge online.

B. Wastewater Treatment Plant
The WWTP detailed in [49] is a dynamic process subject to large perturbations in influent flow rate and pollutant load, together with uncertainties on the composition of the incoming wastewater.The operation of this WWTP is to remove organic matter and perform nitrification and denitrification.To achieve this aim, it is essential to estimate the flow rate during the plant operation.The process inputs and outputs of this WWTP are listed in Table IV.The influent data are collected under severe weather conditions (combination of dry weather and long rainy period), which makes the underlying system characteristics highly time-varying and imposes a challenge on predictive models [50].The plant knowledge suggests that the plant input vector at sample i can be chosen as We collect 1300 samples from the WWTP dataset, among which the first 500 samples are used for training, while the rest are for online prediction.To better capture the local characteristics of this dynamic process, the batch size is set to 50.Hence, there are total of 26 data batches.The first ten batches are used for training, while the rest 16 batches are for online prediction and adaptation.The dictionary size k, regularization parameters, and ρ are set to 2, 10 −8 , and 1, respectively, for the proposed method.The number of latent variables for PLS is three.The augmented Lagrangian parameter and regularize parameter for BAL are chosen to be 10 −3 .The cluster number for CLR and CLR-ensemble is set to 25 empirically.
The performance comparison for various models is presented in Table V, and their online mse learning curves are given in Fig. 7. Again, the same observations regarding the various methods compared can be drawn as for the penicillin fermentation process with two exceptions.First, RLS-batch in this case achieves a slightly better prediction accuracy than the nonadaptive LS, BAL, and PLS.This is because the batch size is smaller, and there are sufficient adjacent batches which have similar characteristics; see the true process output sequence plotted in Fig. 8. Second, different from the previous case study, the CLR and CLR-ensemble attain slightly better online  prediction accuracy than the LS, BAL, PLS, RLS-batch, and pro-nonadaptive.But they are inferior to the pro-adaptive, because the trained local models of CLR and CLR-ensemble cannot capture the abrupt change in testing data.This again demonstrates the importance of model adaptation.A notable feature in the online mse learning curves of Fig. 7 is that around the test sample 300, there is a sharp deterioration in online prediction accuracy for all the methods.This "abnormal" phenomenon can be explained by the time-varying plant  characteristics caused by abrupt weather change, shown in Fig. 8. Like the 500 training samples, the first 300 test samples (samples t = 500-800) are taken at a dry weather period but the next 300 test samples (t = 800-1100) are taken in a long rainy period.Hence, the underlying plant dynamics experience an abrupt change around t = 800, and the data characteristics become very different from those of the training data.This kind of sudden sharp change is difficult for any adaptive model to cope well with.
The online prediction results of the proposed framework in three learning modes are compared in Fig. 9.It can be seen that pro-nonadaptive is hardly able to track the abrupt process drift starting at t = 800, while pro-adaptive does offer some capability to track this abrupt change in the plant dynamics.It is informative to exam the pro-adaptive scheme's prediction behaviors over the rainy period in more detail.Observe from Fig. 9(a) that for the first rainy data batch (test samples 300-350), the local predictor performs poorly, because the current KB is learned from the previous dry data batches.After online prediction, the batch adaptation enables the KB to encode the new rainy characteristics.Consequently, for the next two batches (test samples 350-450), the prediction accuracy improves significantly.The prediction performance is degraded again for the next batch (test samples 450-500) as the underlying plant characteristic begins to change to a dryweather one.Because online adaptation after batch prediction is able to capture this new dynamic, the subsequent batch prediction (test sample 500 onward) achieves high accuracy.

VI. CONCLUSION AND FUTURE RESEARCH
This article has proposed a novel lifelong learning framework for online modeling and prediction of industrial time-varying streaming processes with delayed process output measurement.Our main contribution has been to encode input data into feature vector that contains essential process operating information and to use these input features to augment local predictors on each batch data.We have used coupled dictionary learning to connect the input feature's space with the predictor's parameter space, where the two spaces are linked through two dictionaries that are coupled by the same sparse coding.The proposed learning framework has the capability of unsupervised learning, which enables the learner to rapidly reconstruct local predictors for the new coming batches given only process inputs and provide timely predictions for the true process outputs.Two industrial case studies, involving a penicillin fermentation process and a WWTP, have been used to demonstrate the effectiveness and the intrinsic learning mechanism of our proposed framework for online prediction and adaptation of industrial processes with hard-to-measure process output variables.
This article has opened up a new research direction for streaming process data analytics and modeling under the framework of lifelong learning.There are many theoretical and practical issues that deserve further investigation.One important issue is how to define the KB in lifelong learning Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
by incorporating prior process knowledge [7], [45] and prior knowledge of the underlying process's critical constraints [51], [52].This article has incorporated operation knowledge, namely, process inputs, into lifelong learning to enhance modeling performance and enable online prediction.Other interpretable process physical features can also be integrated.Although the delay time of process output data acquisition is determined by the constraints of industrial applications, there is a scope for investigating how to choose appropriate batch size.The base model for lifelong learning is linear.A longterm research is to study the potential of adopting a nonlinear base model in lifelong learning.

Fig. 2 .
Fig.2.Illustration of industrial lifelong learning system.For historical batches, the process inputs are encoded as feature vector while both input and output data are used to construct a local predictor.The input features and local predictor are factorized into two dictionaries that are coupled by a joint sparse coding.When new batch arrives, the learner reconstructs local predictor for new coming batch using solely process inputs by knowledge transfer.

Fig. 3 .
Fig. 3. Prediction performance comparison of RLS-batch and Pro-batch for predicting (a) penicillin concentration and (b) substrate concentration in the penicillin fermentation process.

Fig. 4 .
Fig. 4.Comparison of online mse learning curves of various methods for predicting (a) penicillin concentration and (b) substrate concentration in the penicillin fermentation process.The curves of RLS-batch, CLR, and CLR-ensemble are outside the plot.

Fig. 5 .
Fig. 5. Prediction error comparison of the proposed framework in three different learning modes for predicting (a) penicillin concentration and (b) substrate concentration in the penicillin fermentation process.

Fig. 6 .
Fig. 6.Performance of lifelong learning of consecutive training tasks for predicting (a) penicillin concentration and (b) substrate concentration in the penicillin fermentation process.

Fig. 7 .
Fig. 7. Comparison of online mse learning curves of various methods for predicting flow rate of WWTP.

Fig. 9 .
Fig. 9. Comparison of (a) predicted outputs and (b) prediction errors of the proposed framework in three different learning modes for WWTP.
Lifelong Learning Meets Dynamic Processes: An Emerging Streaming Process Prediction Framework With Delayed Process Output Measurement Tong Liu , Member, IEEE, Sheng Chen , Life Fellow, IEEE, Po Yang , Senior Member, IEEE, Yunpeng Zhu , Member, IEEE, Mehmet Mercangöz , and Chris J. Harris

TABLE I VARIABLE
DESCRIPTION OF PENICILLIN FERMENTATION PROCESS not be too large so that the process can be approximated by a local linear model over each batch.The batch size is set to 100 in this experiment.

TABLE II PERFORMANCE
COMPARISON OF LS, BAL, PLS, RLS, CLR, AND PROPOSED METHOD FOR PREDICTING PENICILLIN CONCENTRATION IN PENICILLIN FERMENTATION PROCESS

TABLE III PERFORMANCE
COMPARISON OF LS, BAL, PLS, RLS, CLR, AND PROPOSED METHOD FOR PREDICTING SUBSTRATE CONCENTRATION IN PENICILLIN FERMENTATION PROCESS

TABLE V PERFORMANCE
COMPARISON OF LS, BAL, PLS, RLS, CLR, AND PROPOSED METHOD FOR PREDICTING FLOW RATE OF WWTP