Spatio-Temporal Similarity Measure based Multi-Task Learning for Predicting Alzheimer’s Disease Progression using MRI Data

Identifying and utilising various biomarkers for tracking Alzheimer’s disease (AD) progression have received many recent attentions and enable helping clinicians make the prompt decisions. Traditional progression models focus on extracting morphological biomarkers in regions of interest (ROIs) from MRI/PET images, such as regional average cortical thickness and regional volume. They are effective but ignore the relationships between brain ROIs over time, which would lead to synergistic deterioration. For exploring the synergistic deteriorating relationship between these biomarkers, in this paper, we propose a novel spatio-temporal similarity measure based multi-task learning approach for effectively predicting AD progression and sensitively capturing the critical relationships between biomarkers. Specifically, we firstly define a temporal measure for estimating the magnitude and velocity of biomarker change over time, which indicate a changing trend(temporal). Converting this trend into the vector, we then compare this variability between biomarkers in a unified vector space(spatial). The experimental results show that compared with directly ROI based learning, our proposed method is more effective in predicting disease progression. Our method also enables performing longitudinal stability selection to identify the changing relationships between biomarkers, which play a key role in disease progression. We prove that the synergistic deteriorating biomarkers between cortical volumes or surface areas have a significant effect on the cognitive prediction.


I. INTRODUCTION
Alzheimer's disease (AD) is a serious neurodegenerative disease, which is characterized by memory loss and cognitive decline due to the progressive damage of neurons and their connections, which directly leads to death [1].According to World Health Organization (WHO), it is estimated that there are globally 47.5 million people with dementia in 2016 § Po Yang is the corresponding author.with 7.7 million new cases every year.Previous research has focused on using biomarkers combined with machine learning algorithms to predict patients' Mini Mental State Examination (MMSE) and Alzheimer's Disease Assessment Scale cognitive subscale (ADAS-Cog) scores as the target data to predict whether a patient is an AD patient and find the weight of each biomarker feature at different prediction time points, existing AD disease progression models mainly use machine learning regression algorithms [2], survival models based on statistical probabilities [3], [4], and deep learning methods based on neural networks [5]- [7].The above-mentioned research focuses on using the data obtained by the patient during the first test (baseline data) to make predictions, which is a method that uses a small number of input features to make predictions.The disadvantage is that it ignores the information contained in the biomarkers in the process of changing over time.
Previous studies focusing ROIs of brain have studied the differences in the correlation between brain biomarkers for AD, cognitively normal older individuals (NL) and mild cognitive impairment (MCI).[8] proposed a deformation-based framework to jointly model the effects of aging and AD on the evolution of brain morphology, confirming the existence of components that significantly accelerate aging in AD patients.[9] evaluated the correlation of MRI and CSF biomarkers with clinical diagnosis and cognitive performance in subjects with NL and aMCI (amnestic mild cognitive impairment) and AD patients.It is concluded that MRI provides stronger crosssectional grouping and recognition ability and has better correlation with general cognitive and functional status on the crosssection, and MRI can reflect the clinically determined disease stage than CSF biomarkers.On the longitudinal studies, [10] described a novel perspective on volume trajectories and brain atrophy progression of single biomarkers' differences between Fig. 1.The protocol of MTL approach using spatio-temporal measure.
normal aging and AD.Some previous studies focused on the similarity between biomarkers from ROI, [11] employed the correlation of regional average cortical thickness and multikernel support vector machine to integrate relevant information with ROI-based information to improve the classification performance.However, the above-mentioned researches only focused on the use of a single biomarker or the same type of biomarkers and did not focus on the relationships of temporal and spatial changes between different types of biomarkers.
To address the above challenges and uncover the critical relationships between biomarkers, we propose to utilise the temporal and spatial information of brain changes to model the disease process of AD.Additionally, to reinforce temporal relationships between follow-up time points, a multitask learning method [12] based on temporal smoothness is introduced for interpretably modelling disease progression.
In this paper, we propose to utilise the spatio-temporal similarity between biomarkers changes to predict clinical scores of patients.Specifically, we firstly define a temporal measure for estimating the magnitude and velocity of biomarker change over time, which indicate a changing trend(Fig.1:temporal feature mapping).Converting this trend into the vector, we then compare this variability between biomarkers in a unified vector space(Fig.1:spatial feature mapping).The computation of spatial similarity results in an increase in data dimension by an order of magnitude of square.Faced with the scarcity of samples and a large number of feature dimensions, we introduce multiple loss terms with L 1 [13] and its variant norm [12] to overcome the Curse of dimensionality and interpretably capture the key relationships.The contributions of this work are summarized as follows: • A novel spatio-temporal similarity measure approach is proposed of analysing and extracting reliable features from MRI.This similarity measure will effectively quan-tify the synergistic deterioration between these biomarkers over time; • A multi-task learning (MTL) algorithm with spatiotemporal embedding is designed for effectively predicting AD progression, visualising brain biomarkers related to this progression; • A comprehensive experimental analysis is carried out by accessing impact of AD progression on brain function synergistic deteriorating biomarkers.
II. RELATED WORK In traditional machine learning paradigm, an accurate learner is usually treated as one single learning task (e.g., classification, regression) and learnt by a large number of training samples.For instance, deep learning model [5], [14] can train an accurate AD prediction model of neural network with hundreds of layers contacting a great amount of parameters via massive labelled biomarkers at baseline from ADNI.But one key challenge here is that sufficient and welllabelled longitudinal AD data at multiple time points are hardly collected from AD patients.The problem of missing, sparse and insufficient data strongly impacts on learning a fine model.Differing with traditional ML approaches, Multi-Task Learning [15] considers the prediction of AD progression as multiple learning tasks each of which can be a general prediction task art certain time point.Among these prediction tasks, all of them are assumed to be related to each other in time domain with relevant temporal features (e.g., biomarkers in MRI).We demonstrate a typical pipeline of leveraging MTL algorithms for predicting cognitive functionality of AD patients from their brain imaging scans [16], where the predictive information is shared and transferred among related models to reinforce their generalization performance.The data sources employed are Freesufer (Extracted features from MRI like Volume of Hippocampus) and cognitive functional scores (AD cognitive scales like MMSE [17] or ADAS-cog [18]) from selected AD patients repeatedly by multiple time points.By considering the prediction of cognitive scales at a single time point (like 6, 12 or 18 months) as a regression task.The prediction of clinical scores at multiple future time points as a multi-task regression problem.Weights of MTL are trained and optimized through processing pre-extracted features from MRI and baseline cognitive scales.
Two important issues affect the progress of applying MTL in AD modelling problems.First, it is important to obtain good quality of baselines from AD raw data, where MRI reflects changes in brain structure, such as the cerebral cortex and ventricle; cognitive scale directly shows cognitive functions of AD patients.Sparse representation [19] is a popular method in MTL for capturing key biomarkers in AD, which uses sparseness as a regularization condition, image blocks with key characteristics.Cognitive measure can be achieved by using worldwide standard AD cognitive assessment, such as MMSE [17], ADAS-cog [18] and Rey Auditory Verbal Learning Test (RAVLT) [20], [21].As the second issue, utilizing and improving advanced regression models [22] in MTL are highly critical, where they could better explore the relationship and correlations between MRI features and cognitive measures.Here, structural regularization [12] is a common approach in MTL for minimize the penalized empirical loss and bundling the correlations between tasks in the assumption.In the field of MTL in AD, there are many prior work that model relationships among tasks using novel regularizations [23], [24].The addition of kernel method problems allows the algorithm to fit non-linear relationships [25].The benchmark of this paradigm is derived from [26] and subsequent achievements are mostly aimed at theoretical structure, relevance, and fusing the multi-modality data applications.So far to our best knowledge, above regularized MTL approaches deliver promising performance in many AD prediction applications.

III. METHODOLOGY A. Problem formulation
Consider a MTL of k tasks with n training samples of d features.Let x 1 , x 2 , ..., x n be the input data for the patients, and y 1 , y 2 , ..., y n be the predicted cognitive scale for each patient, where each x i ∈ R d represents the feature data of an AD patient, and y i ∈ R is the predicted value of different cognitive scales.Specifically, x j i = [m, v] denotes spatiotemporal ROIs similarity on features j th and (j + r) th of the i th sample, m, v represent the magnitude and velocity of two specific biomarkers over time scale, where j, (j + r) ∈ (0, d].
Then, let X = [x 1 , ..., x n ] T ∈ R n×d be the data matrix, Y = [y 1 , ..., y n ] T ∈ R n×k be the predicted matrix, and W = [w 1 , ..., w k ] T ∈ R d×k be the weight matrix.The process of establishing a MTL model is to estimate the value of W, which is the parameter to be estimated from the training samples.
In order to solve above problem, many prior works in MTL that model relationships among tasks using regularization methods.Normally, they assume the empirical loss to be square loss and common regularization terms are L 1 and L 2 norms, separately named as Lasso regression and ridge regression models as shown in Eq. 1 and 2. Ridge regression constrains variables to a smaller range for reducing some factors with little impacts on model's prediction.Unfortunately, this reduction means that these variables are still considered.To solve this problem, Lasso was proposed as a new sparse representation linear algorithm, which simultaneously performs feature selection and regression.Some variables are set to zero directly to achieve sparsity and dimensionality reduction.
In AD study, the task of predicting AD patient's cognitive scale at certain time point is strongly associated with other tasks at adjacent time points.Thus, many recent studies have focused on designing novel structural regularization methods to improve their performance in AD study.
In this paper, we concentrate on two AD progression prediction models : Temporal Group Lasso (TGL) [16] and Convex Fused Sparse Group Lasso (cFSGL) [27].Specifically, TGL contains a time smoothing term and a group Lasso term as constraints, which ensures that all regression models at different time points share a common set of features.The TGL formulation solves the following convex optimization problem: where the first term measures the empirical error on the training data, ||W || F is the Frobenius norm, ||W H|| 2 F is the temporal smoothness term, which ensures a small deviation between two regression models at successive time points, and ||W || 2,1 is the group lasso penalty, which ensures that a small subset of features will be selected for the regression models at all-time points.
cFSGL involves sparsity between tasks, where it considers both common features at different points in time and unique features to each task.This feature is helpful to improve the overall performance of the model.cFSGL formulation solves the following convex optimization problem: where the first term measures the empirical error on the training data, ||W || 1 is the lasso penalty, ||RW T || 1 is the fused lasso penalty, and ||W || 2,1 is the group lasso penalty.Lasso and group lasso combined employ is called sparse group lasso, which allows simultaneous selection of a common feature for all time points and internally generates sparse solutions in response to different time points.Fused lasso penalty having a given temporal smoothness, which makes selected features at nearby time points similar to each other.In addition, notice that cFSGL's formula involves three nonsmooth terms.Accelerated gradient descent method is utilised to solve this problem.

B. Definition of spatio-temporal similarity
Two consecutive MRI scans are used to calculate the temporal and spatial changes of brain biomarkers.For instance, we utilise BL and M06 MRI to calculate the magnitude and velocity for biomarkers, let x be the detection value of brain biomarkers and t be the MRI test dates, the magnitude is , the velocity is x M 06 −x BL t M 06 −t BL per month.Use the magnitude and velocity to compose a vector that represents the changing trend of the brain biomarker.
Cosine similarity is used to calculate the similarity between two vectors to express the similarity of the temporal and spatial changes of two MRI biomarkers.Cosine similarity uses the cosine value of the angle between two vectors in the vector space as a measure of the difference between two individuals.As the values of different types of biomarkers are different in MRI dataset, while the cosine similarity measures the difference in trend rather than the value.The temporal and spatial relationships of brain biomarkers of AD, NL and MCI displayed by cosine similarity, euclidean distance and mahalanobis distance.

C. Experiment protocol
Firstly, we verified that MTL is superior in following AD progression.Combined with randomization techniques, we locate stable and sensitive cortical biomarkers identified by MTL algorithm.Our empirical protocol design are shown in change over time are characterized from the magnitude and velocity.Describing temporal changes in biomarkers using a two-dimensional vector.3) Spatial feature mapping.Calculating spatial similarity (cosine similarity) between vectors.4) Feature selection.Through this stage, the features dimension is greatly reduced, and the key features of temporal-spatial is retained.5) Predicting multiple cognitive scores.Modelling the AD progression between biomarkers and cognitive scales via MTL methods.6) Stability selection.Embedding MTL methods in the general stability selection to excavate synergistic deterioration between biomarkers in AD progression.Secondly, cross-validation is employed to split the training and test data.We utilise different metrics to evaluate the model performance on test data.The regression performance metric often employed in MTL is normalized mean square error (nMSE) and root mean square error (rMSE) is employed to measure the performance of each specific regression task.In particular, nMSE has been normalized to each task before evaluation, so it is widely used in MTL methods based on regression tasks.Also, weighted correlation coefficient (wR) as employed in the medical literature addressing AD progression problems [26], [28], [29].nMSE, rMSE and wR are defined as follows: Finally, as for repeated experimental times, one evaluation consensus in MTL models for AD study is that one experiment result is usually accidental and unreliable.To reduce experiment accidental errors, repeated experiments are required.We also evaluate the performance of four selected regularized MTL models under different repeated experimental times and lastly evaluate typical factors like data size and number of tasks affecting MTL models.

D. Stability Selection via MTL
In order to improve the interpretability and robustness of the results, stability selection was modified to meet our actual needs.The original strategy of feature selection was included a Lasso algorithm as core feature subsets searches approaches.In this paper, MTL algorithms were utilised to embedded in stability selection.
Let F be the overall set of features and let f ∈ F be the subset of features by sub-sampling.Let γ denote the iteration number of sub-sampling and Di = {X(i), Y (i)} denote one random sub-sample operation of number i ∈ (0, γ].Each operation size account for ⌞ n 2 ⌟.Let Λ be the regularization parameter space.For a λ ∈ Λ, let Ŵ (i) denote the model coefficient of MTFL that fitted on a subset of D(i).Then, the subset of features generated in task j by the sparse constraints of the MTFL algorithm can be denote as: With stability selection, we do not simply select one model in the parameter space λ.Instead the data are perturbed (e.g. by sub-sampling) γ times at task j and we choose all structures or variables that occur in a large fraction of the resulting selection sets: Where indicator function I(•) denote I(x) = 1, x = 0 0, others and πλ j ∈ [0, 1] denote the stability probability of task j at MTFL approaches which feature selection is not based on individual operations but on multiple task collaboration constraints.
Repeat the above procedure for all λ ∈ Λ, we obtain the stability score S j (f ) for each feature f at task j: Finally, for a cut-off π th with 0 < π th < 1 and a set of regularization parameters Λ, the set of stable variables is defined as: The embedded multi-task approach ensures that the selected features have the following properties:1) Stability.A cortical region of the brain that is closely related to the subject's disease progression.2) Global significance.MTL makes sure that the selected features are important for each task.One technique that arises here is to pick the coefficient value for one of the tasks when doing statistics on the stability of the selected features at equation 4.

A. Subjects
To track the effectiveness of disease progression models, ADNI-1 subjects with all corresponding MRI and cognitive scales are evaluated.As shown in the Table I.Subjects are between 55-90 years of age, the male accounts for 52.18%, the degree of suffering from the dementia, the data ratio of AD, MCI and NL are 25%, 50% and 25% respectively.
To explore the impact of the correlation between ROIs on AD progression, MRI data from two follow-up points in the longitudinal cohort were extracted to facilitate observation of this spatiotemporal variation.At the same time, the cognitive scales (like MMSE or ADAS-cog) of longitudinal cohorts are employed to estimate the patients' cognitive functional decline during the AD progression.During the screening period, all the subject must satisfy the data integrity for verifying the reliable result.Namely, the cohort subjects must complete participation in two follow-up point MRI scans and multiple cognitive scoring assessments.

B. Data pre-processing
For guarantees high image quality and reliable data handling, the MR images used in the paper were derived from standardized datasets, which provide the intensity normalized and gradient unwrapped TI image volumes.Subsequently, the FreeSurfer [30] was performed to feature extraction of the MR, which execute cortical reconstruction and volumetric segmentations for processing and analysing brain MR images.
For each MRI, cortical regions and subcortical regions are generated after this pre-processing suite.For each cortical region, the cortical thickness average, standard deviation of

C. Feature selection
To discover the impact of the similarity between ROIs on progression with AD, we couple all the regions in pairs, which allows 326 ROIs statistic features to combine 52975 features.For a given sample size, the higher the dimensionality, the sparser the distribution of the sample in space.
To solve this issue, we utilised TGL combined with a stability selection algorithm to obtain features that play an important role for all tasks.Finally, 300 significant features are selected for the training of MTL algorithm.

V. EXPERIMENTAL RESULTS AND ANALYSIS A. Spatio-Temporal similarity measure
We first accomplish three relevance approaches of estimating ROIs relevant criteria: Euclidean Distance (ED), Mahalanobis Distance (MD) and Cosine Similarity (CS).And then, each criterion between vectors composed of the magnitude and the non-absolute value of velocity of the biomarker are used and the feature subset selecting the original feature space to evaluate the subjects cognitive scales.Table II shows the different criterion tracking the AD progression.Note that Table II shows only the averaged results and variance of 30 independent experiments; and the temporal distance from baseline to M06 period.Besides, we also reproduced the model achieved by [16], [26], [27], with only MRI data as features.
Overall the cosine similarity representation of our proposed ROIs synchronization approaches outperforms the original ROIs feature.We have the following observations: 1) The collaborative expression of ROIs is better than independent ROI to a certain extent.2) The expression of cosine similarity performs better than that of cosine similarity and Mahalanobis Distance.
3) The proposed cosine similarity representation witnesses significant improvement for the early time point.This may be due to the data spanning from baseline and M06 period.

B. Modelling AD progression via MTL
Inspired by the above experiments, we further explored the influence of temporal span on the progress of positioning AD under the collaborative expression of ROIs.In this section, only cosine similarity was utilized to estimate the cognitive functional progression.
There are four temporal span group performed, namely baseline to M06 period, baseline to M12 period, baseline to M24 period and baseline to M36 period.Table III shows that the normalized results of different visited time span and the root mean square error of each sub-task results.We follow the same experimental procedure as above.The experimental results are presented in Table III.
We can observe from the table that as the time span increases, the overall generalization performance of the model improves.When the temporal span growths, we also have the following observations: 1) The performance of the subtasks will gradually improve.2) The task of the latter point in time has been greatly enhanced.This may be due to the latter MRI scanning support more collaborative expression of ROIs and these results further validate the efficacy of the proposed method for temporal-spatial collaborative expression of ROIs.3) during the BL to M24, the overall task performance outperforms others.4) during the BL to M36, Although the performance of the global model has decreased, the performance of each subtask has been greatly improved.

C. Stable synergistic deterioration pattern
Firstly, we use the data from a set of experiments with the best performance in experiment: Temporal Span of MRI Scan, namely the temporal span for baseline to M24 periods, which contains 94 dimensions corresponding a crucial couples of ROIs pairs.Secondly, a set of environmental parameters are clearly indicated: 1) Only half of the overall sample in each sampling subset is randomly selected.2) A total of 210 combinations of model hyperparameters.3) during every The synergistic effect of right posterior cingulate cortex on left pars triangularis, left parahippocampal.The fact that our findings are in line with those of previous studies [31]- [33] demonstrates the validity of our proposed model.
In the selection of longitudinal stability, we observed 29 most stable features with MMSE score, which are shown in Fig. 2, where the horizontal axis represents the markers between each salient ROI pair, and details are available in the Appendix.The correlation features based on Cortical Volume and Cortical Volume are the majority (6 features), which shows that the similarity of the change trend of the biomarkers based on Cortical Volumes have important effect in AD prediction.Previous studies have also observed a significant improvement in the classification performance of abnormal cortical patterns and the coordinated patterns of cortical morphology are widely altered in AD patients [11].In addition, the number of correlation features based on the similarity of changes between Surface Area and Surface Area is also relatively large (5 features).

VI. DISCUSSION
Although we modelled the spatio-temporal correlation of ROIs between time points, we focused only on the cognitive scales of the latter time point.The alignment of two cognitive scales would provide valuable context in the MTL settings as the cognitive scales might potentially also change over time.Additionally, we only focuses on the comparison of methods based on temporal smoothness and does not consider methods such as spatial assumptions.More comparisons will be carried out in future work.

VII. CONCLUSION
Identifying the synergistic deteriorating relationship of biomarkers can help clinicians assess AD progress in early intervention.We propose a new method to model and predict AD progress by extracting morphological information from MRI.This paper has three main contributions.Firstly, we employ cosine similarity to represent a temporal-spatial relationships between brain biomarkers.We then regard the disease progression prediction as a MTL problem and combine the cosine similarity to predict the disease progression of AD.Finally, the stability selection is utilised to analyze the temporal and spatial dynamic patterns between biomarkers.We prove that correlate information can better describe the brain structural changes in patients with NL, MCI and AD.Experiments shows that the effectiveness of the impact of AD progression on brain function synergistic deteriorating biomarkers.

Fig. 1 . 1 ) 2 )
The complete experimental process mainly includes 6 steps: Original feature extraction.Statistical features based on ROIs in cerebral cortex/sub-cortex are extracted from MRI images.Temporal feature mapping.Potential biomarkers (ROIs)

Fig. 2 .
Fig. 2. the vectors of stability temporal collaborative patterns.A total of 94 and 87 stable deteriorating pairs respectively.Specifically, (a) and (b) belong to the MMSE-targeted model of AD progress; (c) and (d) belong to ADAScog-targeted model of AD progress.