Gaussian Process Upper Confidence Bounds in Distributed Point Target Tracking Over Wireless Sensor Networks

Uncertainty quantification plays a key role in the development of autonomous systems, decision-making, and tracking over wireless sensor networks (WSNs). However, there is a need of providing uncertainty confidence bounds, especially for distributed machine learning-based tracking, dealing with different volumes of data collected by sensors. This paper aims to fill in this gap and proposes a distributed Gaussian process (DGP) approach for point target tracking and derives upper confidence bounds (UCBs) of the state estimates. A unique contribution of this paper includes the derived theoretical guarantees on the proposed approach and its maximum accuracy for tracking with and without clutter measurements. Particularly, the developed approaches with uncertainty bounds are generic and can provide trustworthy solutions with an increased level of reliability. A novel hybrid Bayesian filtering method is proposed to enhance the DGP approach by adopting a Poisson measurement likelihood model. The proposed approaches are validated over a WSN case study, where sensors have limited sensing ranges. Numerical results demonstrate the tracking accuracy and robustness of the proposed approaches. The derived UCBs constitute a tool for trustworthiness evaluation of DGP approaches. The simulation results reveal that the proposed UCBs successfully encompass the true target states with 88% and 42% higher probability in X and Y coordinates, respectively, when compared to the confidence interval-based method.


I. INTRODUCTION Target tracking in wireless sensor networks (WSNs) is a fundamental task for various applications including sea
We are grateful to the support for the UK Dstl and US MoD SIGNeTS project.Research was sponsored by the US Army Research Laboratory and the UK MOD University Defence Research Collaboration (UDRC) in Signal Processing, and was accomplished under Cooperative Agreement Number W911NF-20-2-0225.The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory, the MOD, the U.S. Government or the U.K. Government.The U.S. Government and U.K. Government are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.We acknowledge the support from the EPSRC through EP/T013265/1 project NSF-EPSRC: "ShiRAS.Towards Safe and Reliable Autonomy in Sensor Driven Systems" and the support for ShiRAS by the USA National Science Foundation under Grant NSF ECCS 1903466.For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
surveillance, autonomous vehicles, and traffic monitoring.The objective is to collect sensor measurements from one or multiple targets to estimate their current and future states [1].However, the measurements may not only originate from the targets but also from the environmental interference (e.g. from ground and rain), which is referred to as the clutter [2].To achieve reliable performance, a tracker should be able to distinguish between target measurements and clutter measurements, and also decide which measurement is associated with which target, which is called the data association problem [3].
However, these approaches rely on well-defined motion models, in particular, the target dynamics model and the sensor measurement model, which can be inaccurate when the target undergoes non-stationary evolution or mixed maneuvering behaviours.The multiple-model method [13] can capture complex behaviours by running a bank of elemental filters, each based on a unique model in the set and generating the overall estimates based on the results of these elemental filters.On the downside, this framework suffers from high computational complexity, and therefore, is not efficient when a large number of models are involved.
Tracking with multiple models can be achieved by Gaussian process (GP)-based model-free methods, which are powerful non-parametric machine learning inference methods [14].Instead of tracking via motion models, GP-based methods can directly learn unknown functions, which can be considered as a mapping from some system context to the target state, from noisy measurements.Particularly, GP regression can provide uncertainty quantification on the predictions.Ignoring this uncertainty can have disastrous consequences, especially when the output of such models is then fed into higherlevel decision-making procedures.Recently, GP regression has been applied to solve both point target tracking [15]- [21] and extended target tracking [22]- [24] problems.
Most of the existing GP-based tracking methods assume arXiv:2409.07652v1[stat.ML] 11 Sep 2024 that the sensor measurements are collected in a centralized manner and both training and state estimation are made upon the aggregated measurements [25], [26].However, due to the nature of distributed sensing systems, collecting all the measurements for GP training can bring a high communication cost, which is energy inefficient.In addition, considering a WSN with sheer amounts of sensors and measurements, the centralized tracking framework has inevitably reached an inherent bottleneck of scalability, which stems from the cubic computational complexity (O(N 3 ), where N in the number of measurements) of the standard GP regression due to the inversion and determinant calculations of the GP covariance matrix.Therefore, transferring the centralized-based approaches into distributed ones has become a popular choice, and multiple approaches including distributed consensus approaches [27] and message passing methods [28] have been proposed for target tracking.However, to the best of the authors' knowledge, there are rarely any studies about leveraging GP regression to solve the tracking and data association in a distributed way and provide theoretical performance analysis.Hence, in this paper, the distributed Gaussian process (DGP) framework is adopted to design a DGP-based tracking (DGPT) approach, which is able to learn the target motion online and solve the data association jointly in a data-driven way.GP has also been studied in resource allocation problems under the edge computing framework to account for efficient decision-making and network state prediction [29], [30].The presented DGPT approach can also rely on edge computing thanks to their distributed learning nature and since learning happens near the areas of data collection.The DGP framework has been implemented and tested on edge devices [31], and the whole edge learning process can be implemented in other ways as summarized in the survey [32].
The main contributions of this paper are summarized below: • A DGPT approach is proposed for point target tracking under the assumption that the number of measurements follows a Poisson distribution.The proposed approach can leverage both temporal and spatial features to learn the hyperparameters of the DGP online, through a sliding window of measurements.Bayesian filtering approaches are tested with challenging target trajectory scenarios -from uniform motions to highly maneuvering ones.Different levels of measurement noise and clutter rates are involved to evaluate the accuracy and robustness of the proposed approaches.The remaining part of the paper is organized as follows.Section II reviews the related works.Section III introduces the fundamentals of GP regression and multiple DGP methods.Section IV describes the proposed DGPT approach followed by the theoretical analysis of the tracking error bound in Section V. Section VI describes the DGP-assisted hybrid Bayesian filtering approach.The simulation setup and results are presented in Section VII, and the conclusions are drawn in Section VIII.Appendices A, B, and C contain details about the theoretical derivations.

II. RELATED WORK
A wealth of approaches for improving the scalability of the standard GP method have been studied including centralized, distributed, and hybrid methods [33].In [34], [35], sparse approximations of the original N × N covariance matrix of GP are obtained to summarize the dependence of the whole training data using M inducing points, which greatly reduces the computational complexity to O(N M 2 ).However, this type of methods requires all the data to make predictions and thus still works in a centralized manner.
The DGP which originates from the idea of divide-andconquer [36], focuses on training local GPs (which are also referred to as local experts) based on subsets of the whole training data or based on the partitioning of the big state vector into state vectors with smaller dimensions [37].After training the local experts, a family of aggregation methods [38]- [41] relying on the product of experts model, can be applied to aggregate the local knowledge together by multiplying local predictions, and then the overall prediction can be calculated.In addition, hybrid GP variants are studied to utilize a master GP expert to communicate with the local experts leading to a consistent posterior predictive distribution [42].DGP has been applied to problems such as the received signal strength-based location fingerprinting map construction [43].
GP regression is also applied to solve the data association problem.In [44], GP priors are placed on the different generative processes and the associations are modelled via a latent association matrix and inference is carried out using an expectation-maximization algorithm.[45] extends GPbased data association into the non-stationary process where a different number of generative processes can be activated in different locations in the input space.GP has been also combined with state-space models to solve the tracking problem.In [46], one-dimensional temporal GP regression models are reformulated as linear-Gaussian state-space models, which can be solved exactly with classical Kalman filter.The state-space model representation is also used in spatial-temporal GPs [47] and non-Gaussian likelihood [48] to derive computationally efficient infinite-dimensional Kalman filtering and smoothing methods.There are also works studying using GP to represent the state-space model [49].GP is used to learn the whole or part of the state-space model and the learned functions can be integrated into a particle filter or extended Kalman filter [50], [51], which results in hybrid tracking approach.Moreover, recently, different GP approaches are developed by assuming temporal and spatial correlation in the target trajectory and shape, and the state-space model is directly learned from the measurements [15], [16], [22].In Table I, we have summarized and compared more existing relevant works in GP-assisted target tracking and localization.From Table I, we can find that although GP methods have been used for solving tracking and localization problems, there are rarely works about using GP in a distributed system and integrating data association solutions into GP methods.
The next section presents background knowledge for the GP methodology before presenting the distributed learning and tracking approaches.

A. Standard Gaussian Process Method
GP is a stochastic process defining a distribution over functions that fit a set of points.Assume there exist correlations among target motions, at input x ∈ R d , the non-linear mapping f (•) between the current input feature and the target state f (x) can be modelled by a GP as where m(x) and k(x, x ′ ) denote the mean and covariance functions, respectively.The training and the test input data are denoted by x and x ′ , respectively.Popular examples of covariance functions include the squared exponential kernel.Namely, where l j represents the length-scale of the j th feature of the input data.The length-scale describes how smooth a function is which can be thought of as roughly the distance you have to move in input space before the function value can change significantly [14].The output variance σ 2 acts as a scaling factor.It determines the variation of function values from their mean.
The collected measurements can be treated as the noisy outputs of the unknown functions (which are target states in the tracking problem), therefore, a point target tracking problem with noisy observations can be written as where z represents the measurement and ϵ represents the i.i.d.zero-mean Gaussian measurement noise with variance σ 2 z .Given a training data set of input-output pairs as the covariance matrix of the training input, and k * = k(X, x * ) as the covariance between the training input X and test input x * .To make a prediction of the target state at a new input x * , the predictive mean µ * and the predictive variance σ 2 * can be written as where Σ = K + σ 2 z I, with I being the identity matrix.

B. Distributed Gaussian Process Method
The computational complexity and storage cost are major challenges for the large-scale learning problems.The computations require O(N 3 ) time with a standard GP implementation, where N represents the number of the training instances.Besides, the standard GP also requires O(N 2 + N d) of memory, where d is the dimensionality of the data.Both facts limit the scalability of the standard GP regression.Moreover, according to ( 6) and (7), the standard GP can only make predictions based on all the available data, which is a centralized scheme and requires data to be shared among local GPs.
In this section, inspired by the idea of divide-and-conquer, DGP methods are introduced to reduce not only the compu-tational cost but also the memory cost of the standard GP, by first training local GPs based on subsets of the whole training data set and then aggregating the knowledge of local GPS to achieve more accurate high-level predictions.The overall computational complexity and the memory cost can be reduced to O(N n 2 ) and O(M n 2 + N d) (n ≪ N ), respectively, where M represents the number of local GPs and n represents the size of data used for training a local GP.Particularly, both the computational complexity and storage cost can be further reduced through parallel/distributed computing [55].
The first type of DGP methods is the product of experts (PoEs) [37] approach.The idea is to multiply the local predictive probability distributions for overall predictions.Given the data D (i) collected by sensor i, the PoE predicts a function value f (x * ) at a corresponding test input x * according to where M is the number of GP experts and represents the number of active sensors which have measurements.Since the product of these Gaussian predictions is proportional to a Gaussian distribution, the aggregated predictive mean and variance can be calculated with closed form as where µ i (x * ) and σ 2 i (x * ) represent the predictive mean and variance of GP expert i, respectively, which can be calculated based on (6) and (7).
The PoE model provides a straightforward way to aggregate local predictions and sidesteps the weight assignment issue in other DGP models such as the mixture of expert model [56].However, this model becomes overconfident when making predictions, especially in regions without any training data.
The generalized product of experts (GPoEs) model [38] improves PoE by adding weights that represent the contributions of different experts.For instance, the weight can be calculated as the difference in the differential entropy between the prior distribution p(f (x * )) and the posterior predictive distribution p(f (x * ) | x * , D), which can be written as where σ 2 * * represents the variance of the prior distribution p(f (x * )) and σ 2 i (x * ) denote the predictive variance of GP expert i, which can be calculated based on (7).
Given the data D (i) collected by sensor i, the GPoE predicts a function value f (x * ) at a test input x * .The predictive distribution and the closed forms of the aggregated predictive mean and variance can be written as Alternatively, the Bayesian committee machine (BCM) [39] proposes to aggregate the experts' predictions from another view by imposing a conditional independence assumption that for the experts.Therefore, the BCM's prediction distribution can be written as where the denominator reaches an (M −1)-fold division by the prior, which plays the role of a correction term that helps to recover the GP prior when leaving regions of training data.The closed form of the aggregated predictive mean and variance of the BCM can be calculated as In [40], a robust Bayesian committee machine (RBCM) is proposed which combines both the features of the GPoE and BCM models.The predictive distribution and aggregated predictive mean and variance of the RBCM can be written as All the models discussed in this section can be applied to infer the target states in a distributed way in the target tracking problem.Particularly, the closed-form of posterior predictions can be obtained and the predictions are fully tractable.

C. Hyperparameter Learning
The hyperparameters of GP need to be learned from the data.As a standard GP, maximum likelihood estimation (MLE) is applied to learn the hyperparameters by maximising the log marginal likelihood which can be written as where θ θ θ = σ 2 , σ 2 z , l 1 , • • • l d represents the set of hyperparameters.l j represents the length-scale of the j th feature of the input data, σ 2 is the output variance of the kernel function, and σ 2 z is the variance of the measurement noise.
For the DGP, assuming the local GPs are independent with each other, the log marginal likelihood can be factorized as log p(z|X, θ θ θ) where X (i) and z (i) represent the training input and output of local GP i, respectively.The reduction in computational complexity of DGP can also be justified by looking into (22), where both the computations of determinant and inversion are only based on a much smaller matrix Σ (i) .In addition, as compared to (21), the factorized marginal likelihood can potentially be maximized in a decentralized manner like federated learning [57] since it is a summation over local marginal likelihood functions.
The learned hyperparameters are shared by all the local experts for automatic regularization to avoid overfitting.

IV. DGP-BASED POINT TARGET TRACKING
The previous section demonstrates that the DGP is a promising method for large-scale learning systems.In this section, we describe the proposed DGPT approach which solves the tracking problem in WSNs using the distributed machine learning method, in a data-driven way.Several improvement schemes are designed to help integrate DGP for efficient distributed tracking and deal with clutter measurements to achieve robust performance.
In a WSN, each sensor can collect its own measurements and the GP regression can be applied locally to process local data.Some sensors can be edge devices and could provide edge learning [32].The proposed DGPT is linked with federated learning [58].
At each time, after training, local GP-based target state estimations can be aggregated to reach a high-level prediction following different aggregation methods discussed in Section III-B.Only the the predictive means and variances of local GPs are propagated to calculate the overall prediction, without transmitting all the data to a central filter (controller).Particularly, having this aggregation process does not mean an extra central node is necessary.The aggregation can be implemented on any capable sensor or edge node, thus the proposed approach is fully distributed.

A. Temporal and Spatial-Temporal GP
To make a state estimation of a target, it is reasonable to assume that there is a temporal correlation in the motions of the target, and this correlation within time variables in distant past is weaker than in more recent ones.Therefore, the target state can be formulated as a function of the time variable which is used as the input data for training DGP and making state estimations.In this case, we have x = t, where t represents the  1) and ( 5) can be reformulated as where t and t ′ are the training and test data, respectively.Based on the time variables and measurements (training set), set the test input as x * = t, TGP can predict the target state at time t using ( 6) and (7).Moreover, TGP can also predict a next-step state predictions by setting x * = t + 1.
Inspired by [59] that uses GP regression to learn the target state transition function, the spatial correlation can also be involved for state prediction by including the target state of the previous time into the input data, namely x = (r t−1 , t), where r t−1 represents the target state in t − 1.The spatial-temporal GP (STGP)-based tracking problem can be formulated as

B. Sliding Window-based Tracking
Although DGP is designed for complexity reduction, training GPs distributively can still be computationally intense when local sensors collect extensive measurements originating from both the targets and the clutter.
Noticing the fact that the target motion can be a mixed maneuvering behaviour with time-varying parameters, based on the temporal correlation assumptions in the previous section, measurements with weaker motion correlations in distant past time cannot contribute to training and prediction as much as the recent ones.Therefore to further reduce the computational complexity, the distant measurements are abandoned in DGPT and only the recent measurements are utilized for training, which can be treated as using a sliding window to select valid measurements.The sliding window is a set of time variables which represents a number of time steps.Based on the sliding window, a sensor without valid measurements (the times of collecting the measurements are not in the sliding window) is excluded from state estimation at the current time.
When both the number of experts and target measurements is reduced, the computational complexity of GP training reduces and also the tracking accuracy improves.Figure 1 shows the framework of the proposed sliding window-based DGPT approach.

C. DGP-based Data Association
The previous section discuss using sliding window to reduce the number of measurements for training and state estimation.In this section, we focus on dealing with a large number of clutter measurements to improve the tracking accuracy and also reduce the computational complexity.
To cope with the clutter and for learning the GP hyperparameters, a method is designed to assign weights to different measurements based on the marginal likelihood (21).A weighted summation over measurements collected by one sensor is then calculated as the training data for DGP.Define j ∈ {1, 2, • • • , J t } as the index of measurements received by a sensor at time t and define w j,t as the weight of the j th measurement, the gating method can be written as where zt represents the weighted summation of the measurements, which will be used for hyperparameter learning.
The rationale for calculating this weighted summation is to determine a training instance based on the likelihoods of all the measurements (both target and clutter measurements) and the resulting summation is expected to be close to the target measurements.Particularly, since only one summation is calculated from each local expert, the number of measurements for DGP training is greatly reduced and therefore the computational cost is reduced as well.
In the distributed tracking scenario, at the beginning of each time, each sensor performs the proposed gating method independently based on the learned hyperparameters from the previous time and the local measurements which are inside the sliding window.The resulting data based on ( 27) and ( 28) is used for hyperparameter learning and state estimation in the current time.In addition, for a new sensor that does not have any historical measurements, average weights are calculated according to the weights assigned by all the other active sensors.Moreover, when the clutter rate is much higher, the clustering scheme can be used for preprocessing and the gating method will be applied to the cluster centers rather than to all the measurements.

D. Hyperparameter Online Learning in DGPT
Since the target motion can be time-varying and the sliding window is designed to keep valid measurements for training and tracking, the hyperparameters of DGP should be learned online to capture the non-stationary features.Hence, in the proposed DGPT approach, MLE which is based on the factorized marginal likelihood ( 22) is solved every time to update the hyperparameters.This process brings extra computational costs due to the non-convexity of (22) and requires an iterative solving process.To accelerate the hyperparameter learning process, optimized hyperparameters at time t is designed to be set as the initial value of hyperparameters of MLE at t + 1, which can significantly reduce the iterations needed for MLE.
Benefiting from the proposed sliding window design, the DGP-based data association, and the online learning properties, the DGPT can provide accurate target state estimations with collected measurements.

E. Complexity Analysis
For the proposed DGPT approach, the main computational complexity stems from the covariance matrix inversion, which scales cubically in terms of the number of data at each sensor.
Based on [23], the complexity of matrix inversion for DGP is ), where J Mt,t represents the number of data stored in the M t th active sensor at time t and M t is the number of active sensors at t. Based on ( 27) and ( 28), the measurements collected by a sensor at time t are used to calculate a weighted sum for DGP training.This means only a single measurement is saved per sensor per time.Given the length of the sliding window of time as C, the computational complexity due to the matrix inversion can be upper bounded as O(C 3 M t ).
In the DGPT approach, the hyperparameters are learned online by solving the maximum likelihood estimation problem as formulated in (22).Maximizing this function requires computing the covariance matrix inversion iteratively.Therefore, the computational complexity of the DGP algorithm update scales as O(C 3 M t ) per iteration at time t.The prediction step also scales as O(C 3 M t ).It is important to notice that the proposed DGP-based data association helps the DGPT achieve a low computational complexity in both DGP learning and state prediction since the complexity does not scale with the number of received measurements and the length of the sliding window does not increase over time.
V. THEORETICAL PERFORMANCE ANALYSIS Lemma 1. (Lemma 5.1 of [60]) Given a trained local GP based on training data D t = {X t , z t } till t, for any input x * ∈ X t , the probability Pr(.) that the predictive mean µ(x * ) deviates from the true function value by more than a certain amount can be upper bounded as where γ is a positive constant.
Lemma 1 proposes a UCB of the probability that the deviation between the true function and the estimated mean of the function is larger then a scaled version of the estimated variance function.Based on this lemma, error bounds of distributed GPs can be derived.
The following Lemma 2 represents a generalization of Lemma 1, assume each time one estimation is made, for the case of an infinite number of time t, e.g.t → ∞.
Lemma 2. Define δ ∈ (0, 1), set γ t = 2 log(π t /δ), for π t = π 2 t 2 /6, based on Lemma 1, apply the union bound over t, we have Lemma 2 further generalises the cumulative deviation of the predictive mean from the true function value based on the GP predictions from all n test inputs.Theorem 1. (One-step error bound of GPoE) Consider a distributed GP system with M local GPs, with probability at least 1 − M i=1 e −γi/2 , the deviation between the true function value at x * and the aggregated estimation of the mean value made by the GPoE method can be upper bounded as Proof: Define A i as the event in which the prediction of the target state from local expert i and the true target state differs larger than a quantity, which can be written as Define the union of events Applying the union bounds over M events, the probability of A can be upper bounded as where µ i (x * ) and σ i (x * ) represent the predictive mean and standard deviation (STD) of local GP i at x * , respectively.Define Ā as the complement of A, changing the direction of the inequality gives that According to ( 13) and ( 14), the deviation between the true function value and the aggregated predictive mean by GPoE can be written as .
Theorem 1 proposes a theoretical UCB for the tracking performance.Define the highest predictive variances of a local expert as σ 2 H , the bound can be further represented as This bound demonstrates that, given all other local GPs fixed, when one of the local GP makes a highly uncertain prediction which is reflected as a larger predictive variance, the upper bound of the deviation will increase, which means the overall prediction is exacerbated by this poor GP expert.
Next, we derive a Theorem about the UCB of the RBCM.
Theorem 2. (One-step error bound RBCM) following Theorem 1, for a distributed GP system with M local GPs, with probability at least 1 − M i=1 e −γi/2 , the deviation between the true function value at x * and the aggregated estimation of the mean value made by the RBCM method can be upper bounded as Proof: The detailed proof is given in Appendix A.
The UCB for the GPoE algorithm in general is not the same as the UCB (31) for the RBCM (37).However, under certain conditions the two upper bounds coincide.
The next Theorem 3 generalises the result further, to the GPoE's cumulative error bound for a number of estimations over a wide time interval T .The cumulative error bound of other DGPT approaches can be derived in a similar way.Theorem 3. (Cumulative error bound) Consider a distributed GP system with M local GPs.suppose one state estimation is made every time, with probability at least 1 − M i=1 δ i , the cumulative deviation between the true function value at each test input and the aggregated estimation of the mean value can be upper bounded as Proof: The detailed proof is given in Appendix B.

VI. DGP-ASSISTED BAYESIAN FILTERING WITH MEASUREMENT ORIGIN UNCERTAINTY
Inspired by ideas from [61]- [63], in this section, the proposed DGPT is enhanced by an elegant Bayesian filtering method which can solve the data association problem without the need to construct explicit measurement-target assignment hypotheses or gates.The resultant hybrid Bayesian filteringbased tracking approach provides a novel way to merge distributed machine learning and model-based Bayesian inference.The prediction made from the DGPT is used as the prior distribution of the target state and a Poisson measurement likelihood model is involved for posterior state inference.

A. Measurement Likelihood Function
According to [61], the state vector of L + 1 entities that needs to be estimated at time t is defined as where xt,c represents the state of the clutter process and L represents the number of targets which is assumed to be known.In a single point tracking problem, we have L = 1 and xt,1 = xt .The derivation of the measurement likelihood relies on the following three assumptions: A1: The numbers of target originated measurements in a time scan are assumed to be Poisson distributed, with a rate λ T .A2: The numbers of clutter measurements in a time scan are assumed to be Poisson distributed, with a clutter rate λ c A3: The clutter measurements are assumed to be uniformly distributed in the sensing space of each sensor.In many cases, in a time step, a high-resolution sensor is able to generate more the one measurement from the target and also from the environmental interference.Therefore, the assumptions that the clutter rate λ c and target rate λ T of the respected measurements follow Poisson distributions are well justified.These assumptions are also used in [64], [65].Then this is reflected in the first and second assumptions (A1 and A2).Assumption A3 reflects the fact that the clutter is uniformly distributed which is one of the most common cases in practice.The uniform distribution of the clutter in the areas of interest also reflects the full lack of prior knowledge about the possible locations of the environmental interference.
For a WSN, according to above assumptions and by assuming that the target originated measurement likelihood is a product of Gaussian likelihoods, the collected measurements result from superposition of multiple target measurements with Gaussian noise and uniform clutter measurements [61].Therefore, similarly to [62] and [63], the joint likelihood expression of the tracking problem can be written as where n t represents the number of measurements collected from all the sensors in time t.z t denotes the measurements collected in t.
To simplify the notation we rewrite xt as xt to represent the case when the target state is a scalar.Define p c (z t,j ) as the clutter measurement likelihood which is a uniform distribution, and define p(z t,j |x t ) as the target measurement likelihood which is Gaussian, based on the assumptions.In addition, define ϕ as a partition of the measurement set, the joint likelihood (40) can be written as where n T t (ϕ) represents the number of target measurements which is compatible with partition ϕ. ϕ(j) = 0 corresponds to the clutter measurement and ϕ(j) ̸ = 0 corresponds to the target measurement.λ T is the expected number of measurements originating from the target and λ c denotes the expected number of measurements corresponding to clutter in the sensing area.According to Appendix A of [63], the likelihood can be further represented as Based on Assumption A3, the clutter measurement likelihood can be written as p c (z t,j ) = 1  Asen , where A sen represents the sensing area and is considered to be the same for all the sensors.
The measurement likelihood is expressed in two different ways: with the Poisson likelihood model (42) or its equivalent form (43) and with a Gaussian process as in (45).
Assuming the measurement noise follows a zero-mean Gaussian distribution with variance σ 2 z , we have p(z t,j |x t ) ∼ N (x t , σ 2 z ), the measurement likelihood ( 42) can be represented as where μt,j = λc Asen + λ T xt and σ2 t,j = (λ T σ z ) 2 .The derivation from ( 44) to (45) holds due to the fact that the product of Gaussian probability density functions is proportional to a Gaussian probability density function.
Based on ( 9) and ( 10) of product of Gaussian, the mean and the variance of ( 45) can be calculated as The expressions ( 46) and ( 47) are functions of the clutter parameter (e.g. of the clutter rate λ c ) and of the rate λ T of the target originated measurement.The likelihood function could be expressed also as a function of the clutter density, as this is shown in [62], [63].The clutter rate λ c and the clutter density are connected with the area of the sensor and are parameters that can be estimated by knowing the sensor area A sen , as shown in [66] or by theoretical methods.

B. DGP-based Posterior State Inference
As discussed in Section IV-A, notice that at time t, the proposed DGPT approach can also provide a next-step state estimation distribution.This means the DGP model can be used as a high-quality prior distribution for the state estimation in the next time.Have this prior knowledge and by combining the likelihood, a novel state estimation method can be designed.
Define Z t−C:t−1 as the set of measurements within the sliding window from time t − C to t − 1.The length of the sliding window is C time steps.Given the current time t, the prior distribution of the target state can be written as where µ t−1 and σ 2 t−1 represent the prior mean and variance that can be calculated using the GP regression equations ( 4) and ( 5), respectively.According to Bayes rule, the posterior can be written as Based on the prior distribution ( 18), (19), and (20), which is learned by the DGP, considering the measurement likelihood (42), the posterior state distribution (49) can be derived.The posterior mean and variance can be written as Appendix C contains the detailed derivation of these results.A thorough performance validation and evaluation of the developed DGPT and hybrid Bayesian filtering approaches with the derived theoretical UCB and over several test scenarios are presented in the next section.

VII. PERFORMANCE EVALUATION AND VALIDATION A. Training Time of DGP
To evaluate the running time for training the DGP model, in this subsection, define the input data as two-dimensional, namely x = [x 1 , x 2 ] ⊤ , we test the running time of DGP training based on data from the following function where the measurements are generated by adding zero-mean Gaussian noise with the STD as σ z = 0.5 to this function.
The time required to compute the log marginal likelihood and its gradient with respect to the kernel hyperparameters is presented in Figure 2.This computation process acts as the fundamental step for solving the factorized MLE (22) for hyperparameter learning and DGP training.As a comparison,  Particularly, except in the cases of small training sets (500-1000), the running time of DGP increases much slower than the standard GP.Hence, DGP can handle much more data and is more suitable for real-time distributed learning and tracking in WSNs as compared to the standard GP.Moreover, the computation of the factorized likelihood can be implemented in parallel, thus each local computational unit will only need to compute one or a few terms of (22), and the overall running time can be further reduced.
In addition, in Figure 2, the impact of the number of local GPs on the running time for computing the log marginal likelihood and its gradient is also presented based on the data set with a fixed size of 10000.We can find that having more local GPs can help reduce the running time since according to (21), if the smaller size of covariance matrices are generated, solving the matrix determinant and inversion can be much faster on these smaller matrices.

B. Simulation Setup
The tracking performance of the proposed approaches are tested in a WSN with 250 sensors uniformly implemented in a 1000 meters × 1000 meters area.The sensing range is 50 meters and the sampling period is one second, both of which are identical for every sensor.The proposed algorithm can also be implemented in a heterogeneous network easily and can make state estimations considering the heterogeneity of sensors, which for example, can be reflected in the posterior predictive variance of the local sensor.
In this paper, the target states denotes the target locations, hence two GPs are needed at one sensor for the X-coordinate and Y -coordinate, respectively.For the GP, a zero-mean function is used which means no extra knowledge is utilized for tracking.Besides, the covariance function is selected to be the squared exponential kernel which is demonstrated to perform well in many maneuvering models [17].Following Assumptions A1-A3, the clutter measurements are modelled as a Poisson point process and are uniformly located in the sensing region of each active sensor.Moreover, the number of target measurements is modelled as a zero-truncated Poisson distribution, which ensures at least one measurement can be received by a sensor within a time step, namely, λ T = 1.The target measurement noise is modelled by the zero-mean Gaussian distribution.
In addition, we develop a range of scenarios with varying parameters of the clutter process, measurement noise, and target trajectories.To test the robustness of the proposed DGPT approach and also to account for the wireless effect of the channel, three noise levels are involved in the simulation with measurement uncertainty (STD) σ z = 1, 2, 4 meters.In addition, two clutter settings are simulated to test the performance of the proposed approach, the low clutter case sets the clutter rate as 1 and the high clutter case sets the rate to be 5.All the results are averaged over 100 Monte Carlo (MC) runs and the lengths of sliding windows in different trajectories are carefully tuned to be different for optimal performance.

C. Benchmarks
Since this paper focuses on the model-free approaches, the standard GP-based centralized tracking approach is simulated as the benchmark.This scheme relies on solving the MLE (21) to learn the hyperparameters, and the learning process requires the measurements to be transmitted in the WSN.To make fair comparisons, the standard GP-based centralized tracking approach is trained with the same sensor measurements in the sliding window, which means this approach uses the same amount of data for model training and hyperparameter learning as well as the DGPT approach.
To study the impact of different aggregation methods on the DGPT, both RBCM and GPoE are simulated.In addition, DGPTs based on both temporal and spatial-temporal input data are evaluated.For the temporal case, a set of time variables which is in the sliding window is used as the input data.For the spatial-temporal case, both the current time and the target state of the previous time are used as input.Notice that in the online tracking problem, the real target state is not available to be used as the training input.Therefore, the predictive state acquired by the DGPT in the previous time is used instead.

D. Target Trajectories
To evaluate the proposed algorithm and the benchmarks, four challenging scenarios are built following different models.The trajectories and the sensors are depicted in Fig. 3. S1: Similar to [67], the trajectory is generated based on the NCV model in the straight line, and the abrupt velocity change at each pre-defined turning point.S2: The target trajectory is generated by the gradual coordinated turns model (20

E. Normalized Root Mean Square Errors
In this section, the average normalized root mean square errors (NRMSEs) of the proposed DGPT and the standard GP-based tracker are evaluated.The NRMSE is defined as where f (x i ) max and f (x i ) min represent the maximum and minimum value of the target states, respectively.The tracking errors of the proposed DGPT approach with temporal input feature, which are collected from different trajectories are presented in Tables II-V.From the results, we can find that when considering the existence of clutter measurements, the clutter rate plays an important role in determining the prediction error.By contrast, the proposed DGPT approach achieves robust performance while changing the noise level of the target measurements.Moreover, using RBCM as the prediction aggregation method can achieve lower NRMSEs than GPoE in most scenarios, which justifies that adding the common prior into the aggregation process can improve the tracking accuracy.Particularly, the proposed DGPT approach performs competitively well and even outperforms the centralized approach in some scenarios.This is due to that in DGP, different weights can be assigned to the local predictions during the prediction aggregation process, so the final aggregated predictions are closer to the expert who makes more confident predictions.In the centralized method, all the data is aggregated before training without any difference.
The tracking errors of the proposed DGPT approach with both temporal and spatial input features are depicted in Tables    VI-IX.We can find that considering spatial feature can help to improve the tracking accuracy as compared to TGP, especially in the more challenging scenarios where the speed of the target keeps changing or the target keeps maneuvering (Scenarios S2,s3, and S4).Particularly, the improvement is even more significant in the high clutter case since adding spatial feature can help to learn a more accurate likelihood function which can better assign weights for measurement preprocessing.

F. Uncertainty Quantification
To visualise the derived UCB of the proposed DGPT, we choose the probability that the UCB holds as 99.7% which   corresponds to the 3σ confidence interval of the Gaussian distribution.Based on theoretical analysis from Theorem 2, the UCBs of predictions of DGPT-RBCM in both X and Y coordinates are presented in Figures 4 and 6.The confidence intervals of the predictive distribution of the DGPT-RBCM in both X and Y coordinates based on (19) and (20) are presented in Figures 5 and 7.The results in Figures 4 and 6 demonstrate that the proposed UCB can reveal the information of where the true location of the target is since the UCB can cover the true target location in most time steps with a high probability.However, in Figures 5 and 7, although the confidence intervals can quantify the uncertainties of the predictions themselves, the intervals fail to cover the true location of the target in most of the time steps.The derived UCB better characterizes the presence of the target in the error bound with 88% and 42% higher probability in X and Y coordinates, as compared to the confidence of DGP.The comparisons between the derived UCB and the confidence interval highlight the value and informativeness of the UCB, which can help to further refine the measurements for training the DGP model by excluding some measurements out of the bound and thus having the potential to further enhance the measurement preprocessing method presented in Section IV-C.

G. Hybrid Bayesian Filtering
In this section, the performance of the proposed hybrid Bayesian filtering approach for target tracking is evaluated as compared to both the Standard GP-based tracker and the DGPT approach.We have also compared with a model-based aapproach by using a convolutional particle filter (CPF) [66], These values cover possible uniform motions and maneuver within the minimum and maximum values of the turn rate.The CPF multiple model filter is tested over all 4 considered scenarios, by keeping the same transition probabilities matrix, initial mode probabilities vector and grid for the angular turn values.This means that we we give the same conditions for the CPF, for all testing scenarios.
Inspired by ideas from Gilholm and Salmond [61], [62], we adopt the powerful Poisson likelihood model in the CPF and in our proposed hybrid Bayesian filtering approach in order to deal with measurement origin uncertainties and solve the data association task.
The Poisson likelihood model for dealing with measurement origin uncertainties leads to one of the most efficient data association approaches which avoids combinatorial complexities which are typical for multiple hypothesis target tracking model-based approaches.
The dynamic model with a known turn rate has the follow-  In particular, we can see that a fine-tuned multiple-model CPF achieves the lowest NRMSE values in both X and Y coordinates.
We also present the standard deviation (STD) of averaged RMSEs over 100 MC runs.Each of the RMSEs is the tracking error calculated from a single independent MC run.The results are presented in Figures A (c) and (d).The proposed hybrid Bayesian filtering approach achieves the lowest STD of state estimations in both coordinates as compared to other GP-based approaches.It is also fairly stable over multiple scenarios.The results demonstrate that by involving the Poisson likelihood model, both the accuracy and the robustness of the proposed DGPT can be further improved.The STDs of RMSEs from the CPF are relatively higher in some cases, which shows the proposed algorithm achieves higher robustness than the modelbased tracking approach.The high STDs of the RMSEs for the CPF also mean that the estimates from the CPF are somehow far from the true target trajectories in some areas or some runs.

VIII. CONCLUSION
In this paper, a novel DGP-based model-free learning and tracking approach is proposed to solve distributed point tracking problems in WSNs with clutter measurements.The developed approach belongs to the distributed edge learning and overcomes the limitations of standard GP-based tracking methods from a different angle via distributed GP.Theoretical derivations are presented for the UCB of the tracking error for two important tasks: 1) when data have no clutter, 2) when the sensor data contain clutter which means there is measurement origin uncertainty.The UCBs characterize the trustworthiness of the proposed approach.The estimates are acceptable when the derived UCB is within certain prespecified limits.It characterizes the presence of the target in the error bound with 88% and 42% higher probability in X and Y coordinates, respectively, than the confidence interval-based method.By introducing the Poisson measurement likelihood, a hybrid Bayesian filtering approach is proposed to merge the distributed machine learning and model-based methods to further improve the tracking performance and robustness.Numerical experiments demonstrate that the proposed approaches perform competitively well and can deal with varying motion models, noise levels, and clutter rates.Future work will focus on sensor management challenges in the developed distributed tracking system.The challenges include efficiently utilizing edge computing resources to accelerate DGP training and prediction.Another direction is to consider different types of measurements such as received signal strength and non-Gaussian measurement noises in the distributed tracking system.The theoretical derivation of a PCRLB is an important topic to focus on, and the validation of the theoretical results over real case studies will also be considered in the future.

APPENDIX A
According to (18), (19), Lemma 1 and the analysis as in Theorem 1, the deviation of the true function value from the

Figure 1 :
Figure 1: A distributed point tracking system with 4 sensors.The length of sliding window in this example is 5 time steps

Figure 2 :
Figure 2: Computation time for the log-marginal likelihood and its gradient versus the size of the training data and the number of local GPs

Figure 5 :
Figure 5: Confidence interval for DGPT-RBCM of S1 in X coordinate

•
To solve the data association problem, different weights are assigned to the measurements based on the marginal likelihood of local GPs and a weighted summation is calculated for DGP training.This method achieves efficient hyperparameter learning and state prediction.We justify that the complexity of the DGPT does not scale with the number of measurement, but only with the length of the sliding window and the number of active sensors.
• For the first time this work derives theoretically upper confidence bounds (UCBs) for the state estimation error of the proposed DGPT.Numerical results demonstrate the superiority of the UCBs as compared to the confidence interval of the DGP model itself.• With the knowledge learned from the DGP, a novel hybrid Bayesian filtering method is proposed to combine distributed machine learning and classical Bayesian inference.The designed tracker refines DGPT's state estimation by introducing a Poisson likelihood model, which elegantly sidesteps the data association challenge.• The performance of the proposed DGP and hybrid • /s for 10 seconds) with the constant velocity model, which can go both left and right.This model can represent maneuvrable motions.The target trajectory is generated by the sharp coordinated turns model which is more agile with higher turning rate (30 • /s for 9 s).S4: The target trajectory is generated by the Singer acceleration model.The maximum possible acceleration is 50 m/s 2 , the probability of non-acceleration is 0.4.

Table II :
Updated NRMSEs for S1: TGP

Table V :
Updated NRMSEs for S4: TGP