PLPF‐VSLAM: An indoor visual SLAM with adaptive fusion of point‐line‐plane features

Simultaneous localization and mapping (SLAM) is required in many areas and especially visual‐based SLAM (VSLAM) due to the low cost and strong scene recognition capabilities conventional VSLAM relies primarily on features of scenarios, such as point features, which can make mapping challenging in scenarios with sparse texture. For instance, in environments with limited (low‐even non‐) textures, such as certain indoors, conventional VSLAM may fail due to a lack of sufficient features. To address this issue, this paper proposes a VSLAM system called visual SLAM that can adaptively fuse point‐line‐plane features (PLPF‐VSLAM). As the name implies, it can adaptively employ different fusion strategies on the PLPF for tracking and mapping. In particular, in rich‐textured scenes, it utilizes point features, while in non‐/low‐textured scenarios, it automatically selects the fusion of point, line, and/or plane features. PLPF‐VSLAM is evaluated on two RGB‐D benchmarks, namely the TUM data sets and the ICL_NUIM data sets. The results demonstrate the superiority of PLPF‐VSLAM compared to other commonly used VSLAM systems. When compared to ORB‐SLAM2, PLPFVSLAM achieves an improvement in accuracy of approximately 11.29%. The processing speed of PLPF‐VSLAM outperforms PL(P)‐VSLAM by approximately 21.57%.

simple structure, low cost, strong scene recognition ability, and the ability to capture rich texture information.Consequently, VSLAM has gained significant attention in both academic and industrial fields (Di et al., 2019).
Conventional VSLAM typically relies on point features to track the movements of agents and build maps, as this approach is simple and effective.However, because images of nontextured or low-textured environments (Figure 1) lack sufficient point features, conventional VSLAM may suffer from some issues, such as tracking loss (Guo et al., 2021), failure in loop detection stage (Tsintotas et al., 2022).To address these challenges, researchers have been exploring alternative approaches.For instance, to deal with tracking loss issues, they attempted to develop point-line-plane (PLP)-based VSLAM systems that can combine line and plane features (Li, Li, et al., 2017).
Additionally, some researchers tried to employ deep learning-based techniques for loop detection (An et al., 2022;Arshad & Kim, 2021;Lu et al., 2023).This paper only focuses on the tracking loss issue, because loop detection is not always a necessary step in all scenarios.
Considering the current attempts still face limitations in low-/ nontextured indoor environments.Consequently, further investigation and development of VSLAM systems that can effectively operate in such scenarios remain essential.
Inspired by that the conventional point-based VSLAM can handle scenes with rich textures, and the structures of indoor space (such as walls are perpendicular to the floors and ceilings) can be used as an effective supplement a feature in non-/low-textured areas, we propose a VSLAM system with an adaptive fusion of point-lineplane features (PLPF-VSLAM).It is able to adaptively select proper feature fusion strategies for localization and mapping, according to the texture richness of scenes.For scenes with rich texture features, the system will fuse point and line for tracking, and store such features in the map for mapping.For scenes lacking texture features, it will empty the PLP fusion.This paper is organized as follows.Section 2 provides an overview of the current research on VSLAM.Section 3 presents the PLPF-VSLAM.Section 4 evaluates the PLPF-VSLAM by using two RGB-D benchmarks, TUM data set and ICL_NUIM data set.Upon the results, conclusions and future work are drawn in the final section.

| RELATED WORK
In recent years, many different VSLAM systems have been presented.
According to the type of feature utilized, VSLAM can be categorized into three types: (i) point-feature-based VSLAM, P-VSLAM, (ii) line feature-based VSLAM, L-VSLAM and (iii) VSLAM based on fusion of point, line, and/or plane features, PL(P)-VSLAM.It should be noted that it is difficult for a SLAM system to fulfill the accuracy requirements of mapping and tracking solely based on plane features.
Thus, plane features are rarely used alone but are usually employed together with the other two types of features.

| P-VSLAM
P-VSLAM systems primarily rely on point features for tracking and mapping.The commonly used point features in P-VSLAM include scale-invariant feature transform (SIFT) (Lowe, 2004), speeded up robust features (SURF) (Bay et al., 2006), and oriented FAST and rotated Brief (ORB) (Rublee et al., 2011).The processing methods of point features are highly developed, which makes P-VSLAM becomes the current mainstream of VSLAM.There are several classic P-VSLAM systems, such as PTAM system (Klein & Murray, 2007), MonoSLAM (Davison et al., 2007), semidirect visual odometry (SVO) (Forster et al., 2014).In general, the PTAM system is regarded as the prototype of P-VSLAM.This system brought three major innovations to SLAM systems: it (i) replaces the traditional Kalman filtering with nonlinear optimization; (ii) employs a keyframe mechanism.That is, the system only needs to process the most representative image, rather than each frame of an image, which greatly improves the efficiency of the calculation; (iii) to meet real-time requirements, separates the tracking and mapping process by using a multithreading mechanism.However, because without considering the global loop closing, the PTAM system is only applicable to small scenarios, and its tracking process is easy to fail.Afterward, an open-source VSLAM based on point feature named ORB-SLAM was released, Mur-Artal et al. (2015).It employs the ORB feature, loop closing detection mechanism, and the BOW model, which forms a very complete framework of point-feature-based VSLAM.Because ORB-SLAM is prone to tracking loss when the camera rotates violently, on the basis of this version, the authors released ORB-SLAM2 after 2 years (Mur-Artal & Tardós, 2017).This system can support monocular, stereo, and RGB-D cameras.It realizes real-time localization and mapping, in which the accuracy of localization is at a centimeter-level.Hence, it is the most typical P-VSLAM system.But it should be mentioned that this system is very sensitive to dynamic objects and is easy to have tracking loss in dynamic scenes.

| L-VSLAM
In response to the limitations of P-VSLAM in non-/low-textured scenarios, researchers began studying L-VSLAM.Such a VSLAM utilizes line features as the primary source of information for tracking and mapping.For instance, Smith et al. (2006) applied the line feature in the SLAM system, in which a line is represented by two endpoints.Yet, this system is only suitable for small scenes where entire line segment can be fully captured.To address this limitation, Lemaire and Lacroix (2007) applied infinitely long line segments to large scenes.This practice effectively expands the applicable scenarios and further makes the process of matching line segments between frames easier.
However, the initialization of line segments in space may fail in scenarios with a large landmark space.Other than that, Zhang et al. (2015) proposed a three-dimensional (3D) line-based stereo VSLAM system, which employs two different representations to parameterize 3D lines to obtain a better result.Inevitably, this system also has shortcomings.In particular, it is time-consuming in the straight-line tracking process as it is based on the optical flow method.There are several PL-based VSLAM systems, such as linesegment-detector (LSD)-SLAM (Engel et al., 2014), monocular-based PL-SLAM (Pumarola et al., 2017;Yang et al., 2021), point-line fusion (PLF)-SLAM (Fang & Zhan, 2020), PLI-VINS (Zhao et al., 2022).The LSD-SLAM (Engel et al., 2014) applied the direct method to semidense monocular in SLAM, which achieved semidense scene reconstruction on the CPU.But, this system is prone to tracking loss when the camera moves quickly.The monocular-based PL-SLAM proposed by Pumarola et al. (2017) utilized fusion of point and line features to the entire SLAM process.This system addresses tracking and matching problems of specific line segments by removing outliers based on the comparisons of length and orientation of line features.Similarly, Li et al. (2020) proposed a low-drift monocular SLAM method for indoor scenes.In this system, the estimation of rotation and translation are decoupled to reduce long-term drift in indoor scenarios.In particular, it estimates a drift-free rotation between cameras by using spherical mean-shift clustering and a weak Manhattan world hypothesis (Zhou et al., 2016).And then, the translation between the cameras is calculated based on the features of points and lines.

|
VSLAM that uses plane features generally include PP-based (Guo et al., 2019;Taguchi et al., 2013;Zhang et al., 2019) and PLP-based VSLAM (Li, Hu, et al., 2017;Li, et al., 2021;Xia et al., 2022).One of the typical PP-based VSLAM systems was proposed by Zhang et al. (2019), which takes the data from RGB-D camera as the input to do localization and mapping in a low-textured scenario.This system improves its accuracy and robustness by employing structural information in the whole process.However, the system assumes that plane edges should be intersections of vertical planes, limiting its applicability in scenarios with inclined planes.By fusion features of point, line, and plane, the SLAM system presented by Li et al. (2021) decouples rotation and translation and then obtains the rotation of object drift by constructing a Manhattan world.This practice further improves the accuracy of the system.Meanwhile, on the basis of an instance-based meshing strategy, this system constructed dense maps by dividing plane instances independently.However, the initialization of building a Manhattan world needs three pairs of perpendicular planes or lines.Thence, users need to consider whether the scenario meets such a specific condition before using it.PLP-SLAM (Shu et al., 2023) tightly incorporates the semantic and geometric features (point, line, and plane features) to boost both frontend pose tracking and backend map optimization.
However, this method does not perform well in low-textured environments.UPLP-SLAM (Yang et al., 2023) designed a mutual association scheme for data association of point, line, and plane features, which not only considers the correspondence of homogeneous features (i.e., point-point, line-line, and plane-plane pairs), but also includes the association of heterogeneous features (i.e., point-line, point-plane, and line-plane pairs).By considering these cross-feature associations, UPLP-SLAM aims to improve the accuracy of the SLAM system, even in low-textured environments.

| Tracking
In the PLPF-VSLAM system, RGB and depth images are utilized as inputs.The tracking process involves estimating the pose transformation between two frames of images.Initially, point and line features are extracted from the RGB images, while plane features are extracted from the depth images.Meanwhile, incorrect feature matches are eliminated to ensure accuracy.Once the feature extraction is completed, the system constructs various projection error functions based on the matching results and predefined thresholds.These projection error functions capture the differences between the projected and the actual observed features.By optimizing these projection errors, the system obtains the pose estimation results, which represent the transformation between the two frames of images.

| Feature extraction and matching
ORB features are commonly used in VSLAM systems due to their desirable characteristics, such as invariance to rotation and scale, fast extraction, and efficient matching.These characteristics contribute to improved efficiency and performance in many scenarios.However, in nontextured or low-textured environments, the effectiveness of ORB features may be limited because they struggle to extract sufficient point features for accurate pose estimation.With that in mind, the PLPF-VSLAM adds line feature extraction based on LSD approach (Von Gioi et al., 2008), and uses line band descriptor (Zhang & Koch, 2013) to describe the feature of line segments.
Indoor scenarios have a large number of non/low-textured planes, but they show many structural features (e.g., vertical, parallel).
Such features also can be employed to improve the stability of a VSLAM system.The approach presented in Feng et al. (2014) is employed to extract plane features (Figure 3), which includes three steps: (i) divide the point cloud in the image into N nodes and remove the nodes with missing or discontinuous depth information, (ii) cluster the eigenvalues of each pixel in the image, and cluster the continuous blocks with feature differences within the threshold range into the same segmentation block, (iii) carry out iterative optimization for each pixel to output parameters of plane features.
In this paper, a plane is described in the Hessian form, that is, T , which is the normal vector of the plane; d is the distance from the camera's optical center to the plane.In the process of matching two plane features, two values are needed: one is the angle of the normal vector between them, and another is the d of them.Two conditions are used to determine if two planes can be matched.One is if the angle between normal vectors of the two planes is less than a threshold (i.e.,   ), and another is if the distance between the two planes is less than a threshold ).

| Pose estimation
During the pose estimation process, the detected 3D points, lines, and planes from the previous frame are projected onto the current frame.This projection is done using the estimated pose transformation between the two frames.By projecting the features, their positions in the current frame can be estimated.To evaluate the accuracy of the pose estimation, a reprojection error is calculated by comparing the projected features with the corresponding features that are detected directly in the current frame.This re-projection error is used to construct an error function, which represents the differences between the projected and observed features.This error will be further minimized during the optimization process to obtain the optimal pose estimation.
For point features, the reprojection error function is Equation ( 1): where u i is the feature point corresponding to the 3D point in the current frame; P i represents the 3D point in the world coordinate system; K indicates the camera internal parameters, and T cw denotes the transformation matrix from the world coordinate system to that of the camera.
The Jacobian matrix of Equation (1) on T cw is Equation (2), while that on P i is Equation (3): (2) where P′ represents the coordinate of P i in the camera coordinate system, and R denotes the rotation matrix from the world coordinate system to that of the camera.
For line features, we formulate the reprojection error function based on the point-to-line distance between l and two endpoints of projected line from the matched 3D line in the key-frame.For each endpoint P, the reprojection error can be noted as Equation ( 4): where K is the internal parameters of the camera; T cw represents the transformation matrix from the world coordinate system to that of camera; P denotes the endpoint of the 3D line segment; l is the coefficients of the 2D line equation.
The normalized line of a line feature is Equation ( 5): (5) The Jacobian matrix of Equation ( 4) on T cw is Equation ( 6): The Jacobian with respect to P is Equation (7): where P′ is the coordinate of P in the camera coordinate system.
A plane in the Hessian form has four parameters, but that in 3D space only has three degrees of freedom.Thus, to address this overparameterization gap, we denote the unit normal vector by ϕ and ψ to change its representation, where ϕ and ψ are the azimuth and elevation angles of the normal.Then, a plane is represented as a minimized parametric form with only three parameters, that is, it can be represented as Equation ( 8) (Zhang et al., 2019): The reprojection error is Equation ( 9): ( ) where π m is the observed value of the corresponding plane in the current frame; π w is the 3D plane in the world coordinate system; T cw is the transformation matrix from the world coordinate system to that of camera.
The Jacobian matrix of the re-projection error (Equation 9) on T cw is Equation (10) (Zhang et al., 2019): (10) The Jacobian with respect to π w is Equation ( 11): where α β γ , , i i i are the numbers of matched point, line, and plane features; F F 1, 2, and F3 represents the objective function of point, line and plane features, respectively.F F 1, 2, and F3 are expressed as Equation ( 13): where H H , p l , and H π are Huber functions of point, line, and plane, respectively; Γ , Γ p l , and Γ π are the covariance matrix associated to the scale at which the key points, line endpoints, and planes were detected, respectively.Subsequently, a selection process ensues to ascertain the inclusion of specific point, line, and plane features within the map.If a feature can be reliably tracked across no fewer than three keyframes, it will be considered stable and thus included in the local map.Conversely, if a feature cannot be tracked consistently, it will be removed from the map.Once the keyframes and their corresponding map features are added to the local map, optimization is carried out using the BA algorithm.This optimization practice serves to refine the camera poses and the positions of the point, line, and plane features in the local map, with the overarching objective of minimizing the reprojection error.By optimizing the local map, the accuracy and consistency of the map representation are improved, leading to heightened reliability of localization and mapping outcomes.

| LOOP CLOSING
In the field of VSLAM, relying solely on the pose transformation calculation between two adjacent keyframes leads to an inevitable accumulation of errors.This accumulation, in turn, renders the system unreliable over extended duration of operation.Therefore, it is critical to YAN ET AL.
| 55 eliminate the accumulated error by performing pose optimization in loop closing.The loop closing of PLPF-VSLAM is based on the approach presented in (Mur-Artal & Tardós, 2017).This process mainly includes two components: loop detection and loop correction.
The loop detection is to detect the loop keyframe by using a BOW model (Gálvez-López & Tardos, 2012).To determine whether the current keyframe can be used as a loop keyframe, we need to calculate the similarity transformation from the current keyframe to the loop keyframe; on the basis of similarity transformation, obtain the translation and rotation between the current and the loop keyframe; and perform projection and matching according to the translation and rotation to detect the reliability of the current loop.
The loop correction starts by adjusting all camera poses based on a known similarity transformation.Then, the adjusted pose is employed to update the map points that correspond to the connected keyframes.Meanwhile, it fuses the map point of the loop keyframe with that of the current keyframe.Afterward, these fused map points are further reprojected and rematched to establish new matching relationships, and according to the new relationships, the poses of all cameras are optimized based on the pose graph.Finally, loop correction is finished after using the full BA algorithm.

| EXPERIMENTS
To evaluate the performances of PLPF-VSLAM, we utilized two commonly used RGB-D benchmark datasets, Technical University of Munich (TUM) data set (Sturm et al., 2012) and the Imperial College London and National University of Ireland Maynooth (ICL_NUIM) data set (Handa et al., 2014).The former includes a large set of data sequences containing both RGB-D data from the Microsoft Kinect and ground truth pose estimates from the motion capture system.
Notably, the accuracy of the ground truth measurements attains a millimeter-level precision.The latter collects the image sequences in synthetic indoor spaces, mainly including living room and office.This data set can provide RGB images, depth images, and ground truth of camera poses.TUM data set not only shows rich-textured scenes but also low-/nontextured scenarios.More important, it has real trajectories during data collection.Therefore, TUM data set is employed to determine parameters for PLPF-VSLAM, while the ICL_NUIM data set for testing.PLPF-VSLAM is compared with five other VSLAM systems, including ORB-SLAM2 (Mur-Artal & Tardós, 2017), PL-SLAM (Pumarola et al., 2017), LSD-SLAM (Engel et al., 2014), Planar-SLAM (Li et al., 2021), L-SLAM (Kim et al., 2018), in which the first one is based on point features, the second and third are based on fusion of point and line features, and the last two are based on fusion of PLPF.It should be noted that the performances of the five VSLAM systems come from the literature.The computer for this experiment is equipped with Intel Core i7-7500U (3.5 GHz) and 12 GB memory.
The performance of the four modes (P-VSLAM, PL-VSLAM, PP-VSLAM, and PLP-VSLAM) is evaluated by the root mean square errors (RMSE) of the absolute trajectory error (ATE) (Equation 14).
where x ˆi represents the keyframe trajectory estimated by a VSLAM, and x i denotes the real trajectory of the camera.
Other than the RMSE, we propose another criterion to evaluate the overall performance (O P ) of the four modes.The O P is determined by the mean tracking time of each frame, and the RMSE.For a given scenario, the mode that has the minimum of O P will be selected.O P can be calculated by Equation ( 15): where tm i denotes the mean tracking time of each frame of the ith mode; RMSE i represents the RMSE of the ith mode.i = 1, 2, 3, and 4 corresponds to the four modes P, PL, PP, and PLP; and η and λ are the weights of mean tracking time of each frame, and RMSE, respectively.

| Determination of parameters
In this part, we attempted to determine the parameters of PLPF-VSLAM by processing the TUM data set.The thresholds for judging if two planes are matched are set on the basis of the research (Li et al., 2021;Zhang et al., 2019).In particular, the threshold of the angle between normal vectors of the two plane is set as ∘ 10 and the threshold the distance between the two planes is set as m 0.1 (i.e., ).Furthermore, we leverage parallel and perpendicular relationships of the map planes as additional constraints during the tracking process.
The parameters of the four modes (P-VSLAM, PL-VSLAM, PP-VSLAM, and PLP-VSLAM) on the 10 selected sequences from the TUM data set are shown in Table 1.These sequences were chosen specifically because they contain plane features, allowing for a comprehensive evaluation of the different modes.For each sequence, three parameters were computed for each mode: the mean tracking time per frame, the RMSE, and the O P value.These parameters provide insights into the performance of each mode in terms of computational efficiency and accuracy.

| Weight calculation
To achieve a balance between processing speed and accuracy in PLPF-VSLAM, it is crucial to carefully select suitable values for the parameters η and λ.These two parameters play a significant role in determining the overall performance metric O P .The parameter η represents the weight assigned to the mean tracking time per frame, while the parameter λ represents the weight assigned to the RMSE.Finding the right balance between these two parameters is essential for optimizing the overall performance of the system.
First of all, according to the mean tracking time and RMSE in the last four columns of Table 1, the v 1 and v 2 for each of the four modes in each sequence are computed by using Equation ( 16).It should be added that the sequence named fr3/nstr_ntex_far is not considered in the calculation process, because it only can work in the PLP-SLAM mode.
Then, we calculated the v 1 and v 2 for the four modes of P-VSLAM, PL-VSLAM, PP-VSLAM, and PLP-VSLAM in the nine sequences, and the average ratio (ρ) based on v 1 and v 2 is Equation ( 17).
T A B L E 1 Parameters of the four modes on the 10 selected sequences from the TUM data set.Note: X point , X line , and X plane are the normal distribution models that the matching numbers of point, line, and plane features."×" means that tracking loss happened or a significant portion of the sequence is not processed.The η and λ are 0.47 and 0.53, respectively, when computing O P .The bold font values mean the minimum of the O P , The bold values mean the minimum of the O P , which shows the VSLAM that has the best performance in this scenario.
Having ρ, to balance the impact of the mean tracking time of each frame and RMSE on O P , we set the relationships between η and λ as Equation ( 18): Finally, we can obtain the η and λ, where η = 0.47 and λ = 0.53.
Then, the O P is computed (Table 1).

| Data processing
For each sequence, after counting the number of matched point, line, and plane features are counted, we found that they follow the Gaussian distribution (Figures 4 and 5).The two figures show the Gaussian fitting result of the number of matched features in the fr1/ room and str_tex_far sequences, respectively.

| Data fusion
According to Table 1, the four fusion modes (P-VSLAM, PL-VSLAM, PP-VSLAM, and PLP-VSLAM) perform the best in multiple sequences.Therefore, we need to fuse the distributions of numbers of matched PLPF corresponding to multiple sequences into a Gaussian distribution to represent the number of matched features threshold interval used by each mode.For example, on the basis of O P1 , P-VSLAM performs the best in the four sequences of fr1/room, fr1/desk, fr1/xyz, and fr3/str_tex_near.Therefore, we fused the distributions of numbers of matched point, line, and plane features in each of these four sequences to obtain an optimal distribution.
To combine multiple distributions into a single one, we aim to minimize the variance of the resulting distribution.For instance, suppose we have two Gaussian distributions, 2 , and we want to fuse them into a new distribution 2 , To achieve this, we can follow the specific method outlined in Equation ( 19): Equation ( 19) further can be simplified as Equation ( 20), and then the variance of the fused distribution based on it becomes Equation ( 21): To minimize the variance of the fused distribution, we can take the derivative of Equation ( 21) with respect to k.This yields the following result (Equation 22): On the basis of the k (Equation 23) that minimizes the variance, we can compute the final fused distribution.After fusing the distributions of the four modes, we finally get the feature-matching distribution of them (Figure 6a,c,e).Note: "×" means that the tracking is lost at some point or a significant portion of the sequence is not processed; "-" indicates that the data is not provided in the literature.The bold values mean the minimum of the Op, which shows the VSLAM that has the best performance in this scenario.
After processing and optimizing the data, we summarized Figure 6b,d,f to obtain the thresholds for each mode (Equation 24).
However, it should be noted that overlaps can occur between different modes.In such cases, we consider the μ of the Gaussian distributions corresponding to the number of matched point, line, and plane features for different modes, and choose the mode that has a closer μ to the overlapping region as the best choice.
where P, PL, PP, and PLP represent P-VSLAM, PL-VSLAM, PP-VSLAM, and PLP-VSLAM, respectively.The n n , p l , and n π mean the numbers of matched point, line, and plane features, respectively.The two weights are assigned as η = 0.47 and λ = 0.53.
Based on the μ of the distributions corresponding to the number of matched point, line, and plane features, the applicable conditions of the four modes are shown in Figure 7.In the figures, the four modes, P-VSLAM, PL-VSLAM, PP-VSLAM, and PLP-VSLAM are colored green, blue, yellow, and red, respectively.Figure 7a is the overview, while 7b and 7c are side-views.Figure 7a can be populated as Figure 7d.

| TUM data set
In this experiment, we conducted the experiments by using the results shown in Equation ( 24).With this PLPF-VSLAM, we then evaluated the performance of the six VSLAM on the 10 sequences from the TUM data set (Table 2) In sequences with rich texture features (fr1/xyz, fr1/desk, fr3/nstr_tex_near, fr3/str_tex_far, and fr3/str_tex_near), the PLPF-VSLAM automatically selects point or PL features for tracking and mapping.In the sequences with sparse features (fr1/room, fr3/nstr_ntex_far, fr3/ nstr_tex_far, fr3/str_ntex_far, and fr3/str_ntex_near), it adds plane features to the tracking and mapping to ensure the accuracy.In terms of mapping results, after adding plane features, the maps have a better structural characteristics of the scene, and clearer outlines.
In addition, considering that real-time performance is also an important indicator of a VSLAM, we compared our system to ORB-SLAM2 and Planar-SLAM on five sequences from the total time and mean tracking time (Table 3 and Figure 11).The processing times show that ORB-SLAM2 has the best performance in scenarios with rich features (fr1/room, fr3/str_tex_far, and fr3/nstr_tex_near), which is much faster than the systems based on the fusion of PLP.
Compared with the Planar-SLAM system that also uses PLPF fusion, PLPF-VSLAM has a better performance on all sequences.It can be explained that not all the scenarios are non-/low-textured, thus our system adaptively selects the fusion of point, line, and plane features, which makes the processing time shorter.However, compared to ORB-SLAM2, our method is slower.On the one hand, adaptive fusion is based on the number of matched features.The extraction and matching of line and plane features increases the processing time.On the other hand, on a sequence, our system not only uses point features but also line and plane features in some places for mapping and tracking, which also increases the processing time.
To sum up, PLPF-VSLAM demonstrates its versatility in handling both rich-textured and non-/low-textured indoor scenarios.As for the processing time, it is slightly longer than ORB-SLAM2, but is faster than Planar-SLAM which is also based on the fusion of point, line, and plane.The reason why our system takes longer time is that it aims to deal with non-/lowtextured indoor scenarios.Thus, the threshold of feature extraction is not as strict as that of ORB-SLAM2, which is time-consuming when more features are involved in the computation.Moreover, adding the additional line and plane features to the map also takes more time.Therefore, overall, the PLPF-VSLAM system is the best when taking all aspects into account.Note: "×" means that the tracking is lost at some point or a significant portion of the sequence is not processed.The bold values mean the minimum of the Op, which shows the VSLAM that has the best performance in this scenario.

Note:
The bold values mean the minimum of the Op, which shows the VSLAM that has the best performance in this scenario.

| ICL_NUIM data set
The sequences from living room and office in ICL_NUIM data set are also used to evaluate the accuracy of the PLPF-VSLAM.Base on RMSE of ATE, we compare our system with ORB-SLAM2, L-SLAM, and Planar-SLAM.The performances of different systems are shown in Table 4 and Figure 12.As no non-/low-textured scenarios are involved, four systems stably finish localization and mapping.
In the ICL_NUIM data set, our system, along with L-SLAM and Planar-SLAM, achieves favorable results.This can be attributed to the abundance of structural features present in the data set, which offer ample line and plane features.These additional features serve to provide valuable constraints for pose estimation, thereby enhancing the overall accuracy of the system.Therefore, in such scenarios, the VSLAM based on point, line, and plane have a better performance than ORB-SLAM2 based on point features.

1
Three indoor scenes with different levels of texture richness.(a) Rich-textured scene, (b) low-textured scene, and (c) nontextured scene.
Point-line(PL)(P)-VSLAM PL(P)-VSLAM incorporates a fusion of point, line, and/or plane features to enhance tracking and mapping accuracy.During the extraction and matching of line features, several challenges may arise, such as unclear endpoint positions and weak set constraints, leading to a high number of mismatches.As a result, researchers shifted their focus towards fusion-based VSLAM, which typically includes three types: point-line (PL), point-plane (PP), and point-line-plane (PLP).

3
| PLPF-VSLAM: VSLAM WITH ADAPTIVE FUSION OF PLPF PLPF-VSLAM can adaptively select fusion strategies according to the different characteristics of the scenarios.As shown in Figure 2, PLPF-VSLAM has the same framework as the conventional VSLAM (Mur-Artal & Tardós, 2017), which includes three threads: tracking, local mapping, and loop closing.Compared with conventional VSLAM, improvements happen in the tracking threads.In particular, taking the numbers of matched features as a reference.Four new modes (Point Tracking Mode, PL Tracking Mode, PP Tracking Mode and PLP Tracking Mode) are adaptively selected for tracking process.
The framework of the PLPF-VSLAM.F I G U R E 3 Plane extraction of two sequences in the ICL_NUIM data set.Plane extraction of (a) Ir-kt0 and (b) kt0 sequence.

T
is the plane in the coordinate system of camera.After obtaining the re-projection error of each feature, we start to construct the optimization objective function based on the least squares.The construction of different objective functions for different scenarios is based on their richness of features.For scenarios with rich textures, we choose the P-VSLAM.For other scenes with insufficient features, we use the fusion of four modes: P-VSLAM, PL-VSLAM, PP-VSLAM, and PLP-VSLAM, according to the number of features in the scenario.The criteria for distinguishing if a scenario is non-/low-textured or rich of textures is the number of matched PLPF (n n n , , p l π ).The objective function is Equation (12): The local mapping thread plays a role in the construction of the local map, leveraging the keyframes generated within the tracking thread to estimate the precise pose of each keyframe, along with the associated map points, lines, and planes.These features are subsequently assimilated into the local map.In the course of processing a keyframe, the bundle adjustment (BA) algorithm is employed with the aim of mitigating the local pose error.BA optimizes the poses and positions of the map features by minimizing the reprojection error, ultimately yielding a local map of heightened accuracy and consistency.A local map in PLPF-VSLAM primarily consists of keyframes and their associated map points, lines, and plane features.The construction of the local map involves fusing different types of features based on their richness in the given scenarios.Initially, the keyframe generated by the tracking thread is added to the local map.

F
I G U R E 4 The Gaussian fitting result of the number of matched features in the fr1/room sequence.The number of matched (a) points, (b) lines, and (c) points.F I G U R E 5 The Gaussian fitting result of the number of matched features in the str_tex_far sequence.The number of matched (a) points, (b) lines, and (c) points.
. To more intuitively show the F I G U R E 8 Comparison of the RMSE of different VSLAM on the TUM data set.YAN ET AL. | 61 F I G U R E 9 Reconstructed maps of the six most representative sequences from the TUM data set.(a) fr1/xyz, (b) fr1/room, (c) fr3/ nstr_ntex_far, (d) fr3/nstr_tex_near, (e) fr3/str_ntex_far, and (f) fr3/str_tex_near.performance of different VSLAM systems, we make Figure 8 based on Table 2.As for the cases where tracking loss or data is missing, the maximum RMSE in the sequence is used.The trajectories and reconstruction of maps in PLPF-VSLAM are visually depicted in Figures 9 and 10 (here, only the most representative scenes are displayed).This figure provides a comprehensive visualization of the paths followed by the camera and the resulting map reconstruction.

F
I G U R E 11 Comparison of mean tracking time of different VSLAM on the TUM data set.T A B L E 4 RMSE of different VSLAM on ICL_NUIM data set (unit: m).