Feature Calibrating and Fusing Network for RGB-D Salient Object Detection

Due to their imaging mechanisms and techniques, some depth images inevitably have low visual qualities or have some inconsistent foregrounds with their corresponding RGB images. Directly using such depth images will deteriorate the performance of RGB-D SOD. In view of this, a novel RGB-D salient object detection model is presented, which follows the principle of calibration-then-fusion to effectively suppress the influence of such two types of depth images on final saliency prediction. Specifically, the proposed model is composed of two stages, i.e., an image generation stage and a saliency reasoning stage. The former generates high-quality and foreground-consistent pseudo depth images via an image generation network. While the latter first calibrates the original depth information with the aid of those newly generated pseudo depth images and then performs cross-modal feature fusion for the final saliency reasoning. Especially, in the first stage, a Two-steps Sample Selection (TSS) strategy is employed to select such reliable depth images from the original RGB-D image pairs as supervision information to optimize the image generation network. Afterwards, in the second stage, a Feature Calibrating and Fusing Network (FCFNet) is proposed to achieve the calibration-then-fusion of cross-modal information for the final saliency prediction, which is achieved by a Depth Feature Calibration (DFC) module, a Shallow-level Feature Injection (SFI) module and a Multi-modal Multi-scale Fusion (MMF) module. Moreover, a loss function, i.e., Region Consistency Aware (RCA) loss, is presented as an auxiliary loss for FCFNet to facilitate the completeness of salient objects together with the reduction of background interference by considering the local regional consistency in the saliency maps. Experiments on six benchmark datasets demonstrate the superiorities of our proposed RGB-D SOD model over some state-of-the-arts.

Feature Calibrating and Fusing Network for RGB-D Salient Object Detection Qiang Zhang , Qi Qin, Yang Yang , Qiang Jiao , and Jungong Han , Member, IEEE Abstract-Due to their imaging mechanisms and techniques, some depth images inevitably have low visual qualities or have some inconsistent foregrounds with their corresponding RGB images.Directly using such depth images will deteriorate the performance of RGB-D SOD.In view of this, a novel RGB-D salient object detection model is presented, which follows the principle of calibration-then-fusion to effectively suppress the influence of such two types of depth images on final saliency prediction.Specifically, the proposed model is composed of two stages, i.e., an image generation stage and a saliency reasoning stage.The former generates high-quality and foreground-consistent pseudo depth images via an image generation network.While the latter first calibrates the original depth information with the aid of those newly generated pseudo depth images and then performs cross-modal feature fusion for the final saliency reasoning.Especially, in the first stage, a Two-steps Sample Selection (TSS) strategy is employed to select such reliable depth images from the original RGB-D image pairs as supervision information to optimize the image generation network.Afterwards, in the second stage, a Feature Calibrating and Fusing Network (FCFNet) is proposed to achieve the calibration-then-fusion of cross-modal information for the final saliency prediction, which is achieved by a Depth Feature Calibration (DFC) module, a Shallow-level Feature Injection (SFI) module and a Multi-modal Multi-scale Fusion (MMF) module.Moreover, a loss function, i.e., Region Consistency Aware (RCA) loss, is presented as an auxiliary loss for FCFNet to facilitate the completeness of salient objects together with the reduction of background interference by considering the local regional consistency in the saliency maps.Experiments on six benchmark datasets demonstrate the superiorities of our proposed RGB-D SOD model over some stateof-the-arts.
Index Terms-Salient object detection, RGB-D images, twosteps sample selection, calibration-then-fusion, region consistency aware loss.

I. INTRODUCTION
S ALIENT object detection (SOD) imitates the human vision system to identify the most visually appealing objects or regions in an image.It has been widely applied in many computer vision fields, such as object recognition [1], video segmentation [2], person re-identification [3], visual tracking [4] and image quality assessment [5].
In fact, depth information plays a critical role in RGB-D salient object detection, which directly dictates the performance of subsequent saliency detection.However, depth images are sometimes unreliable.For example, as shown in the 3 r d and 4 th rows of Fig. 1(b), some depth images have poor visual qualities and contain affluent disturbing cues, which cannot provide much valid spatial information for the cross-modal information fusion.Besides such low-quality depth images, there exists another kind of depth images that also contain unreliable depth cues for RGB-D SOD.As shown in the 1 st and 2 nd rows of Fig. 1(b) and Fig. 1(e), objects that are closer to the depth camera usually tend to have higher intensities than other objects and are more likely to be regarded as potential salient objects in the depth images.However, as shown in the 5 th and 6 th rows of Fig. 1(a) and Fig. 1(d), such objects may not be always seen as the salient ones in their corresponding RGB images.Here, we call these depth images foreground-inconsistent ones for brevity in this paper.As shown in Fig. 1(f), directly using such low-quality or foreground-inconsistent depth images in RGB-D SOD may contaminate the results of RGB-D SOD.
Recently, some works have paid attention to the qualities of depth images for saliency detection [20], [24], [25], In the 1 st and 2 nd rows, RGB and depth images are both of high visual qualities.In the 3 r d and 4 th rows RGB images are of high visual qualities, but depth images are of low visual qualities.In the 5 th and 6 th rows, RGB and depth images contain inconsistent foreground salient objects.[26], [27], [28], [29].For example, Fan et al. [24] designed a depth depurator unit to determine the qualities of depth images and discard those depth images of low visual qualities in the pipeline.Differently, Jin et al. [25] first leveraged RGB images to generate some meaningful depth images and then fused such features extracted from the original and those generated depth images for learning robust depth features.Thanks to that, as shown in the 1 st and 2 nd rows of Fig. 2, these models may work well for those depth images with low visual qualities.However, they may be powerless for those scenes with foreground-inconsistent depth images, as shown in the 3 r d and 4 th rows of Fig. 2.
As a remedy for such an issue, in this paper, we propose a two-stage RGB-D SOD model to effectively suppress the influence of such two types of depth images on the final saliency prediction via a principle of calibration-and-fusion.Specifically, the proposed RGB-D SOD model is composed of an image generation stage and a saliency reasoning stage.In the image generation stage, we use the input RGB images to generate their corresponding pseudo depth images via an image generation network.In the saliency reasoning stage, we first calibrate such unreliable information in the original depth images with the aid of those generated pseudo depth images and then perform cross-modal feature fusion for the final saliency prediction.
Especially, in the image generation stage, we propose a Two-steps Sample Selection (TSS) strategy to select those high-quality and foreground-consistent depth images from the original RGB-D image pairs as supervision information for the image generation network.More details, in the first step of TSS, such depth images with rich saliency cues will be first selected from the input RGB-D image pairs in terms of the Intersection of Unions (IoUs) between the predicted saliency maps and their corresponding ground truths.On top of that, in the second step, those foreground-inconsistent depth images will be filtered out from such depth images with rich saliency cues according to the true positive rates of their predicted saliency maps.By doing so, such high-quality and foregroundconsistent depth images will be selected from the original RGB-D image pairs.This will benefit the generation of some pseudo depth images with more desirable depth information, thus facilitating the subsequent depth information calibration in the saliency reasoning stage.
In the saliency reasoning stage, we propose a novel Feature Calibrating and Fusing network (FCFNet) to effectively calibrate the raw depth information and capture more reliable complementary information from the input RGB-D images for final saliency prediction.More specifically, in FCFNet, a Depth Feature Calibration (DFC) module is first designed to calibrate such unreliable information contained in the original depth images with the aid of the generated pseudo depth images.On top of that, a Multi-modal and Multi-scale Fusion (MMF) module is proposed to capture the cross-modal complementary information and multi-scale context information between the calibrated depth features and the RGB features.As well, in order to make the FCFNet computationally efficient, we just perform MMF on some deeper levels (e.g., the last three levels in this paper) of RGB and calibrated depth features.Accordingly, to avoid the loss of some detailed information contained in the shallower levels of features, which is vital for refining salient object boundaries, a Shallowlevel Feature Injection (SFI) module is also presented to inject such detailed information contained in the shallower levels (e.g., the first two levels in this paper) of features into one deeper level (e.g., the third level in this paper) of features in each unimodal feature extraction stream for refining the boundaries of salient objects.
Finally, in addition to the networks, the loss functions are also important for an SOD task.Especially, the Binary Cross Entropy (BCE) loss [30] and the IoU loss [31] are two widely used loss functions in the RGB-D SOD field.However, they are both implemented in a pixel-wise way and ignore the local regional consistency within the saliency maps, thus easily leading to the incomplete detection of salient objects or the introduction of disturbing backgrounds.In view of this, we also design a new loss function, called Region Consistency Aware (RCA), as an auxiliary loss function for our proposed FCFNet, in which the saliency consistency among the pixels within the foreground salient object regions and the saliency consistency among the pixels within the background regions are simultaneously considered.Under the joint supervision of the three loss functions, i.e., BCE, IoU and RCA, our FCFNet will achieve more accurate saliency results.
Our main contributions are summarized as follows: (1) We propose a calibration-then-fusion based two-stage model for RGB-D salient object detection, in which the influence of two types of depth images, i.e., low-quality ones and foreground-inconsistent ones, on the saliency detection is simultaneously considered.The results of comprehensive experiments on six benchmark datasets demonstrate the superiorities of our proposed model over existing ones.
(2) In the image generation stage, we propose a TSS strategy to select those high-quality and foreground-consistent depth images from the original input RGB-D image pairs as supervision information for the generation network of pseudo depth images.
(3) In the saliency reasoning stage, we propose a Feature Calibrating and Fusing Network (FCFNet), which is achieved by three dedicated modules, i.e., DFC, MMF and SFI, to first calibrate those unreliable depth information and then capture more cross-modal information from the RGB-D images for final salient object detection.
(4)We design a novel auxiliary loss, i.e., RCA loss, for our proposed FCFNet, in which the local regional consistency within the foreground salient object regions and that within the background regions are simultaneously considered.With the collaboration of BCE, IoU and RCA, more complete foregrounds and less disturbing backgrounds can be achieved by FCFNet.
Early RGB-D SOD methods rely on various types of handcrafted features, such as contrast [13] and shape [45], to detect the salient objects.However, these methods usually suffer from unsatisfactory performance due to their limited representation ability of handcrafted features.Recently, CNNbased RGB-D SOD approaches [20], [21], [24], [37], [46] have achieved a qualitative leap in performance due to the powerful feature representation ability of CNNs.Such CNNbased RGB-D SOD approaches may be mainly divided into three categories, i.e., pixel-level fusion based, result-level fusion based and feature-level fusion based ones.Especially, feature-level fusion based RGB-D SOD models have become the current mainstream in the past few years.
Most of these feature-level fusion based RGB-D SOD methods focus on how to effectively integrate cross-modal complementary information from RGB-D images for saliency detection.For example, Chen et al. [16] proposed a novel complementarity-aware fusion module to explicitly learn the complementary information from the paired RGB-D images by introducing some cross-modal residual functions and complementarity-aware supervisions.Zhou et al. [38] proposed a cross-flow cross-scale adaptive fusion network to detect salient objects in RGB-D images.Chen et al. [20] proposed a novel network to explicitly model the potentiality of depth images and effectively integrate the cross-modal complementarity.Cong et al. [39] proposed an end-to-end cross-modality interaction and refinement network for RGB-D SOD by fully capturing and utilizing the cross-modality information in an interaction and refinement manner.Xia et al. [47] proposed a global contextual exploration network to exploit the role of multi-scale features at a single fine-grained level for RGB-D SOD.
Some recent works have also paid attention to the qualities of input images.For example, Piao et al. [27] presented an RGB-D SOD network based on knowledge distillation [12], which can transfer the depth knowledge from the depth stream to the RGB stream, reducing the influence of low-quality depth images on the saliency detection results.Similarly, Chen et al. [28] presented a depth quality aware sub-network to evaluate the qualities of depth images before the cross-modal feature fusion.Yang et al. [48] presented a Bi-directional Progressive Guidance Network for RGB-D salient object detection, which progressively and interactively suppresses the disturbing cues within the multi-modal input images.Alternatively, Jin et al. [25] and Chen et al. [26] suggested a promising way to alleviate the influence of low-quality depth images on the detection results by generating some high-quality depth images as complements to the original depth images in RGB-D SOD.Zhang et al. [29] exploited some valid priors to alleviate the influence of low-quality depth images for the SOD task.These methods may effectively reduce the influence of low-quality depth images on the final saliency detection.However, they still ignore the influence of such foreground-inconsistent depth images on the RGB-D saliency detection.Differently, in our proposed model, the influence of low-quality depth images and that of foreground-inconsistent depth images on saliency detection are simultaneously considered.

III. PROPOSED METHOD A. Method Overview
The proposed two-stage RGB-D saliency detection model is composed of an image generation stage and a saliency reasoning stage.In the image generation stage, some high-quality and foreground-consistent pseudo depth images will be generated from the input RGB images by using an image generation network.More specifically, a TSS strategy will be employed to select such high-quality and foreground-consistent depth images from the original input RGB-D images as the supervision information when training the image generation network.In the saliency reasoning stage, a novel FCFNet is designed for SOD, which first calibrates such unreliable depth information in the original depth images with the aid of the generated pseudo depth images and then captures the cross-modal complementary information for the saliency detection.This is achieved by a DFC module, an SFI module and an MMF module.Moreover, together with the BCE and IoU loss functions, a new proposed RCA loss function will be employed to train the FCFNet.In the following contents, we will discuss the two stages in details.

B. Image Generation Stage
As discussed earlier, there may exist some low-quality and foreground-inconsistent depth images in the RGB-D image pairs.Directly using such depth images may degrade the detection performance of RGB-D SOD model greatly.For that, similar to [25], we also perform an image generation network on the input RGB images to generate some high-quality and foreground-consistent pseudo depth images, which will be used to calibrate the original depth images in the subsequent saliency reasoning stage.More details, the same image generation network [49] is employed in this stage to generate pseudo depth images for simplicity, which is composed of several cascaded FCNs and CNNs for discriminative depth estimation.
In addition to the image generation network, the supervision information also plays an important role on the qualities of generated images.Considering that, in this stage, we present a Two-steps Sample Selection (TSS) strategy to select those depth images with high visual qualities and foreground consistency from the input RGB-D image pairs as the supervision information for the image generation network.Fig. 3 illustrates the diagram of the proposed TSS strategy, and the specific details are as follows.
Step1: Select depth samples Da containing more saliency cues from the input RGB-D image pairs via the IoUs [31] between the saliency maps predicted by depth images and their ground truths (GTs).This is due to the following considerations.The IoUs can measure the consistency between the depth saliency maps and their corresponding GTs, further reflecting the amount of saliency information contained in the depth images to some extent.Specifically, we first train an encoder-decoder network to predict the depth saliency maps Pd .Here, the VGG network [50] is employed as the encoder and the decoder part of U-Net [51] is employed as the decoder.After that, for each original depth image (e.g., the k-th depth image), we compute the IoU value D k iou between the saliency map P k d and its corresponding ground truth GT k , i.e., Finally, those depth images with more saliency cues, i.e., with D k iou ≥ th1, are selected from the original RGB-D image pairs, obtaining a new depth image set Da. Here, the threshold th1 is experimentally set to 0.9 in this paper.
Step2: Select those high-quality and foreground-consistent depth images from Da by further computing the true positive rates TP d of their predicted saliency maps, obtaining the final depth image set Db as supervision information for the image generation network.
Specifically, we first feed RGB images to the re-trained SOD network mentioned in the first step for predicting RGB saliency maps Pr .Then, for each depth image in Da, we calculate the positive rate TP k d of its depth saliency map P k d , as well as the positive rate TP k r of its corresponding RGB Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.saliency map P k r , i.e., Finally, we select those depth images with TP k d > TP k r from the depth samples set Da as the final depth samples set Db.As shown in Fig. 3, such high-quality and foregroundconsistent depth images can be effectively selected from the original RGB-D image pairs by using our proposed TSS strategy.On top of that, we further select those RGB images corresponding to such finally selected depth images from the original RGB-D images, obtaining a new set of RGB-D image pairs RD.
Given the RGB-D image set RD constructed above, the image generation network in [49] is re-trained in this paper.Here, the selected RGB images are used as the inputs of the image generation network and their corresponding depth images are used as the supervision information.After training, we feed all of the RGB images contained in the original RGB-D image pairs into the re-trained image generation network to produce their corresponding high-quality and foreground-consistent pseudo depth images.These generated pseudo depth images will be used to calibrate the original depth images in the subsequent saliency reasoning stage.As shown in Fig. 4, compared to those original low-quality or foreground-inconsistent depth images, the generated pseudo depth images usually have higher visual qualities or have better foreground consistency with their corresponding RGB images.

C. Saliency Reasoning Stage
In the saliency reasoning stage, the original depth features are first calibrated by using those pseudo depth images generated from the first stage and then fused with the RGB features for better capturing the cross-modal complementary information within the multi-modal RGB-D images.To this end, we propose a Feature Calibrating and Fusing Network (FCFNet) in this stage for saliency detection Specifically, as shown in Fig. 5, the proposed FCFNet contains three key components: a DFC module, an MMF module and an SFI module.First, a VGG16 based three-stream encoding network is deployed to simultaneously extract hierarchical features from RGB images, original depth images and pseudo depth images, which are denoted as r i , d i o and d i pse (i = 1,2,3,4,5), respectively.Here, i denotes the feature level index.After that, the DFC module is employed to calibrate such original depth features d i o with the aid of the pseudo depth features d i pse , obtaining the calibrated depth features d i .Subsequently, several MMF modules are performed on the RGB features r i and the calibrated depth features d i , achieving cross-modal feature fusion.Especially, considering the trade-off between the computational complexity and the saliency detection performance, the MMF module is only performed on the last three levels as in [52].Meanwhile, to avoid the loss of some detailed information from the shallower levels during the cross-modal feature fusion, the proposed SFI module is further employed to inject such valuable unimodal features from the shallower levels (i.e, the first two levels) into the middle level (i.e., the third level) of unimodal features before they are fused.Finally, the fused cross-modal features are integrated in a progressive way, and some auxiliary loss functions are applied to facilitate the optimization, obtaining three levels of saliency maps S (t) (t = 3,4,5).Here, S (3)  is taken as the final saliency map in this paper.Especially, the existing BCE, IoU and our proposed RCA loss functions are jointly performed on FCFNet for better saliency detection results.Details about these components mentioned above are seen in the following contents.
1) Depth Feature Calibration Module: As discussed earlier, there may exist some low-quality or foregroundinconsistent depth images in the original RGB-D image pairs.Directly using the features extracted from the original depth images may easily deteriorate the performance of an RGB-D SOD model.In order to address such a problem, as shown in Fig. 6, we propose a Depth Feature Calibration (DFC) module to calibrate such original depth features with the aid of the generated pseudo depth images before they are fed into the cross-modal feature fusion module.
Specifically, we first initially suppress those unreliable cues contained in original depth features d i o by utilizing Next, unimodal RGB features and calibrated depth features are fused via the proposed MMF module to capture cross-modal complementary information and multi-scale context information.In addition, in order to avoid the loss of some detailed information contained in the shallower levels of features, the SFI module is applied to inject such important detailed information from the first two levels into the third level before performing cross-modal feature fusion.Finally, those cross-modal features are fed into the decoder to progressively achieve saliency reasoning.the extracted pseudo depth features d i pse to re-weight the importance of unimodal features in the channel dimension.More specific, the original depth features d i o and the pseudo depth features d i pse are first concatenated to learn a set of channel-wise weights via convolution and global average pooling operations, i.e., where GAP( * ) denotes the global average pooling.σ ( * ) refers to the Sigmoid operation.W i denotes the channel-wise weights for the i-th level of original depth features d i o .Conv 3 ( * , α) denotes a 3 × 3 convolutional layer with its parameters α.Cat( * , * ) denotes the concatenation operation.With the channel-wise weights, the initially calibrated depth features d i ic are obtained by where ⊗ denotes the element-wise multiplication and ⊕ denotes the element-wise addition.1 denotes an all-one vector with the same size as W i .Then, the pseudo depth features d i pse are further applied to calibrate d i ic via a spatial-attention mechanism, achieving the salient content alignment, i.e., where Wsa( * ) denotes the spatial weight generation function [53], i.e., Here, Conv 7 ( * , ω) denotes a 7 × 7 convolutional layer with its parameters ω.Avg( * ) denotes the average pooling operation along the channel dimension.Max( * ) denotes the max pooling operation along the channel dimension.Finally, the pseudo depth features d i pse are further embedded into the enhanced depth features d i en via a skip connection, obtaining the finally calibrated depth features d i , i.e., Fig. 7 visualizes the calibration of original depth features with the aid of the extracted pseudo depth features in the DFC module.Compared with the original depth features in Fig. 7 (d), the finally calibrated depth features in Fig. 7 (g) contain more reliable depth information about the salient objects, while suppressing disturbing cues within the background regions.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.2) Shallow-Level Feature Injection Module: As aforementioned, we only perform the cross-modal feature fusion on the last three levels of unimodal features for reducing computational complexity.However, doing that will lose some important detailed information from shallower levels, which is crucial for refining the boundaries of salient objects.In view of this, we design an SFI module that selectively injects those valuable unimodal features from the first two levels into the third level of unimodal features.
Fig. 8 illustrates the details of the proposed SFI module, where the injection of shallower-level RGB features is taken as an example.As shown in Fig. 8, in SFI, the first two levels of unimodal RGB features are first concatenated to obtain the shallower levels of fusion features r sl , i.e., where Down( * ) denotes the downsample operation.Conv 3 ( * , ε) and Conv 3 ( * , β) denote two 3 × 3 convolutional layers with their parameters ε and β, respectively.The same spatial-attention mechanism mentioned in the DFC module is then applied to the third level of RGB features r 3 for selecting those valuable detailed information from r sl .Finally, such selected RGB features are further injected into the third level of RGB features, obtaining the enhanced RGB features r 3 en .The process can be described as: where r 3 en denotes the third level of enhanced RGB features.Conv 3 ( * , γ ) denotes a 3 × 3 convolutional layer with its parameters γ .Accordingly, we also obtain the middle level of enhanced depth features d 3 en .3) Multi-Modal Multi-Scale Fusion Module: Employing those simple fusion strategies is hard to fully exploit complementary information within RGB-D image pairs.Considering that, we propose a Multi-modal Multi-scale Fusion (MMF) module to achieve the fusion of RGB and calibrated depth features, where the unimodal features (r i and d i ) are first fused and then enhanced from a cross-scale perspective for achieving better saliency detection results.As shown in Fig. 9, in MMF, a weighted channel attention mechanism is first employed to re-weight and fuse the corresponding channels of unimodal features from different modalities, resulting in initially fused features.Afterwards, a Multi-Scale Attention (MA) module is designed, where the initially fused features are fed into four parallel branches to capture their multi-scale contextual information for dealing with the challenge of object scale variations in the scenes.Meanwhile, the importance of those extracted multi-scale features is re-assigned to obtain the finally fused features.Specifically, unimodal features r i and d i are first concatenated and then fed into some convolution layers to learn the channel-wise relative importance weights, which can be formulated as: ) where i = 3, 4, 5. Especially, when i = 3, r 3 and d 3 are the enhanced unimodal features (r 3  en and d 3 en ) obtained by the SFI module, respectively.F i denotes the concatenated features of r i and d i .CB( * , ϕ) denotes a convolutional blocks with its parameters ϕ, which contains two 3 × 3 convolutional layers.MLP( * , η) and MLP( * , ψ) denote two fully connected blocks with their parameters η and ψ, respectively.GAP( * ) and GMP( * ) denote the global average pooling operation and the global max pooling operation, respectively.W i f us denotes a set of channel-wise weights.With the obtained channel-wise weights W i f us , the fused features are obtained by where F i f us denotes the i-th level of initially fused features.1 denotes an all-one vector with the same size of W i f us .On top of that, as illustrated in the bottom row of Fig. 9, four parallel convolution layers with different kernel sizes are performed on the initially fused features F i f us , obtaining four scales of features, i.e., where Conv m ( * , θm) denotes a convolutional layer with kernel size of m × m (m = 3, 5, 7, 9).θm refers to their parameters.F i m (m = 3, 5, 7, 9) denotes different scales of features in the i-th level.
In addition, considering that different scales of features have certain contributes for final saliency prediction, we propose a scale-aware attention mechanism to adaptively fuse those multi-scale features.Specially, we first feed the multi-scale features into several SE blocks [54], obtaining their scale attention weights V i m (m = 3, 5, 7, 9), respectively.Then the softmax function is performed on V i m to obtain the multi-scale weight vector W i , which reflects the feature importance of different scales.Finally, the multi-scale features are re-weighted by using such importance weight vector, achieving the finally enhanced cross-modal features F i in the i-th level.Mathematically, the whole process can be expressed as follows: Here, SE ( * ) denotes the SE block in [54].4) Loss Function: a) RCA loss: The BCE loss [30] and IoU loss [31] are two widely used loss functions in SOD, which provide some pixel-level constraints to force the predicted results close to the GTs.However, they usually ignore the local regional consistency within the saliency maps, resulting in inaccurate saliency detection results.In view of this, we propose a novel auxiliary loss function, i.e., Region Consistency Aware (RCA) loss, where the local regional consistency within the foreground regions and the local regional consistency within the background regions are simultaneously considered.
For that, we first compute the false negative (FN) part in the foreground salient object regions and the false positive (FP) part in the background regions by using Eq. ( 17) and Eq. ( 18), respectively.In Eq. ( 17) and Eq. ( 18), S denotes the predicted saliency map and G denotes its corresponding ground truth.|•| denotes the aboslute value of a number.Given FN and FP, the proposed RCA loss ℓrca is computed by using Eq. ( 19) to enhance the local regional saliency consistency within the foreground salient object regions as well as in the background regions by reducing the saliency differences between each pixel and its surrounding pixels.
In Eq. ( 19), Avg spa ( * ) denotes the 7 × 7 average pooling operation along the spatial dimension.H and W denote the height and width of the saliency map, respectively.1 denotes an all-one vector with the same size as G.The proposed RCA loss is beneficial for facilitating the completeness of salient objects and suppressing the introduction of disturbing backgrounds in the saliency map.This will be verified in the experimental part.b) Total loss: Given the proposed RCA loss ℓrca, a joint loss function (L joint ) will be employed in this paper to train our proposed FCFNet, which is defined by where ℓbce ( * ) and ℓiou( * ) denote the BCE loss and IoU loss, respectively.The hyper-parameter λ is experimentally set to 15 in the paper.
As shown in Fig. 3 and similar to that in [62], multilevel supervisions are also performed on saliency maps Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Implementation Details
We implement the proposed model by PyTorch [72] with an NVIDIA 1080Ti GPU.As the same splitting way in [27], [55], we select 800 samples from DUT-RGBD [55], 1485 samples from NJU2K [63] and 700 samples from NLPR [64] for training.The remaining images in the three datasets and those images in the remaining two datasets are all for testing to verify the performance of different models.At the training and testing phase, all the input images are resized to 224 × 224.Random horizontal flipping and random vertical flipping are adopted for data augmentation.The image generation network is trained with Adam optimizer [73], where the batch size, learning rate and weight decay are set to 4, 1e-4 and 4e-5, respectively.The FCFNet is trained with the Stochastic Gradient Descent (SGD) optimizer [74].Here, the batch size, momentum and weight decay of the SGD optimizer are set to 5, 0.9 and 5e-4, respectively.Meanwhile, the learning rate is set to 1e-4, which is divided by 10 after 35 epochs.The maximum epochs of our network is 70.

C. Comparison With State-of-the-Art Models
We compare our proposed model with 21 state-of-the-art RGB-D SOD methods, including MMCI [18], TANet [17], DMRA [55], CPFP [56], D3Net [24], ICNet [19], A2dele [27], S2MA [57], DRLF [58], FRDT [59], JL-DCF [60], SSF [52], CMWNet [75], DQSD [28], DFMNet [21], CCAFNet [38], CIRNet [39],GCENet [47], CFIDNet [40], HINet [41] and DCMF [42].For fair comparisons, the saliency maps of these SOTA models are obtained from their authors or the deployment codes provided by their authors.Some visualization results are illustrated in Fig. 10.As shown in the first two rows of Fig. 10, all of the methods mentioned here perform well for those images with simple Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.scenes.However, as shown in the 3 r d , 4 th , 5 th and 6 th rows of Fig. 10, when the depth images have low visual qualities or have some inconsistent foregrounds with their corresponding RGB images, those SOTA methods cannot obtain desirable detection results.For example, some salient objects are not completely detected, or parts of the backgrounds are not well suppressed during the saliency detection.Moreover, as shown in the last two rows of Fig. 10, when the scenarios have multiple objects, most SOTA models fail to achieve complete detection for all salient objects.While our method can still effectively detect the salient objects for these challenging scenes.This may be due to the fact that such undesirable information in the original low-quality or foreground-inconsistent depth images are well calibrated by using those generated pseudo depth images.As well, the multi-scale context information may be effectively exploited for saliency detection in our method because of the proposed MMF module.As shown in Table I and Fig. 11, it can be easily seen that our proposed model outperforms those SOTA methods in terms of the five metrics, i.e., Fβ , Sα, Eξ , MAE and PR curves on the NJU2K [63], DUT-RGBD [55] and LFSD [66] datasets.For NLPR [64] and STERE [65], our proposed model also achieves competitive results with these SOTA methods.

D. Ablation Study
Actually, our proposed method consists of two stages, i.e., an image generation stage and a saliency reasoning stage.Thus, we conduct two sets of ablation experiments on the NJU2K dataset [63] to verify the effectiveness of each component contained in each stage, respectively.First, we verify the effectiveness of the TSS strategy proposed in the image generation stage.Then, we verify the effectiveness of each component in FCFNet proposed in the saliency reasoning stage.
1) Validity of TSS Strategy: In order to demonstrate the effectiveness of the proposed TSS strategy, we first construct a baseline model, denoted as 'B', where the DFC and SFI modules, together with the RCA loss, are removed from FCFNet.As well, MMF is replaced with some element-wise addition operations.On top of that, we construct four versions of our proposed TSS strategy by using different sample selection ways, i.e., TSS (w/o 'step1' and 'step2'), TSS (w/o 'step1'), TSS (w/o 'step2') and TSS.In TSS (w/o 'step1' and 'step2'), 'step1' and 'step2' are simultaneously removed from our proposed TSS strategy, which means that we use all original depth images as the supervision information for the image generation network.In TSS (w/o 'step1') and TSS (w/o 'step2'), 'step1' and 'step2' are removed from our proposed TSS strategy, respectively.
The corresponding quantitative results of different versions are provided in Table II.It can be seen that the method of using all original depth images as the supervision information even deteriorates the performance of 'B'.This is due to the fact that those original low-quality depth images will contaminate the qualities of the generated pseudo depth images, thus leading to unsatisfactory results.Compared to utilizing all original depth images as supervision information, the performance of 'B' is enhanced after employing the proposed TSS, which indicates the validity of TSS.Meanwhile, we can also observe that the performance of 'B+TSS' is degraded when the 'step1' or 'step2' is removed, which indicates that each selection step is beneficial for the generation of pseudo depth images.
Furthermore, as shown in Fig. 12, we also provide some visual comparisons of the generated pseudo depth images to further verify the validity of TSS.As shown in Fig. 12(c  inconsistency with their corresponding RGB images.Differently, as shown in the first two rows of Fig. 12(e), the visual qualities of the pseudo depth images generated by TSS (w/o 'step2') are significantly improved by removing those low-quality depth images as supervision information.However, as shown in the last two rows of Fig. 12(e), such method may fail for those depth images that have inconsistent foregrounds with their corresponding RGB images.Similar phenomena are reported for TSS (w/o 'step1').Specifically, the pseudo depth images generated by TSS (w/o 'step1') have consistent foregrounds with their corresponding RGB images but their visual qualities still have large room for improvement.By comparing Fig. 12(d), Fig. 12(e) and Fig. 12(f), it is easily observed that better pseudo depth images can be obtained by using the TSS strategy than by other versions.This indicates that our proposed TSS strategy can better select those high-quality and foreground-consistent depth images for supervising the image generation network than TSS (w/o 'step1') and TSS (w/o 'step2') do, thus generating better pseudo depth images for calibrating the original depth information.
2) Validity of Each Component in FCFNet: Specifically, the following five models are mainly involved in our ablation study to demonstrate the effectiveness of each component in -FCFNet: Our proposed model.

3) Validity of DFC:
In FCFNet w/o DFC, the features extracted from the original depth images will be directly fed into the subsequent fusion module to achieve saliency prediction.As shown in Table III, we can observe that the performance of FCFNet w/o DFC is worse than the proposed FCFNet.This indicates that the effectiveness of our DFC module for the calibration of original depth features.Furthermore, as shown in Fig. 13, it can be observed that the influence of   those unreliable information from the original depth image can be effectively reduced with the introduction of DFC.
4) Validity of SFI: As reported in Table III, SFI also promotes the performance of saliency detection by injecting the detailed information from the shallower levels into the middle level of features.Intuitively, as shown in Fig. 14, the detected salient objects achieve sharper boundaries by virtue of our proposed SFI module.This indicates that SFI can effectively exploit the detailed information contained in the shallower-level features for better saliency prediction results.
5) Validity of MMF: It can be seen from the quantitative results in Table III that replacing our proposed MMF module with element-wise addition operation will decrease the performance of FCFNet.This demonstrates that MMF can facilitate   the cross-modal and multi-scale complementary information exploration for SOD.Moreover, we also provide some visual comparisons to verify the effectiveness of our proposed MMF module in Fig. 15, which demonstrates that multiple salient objects can be simultaneously detected by using our proposed MMF module.
Especially, in order to further demonstrate the validity of MMF for multi-scale information exploitation, we construct different versions of our proposed MMF module by keeping CF unchanged and replacing MA with different modules, which are denoted as MMF w/o MA, MMF w/o MA+ASPP and MMF w/o MA+DASPP, respectively.In MMF w/o MA, MA is directly removed from our proposed MMF, which denotes that we do not further exploit the multi-scale contextual information for SOD.In MMF w/o MA+ASPP, the ASPP module is added to MMF w/o MA, where the dilation rates are set to 6/12/18/24 according to the literature [76].In MMF w/o MA+DASPP, the DASPP module is added to MMF w/o MA, where the dilation rates are set to 3/6/12/18/24 according to the literature [77].As shown in Table IV, our proposed MMF obtains better performance than other modules.This indicates that MMF can more effectively capture the multi-scale information from the cross-modal RGB-D features via the proposed MA sub-module for SOD.
6) Validity of RCA: As shown in Table III, the performance of SOD is significantly degraded after removing the RCA loss.This can be also verified in Fig. 16.As shown in the 1 st row of Fig. 16, incomplete foregrounds are easily obtained if the RCA loss is removed from the joint loss function defined by Eq. (20).Similarly, as shown in the 2 nd row of Fig. 16, some backgrounds are also mistakenly detected as salient ones without using the proposed RCA loss function in FCFNet.This indicates that the local regional saliency consistency within the foreground salient object regions as well as within the background regions is beneficial for the accurate segmentation of salient objects, thus achieving better saliency results by using our proposed RCA loss.

V. CONCLUSION
In this paper, a two-stage RGB-D salient object detection model has been presented, which is composed of an image generation stage and a saliency reasoning stage.In the image generation stage, owing to the proposed TSS strategy, highquality and foreground-consistent depth images can be generated from the input RGB images.In the saliency reasoning stage, the original depth features are first calibrated by using the generated pseudo depth images via the proposed DFC module and then fused with the RGB features for saliency prediction via the proposed SFI and MMF modules.By virtue of the proposed calibration-then-fusion strategy, the influence of such low-quality depth images as well as that of those foreground-inconsistent depth images on the saliency prediction can be greatly reduced.Moreover, thanks to the proposed RCA auxiliary loss function in the saliency reasoning stage, where the local regional saliency consistency within the foreground salient object regions and that within the background regions are both considered, more complete salient objects and less disturbing backgrounds can be obtained in the final saliency maps.Experimental results on six benchmark datasets demonstrate the validity of our proposed RGB-D SOD model, especially when the depth images have low visual qualities, or have some inconsistent foregrounds with their corresponding RGB images in the scenes.

Fig. 1 .
Fig. 1.Visualization of different types of RGB-D images and saliency maps deduced from different input data.(a) RGB images; (b) Depth images; (c) GTs; (d) Saliency maps deduced from RGB images; (e) Saliency maps deduced from depth images; (f) Saliency maps deduced from RGB-D images.In the 1 st and 2 nd rows, RGB and depth images are both of high visual qualities.In the 3 r d and 4 th rows RGB images are of high visual qualities, but depth images are of low visual qualities.In the 5 th and 6 th rows, RGB and depth images contain inconsistent foreground salient objects.

Fig. 4 .
Fig. 4. Visualization of the pseudo depth images obtained by using our proposed TSS strategy.(a) RGB images; (b) Depth images; (c) Generated pseudo depth images; (d) GTs for saliency maps.

Fig. 5 .
Fig. 5. Diagram of the proposed FCFNet.First, RGB features, original depth features and pseudo depth features are extracted from the three-stream backbone network, respectively.Then, original depth features and pseudo depth features are fed into the DCF module to calibrate those original depth features.Next, unimodal RGB features and calibrated depth features are fused via the proposed MMF module to capture cross-modal complementary information and multi-scale context information.In addition, in order to avoid the loss of some detailed information contained in the shallower levels of features, the SFI module is applied to inject such important detailed information from the first two levels into the third level before performing cross-modal feature fusion.Finally, those cross-modal features are fed into the decoder to progressively achieve saliency reasoning.
), since those original low-quality and foreground-inconsistent depth images are utilized to supervise the image generation network, the pseudo depth images generated by TSS (w/o 'step1' and 'step2') still have low visual qualities or have foreground
(a) RGB images; (b) Depth images; (c) Saliency maps obtained by FCFNet w/o MMF; (d) Saliency maps obtained by FCFNet; (e) GTs.The red boxes indicate the obvious differences.

Fig. 16 .
Fig. 16.Visual comparisons between FCFNet w/o RCA and FCFNet.(a) RGB images; (b) Depth images; (c) Saliency maps obtained by FCFNet w/o RCA; (d) Saliency maps obtained by FCFNet; (e) GTs.The red boxes indicate the false positive regions and the blue boxes indicate the false negative regions.

TABLE I QUANTITATIVE
COMPARISONS OF OUR PROPOSED MODEL WITH 21 STATE-OF-THE-ART RGB-D SALIENCY MODELS ON SIX BENCHMARKS DATASETS.THE BEST THREE RESULTS ARE SHOWED IN RED, GREEN AND BLUE COLORS, RESPECTIVELY Fig. 11.Quantitative comparisons of our proposed method with other methods on six benchmark datasets.

TABLE II QUANTITATIVE
EVALUATION OF ABLATION STUDIES OF TSS ON NJU2K DATASET

TABLE III QUANTITATIVE
EVALUATION OF ABLATION STUDIES OF FCFNET ON NJU2K DATASET

TABLE IV QUANTITATIVE
EVALUATION OF ABLATION STUDIES OF MMF ON NJU2K DATASET Visual comparisons between FCFNet w/o MMF and FCFNet.