Beyond One-to-One: Rethinking the Referring Image Segmentation

Referring image segmentation aims to segment the target object referred by a natural language expression. However, previous methods rely on the strong assumption that one sentence must describe one target in the image, which is often not the case in real-world applications. As a result, such methods fail when the expressions refer to either no objects or multiple objects. In this paper, we address this issue from two perspectives. First, we propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches and enables information flow in two directions. In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target. Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature. In this way, visual features are encouraged to contain the critical semantic information about target entity, which supports the accurate segmentation in the text-to-image decoder in turn. Secondly, we collect a new challenging but realistic dataset called Ref-ZOM, which includes image-text pairs under different settings. Extensive experiments demonstrate our method achieves state-of-the-art performance on different datasets, and the Ref-ZOM-trained model performs well on various types of text inputs. Codes and datasets are available at https://github.com/toggle1995/RIS-DMMI.


Introduction
Referring image segmentation aims to segment the target object described by a given natural language expression.Compared to the traditional semantic segmentation task [32,41,3], referring image segmentation is no longer restricted by the predefined classes and could segment specific individuals selectively according to the description of  text, which has large potential value for various applications such as human-robot interaction [43] and image editing [2].Despite the recent progress, there are still several important challenges that need to be addressed in order to make this technology more applicable in real-world scenarios.
In referring image segmentation, most previous methods only concentrate on the one-to-one setting, where each sentence only indicates one target in the image.However, as shown in Fig. 1(a), one-to-many and one-to-zero settings, where the sentence indicates many or no targets in the image, respectively, are also common and critical in the real-world applications.Unfortunately, previous methods tend to struggle when confronting one-to-many and one-tozero samples.As illustrated in Fig. 1(b), the recent SOTA method, LAVT [48], only localizes one person in the image when given the description "Three persons playing baseball".As for one-to-zero input, previous methods still segment one target even if it is completely irrelevant to the given text.Therefore, it is imperative to enable the model to adapt to various types of text inputs.
We attribute this problem to two main factors.First, although existing methods design various ingenious modules to align multi-modal features, most of them only supervise the pixel matching of the segmentation map, which cannot ensure the significant semantic clues from the text are fully incorporated into the visual stream.As a result, visual features lack the comprehensive understanding of the entity being referred to in the expression, which limits the capacity when the model confronts various types of text inputs.Second, all popular datasets [21,38,36] for referring image segmentation are established under the one-to-one assumption.In the training, the model is enforced to localize one entity that is most related to the text.As a result, the model trained on these datasets is prone to overfitting and only remembers to segment the object with the largest response, which leads to the failure when segmenting one-to-many and one-to-zero samples.
To address the aforementioned issues, this paper proposes a Dual Multi-Modal Interaction Network (DMMI) to achieve robust segmentation when given various types of text expressions, and establishes a new comprehensive dataset Ref-ZOM (Zero/One/Many).In the DMMI network, we address the referring segmentation task in a dual manner, which not only incorporates the text information into visual features but also enables the information flow from visual stream to the linguistic one.As illustrated in Fig. 2, the whole framework contains two decoder branches.On the one hand, in the text-to-image decoder, linguistic information is involved into the visual features to segment the corresponding target.On the other hand, we randomly erase the entity-phrase in the original sentence and extract the incomplete linguistic feature.Then, in the image-totext decoder, given the incomplete text embedding, we utilize the Context Clue Recovery (CCR) module to reconstruct the missing information conditioned on the visual features.Meanwhile, multi-modal contrastive learning is also deployed to assist the reconstruction.By doing so, the visual feature is encouraged to fully incorporate the semantic clues about target entity, which promotes the multi-modal feature interaction and leads to more accurate segmentation maps.Additionally, to facilitate the two decoder parts, we design a Multi-scale Bi-direction Attention (MBA) module to align the multi-modal information in the encoder.Beyond the interaction between single-pixel and single-word [48], the MBA module enables the multi-modal interaction in the local region with various sizes, leading to a more comprehensive understanding of multi-modal features.
In the Ref-ZOM, we establish a comprehensive and challenging dataset to promote the referring image segmentation when given various types of text inputs.On the one hand, compared to the existing widely-used datasets [21,38,36], the text expressions are more complex in Ref-ZOM.It is not limited to the one-to-one assumption, and instead, the expression can refer to multiple or no targets within the image.Additionally, the language style in our Ref-ZOM is much more flowery than the short phrases found in [21].On the other hand, Ref-ZOM also surpasses most mainstream datasets in terms of size, containing 55078 images and 74942 annotated objects.
We conduct extensive experiments on three popular datasets [21,38,36] and our DMMI achieves state-of-theart results.Meanwhile, we reproduce some representative methods on our newly established Ref-ZOM dataset, where DMMI network consistently outperforms existing methods and exhibits strong ability in handling one-to-zero and oneto-many text inputs.Moreover, the Ref-ZOM-trained network performs remarkable generalization capacity when being transferred to different datasets without fine-tuning, highlighting its potential for real-world applications.
The main contributions of this paper are summarized as follows: • We find the deficiency of referring image segmentation when meeting the one-to-many and one-to-zero text inputs, which strongly limits the application value in real-world scenarios.

Referring Image Segmentation
Referring image segmentation is first introduced by [15].Early approaches [15,27,29,37] generally employ Convolutional Neural Networks (CNNs) [3,40,13] and Recurrent Neural Networks (RNNs) [14,17] to extract relevant visual and linguistic features.After feature extraction, the concatenation-convolution operation is employed to fuse multi-modal features.However, it fails to exploit the inherent interaction between image and text.To overcome this shortcoming, some approaches [16,18,47] establish relation-aware reasoning based on the multi-modal graph.Recently, due to the breakthrough of Transformer in computer vision community [8,46,12,1], Transformerbased backbones have become dominant in referring image segmentation for both visual and linguistic feature extraction [31,6,19,20].Meanwhile, the self-attention mechanism [42] in the Transformer also inspired numerous studies that employ cross-attention blocks for better cross-model alignment.For instance, VLT [7] utilizes the cross-attention module to generate the query vectors by comprehensively understanding the multi-model features, which are then used to query the given image through the Transformer decoder.LAVT [48] finds that early fusion of multi-modal features via cross-attention module brings better cross-modal alignments.Moreover, CRIS [44] utilizes the Transformer block to transfer the strong ability of image-text alignment from the pre-trained CLIP model [39].
However, most previous methods only supervise the visual prediction and cannot ensure the semantic clues in the text expressions have been incorporated into the visual features.As a result, these methods tend to struggle when handling the text expressions that refer to either no objects or multiple objects.In this work, we establish a dual network and emphasize that the information flow from image to text is beneficial for comprehensive understanding of text expression.Furthermore, we collect a new dataset called Ref-ZOM, which contains various types of text inputs and compensates for the limitations of existing benchmarks.

Visual-Language Understanding
Video-Language understanding has received rapidly growing attention in recent years and plays an important role in various tasks such as video-retrieval [10], image-text matching [24] and visual question answering [53,25].In these tasks, effective multi-modal interaction and comprehensive understanding of both visual and linguistic features are critical in achieving great performance.Some previous works employ masked word prediction (MWP) to achieve this goal, where a proportion of words in a sentence are randomly masked, and the masked words are predicted under the condition of visual inputs [23,11,54].Most MWP methods directly predict the value of the token.In our work, instead of predicting the single token, we reconstruct the holistic representation of text embedding and measure the global similarity, leading to the comprehensive understanding of the entire sentence.
Moreover, recently popular vision-language pre-training models [39,26,49,50] have demonstrated the remarkable ability of contrastive learning in cross-modal representation learning.Motivated by their success, we incorporate the contrastive loss in our image-to-text decoder to facilitate text reconstruction.The experimental results reflect that the two components are highly complementary and effectively enhance the semantic clues in visual features.

Method
The Dual Multi-Modal Interaction (DMMI) network adopts the encoder-decoder paradigm, which is illustrated in Fig. 2. In the encoder part, the visual encoder and text encoder are utilized to extract visual and linguistic features, respectively.During this process, the Multi-scale Bidirection attention (MBA) module is employed to perform cross-modal interaction.After feature extraction, the two modalities are delivered to the decoder part.In the text-toimage decoder, the text embedding is utilized to query the visual feature and generate the segmentation mask.While in the image-to-text decoder, we employ the Context Clue Recovery (CCR) module to reconstruct the erased information of target entity conditioned on the visual features.Meanwhile, the contrastive loss is utilized to promote the learning of CCR module.We elaborate each component of the DMMI network in detail in the following sections.

Feature Encoder
Given the text expression T , we randomly mask the entity-phrase via TextBlob Tool [33] and generate its corresponding counterpart T ′ .Then, we feed both T and T ′ into the text encoder to generate the linguistic features , where C t and L indicate the number of channels and the length of the sentence.For the input image X, we utilize the visual encoder to extract the multi-level visual features V n ∈ R Cn×Hn×Wn .Here, C n , H n and W n denote the number of channels, height and width, and n indicates features in the n-th stage.
During the feature extraction, MBA module is hierarchically applied to perform cross-modal feature interaction.

Hierarchical Structure
As illustrated in Fig. 2 (a), the visual encoder is implemented as a hierarchical structure with four stages, which is conducted with the MBA module alternately.For the shallow layer feature V 1 extracted from the first stage of visual encoder, we deliver it to MBA module with linguistic feature E 1 and obtain V * 1 and E * 1 .Then, V * 1 is sent back to the visual encoder, based on which V 2 is extracted through the next stage.Meanwhile, E * 1 is also noted as the E 2 that will be utilized in the next MBA module.Similarly, V 2 and E 2 are fed to MBA module again, and the generated V * 2 will be delivered to the next part of visual encoder.By doing so, the visual and linguistic features are jointly refined, achieving cross-modal alignment in both text-to-image and image-to-text directions.

Multi-scale Bi-direction attention Module
The MBA module jointly refines the visual feature V and linguistic feature E to achieve text-to-image and image-totext alignment.To simplify the notation, here we drop the subscript of features from different stages.Inspired by the success of self-attention [42], most recent works utilize the cross-attention operation to perform the cross-modal feature interaction.During this process, the visual feature V is first flattened to R C×N , where N = W × H.Then, the feature interaction is formulated as: where W q , W k and W v are three transform functions unifying the number of channels to Ĉ.However, Eq. 1 only establishes the relationship between a single pixel and a single word.In fact, beyond the single point representation, local visual regions and text sequences also store critical information for the comprehensive understanding of multimodal features.Following this idea, we design two alignment strategies in MBA to capture the relationship between visual features and text sequences in different local regions.
Text-to-Image Alignment.To fully leverage the structure information in various regions, we compute the affinities coefficients W α,r att between each token and different local regions Ω α r , in which r indicates different spatial sizes and its value ranges from 1 to R. Ω α r will slide across the whole spatial plane of the visual feature.Then, given region Ω α r (i) centered at the position i, the i-th row weight w α,r i in attention matrix W α,r att ∈ R N ×L is calculated as: where w α,r i ∈ R 1×L , m enumerates all spatial positions in Ω α r (i) and V m ∈ R Ĉ×1 denotes one specific feature vector in Ω α r (i).Then, for all Ω α r , the final affinities coefficient is calculated as: where λ α r is a learnable parameter reflecting the importance of regions in different sizes.Finally, after the process of transform function W α v , the linguistic information is incorporated into the visual feature: Image-to-Text Alignment.In human perception, to fully comprehend the language expression, we will associate the context information rather than understanding each word separately.Therefore, for each visual pixel, we also establish the connection with various text sequences Ω β r , where r indicates different lengths of the sequence and Ω β r slides across the whole sentence.For text sequence Ω β r (i) starting at position i, we calculate the i-th row weight w β,r i in affinity coefficients W β,r att ∈ R L×N as: where w β,r i ∈ R 1×N , m enumerates all tokens in Ω β r (i) and E m ∈ R Ĉ×1 represents one specific feature vector in Ω β r (i).Then, similar to Eq. 3, we average the W β,r att through a set of learnable parameters λ β r to obtain the W β att .Afterwards, the visual information is involved to generate the refined text embedding as follows:

Text-to-Image Decoder
The whole structure of text-to-image decoder is depicted in Fig. 2(b).As advocated in [41,3], we implement skipconnections between the encoder and decoder to introduce the spatial information stored in the shallow layers.Specifically, the text-to-image decoder can be described as: in which ψ (•) indicates the Transformer decoder layer.ϕ (•) consists of two 3×3 convolutions followed by batch normalization and the ReLU function, in which features from the shallow parts of the encoder are aggregated with the decoder feature.Then, a series of convolution operations are applied on Y 2 to produce two class score maps Ŷ1 , which is considered as the final visual prediction of DMMI network.Finally, we calculate the binary cross-entropy loss for Ŷ1 with Y gt , which is denoted as L ce .

Context Clue Recovery Module
Besides the text-to-image decoder, DMMI network promotes the referring segmentation in a dual manner and facilitates the information flow from visual to text, which is illustrated in Fig. 2(c).For the incomplete linguistic feature , we utilize CCR module to reconstruct its masked information under the guidance of visual feature V * g .To support the precise reconstruction, the visual feature is encouraged to contain essential semantic clues stored in the E = {e l } L l=1 , which boosts the sufficient multi-modal interaction in the encoder part and support the accurate segmentation in the text-to-image decoder.
Specifically, given the visual feature V * g , we employ a Transformer decoder layer D(E ′ , V * g ) to recover the missed information in the , where visual feature V * g is employed to query the E ′ .Notably, we extract V * g from middle part of the text-to-image decoder, which contains both spatial and semantic information.The output of D(E ′ , V * g ) is considered as the reconstructed text embedding, which is denoted as Ê = {ê l } L l=1 .
To enforce the CCR module to precisely recover the missing information, we measure the similarity distance between the reconstructed embedding Ê and E * 4 , and calculate L sim as: Here, δ is an indicator that will be set to 0 if this sample is a one-to-zero case, where the text input is unrelated to the corresponding image, making it impossible to reconstruct linguistic information.Additionally, Detach(E * 4 ) refers to stopping the gradient flow of E * 4 in Eq. 8, which prevents E * 4 from being misled by Ê in the optimization.

Multi-modal Contrastive Learning
We calculate the contrastive loss to reduce the distance between visual feature and its corresponding linguistic one, which is helpful in reconstructing the text embedding from the visual representation.Specifically, we aggregate features from different parts of text-to-image decoder to generate Ṽ * d .Then, for visual feature Ṽ * d ∈ R B×N ×C and linguistic feature Ẽ * 4 ∈ R B×L×C in a batch, we pool them into V o and E o ∈ R B×C .Afterwards, the contrastive loss is computed as: where L I→T and L T →I denote image-to-text and text-toimage contrastive loss respectively: where o ∈ R C denote i th sample in a batch, B indicates the batch size.Meanwhile, δ is the oneto-zero indicator, τ is the temperature hyper-parameter that scales the logits.Finally, the total loss is combined as the summation of L ce , L sim and L con over the batch.

Ref-ZOM Dataset
We collect Ref-ZOM to address the limitations of mainstream datasets [21,38,36] that only contain one-to-one samples.Following previous works [21,38,36], images in Ref-ZOM are selected from COCO dataset [28].Generally, Ref-ZOM contains 55078 images and 74942 annotated objects, in which 43,749 images and 58356 objects are utilized in training, and 11329 images and 16,586 objects are employed in testing.Notably, Ref-ZOM is the first dataset that contains one-to-zero, one-to-one, and one-to-many samples simultaneously.It is worthwhile to mention that although the VGPHRASECUT dataset [45] includes some one-tomany samples, it lacks one-to-zero cases, which makes it less applicable than Ref-ZOM.Due to the space limitation, we only illustrate a selection of representative samples from Ref-ZOM in Fig. 3.More detailed information can be found in the supplementary materials.
One-to-many.We collect one-to-many samples in three different ways, as illustrated in the first row of Fig. 3   One-to-zero.We carefully select 11937 images from COCO dataset [28], which are not included in [21,38,36].Next, we randomly pair each image with a text expression taken from either the COCO captions or the text pools in [21,38,36].Finally, we conduct a thorough doublechecking process to verify that the selected text expressions are unrelated to the corresponding images.
One-to-one.First, we randomly select some samples from existing datasets [21,38,36].Meanwhile, we manually create some new samples based on the category information with the prompt template, which is similar with the third strategy in the creation of one-to-many samples.In total, there are 42421 one-to-one objects in the Ref-ZOM.

Implementation Details
We evaluate the performance of DMMI with two different visual encoders, ResNet-101 and Swin-Transformer-Base (Swin-B), which are initialized with classification weights pre-trained on ImageNet-1K and ImageNet-22K [5], respectively.Our text encoder is the base BERT model with 12 layers and the hidden size of 768, which is initialized with the official pre-trained weights [6].In the training, we utilize AdamW as the optimizer with a weight decay of 0.01.Moreover, the initial learning rate is set to 5e-5 with a polynomial learning rate decay policy.The images are resized to 448 × 448 and the maximum sentence length is set to 20.Additionally, we only randomly erase one phrase in each iteration for each sentence.

Datasets and Metrics
In addition to our newly collected Ref-ZOM, we evaluate our method on three mainstream referring image segmentation datasets, RefCOCO [21], RefCOCO+ [21] and G-Ref [38,36].Notably, G-Ref has two different partitions, which are established by UMD [38] and Google, respectively [36].We evaluate our method on both of them.
In the test, for one-to-one and one-to-many samples, we employ the overall intersection-over-union (oIoU), the mean intersection-over-union (mIoU), and prec@X to evaluate the quality of segmentation masks.The oIoU measures the ratio between the total intersection area and the total union area added from all test samples, while the mIoU averages the IoU score of each sample across the whole test set.Prec@X measures the percentage of test images with an IoU score higher than the threshold X ∈ {0.5, 0.7, 0.9}.As for the one-to-zero samples, since there is no target included in the image, IoU-based metrics are not applicable.Thus, we utilize image-level accuracy (Acc) to evaluate the performance.For each one-to-zero sample, its Acc value is 1 only when all points in the prediction mask are classified as the background.Otherwise, the Acc value is 0. We average the Acc value across the whole test set.

Comparison with State-of-the Arts
In Table 1, we compare the proposed DMMI network with the state-of-the-art methods on RefCOCO, Re-fCOCO+, and G-Ref in terms of the oIoU metric.The table is divided into two parts according to their visual encoder.The first part reports the performance of methods equipped with CNNs as the visual encoder, while the second part presents the methods using Transformer-based structure or pre-trained backbones beyond ImageNet as the visual encoder.Generally speaking, DMMI delivers the best performance in two conditions.Here, taking the second part as the example for analysis.On the RefCOCO dataset, we sur- pass the second-best method by 1.4%, 1.31%, and 1.37% on val, testA, and testB subsets, respectively.On RefCOCO+ dataset, our DMMI network achieves a significant gain over the SOTA method, with increases of 1.84%, 1.35%, and 1.93% on the val, testA, and testB subsets, respectively.On the UMD partition of G-Ref dataset, 2.22% and 2.1% oIoU improvements are obtained, while a 1.48% increase is also observed on the Google partition.Such improvements are consistent in the first part of Table 1.
In addition, we reproduce some representative methods on the newly collected Ref-ZOM dataset and evaluate our DMMI against these methods.The performance comparison is presented in Table 2. Here, our DMMI is equipped with Swin-B as the visual encoder.As shown, our method achieves the best performance in handling the one-to-many and one-to-zero settings.To be more specific, DMMI outperforms the second-best method by 4.32% and 3.43% in terms of oIoU and mIoU.Moreover, in terms of the metric for one-to-zero samples, DMMI surpasses the secondary method by 3.91% in Acc results.

Ablation Study
In this part, we perform several ablation studies to evaluate the effectiveness of the key components in our DMMI network on both G-Ref (U) and Ref-ZOM datasets.The results are listed in Table 3.
Effect of Image-to-Text Decoder.The first three rows in Table 3 verify the effectiveness of the image-to-text decoder.First, we remove the whole image-to-text decoder and report the performance in the first row of Table 3, where a 1.7% performance degradation could be observed on G-Ref.This reflects the image-to-text decoder contributes a lot to producing accurate segmentation result.Next, we add the similarity loss L sim and report the results in the second row of Table 3.We can find L sim brings significant improvements.Especially, on the Ref-ZOM, the accuracy improves by 1.48% and 1.64% in terms of oIoU and Acc.Meanwhile, 0.71% oIoU gain is also found on G-Ref when the network is equipped with L sim .This demonstrates that through the reconstruction of incomplete text embedding in the training, DMMI is learned to fully incorporate the semantic clues about the entity targets into the visual features, which brings superior performance when facing various types of text inputs.Additionally, we verify the effectiveness of L con in the third row.Compared to the baseline in the first row, performance goes up by 0.59% and 0.82% on the Ref-ZOM in terms of oIoU and Acc.This suggests L con also contributes to the comprehensive understanding of target entity by pairing corresponding multi-modal features.Moreover, in the seventh row, we can find the best performance is achieved when the network equipped with L sim and L con simultaneously, reflecting the contrastive learning and text reconstruction are highly complementary.
Effect of MBA module.In the fourth to sixth row of Table 3, we conduct experiments to investigate the effectiveness of MBA module.On the one hand, as shown in the fifth and seventh row, if we prohibit the multi-modal interaction between various local regions, and only imple-   ment the interaction between single-word and single-pixel, the segmentation results drop significantly.Specifically, 1.46% and 1.2% degradation are observed on Ref-ZOM, which demonstrates the interaction in a large region benefits the comprehensive understanding of multi-modal features.
On the other hand, we forbid the bi-direction mechanism in MBA module by removing the image-to-text alignment and only retaining the text-to-image one.The results are listed in the sixth row in Table 3, in which the performance drops a lot compared to the whole network.This reflects that mutually refining the multi-modal features in the interaction contributes to producing the accurate segmentation map.

Visualization
In this section, we visualize some segmentation maps generated from DMMI and LAVT [48] to further demonstrate the superiority of our method.
Zero-shot to Ref-ZOM.We first visualize some segmentation maps when the model is trained on the G-Ref and transferred to Ref-ZOM under the zero-shot condition.The results are illustrated in the first row of Fig. 4. Since G-Ref only contains one-to-one samples, it is challenging to directly utilize the G-Ref-trained model to address the oneto-many and one-to-zero cases.As shown in the first sample, our DMMI network could precisely localize two boys and distinguish the women in the background.However, LAVT could only localize one boy with the largest size in the image.As for the second sample in the first row, DMMI also handles the one-to-zero case successfully.
Zero-shot to Cityscapes.To further verify the generalization ability of DMMI, we directly transfer the Ref-ZOMtrained networks to the Cityscapes dataset and give the model some expressions as the text input.The training images in Ref-ZOM all come from the COCO dataset, where the image style is quite different from that in Cityscapes.Thus, it is challenging to produce satisfactory performance when the model is transferred to Cityscapes without finetuning.As shown in the second row of Fig. 4, DMMI presents the satisfactory performance.Specifically, when we give the text "White cars on two side", DMMI could precisely localize the corresponding targets while LAVT segment many irrelevant cars and fails to produce accurate segmentation map, demonstrating the great generalization ability of our method.

Conclusion
In this paper, we point out the limitations of existing referring image segmentation methods in handling expressions that refer to either no objects or multiple objects.To solve this problem, we propose a Dual Multi-Modal Interaction (DMMI) Network and establish the Ref-ZOM dataset.In the DMMI network, besides the visual prediction, we reconstruct the erased entity-phrase based on the visual features, which promotes the multi-modal interaction.Meanwhile, in the newly collected Ref-ZOM, we include imagetext pairs under one-to-zero and one-to-many settings, making it more comprehensive than previous datasets.Experimental results show that the proposed method outperforms the existing method by a large margin, and Ref-ZOM dataset endows the network with remarkable generalization ability in understanding various text expressions.We hope our work provides a new perspective for future research.

Figure 1 :
Figure 1: (a) Taking the autonomous driving as an example, text expression may refer to varying number of targets, depending on the specific real-world scenario.(b) When the sentences refer to multiple or no targets, existing methods cannot realize accurate segmentation.

Figure 2 :
Figure 2: The whole framework of the proposed Dual Multi-Modal Interaction (DMMI) Network.(a) The feature encoder, in which the visual encoder and text encoder are utilized to extract visual and linguistic feature, respectively.Meanwhile, MBA module is employed to perform multi-modal feature interaction.Notably, w i,j denotes the j-th point in the i-th row of attention weight.(b) Text-to-image decoder, in which the text embedding is utilized to query the visual feature to generate the prediction map.(c) Image-to-text decoder, in which CCR module is utilized to reconstruct the erased linguistic information condition on the visual feature.L con and L sim are implemented to assist the reconstruction.
from left to right.(1)We manually create some image-text pairs "The pedestrian on the right side" "An airplane sits on a runway at an airport" "Duck with foot going over the edge""Dish with a piece of shrimp" "A man catches the ball wearing a red cap" "umpire and batter" "Two young ladies and the kite between them" "Motorcycle in picture" "a guy and a girl looking at the Taco truck"

Figure 3 :
Figure 3: Selected samples from our newly collected Ref-ZOM datasets.From top to down are image-text pairs under one-to-many, one-to-zero and one-to-one settings.basedon the expressions from COCO Caption and annotate the corresponding target in a two-player game[21,52].Specifically, given an image with caption expressions and annotations, the first annotator selects and modifies the sentence to describe the masked objects.Then, only given the image, the second annotator is asked to select the targets according to the text expression from the first one.The imagetext pair will be collected only when the second annotator selects the targets correctly.(2) Based on the existing oneto-one referring image segmentation dataset, we combine the text expression describing different targets in one image to compose the one-to-many expressions.(3) We utilize the category information with the prompt template to compose some text samples.Generally, Ref-ZOM contains 41842 annotated objects under one-to-many settings.One-to-zero.We carefully select 11937 images from COCO dataset[28], which are not included in[21,38,36].Next, we randomly pair each image with a text expression taken from either the COCO captions or the text pools in[21,38,36].Finally, we conduct a thorough doublechecking process to verify that the selected text expressions are unrelated to the corresponding images.One-to-one.First, we randomly select some samples from existing datasets[21,38,36].Meanwhile, we manually create some new samples based on the category information with the prompt template, which is similar with the third strategy in the creation of one-to-many samples.In total, there are 42421 one-to-one objects in the Ref-ZOM.
front Two boys playing wii in a living room An airplane sits on a runway at an airport White cars on two side

Figure 4 :
Figure 4: Comparisons of segmentation maps generated by LAVT and our DMMI network.

Table 1 :
Comparison with state-of-the-art methods in terms of oIoU(%) on three datasets.In G-Ref, U and G denote the UMD and Google partition, respectively.The best results are in bold.

Table 2 :
Comparisons with some representative methods on the newly collected Ref-ZOM dataset.

Table 3 :
Ablation study of different components in DMMI network on G-Ref and Ref-ZOM datasets.Notably, "Bi-D" indicates the bi-direction operation in MBA module and I2T denotes the "Image-to-Text".