Enabling Large-Scale Image Search with Co-Attention Mechanism

Content-based image retrieval (CBIR) consists of searching the most similar images to a given query. Most existing attention mechanisms for CBIR are query non-sensitive and are only based on single candidate image’s feature regardless of the actual query content. This can result in incorrect regions especially when the target object is not salient or surrounded by distractors. This paper proposes an efficient and effective query sensitive co-attention mechanism for large scale CBIR tasks. Local feature selection and clustering are employed to reduce the computation cost caused by the query sensitivity. Experimental results indicate that the proposed co-attention method can generate good co-attention maps even under challenging situations leading to a new state of the art performance on several benchmark datasets.


INTRODUCTION
Deep Convolution Neural Network (CNN) based methods for CBIR can be divided into two categories: global and local feature methods.Global feature methods extract a compact feature vector from each image using a single forward passing through the network.It can be achieved by a fully connected layer [1] or by global spatial pooling [2,3,4].In addition, several attention mechanisms have been proposed for feature refinement before global pooling.The Weighted Generalized Mean pooling (WGeM) [5] employs a trainable spatial weighting module for feature re-weighting.SOLAR [6] explores the correlation between each entry from the convolution feature tensor with the second order attention.Deep Orthogonal Local and Global (DOLG) [7] proposes an Orthogonal Fusion module to combine the global feature with critical local features for better image representation, while a dot-product fusion module is trained in [8].Local feature methods treat each entry from the feature tensor as a local descriptor followed by a separate aggregation method to build the final image representation [9,10,11].In recent works, selected local features are further used in spatial verification mechanisms for re-ranking [12,13].For example HOW [14] combines CNN-based local features with the Aggregated Selective Match Kernel (ASMK) [15] to directly perform manyto-many local feature matching for image retrieval.
Despite the successes of CNN-based methods, existing attention mechanisms for CBIR [5,12,13,14] are all query non-sensitive; for the given candidate images they predict the regions of interest purely based on the knowledge learned during the training, regardless of what the query content is about.These query non-sensitive spatial attention modules are very likely to fail when the target object is not salient or surrounded by distractors.For example in Fig. 1, the query-nonsensitive attention mechanism from the WGeM [5] fails.As the Louvre Pyramid and Palace are both potential objects of interest, when treating the Louvre Pyramid as the query item, it is always ignored by the WGeM attention module while the adjacent Louvre Palace attracts more attention.Fig. 1.Examples of query non-sensitive attention where WGeM approach fails.Images taken from [5].
Ideally, the attention should be query sensitive, consistent with the current query content.When the Louvre Pyramid is treated as query, it should be highlighted in the resulting coattention map and vice versa, as shown in the examples 3-4 from Fig. 3.This kind of query sensitive attention, conditioned on the query content, is called co-attention in this paper.In some other co-attention works [16,17,18] the query pattern was shown to be essential for feature extraction.
Our contributions are : 1) we propose an efficient coattention method based on local feature selection and clustering without the requirement of extra parameter training; 2) we show that our method could generate good co-attention maps even for some hard situations; 3) the retrieval performance is greatly improved with our co-attention method according to the experimental results and reaches new state of art performance on several benchmark datasets.

BASELINE MODEL STRUCTURE
The proposed co-attention method serves as a post-processing module for pre-trained spatial pooling models without requir- ing the training of any extra parameters.Accordingly, in this paper, we follow the framework from [4] to construct the baseline GeM model.ResNet101 [19] is used as the backbone network for the feature tensor extraction.The output feature tensor is globally pooled by a GeM layer [4] followed by a fully connected layer for feature whitening.Let X = [x l ] ∈ R H×W ×D denote the feature tensor output by the backbone network before pooling, where H, W , D represent the height, width and the channel count (D = 2048 for ResNet101), x l represents the local feature at location l from X.According to the spatial pooling from [14], any loss function that optimizes the cosine similarity between global pooling features would implicitly optimize the following aspects : first, for irrelevant background locations x bg , the L2 norm is minimized, leading to little or no contribution to the final similarity score.On the contrary, for distinct foreground objects or region locations x bg , the L2 norm is maximized.Accordingly, the L2 norm can be treated as a spatial attention that the model implicitly learns at the training stage [14].

ENABLING CBIR WITH CO-ATTENTION
In the following we consider the convolution feature tensor output by the pre-trained GeM model from Section 2 for enabling the co-attention generation process.
Local feature selection and clustering.The first challenge for using co-attention is the computation cost required by the large number of local features that are extracted from a single image.Hundreds of local features could be extracted from a single high resolution image.However, it is impractical to cache all these features.An intuitive way to reduce the memory cost is to discard the irrelevant background local features.L2 norm of each entry on the CNN feature tensor can be used as an indicator of feature importance.Then, feature selection can be performed by keeping the top N local features with the highest L2 norm from feature tensor X, resulting in a selected local feature set X N ∈ R N ×D .To further reduce the number of local features, k-means clustering is employed on X N , grouping them into K clusters.Within each cluster, after performing GeM pooling in order to select the represen-tative local features centers, followed by whitening with the fully connected layer, results in the clustered local features Co-attention generation with local features.The pipeline of co-attention generation and weighted feature extraction is illustrated in Fig. 2.After extracting the representative features by feeding the query image I q and the candidate image I c through the backbone network, we consider L2 norm for the feature selection.The selected query local features X q,N are then directly GeM pooled and whitened to obtain the query global feature V q .In order to extract representative feature vectors, selected candidate local features X c,N are clustered and then whitened, resulting in the local feature set X c,K .Then, the co-attention weights a = [a i ] ∈ R K are obtained by calculating the cosine similarity between V q and each local feature from X c,K .However, the range of the attention weights is within [−1, 1], which may not ensure a high contrast among the locations.For better controlling the weight distribution and normalizing them into the range [0, 1], the SoftMax function is then applied on a : where T is a temperature parameter.The final co-attention weighted candidate global feature vector V c is defined by weighted sum pooling : The similarity measure is performed by evaluating the cosine similarity between V q and V c .Further computation cost reducing.In order to make the co-attention practical to large-scale image retrieval and for further reducing the required computation costs we propose two extra processing steps during the retrieval stage.First, PCA dimension reduction is performed on both query global feature V q and the candidate image local features from X c,K .Second, an inverted file indexing [23] module is applied to reduce the candidate image count that need to be compared with  the query image at the retrieval stage.At the feature extraction stage, both selected query image and candidate image local features X c,N and X q,N , after dimension reduction and whitening, are clustered over the visual words [23] from the codebook while recording the visual word indices that each image is assigned to.Then during the retrieval stage, for each query image we only pick out those candidate database images that share at least one visual word with the query image to perform co-attention generation and assess their similarity.The other images are no longer considered.

EXPERIMENTS
Experiment setup.For a fair comparison with the current state-of-art (SOTA) work Deep Orthogonal Local and Global (DOLG) [7], the baseline model described in Section 2 is trained on GLDv2 dataset [24] with ArcFace margin loss [13].The batch-size is set to 128.The initial learning rate is considered as 0.05 together with a cosine learning rate decay strategy [7].The model is trained for no more than 50 epochs.We set N = 500 for local feature selection, cluster count K = 10 for k-means clustering and T = 10 for the SoftMax temperature in Eq. ( 1).The feature dimension is reduced to 512 by PCA.For the inverted file index, we use single scale 60,000 images from the training dataset (GLDv2) to train the codebook.From each image, 300 local features are picked out and compressed to a dimension of 128 by the PCA.The visual word count of the codebook is set to 65,536.In addition, the multi-scale feature extraction scheme [4] is applied, where we consider 5 scales : Local features extracted from different scales are merged together and jointly selected using the L2 norm.We consider ROxf/RPar datasets [25] along with a 1 million images distractor set R1M [25] for the performance evaluation.Visualization results.Visualization examples of the proposed co-attention are shown in Fig. 3.For comparison, the query non-sensitive L2 norm attention is shown in the forth column.As we can observe, the L2 norm attention tends to highlight regions relevant to the training data, while our co-attention can accurately highlight regions that match the query content.Image retrieval results1 .Image retrieval results for the proposed method and comparisons with other methods are provided in Table 1.For a fair comparison, some of the recent works are re-implemented and marked with " †".The group (A) of results from Table 1 shows the results of local feature methods.R101 − -HOW (GLDv2) †2 is the reimplementation of HOW [14] on GLDv2 dataset with ResNet101 backbone and ArcFace loss.It reaches 71.3% mAP on ROxf hard set before but it has relatively weak performance with the 1 million distractors.Group (B) shows the results of the global feature methods.The original DEep Local and Global features (DELG) model [13] was trained on GLDv2 with a small batch size of 32.R101-DELG † is its re-implemented version with ResNet101 as the backbone network.It can be seen that the spatial verification gives limited improvement, especially when considering the 1 million distractor set.The bottom group (C) shows the results for the baseline model (GeM †) as described in Section 2 and when it is combined with the proposed co-attention method (GeM †-CA).For the results of GeM † and GeM †+CA, they share the same exact GeM model with that described in Section 2, the only difference is that GeM †+CA implements the co-attention method as in Section 3 as well as PCA dimension reduction and inverted file indexing from Section 3) to re-weight the candidate image feature tensor before global GeM pooling.We observe that by introducing the co-attention to the CBIR pipeline greatly improves the retrieval performance.Especially, on the hard set of ROxf (RPar), GeM †+CA reaches the best results of 72.6% (85.6%).Also the proposed co-attention method still gives the best retrieval results when considering the 1 million distractor set.

ABLATION EXPERIMENT AND DISCUSSION
Impact of clustering parameters.We evaluate in the plots from Figures 4 (a), (b) and (c) the impact of cluster hyperparameters features N , clusters k, and temperature T from Eq. ( 1), on the model retrieval performance.The proposed method is robust to changes in these hyper-parameters.Varying the number of clusters k, has implications not only on the performance but also on the computation cost.The setting described in the beginning of Section 4 reaches a good balance between performance and computation costs.Computation cost and speed.For the memory cost, it takes around 21GB to cache the whole ROxf/RPar database with the 1 million distractor set.For the time cost, the feature extraction takes in average 240ms to cache one candidate image's local features but it can be performed offline and it is only done once.It takes on average 530ms with acceleration on a NVIDIA Tesla GPU, when searching on ROxf/RPar with the 1 million distractor dataset for one query image.Detailed computation cost comparision is provided in Table 2.The proposed method "GeM †+CA" requires a similar memory cost as DELG [13].Although the proposed co-attention method requires more time cost than those simple global feature methods, like GeM [4] and DOLG [7], it provides the best retrieval performance.

CONCLUSION
In this paper, we enable large-scale content-based image retrieval with co-attention mechanisms.The proposed coattention method can be treated as a non-trainable-parameter module for a pre-trained spatial pooling model.It is intuitively based on the similarity score between the global feature vector of the query image and the clustered local features from the candidate image.The extra computation cost caused by the query-sensitivity is addressed by employing local feature selection and clustering while also considering the inverted file indexing to speed up the retrieval procedure.While straightforward, the proposed co-attention method generates good co-attention maps even in some challenging cases.By simply adding our co-attention method to the pre-trained baseline GeM model, the retrieval performance is greatly improved and results in a new state of the art retrieval performance on benchmark datasets white requirying comparable computation costs to other models.

Fig. 2 .
Fig. 2. Illustration of clustering based co-attention generation and weighted feature extraction.