Few but Informative Local Hash Code Matching for Image Retrieval

Content-based image retrieval (CBIR) aims to search for the most similar images from an extensive database to a given query content. Existing CBIR works either represent each image with a compact global feature vector or extract a large number of highly compressed low-dimensional local features, where each contains limited information. In this research study, we propose an expressive local feature extraction pipeline and a many-to-many local feature matching method for large-scale CBIR. Unlike existing local feature methods, which tend to extract large amounts of low-dimensional local features from each image, the proposed method models characteristic feature representations for each image, aiming to employ fewer but more expressive local features. For further improving the results, an end-to-end trainable hash encoding layer is used for extracting compact but informative codes from images. The proposed many-to-many local feature matching is then directly performed on the hash feature vectors from input images, leading to new state-of-the-art performance on several benchmark datasets.


INTRODUCTION
Comparing images and finding those with similar content with a given query represents an important image processing application named Content Based Image Retrieval (CBIR).The crucial processing aspects that have to be addressed in CBIR are the image feature extraction and the similarity measure.There are two levels of image representation, corresponding to global and local feature modelling, when employing Convolution Neural Networks (CNN) for CBIR.
Global feature methods [1,2,3,4] extract a compact feature vector for each image following a single forward passing through the network.In recent works several types of attention mechanisms have been proposed to re-weight the convolution feature tensor before compressing it into a compact global feature vector [5,6,7,8,9] instead of uniformly extracting features from the whole image.As for local feature methods, they could be further categorised into three categories.The first category employs an aggregation module to encode the local features into a compact feature vector [10,11,12,13].The second type keeps several local features from each image while employing a similarity measure in a many-to-many matching manner [14,15].The final category of methods utilises the spatial information of each extracted local feature vector to perform verification during a re-ranking stage [16,17].In a way, the first type of local feature methods is similar to the global feature representation approach, as they both eventually lead to single compact global feature vectors as image representations.However, most existing local feature methods tend to extract a large number of local features from the input image.Especially, due to the fixed receptive field of CNNs and the variation in the object size, each local feature may only correspond to a local part of a target object or region.To address this problem, local feature vectors are normally extracted from multi-scale resolutions of the input image.However, the increase in the number of features leads to unbearable processing costs at the online retrieval stage.Consequently, most existing works apply dramatic dimension reduction or binarization [14,15,16,17]) for feature compression.This results in a large number of low-dimensional local features, where each contains relatively limited relevant information.
This research study proposes a new method for extracting a comprehensive and compact image information representation from pre-trained CNNs.Instead of storing huge amounts of low-dimensional local features, we employ clustering on selected local features from the CNN output to build compressed but expressive local feature representations for each image.Moreover, we propose a trainable hash encoding layer, which is optimized based on the idea of the Bi-half Net [18].After training, the proposed feature extraction pipeline results in compact hash codes with limited information loss.Finally, a corresponding many-to-many similarity criterion is applied to the resulting expressive local features, leading to state of the art results while significantly lower memory resources are used when comparing to other CBIR approaches.

METHODOLOGY
In this section, we first describe a simple model structure that generates global feature vectors optimized with image-level Fig. 1.Illustration of the proposed binary local feature model processing system.labels at the training stage.Then, we discuss how the local features and hash code generation are optimized during training.After that, we propose a clustering based local feature extraction method along with a many-to-many local feature matching strategy for image matching at the retrieval stage.

Architecture
We consider ResNet [19] with an output channel count of D = 2048 as the backbone network.The feature tensor X output by the final convolution layer is GeM pooled [4] with a fixed power coefficient of 3.After that, the spatial pooling feature is whitened by a fully connected layer, resulting in the global feature vector V g with the dimension of D B .As shown in Fig. 1 (a), at the training stage, the model is trained with the ArcFace margin loss [17] : , where V g is the whitened L2 normalized global GeM feature vector for each input training image.AF(u, y) is the ArcFace-adjusted cosine similarity [17]: while w i refers to the trainable L2 normalized classifier weights for class i from the ArcFace weight matrix W ∈ R Nc×D and N c is the number of classes in the training dataset.In other words, the ArcFace loss potentially optimizes the cosine similarity not between single image pairs but between each training image and proxies of classes.According to the insight of spatial pooling from [14], optimizing cosine similarity between global spatial pooling features would implicitly optimize the L2 norm of the local descriptor from each entry of the feature tensor X output by the backbone network.As the Sign function has been widely applied for feature binarization [14,17,16], its direct application over real-value features, which are optimized with the real-value loss, could lead to information loss, degrading the whole model's performance.What exact attributes make a good binary code has been discussed in several studies [20,21,22].
Recently, the Bi-half Net [18] considers that the information per channel transmitted from the original continuous features to the corresponding binary code is maximized when the distribution of the binary values {−1, 1} across all channels is equally balanced.The forward and backward processes for the Bi-half layer given the input feature tensor F, are, [18] : where φ is a hyper-parameter equal to the multiplicative inverse of the element count of the feature batch F.
The Bi-half layer in [18] is applied on each batch of features at the training stage which could make the training unstable.Accordingly, in our implementation, we apply the Bihalf layer on the class proxy features W, resulting in binary proxy features B W . Then the ArcFace loss from Eq. ( 1) is re-written as : .
Intuitively speaking, enforcing these proxy features to have equal binary symbol probabilities could potentially make the binary code of the same class images be optimized towards a consistent goal across all batch steps.This eliminates the distraction caused by the batch size setting or the random image sample shuffle at the training stage.

Local feature extraction and matching
At the retrieval stage, as shown in Fig. 1 (b), for input Image I, after feeding through the backbone network, L2 norm based feature selection is applied, keeping the top N local features X N ∈ R N ×D with the highest L2 norm.Then, k-means clustering is employed for extracting a set of representative feature vectors by performing GeM pooling within each cluster.This is followed by whitening and applying the Sign function based binarization to obtain a set of clustered binary codes Let us consider a pair of images : the query image I q and the candidate image I c along with corresponding binary features B q,K = [b q,i ] and B c,K = [b c,i ] (i, j = 1, ..., K).Their similarity score is defined by: where Hamm(•, •) represents the Hamming distance.To further speed up the online retrieval procedure, inverted file indexing [23] is used to eliminate the obvious non-matching images.We use the local features from X N to build the visual word codebook.At the feature extraction stage, both query and candidate image local features, X c,N and X q,N , respectively, are clustered over visual words from the pre-trained codebook and we record the visual word index that each image is assigned to.Then during the retrieval stage, for each query image, we only pick out those candidate database images that would share at least one visual word with the query image to perform the local feature match and assess their similarity.The other candidate images which are not selected are simply set to have zero similarity score with the query image.

EXPERIMENTAL RESULTS
Experiment setup.The model uses ResNet101(50) as backbone.The margin m = 0.15 and γ = 30 for the ArcFace loss in Eq. ( 2) and Eq. ( 4).To speed up the model convergence, the GeM backbone network is pre-trained on the GLDv2 dataset [24] with ArcFace loss for 50 epochs1 .The batchsize is set to 128, the initial learning rate is of 0.05 and we consider a cosine learning rate decay strategy [9].Then, after PCA dimension reduction, class proxies and the fully connected layer are fine-tuned for extra 10 epochs with the Bi-half layer applied and the backbone network frozen.The final output feature dimension D B = 512.We set N = 500 for the local feature selection while considering K = 10 clusters for k-means clustering.For the inverted file index, we use single scale 60,000 images from the GLDv2 dataset to train the codebook.From each image, 300 local features are picked up and compressed to a dimension of 128 by using PCA.The visual word count of the codebook is set to 65536.In addition, we consider 5 scales √ 2 for the multi-scale feature extraction scheme [4].Local features extracted from different scales are merged together and jointly selected using the L2 norm.ROxf/RPar datasets [25] along with a 1 million images distractor set R1M [25] are used for evaluation.
Local match visualization.In Fig. 2 we visualize the contribution of each location to the image pair similarity score.
For comparison, the L2 norm attention given by simple GeM pooling is visualized in the images from the fourth column.
As we can observe, the L2 norm tends to uniformly highlight training data's relevant objects, while our local match method emphases the correct regions of interest.In a way, the visualization of local match maps looks like co-attention, as the importance of each local feature from the candidate image is no longer fixed as in the traditional global spatial pooling.Image retrieval results.
Comparative image retrieval results for the proposed method ("LM-BiHalf") and other approaches are provided in Table 1.For fair comparison we reimplemented the GeM [4] and HOW [14] with ArcFace loss on GLDv2 dataset, denoting by †.We can observe that the proposed local match method "LM-BiHalf" leads to great accuracy improvement when compared to the baseline "GeM †".Especially, when considering the ResNet101 as the backbone network, on the Hard set of ROxf (RPar), our method reaches the best result 72.0% (83.6%).When considering the 1 million distractor set, our method has the best retrieval results on ROxf+1M and it also gives comparable results to the current SOTA work DOLG on RPar+1M.

ABLATION EXPERIMENTS AND DISCUSSION
Bi-half layer impact.We first verify the impact of the Bi-half layer on the model's performance.According to the results from Table 2, after employing the Bi-half layer, the retrieval results globally outperforms those without the Bi-half layer.Computation and memory costs.With the hyper-parameter setting described in Section 3, the memory cost for one im-age feature cache is about 0.64KB.It takes around 0.64GB to cache the entire ROxf/RPar dataset with the +1M distractor set.For the online retrieval search on ROxf/RPar with +1M distractor dataset and with the help of inverted file indexing, for one query image, it takes on average 590ms with a CPU.According to the results from Table 3, our method "LM-BiHalf" requires much less memory with a comparable time cost to other CBIR approaches.

CONCLUSION
In this paper, we propose extracting few but expressive binary codes as representation for input images.Extracted compact binary features are employed into a many-to-many local matching method for CBIR.Unlike other local matching methods which extract large sets of low-dimensional local features which may require complex matching kernel implementations, the proposed local matching method is based on the L2 norm local feature selection and simple clustering to extract the appropriate number of expressive local features.
In addition, the adapted Bi-half binarization layer enriches the information capacity of each feature channel, relieving the information loss problem caused by feature compression.The proposed CBIR methodology enabled by deep feature space clustering and Bi-half binarization achieves new state of art performance on benchmark datasets while having much lower memory requirements than other methods.

Table 2 .
Ablation experimental results when considering the Bi-half layer applied on the proxy features W.