VS-Net: Voting with Segmentation for Visual Localization

CVPR 2021

Zhaoyang Huang1,2*, Han Zhou1*, Yijin Li1, Bangbang Yang1, Yan Xu2,
Xiaowei Zhou1 Hujun Bao1, Guofeng Zhang1, Hongsheng Li2, 3

1State Key Lab of CAD & CG, Zhejiang University   
2CUHK-SenseTime Joint Laboratory, The Chinese University of Hong Kong   
3 School of CST, Xidian University   
* denotes equal contributions


Visual localization is of great importance in robotics and computer vision. Recently, scene coordinate regression based methods have shown good performance in visual localization in small static scenes. However, it still estimates camera poses from many inferior scene coordinates. To address this problem, we propose a novel visual localization framework that establishes 2D-to-3D correspondences between the query image and the 3D map with a series of learnable scene-specific landmarks. In the landmark generation stage, the 3D surfaces of the target scene are oversegmented into mosaic patches whose centers are regarded as the scene-specific landmarks. To robustly and accurately recover the scene-specific landmarks, we propose the Voting with Segmentation Network (VS-Net) to segment the pixels into different landmark patches with a segmentation branch and estimate the landmark locations within each patch with a landmark location voting branch. Since the number of landmarks in a scene may reach up to 5000, training a segmentation network with such a large number of classes is both computation and memory costly for the commonly used cross-entropy loss. We propose a novel prototype-based triplet loss with hard negative mining, which is able to train semantic segmentation networks with a large number of labels efficiently. Our proposed VS-Net is extensively tested on multiple public benchmarks and can outperform state-of-the-art visual localization methods.

Motivation of Scene-specific Landmarks


Estimating camera poses by the Pnp algorithm doesn’t require 2D-to-3D dense correspondences. Instead, sparse, uniformly distributed, and accurate correspondences can better benefit localization. Object pose estimation establishes 2D-3D correspondences by identifying predefined keypoints on the 3D model, which can produce sparse, uniformly distributed, and accurate correspondences. Inspired by them, we propose to define landmarks on the given 3D model, which are referred to as scene-specific landmarks, and train a neural network to identify such scene-specific landmarks in 2D images. Furthermore, visual localization is faced with a much larger model and should process a large number of landmarks. We propose a voting with segmentation neural network to address this problem.

Pipeline overview

VS-Net: Voting with Segmentation for Visual Localization
VS Architechture

There are two decoder branches respectively predicting a landmark segmentation map and a pixel-wise direction voting map. We identify the location and labels of landmarks with these two maps by counting the intersection of pixel votes grouped by the segmentation map. After establishing 2D-3D correspondences according to the landmark labels, we can estimate the 6-DoF camera pose with a standard RANSAC-PnP.

Identifying Scene-specific Landmark with VS-Net

VS landmark generation

We detect the landmarks with the voting-intersection based RANSAC algorithm.

1) Filter out disturbed pixels for landmark localization by RANSAC. Estimating the location from the remaining pixel votes achieves high accuracy and robustness against distracting factors.

2) Reject unreliable landmark locations that have too few inlier votes to ensure the accuracy of correspondences involved in the follow-up pose estimation

Prototype-based triplet loss for segmentation

VS prototype-based

Cross-entropy loss is computation and memory costly when the number of labels is large so we train the segmentation map by the proposed prototype-based triplet loss:

1) Maintain a learnable prototype set that contains a prototype feature for each landmark (label)

2) Push a feature embedding corresponding to a pixel to its target prototype

3) Pull the feature embedding from top-k non-corresponding prototypes rather than all non-corresponding prototypes

Comparison with scene coordinates

VS Comparison

Reprojection errors of 2D-to-3D correspondences of scene coordinates and scene-specific landmarks. (a) The query image. (b) The reprojection errors of dense scene coordinates predicted by the regression-only network. (c) The reprojection errors of scene-specific landmarks and their surrounding patches by the proposed method. Pixels belonging to the same landmark are painted with the same color representing the landmark’ reprojection error. The white pixels in (c) are filetered by our voting-by-segmentation algorithm. Our scene-specific landmarks achieve much lower reporjection error compared with dense scene coordinate regression.

Comparison with general features

FeatureMatching Comparison

All the matches have been filtered by epipolar geometry with RANSAC. R2D2 is not able to figure out more than three correspondences, which can not be applied to the following fundamental matrix estimation, so we do not present it in the figures. Even after RANSAC, there are many erroneous matches in SIFT and D2-Net when the baseline is large because the textures are locally similar. We highlight some of them with red circles. By contrast, our landmark association keeps good robustness and accuracy under such a significant viewpoint changing because VS-Net mines scene bias well.

Quantitative Comparison

Quantitative Comparison

Our VS-Net outperforms previous visual localization methods on the Microsoft 7-Scenes dataset and the Cambridge Landmarks dataset.

Qualitative localization performance


  title={VS-Net: Voting with Segmentation for Visual Localization},
  author={Huang, Zhaoyang and Zhou, Han and Li, Yijin and Yang, Bangbang and Xu, Yan and Zhou, Xiaowei and Bao, Hujun and Zhang, Guofeng and Li, Hongsheng},