TransMVSNet: Global Context-aware Multi-view Stereo
Network with Transformers

CVPR 2022


Yikang Ding2*, Wentao Yuan3*, Qingtian Zhu3, Haotian Zhang1, Xiangyue Liu1, Yuanjiang Wang1, Xiao Liu1

1Megvii Research    2Tsinghua University    3Peking University   
* denotes equal contributions

Abstract


teasor

In this paper, we present TransMVSNet, based on our exploration of feature matching in multi-view stereo (MVS). We analogize MVS back to its nature of a feature matching task and therefore propose a powerful Feature Matching Transformer (FMT) to leverage intra- (self-) and inter- (cross-) attention to aggregate long-range context information within and across images. To facilitate a better adaptation of the FMT, we leverage an Adaptive Receptive Field (ARF) module to ensure a smooth transit in scopes of features and bridge different stages with a feature pathway to pass transformed features and gradients across different scales. In addition, we apply pair-wise feature correlation to measure similarity between features, and adopt ambiguity-reducing focal loss to strengthen the supervision. To the best of our knowledge, TransMVSNet is the first attempt to leverage Transformer into the task of MVS. As a result, our method achieves state-of-the-art performance on DTU dataset, Tanks and Temples benchmark, and BlendedMVS dataset.


Architecture of FMT


FMT

Different from a typical one-to-one matching task between two views, MVS tackles a one-to-many matching problem, where context information of all views should be considered simultaneously. To this end, we propose the FMT to capture long-range context information within and across images. The architecture of FMT is illustrated above. We follow SuperGlue and add positional encoding, which implicitly enhances positional consistency and makes FMT robust to feature maps with different resolutions. Each view's corresponding flattened feature map is processed by Na attention blocks sequentially. Within each attention block, the reference feature and each source feature firstly compute intra-attention with shared weights, where all features are updated with their respective embedded global context information. Afterwards, the unidirectional inter-attention is performed, with which is updated according to retrieved information from the reference feature.

Comparison results on Tanks and Temples Benchmark

tnt-score
tnt-compare

Evolution of feature maps via FMT

fea-evo

Visualization of intra- and inter-attention weights

att

Point clouds of DTU evaluation set

dtu-pcd

Point clouds of Tanks and Temples Benchmark

tnt-pcd

Point clouds of BlendedMVS evaluation set

bld-pcd