FlowFormer: A Transformer Architecture for Optical Flow

ECCV 2022

Ranks 1st on Sintel Optical Flow benchmark on Mar. 17th, 2022

Zhaoyang Huang^{1, 3}, Xiaoyu Shi^{1, 3}, Chao Zhang², Qiang Wang², Ka Chun Cheung³, Hongwei Qin⁴, Jifeng Dai⁴, Hongsheng Li¹

¹The Chinese University of Hong Kong ²Samsung Telecommunication Research
³NVIDIA AI Technology Center ⁴ SenseTime Research
^* denotes equal contributions

Paper

Supplementary

Code

Abstract

We introduce Optical Flow TransFormer (FlowFormer), a transformer-based neural network architecture for learning optical flow. FlowFormer tokenizes the 4D cost volume built from an image pair, encodes the cost tokens into a cost memory with alternate-group transformer (AGT) layers in a novel latent space, and decodes the cost memory via a recurrent transformer decoder with dynamic positional cost queries. On the Sintel benchmark, FlowFormer achieves 1.159 and 2.088 average end-ponit-error (AEPE) on the clean and final pass, a 16.5% and 15.5% error reduction from the best published result (1.388 and 2.47). Besides, FlowFormer also achieves strong generalization performance. Without being trained on Sintel, FlowFormer achieves 0.64 and 1.50 AEPE on the clean and final pass of Sintel training set, outperforming the best published result (1.29) by 50.4% and 45.3%.

Method

Motivation

Recently, transformers have attracted much attention for their ability of modeling long-range relations, which can benefit optical flow estimation. Can we enjoy both advantages of transformers and the cost volume from the previous milestone architectures? Such a question calls for designing novel transformer architectures for optical flow estimation that can effectively aggregate information from the cost volume. In this paper, we introduce the novel optical Flow TransFormer~(FlowFormer) to address this challenging problem.

Architecture

FlowFormer adopts an encoder-decoder architecture for cost volume encoding and decoding. After building a 4D cost volume, FlowFormer consists of two main components: 1) a cost volume encoder that embeds the 4D cost volume into a latent cost space and fully encodes the cost information in such a space, and 2) a recurrent cost decoder that estimates ﬂows from the encoded latent cost features. Compared with previous works, the main characteristic of our FlowFormer is to adapt the transformer architectures to effectively process cost volumes, which are compact yet rich representations widely explored in optical flow estimation communities, for estimating accurate optical flows.

Our contributions can be summarized as threefold:

We propose a novel transformer-based neural network architecture, FlowFormer, for optical flow estimation, which achieves state-of-the-art flow estimation performance.

We design a novel cost volume encoder, effectively aggregating cost information into compact latent cost tokens.

We propose a recurrent cost decoder that recurrently decodes cost features with dynamic positional cost queries to iteratively refine the estimated optical flows.

We validate for the first time that an ImageNet-pretrained transformer can benefit the estimation of optical flow.

Compared with Other Methods

* denotes that the methods use the warm-start strategy. † means the result is evaluated with the tile technique.

Generalization Performance

We train FlowFormer on the FlyingChairs and FlyingThings (C+T), and evaluate it on the training set of Sintel and KITTI-2015. This settings evaluates the generalization performance of optical flow models. FlowFormer ranks 1st among all compared methods on both benchmarks. FlowFormer achieves 0.64 and 1.50 on the clean and final pass of Sintel. On the KITTI-2015 training set, FlowFormer achieves 4.09 F1-epe and 14.72 F1-all. Compared to GMA, FlowFormer reduces 50.4% and 45.3% errors on Sintel clean and final, and 13.9% errors on KITTI-2015 F1-all, which shows its extraordinary generalization performance.

Sintel Benchmark

FlowFormer achieves 1.16 and 2.09 on the Sintel clean and final, 16.5% and 15.5% lower error compared to GMA*, which both ranks 1st on the Sintel benchmark. It is noteworthy that RAFT* and GMA* use the warm-start strategy that requires image sequences while FlowFormer does not. Compared with GMA, which also does not use the warm-start, FlowFormer obtains 17.2% and 27.5% error reduction. From RAFT v.s. RAFT* and GMA v.s. GMA*, we can see significant error reduction from the warm-start strategy especially on the final pass. RAFT trained on the autoflow dataset~(A+S+K+H) significantly outperforms RAFT trained on the C+T+S+K+H on final pass because autoflow provides training image pairs that are more challenging. We believe training FlowFormer with autoflow can achieve better accuracy but it is not released yet.

KITTI-2015 Benchmark

FlowFormer achieves 4.68, ranking 2nd on the KITTI-2015 benchmark. S-Flow obtains slightly smaller error than FlowFormer on KITTI (-0.85%), which, however, is significantly worse on Sintel (31.6% and 22.5% larger error on clean and final pass). S-Flow finds corresponding points by computing the coordinate expectation weighted by refined cost maps. Images in the KITTI dataset are captured in urban traffic scenes, which contains objects that are mostly rigid. Flows on rigid objects are rather simple, which is easier for cost-based coordinate expectation, but the assumption can be easily violated in non-rigid scenarios such as Sintel.

Qualitative Results

Qualitative Comparison

We visualize flows that estimated by our FlowFormer and GMA of three examples in the figure to qualitatively show how FlowFormer outperforms GMA. As transformers can encode the cost information at a large perceptive field, FlowFormer can distinguish overlapping objects via contextual information and thus reduce the leakage of flows over boundaries. Compared with GMA, the flows that are estimatd by FlowFormer on boundaries of the bamboo and the human body are more precise and clear. Besides, FlowFormer can also recover motion details that are ignored by GMA, such as the hair and the holes on the box.