1The Chinese University of Hong Kong
2Samsung Telecommunication Research
3NVIDIA AI Technology Center 4 SenseTime Research
* denotes equal contributions
We introduce Optical Flow TransFormer (FlowFormer), a transformer-based neural network architecture for learning optical flow. FlowFormer tokenizes the 4D cost volume built from an image pair, encodes the cost tokens into a cost memory with alternate-group transformer (AGT) layers in a novel latent space, and decodes the cost memory via a recurrent transformer decoder with dynamic positional cost queries. On the Sintel benchmark, FlowFormer achieves 1.159 and 2.088 average end-ponit-error (AEPE) on the clean and final pass, a 16.5% and 15.5% error reduction from the best published result (1.388 and 2.47). Besides, FlowFormer also achieves strong generalization performance. Without being trained on Sintel, FlowFormer achieves 0.64 and 1.50 AEPE on the clean and final pass of Sintel training set, outperforming the best published result (1.29) by 50.4% and 45.3%.
We provide a video comparing results of our proposed FlowFormer with those of GMA on the Sintel training set, together with flow prediction on two video clips from the Internet.
Recently, transformers have attracted much attention for their ability of modeling long-range relations, which can benefit optical flow estimation. Can we enjoy both advantages of transformers and the cost volume from the previous milestone architectures? Such a question calls for designing novel transformer architectures for optical flow estimation that can effectively aggregate information from the cost volume. In this paper, we introduce the novel optical Flow TransFormer~(FlowFormer) to address this challenging problem.
FlowFormer adopts an encoder-decoder architecture for cost volume encoding and decoding. After building a 4D cost volume, FlowFormer consists of two main components: 1) a cost volume encoder that embeds the 4D cost volume into a latent cost space and fully encodes the cost information in such a space, and 2) a recurrent cost decoder that estimates ﬂows from the encoded latent cost features. Compared with previous works, the main characteristic of our FlowFormer is to adapt the transformer architectures to effectively process cost volumes, which are compact yet rich representations widely explored in optical flow estimation communities, for estimating accurate optical flows.
Our contributions can be summarized as threefold:
* denotes that the methods use the warm-start strategy. † means the result is evaluated with the tile technique.
We train FlowFormer on the FlyingChairs and FlyingThings (C+T), and evaluate it on the training set of Sintel and KITTI-2015. This settings evaluates the generalization performance of optical flow models. FlowFormer ranks 1st among all compared methods on both benchmarks. FlowFormer achieves 0.64 and 1.50 on the clean and final pass of Sintel. On the KITTI-2015 training set, FlowFormer achieves 4.09 F1-epe and 14.72 F1-all. Compared to GMA, FlowFormer reduces 50.4% and 45.3% errors on Sintel clean and final, and 13.9% errors on KITTI-2015 F1-all, which shows its extraordinary generalization performance.
FlowFormer achieves 1.16 and 2.09 on the Sintel clean and final, 16.5% and 15.5% lower error compared to GMA*, which both ranks 1st on the Sintel benchmark. It is noteworthy that RAFT* and GMA* use the warm-start strategy that requires image sequences while FlowFormer does not. Compared with GMA, which also does not use the warm-start, FlowFormer obtains 17.2% and 27.5% error reduction. From RAFT v.s. RAFT* and GMA v.s. GMA*, we can see significant error reduction from the warm-start strategy especially on the final pass. RAFT trained on the autoflow dataset~(A+S+K+H) significantly outperforms RAFT trained on the C+T+S+K+H on final pass because autoflow provides training image pairs that are more challenging. We believe training FlowFormer with autoflow can achieve better accuracy but it is not released yet.
FlowFormer achieves 4.68, ranking 2nd on the KITTI-2015 benchmark. S-Flow obtains slightly smaller error than FlowFormer on KITTI (-0.85%), which, however, is significantly worse on Sintel (31.6% and 22.5% larger error on clean and final pass). S-Flow finds corresponding points by computing the coordinate expectation weighted by refined cost maps. Images in the KITTI dataset are captured in urban traffic scenes, which contains objects that are mostly rigid. Flows on rigid objects are rather simple, which is easier for cost-based coordinate expectation, but the assumption can be easily violated in non-rigid scenarios such as Sintel.
We visualize flows that estimated by our FlowFormer and GMA of three examples in the figure to qualitatively show how FlowFormer outperforms GMA. As transformers can encode the cost information at a large perceptive field, FlowFormer can distinguish overlapping objects via contextual information and thus reduce the leakage of flows over boundaries. Compared with GMA, the flows that are estimatd by FlowFormer on boundaries of the bamboo and the human body are more precise and clear. Besides, FlowFormer can also recover motion details that are ignored by GMA, such as the hair and the holes on the box.