Video object segmentation and tracking

Voigtlaender, Paul; Leibe, Bastian (Thesis advisor); Leal-Taixé, Laura (Thesis advisor)

Aachen : RWTH Aachen University (2021, 2022)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2021


Video Object Segmentation (VOS) is the computer vision task of segmenting generic objects in a video given their ground truth segmentation masks in the first frame. Strongly related are the tasks of single-object tracking (SOT) and multi-object tracking (MOT), where one or multiple objects need to be tracked on a bounding box level. All these tasks are highly related and have important applications like autonomous driving and video editing. At the same time, all of these tasks remain very challenging till today. In this work, we propose three different methods for VOS, each following a different paradigm. The first method, OnAVOS, follows the appearance-based paradigm and performs online updating to be able to adapt to appearance changes while processing a video. The second method, PReMVOS, follows the tracking-by-detection paradigm. PReMVOS uses a first-frame fine-tuned instance segmentor to provide object mask proposals. These proposals are then linked over time into tracks using re-identification and optical flow mask warping cues.The third method, FEELVOS, follows the feature embedding-learning paradigm. FEELVOS is one of the first VOS methods which use a feature embedding as internal guidance of a convolutional network and learn the embedding end-to-end with a segmentation loss. Following this approach, FEELVOS achieves strong results while being fast and not requiring test-time fine-tuning. This feature embedding-learning paradigm together with end-to-end learning has by now become the dominating approach for VOS. Since datasets are a major driving force behind progress in VOS, we further develop and validate a semi-automatic approach for labeling VOS datasets based on bounding box annotations. We demonstrate that training a state-of-the-art VOS model using the (semi-)automatically generated labels leads to results which come very close to using fully hand-labeled annotations. We apply this annotation procedure to create mask annotations for the challenging Tracking Any Object (TAO) dataset and release the resulting TAO-VOS benchmark. We demonstrate that unlike existing VOS benchmarks, TAO-VOS is able to reveal significant differences in performance of current methods and that the result quality on TAO-VOS does not saturate yet. We further extend the popular MOT task to Multi-Object Tracking and Segmentation (MOTS) by requiring methods to also produce segmentation masks. We annotate two existing MOT datasets with masks and release the resulting KITTI MOTS and the MOTSChallenge benchmarks together with new evaluation measures and a baseline method. Additionally, we promote the new MOTS task by hosting a workshop challenge. MOTS is a step towards bringing the communities of VOS and MOT together to facilitate further exchange of ideas. Finally, we develop Siam R-CNN, a Siamese re-detection architecture based on Faster R-CNN, to tackle the task of long-term single-object tracking. In contrast to most previous long-term tracking approaches, Siam R-CNN performs re-detection on the whole image instead of a local window, allowing it to recover after losing the object of interest. Additionally, we propose a tracklet dynamic programming (TDPA) algorithm to incorporate spatio-temporal context into Siam R-CNN. Siam R-CNN produces strong results for SOT and VOS, and performs especially well for long-term tracking.


  • Department of Computer Science [120000]
  • Chair of Computer Science 13 (Computer Vision) [123710]