Computer Science Graduate Seminar

Friday, September 24, 2021, 9:00am

Video Object Segmentation and Tracking



Video Object Segmentation (VOS) is the computer vision task of segmenting generic objects in a video given their ground truth segmentation masks in the first frame. Strongly related are the tasks of single-object tracking (SOT) and multi-object tracking (MOT), where one or multiple objects need to be tracked on a bounding box level. All these tasks
are highly related and have important applications like autonomous driving and video editing. At the same time, all of these tasks remain very challenging till today. In this talk, we present our work on VOS, MOT, and SOT.

Firstly, we present a VOS method, FEELVOS, which follows the feature embedding-learning paradigm. FEELVOS is one of the first VOS methods which use a feature embedding as internal guidance of a convolutional network and learn the embedding end-to-end with a segmentation loss. Following this approach, FEELVOS achieves strong results while being fast and not requiring test-time fine-tuning. This feature embedding-learning paradigm together with end-to-end learning has by now become the dominating approach for VOS.

We further extend the popular MOT task to Multi-Object Tracking and Segmentation (MOTS) by requiring methods to also produce segmentation masks. We propose a semi-automatic labeling method and use it to annotate two existing MOT datasets with masks. We release the resulting KITTI MOTS and the MOTSChallenge benchmarks together with new evaluation measures and a baseline method. Additionally, we promote the new MOTS task by hosting a workshop challenge. MOTS is a step towards bringing the communities of VOS and MOT together to facilitate further exchange of ideas.

Finally, we present Siam R-CNN, a Siamese re-detection architecture based on Faster R-CNN, to tackle the task of long-term single-object tracking. In contrast to most previous long-term tracking approaches, Siam R-CNN performs re-detection on the whole image instead of a local window, allowing it to recover after losing the object of interest. Additionally, we propose a tracklet dynamic programming (TDPA) algorithm to incorporate spatio-temporal context into Siam R-CNN. Siam R-CNN produces strong results for SOT and VOS, and performs especially well for long-term tracking.


The computer science lecturers invite interested people to join.