Vision-based category agnostic object tracking for mobile robots and intelligent vehicles

Osep, Aljose; Leibe, Bastian (Thesis advisor); Held, David (Thesis advisor)

Aachen (2019, 2020)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2019


Analysis of the moving objects is a vital ability of mobile vehicles, such as self-driving cars. Through tracking, autonomous systems can become aware of the whereabouts of important objects and determine their future motion. This ability to foresee potential collisions and react to possibly harmful situations is essential for safe robot navigation, motion planning, and collision avoidance. In recent years, deep learning has revolutionized the way research is being performed in computer vision and one of the success stories of this development is rapid progress in the area of object detection. With the availability of robust object detectors, tracking-by-detection has become established as a leading paradigm for vision-based multi-object tracking. The majority of existing vision-based methods perform tracking in the image domain. Yet, in mobile robotics and autonomous driving scenarios, precise 3D localization and trajectory estimation is of fundamental importance. Furthermore, tracking-by-detection approaches are inherently limited to a pre-defined set of object categories, for which object detectors can be robustly trained. However, future mobile systems will need a capability to cope with rich human-made environments, in which obtaining detectors for every possible object category would be infeasible. The first goal of this thesis is to develop a vision system that lifts the detection-based multi-object tracking paradigm to 3D using an inexpensive stereo setup and that is able to detect, track, and localize surrounding objects precisely in 3D space. To this end, we propose a system that carefully combines 2D object detections and stereo-based depth measurements in order to improve image-based tracking and, more importantly, precise 3D localization. During tracking, we loosely couple image detections and 3D object segmentation estimates and combine them on an object track level. This enables us to track distant objects and continue these tracks with more precise information in the close range, while smoothly transitioning between the modalities. This approach still requires object detections and is limited to the most common object categories, for which detectors are readily available. To overcome this limitation, we further propose CAMOT, a vision-based, category-agnostic multi-object tracking approach. CAMOT leverages recent developments in the area of learning-based object proposal generation and lifts image-based proposal estimates to 3D space in order to estimate trajectories of arbitrary objects. At the core of this approach is an efficient mask-based representation of tracked objects, that can be easily lifted to 3D space in the presence of depth estimates and that allows for robust and precise data association based on estimated 3D position and pixel-precise representation of the tracks. Even though objects are tracked regardless of their category, most common traffic participants can still be recognized by classifying these object tracks. We further extend CAMOT for the task of video-object proposal generation and demonstrate that by utilizing motion consistency and parallax as consistency filters, we can train our method on a smaller dataset containing labels for 80 classes and performs better than state-of-the-art methods, trained on a large-scale dataset with over 3,000 classes. In order to evaluate capabilities for the proposed tracking methods, we evaluate them on the KITTI tracking benchmark. To further demonstrate the efficacy and robustness of the proposed methods, we apply them on several hours of driving video of the Oxford RobotCar dataset, captured in challenging weather and lighting conditions. Our experiments show that the proposed 3D tracking-by-detection method is on-par with state-of-the-art image-based methods and that our proposed category-agnostic variants achieve comparable performance in the camera near-range. Furthermore, we show that the category-agnostic tracker can be used to analyze several hours of driving video and mine several thousand tracks of previously known as well as unknown objects. We additionally show that based on the mined object tracks of unknown object categories, we can discover new object classes, learn new detectors for them, and learn to predict their future motion.