Learning-based visual scene and person understanding for mobile robotics

Hermans, Alexander; Leibe, Bastian (Thesis advisor); Stachniss, Cyrill (Thesis advisor)

Aachen (2020, 2021)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2020


We have seen tremendous progress in the computer vision community across the past decades. While early approaches often relied on heuristics and only saw limited application of machine learning methods, the recent advances in deep learning have significantly changed the field. It has enabled us to move beyond hand-crafted features and toward learning deep neural networks end-to-end. Especially in combination with growing computational power and dataset sizes, we have seen very impressive results, even surpassing human capabilities for some applications. However, when we want to use computer vision within actual robotic applications, we often find that top-performing methods are difficult to deploy. Since inference speed is typically not a real concern for computer vision research, the limited computational resources on mobile robotics platforms are not sufficient to run many of the research methods online. Furthermore, many sensor setups used on robotics platforms produce images with characteristics different than those found in many computer vision datasets, thus resulting in unexpected behavior. At the same time, an increasing number of robots, such as service robots, autonomous cars and agricultural robots rely on vision capabilities. In this thesis we deal with visual scene and person understanding which are highly relevant for robotics applications. Robots need to be able to understand their environment and take special care around persons to ensure a safe navigation and interaction. We specifically deal with three important sub-tasks: semantic segmentation, 2D laser-based object detection, and person re-identification. Semantic segmentation deals with the task of labeling every pixel or point in a scene with a class label. This can in turn be used to extract higher level information about the surrounding scene, which can be used as context for further planning and interaction tasks. While the resulting segmentations provide object labels, they do not contain instance labels, making it hard to detect object instances. However, object detection is an important capability for allowing robots to safely navigate between dynamic objects. Especially the detection of persons is an important task, enabling robots to interact with us. Since many mobile platforms are already equipped with a 2D laser scanner, they are interesting input sensors for object detection, even though the resulting scans only contain sparse data. In addition to person detection, person re-identification is an important task. This can be used to improve tracking, but also allows to gather longer-term statistics and enables person specific interactions. While we aim to improve the state-of-the-art for each of these tasks, we also focus on the actual applicability of the approaches. We propose three different semantic segmentation methods, tackling different aspects of the task. The first two deal with the semantic segmentation of 3D point clouds and rely on traditional machine learning approaches. For our third method, we propose a novel neural network architecture and show that we can train it from scratch - this is in contrast to the typical approach of pretraining a network on large additional datasets. We then introduce our deep learning based object detector, which relies on a learned voting scheme. We apply our detector to walking aids and persons and show that it outperforms existing methods. Finally, we turn to person re-identification and show that, contrary to the general opinion, a triplet loss can be used to train a re-identification network that achieves state-of-the-art results. We show several applications of our methods within the context of robotics projects. We believe we have been able to contribute to the respective computer vision fields, but we especially hope that we have brought the theoretical approaches and their actual applications closer together.