Deep visual human sensing with application in robotics

Beyer, Lucas Klaus; Leibe, Bastian (Thesis advisor); Triebel, Rudolph (Thesis advisor)

Aachen : RWTH Aachen University (2021, 2022)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2021


Thanks to advances in robotics, navigation, localization, and perception in the last decade, mobile robots (which includes self-driving cars) are recently starting to be deployed in everyday life scenarios, surrounded by people. In such situations, understanding the people surrounding the robot is crucial. This thesis consists of a collection of works which significantly advance the state of the art in visual understanding of humans. First, we introduce a fundamentally new paradigm for performing detection in 2D LiDAR scans. Our detector, dubbed DROW, is based on a voting scheme, where each individual measuring point casts a vote. It is completely data-driven, naturally multiclass, and outperforms previous detectors and even trackers significantly. Orientation of people, as well as their head orientation, are important higher-level cues for attention and motion prediction. We introduce a new neural network output module, the Biternions, and a corresponding von-Mises loss function, which allow for accurate, continuous orientation prediction using only weak, discrete labeling of data. We furthermore extend it with a principled, learned measure of confidence in its own prediction. Then, we take a closer look at learning semantic embeddings of images, with focus mainly on person re-identification, and promising results on object recognition. We demonstrate that triplet-loss based approaches perform much better than previously assumed, while being a simple and ideologically "clean" family of methods. In fact, our proposed model using ImageNet pre-trained ResNet50, batch-hard triplet loss, PK-batches, and a soft margin significantly outperforms the state-of-the-art on multiple person re-identification benchmarks, as well as on fine-grained car, bird, and product recognition benchmarks. All aforementioned advances make use of deep learning, which typically results in algorithms which require hardware accelerators on the robot. Having multiple such components on a single robot comes at a cost. We investigate ways of mitigating this cost in our DetTA pipeline which leverages a tracker to perform strided execution of analysis modules (thus significantly reducing load) and per-person smoothing of the results (thus not decreasing prediction accuracy). Finally, motivated by the importance of tracking on mobile robots and our strong person re-identification results, we investigate a completely novel formulation to tracking which makes use of a solid person re-identification model from the ground up, bypassing the need for complicated data-association. This new formulation goes one step further towards end-to-end learning of tracking and opens up many novel research opportunities.