Alignment and localization in fine-grained image recognition

Hanselmann, Harald; Ney, Hermann (Thesis advisor); Rigoll, Gerhard (Thesis advisor)

Aachen : RWTH Aachen University (2020, 2021)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2020


The goal of image recognition is to identify or recognize objects shown in an image. Image recognition tasks can be classified into different categories with respect to the extent of the inter-class variations. General image recognition tasks typically classify images into a wide variety of broad categories and therefore display large inter-class variation. Fine-grained image classifications tasks, however, are defined by low inter-class variation. Examples of such tasks include the classification of different car models or animal species. A special case of a fine-grained image classification task is face recognition, where individuals have to be classified. For fine-grained tasks, it is not only important to detect which features are in an image, but also where they are located and what their spatial relations are. In this thesis we look at different methods to align and localize features and discriminative regions for fine-grained image classification. On the one hand, we will look at computing dense pixel-wise alignments using 2D-Warping. In this context, we will introduce methods for speeding up the computation of the dense alignments as the runtime is the main drawback of 2D-Warping based approaches. Additionally, we will introduce a new 2D-Warping algorithm that obtains better results in terms of optimization score and classification accuracy compared to previous 2D-Warping algorithms. On the other hand, we will explore a new method to obtain local features needed to compute the dense alignments. These features are learned from data using convolutional neural networks (CNNs). Further, we will introduce a warped region-of-interest pooling layer based on 2D-Warping that can be inserted into a trained CNN to recognize images with spatial deformations not seen in training. We will observe that for the classification accuracy, modeling translation and scaling are most important. For this reason we introduce a localization module that handles translation and scaling variances, is very lightweight and efficient, and needs only class labels to be trained. We then add an embedding layer and global K-max pooling to obtain a complete and efficient system for fine-grained image classification. While the aforementioned localization module is effective, it is implemented in a stand-alone module that is trained separately from the classification model. To simplify the training procedure and leverage the benefits of full end-to-end systems, we transform the localization module such that it can be integrated into the classification model and trained jointly. We evaluate our methods on popular and challenging tasks for fine-grained image classification and are able to report very competitive results. On some tasks we can even report the best state-of-the-art accuracy.


  • Department of Computer Science [120000]
  • Chair of Computer Science 6 (Machine Learning and Reasoning) [122010]