Data integration and metadata management in data lakes

  • Datenintegration und Metadatenmanagement in Data Lakes

Hai, Rihan; Jarke, Matthias (Thesis advisor); Miller, Renée (Thesis advisor); Quix, Christoph Josef (Thesis advisor)

Aachen (2020)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2020


Although big data is being discussed for some years, it still has many research challenges, such as the variety of data. The diversity of data sources often exists in information silos, which are a collection of non-integrated data management systems with heterogeneous schemas, query languages, and data models. It poses huge difficulty to efficiently integrate, access, and query the large volume of diverse data in these information silos with the traditional ‘schema-on-write’ approaches such as data warehouses. Data lake systems have been proposed as a solution to this problem, which are repositories storing raw data in its original formats and providing a common access interface. The challenges of combining multiple heterogeneous data sources in data lakes are rooted in the research area of data integration. To integrate the data in data lakes, the primary tasks include understanding the relationships (e.g., schema mappings) among data sources in data lakes, and answering user queries over heterogeneous sources. Moreover, to prevent a data lake from turning into an unusable data swamp, metadata management is crucial, especially for accessing and querying the data. The main challenges for metadata management in data lakes are to acquire, model, store, and enrich the metadata that describes the data sources. Therefore, in this thesis, we present a comprehensive and flexible data lake architecture and a prototype system Constance, which provides data ingestion, integration, querying and sophisticated metadata management over structured, semi-structured (e.g., JSON, XML), and graph data. First, we propose a native mapping representation to capture the hierarchical structures of nested mappings, and efficient mapping generation algorithms, which avoid producing a considerable number of intermediate basic mappings. Second, to store heterogeneous data in raw formats, our data lake system enables the coexistence of several data storage systems with different data models. To provide a unified querying interface, we design a novel query rewriting engine that combines logical methods for data integration based on declarative mappings with the big data processing system Apache Spark. Our query rewriting engine efficiently executes the rewritten queries and reconciles the query results into an integrated dataset. Third, we also study the formalism of the generated schema mappings as dependencies. Regarding computation complexity and decidability of certain reasoning tasks, the mapping formalisms in second-order logic are less desirable compared to first-order mapping languages. Our algorithmic approach transforms schema mappings expressed in second-order logic to their logically equivalent first-order forms. Finally, we define a generic metadata model to represent the structure of heterogeneous sources and introduce clustering-based algorithms to discover relaxed functional dependencies, which enrich the metadata and improve data quality in the data lake.