Computer Science Graduate Seminar

Monday, July 13, 2020, 4:00pm

Data Integration and Metadata Management in Data Lakes

  • Location: https://rwth.zoom.us/j/95676455814?pwd=NUEvVnFVNEVLSjFsTWY2OEw2VWhrdz09
    • Meeting ID: 956 7645 5814
    • Password: 302988
  • Speaker: M.Eng. Rihan Hai (Chair for Computer Science 5)

 

Abstract

Although big data has been discussed for some years, it still has many research challenges, such as the variety of data. Non-integrated data management systems with heterogeneous schemas, query languages, and data models result in information silos. As traditional 'schema-on-write' approaches such as data warehouses cannot solve the challenges to efficiently integrate, access, and query the information silos, data lake systems have been proposed as a solution to this problem. Data lakes are repositories storing raw data in its original format and providing a common access interface.

In this thesis, we present a comprehensive and flexible data lake architecture and the prototype system Constance. First, we propose a native mapping representation to capture the hierarchical structures of nested mappings and efficient mapping generation algorithms. Second, to provide a unified querying interface, we design a novel query rewriting engine that combines logical methods for data integration based on declarative mappings with the big data processing system Apache Spark. Third, we also study the formalism of the generated schema mappings as dependencies. Our algorithmic approach transforms schema mappings expressed in second-order logic to their logically equivalent first-order forms. Finally, we introduce clustering-based algorithms to discover relaxed functional dependencies, which enrich the metadata and improve data quality in the data lake.

 

The computer science lecturers invite interested people to join.