What's a Data Lake and What Does It Mean For My Open Source Stack?
MCLD 3038 | Thu 06 Aug 4:30 p.m.–5:15 p.m.
Presented by
-
Robert Hodges has worked on database systems and applications since 1983. His experience spans pre-relational databases like M204 to online transaction processing in SQL to Hadoop and analytics. His work in the last few years has focused on analytic databases, Kubernetes, and open source. Robert’s day job is CEO of Altinity, an enterprise provider for ClickHouse.
Robert Hodges has worked on database systems and applications since 1983. His experience spans pre-relational databases like M204 to online transaction processing in SQL to Hadoop and analytics. His work in the last few years has focused on analytic databases, Kubernetes, and open source. Robert’s day job is CEO of Altinity, an enterprise provider for ClickHouse.
Abstract
Data lakes on open table formats like Iceberg are a popular way to manage large datasets for analytics, data science, and AI. This talk explains how data lakes work and how to adapt open source analytic stacks to use them. First, we'll tour projects like Arrow, Iceberg, and Unity Catalog that make data lakes possible. Next, we'll see how analytic engines like DuckDB, ClickHouse, and Spark are adapting. Finally, we'll survey a few projects that enable applications written in Python, Golang, or Rust to deliver fast queries. You'll have to build the app yourself, but this talk will show you a path to use data lakes and open source successfully.
Notes: This talk is designed to explain data lakes and how to work with them in open source to a wide audience. Data lakes can be a confusing topic for newcomers as there are many open source projects related to data lakes and they are evolving rapidly. The landscape is also confused by vendors crowding in with non-OSS solutions. The talk is designed to clear away some of the fog. It assumes an interest in and familiarity with databases in general but no particular knowledge of data lakes themselves.
p.s., I left out analytic in the title because it’s submitted to the open source analytic track. For non-analytic conferences I would add that back in as a qualifier.
Data lakes on open table formats like Iceberg are a popular way to manage large datasets for analytics, data science, and AI. This talk explains how data lakes work and how to adapt open source analytic stacks to use them. First, we'll tour projects like Arrow, Iceberg, and Unity Catalog that make data lakes possible. Next, we'll see how analytic engines like DuckDB, ClickHouse, and Spark are adapting. Finally, we'll survey a few projects that enable applications written in Python, Golang, or Rust to deliver fast queries. You'll have to build the app yourself, but this talk will show you a path to use data lakes and open source successfully. Notes: This talk is designed to explain data lakes and how to work with them in open source to a wide audience. Data lakes can be a confusing topic for newcomers as there are many open source projects related to data lakes and they are evolving rapidly. The landscape is also confused by vendors crowding in with non-OSS solutions. The talk is designed to clear away some of the fog. It assumes an interest in and familiarity with databases in general but no particular knowledge of data lakes themselves. p.s., I left out analytic in the title because it’s submitted to the open source analytic track. For non-analytic conferences I would add that back in as a qualifier.