T he data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse, which will be based on open direct-access data formats, such as Apache Parquet, have first- class support for machine learning and data science, and offer state-of-the-art performance.
The history of data warehousing started with helping business leaders get analytical insights by collecting data from operational databases into centralized warehouses, which then could be used for decision support and business intelligence (BI). Data in these warehouses would be written with schema-on-write, which ensured that the data model was optimized for downstream BI consumption.
A decade ago, the first generation systems started to face several challenges. First, they typically coupled compute and storage into an on-premises appliance. This forced enterprises to provision and pay for the peak of user load and data under management, which became very costly as datasets grew. Second, not only were datasets growing rapidly, but more and more datasets were completely unstructured, e.g., video, audio, and text documents, which data warehouses could not store and query at all.
To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC . This approach started with the Apache Hadoop movement, using the Hadoop File System (HDFS) for cheap storage. In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications.
From 2015 on wards, cloud data lakes, such as S3, ADLS and GCS, started replacing HDFS. They have superior durability (often >10 nines), geo-replication, and most importantly, extremely low cost with the possibility of automatic, even cheaper, archival storage, e.g., AWS Glacier. The rest of the architecture is largely the same in the cloud as in the second-generation systems, with a downstream data warehouse such as Redshift or Snowflake.
This two-tier data lake + warehouse architecture is now dominant in the industry in our experience.
Lakehouse architecture will address all these problems. We define a Lakehouse as a data management system based on low- cost and directly accessible storage that also provides traditional analytical DBMS management and performance features such as ACID transactions, data versioning, auditing, indexing, caching, and query optimization. Lakehouse thus combine the key benefits of data lakes and data warehouses: low-cost storage in an open format accessible by a variety of systems from the former, and powerful management and optimization features from the latter.
The first key idea for implementing a Lakehouse is to have the system store data in a low-cost object store (e.g., Amazon S3) using a standard file format such as Apache Parquet, but implement a transactional metadata layer on top of the object store that defines which objects are part of a table version. Several recent systems, including Delta Lake, Apache Iceberg and Apache Hudi have successfully added transactional features to data lakes in this fashion.
Although a metadata layer adds management capabilities, it is not sufficient to achieve good SQL performance. Data warehouses use several techniques to get state-of-the-art performance, such as storing hot data on fast devices such as SSDs, maintaining statistics, building efficient access methods such as indexes, and co-optimizing the data format and compute engine. In a Lakehouse based on existing storage formats, it is not possible to change the format, but it is possible to implement other optimizations that leave the data files unchanged, including caching, auxiliary data structures such as indexes and statistics, and data layout optimizations.
Databricks has built a lake house platform based on this design using Delta Lake, Delta Engine and Databricks ML runtime projects.
The multi cluster load balancing SQL end points with the Query Optimizer, Native Execution engine(Photon) and caching layer accelerate all the workloads on top of the lake house.
Is your data analytics practice still stuck with the old generation? To get the best out of your data, its time that you shift gears and embrace the future in data processing and management.
We provide advanced consulting on implementing Lakehouse architecture using Databricks.