In recent years, a new type of technology called lakehouse has been developed independently by several companies simultaneously. To talk about this new paradigm, let’s go back a little in time:
At the end of the 1980s, Data Warehouses emerged to serve as central repositories for long-term structured and consistent data storage, serving to analyze strategic decisions and B.I. applications. This solution served well for almost 20 years when companies began to deal with unstructured data such as images, text, audio, etc., a scenario where Data Warehouses do not adapt well, presenting a high cost to handle this data with high speed and volume. In addition, manipulating data and creating multiple copies for different analyses resulted in complex, slow, and expensive systems to maintain.
To deal with the problem of unstructured data, in the early 2010s, the so-called Data Lakes emerged, capable of storing large volumes of unstructured data from various sources and in multiple formats using distributed file systems. However, the big problem with the Data Lake structure is that, unlike Data Warehouses, it does not allow ACID transactions, therefore, not thoroughly ensuring data quality.
ACID transactions (i.e., atomic, consistent, isolated, and durable transactions) are usual in databases with a traditional file system and ensure the correct functioning of the database, preventing data from being corrupted or lost during transactions.
Nonetheless, unlike a traditional file system, in a distributed system in the configuration of a Data Lake, we may have isolation problems, leading to conflicts during competing operations.
In addition to the issue of isolation, over time, it became common to use architecture in two layers: one for storage in a Data Lake (ex: S3) and another for the Data Warehouse (ex: AWS Redshift). Thus, the data often went through two processes: one data ETL for Data Lake and another ETL to take the data to the Data Warehouse, generating duplicate data, excessive complexity, and susceptibility to failures.
Years later, Data Lakehouse appeared. Although the name lakehouse was launched in 2017 by Snowflake and in 2019 by AWS with Redshift Spectrum, the concept of Data Lakehouse began to take shape in early 2021 after an article published by Armbrust et al., and it combined the best of Data Lake and Data Warehouse.
A Lakehouse is simply a layer of metadata at the top of a Data Lake that provides query capabilities similar to a Data Warehouse, such as ACID transaction support, data versioning, and schema enforcement.
In a Lakehouse architecture, metadata regarding datasets are shared by the Data Lake and the Data Warehouse, allowing queries that incorporate both data stored in the Data Lake and the Data Warehouse in the same SQL.
In addition to enabling this combination, Lakehouses also allow time travel between different versions of the data since all files remain stored in storage, as well as read/write competition, schema evolution, and schema enforcement. Using a non-proprietary format in metadata and log control also enables compatibility and integration between several cloud providers, bringing flexibility and cost optimization to Data Lake.
In this way, we have a single location to perform the analysis. That facilitates data administration and governance and provides efficiency to deal with ingestion both in batch and streaming for the ingestion and processing of structured, semi-structured, or unstructured data for purposes such as B.I, Log Analysis, Data Science, and Machine Learning, among others.
However, it is noteworthy that, depending on the situation, adopting such an architecture may not be advantageous.
For companies that already use Data Lakes with data in modern formats such as Parquet or ORC, we can easily convert the data into a Delta Lake format table.
However, companies with large volumes of data in old formats such as CSV or JSON may face high processing time consumption in conversion processes that may not be feasible. For this, it is essential the presence of a qualified data professional to evaluate the whole situation.
The fact is that Data Lakehouse is a recent technology that has arrived to stay. If scalability and reliability are at the top of a project’s priorities, lakehouses can be an exciting alternative to consider.