Meet the Lakehouse: a Data Warehouse and Data Lake in one

Jerrold Stolk
20 Nov, 2023

Data warehouses within organizations have for many years been providing the insights that support important decisions. Although Massive Parallel Processing (MPP) architecture has enabled Data warehouses to easily process large amounts of data, data warehouses are primarily focused on structured data.

As medium and large organizations are increasingly dealing with unstructured data as well as streaming data, they are running into the limitations of their data warehouses. In this first blog in a series of two, read about the next step in data-driven processes within these organizations: Lakehouse architecture.

As data warehouses are not suitable for storing unstructured data and streaming data, a data lake is often used for this purpose. A data lake is a repository for raw data in various formats. The advantage of a data lake is that it is inexpensive and supports all data formats. The disadvantage of a data lake is the lack of the following key features:

  • Support for ACID transactions
  • Enforcing data quality
  • A fixed schedule

The lack of these features means that a data lake is NOT the solution for processing data into insights. Many organizations therefore have both a data lake and a data warehouse in use. This is not an ideal situation, as it leads to siloing of data flows: on structured data (via the data warehouse) and on unstructured data and streaming data (via the data lake). Many of these organizations are looking for a solution to combat this silo formation. This solution is Lakehouse architecture.

What is a Lakehouse?

A Lakehouse is an open architecture that combines the features of a data lake and a data warehouse. A Lakehouse does this by implementing the data structures and data management features of a data warehouse on a data lake. This creates a win-win situation: the benefits of a data warehouse and the benefits of a data lake.

Data & Analytics

What are the main benefits of a Lakehouse?

The main benefits of a Lakehouse are:

  • Centralized storage of data only on the Data Lake, rather than scattered across databases;
  • A forced schema for the data, enforcing structure and ensuring integrity;
  • Using BI tools directly on the data is possible, reducing the cost and lead time of creating these solutions;
  • Storage is separated from compute, leading to a more scalable and manageable platform, the costs of which are more transparent;
  • Use of open data types, such as Parquet, for storage. APIs allow direct interaction with the data in different languages (SQL/Python/R/etc.);
  • Support for different data types, such as unstructured data and streaming data.
  • Support for various workloads, such as Data Science, Machine Learning, Business Intelligence and Analytics. Different tools are needed to perform these workloads, but ultimately they all use data from the same storage: the Lakehouse;
  • Through real-time reporting on data in the Lakehouse, insights can be created faster (End-to-End streaming).

How do you offer, in an organized manner, data from a Lakehouse for use?

You may now be wondering: how do I offer all the data from a Lakehouse conveniently for use? A Lakehouse stores both data in raw form and in processed form with business logic in it. This distinction in dataset maturity is important: raw datasets are less useful for standard reports and KPIs, while processed data may exclude data of interest to Data Science.

A common solution for this is medallion architecture, in which data is categorized as Gold Data, Silver Data and Bronze Data. As with medals in sports, gold is better than silver and silver is better than bronze. In medallion architecture, raw data is labeled as Bronze data and processed data is labeled as Silver Data or Gold Data.

Understanding and processing Bronze Data requires more expertise than understanding and processing Gold Data. In addition, Bronze Data may unintentionally contain more sensitive data than Gold Data. So it makes sense to shield access to these types of data with roles and groups. For example: Bronze Data are accessible only to Data Engineers or Data Scientists. For that, the Access Control Lists (ACLs) in Azure Data Lake Storage provide the solution. This allows rights to be set up per layer, source system or domain.

Migration to Lakehouse: the way to go?

Most of the new data platforms we implement for clients follow the Lakehouse architecture. A Lakehouse fits well with organizations that need a widely deployable analytics platform. Within a Lakehouse, different types of use cases are supported, ranging from Data Exploration to Data Science, Reporting and Business Intelligence. This requires high data literacy and extensive knowledge of data-driven work among your users.

Therefore, for some organizations, a data warehouse is still the best choice. A data warehouse fits well with organizations that want to focus solely on Reporting and Business Intelligence in the medium term. Within a data warehouse, the complexity of a data platform is lower, requiring less high data literacy and less extensive knowledge of data-driven work among users.

If a Lakehouse is the right decision for your organization, the key is to determine the right migration strategy. In the second blog in this series, we take you through determining the best migration strategy, the skill set needed and the different thinking that is essential when following the Lakehouse architecture.

Challenges & solutions

Around your data warehouse and data lake

Connect with us

Want to learn more about the key benefits and challenges of Lakehouse architecture, or get started right away? Connect with our data & analytics experts.

By using this form you agree to the storage and processing of the data you provide, as indicated in our privacy policy. You can unsubscribe from sent messages at any time. Please review our privacy policy for more information on how to unsubscribe, our privacy practices and how we are committed to protecting and respecting your privacy.

Read more

Customer cases and resources on Data & Analytics