Challenges & solutions
Around your data warehouse and data lake
Meet the Lakehouse: a Data Warehouse and Data Lake in one
Data warehouses within organizations have for many years been providing the insights that support important decisions. Although Massive Parallel Processing (MPP) architecture has enabled Data warehouses to easily process large amounts of data, data warehouses are primarily focused on structured data.
As medium and large organizations are increasingly dealing with unstructured data as well as streaming data, they are running into the limitations of their data warehouses. In this first blog in a series of two, read about the next step in data-driven processes within these organizations: Lakehouse architecture.
As data warehouses are not suitable for storing unstructured data and streaming data, a data lake is often used for this purpose. A data lake is a repository for raw data in various formats. The advantage of a data lake is that it is inexpensive and supports all data formats. The disadvantage of a data lake is the lack of the following key features:
The lack of these features means that a data lake is NOT the solution for processing data into insights. Many organizations therefore have both a data lake and a data warehouse in use. This is not an ideal situation, as it leads to siloing of data flows: on structured data (via the data warehouse) and on unstructured data and streaming data (via the data lake). Many of these organizations are looking for a solution to combat this silo formation. This solution is Lakehouse architecture.
What is a Lakehouse?
A Lakehouse is an open architecture that combines the features of a data lake and a data warehouse. A Lakehouse does this by implementing the data structures and data management features of a data warehouse on a data lake. This creates a win-win situation: the benefits of a data warehouse and the benefits of a data lake.
What are the main benefits of a Lakehouse?
The main benefits of a Lakehouse are:
You may now be wondering: how do I offer all the data from a Lakehouse conveniently for use? A Lakehouse stores both data in raw form and in processed form with business logic in it. This distinction in dataset maturity is important: raw datasets are less useful for standard reports and KPIs, while processed data may exclude data of interest to Data Science.
A common solution for this is medallion architecture, in which data is categorized as Gold Data, Silver Data and Bronze Data. As with medals in sports, gold is better than silver and silver is better than bronze. In medallion architecture, raw data is labeled as Bronze data and processed data is labeled as Silver Data or Gold Data.
Understanding and processing Bronze Data requires more expertise than understanding and processing Gold Data. In addition, Bronze Data may unintentionally contain more sensitive data than Gold Data. So it makes sense to shield access to these types of data with roles and groups. For example: Bronze Data are accessible only to Data Engineers or Data Scientists. For that, the Access Control Lists (ACLs) in Azure Data Lake Storage provide the solution. This allows rights to be set up per layer, source system or domain.
Most of the new data platforms we implement for clients follow the Lakehouse architecture. A Lakehouse fits well with organizations that need a widely deployable analytics platform. Within a Lakehouse, different types of use cases are supported, ranging from Data Exploration to Data Science, Reporting and Business Intelligence. This requires high data literacy and extensive knowledge of data-driven work among your users.
Therefore, for some organizations, a data warehouse is still the best choice. A data warehouse fits well with organizations that want to focus solely on Reporting and Business Intelligence in the medium term. Within a data warehouse, the complexity of a data platform is lower, requiring less high data literacy and less extensive knowledge of data-driven work among users.
If a Lakehouse is the right decision for your organization, the key is to determine the right migration strategy. In the second blog in this series, we take you through determining the best migration strategy, the skill set needed and the different thinking that is essential when following the Lakehouse architecture.
Challenges & solutions
Around your data warehouse and data lake
Want to learn more about the key benefits and challenges of Lakehouse architecture, or get started right away? Connect with our data & analytics experts.
Read more
Customer cases and resources on Data & Analytics