Challenges and opportunities
When managing a data lake or data warehouse
Building a Lakehouse part 2: tackle your data migration
Data Warehouses within organizations have for many years provided the insights that support important decisions. Although Massive Parallel Processing (MPP) architecture has enabled Data Warehouses to easily process large amounts of data, Data Warehouses are primarily focused on structured data.
As medium and large organizations increasingly deal with unstructured data as well as streaming data, they are running into the limitations of their Data Warehouses. The first blog in this series of two discusses what a Lakehouse architecture is and why it is the next step in data-driven processes within these organizations. In this second blog, we'll tell you how to go about migrating to a Lakehouse.
When an organization opts for a Lakehouse architecture to achieve a future-proof data platform, that is the moment to address shaping the migration. For the migration to be successful, proper preparation is required. This preparation consists of at least the following items:
After that, we cover the following topics:
1. Platform and security
After preparing steps 1 through 4, the first step in actually performing the migration is setting up the platform. This involves setting up the Landing Zones on which the data platform will be built. Microsoft recommends Data and Data Management Landing Zones for this purpose, part of the Cloud Adoption Framework. Part of the platform rollout also includes the security setup. Before the migration starts, it should be clear who has rights to do what. This is central to setting up the help of firewalls, Network Security Groups and Private Endpoints, among others.
Once the platform is in place, it's time for the data migration. This contains two sub-steps: migrating the history and setting up the loading processes. In each, we recommend a side-by-side migration over an in-place migration. This involves the new data platform being set up alongside the existing platform, making testing and validation easy. This way, you can easily compare the newly established data platform 1-to-1 with the already existing platform.
Migrating history
When migrating history, all relevant data is copied from the original Data Warehouse into the Data Lake. When data is copied from a Data Warehouse, the source is usually a database environment. To access these, we recommend metadata-driven extraction. This involves using an ETL tool to create one copy per table in the Data Lake, in a predefined structure.
Setting up the loading processes
When setting up the loading processes, it is necessary to keep the data from both environments the same by updating them. To do this, the loading processes from the Data Warehouse must be set up in the Data Lake. For this, too, we recommend a metadata-driven solution, which in this case connects directly to the source: the Data Warehouse. When migrating from an on premise solution to a cloud solution, a Gateway is required in many cases.
The next step in migrating from a Data Warehouse to a Lakehouse is to convert the data transformations. It too contains two sub-steps: migration of transformation processes and rebuilding of data processes. The decision for this should already be made at the stage and depends on the answers to the following questions:
If rebuilding is chosen, this is the time to also review data layering. Many Lakehouses use a medallion architecture, with Bronze, Silver and Gold layers.
Whether migration or rebuilding of data transformations is chosen, the location where the transformed data is stored must be changed in any case. This is different for a Lakehouse than for a Data Warehouse. The transformed data in the Silver and Gold layers is also stored back in the Data Lake.
4. Data products
Once the data and transformation processes are successfully migrated, the same datasets are available in the new environment. This is the time to convert the data products to the new environment. A distinction can be made in this between managed reporting and self-service analytics.
In managed reporting, the data products are managed by a central reporting team. This team can take care of converting the data itself, by referring the data products to the new environment. For self-service use, users should be informed of the change in three stages:
Challenges and opportunities
When managing a data lake or data warehouse
Want to learn more about the key benefits and challenges of Lakehouse architecture, or get started right away? Connect with one of our Data & Analytics experts.
Read more
These customers rely on our data & analytics expertise