Authors: Abhinav Jain, Scott Lyons, Maciej Szpakowski

Data Lakehouse

Data Lakehouse is a relatively new term in the Big Data space that, like many things in the Data ecosystem, is a combination of two existing concepts: Data Warehouses and Data Lakes.

Data Warehouses (DWs) store large volumes of processed and highly-structured data while enabling efficient and performant querying on top of said data. Most DW implementations are based on MPP architectures and allow users to execute common SQL operations for standard databases such as inserts, updates, and even merges. Some are even ACID compliant. However, with the ever-increasing quantities of data and the rise of unstructured data, warehouses become obsolete, expensive, and difficult to maintain.

Data Lakes (DLs), while sharing similar underlying concepts, such as high-volume storage and efficient querying, generally tend to be more unstructured by nature. They’re frequently used as the pre-processing area by data scientists and engineers to prepare data sourced from a variety of upstream sources for loading into a Data Warehouse.

Data Lakehouse aims to have the best of both worlds: A flexible data preparation space with the tools and underlying architecture to move said data into a structured and governed space. This support both Batch and Streaming paradigms and any number of diverse workloads on top of them.

Apache Spark is the ETL/Computation engine of choice that has enabled many companies to perform efficient and scalable data engineering on the cloud without breaking the bank.

The Architecture

Classically, the need for cost-effective storage, performant processing, and low-latency querying birthed a two-tier architecture. Data Lake for raw storage of unstructured data and Data Warehouse on top of it for high-performance reporting and querying. To integrate those layers ETL batch processes are used. 

Even though this is one of the most commonly used architectures today, its use can result in a number of problems, such as

The introduction of a Lakehouse architecture, using Delta Lake as the underlying storage format and Spark as the querying engine, attempts to solve the two-tiered architecture by unifying it into a single layer. Delta Lake enables the simplicity of a data warehouse with advanced update operations and ACID guarantees while supporting the large majority of Cloud vendor storage offerings. This makes data more accessible, cheaper to query, and reduces the number of buggy data pipelines.

Making Data Lakehouse even easier

Even though Spark and Delta Lake are the perfect combinations to base your future data pipeline on and the Data Lakehouse architecture allows us to enable a tremendous amount of new use cases, the usability and productivity still remain a challenge. Delta Lake is yet another technology that your team has to learn, requires you to know how to write code in an optimal way, and common use cases like performing slowly changing dimensions require quite a few lines of repeatable code.

Along comes Prophecy. Gone are the chaotic collection of disconnected ETL notebooks and haphazard orchestration scripts, replaced with a visually built data flow that is turned into well-engineered and efficient Spark or PySpark code stored directly into Git. You can even start leveraging all the best practices that have been classically available only for traditional software engineers, including Unit Testing and CI/CD. Once you’re confident that the pipeline will produce the Data you’re looking for, you can then use Prophecy to schedule the jobs.

Bronze, Silver, and Gold layers

With the Data Lakehouse architecture, all your data flows from the operational systems directly to your object store. However, this often results in poor data management and governance, which leads to the creation of a data swamp. To prevent this it’s paramount to organize your Lakehouse well. Proper governance, lineage, and taxonomies are also essential (but we will leave that for a future blog).

A common architecture is based on the organization pattern that uses tables that correspond to different quality levels of data:

Building such architecture allows anyone to directly leverage data at different levels, depending on their needs. However, building it often requires multiple data pipelines at each stage that have to be often collaboratively worked upon. Using a low-code tool can make this job significantly easier.

Pipeline example

To demonstrate how this architecture can be built, we’re going to use two datasets:

Our reporting pipeline is going to consist of three separate pieces:

After every change the code for our entire pipeline is automatically generated in the backend in PySpark.

In just a few minutes we’ve built our report while following the best practices in terms of structuring the code. Now it’s time to schedule it.


Slowly Changing Dimensions (SCD)

So far we’ve built a set of pipelines that produce our report without taking into account any historical data. Historical information, however, can contribute a tremendous value to the business, as it allows us to draw patterns over time. 

A commonly followed pattern in the data warehouse space has been called “Slowly changing dimensions”. It allows us to store and manage both current and historical data over time directly within our tables. This used to be very difficult to do in a traditional Data Lake due to the lack of table merge and updates. However, Delta Lake makes it possible. 

There are three most popular types of SCD: 

Consider a simple example where our initial data contains a Customer table in the operational database. After a Customer changes their address, the row with their corresponding customer_id is updated.

If we’re ingesting the data from our operational database using tools like Fivetran or AWS DMS, we’d likely see two files created for us.

customer_id
zip_code
state
123
94105
San Francisco

customers/2022-01-22T00:00:00.parquet

And, after the customer changes their address:

customer_id
zip_code
state
123
10001
New York

customers/2022-05-22T13:23:23.parquet

SCD Type 1

The standard operational databases store data using the SCD1 method to limit the amount of data present in the database at any point in time and speed up the querying. 

In our Lakehouse, we’d like to store the information using the Type 1 SCD which would simply mean updating and overwriting the row in the existing silver customers table.

This would result in the following table

customer_id
zip_code
state
123
10001
New York

customers_silver_scd1

SCD1 doesn’t maintain the historical information, thus it’s only useful when the change of information over time is not important. 

Building a pipeline that updates the existing customers, based on the latest record is very simple:


SCD Type 2

SCD2 is another approach to handle slowly changing dimensions. In this case, current and historical data is stored in a data warehouse. The purpose of an SCD2 is to preserve the history of changes. If a customer changes their address, for example, or any other attribute, SCD2 allows analysts to link facts back to the customer and their attributes in the state they were at the time of the fact event.

customer_id
zip_code
zip_code
updated_at
end_dt
current_state
123
94105
California
2022-01-22
2022-05-22
false
123
10001
New York
2022-05-22
null
true

customers_silver_scd2


Here we have created 3 extra columns as below which would make it easy to consume this data in several downstream pipelines:

SCD Type 3

A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.

customer_id
previous_zip_code
zip_code
previous_state
state
123
94105
10001
California
New York

customers_silver_scd3


How can I try Prophecy?

Prophecy is available as a SaaS product where you can add your Databricks credentials and start using it with Databricks. You can use an Enterprise Trial with Prophecy's Databricks account for a couple of weeks to kick the tires with examples. We also support installing Prophecy in your network (VPC or on-prem) on Kubernetes. Sign up for your account now:

Sign up for your free Account!

We're super excited to share our progress with you, get in touch with us - we're looking to learn!


Learn more