Spark: Column Level Lineage

Author: Raj Bains, Maciej Szpakowski

At Prophecy, we’re building a Data Engineering product to replace Legacy ETL, bringing modern software engineering practices to data. Modern data engineering hinges on using code as the center of the universe, and using continuous integration and continuous deployment for agility. This means that for lineage, we’ll also be relying on computing it from code - Spark or Hive.

Column level lineage has very important use cases for different users in an Enterprise

Crawlers on git

We compute lineage by periodically crawling git repositories and support Spark and Hive repositories currently. We maintain datasets and workflows as metadata objects in out repositories, corresponding to the source locations.

Dataset View

The place to start is the dataset that you’re interested in. Here, our view shows the

Here, you can select a column on the right (such as id or firstName), on selecting the column, you can see:

Simplified Entity-Aspect Model

Workflow View

Once you have chosen a particular column, you can dive into a workflow in the context of a column. Here, you can view where the column is modified, passthrough or absent using the same color code.

When you click on a node in the workflow, we pull up the code for that node, with the relevant line highlighted.

You can navigate by clicking a node in the workflow, or moving right, left by clicking a particular column. If the column is present in multiple nodes, you click the node you want to follow.

Summary

As we move to Data Engineering, we don’t have to throw out all the features that provide us with productivity during development, production debugging or business analytics.

Data Engineering means the stack will be rebuilt, with code and agility at its center and Prophecy is focused on it. We’d love to hear from you and see if we can help you succeed in the transition from Legacy ETL to Agile Data Engineering.