Spark: Column Level Lineage

Authors:
Raj Bains | Maciel Szpakowski

At Prophecy, we’re building a Data Engineering product to replace Legacy ETL, bringing modern software engineering practices to data. Modern data engineering hinges on using code as the center of the universe, and using continuous integration and continuous deployment for agility. This means that for lineage, we’ll also be relying on computing it from code - Spark or Hive.

Column level lineage has very important use cases for different users in an Enterprise

  • Production Debugging: If the data of a column is incorrect, help figure out which workflow, and within that workflow what line of code, modified this last, and go backward from that. Data Engineering needs this, and without it L1, L2 support can’t find the source of an error, and multiple devs need to get involved and discuss across teams to figure out where the bug happened.
  • Privacy: Given data such as social security numbers, figure out which of the forty thousand datasets in the Enterprise, has this value propagated to, and what are the resulting privacy restrictions on them. Metadata Architects need this, and in its absence they talk to multiple teams to figure out flow of values across thousands of workflows and datasets. The process is slow, expensive and error prone adding compliance risk.
  • Business Decisions: Given a data value, using which an analyst needs to make a business decision, understand how this got there, and summary-formula and description of what it represents. In the absence of this, they have to guess the value and hope that they’re right, or talk to data engineers who wrote this and understand it.

Crawlers on git

We compute lineage by periodically crawling git repositories and support Spark and Hive repositories currently. We maintain datasets and workflows as metadata objects in out repositories, corresponding to the source locations.

Dataset View

The place to start is the dataset that you’re interested in. Here, our view shows the

  • selected dataset in the center — indicated by a subtle dotted line also
  • all the adjacent workflows reading and writing that dataset
  • all datasets read and written by adjacent workflows

Here, you can select a column on the right (such as id or firstName), on selecting the column, you can see:

  • Red shows the workflow that edits the column
  • Yellow shows the flow of a passthrough value i.e. present but unmodified
  • Gray shows the absence of the column on a path, workflow or dataset

Workflow View

Once you have chosen a particular column, you can dive into a workflow in the context of a column. Here, you can view where the column is modified, passthrough or absent using the same color code.

When you click on a node in the workflow, we pull up the code for that node, with the relevant line highlighted.

You can navigate by clicking a node in the workflow, or moving right, left by clicking a particular column. If the column is present in multiple nodes, you click the node you want to follow.

Summary

As we move to Data Engineering, we don’t have to throw out all the features that provide us with productivity during development, production debugging or business analytics.

Data Engineering means the stack will be rebuilt, with code and agility at its center and Prophecy is focused on it. We’d love to hear from you and see if we can help you succeed in the transition from Legacy ETL to Agile Data Engineering.