Spark: Column-level lineage

Prophecy computes column-level lineage from code for Spark code on Git.

Raj Bains

Assistant Director of R&D
Texas Rangers Baseball Club
‍

March 17, 2020

April 22, 2023

At Prophecy, we’re building a Data Engineering product to replace Legacy ETL, bringing modern software engineering practices to data. Modern data engineering hinges on using code as the center of the universe, and using continuous integration and continuous deployment for agility. This means that for lineage, we’ll also be relying on computing it from code - Spark or Hive.

Column level lineage has very important use cases for different users in an Enterprise

Production Debugging: If the data of a column is incorrect, help figure out which workflow, and within that workflow what line of code, modified this last, and go backward from that. Data Engineering needs this, and without it L1, L2 support can’t find the source of an error, and multiple devs need to get involved and discuss across teams to figure out where the bug happened.
Privacy: Given data such as social security numbers, figure out which of the forty thousand datasets in the Enterprise, has this value propagated to, and what are the resulting privacy restrictions on them. Metadata Architects need this, and in its absence they talk to multiple teams to figure out flow of values across thousands of workflows and datasets. The process is slow, expensive and error prone adding compliance risk.
Business Decisions: Given a data value, using which an analyst needs to make a business decision, understand how this got there, and summary-formula and description of what it represents. In the absence of this, they have to guess the value and hope that they’re right, or talk to data engineers who wrote this and understand it.

Crawlers on git

We compute lineage by periodically crawling git repositories and support Spark and Hive repositories currently. We maintain datasets and workflows as metadata objects in out repositories, corresponding to the source locations.

Dataset View

The place to start is the dataset that you’re interested in. Here, our view shows the

selected dataset in the center — indicated by a subtle dotted line also
all the adjacent workflows reading and writing that dataset
all datasets read and written by adjacent workflows

Here, you can select a column on the right (such as id or firstName), on selecting the column, you can see:

Red shows the workflow that edits the column
Yellow shows the flow of a passthrough value i.e. present but unmodified
Gray shows the absence of the column on a path, workflow or dataset

Simplified Entity-Aspect Model

Workflow View

Once you have chosen a particular column, you can dive into a workflow in the context of a column. Here, you can view where the column is modified, passthrough or absent using the same color code.

When you click on a node in the workflow, we pull up the code for that node, with the relevant line highlighted.

You can navigate by clicking a node in the workflow, or moving right, left by clicking a particular column. If the column is present in multiple nodes, you click the node you want to follow.

Summary

As we move to Data Engineering, we don’t have to throw out all the features that provide us with productivity during development, production debugging or business analytics.

Data Engineering means the stack will be rebuilt, with code and agility at its center and Prophecy is focused on it. We’d love to hear from you and see if we can help you succeed in the transition from Legacy ETL to Agile Data Engineering.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Ready to see Prophecy in action?

Request a demo and we’ll walk you through how Prophecy’s AI-powered visual data pipelines and high-quality open source code empowers everyone to speed data transformation

Get started with the Low-code Data Transformation Platform

Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.