ProphecyHub: Metadata re-invented with Git & GraphQL for Data Engineering

Authors: Raj Bains, Arpan Agrawal, Mayank Kotwal

At Prophecy, we’re building a Data Engineering product to replace Legacy ETL, bringing modern software engineering practices to data. We have a unique take on the metadata system merging traditional metadata, code, and big data metadata into a unified system. Now that the foundation of the system is strong, we’d like to share our learnings.

As Enterprises abandon legacy ETL products to adopt modern data engineering, they’re running into the challenges for which bay area companies have developed AirBnb’s Dataportal, Uber’s Databook, Netflix’s Metacat, Lyft’s Amundsen, Google’s Data Catalog, and LinkedIn’s DataHub.

Our metadata system represents persons, teams, projects, workflows, datasets, scheduled graphs, runtime environments, clusters, jobs. It supports our Code=Visual IDE and Column Level Lineage. We have designed the system with a focus on certain aspects that make us different:

Entity Aspect Model

We liked the concept of modeling metadata as entities and aspects by LinkedIn’s DataHub and built on this. It has the following characteristics:

Now, if we want to add business metadata such as column-level lineage, we just decorate the datasets with the lineage aspect. This allows us to develop new features without any changes to existing code paths.

Fabrics Concept

On premise, there are Hadoop clusters for test, staging and production environments, and in the public cloud the Spark clusters are often ephemeral. In our systems Fabrics represent such physical or virtual environment. Also, the same workflow needs to read or write a dataset (logical dataset) that might be stored needs to read and write different physical locations. So, we have Physical Datasets for the same Logical Dataset on each Fabric.

Simplified Entity-Aspect Model

VersionedAspects with Git

We build unique Code=Visual IDE for Spark, and one magical mechanism we have built is

This serves two important use cases

Our metadata storage is a completely functional git repository. Our customer integrate Jenkins and CI/CD with it. Metadata contains much more beyond Git though.

HiveMetastore Aspects

For many Hadoop based systems, Hive Metastore is a challenge. It stores the schema, physical layout and not much else. Neither will it suffice for your needs for a rich metadata system, nor can you do away with it. We solved it in this way

Interface in GraphQL

Coming from systems with background in databases and compilers, having a REST interface for metadata made little sense due to high surface area.

GraphQL stack

Initially, with REST we ended up with too many endpoints, no type safety and interface changes requiring much co-ordination. This would be equivalent of having a SQL database and adding a new JDBC endpoint for every query. We quickly abandoned it in favor of GraphQL.

Project GraphQL definition in Scala/Sangria

For GraphQL implementation, we use Apollo client in the user interface to work with React, and for our services we have written our own Scala client, but we could have as easily added Scala plugin to GraphQL Code Generator (using javascript). Our services and crawlers use this interface. On the server side we use Sangria with GraphiQL for testing.

Interface Summary
Project business logic

Storage

Storage uses a Git client, and for SQL we use Slick, the functional-relational mapping is intuitive and terse. The Entity graph is small and stored in Postgres. Aspects are stored as Json documents, also in Postgres. We store metadata for multiple Hive Metastores in Postgres as well.

Project storage interface in Scala/Slick

What’s next

Apart from the incremental work of moving to represent consumption side with reports, dashboards, business definitions and business user comments, the roadmap features that were critical considerations for the design are:

User Extensibility

Our users define new types of Aspects and decorate Entities with them. We’re adding an API to allow users to define Aspects with new schemas and consequently add Aspect objects

Multi-Cloud

Enterprises often have a multi-cloud strategy that we have designed for. We have the concept of multiple Fabrics so that one data plane can be on Azure Databricks and the other an AWS EMR. Secondly, the metadata will be visible across both locations via shared storage, made possible via geo-distributed Postgres compatible databases such as CockroachDB.

Search

We will implement text and facet search across all stored metadata soon, including searching the codebase. It’s an essential part of any metadata system. We’ll add relevance to show recent and important datasets for discovery.

We’re quite happy with how our metadata system has turned out, and think it will serve us well for quite some time. If you have ideas on improving it or want to discuss the system, reach out to me at contact.us@prophecy.io.