ProphecyHub: Metadata re-invented with Git & GraphQL for Data Engineering

Authors:
Raj Bains | Arpan Agrawal | Mayank Kotwal

At Prophecy, we’re building a Data Engineering product to replace Legacy ETL, bringing modern software engineering practices to data. We have a unique take on the metadata system merging traditional metadata, code, and big data metadata into a unified system. Now that the foundation of the system is strong, we’d like to share our learnings.

As Enterprises abandon legacy ETL products to adopt modern data engineering, they’re running into the challenges for which bay area companies have developed AirBnb’s Dataportal, Uber’s Databook, Netflix’s Metacat, Lyft’s Amundsen, Google’s Data Catalog, and LinkedIn’s DataHub.

Our metadata system represents persons, teams, projects, workflows, datasets, scheduled graphs, runtime environments, clusters, jobs. It supports our Code=Visual IDE and Column Level Lineage. We have designed the system with a focus on certain aspects that make us different:

  • Unique Code Interplay: Code, tests and configuration are first-class metadata in our system and stored in Git.
  • Hive Metastore support: Our metadata system provides in-built persistent Hive Metastore.
  • Designed for a small engineering team: We’ll develop this system with one engineer, exceeding capabilities of those built by large teams.
  • Rapid Development: Speed of new feature development must be high, without destabilizing the existing features.


Consumers and Storage for Prophecy Hub

Entity Aspect Model

We liked the concept of modeling metadata as entities and aspects by LinkedIn’s DataHub and built on this. It has the following characteristics:

  • Entity represents the primary entities in the metadata system (such as Project, Workflow, User, shown in blue in the diagram below). The schema of an entity only contains minimal information required to search for it, and thus rarely changes.
  • Aspects store details about the entities, and contain the content or pointers to external system where content is stored. They can be evolved independently without affecting other aspects. For a workflow, info aspect will store basic information (in Postgres), code aspect stores the code (in Git), and test aspect stores unit tests.

Now, if we want to add business metadata such as column-level lineage, we just decorate the datasets with the lineage aspect. This allows us to develop new features without any changes to existing code paths.

Fabrics Concept

On premise, there are Hadoop clusters for test, staging and production environments, and in the public cloud the Spark clusters are often ephemeral. In our systems Fabrics represent such physical or virtual environment. Also, the same workflow needs to read or write a dataset (logical dataset) that might be stored needs to read and write different physical locations. So, we have Physical Datasets for the same Logical Dataset on each Fabric.

Simplified Entity-Aspect Model

Versioned Aspects with Git

We build unique Code=Visual IDE for Spark, and one magical mechanism we have built is

  • We developed VersionedAspect for which the content is stored on Git. Projects store the git repo and VersionedAspects store relative paths and cache commit ids.
  • Now, all we need to do for storing Code, Tests, Configurations is to inherit from VersionedAspect with a few lines of code.

This serves two important use cases

  • Our metadata system acts like a traditional metadata system to our IDE — serving and storing workflows including visual workflow, code, config and tests.
  • More importantly, you can just go to the Git repo and build the workflows, and run the tests from command line.

Our metadata storage is a completely functional git repository. Our customer integrate Jenkins and CI/CD with it. Metadata contains much more beyond Git though.

HiveMetastore Aspects

For many Hadoop based systems, Hive Metastore is a challenge. It stores the schema, physical layout and not much else. Neither will it suffice for your needs for a rich metadata system, nor can you do away with it. We solved it in this way

  • PhysicalDataset Entities have a HiveMetastore Aspect that decorates the dataset in the metadata graph. In the metadata screen in UI, you can pull information from Hive Metastore.
  • On premise, ProphecyHub can connect to an existing HiveMetastore of a persistent cluster.
  • On the public cloud, ProphecyHub provides a HiveMetastore so that when spinning up a Spark cluster, you can just point it to ProphecyHub which provides a thrift interface. Each Fabric such as Test, Integration, Production gets its own environment. As ephemeral clusters come up, many clusters can connect to the same environment.

Interface in GraphQL

Coming from systems with background in databases and compilers, having a REST interface for metadata made little sense due to high surface area.

GraphQL stack

Initially, with REST we ended up with too many endpoints, no type safety and interface changes requiring much co-ordination. This would be equivalent of having a SQL database and adding a new JDBC endpoint for every query. We quickly abandoned it in favor of GraphQL.

Project GraphQL definition in Scala/Sangria

For GraphQL implementation, we use Apollo client in the user interface to work with React, and for our services we have written our own Scala client, but we could have as easily added Scala plugin to GraphQL Code Generator (using javascript). Our services and crawlers use this interface. On the server side we use Sangria with GraphiQL for testing.

Interface Summary
Project business logic

Storage

Storage uses a Git client, and for SQL we use Slick, the functional-relational mapping is intuitive and terse. The Entity graph is small and stored in Postgres. Aspects are stored as Json documents, also in Postgres. We store metadata for multiple Hive Metastores in Postgres as well.

Project storage interface in Scala/Slick

What’s next

Apart from the incremental work of moving to represent consumption side with reports, dashboards, business definitions and business user comments, the roadmap features that were critical considerations for the design are:

User Extensibility

Our users define new types of Aspects and decorate Entities with them. We’re adding an API to allow users to define Aspects with new schemas and consequently add Aspect object.

Multi-Cloud

Enterprises often have a multi-cloud strategy that we have designed for. We have the concept of multiple Fabrics so that one data plane can be on Azure Databricks and the other an AWS EMR. Secondly, the metadata will be visible across both locations via shared storage, made possible via geo-distributed Postgres compatible databases such as CockroachDB.

Search

We will implement text and facet search across all stored metadata soon, including searching the codebase. It’s an essential part of any metadata system. We’ll add relevance to show recent and important datasets for discovery.

We’re quite happy with how our metadata system has turned out, and think it will serve us well for quite some time. If you have ideas on improving it or want to discuss the system, reach out to me at raj.bains@prophecy.io.