Be more Productive on Spark with
We've been working tirelessly over the last two years - trying to perfect the development and deployment experience on Spark. We're very excited to share our learnings about productive and powerful development on Spark.
In this blog, we'll share the four main pillars of new Low-Code products that make Spark development way easier and faster. After this, we'll go through and show the various pieces of Data Engineering process and how they come together to support these pillars.
Here is the link for TL;DR, show me the product :)
Four Pillars for Productivity & Power
Following are the four pillars, let's understand them a bit better!
- Productive: Low-Code: Low code enables a lot more users to become successful on Spark. It enables all the users to build workflows 10x faster. Often you have first team enabled, you often want to expand the usage to other teams that include visual ETL developers, data analysts and machine learning engineers - many of whom sit outside the central platform and data engineering teams. Second, common pattern we see in Enterprises is that they have contract workers augmenting their ETL teams and they'd like to make the footprint smaller by making their current teams more productive. We notice that 1. many more teams can become productive on Spark, and 2. Creating workflows that used to take 2 weeks can now be completed in a couple of days.
- Productive: Complete: Switching from tool to tool, stitching tools together, having to sift through various logs, and working through manual steps to keep the whole system running is a big drain on development and production sides of the house. Having a single tool that is well thought out, with pieces - development, deployment, scheduling, metadata search and lineage - all interoperating well and automation for most manual tasks is a big productivity boost for the entire team.
- Powerful: Code Based: No one wants the old school ETL tools that store your workflow as an XML that only they can run. By having the low-code tool generate high quality code on Git, not only is lock-in avoided, but you can integrate development with your CI, CD and deployment pipelines easily - making agile development work seamlessly.
- Powerful: Extensible: Standardization increases productivity greatly - building workflows out of standard components makes development fast, performant, less error prone and the entire team can understand each others' code.
However, the few standard transforms provided by ETL tools have never been enough. So how do you handle this? You can write scripts, but there is no standardization - and this misses the whole point. These tools allowed some clunky user-defined-components with json inputs - ugh.
We've decided that custom-gems is at the heart of Prophecy - where all our standard gems (sources, targets and transforms) are built out of the same Gem-Builder that you will use to add your own transforms. You can build a data quality library, or ElasticSearch write library and roll it out to your entire team - with a great UI and high quality code generated.
Low-Code for Complete Data Engineering
Prophecy is low-code data engineering product on Spark, Delta lake and Airflow. You get the best experience with Databricks in the cloud. Let's look at how to achieve various things:
Develop a new Spark Workflow in 5 mins!
Developing a simple end-to-end workflow on Spark should take no more than 5 minutes with Low-Code. Here are the key things that make development fast:
- Standard Gems on Canvas - Drag-and-drop development is very fast - since a lot of standard functionality is baked in. You can browse your datasets instead of trying to remember what the third argument in the API was or what the path of the source data is.
- Interactive Execution - Connecting to Spark cluster (or spinning up one), and interactively running after each step ensures your code is all tested as you develop. You can see schema at every step. You can also see sample data at every step - so you understand what transform to write.
- Code is generated - In the 5 minutes as you develop the workflow, code is generated in a very clean structure on git - including the main file, supporting files and build files. You will be done before someone doing this manually will have run their first line - they'd still be creating files, making the build work and adding the correct imports.
Let's see this in action - we've accelerated the videos to respect your time!
Extend with your Own Gems!
If you're the platform team - you want to standardize development. We'll show how you can build and rollout your own framework
- Build a Framework - Architects and Platform teams that are focused on standardization and quality explore functionality for a function such as data quality, decide on the best approach and the roll-out a framework for the team. Prophecy enables you to build and rollout a framework
- Your Gems with fill-in-the-blanks - You can write a standard Spark function, and then specify which parts (blanks) do you want the developer or data analyst to fill out. You can also specify how the UI for this gem should look. These show up in the Prophecy toolbar just like built-in gems.
- Templates for Analysts As we extend gem-builder to support subgraphs over the next few weeks, you'll be able to create standardized templates, perhaps consisting of standard inputs, few transforms to modify, maybe auditing functionality and write to a standard place. This can make Spark available to many more users.
Let's see how you can create your own Gem:
Develop a Spark Test in 3 mins!
Everyone is struggling to get good test coverage. Test coverage enables you to be agile where you can you can have higher confidence when you run these tests in CI, CD before pushing new code to production. Here is how tests can be written quickly:
- One-click to add new test with sample Input Data - You just ran the workflow on a Spark cluster, and we have the sample rows of input data per gem on the canvas. For a new test - you can just say what columns you want to test and how many rows to add in test.
- Output expected or predicates - The output currently produced can be added to test against, and now you can just go modify a few values to ensure that output is what you want. Alternately, you can add predicates such as state in [CA, WA, OR]
- The heavy lifting is automated - This quick addition of test has generated test-code on git in the test folder that can be run in the build/CI/CD system. These tests are run every time you do a commit into master. You can run test from Prophecy UI for one component, or tests for entire workflow interactively, and get a human readable report.
Let's see this in action:
Develop a Schedule in 5 mins!
Now that you have developed a workflow - you'll want to run it regularly - perhaps deploy it to run everyday at 9am. You'll also want to do other steps - perhaps run a sensor to wait for a file to show up in S3 storage, send an e-mail on failure or move some files around after they are processed. Prophecy Low code Airflow makes all this super easy and fast.
Airflow is the popular open source scheduler - it seems to handle that 1 must-have use case - that no other product does - and this use case is different for every team. It is based on Python and there is good number of active developers enhancing it. Usability is quite a different story - getting it working correctly in production is hard, building a schedule dag has a steep learning curve and after that is quite involved to test it. We're fixing this experience based on following tenets.
- Low code development - You can visually add gems (Airflow operators) on the canvas and connect them to develop you schedule-dag. For each gem - you just need to fill out a few fields for details and connections clearly specify the success actions and failure handling. Configuration builder assists you in specifying details such as when to run, timezone and retries. You're getting clean Airflow code on git as you develop.
- Interactive Runs - With Prophecy you can just click play and run a schedule-dag interactively, without deploying it - making it so much easier than Airflow alone. You can see each gem run and see logs appearing on each as the run proceeds.
- Deploy and Monitor - Deploy is a single click - and this will not only deploy the schedule, but ensure that for all the Spark workflows referenced, up to date binaries/jars are deployed to Airflow. Any updates to the workflows (in production branch) will automatically trigger rebuild and redeploy of jars.
You can also monitor the schedules from Prophecy, rarely having to interact with Airflow underneath. You can see all the deployments, and all the run and filter them.
Let's see the simplest way to schedule:
Search & Lineage for free!
Column-Level lineage can track any value at column level through thousands of workflows and datasets, helping solve some challenging problems
- Debugging and Impact Analysis - So you get a wrong value in production and you want to figure out where the mistake was made. If you're like many of our Enterprise customers - after a point fact tables have 1000+ columns and last three workflows that wrote the datasets didn't even touch the column you care about - but you had to look at workflows in Git or talk to two teams to figure this out. Lineage makes it easy to track the last edit, and the one before that across workflows and projects.
Let's say you're a good citizen, and you are going to remove 5 columns from a dataset. You can proactively look at potential downstream impact and know what workflows might fail due to this change and proactively work with those teams to fix those as well.
- Chase down the PII data and follow XYZ regulation - With GDPR and myriad of other such regulations, you need to ensure that you can track for every value what datasets it got propagated to and what are the restrictions on those datasets. You can also ensure that you're not missing any value when removing information for a particular user who wants to be forgotten
- Search - If you're looking for a particular column, or the uses of a function, you can just search in Prophecy and get matches across workflows, datasets, columns and expressions - lighting up your entire system
Let's see column level lineage in action, search is coming soon!
Ok, this is cool, can I try it?
Prophecy is available as a SaaS product where you can add your Databricks credentials and start using it with Databricks. You can use an Enterprise Trial with Prophecy's Databricks account for a couple of weeks to kick the tires with examples. Or you can do a POC where we will install Prophecy in your network (VPC or on-prem) on kubernetes. Sign up for you account now:Sign up for your free Account!
We're super excited to share our progress with you, get in touch with us - we're looking to learn!