Spark deserves a better IDE

Spark deserves a better IDE

Prophecy IDE enables visual and code developers to produce high-quality Spark code.

Prophecy IDE enables visual and code developers to produce high-quality Spark code.

Raj Bains
Assistant Director of R&D
Texas Rangers Baseball Club
January 28, 2020
April 22, 2023

Table of Contents

Spark has become the default Data Engineering platform in the cloud. With Databricks, AWS EMR, AWS Glue, Azure Databricks, Google Dataproc and Cloudera, one can rely on Spark being ubiquitously available.

As we work with Enterprises to move legacy ETL to Spark, we’ve been focusing on building the right replacement. We find that the current interfaces fall short, and defining a new interface.

Legacy Visual ETL

Driven by Visual Drag-n-Drop interfaces, these have a vast number of Enterprise developers adept at using them. Actually, it’s quite nice to get a visual overview of how the data is flowing, but over time the clicking is exhausting.

Now, legacy ETL products claim to support Spark. On the ground, this means developing workflows in their proprietary format, that sits in their legacy store, generating unmodifiable crappy code. There is no longer an appetite for these boxed solutions in the Enterprise.

Spark Code

A lot of technology companies especially in the bay area choose to write code instead. One can use notebooks, but without ordering and standard structure, there is a consensus that these are no way to write production code.

IDEs give a vast canvas to paint code on, but with the power and flexibility, different teams paint differently. They end up with different ways of structuring code and managing configurations. In the worst case, this means long Spark scripts where it is a nightmare to understand how the data is flowing, by doing variable chasing across instructions. It’s no joy for a production support team to find errors under time pressure.

Code=Visual

We looked at various roles in Data Engineering including Architects, Engineers, QA and Support and Engineers with different preferences, and thought hard about how to make everyone successful.

We believe the way forward is to use Git as the source of truth, and Visual Graph as a view on the code. Our Code=Visual (code-equals-visual) interface provides:

  1. Instantaneous toggle between Code and Visual editors
  2. Edits made in the Visual Editor are visible in the Code Editor and vice-versa
  3. There are inbuilt components represented with unique dialogs (visual) or as functions (code), such as read, write, join, reformat. In code, these components can be edited freely, as long as it is structurally similar, including adding comments.
  4. Users add user defined components that can be individual components or subgraphs to represent the commonly used constructs such as auditing.

Above, you can see that the edits made in Code Editor are instantaneously visible in the Visual Editor. Note that this is standard Spark code, where every component is an object with apply function that is DataFrame in and out. We focus on the DataFrame API right now that is equivalent to SparkSQL.

You can edit the code, add expressions, local variables, newlines, comments — as long as the AST reduces to the defined sequence of DataFrame operators.

Below, you’ll see that you can make edits in the visual editor and they will show up in the code. Also note that the structure and comments added in the code editor previously, are preserved.

Visual Edits show instantaneously in Code Editor

As we’ve built this interface, we’re finding that it has much more far reaching consequences than we had initially thought:

  • All edits are Git commits, and the Git history shows edits by different users
  • Standardization of components is achieved across the codebase.
  • Standard components give a unit to unit test. We can auto-generate high quality unit tests for existing workflows.
  • We provide a data quality test component. We can auto-generate high quality tests.
  • With git commits, tests and parallel runs (coupled with column-level lineage) this brings CI/CD to data engineering fundamentally changing the agility of the team where current time to deploy is months.
  • Visual Editor gives a quick way to understand the workflow, even for coders. This is off-course important for QA and Support who have to quickly come up to speed on someone else’s code.

Edits are instantaneously converted to Git commits with version history

We’re quite excited to share what we’re building and get feedback so we can build the right interface for Spark. We’ll dive into different areas in follow up posts:

  • Spark Code=Visual Editor Design
  • Column Level Lineage for Spark
  • Step-by-step Debugger for Spark Development
  • Bringing Continuous integration to Spark Data Engineering
  • Bringing Continuous Deployment to Spark Data Engineering

As we try to build the best interface, we are sure other engineers have ideas on how we can improve on this, we’d love to hear from you. Reach out to me at contact.us@prophecy.io with feedback, or request a demo here.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 14 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Get started with the Low-code Data Transformation Platform

Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.

Related content

PRODUCT

A generative AI platform for private enterprise data

LıVE WEBINAR

Introducing Prophecy Generative AI Platform and Data Copilot

Ready to start a free trial?

Visually built pipelines turn into 100% open-source Spark code (python or scala) → NO vendor lock-in
Seamless integration with Databricks
Git integration, testing and CI/CD
Available on AWS, Azure, and GCP
Try it Free

Lastest blog posts

Announcements

Announcing Prophecy University

Mei Long
April 3, 2024
April 3, 2024
April 3, 2024
April 3, 2024
April 3, 2024
April 3, 2024
Events

Gartner Data & Analytics Summit 2024 - that’s a wrap!

Ashleigh Blalock
March 21, 2024
March 20, 2024
March 21, 2024
March 20, 2024
March 21, 2024
March 20, 2024