Spark deserves a better IDE

Author: Raj Bains, Maciej Szpakowski

Spark has become the default Data Engineering platform in the cloud. With Databricks, AWS EMR, AWS Glue, Azure Databricks, Google Dataproc and Cloudera, one can rely on Spark being ubiquitously available.

As we work with Enterprises to move legacy ETL to Spark, we’ve been focusing on building the right replacement. We find that the current interfaces fall short, and defining a new interface.

Legacy Visual ETL

Driven by Visual Drag-n-Drop interfaces, these have a vast number of Enterprise developers adept at using them. Actually, it’s quite nice to get a visual overview of how the data is flowing, but over time the clicking is exhausting.

Now, legacy ETL products claim to support Spark. On the ground, this means developing workflows in their proprietary format, that sits in their legacy store, generating unmodifiable crappy code. There is no longer an appetite for these boxed solutions in the Enterprise.

Spark Code

A lot of technology companies especially in the bay area choose to write code instead. One can use notebooks, but without ordering and standard structure, there is a consensus that these are no way to write production code.

IDEs give a vast canvas to paint code on, but with the power and flexibility, different teams paint differently. They end up with different ways of structuring code and managing configurations. In the worst case, this means long Spark scripts where it is a nightmare to understand how the data is flowing, by doing variable chasing across instructions. It’s no joy for a production support team to find errors under time pressure.

Code=Visual

We looked at various roles in Data Engineering including Architects, Engineers, QA and Support and Engineers with different preferences, and thought hard about how to make everyone successful.

Note: we’ve made the GIFs low res — they still take a few seconds to load

Illustrate the switch between Visual and Code

We believe the way forward is to use Git as the source of truth, and Visual Graph as a view on the code. Our Code=Visual (code-equals-visual) interface provides:

  1. Instantaneous toggle between Code and Visual editors
  2. Edits made in the Visual Editor are visible in the Code Editor and vice-versa
  3. There are inbuilt components represented with unique dialogs (visual) or as functions (code), such as read, write, join, reformat. In code, these components can be edited freely, as long as it is structurally similar, including adding comments.
  4. Users add user defined components that can be individual components or subgraphs to represent the commonly used constructs such as auditing.
Illustrate the switch between Visual and Code

Above, you can see that the edits made in Code Editor are instantaneously visible in the Visual Editor. Note that this is standard Spark code, where every component is an object with apply function that is DataFrame in and out. We focus on the DataFrame API right now that is equivalent to SparkSQL.

You can edit the code, add expressions, local variables, newlines, comments — as long as the AST reduces to the defined sequence of DataFrame operators.

Below, you’ll see that you can make edits in the visual editor and they will show up in the code. Also note that the structure and comments added in the code editor previously, are preserved.

Visual Edits show instantaneously in Code Editor

As we’ve built this interface, we’re finding that it has much more far reaching consequences than we had initially thought:

Edits are instantaneously converted to Git commits with version history

We’re quite excited to share what we’re building and get feedback so we can build the right interface for Spark. We’ll dive into different areas in follow up posts:

As we try to build the best interface, we are sure other engineers have ideas on how we can improve on this, we’d love to hear from you. Reach out to me at contact.us@prophecy.io with feedback, or request a demo here.