Spark deserves a better IDE

Authors:
Raj Bains | Maciej Szpakowski

Spark has become the default Data Engineering platform in the cloud. With Databricks, AWS EMR, AWS Glue, Azure Databricks, Google Dataproc and Cloudera, one can rely on Spark being ubiquitously available.

As we work with Enterprises to move legacy ETL to Spark, we’ve been focusing on building the right replacement. We find that the current interfaces fall short, and defining a new interface.

Legacy Visual ETL

Driven by Visual Drag-n-Drop interfaces, these have a vast number of Enterprise developers adept at using them. Actually, it’s quite nice to get a visual overview of how the data is flowing, but over time the clicking is exhausting.

Now, legacy ETL products claim to support Spark. On the ground, this means developing workflows in their proprietary format, that sits in their legacy store, generating unmodifiable crappy code. There is no longer an appetite for these boxed solutions in the Enterprise.

Spark Code

A lot of technology companies especially in the bay area choose to write code instead. One can use notebooks, but without ordering and standard structure, there is a consensus that these are no way to write production code.

IDEs give a vast canvas to paint code on, but with the power and flexibility, different teams paint differently. They end up with different ways of structuring code and managing configurations. In the worst case, this means long Spark scripts where it is a nightmare to understand how the data is flowing, by doing variable chasing across instructions. It’s no joy for a production support team to find errors under time pressure.

Code=Visual

We looked at various roles in Data Engineering including Architects, Engineers, QA and Support and Engineers with different preferences, and thought hard about how to make everyone successful.

Note: we’ve made the GIFs low res — they still take a few seconds to load

Illustrate the switch between Visual and Code

We believe the way forward is to use Git as the source of truth, and Visual Graph as a view on the code. Our Code=Visual (code-equals-visual) interface provides:

  1. Instantaneous toggle between Code and Visual editors
  2. Edits made in the Visual Editor are visible in the Code Editor and vice-versa
  3. There are inbuilt components represented with unique dialogs (visual) or as functions (code), such as read, write, join, reformat. In code, these components can be edited freely, as long as it is structurally similar, including adding comments.
  4. Users add user defined components that can be individual components or subgraphs to represent the commonly used constructs such as auditing.
Code Edits show instantaneously in Visual Editor

Above, you can see that the edits made in Code Editor are instantaneously visible in the Visual Editor. Note that this is standard Spark code, where every component is an object with apply function that is DataFrame in and out. We focus on the DataFrame API right now that is equivalent to SparkSQL.

You can edit the code, add expressions, local variables, newlines, comments — as long as the AST reduces to the defined sequence of DataFrame operators.

Below, you’ll see that you can make edits in the visual editor and they will show up in the code. Also note that the structure and comments added in the code editor previously, are preserved.

Visual Edits show instantaneously in Code Editor

As we’ve built this interface, we’re finding that it has much more far reaching consequences than we had initially thought:

  1. All edits are Git commits, and the Git history shows edits by different users
  2. Standardization of components is achieved across the codebase.
  3. Standard components give a unit to unit test. We can auto-generate high quality unit tests for existing workflows.
  4. We provide a data quality test component. We can auto-generate high quality tests.
  5. With git commits, tests and parallel runs (coupled with column-level lineage) this brings CI/CD to data engineering fundamentally changing the agility of the team where current time to deploy is months.
  6. Visual Editor gives a quick way to understand the workflow, even for coders. This is off-course important for QA and Support who have to quickly come up to speed on someone else’s code.

We’re quite excited to share what we’re building and get feedback so we can build the right interface for Spark. We’ll dive into different areas in follow up posts:

  • Spark Code=Visual Editor Design
  • Column Level Lineage for Spark
  • Step-by-step Debugger for Spark Development
  • Bringing Continuous integration to Spark Data Engineering
  • Bringing Continuous Deployment to Spark Data Engineering

As we try to build the best interface, we are sure other engineers have ideas on how we can improve on this, we’d love to hear from you. Reach out to me at raj.bains@prophecy.io with feedback, or request a demo here.