Data Engineering Battle: Python vs SQL++ vs Visual=Code
There is a debate underway about what the right programming model for Data Engineering is. There are three approaches - Python, SQL++ and Visual=Code. The Visual=Code is a new approach that Prophecy is working on to address the challenges we are seeing in the field, but there is no consensus on the right approach.
In this blog, we'll articulate the essential complexity of operations we see in Data Engineering, and what each approach is best suited for. By the end of this blog, you will have a structured framework to articulate what approach is best for you team (where you might already start with implicit understanding of the concepts). The following articulates the various groups of users and operations that we commonly see with our customers.
Essential Complexity of Data Engineering
Data Engineering or ETL has an essential complexity that includes some SQL operations and some non-SQL operations. Here are some common operations that for the basics of Data Engineering operations
SQL operations form the backbone of Data Engineering operations whether you're writing code in SQL, writing DataFrame code in Python or doing Visual Dataflow programming.
- Load Operations: These are the operations to load data into a table. Examples here are INSERT, UPDATE, MERGE
- Common Transforms: These are the most frequently used transforms and form the bulk for data processing and include SCAN (Read), FILTER, JOIN, SORT, GROUP BY, ORDER BY
- More Complex Operations While frequent, these are not as frequent as the common transforms and are used more in analytics and reporting - PIVOT, ROLLUP, CUBE, WINDOW FUNCTIONS
SQL is a good solution that everyone can use - but there are many operations that are common in data engineering but are not covered by pure SQL. Also, as complexity increases SQL is harder and harder to understand and maintain.
SQL starts to get complex quite fast. There are CTAS, Table Functions, Correlated Subqueries - but let's start with an operation that is quite common - a standard SCD2 merge:
SCD2 merge is a slowly changing dimension merge where the operational database has a field such as an address that changes infrequently, so in the analytical database you keep of history of various addresses and the dates (from-date and to-date) capturing the period when this entry was active, along with flags to mark the first and last row in a chain. This can be same for analytics on how long a home delivery order was in ordered, or en-route.
Following is the example code for it. This is clearly SQL that should never be hand-written. This example uses the Dataframe API but can be written as a SQL string. It shows an example where SQL is too low an abstraction.
While consensus view is that these operations should be generated, there are multiple ways to generate it - code generators, macros and functions. The SQL++ approach of DBT gives some basic constructs (macros) to try to handle these operations (datespine, snapshots for scd2). DBT also brings software engineering practices to SQL and is being appreciated by the users for this.
Programming Language Constructs
Now, there are many operations in Data Engineering for which SQL is not the right abstraction and you must use a programming language instead. There are a few use cases here. Our customers need to perform operations that must be performed per row and across rows. Here are some example operations
- Data Quality Library - including computing statistics every day and comparing changes in patterns across days.
- Lookup from a REST service (too expensive per row - so done per partition). Also, Lookup a set of values and loop through them to find the right one.
- Encryption, Decryption of certain columns with sensitive data.
- Writing to Elastic Search, writing to Athena
SQL has always accepted that it is not the right paradigm for these operations and provides with numerous mechanisms to call non-SQL code such as User Defined Functions, User Defined Aggregate Functions, and Table Functions which support the full spectrum of use cases from most granular scope calling outside code per row - to passing the entire table out to code and accepting a new table back in.
Writing code in Python can capture these use cases, but only a small subset of users in an organization can produce high quality and standardized code and the productivity is always low.
Common Pattern Templates
Templates can encode common series of patterns - standardizing practices for various parts of the ecosystem. We've seen standard ingestion templates for pipelines from multiple similar source systems that include best practices such as auditing that correct numbers of rows were output that are required in financial environments.
Enabling All Users with All Operations!
As you can see with the previous approaches - either many users are left out, or many use cases are, greatly limiting what can be achieved.
At Prophecy we have been thinking from scratch what might be the best approach to handle all of the data engineering operations and enable all users at the same time. Here is our approach:
Use Gems: Visual with SQL Expressions
All users must be enabled to use all kinds of transforms and be enabled to build any kind of data engineering workflows, so we've created an interface where all the usage is in SQL - but your operations generate a mix of SQL and non-SQL code depending on the operations.
Build Gems: Code Templates with UI
In the team, you can have a few Gem Builders (or you can ask Prophecy for it). You can write code that you want generated for any particular operation by writing sample code and specifying what information the user of these gems should fill out. As your users develop gems - high quality code is being generated on git. Here is a quick preview of Gem Builder:
Visual=Code: Putting it all together
Now, when you put these two personas together - the Gem Builders and the Gem Users, you have your entire team enabled to perform all the operations you need. Also, all users can build these data pipelines and everyone is developing high quality code on Git.
There are multiple approaches to Data Engineering, and as different startups are looking at the problem, they're coming up with the approaches they think are best suited to solve them, toiling away to make the lives of Data Engineers better.
We have shared here the framework that we used to figure out the best approach to enable most users with all the common elements we find in Data Engineering. We look forward to tremendous innovation over the next 3-5 years to make Data Engineering more accessible and reducing the toil required for it.
Prophecy is available as a SaaS product where you can add your Databricks credentials and start using it with Databricks. You can use an Enterprise Trial with Prophecy's Databricks account for a couple of weeks to kick the tires with examples. Or you can do a POC where we will install Prophecy in your network (VPC or on-prem) on kubernetes. Sign up for you account now:Sign up for your free Account!