ELT is not the disruption- Data Engineering is!

Author: Raj Bains

The disruption is agile software practices in Data Engineering made usable for the many. Allow me to explain:

Prophecy is focused on Enterprise Data Engineering. We see a wide gap between what enterprise customers need, and the startup & VC ecosystem - with irrational exuberance - this time about bottoms up ELT taking over the world. Having been through the NoSQL and Hadoop waves before, I want to give a cautionary note and explain what’s actually going on.

For context, I was the product manager of Apache Hive at Hortonworks through its IPO and talked to 100+ Enterprises using SQL for ETL. As Prophecy CEO, I’ve talked to 100+ Enterprises using various ETL products. I’ve seen my share of inscrutable SQL scripts (try getting performance with complex transforms), and equally gnarly data processing code. Technically, I’m an expert in Compilers, PL and Databases.

Data Engineering!

Data Processing Landscape

These are interesting times in the data space, Spark and Snowflake are well established and converging to the same feature set that is required by Data Engineering today

Define ELT & ETL

AbInitio for example is an excellent ETL product in large Enterprises where every user I have talked to loves the product.

Omg! Omg! ELT taking over the world - bottoms up disruption!!

We’re cheering for dbt as a fellow startup/product adding great value. It provides SQL with agile software development practices. You load all your raw data directly into data warehouse and do transforms there. For this to work, the raw data is small and there is no machine learning or complex data or complex transforms - not the world we live in.

The reverse is happening

To succeed in data engineering for Enterprises with massive data sets, and complex use cases - complex data, complex transforms and machine learning - Snowflake is moving beyond SQL with Snowpark, and the tooling will follow. SQL Only is a losing battle in data engineering outside of simple use cases. Snowflake is now building a closed source Spark.

Data Engineering: The Real Disruption!

Basically, ETL/ELT has lagged software engineering in agile development techniques.

Data Engineering is the move to code-first development, with agile practices - git, tests, continuous integration and continuous deployment. It’s bringing data pipelines out of closed source, boxed software products into mainstream development.

Now we have to solve the key Data Engineering problem - Usability. Code is too hard for many users whereas Visual or SQL development makes it more accessible. Here are the solutions:

That’s pretty much it! The disruption is agile software practices in Data Engineering made usable for the many.