ELT is not the disruption- Data Engineering is!
The disruption is agile software practices in Data Engineering made usable for the many. Allow me to explain:
Prophecy is focused on Enterprise Data Engineering. We see a wide gap between what enterprise customers need, and the startup & VC ecosystem - with irrational exuberance - this time about bottoms up ELT taking over the world. Having been through the NoSQL and Hadoop waves before, I want to give a cautionary note and explain what’s actually going on.
For context, I was the product manager of Apache Hive at Hortonworks through its IPO and talked to 100+ Enterprises using SQL for ETL. As Prophecy CEO, I’ve talked to 100+ Enterprises using various ETL products. I’ve seen my share of inscrutable SQL scripts (try getting performance with complex transforms), and equally gnarly data processing code. Technically, I’m an expert in Compilers, PL and Databases.
Data Processing Landscape
These are interesting times in the data space, Spark and Snowflake are well established and converging to the same feature set that is required by Data Engineering today
- SQL & Transactions are a must have: SQL is declarative and productive. Transactions, change data capture and merges are essential.
- Tabular Data & SQL is not enough: Data Engineering requires a lot more than Tabular data, there is data from Json, documents, images and one needs non-SQL transforms for data engineering and machine learning.
Define ELT & ETL
- ELT means that you Load all your data in Data Warehouse in tabular format and then use a set of SQL queries to Transform it, finally merging into my target tables.
- ETL means a processing engine does Transforms and then Loads data into Data Warehouse as the final step. The processing engine can be very powerful where you can write code, have code versioning, configurations, resolved configs, rules engines. There are SQL operators for productivity as well.
AbInitio for example is an excellent ETL product in large Enterprises where every user I have talked to loves the product.
Omg! Omg! ELT taking over the world - bottoms up disruption!!
We’re cheering for dbt as a fellow startup/product adding great value. It provides SQL with agile software development practices. You load all your raw data directly into data warehouse and do transforms there. For this to work, the raw data is small and there is no machine learning or complex data or complex transforms - not the world we live in.
The reverse is happening
To succeed in data engineering for Enterprises with massive data sets, and complex use cases - complex data, complex transforms and machine learning - Snowflake is moving beyond SQL with Snowpark, and the tooling will follow. SQL Only is a losing battle in data engineering outside of simple use cases. Snowflake is now building a closed source Spark.
Data Engineering: The Real Disruption!
Basically, ETL/ELT has lagged software engineering in agile development techniques.
Data Engineering is the move to code-first development, with agile practices - git, tests, continuous integration and continuous deployment. It’s bringing data pipelines out of closed source, boxed software products into mainstream development.
Now we have to solve the key Data Engineering problem - Usability. Code is too hard for many users whereas Visual or SQL development makes it more accessible. Here are the solutions:
- Dbt is getting traction in startups with SQL editor and agile practices.
- Prophecy is getting traction in the Enterprise with unique IDE that has both Visual and Code development with SQL, Scala and Python support. All the users can simply develop high quality Spark code with agile practices
- Prophecy also provides metadata, lineage, observability, performance debugging, scheduling - features critical to the Enterprise.
That’s pretty much it! The disruption is agile software practices in Data Engineering made usable for the many.