Apache Spark™ Vs. Snowflake: The Cloud Data Engineering (ETL) Debate!
Data Integration is a critical engineering system in all Enterprises. Initially, it started with ad hoc scripts, which got replaced by Visual ETL tools such as Informatica, AbInitio, DataStage, and Talend. To cope with an explosion in data, consumer companies such as Google, Yahoo, and LinkedIn developed new data engineering systems based on commodity hardware. The usability of these systems was quite low, and the developer needed to be much more aware of the performance. Apache Spark has broken through from this clutter with thoughtful interfaces and product innovation, while Hadoop has effectively gotten disaggregated in the cloud and become a legacy technology.
Now, as Enterprises transition to the cloud, often they are developing expertise in the cloud ecosystem at the same time as trying to make decisions on the product and technology stack they are going to use.
In the rest of the blog, we'll take a look at the two primary processing paradigms for data integration, and their cloud equivalents.
What is Data Integration (or ETL)
Data Integration is your Data Factory. It reads data from various input sources such as Relational Databases, Flat Files, and Streaming. It then does various transformations on the data such as joining and de-duplicating data, standardizing formats, pivoting, and aggregating. Once the data is ready for analytics (such as in star schemas), it is stored or loaded into the target which is typically a Data Warehouse or a Data Lake.
The Two On-Premises Execution Paradigms
For most large Enterprises and companies rich in data, one server will be insufficient to execute the workloads, and thus, parallel processing is required. For this, there have historically been two primary methods:
- ETL Execution Engine Processing - here the ETL tool comes with a distributed high performance execution engine. Most of the processing happens in this execution engine, and after the data is ready for analytics, it is loaded into a data warehouse. AbInitio is a good example and is the market leader in performance.
- Data Warehouse Pushdown Processing - here the ETL tool comes with a single server execution engine. Since it cannot do high volume processing, it provides pushdown processing that pushes computations down to the Data Warehouse and leverages the distributed processing engine there. In the field, we see Informatica commonly deployed with Teradata this way, though Informatica has a PowerCenter Grid product as well.
Which Architecture is Better?
One natural question to ask is - whether one of these paradigms is preferable? The Answer is Yes!
The case for data warehouse ETL execution is that it reduces one system - ETL execution and data warehouse execution will both happen in Teradata. Also, most data warehouses are typically high-quality products. However, it's an expensive approach and not the right architectural fit. Data warehouses have an architectural focus on low latency since there is often a human analyst waiting for her BI query. For this, they collect high-quality statistics for query planning and have sophisticated caching mechanisms. This is not a great fit for ETL workloads where throughput is the most important factor, and there is no reuse, making caches and statistics useless. Often we've found that 70% of Teradata capacity was dedicated to ETL in Enterprises, and that is what got offloaded to Apache Hive.
On the other hand, high-quality parallel processing products, exemplified by AbInitio are perhaps the best solution - both in inherent processing cost and performance. Most users of AbInitio loved the product, but the high licensing cost has removed any architectural cost advantages they had and made them available to a very few of the largest Enterprises.
Cloud, with usage based pricing, is a great equalizer, let's look at how cloud is changing this equation...
Cloud Transition - the two ETL Architectures
There are two primary approaches to choose for your ETL or Data Engineering
- Data Warehouse ETL Approach: This is an as-is migration of the on-premises approach, done in a cloud context. An example here, one can use Snowflake as the data warehouse instead of Teradata on-premises. Then you can use any ETL tool such as Informatica or Matillion on top and it will push down queries to Snowflake that will do the heavy lifting. If you have small datasets, this works. As discussed above, for large datasets and complex transformations this architecture is far from ideal. This is far from the world of open-source code on Git & CI/CD that data engineering offers - again locking you into proprietary formats, and archaic development processes.
- Data Engineering Approach: Data Engineering based on Spark for the execution layer, merges the best of the previous generation in high performance, with the best of large scale commodity processing from consumer companies - such as Hadoop. If you use Databricks, it adds transactions from Data Warehouses via delta lake providing the best product in the cloud by a large margin. A product such as Prophecy adds the remaining functionality - code and visual drag-and-drop editing that generates code on Git, Metadata with lineage, Scheduling, and CI/CD, providing a complete stack that will free you from proprietary formats.
The following image is how the Cloud Data Engineering architecture looks. The data from on-premise operational systems lands inside the data lake, as does the data from streaming sources and other cloud services. Prophecy with Spark runs data engineering or ETL workflows, writing data into a data warehouse or data lake for consumption.
Reports, Machine Learning, and a majority of analytics can run directly from your Cloud Data Lake, saving you a lot of costs and making it the single system of record. For particular BI use cases (fast interactive queries), Data Marts can be created on Snowflake or another Cloud Data Warehouse such as Redshift, BigQuery, or Azure SQL.
How to Choose?
If you're moving you ETL to Data Engineering, you're deciding what your architecture for the next decade or more.
We recommend moving to Apache Spark and a product such as Prophecy. Apart from exceeding the capabilities of the Snowflake based stack at a much cheaper price point, this prevents you from getting locked into proprietary formats. You will also be able to deliver new analytics faster by embracing Git and continuous integration and continuous deployment - that is equally accessible to the Spark coders as well as the Visual ETL developers who have a lot of domain knowledge.