Apache Spark™ Vs. Snowflake: The Cloud Data Engineering (ETL) Debate!

Authors: Raj Bains, Saurabh Sharma

Data Integration is a critical engineering system in all Enterprises. Initially, it started with ad hoc scripts, which got replaced by Visual ETL tools such as Informatica, AbInitio, DataStage, and Talend. To cope with an explosion in data, consumer companies such as Google, Yahoo, and LinkedIn developed new data engineering systems based on commodity hardware. The usability of these systems was quite low, and the developer needed to be much more aware of the performance. Apache Spark has broken through from this clutter with thoughtful interfaces and product innovation, while Hadoop has effectively gotten disaggregated in the cloud and become a legacy technology.

Now, as Enterprises transition to the cloud, often they are developing expertise in the cloud ecosystem at the same time as trying to make decisions on the product and technology stack they are going to use.

In the rest of the blog, we'll take a look at the two primary processing paradigms for data integration, and their cloud equivalents.

What is Data Integration (or ETL)

Data Integration is your Data Factory. It reads data from various input sources such as Relational Databases, Flat Files, and Streaming. It then does various transformations on the data such as joining and de-duplicating data, standardizing formats, pivoting, and aggregating. Once the data is ready for analytics (such as in star schemas), it is stored or loaded into the target which is typically a Data Warehouse or a Data Lake.

The Two On-Premises Execution Paradigms

For most large Enterprises and companies rich in data,  one server will be insufficient to execute the workloads, and thus, parallel processing is required. For this, there have historically been two primary methods:

Two on-premises ETL Execution paradigms

Which Architecture is Better?

One natural question to ask is - whether one of these paradigms is preferable? The Answer is Yes!

The case for data warehouse ETL execution is that it reduces one system - ETL execution and data warehouse execution will both happen in Teradata. Also, most data warehouses are typically high-quality products. However, it's an expensive approach and not the right architectural fit. Data warehouses have an architectural focus on low latency since there is often a human analyst waiting for her BI query. For this, they collect high-quality statistics for query planning and have sophisticated caching mechanisms. This is not a great fit for ETL workloads where throughput is the most important factor, and there is no reuse, making caches and statistics useless. Often we've found that 70% of Teradata capacity was dedicated to ETL in Enterprises, and that is what got offloaded to Apache Hive.

On the other hand, high-quality parallel processing products, exemplified by AbInitio are perhaps the best solution - both in inherent processing cost and performance. Most users of AbInitio loved the product, but the high licensing cost has removed any architectural cost advantages they had and made them available to a very few of the largest Enterprises.

Cloud, with usage based pricing, is a great equalizer, let's look at how cloud is changing this equation...

Cloud Transition - the two ETL Architectures

There are two primary approaches to choose for your ETL or Data Engineering

The following image is how the Cloud Data Engineering architecture looks. The data from on-premise operational systems lands inside the data lake, as does the data from streaming sources and other cloud services. Prophecy with Spark runs data engineering or ETL workflows, writing data into a data warehouse or data lake for consumption.

Reports, Machine Learning, and a majority of analytics can run directly from your Cloud Data Lake, saving you a lot of costs and making it the single system of record. For particular BI use cases (fast interactive queries), Data Marts can be created on Snowflake or another Cloud Data Warehouse such as Redshift, BigQuery, or Azure SQL.

Cloud Data Engineering Architecture

How to Choose?

If you're moving you ETL to Data Engineering, you're deciding what your architecture for the next decade or more.

We recommend moving to Apache Spark and a product such as Prophecy. Apart from exceeding the capabilities of the Snowflake based stack at a much cheaper price point, this prevents you from getting locked into proprietary formats. You will also be able to deliver new analytics faster by embracing Git and continuous integration and continuous deployment - that is equally accessible to the Spark coders as well as the Visual ETL developers who have a lot of domain knowledge.