Announcing an exciting new feature – Data Quality Pushdown for Snowflake. The purpose of this beta feature is to create a faster and easier time to value for data quality users who are also using cloud databases. The new cloud-native vendors are showing workloads that can scale to hundreds of concurrent jobs, with auto-scaling and other functionality. One of the reasons this is even possible right now is because there are more user-defined functions (UDFs) and more machine learning (ML) capabilities available in cloud-native databases than ever before. Collibra has leveraged this growth to achieve a best-of-breed data quality and observability pushdown solution.
Running a DQ job without a pushdown option
When you run a DQ job without the pushdown option, you define certain parameters, such as the columns or the range you want. Then you also define some ML layers, such as Outliers or Patterns.
Now all this work requires processing, and an Apache Spark compute engine does it. You read the entire dataset defined by your parameters into Spark, which has its own memory, CPUs, and its own compute resources. It reads data and then does partitioning and sorting to execute the query. After that, it writes data out and then does more processing on it to get Outliers and Patterns.
The source data is sitting inside a database and it is read out. All the user requirements processing takes place in Spark. And then Spark writes all the results into the DQ Metastore.
What is Data Quality Pushdown for Snowflake?
Before explaining Pushdown for Snowflake, let’s look at what Snowflake is. It is a best-of-breed cloud-native data platform. The Snowflake data platform is not built on any existing database technology or ‘big data’ software platforms, such as Hadoop. Instead, Snowflake combines a completely new SQL query engine with an innovative architecture natively designed for the cloud. To the user, Snowflake provides all of the functionality of an enterprise analytic database, along with many additional special features and unique capabilities.
In the Pushdown model, the Collibra DQ Agent that creates the Apache Spark DQ Job is not needed. The pushdown is running the database engine to get this work done. The pushdown function sends the processing to the compute on Snowflake for less physical data movement.
Why do we need Pushdown for Snowflake?
Pushdown is an alternative compute option for running a DQ Job, where all processing for the data quality is submitted to the target data warehouse. To use pushdown, you can run a setup script that creates a dedicated Snowflake Virtual Warehouse and a service account user for DQ Job runs. This designated service account user will need read access on all schemas with the target data. Collibra will provide customers with a Snowflake Pushdown setup script which must be run to use this new feature.
A few more points explain why Snowflake Pushdown is a better alternative.
- Compute resources: When a DQ Job runs in Snowflake pushdown mode, you can take advantage of the Snowflake architecture. It means that the scale is not limited. When there is a greater demand, the server nodes can auto-scale and then scale back down again as required.
- Ephemeral bursting: A lot of processing on Snowflake can “burst” to 64 or 128 nodes. A large DQ Job working on millions of rows and hundreds of columns would cause Snowflake bursting. After the DQ Job, the system would scale back down. This feature is the advantage of the SaaS (Software as a Service) model versus adherence to static hardware.
- Data Privacy: With Snowflake Pushdown, your customer data is never read out of the Snowflake environment. This feature is valuable for privacy regulation compliance and information security assurance.
So what exactly will Data Quality Pushdown for Snowflake do for our customers? It will auto-generate SQL queries to offload the DQ compute to the data source. It will reduce the amount of data transfer and remove the Apache Spark computation of the DQ Job.
In summary
Collibra Data Quality Pushdown for Snowflake (in Beta) unlocks exponential savings for customers with lower TCO, lower management costs, higher efficiency, and improved on-demand scaling. You can eliminate the need for a separate Apache Spark compute platform to run Collibra Data Quality & Observability.