We are thrilled to announce the beta of Collibra Data Quality Pushdown for Databricks. As a key new feature of Collibra Data Quality and Observability, It can significantly accelerate data quality time-to-value for cloud database users.
Cloud-native vendors are successfully supporting workloads that can scale to hundreds of concurrent jobs. Cloud-native databases are able to support these large workloads because of the auto-scaling, user-defined functions (UDFs) and machine learning (ML) capabilities.
Collibra harnesses this evolution of cloud-native databases to introduce a top-notch pushdown solution for Collibra Data Quality & Observability.
Running a data quality job without a pushdown option
To effectively execute a Data Quality job, several steps need to be undertaken. First, you must configure the source of your data and establish a connection to access it. This involves specifying the database where your data resides and providing the necessary credentials or connection details. Once the data source is configured and the connection is established then you can proceed to define the necessary parameters for the job such as the desired columns and data range. Additionally, configuring machine learning layers, including Outliers, Patterns, Duplicates, Shapes, etc, is also available. These configurations provide valuable insights into the dataset being processed.
For the Data Quality job without Pushdown option to be processed, an Apache Spark compute engine is typically employed as the underlying computational framework. Once the parameters and ML layers are established, the specified dataset is ingested into Spark. At this stage, Spark performs a range of operations, such as partitioning, sorting, and executing queries, to process the data effectively. These operations contribute to organizing and manipulating the dataset to obtain meaningful results.
Upon completion of the processing phase, the results are written out and further analyzed to identify outliers and patterns. This additional processing helps to extract valuable insights and improve data quality. The outcomes are typically stored in the Data Quality Metastore, providing a consolidated repository for future reference while The source data resides within a database.
What is Data Quality Pushdown for Databricks?
Let’s begin with what Databricks is. It is a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. The Databricks Lakehouse Platform combines enterprise data warehouses with the flexibility and cost-efficiency of data lakes to enable BI and ML on all data. Databricks is gaining popularity because your data can be anywhere, in any format, and Databricks can process it.
The Data Quality Pushdown model transfers the processing to Databricks, resulting in reduced physical movement of data.
Why do we need Data Quality Pushdown for Databricks?
Data Quality Pushdown for Databricks provides an alternative approach for executing a data quality job, with the distinct advantage of having the data quality processing performed directly in the target data warehouse.
Performing the processing directly in Databricks also provides:
- Compute resources (Auto-scaling of nodes): When a Data Quality Job runs in the pushdown mode, you can take advantage of the Databricks architecture for scaling where processing can dynamically scale up to 64 or 128 nodes. When executing a large Data Quality Job involving millions of rows and hundreds of columns, Databricks can efficiently burst to manage the workload. After the Data Quality Job is complete, the system automatically scales back down.
- Serverless: In addition to the resource bursting capability, Databricks also offers a serverless option for cluster management. With the serverless option in Databricks, you can eliminate the need to manage and provision clusters manually. Instead of setting up and maintaining fixed hardware resources, you can leverage Databricks’ serverless architecture to automatically allocate and deallocate resources as needed. When using Databricks in serverless mode, you can simply submit your Data Quality Job without worrying about cluster configuration or management. Databricks takes care of provisioning the necessary compute resources on-demand to execute your job efficiently.
- Improved data privacy: Since your data never leaves your Databricks environment, your privacy regulation compliance is not compromised and your information security is assured.
Additionally, Collibra Data Quality & Observability with Pushdown for Databricks auto-generates SQL queries to offload the data quality compute to the data resource, reducing the amount of data needing to be transferred and removing the need for the Apache Spark compute plane. All of this helps customers improve processing time for data quality jobs while reducing egress costs.
In summary
As a key new capability of Collibra Data Quality & Observability, the beta release of Pushdown for Databricks opens up significant opportunities for reduced TCO, increased efficiency, and instant scalability.