After witnessing the joint value and demand across a broad spectrum of customers, Collibra and Databricks have extended their cross-platform functionality. With a continued focus on cloud agility and self-service analytic environments driving the market, both companies have been growing quickly and are established market leaders. As a result, both companies are investing more into product integrations. You can see these product integrations in action at Databricks’ Data & AI summit next week.
Introducing the Unity Catalog from Databricks
At the Data and AI Summit 2021, Databricks announced Unity Catalog. Unity Catalog provides a unified governance solution for all data assets including files, tables, dashboards, and machine learning models in your lakehouse on any cloud.
Databricks and Collibra’s joint customers are excited for the deeper lineage capabilities and the improved data governance Unity Catalog will provide.
Databricks’ Unity Catalog and ongoing product partnership are great news for Collibra customers for many reasons:
- Collibra customers love the ability to do impact analysis. By harvesting cross-system lineage, customers are able to see the impact of changes on their data landscape.
- Collibra and Databricks lineage is simple and robust. With many on-platform lineage capture operations, scripting can have nuances which add complexity to the lineage harvesting framework. Databricks lineage is immediate, actionable, and automatically captured as part of every platform operation.
- Databricks lineage easily integrates with Collibra through Databricks APIs. Customers can expect better quality lineage with no harvesting time delays, and robust integration into Collibra.
Collibra Data Catalog and Unity Catalog together
The introduction of Unity Catalog means an improved and seamless experience for Databricks and Collibra joint customers. Collibra Data Catalog is the industry leader for data governance workflows and cataloging enterprise data systems. Integrating Collibra with the more technical and tactically focused Unity Catalog from Databricks means better on-platform functionality coverage, better visibility into data governance with Databricks lineage and catalog attributes, and a better method for bringing data entitlement and data protection policies into action.
The advantages of data quality and observability for our joint customers
Data quality and observability is at the forefront of our joint customers’ data and governance strategies because…
- Data quality is a strategic advantage as it can significantly lower the time to value for adopting new analytic and programmatic data strategies (like data mesh and data fabric).
- Data quality and observability can exponentially increase the accuracy of machine learning and augmented decision support.
- Data quality ensures data migrations can occur faster with less errors.
- Data Quality is being revolutionized with modern advances in machine learning.
The value of Databricks and Collibra together means our customers are well positioned to take advantage of these advancements.
The benefits of Collibra Data Quality and Databricks
Since late 2021, Collibra and Databricks have had a data quality integration which overlaid declarative quality rules programmatically on Databricks’ Delta Live Tables. This was a pilot approach to a data quality integration and provides value to customers who already have defined data quality rules and use DLT. Collibra also released a machine learning based data quality offering that is revolutionizing the approach. It can effectively replace and automate over 13,000 declarative rules in a single click with much deeper insights and provides improved functionality to what was previously possible.
Collibra’s data quality and observability offering can be utilized natively within Databricks with multiple supported patterns. It leverages Databricks’ processing capabilities to run Collibra’s data quality algorithms on data stored in the lakehouse. The Collibra Data Quality & Observability approach uses Collibra’s SaaS as a front-end to view the data quality report over time and visualize anomalistic behaviors and trigger alerts. To accomplish this, network connectivity is required between the Databricks nodes and Collibra, however, source data never leaves the Databricks platform, only information about the results of the Collibra DQ jobs.
The first method of using Databricks’ clusters to process Collibra jobs (which was tested and is currently being used in production) is to upload the Collibra DQ worker jars into a Databricks DataFrame, then schedule that DataFrame to run on the desired data sets while also ensuring the Databricks execution nodes have network access to Collibra to publish the results of the job. The DataFrame approach works well but requires that an analyst create each DataFrame for specific data sets and then schedule the execution of the DataFrames within Databricks. It has some manual management and technical overhead.
If a customer would rather schedule the jobs to run externally without specifying a DataFrame, they can use the Databricks web service API for Spark job submission. This second approach submits a web-service call as an external system, kicking off a Collibra Data Quality job within Databricks. This approach requires making the Collibra Data Quality jars available to the Databricks cluster and then configuring a web service payload as well as some environment variables on the Cluster Configuration within Databricks so that jobs can be successfully submitted.
It is worth mentioning that for Collibra Data Quality these patterns require a different technical approach than the open-source Spark method of submitting Collibra DQ jobs into a cluster from the Collibra DQ interface. Databricks requires knowledge of either DataFrame or web-services. Our shared customers can take advantage of the processing power of their existing Databricks cluster and enjoy the scale and ease of a managed data service without the need to stand up a secondary execution environment to process Collibra DQ jobs!
Below, please read about the different ways to use Collibra DQ on Databricks:
- Documentation for the leveraging a Spark DataFrame Method
- Documentation for the Web-service API Method:
- Other supporting documentation
***
If you want to learn more about Collibra DQ and our work to integrate Collibra with Databricks’ Unity Catalog, come join us at Booth #846 at the Data & AI Summit in San Francisco June 27-30.
Databricks will be speaking about Unity Catalog in more detail on June 28, at 10:45 am, Unity Catalog: Journey to unified governance for your Data and AI assets on Lakehouse