Collibra and Databricks are two companies with parallel missions. They each help organizations to unlock the value of their data, break through legacy silos, speed time to insights and drive digital transformation projects. Yet they do so in unique and complementary ways.
Collibra enables organizations to nurture data as an asset – helping to enhance data discovery, aid understanding, promote trust and ensure compliance with relevant policies. Building on that foundation, Databricks arms data scientists, engineers and analysts with a platform to quickly turn data into business insights, enabling them to rapidly ingest, process, store and analyze large and diverse datasets.
The crux of the partnership between Collibra and Databricks is simple: Collibra offers a Data Intelligence platform that helps organizations assure trust in data; Databricks offers a unified analytics platform to turn that trusted data into business insights. These two sets of capabilities go hand in hand. As organizations amass ever greater quantities of data, they need powerful analytics to derive insights from those datasets. But those insights can only be of true value if they are derived from trusted data.
Integration touchpoints
Given the complementary nature of Collibra and Databricks, there are several potential integration touch points between the two platforms. However, for the purpose of this blog we will focus on two that are most significant:
Delta Lake: Databricks is the original creator of Delta Lake, an open-source storage layer for big data implementations. Delta Lake solved a key issue that many organizations faced with their data lake implementations by making them more reliable with the introduction of ACID transactions. However, without proper governance, many organizations still experienced challenges around data discovery, classification and compliance. Collibra excels at addressing these challenges. Data ingested into Delta Lake can automatically be profiled and classified by Collibra. Doing so makes it easier for end users to find the right data, understand its context (including any restrictions to its use) and trust in its accuracy.
Databricks SQL and BI Integrations: Databricks has recently launched Databricks SQL, a set of capabilities offering enhanced integration with business intelligence tools along with faster query performance on Delta Engine, a vectorized engine optimized for SQL workloads. Collibra not only integrates with those same BI partners but also with Databricks SQL, enabling business analysts to shop for data and have it provisioned automatically using metadata housed in Collibra Data Catalog. Similarly, reports and dashboards created within those BI platforms can also be registered in Collibra Data Catalog, providing a platform for analysts to collaborate, share insights and reduce duplication in their efforts.
Solving a range of challenges
The combination of capabilities from Collibra and Databricks enables organizations to address a range of challenges associated with traditional data lake implementations.
Shopping for data
Collibra helps data consumers find the right sources by highlighting certified datasets and using techniques such as automated recommendations. Part of that shopping experience also includes aiding understanding of data by providing context (through business glossaries, data dictionaries and by collating user feedback). Once the right data has been selected, it needs to be provisioned by transforming it into the target format that the user requires for his analysis. Collibra’s detailed knowledge of technical metadata supports this process by helping to orchestrate data pipelines. For example, users can select a Databricks SQL instance and create a Tableau data source that specifies the required table(s) in Delta Lake on Databricks. Tableau will then be able to access the data in Databricks guided by metadata housed in Collibra.
Compliance
Most organizations face a raft of rules and policies impacting their use of data. Data privacy regulations are rapidly evolving across the globe (examples include GDPR in Europe, CCPA in the US, GDPL in Brazil and the PDP Bill in India), alongside industry specific rules that mandate their own set of controls, internal policies that restrict sharing of sensitive information (such as ensuring salary information remains with HR professionals), and data retention requirements for audit purposes. Balancing such a complex mix of requirements can only be done with a granular data-centric approach to compliance. By profiling and classifying data ingested into Delta Lake at the columnar level, Collibra ensures all sensitive information can be accurately identified, while also highlighting applicable policies that can be used to determine access permissions.
Lineage for BI
Data lineage offers a variety of benefits to any data-driven organization. Collibra can track data lineage from reports generated in Tableau through to specific Delta Lake tables and columns. In doing so, organizations can address operational risk by alerting business analysts when elements in their reports are being deprecated. Equally, anyone viewing a report and questioning the validity of the underlying data can quickly trace the source of that data to ascertain its reliability.
Controlled ingestion
Data lakes offer scalable and cost-effective storage of enterprise data assets. However, by enabling storage of greater volumes and varieties of data, and at faster velocities, many implementations have run into problems relating to data governance and discovery. Collibra helps to address these issues by automatically capturing metadata as datasets are ingested into the lake and supplementing that information with insights from subject matter experts. This descriptive information provides organizations with an understanding of where data is sourced from, how each field is defined, whether datasets are complete and accurate, and whether there are restrictions governing its use. In addition, Collibra’s built-in governance capabilities can help address data quality issues by maintaining accountability, and helping to introduce quality assurance and certification processes.
Delivering business benefits
Data can make a positive impact on all aspects of a business. It can help organizations better understand their customers, build better products, drive operational efficiencies and reduce risk. But to achieve those goals, organizations need to ensure their data and analytics are supported by the right people, process and technology.
Databricks excels at driving agile data operations by arming data scientists, data engineers and analysts with a platform for sophisticated analytics. As the original creators of open source initiatives—including Apache Spark, Delta Lake and ML Flow—the company prides itself in creating open and collaborative frameworks that accelerate innovation. The introduction of SQL Analytics has created a home for business analysts within that agile framework.
However, by enabling more agile data operations, organizations must also assure that the right data is being sourced and that it is being permissioned correctly. Collibra enables that to happen. Collibra Data Catalog makes it easier for business analysts and data scientists to find data, understand its business context, be aware of any restrictions regarding its use and have that data rapidly provisioned.
Behind the scenes, Collibra also ensures that data can be trusted. It does so through its in-built governance capabilities: driving consistency through data dictionaries and business glossaries, promoting data quality through accountability and certification and helping to ensure compliance by classifying data to keep track of personally identifiable information.
The partnership between Collibra and Databricks has always been complementary. The launch of Databricks SQL Analytics enables new use cases between the two platforms, with the ultimate goal of supporting data-driven organizations with an agile analytics platform underpinned by trusted data.