AI and ML can provide a tremendous boost to automating analysis and decisions, but it needs high-quality data to harness its true power. With the increasing volume and variety of data arriving at increasing velocity, data quality is proving to be the biggest challenge for trusted analytics and AI.
ML has a big appetite for data
A model trained on a small data set may not correctly represent the pattern. As additional data improves the model, the ML models are expected to learn continuously from the incoming new data and the feedback from results.
ML models deliver faster results at scale, but the results can be precise only when the data feeding them is of high quality. Data quality is in the top 3 barriers to AI adoption, according to a Gartner report. A modern data quality solution should be able to scan large and diverse databases (including files and streaming data) without needing to move or extract the data, accelerating the development of new data quality pipelines and ML initiatives. Data and models are two pillars of ML-driven analytics; both must be of high quality to power accurate and trusted outcomes.
Bad Data + Good Models = Bad Results
The quality demands of ML are steep and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.
For any ML model, the training data set must fit the purpose. It should be complete, correct, without any empty or duplicated records, and valid. A model trained on bad quality data will not deliver the expected results even if you feed it with good quality input data.
It is easy to label data good or bad, but when you start working on data quality, the real challenges become apparent.
- Meaning of data quality: Different stakeholders look at data quality from different perspectives. Data engineers and data stewards tend to place a high priority on the accuracy of individual records. Data consumers, on the other hand, prefer to consider data sets rather than records. They understand the importance of accuracy, but they also want to consider other attributes to correctly present the state of business health and forecast the market trends.
- Measurement of data quality: Data has several attributes or dimensions. Not all dimensions are relevant to your context, and not all contribute equally to data quality. You can choose 3-6 dimensions that matter to your specific use cases, assign appropriate weights, and determine the combined score.
- Approach to data quality: Hasty, disjointed efforts to measure and improve data quality do not deliver any long-term benefit. Consider approaching data quality as a fundamental part of data strategy, aligning it to your enterprise-wide data governance and data intelligence efforts.
Data quality is no longer just about the accuracy of data. It is more about visibility into data and the ease of shopping for the right data. Gartner research recommends focusing on the supply chain to deliver the right data to data consumers.
Good Data + Bad Models = Bad Results
Ensuring good quality data is the first step towards analytics and AI. But the ML models themselves must be of the highest quality and appropriate for the planned analysis.
While ML is still a black box for some business users, data scientists are aware of the efforts behind successful ML modeling. If the models are bad, even top-quality input data can deliver the wrong results. Bad models can be due to insufficient, incomplete, non-relevant, or biased training data. For designing good, bias-free, objective ML models, data scientists constantly need to monitor any new training data.
Predictive, continuous, self-service data quality
Predictive data quality leverages ML to auto-generate SQL-based, non-proprietary, explainable, and adaptive data quality rules. The system can constantly learn from data to generate data quality rules, become incrementally smarter each day, and track down issues as soon as they arise. Monitoring data drift, outliers, patterns and schema change will help you detect the accuracy and performance of ML models over time.
Trusted analytics need timely access to relevant and high-quality data. But data quality is hardly a one-time activity. Quality of data can deteriorate over time, and data can lose its integrity during its enterprise journey. Data quality rules for accuracy can affect timeliness if they excessively load the data processes. If the tools cannot manage the high volume and variety of data arriving from different sources and different environments (e.g. cloud, on-prem, hybrid), they can impact the timeliness and accessibility of data.
Enabling contribution from all users strengthens the continuous quality efforts and promotes the culture of quality. A self-service data quality solution empowers data engineers, data stewards, business analysis, data scientists, and all managers to identify as well as resolve the quality issues themselves.
- Flexible and distributed Apache Spark™ parallel processing gives you better stability and quick scalability for large databases
- Autogenerated and adaptive rules reduce complexity, bottlenecks, repetition and guesswork in data quality rule management
- A robust data quality assessment framework helps you define single scoring with your choice of quality dimensions
- Continuous anomaly detection helps monitor and improve data quality
- Powerful metadata management capability captures and reconciles metadata for data quality processes
- Collaborative self-service access boosts DataOps productivity and minimizes the cycle time
Taking the ML-first, rules second approach with the data quality solution makes you future-ready, aligning with the Gartner prediction that by 2022, 60% of organizations will leverage ML-enabled technology for data quality improvement.
Good Data + Good Models + Good Collaboration = Trusted Results
Data engineers work on making the data right, while data scientists keep looking for the right data to use. The approach gap is the key reason for compartmentalized management of data quality and ML models.
Databricks bridges the gap with a simple, open, and collaborative platform to store and manage all of your data for all of your analytics workloads. Collibra provides a native integration capability with Databricks lakehouse enabling continuous, intelligent data quality monitoring. Together they offer high-quality, reusable data pipelines for supporting trusted results.
- No more silos: Databricks offers a foundation of a highly scalable lakehouse with a single open-format storage layer – Delta Lake – for structured, semi-structured, and unstructured data.
- Continuous quality data pipelines: Collibra helps ensure the high quality of data pipelines via automated governance and lineage tracing, with the predictive data quality rules continuously adapting to the data arriving in the lakehouse.
- Autoscaling: Databricks’ auto-scaling infrastructure powers fast data pipelines with very high performance.
- Trusted results: Continuous, self-service, compliant data quality on a collaborative platform for trusted, bias-free analytical results.
When good data and good models find a unified, scalable platform with superior collaboration capabilities, your AI / ML initiatives can deliver trusted results. Trusted analytics and AI drive more effective decisions, higher productivity, and better cost-efficiency. Predictive, continuous and self-service data quality together with Databricks Lakehouse Platform is your way to achieve trusted analytics and AI.