Mastering the art of data intelligence: empowering Collibra with ChatGPT

Don’t you just hate when you are working with a data set and you have no clue what the tables and columns are about? As a Data Scientist in our internal Data Office, I understand that data is the backbone of modern business and unlocking its value requires a thorough understanding. That’s why I rely on Collibra to find, trust, understand, and access my data. 

In the past when we were just launching the Data Office, I was responsible for ingesting and cataloging data in Collibra as a data engineer persona. However, adding column descriptions was often time-consuming, leading to many data sets being described incompletely on the column level. This is a major pain point for data scientists, causing confusion and delays in their work. 

In this blog post, my colleague Ahmad Al-Qinneh and I explore how ChatGPT can help data scientists master the art of data intelligence by empowering Collibra with column description generation. This is all made possible by the openness of Collibra as well as our native workflow designer.

How ChatGPT can help Data Scientists 

Ultimately, column descriptions are generated by Open AI’s gpt-3.5-turbo model; this is done by extracting a column’s complete metadata (column name, source table, and source schema) and sending these as additional components of the prompt to propose a more relevant description, e.g. This is the first step. 

Prompt:
Provide a description for a column called {currentColumnName} in a table called {tableName} within a schema called {schemaName}. Do not repeat the name of the column, table and or the schema. Also do not mention the data type.

Once the prompt has been engineered, we enter the development stage. As part of the development process, we noticed that the model’s behavior was to provide additional noise such as repeating the prompts (e.g., The column called {currentColumnName} in a table called {tableName} within a schema called {schemaName} is …”) therefore we’ve also provided explicit instruction in the prompt to avoid such redundancy. 

Next, the prompt has to be embedded in a workflow. This workflow behaves by extracting a list of columns from a table, at which point it iterates over the list and prompts the user to review the description for the column at hand:

In practice, a data steward or custodian can be given authority to invoke this “Generate Description” process:

Once invoked, the custodian is presented with a task in which they have to review the generated description. Good governance is still going to need a business person to be the authority for this description. This task then helps them choose to either regenerate a new one, approve, or make corrections and modifications before approving. Finally, once approved the description is persisted in the field, and the workflow proceeds to the next column.

In conclusion, we are thrilled to have collaborated with a talented team of individuals in bringing together Collibra and ChatGPT’s column description generation capabilities. With the power of ChatGPT, data stewards can now easily and accurately describe their data, leading to faster and more reliable decision-making. Although this is not part of our product and not currently planned, we hope this blog post has provided valuable insights and inspiration for your data intelligence journey. 

***

Thank you to Alexandre T’Kint, Ahmad Al-Qinneh, Michael Wilcox, Tony Hoang, Miguel Santana, Antonio Monteiro Goulao, and Danut Codrescu for your contributions to this project and blog.

 

Join the conversation in the Data Citizens community to learn more!

Related resources

Blog

AI governance: the holy grail for all data scientists

Blog

Evaluating Collibra’s data intelligence maturity with our IDC Assessment tool

View all resources

More stories like this one

Jan 22, 2025 - 3 min read

How Collibra is leading the way in AI Governance with its ISO 42001...

Read more
Arrow
Jan 13, 2025 - 4 min read

Collibra named a Leader in the Gartner® Magic Quadrant™ for Data and...

Read more
Arrow
Jan 9, 2025 - 7 min read

How to achieve data quality excellence for BCBS 239 risk data aggregation...

Read more
Arrow