Data Governance series Pt.1 - The Data Refinery Fallacy

I cannot count the number of times I heard that data is the oil of the 21st century. Hardly any talk about data or AI goes by without someone repeating this claim - usually followed by a call for everything to become “data-driven.” However, I feel that the comparison is off.
The myth
Could you imagine a refinery that would live with processes polluting the oil they process? A refinery that would knowingly dilute their raw product or even accept leakages and spills without trying to fix them? A refinery that would even give their oil for free to third parties on the promise that they would turn it into gold? Yet, I think we all have seen enterprises diluting, polluting and losing data. We have all witnessed companies giving their data for free to AI startups on the promise that afterwards everything would be magically better. And we have all seen these promises broken.
Prominent examples where AI did not magically make the world better happen all over the internet. And they happen even to the biggest tech giants. Maybe you remember that in February 2023 the Google AI chatbot Bard claimed that the James Webb Space Telescope “took the very first pictures of a planet outside our own solar system” when in fact the first pictures of an exoplanet were taken by the European Southern Observatory in 2004. This small blunder caused more than a laugh: Google’s parent company Alphabet saw a notable share drop after the demo (see this CNN report). Now just imagine that this was a chatbot your company released to advertise your innovative approach and products.
When this happens, decision‑makers should pause and assess whether their data quality meets required standards. Then they conclude that they must not use AI tools without first getting a clear picture of the data quality. That then entails understanding what data is created, transformed and used where, ideally followed by the question what exit strategies exist if there are data breaches or the quality is not right or the tools do not keep their promises. And when that happens, a door opens for change that would substantially improve the quality of the refinery’s oil and reduce the spillage. However that door typically closes quite fast with the introduction of yet another data quality tool.
Manifestations
I was working for a customer who had multiple ERP and CRM systems for multiple business units. Data between the systems was never synchronized. Of course, they had duplicate customer data with different customer numbers for each business unit.
The problem is, if you now want to use machine learning to find overarching customer behavior, due to different customer numbers the same customer might be seen as different customers and their behavior as customer of the enterprise, not the business unit will not be interpreted correctly. The same basic rule that holds for every report also holds for every AI model: if the data that is used to learn from is low in quality, the output will also be low in quality. Or simply put: Shit in, shit out.
Does introducing a tool like Microsoft Purview or Informatica in itself fix your data quality? Of course not, because tools can only help you understand the data, visualize its lifecycle and detect issues. But they cannot fix the broken data for you. If an input form allows letters in phone numbers, users will enter letters; downstream tools can detect the issue but only a frontend change can prevent it.
I know and understand that as people working in IT we are tool-driven professionals. We learned how to create and adopt tools for our need, how to make machines work for us (and sometimes against us). If there exists a tool that will solve our problem, we will always follow the approach to use the tool to fix the problem.

Many times in our professional environment, this approach will be successful and deliver the results we strive for. This gives us satisfaction and so we tend to try and use more tools for the next problem again. I have seen this approach being followed so often in my career and more than once I also asked “which tool solves this problem” and forgot that more often than not, the tool might help us avoid a problem but will not solve it as in “go to the root of the problem and remove that root cause from the equation”.
For example, if sales are low, we could say that it is due to someone taking wrong decisions. Or due to them not having all the right reports or - even worse - due to the reports showing wrong data.
We work on the reports and on data quality using data quality tools but we tend not to think about what the reason for wrongful data is in the first place. Instead of avoiding the creation of erroneous data, we eliminate the errors in the data. The reason why we are doing that (or at least why I often tend to do so) is because it is easier to fix data in an ETL process than to fix the process where the data was created and where people doing their jobs are involved.
Simply put: it is much easier to install one of the plethora of data quality or data governance tools than to change the organizational processes that are the reason for faulty data.
Root causes
This is where data governance enters the picture - not as yet another tool, but as a response to structural causes. The complexity of breaking up established processes and make a change for the good increases with each day that goes by. If you really want to ensure that data is used safely in the context of a bigger organization, you need new processes to monitor and act upon the data quality. And you need to change the processes that cause the faulty data. This could mean as little as changing a user interface that is misinterpreted by users and as much as defining new responsibilities throughout your organization units even against their opposition.
Or, to clarify: Fixing the process always beats fixing the data.
If your process uses data of unclear origin and attempts to create information from it, you might suffer from faulty information like in the Google Bard case mentioned above, but you could also be facing legal consequences. For example, in November 2022, a lawsuit was filed against GitHub, Microsoft, and OpenAI, alleging that Codex (which powers Copilot) was trained on code without obeying open-source license terms (e.g., attribution). See the reuters coverage of the lawsuite for more details.
These examples show that Data Governance goes beyond the processes that are at the beginning of the data lifecycle. Instead, the idea is to work on all processes that deal with your data and to control every step of its lifecycle. But to do so in a holistic way, of course you need someone driving and orchestrating these efforts. In other words, you need to define responsibilities.
The beginning
This article is the beginning of a short series of articles on data governance and how to achieve it. It is written from the organizational rather than the technical viewpoint even though as technical person I understand and prefer the technical viewpoint. Here I will try to adopt the organizational viewpoint to challenge myself and to make clear to myself and my fellow data-workers what might cause purely technical changes to fail and how to make them a success.
I would like to encourage you to not wait until this series is finished. Instead, you should start thinking about data quality and your data management processes today. Find out where you stand. Detect data issues that cause problems down the line, understand the flow of your data. Start forming a consortium of people who see the need for improving the way you handle data and start talking about ownership and quality today.
The next article in this series will look at where to start developing a data governance strategy in your organization.
A german version of this post can be found on the virtual7 Blog
Photos:
- Oil Rig: Photo by Jan-Rune Smenes Reite
- Toolbox: Photo by cottonbro studio
