Data Governance series Pt.2 - Why Fixing Reports Doesn’t Fix Data

Welcome to the jungle!
Welcome to the jungle!

When enterprises embark on a data governance journey, they are often driven by problems with data. Countless times, a C-Level representative saw a report that they disagree with. And usually, as a consequence, a Data Engineer, Data Scientist or Data Warehouse Developer embarks on an expedition through the jungle of the enterprise processes in an attempt to “fix” the “problem with the data”.

Finding the source This “problem with the data” can have multiple sources.

  • Data creation: Is the data captured correctly in the source system? Business rules might not be technically enforced in a source system and only followed by convention, data could be stored inconsistently (e.g. “DE”, “Ger”, “Germany” and “Deutschland” as country depending on how the data got into the initial database)
  • Data movement and transformation: The data might have been processed wrongly, this could just be an error in the ETL process like wrong joins or it might also be caused by abuse of columns to store other data (“we did not have a field to store the telephone number in so we used address2 for that”)
  • Data definition and semantics: It might be unclear what is meant by a certain term when used by different people in an organization. A “unit” might be the individual item that is produced when talking to machine operators in a factory while it might be a box with multiple items that are shipped for the logistics department. These misalignments in communication can cause problems when building reports based on that data.
  • Data drift: As versions of systems evolve, their databases change. This might mean that an ETL pipeline breaks and needs to be fixed “ASAP” or it just runs and loads data but certain information has gone missing from the data. An optional field that used to be filled but is not anymore after the upgrade can slowly deteriorate the value of a report and by the time this is noticed and communicated, the version upgrade might not even be thought of as the source of the problem.
  • Data consumption: The person reading the report might have asked a different business question than the report answers.

Trying to deal with all these different root causes the same way is bound to fail. Specific problems require specific solutions.

Attempting to fix the problem

If the Data Engineer survives the jungle, they will return dehydrated and frustrated because typically they will have just uncovered more “problems” with the data and processes rather than fixed anything. So they will attempt to fix the problem where they first found it: in the Data Warehouse to make the report “look right”.

Data governance can really help break out of this vicious cycle. By committing to a company culture in which data is viewed as an asset that must be maintained and cultivated, these journeys can be avoided. But let’s take a step back first. In the last post in this series, I established the reason why I do not like the “data is the new oil” analogy and why for me data governance is much more than just the introduction of a new tool.

Finding the right approach

And the (albeit slightly tongue in cheek) description of the poor developer’s quest is exactly the reason why I see it this way. Following the example I described above, the Data Engineer was on a quest to fix a problem with the data.

A lot of the issues with data are issues that you can fix in the ETL process just before the report. But let’s think a step ahead and assume you want to integrate more systems, make your whole operation digital. Then data drift or semantics might cause serious issues in your other operations. And when you think that far ahead, you will notice that just fixing an issue with data “downstream” cannot be a good solution when you want to run your organization on data and want to integrate as many of your operations as you can.

That is when your data issues change character. You no longer live in a “bottom up” world where the C-Level complains about a report and somebody goes into the machine room (and depending on their responsibility that means to a different system) to fix it. Instead you evolved into a “top down” organization where the data problem is no longer only a problem with a report but a problem that the organization as a whole needs to solve to ensure effectiveness and flawless operations of their core business.

And that means that the data has evolved from a nice by-product shown on a management report to a central asset of the organization. And all of a sudden, the data quality problem becomes an organizational problem where the source needs to be fixed and documentation of the interpretation has a value of itself and needs to be present. Then you have embarked on the data governance journey.

![[pexels-sebastian-palomino-933481-1955134.jpg]] Into the unknown.

Choosing the right approach to data governance therefore starts with the organization’s strategic goals. If an organization wants to become more digital and truly data-driven, this must be reflected in concrete, realistic objectives. These objectives should define which processes and data domains matter most and when they need to be addressed rather than aiming for unrealistic completeness. A goal such as documenting and integrating data from a defined set of core processes within a two-year planning horizon is far more actionable than striving for perfect data everywhere.

Making it actionable

To not just leave you hanging with theoretical explorations and analogies, here is a short 3 step checklist you can adapt and follow to get started with improving your data quality in an organized way:

  1. Pick one critical process and map its primary data sources.
  2. Document 5–10 key definitions for that domain and record who owns each.
  3. Assign a data steward and measure two simple success metrics over 6 months (e.g., percent of records conforming to key rules, number of incidents reported).

In the next article I will discuss different kinds of data that can be found in an organization and how they can interact.

A german version of this post can be found on the virtual7 Blog

Read part 1 of this series

Photos:

You should also read: