Basic Definitions
Simply put, and we need to keep things simple, Master Data is one of two (or three) classes of data.
There are many books that deal with the simple bullets above, but this is the in a nutshell version of types of data. It gets very complex very quickly, but always responds to the questions underlined above, so it can always be broken down to simple building blocks.
Transactional Data
Transactional Data is usually within the limits of a range that is not fixed, like money or quantities. So it really cannot be kept in a list somewhere as the values can be anything within the logical range.
The main issue this class of data can experience is inaccuracy (someone types 10000 instead of 1000, for example--fun note: around 2006 I was running the EMEA BI team for an American industrial group and we received an order for £1000000000"Australia" when the correct figure was £100K for Austria, someone was really having a bad day...). It is very hard to reconcile these values if there are inaccuracies, usually one has to go back to the sale contract, which may be out of date. The important thing is to get the data right first time every time (with logic to ensure consistency and having reconciliations).
Master Data
Master Data is very different, it holds values that identify things, like a customer's company name (who), the specific product they bought (what), the concept of an order or a shipment (why) and where the product has to be shipped or the invoice address (where). These values are. not normally in a small list (though they can be, for example if your company has a short product catalogue) but must always stay in "high fidelity" with the reality they represent and only change when this reality changes. Remember, data is the mirror through which you obtain information about your organisation, clearly you would not want to get the wrong image.
The key issue with this class of data is usually duplicates, where a customer's company name may been spelled differently by different sales managers, like RBS vs Royal Bank of Scotland vs R.B.S. vs RBS PLC). My last project had to deal with the "Heinz Ketchup" problem: 57 different spellings of a UK organisation. This is where the infamous concept of "data cleansing" usually comes in (see remediative techniques below), always after the fact, which really is of limited help.
Reference Data
Reference Data is a special subclass of Master Data. Where possible, the values for these records should be sourced from external or internal single lists of permitted values. External is preferred as the problem and the solution are outsourced (for example, Companies House company registration numbers are exquisitely curated by Gov.uk and the data is free to use). Another example is the list of countries in the world, counties in the UK and so on. Even postal addresses are maintained by different organisations in the UK that can be used as "reference".
The main issue with Reference Data is an insufficient control over the shared list of admissible values--or no list at all, there is nothing worse for Reference Data than "free-text" fields. These should always follow the concept of a "drop-down" when human operators are involved in the selection of values.
So, if we assume the values of transactional data will be entered accurately (logical checks and balances, reconciliations, etc. to provide safeguard) and Reference Data is kept in a single central internal or external "reference" list with "drop-downs" to select from, the only data family left is Master Data, which is up to each business to keep in top shape.
This is why Master Data Management (MDM) exists as a separate discipline in the data world.
Who is the Master?
How is data in your organisation created? I can think of two main entry points for data: machine to machine (imports, integrations) and what we love the most: people typing it (for example, a Customer Services operator receives a call from a customer to notify a change in invoice address). If we quickly rule out machine to machine ingestion as a source of problems (once designed, the data exchange will always be like-for-like), we are left with the question of people and the tools they have to do their jobs.
In short, the Master in Master Data is you. So either if you have a problem of duplicates or if you don't, your current situation is and has always been up to you. This is really good news, actually, there is nothing worse than a problem over which one has no control. This control is what the Management in Master Data Management stands for, and it is a good think you can actually control your own destiny.
Who does the Management?
The actual management in MDM is, obviously, up to you. Of course "you" may be hundreds or thousands of people, and if any of these individuals or their information tools do not follow the same rules, then there is more than one "you" and "too many cooks spoil the broth" comes to mind.
So far, we can agree one set of rules is required and these need to be implementable, as in not just a solution design, an ambition or a vision on a white paper. The management of Master Data has to be defined before you start working with Master Data (and way before you go out to buy an MDM system). Of course that is impossible if you have been in business more than five minutes, so we need to think about changing the way it works (if there is room for improvement, if your MDM is perfect, then there is little to discuss and it's all good).
At this point, let's imagine that your MDM is not actually perfect and that you have, let's say, multiple instances of your customers, each for a different department (typical to service organisations) or a product catalogue that looks more like an infinite laundry list than like a proper catalogue (typical in engineering firms). What are the steps to take?
There is an inevitable reality of the "historical" records that you know have a problem, there is no escaping this once it happens. And there is what you start doing today to prevent it from happening in the future.
Techniques
Historical Master Data (a.k.a. Data Cleansing)
Let's get this out of the way quickly: data is not like dishes, you cannot "clean" data, Data Cleansing is a fallacy, a term used by vendors and contractors and sometimes unrealistic senior management. Data cannot be cleansed, in fact, data cannot be dirty. Period.
What we have to think is in terms of "Quality". What is the quality of data. And this has several dimensions. We have talked about two of them earlier: accuracy and unambiguity. There are more, completeness, and a few more.
For each of the dimensions the business (I will repeat this: the business) has to define the parameters or quality that apply to that dimension of a specific column of attributes and the values that are permissible with tolerance levels. IT will not, cannot do this for you. What IT can do is implement the rules and do this consistently across your entire estate. And only when these rules are written down as specifications, should you deploy an MDM system (which is going to cost a lot). All MDM systems are "vanilla" and they do not have these out of the box (that is another fallacy I may write about in a future post).
Your future starts today (cheesy, I know)
It is only when you have the business rules for each of your groups of data that you can implement them to prevent these problems from happening. But here again, a word of caution: do not do this as a project. Information and data are the blood that runs through the entire business body, the minute you finish the project, it will start to go wrong. This is meant to be (following the analogy of the blog in the business body) a consistent and constant behaviour that keeps you healthy. When you stop a diet or stop your exercise regime, we all know what happens to all that weight you lost).
DevOps? BAU?
It does not matter what model you choose, but you have to assign the accountability and responsibility to your people, permanent people, as an ongoing effort that you should measure and, crucially, reward generously. My rule of thumb is one Data Governance Manager for every 100 employees, one Data Governance Architect for every 1000 employees and a single Chief Data Officer for the entire organisation. Rule of thumb, of course, depends on every business. But always as a constant, permanent, ongoing activity. Once you are at cruise altitude, you may even decide to outsource some of the crunching, recently Talend launched a Data Quality as a Service offering, worth looking at.
The Data Bakery
So, why a Bakery? Well, the way a bakery works kind of illustrates every business there is. There are three domains in every business: incoming goods, processing and outgoing goods. Costs follow that flow in that way, revenue follows the flow in the opposite way. And since data, we have said and I hope you agree by now, is the Hi-Fi reflection of the business, then data flows in the same way.
So you need procurement and the data that reflects it, you need operations, and the data that flows from procurement and reflects operations and then you have the customer domain which takes the data from operations (and implicitly procurement) all the way to the point you get paid by the customer.
Flour and water, bakers, bakery store where the customer buys , hopefully, the best bread around. Once you have this basic model (this is what AVANZ.IO do best) then you break it down into smaller processes, sub-processes and single activities at the business level coupled with its mirror image, then data level. Then you identify for each fine grained data element the Quality dimensions, then, as above, you define the parameters for each dimension and the admissible values and tolerances. Then you may want to set up a Dev Ops team to start it off and embed it in the fabric of your business.
And if you are in doubt about any of these points, take a look at GDPR, SOX and the white papers from the UK Information Commissioner. These are all achievable by following the Bakery methodology.
This is what we do.
AVANZ.IO