Master Data (Whatever That Is) Has to Be Properly Managed

Master data management is rapidly being recognized as a strategic asset in the enterprise. It wasn't always like this, and the key word is management.

For too long, data was confined to the silo-based applications that produced it. Then, several years ago, technology enabled us to share data, primarily through implementing data warehouses and marts. Unfortunately, many data warehouse projects have failed to deliver what they promised.

The inability to share and exchange data effectively across many enterprises appears to have led business users to blame master data.

Perhaps the supreme value of data is that it that can be used repeatedly without being consumed-unlike any other resource available to an enterprise. The dark side of this is that if an item of data has a defect and is repeatedly reused, then there is a huge negative multiplication of the defect. And master data is the most reused of all data. That's why it must be of extremely high quality. Unfortunately, it took major investments in data warehouses and collaborative transaction systems for many enterprises to realize this.

We are now witnessing an explosive growth in master data management (MDM) solutions, closely followed by an array of consultants and service providers. However, MDM is still not a mature area. People have quite different understandings of what MDM is, and the field is full of loosely-defined terminology and fuzzy concepts. This situation presents real risks for anyone venturing in to MDM.

What is Master Data?

Perhaps the most dangerous idea is that master data is a homogeneous set to which "one-size-fits-all" management techniques can be applied. However, master data consists of separate classes of data each with its own special properties and behaviors, and thus unique management needs.

Figure 1 shows how we can segment an enterprise's data resource into different layers if we consider the data from the viewpoint of supporting transactions in operational systems. If we look at all of the databases in an enterprise, it is possible to discern a pattern in the different types of data tables they contain. At the top of this hierarchy is the metadata layer that contains the definitions of database tables and columns. This metadata should stay unchanged for the lifespan of the database it describes. Any data quality problem, such as a column whose size is too small, can have a tremendous impact. Most importantly, metadata includes semantic content-tables and columns have meanings that must be understood by anyone using the data they contain.

The Three Types of Master Data

Below metadata is reference data. This is also known as code tables, domain values or valid values, and it is an important subclass of master data. Examples include Country, Currency, Product Type, Customer Credit Status, and so on. These database tables typically consist of a code column and a description column, and usually only a few records. Just to add to the confusion, some sectors-particularly finance-use the term reference data to mean master data. Reference data tables are the Rodney Dangerfield of the data world. They get very little respect because they are thought of as being small, simple, and not changing very much. Yet, in reality they are very important, and can make up anywhere from 20 percent to 50 percent of the number of tables in a database. Reference data can be defined as follows:

Reference Data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.

Some unique properties and behaviors of reference data that set it apart from other classes of data include:

• Codes have meanings, just like metadata (but unlike other classes of data), and this semantic content needs to be managed. E.g., a Customer Credit Status of "Bronze" may set credit limits for a customer.

• Codes drive business rules. If a business rule includes a data value found in a database, this will almost certainly be a code value from a reference data table.

• Reference data tables have to be fully and accurately populated when an application is being developed-long before it goes live.

Next in the data hierarchy comes transaction structure data. This includes the two old favorites, Product and Customer. Transaction structure data tables define the parties to the transactions which an enterprise processes in its operational systems. For instance, if I buy a book online, the product information for the book has to be present, as do my customer details. Transaction structure data is another sub-class of master data but it is quite different to the reference data tables discussed above. It can be defined as follows:

Transaction structure data represents the direct participants in a transaction which must be present before a transaction fires.

Some of its unique properties and behaviors include:

• Transaction structure data tables contain more data elements (that is fields or columns) than other tables in a typical database.

• Many of the data elements in transaction structure data tables have complex relationships. E.g., if Product Type is "Domestic Appliance" then "Working Voltage" must be populated, otherwise "Working Voltage" must be blank.

• Transaction structure data tables must be populated with information before the transactions they support can be fired.

Transaction structure data can also be further subdivided in terms of its management needs. For instance, the need for privacy management is increasing every year for Customer, and presents a special challenge. In the case of Product, getting to grips with the unstructured data that often resides in Product Description is often an important need.

The final sub-class of master data is enterprise structure data. Although these tables are not present in every database, they are very important and can be defined as follows:

Enterprise structure data is data that permits business activity to be reported or analyzed by business responsibility.

Examples include Chart of Accounts and Organization Structure. Unique properties and behaviors of this sub-class of master data include:

• Enterprise structure data is typically very hierarchical.

• Enterprise structure data evolves over time, presenting challenges for the reporting of historical business activity in the current structure.

Below and Beyond Master Data

At this point we have reviewed the three different sub-classes of master data. The next layer of data in Figure 1 is transaction activity data. It is the traditional focus of information technology, and represents the actual transactions that flow through operational systems. Below this layer is transaction audit data, which tracks the progress of each transaction as it goes from initiation to termination. This type of data is typically stored in database logs or web logs, but often exists in some form in regular database tables too.

The perspective that master data consists of reference data, transaction structure data, and enterprise structure data is at odds with the perception that master data is any data in an enterprise that is shared, or any data that a database uses that is not created in the database. Such definitions do not bring out the special management needs of master data. Consider, for example, IBM's definition of MDM:

"IBM defines Master Data Management as the set of disciplines, technologies, and solutions used to create and maintain consistent, complete, contextual and accurate master data for all stakeholders. It focuses on the concept of master data objects which represent the key business entities that an organization interacts with to run the business. Core master data objects include products, organizations, locations, trading partners, employees, customers, consumers, citizens, assets, accounts, policies, etc." (http://www-306.ibm.com/software/data/ masterdata/launch.html)

The dangers of not having a clear understanding of master data are many. One is scope creep in any MDM project. The inability to set clearly defined boundaries for master data can lead to expectations that a far greater set of data will be tamed by an MDM project than is practical. Stakeholders across an enterprise may genuinely believe that many of their data problems will be resolved via MDM when this is simply impossible.

Another danger that can arise from a fuzzy view of master data is assuming that one set of management techniques can apply to master data. Even some vendors say that MDM is nothing more than time-honored basic data management techniques put into a new format. This is simply not the case. The need to manage semantic content of codes in reference data tables is very important. It requires gathering precise definitions of values such as Customer Credit Status of "Bronze" and implementing a knowledge management infrastructure that will permit anyone in an enterprise to access these definitions. This is a need unique to reference data. De-duplicating Customer data is quite different, and does not apply to reference data. Transmitting Product data across an enterprise as a Product moves through its lifecycle is different again.

Master data is not a homogeneous class of data. The need for MDM to enable sharing of high quality master data is more urgent than ever. However, unless attention is paid to the different management needs of the different kinds of master data, generalized attempts to implement MDM may be doomed from the start.

Malcolm Chisholm has more than 20 years' worth of experience in strategic data management and metadata engineering. He runs the web site

www.refdataportal.com.