Database Cleansing - the what, why & how

Heather Maloney, November 1, 2006.

In this article I will attempt to set out simply what it means to clean a database, why you need to do it, and how you might go about it.

Firstly the "What"

Cleaning a database is done to:

  • Remove duplicate records
  • Ensure your data is consistently formatted
  • Correct data that is obviously wrong e.g. wrong postcode for a known suburb
  • Find other records that are likely to be the same (more on this later)

So "Why" would you want to do that?

To explain why, I am going to use the example of a customer database, but the principles apply to other types of data also.

Have you ever received a marketing message / catalogue in the mail twice or more times? I receive multiple copies of such communications regularly, and I don't always get around to telling the sender of their mistake. This can:

  • be interpreted as sloppiness on the part of the organisation
  • undo your efforts to target / personalise - any attempt on the organisation's part to "personalise" and "target" the message is wasted, because the recipient knows immediately that it was a mindless distribution of information using a database.
  • waste $$$! Everytime you send a communication twice to the one person or household, you have most likely just wasted some of your hard-earned funds.

In addition, cleaning your data, will help you to analyse your data more accurately. For instance, you will know the real number of contacts and perhaps how they are geographically distributed, rather than the distorted figures that can be derived from analysing a corrupted database.

It's not a crime! In fact it is very easy for your data to get in a state that requires cleaning. For example, when a client changes their address, your staff might update the suburb but forget to put in the new postcode. Or, an existing client returns to your organisation several years later, without informing new staff that they are an existing client, and if you don't have the appropriate keys on your database preventing duplicates, the client could be set up again as another customer with the same or similar details.

Having documented processes that your staff can use as a checklist, and appropriate unique keys on your database fields, will go some way to ensuring that your data is kept clean, but incorrect data will never be prevented.

"How" then, do you efficiently clean your database?

Fixing incorrect information such as the postcode matching the suburb is usually done by comparing each record to the correct values in another table. For example, to correct all the postcodes in your data, assuming that the suburb entered is correct, you would write SQL code that would compare the postcode of your record against a table of postcode + suburb + state that you may have obtained from Australia Post. Such a process would likely generate a list of records where the suburb was not found, requiring you to manually investigate and correct the data.

Correcting the formatting of your data, is usually done using some pretty simple SQL perhaps combined with logic programming. You need to decide the format you wish to apply to your data, for example, whether you would like the suburb in title case or all capitals. While this is much less important than getting the data actually right, it can help to make your communications look more professional.

Finding duplicates is a fairly easy task for someone who knows a little about the SQL database language. It is more difficult to find similar records that really are the same person, but are not listed in exactly the same way in your database. For instance the following two records may actually be the same person:

3442JohnCitizenPO Box 33Frankston3199VIC
682JonathonCitien14 Beach RoadFRANKSTON3199VIC

Finding records such as the above calls for what is usually called "Fuzzy" Matching. Software is available to find such records, and much more experienced SQL programmers could write software to find such possible duplicates.

Because you can't confidently use logic to determine whether or not two records are the same in the case given above, usually fuzzy matching would leave the data as is, but produce an exception report, highlighting likely duplicate records.

Even when you can determine confidently that two records are the same, you may wish to manually process the data cleanup to ensure that only the correct data is kept, and that all associated pieces of information are transferred across to the valid record e.g. customer payment history. It is possible however, to set up your de-duplication process to remove all the duplicates and clean up all the records automatically.

Cleaning your database can take some time, and some manual effort on the part of your staff. If you are just starting out with a new database, it is very worthwhile to:

  1. Agree and document the data structure, and what information will be stored in what field (which isn't always obvious despite the names you might give fields)
  2. Agree the format of the data entered into each field
  3. Agree a process to handle the case where a record needs to be entered that won't fit into the current structure

If you need help cleaning your database, Contact Point can help you. We provide a quick and efficient service to deal with all the database issues discussed above, and can tailor our service to meet your particular needs. Submit a request now for an obligation free quote.

Copyright Contact Point IT Services. Publication or use of this article on or off-line, without prior written permission from the author, is prohibited. If you would like to use this article on your Web site or in your publication, please contact us with details of your desired use.