Clean Your Input Data

Garbage in, garbage out? Not if you pay serious attention to cleaning your input data in the first part of your spreadsheet. And if possible, in such a way that it becomes an automated process for the next time you use the same spreadsheet.

Excel Guidelines for Professionals - No 16 :: Clean Your Input Data ::PerfectXL

It is likely that you will recognize the following situation. Your spreadsheet gets different sources in different formats from different systems. The sources turn out to be far from complete, but they also contain many duplications. One source uses points for decimals, the other uses commas. It contains complete tables that are of no use to you at all. And the date here is 11.03.2018, so is it supposed to be 3 November 2018 or 11 March 2018?  You need to have your data analysis ready by tomorrow…

Most Excel users dislike input data cleansing and prefer to start their analyses straight away. But cleaning up is really necessary! No matter how much pressure there is from your organisation to deliver.

Quick wins in the dataset?

Often there is al lot you can do in the dataset itself to make it usable for analyses. Replace points with commas, delete lines, deduplicate a function in Excel, etc. But please note: a manually modified dataset is very difficult to validate by controllers and accountants. The connection between the original input data and the modified “Excel source data” is a well-known frustration. So if possible, clean your data using formulas and reserve separate tabs for the original source data (1 source, 1 tab) and your automatically edited data.

We could write a book about cleaning up data using formulas. But for now we will limit ourselves to a few tips here:

Do not import more data than necessary

Do you get more information than you actually need from your input source? Don’t try to burden your own spreadsheet with it. Supplied tables that you don’t use shouldn’t be included in the spreadsheet. If possible, run the export scripts of other systems in such a way that they don’t supply useless columns (read our guideline “Delete what you don’t need”). In any case, make sure that the tab with the edited data doesn’t contain the unnecessary data.

Excel Guidelines for Professionals - No 16 :: Clean Your Input Data ::PerfectXL

Have numbers been recognized as numbers?

In some import files numbers can’t be recognized as numbers in Excel. Figures remain text fields, with all kinds of annoying consequences for somewhat more complex analyses. A one-time trick is to select the column in question and convert it to numbers via the menu Data-> text to data (don’t ask us why it works, but it works). But you will have to do this trick over and over again and then you’ll have to deal with the verifiability again. Fortunately, there is the formula NUMBERVALUE(A2), where you can indicate which symbol is used for decimals in the source.

What to do with empty fields?

There are datasets in which, for example, a “main category” of something is listed in column A, but only if the main category is not equal to the main category of the row above. The human eye understands this, but Excel doesn’t. Fill these empty cells in the enriched data set with functions such as IF(SOURCE!A3=”;A2;SOURCE!A3).

Superfluous spaces in text fields? Consistent capitalization?

It might not seem important, but it happens more often than you think, that imported text data in Excel ends up with one or more spaces. If this happens inconsistently, you’ll have problems with search functions and turntables. The TRIM (SOURCE!A2) function is a powerful way to correct this. If you want to combine this with consistent capitalization, you can choose PROPER(TRIM(SOURCE!A2)).

In summary: devote time and attention to clean your input data using formulas. This will save you a lot of misery later on. Garbage in -> great analyses out!

Never miss a Guideline!