Clean Your Input Data

Garbage in, garbage out. So pay serious attention to cleaning your input data as it is the foundation of your spreadsheet and if possible, do it in such a way that it becomes an automated process for the next time you use the same spreadsheet.

Excel Guidelines for Professionals - No 16 :: Clean Your Input Data ::PerfectXL

You will probably recognize the follow situation: Your spreadsheet has several input sources, from different systems, the format of this data is different for each system. The data received seems to be incomplete, and it also includes a lot duplicates. One system uses points for decimals, the other uses commas. It contains complete tables that are of no use to you at all, and the date used is 11.03.2018, does this mean 3 November 2018 or 11 March 2018?  All these issues and you are expected to have your data analysis ready by tomorrow…

Most Excel users dislike the process data cleansing and prefer to start their analyses straight away, but cleaning up is necessary! No matter how much pressure there is from your organization to deliver the analysis quickly.

Quick wins in the dataset?

Often there is a lot you can do in the dataset itself to make it usable for analyses. Replace points with commas, delete lines, delete duplicates, etc. Please note: a manually modified dataset is very difficult to validate by controllers and accountants. The connection between the original input data and the modified “Excel source data” is a well-known frustration. So, when possible, clean your data using formulas and reserve separate tabs for the original source data (1 source, 1 tab) and the cleaned up data.

We could write a book about cleaning up data using formulas, but for now we will limit ourselves to a few tips:

Do not import more data than necessary

Do you get more information than you actually need from your input source? Don’t burden your own spreadsheet with it. Supplied tables that you don’t use shouldn’t be included in the spreadsheet. If possible, run the export scripts of other systems in such a way that they don’t export useless additional columns (read our guideline “Delete what you don’t need”). At the very least, make sure that the tab with the edited data doesn’t contain the unnecessary data.

Excel Guidelines for Professionals - No 16 :: Clean Your Input Data ::PerfectXL

Have numbers been recognized as numbers?

In some import files numbers are not recognized as such in Excel. Figures remain text fields, with all kinds of annoying consequences for more complex analyses. A one-time trick is to select the column in question and convert it to numbers via the menu Data-> text to data (don’t ask us why it works, but it works). You will have to do this trick over and over again and then you will have to deal with the verifiability issues again. Fortunately, there is the formula, =NUMBERVALUE(A2), with which you can indicate which symbol is used for decimals in the source.

What to do with empty fields?

There are datasets in which, for example, a “main category” of something is listed in column A, but only if the main category is not equal to the main category of the row above. The human eye understands this, but Excel doesn’t. Fill these empty cells in the enriched data set with functions such as =IF(SOURCE!A3=”;A2;SOURCE!A3).

Unneeded spaces in text fields?

At first glance unneeded spaces in text fields might not seem important, but it happens often, imported text data in Excel ends up with one or more spaces. If this happens inconsistently, you’ll have problems with search functions and pivot tables. The “=TRIM(SOURCE!A2)” function is a powerful way to correct this. If you want to combine this with consistent capitalization, you can choose “=PROPER(TRIM(SOURCE!A2))”.

In summary: devote time and attention to clean your input data using formulas. This will save you a lot of misery later on. Garbage in can still result in a great analysis coming out.

Never miss a Guideline!