Introduction

The focus of these notes is on inspecting and cleaning data before feeding it to models. The importance of this task is captured by the dictum, “Garbage in, garbage out”.


All models have assumptions and limitations. “Garbage in” refers to inputting data that violates critical model assumptions, and “Garbage out” refers to the inaccurate results produced by feeding garbage to your models.

As data scientists, a HUGE part of our job is making sure we aren’t putting garbage into models. In fact, for many of is, this is the most time consuming part of our job!

The process of preparing data for modeling is so important that it deserves to be formalized, which brings us to a core concept of the course: The Inspect-Transform-Inspect workflow (ITI).

Motivating example

Intuitively, linear regression draws a line through the “middle” of the data. Mathematically, the “middle” is usually defined as the line that minimizes the Sum of the Squared Errors (SSE), or the least squares line:

When the errors are scattered evenly around the line (i.e., normally distributed errors), the least squares line produces the most accurate possible predictions. However, when the errors are NOT normally distributed, the least squares line can be misleading. Below, I simulate some data with outliers and skew to demonstrate two common ways in which non-normal errors can confuse linear regression:

library('tidyverse')
library('gridExtra')

set.seed(23)
n <- 100

# Simulate normally distributed x and y
normal_x <- rnorm(n)
y <- normal_x + rnorm(n, sd=0.25)

# Outlier version of x
outlier_x <- normal_x
outlier_x[which(outlier_x==min(outlier_x))] <- outlier_x[which(outlier_x==min(outlier_x))] + 10

# Skewed version of x
skewed_x <- exp(normal_x)

# Combine variables into dataframe
df <- data.frame(y, normal_x, outlier_x, skewed_x)

# Plot
p1 <- df %>% ggplot(aes(normal_x, y)) + ylim(-3, 3) +
  geom_point() + geom_smooth(method='lm') + ggtitle('Good')
p2 <- df %>% ggplot(aes(outlier_x, y)) + ylim(-3, 3) +
  geom_point() + geom_smooth(method='lm') + ggtitle('Bad - outlier')
p3 <- df %>% ggplot(aes(skewed_x, y)) + ylim(-3, 3) +
  geom_point() + geom_smooth(method='lm') + ggtitle('Bad - nonlinearity (skew)')

# Plot linear regressions
grid.arrange(p1, p2, p3, nrow=1)

The first plot above shows a ‘healthy’ linear regression of y ~ x, where both y and x are normally distributed. The middle plot shows what happens to the line if we insert a single severe outlier into x, and the plot on the far right shows what happens if we force x to be skewed by exponentiating it.

Notice how in both the middle and right plots, the line does not go through the ‘middle’ of the data and the points are not scattered evenly around the line. If we were to use these models to make predictions, they would be extremely biased and inaccurate!

THIS is why we must inspect and transform our data before feeding it into our models. In this case, since we have continuous features, we can inspect them individually with histograms:

# Plot histograms of the different versions of x
p1 <- df %>% ggplot(aes(x=normal_x)) + geom_histogram(bins=9)
p2 <- df %>% ggplot(aes(x=outlier_x)) + geom_histogram(bins=9) 
p3 <- df %>% ggplot(aes(x=skewed_x)) + geom_histogram(bins=9) 
grid.arrange(p1, p2, p3, nrow=1, top="Original predictor variables")

Now, we can fix the problems with some standard variable transformations and plot the histograms again to show that the variables are ‘fixed’ (don’t worry, we will discuss these transformations shortly):

# Drop outliers
df <- df %>% filter(abs(outlier_x) < 5)

# Log tansform the skewed variable
df$skewed_x <- log(df$skewed_x)

# Plot histograms of the transformed versions of x
p1 <- df %>% ggplot(aes(x=normal_x)) + geom_histogram(bins=9) 
p2 <- df %>% ggplot(aes(x=outlier_x)) + geom_histogram(bins=9)
p3 <- df %>% ggplot(aes(x=skewed_x)) + geom_histogram(bins=9)
grid.arrange(p1, p2, p3, nrow=1, top="Transformed predictor variables")

They all look relatively normal now. Now we can refit the linear regressions on the transformed data and observe that all 3 appear to be healthy linear regressions:

# Plot
p1 <- df %>% ggplot(aes(normal_x, y)) + 
  geom_point() + geom_smooth(method='lm') + ggtitle('Good')
p2 <- df %>% ggplot(aes(outlier_x, y)) + 
  geom_point() + geom_smooth(method='lm') + ggtitle('Good')
p3 <- df %>% ggplot(aes(skewed_x, y)) + 
  geom_point() + geom_smooth(method='lm') + ggtitle('Good')

# Plot linear regressions
grid.arrange(p1, p2, p3, nrow=1)

Much better! In the next section, we will generalize the process that unfolded above to other types of variables.

Inspect-Transform-Inspect workflow (ITI)

The ITI workflow is an iterative process in which you inspect and transform your variables until they are ‘clean’. This messy process is almost always hidden from view in the final version of a data science project, but in reality, it occupies much of your time as an analyst and has a huge impact on the quality of your results. This is the unglamorous part of scientific research in which you painstakingly set up your experiment and obsess over details that could mislead your results:

In this course, rather than glossing over the dirty work, we will emphasize and expose it in our assignments. Every variable that you are considering for your models must go through the ITI workflow. You must demonstrate your ability to recognize problems in the data, fix them with code, and communicate your thought process effectively. This is what data science is about!

The workflow for a single variable goes like this:

  1. Plot the variable using an appropriate plot type based on the data type. Inspect the plot and interpret what you see with words. Indicate anything interesting you notice and mention any transformations that might needed (if there aren’t any, just say so).
  2. Transform the variable, or leave it alone, based on your observations above.
  3. Plot a final inspection of the variable to show your reader that the variable is now appropriately transformed and “passes” your inspection.

You can apply this workflow to each variable one at a time. However, an even better approach is to consolidate all of your plots into one “before transformation” panel of plots and one “after transformation” panel of plots, with each panel including a plot for each variable. You can combine arbitrary ggplot plots into multi-panel plots using the grid.arrange function from the gridExtra package, and there are many examples of how to do this in the Instructor Notes. For example, if you have 6 variables to inspect, your report could include:

  1. Initial inspection: One 2 x 3 panel plot showing each variable prior to transformation. After the plot is a block of narrative text, organized with a bullet point for each of the 6 variables, describing the plots and your ideas for transforming each variable.
  2. Transform: A block of code that executes the transformations to your variables, based on your observations above.
  3. Final inspection: Reproduce the 2 x 3 plot, except using the newly transformed variables, to demonstrate how the variables have been changed and improved by the transformations you just did.

Communication is absolutely critical: The ITI workflow is not only a method for doing careful work, it is also about proving to your reader that you are doing careful work. The reason I recommend consolidating your plots is purely to make it easier to communicate your steps to the reader. In the end, it doesn’t matter how careful you are if you can’t convince your reader. This is very important when you are applying for analyst jobs, because employers want to know that they can trust you to work independently and to put in the effort to not only DO the work, but also SHARE it in a way that is easy for others to understand. It is the exact same reason why companies selling exercise equipment or weight-loss drugs provide before-after photos: You have to show people what the impact is!

In addition, it is important to know that the ITI workflow is not only about finding problems: It can also lead to new questions and insights beyond the original goals of the project. As you go through the workflow, keep an open mind and don’t be afraid to follow up on something unusual or interesting that you notice in your plots, even if it wasn’t part of your original plan. For example, if you see an outlier, a decent analyst recognizes it and removes it, but an excellent analyst digs deeper to explain why the outlier is there in the first place, because that could lead to interesting insights and impress your reader with your thoroughness. Finding hidden value in datasets will make you stand out as an analyst!

In the next section, we discuss the most common types of inspections and transformations for individual variables.

ITI for individual variables

This week, the focus is on how to inspect and transform individual variables. This means that we will only be considering one variable at a time. We will turn our attention to multivariate relationships next week.

Generally speaking, there are two types of inspections: visual and numerical. Visual inspections involve plots, while numerical inspections involve tables or summary statistics. When you first pull a new dataframe into R, the first thing you should do is a quick summary of all the variables to get a big picture idea of what is in the dataset. I like to use the summary function, since it shows you a lot of useful information about each variable. Here is an example using the built in iris dataset:

# load built-in dataframe
data(iris)

# print summary
iris %>% summary
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# print structure
iris %>% str
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Take moment to inspect the information output from these two functions. The summary function computes summary statistics for continuous variables and counts of categorical variables (although if you have a category with many levels, it will only show you counts a few of them to save space). The str function shows you how many observations (rows) and variables (columns) you have, followed by the type of variables (e.g., number or factor), and prints out the first few examples of each variable.

Sometimes, you can notice glaring errors in the data at this stage, such as negative values in a variable that should be all positive (e.g., ‘height’ should never be negative, so if there are negative values). You can also get an early hint that some continuous features are skewed if the mean and the median are very different. The summary function also shows you how many missing values are in each row of data, although in this case there are no missing values. But really, the summary and str functions are just a quick peek at the data before you dive in to inspect each variable with plots.

For the rest of these notes, we are going to focus on visual inspections. What types of plots should you use to inspect individual variables, and how can you fix common problems? It depends on the variable type! Let’s go through each of the four variable types: continuous, ordinal, multinomial, and binary. Finally, the section ends with a discussion of how to handle missing values, which is a common issue that occurs for all data types.

Continuous variables

We have already seen that the standard way to plot a continuous variable is with a ‘histogram’. Ideally, we want to see normally distributed variables, i.e., the classic “bell curve”. For continuous features, you should always inspect with a histogram.

When you plot histograms, the goal is to get an idea of the overall shape of the distribution. You need to choose an appropriate number of bins so that you don’t show too much or too little detail. Below, I simulate a perfectly normal distribution that should not require any data cleaning steps, and I show how different numbers of bins can lead to histograms with either too little or too much detail:

# Simulate normal data
df <- data.frame(x = rnorm(1000))

# Plot, using the 'bins' argument to choose a nice number of bins
p1 <- ggplot(df, aes(x=x)) + geom_histogram(bins=3) + ggtitle('Too few bins (not enough detail).')
p2 <- ggplot(df, aes(x=x)) + geom_histogram(bins=12) + ggtitle('Just right!')
p3 <- ggplot(df, aes(x=x)) + geom_histogram(bins=100) + ggtitle('Too many bins (too much detail).')
grid.arrange(p1, p2, p3, nrow=1)

Note that if you have a somewhat small dataset (e.g., < 500 rows), even truly “normal” data can have random fluctuations that cause your histogram to deviate from a perfect bell curve. A common mistake beginners make is to overreact to minor deviations from normality. With experience, you will get better at differentiating series deviations from normality from benign deviations due to random chance.

The most common types of deviations from normality are outliers, skew, and multi-modal distributions. Let’s discuss each issue in turn:

Outliers: A outlier is an anomolous data point that falls far from the other points in the distribution. There is no single objective indicator of what an outlier is, so you have to use your judgement and justify your choice. In theory, you should consider something an outlier if it is either an error (e.g., a data entry error) or if it has a dramatic influence on your models. Outliers can be handled in the same way as missing values, which are discussed at the end.

Skew: Skewed variables are a common cause of poor model fit, especially if your target variable is skewed. Data can be either ‘right skewed’ or ‘left skewed’. To help remember ‘right’ vs. ‘left’ skew, think of it as the direction a ball would roll of you placed it at the highest point of the distribution:

# Simulate left and right skewed data
x <- rnbinom(1000, 10, .25)
df <- data.frame(x_right_skew = x, x_left_skew = -x + 200)

# Plots
p1 <- ggplot(df, aes(x=x_right_skew)) + geom_histogram(bins=12) + ggtitle('Right Skewed Data')
p2 <- ggplot(df, aes(x=x_left_skew)) + geom_histogram(bins=12) + ggtitle('Left Skewed Data')
grid.arrange(p1, p2, nrow=1)

Right skewed data can often be fixed with a square root or log transformation (you can imagine these as squishing the x-axis so that larger values shrink more than small values). Left skewed data can often be fixed by raising to a power greater than 1, e.g., squaring or cubing the values (you can imagine these as stretching the x-axis out so that larger values get stretched more than small values). Below, I demonstrate how different transformations can be used to make the data above (almost) normally distributed:

# Log transformation of the right skewed variable
df$x_right_skew_transformed <- log(df$x_right_skew)
  
# Cube transformation of the left skewed variable
df$x_left_skew_transformed <- df$x_left_skew^3

# Plots
p1 <- ggplot(df, aes(x=x_right_skew_transformed)) + 
  geom_histogram(bins=10) + ggtitle('Log transformed right skewed data')
p2 <- ggplot(df, aes(x=x_left_skew_transformed)) +
  geom_histogram(bins=10) + ggtitle('Square transformed left skewed data')
grid.arrange(p1, p2, nrow=1)

The variables are now much less skewed than before. Note that log transformations assumes your data is all greater than 0. So, if you have data with zeroes or negative values, you must add a constant to ensure all the values are positive before transforming. For example, if you have only zeroes and positive values in your data, you can add 1 to eliminate the zeroes before logging, squaring, or whatever. This trick is so commonly used that R has a built in function, log1p, which simply adds 1 and then log transforms.

You may run into cases where it is impossible to find a good transformation for a skewed variable. If the skewed variable is a predictor, you might be able to get away with it if it does not have a serious effect on model performance. If the skewed variable is the target, you might need to use more advanced modeling techniques than plain linear regression. We won’t cover these more advanced models in this course, so you should avoid trying to model a skewed or otherwise highly non-normal continuous target variable.

Multi-modal distribution: This refers to a situation where your data has multiple peaks. Multi-modal distributions usually arise when your data includes multiple hidden groups. A classic example would be if you have data on weight for a mixed-sex sample of people, and there are two peaks: one for men and one for women:

# Simulate data for men's and women's weight
men_weight_lbs <- rnorm(500, mean=175, sd=10)
women_weight_lbs <- rnorm(500, mean=135, sd=10)
combined_weight <- data.frame(weight = c(men_weight_lbs, women_weight_lbs))

# Plot
ggplot(combined_weight, aes(x=weight)) + geom_histogram(bins=15) + ggtitle('Bimodal Distribution')

Multi-modal data indicates that there is some structure in the data that you should try to explain. For example, it could be that your sample includes a mixture of men and women, and the distribution has a peak for each sex. Alternatively, it could be that your data has some sort of bias or complexity built into it. For example, star rating systems on apps often have a bimodal distribution, because users are most likely to rate the app if they really love it or really hate it, but people with intermediate feelings aren’t as motivated to rate the app. Another example would be rainfall amount, which has a huge peak at 0 for all the days in which it does not rain at all, and then another peak at several inches indicating the amount of rainfall when it does rain.

Whenever you see multiple peaks in a distribution, you should try to explain it and think about what this means for your project. If the multi-modality reflects some underlying groupings (e.g., sex), you may want to “control” for that grouping variable in your models to ensure that you don’t get spurious results. If your data was generated by some biased or complex process, it might make sense to bin the data (e.g., transform rainfall amount into binary yes/no variable).

Ordinal variables (i.e., ordered categories)

Recall that ordinal variables have an order to them, but they can only take a few discrete values. A common example is star rating systems, e.g., Amazon star ratings, which clearly have an order to them, but can only take the values 1, 2, 3, 4, or 5.

The correct way to plot an ordinal variable is with a barplot:

# Simulate ordinal variable with possible values 1-5
x <- sample(1:5, 100, replace=TRUE)
df <- data.frame(x)

# Plot
ggplot(df, aes(x=factor(x))) + geom_bar() + ggtitle('Ordinal variable')

A common mistake is to confuse histograms and barplots. They are similar, but they are NOT the same. A histogram does not have any space between the bars, reflecting the fact that there IS no space between the bins. In contrast, a bar plot has spaces between the bars, reflecting the fact that values between these spaces are impossible. This might not seem important, but if you use the wrong type of plot for your variables, it makes your work look sloppy and calls into question whether you understand this basic distinction between continuous and discrete variables.

An important related point is that you can NOT use linear regression with an ordinal target variable because it will predict impossible values (e.g., with the Amazon star ratings example, a linear regression could predict a star rating of 1.2 or 6, which are impossible values). Instead, if you have an ordinal outcome variable, you can transform it into a binary variable and model it with logistic regression.

The easiest way to bin an ordinal variable is by using an ifelse statement. This is an incredibly useful function that you should master:

# Bin 1-5 ratings into a binary variable indicating whether the value is greater than two
df$greater_than_two <- ifelse(df$x > 2, 1, 0) 

# Plot
ggplot(df, aes(x=factor(greater_than_two))) + 
  geom_bar() + ggtitle('Binned version of ordinal variable')

You can read the ifelse statement like this: If df$x is greater than 2, let the new variable be 1, otherwise let the new variable be 0. Also note the use of the factor function in the ggplot, which tells R to treat this as a categorical (factor) rather than a numerical variable. If you don’t use the factor function, the x-axis of the plot may show a range of numerical values such as 0.5 and 1.5 rather than only the binary numbers 0 and 1, which shows that ggplot thinks it is a numerical variable rather than categorical.

If you have an ordinal predictor variable, you do not necessarily need to bin it. Whether you bin it depends on whether the numbers have a linear relationship to the target variable. We will discuss this more next week when we get into multivariable plotting. For now, you should only worry about binning ordinal target variables. For predictor variables, simply plot with a barplot and describe the distribution.

You might be wondering, how should you choose the cutoff for binning an ordinal variable into two groups? Sometimes there is a very obvious cutoff point. But other times, as in the example above, it isn’t very clear where to draw the line. If there is no natural breakpoint in the distribution, I would try to cut it roughly in half. So in the example above, a threshold of either 2 or 3 would be fine. However, take the example below:

df <- data.frame(x = c(rep(5, 200), rep(4, 5), rep(3, 10), rep(2, 7), rep(1, 50)))

ggplot(df, aes(x=factor(x))) + geom_bar() + 
  ggtitle("It makes sense to make 5 its own group")

Patterns such as the one above are pretty common in rating systems. Take Uber star ratings: most people give 5 stars every time unless something is seriously wrong, in which case they will go to the other extreme of 1 star, but people rarely give 2-4 stars. In such cases, it makes sense to take one extreme, such as 5 stars, and create a binary variable for 5 stars vs less than 5 stars, and model it with logistic regression.

Reminder: At a certain point, the line between an ordinal and a continuous feature becomes blurry. For example, if we have an ‘age’ variable in years, which can take any value between 1 and ~100, this is technically an ordinal variable, but it has so many levels that we may as well treat it as continuous. Deciding how to treat a given variable is more art than science, and you will have to use your judgement!

Multinomial variables (i.e, unordered categories)

Recall that multinomial variables are discrete like ordinal variables, however unlike ordinal variables, they do not represent an ordered scale. Examples could be different types of animals, different countries, or vehicle license plate numbers (that’s right, numbers can be considered unordered categories as long as the numbers are not actually meaningful).

As with ordinal variables, the correct way to plot a multinomial variable is with a barplot. However, a common problem that arises with multinomial variables is having too many categories to display nicely in one plot. For example, here is a barplot for a categorical variable that has 50 levels:

# Simulate 1000 categorical data points with 30 possible levels
df <- data.frame(multinomial_variable = sample(paste0('category_', 1:30), 1000, replace=TRUE))

# Plot
ggplot(df, aes(x=multinomial_variable)) + geom_bar() + 
  ggtitle('Multinomial variable with 50 levels') 

This is problematic because we can’t see any of the labels. Luckily, in a case where we have about 5-50 categories, a simple trick can make the labels easier to read. By adding the coord_flip layer to our plot, we can force the labels to appear on the y-axis rather than the x-axis, which prevents them from overlapping:

# Simulate 1000 categorical data points with 30 possible levels
df <- data.frame(multinomial_variable = sample(paste0('category_', 1:30), 1000, replace=TRUE))

# Plot with coord_flip
ggplot(df, aes(x=multinomial_variable)) + geom_bar() + 
  ggtitle('Multinomial variable with 50 levels') + coord_flip()

Another solution is to adjust the orientation of the labels to make them vertical, or to shrink the size of the labels. Either way works, but I usually prefer coord_flip since it puts the labels in a natural orientation for reading without making them too small to read.

Let’s take an even more extreme case, where I have a categorial variable with 500 levels and I try to make a barplot (I will use coord_flip from the start since I already know this one is going to be even more crowded than the last one):

# Simulate 1000 categorical data points with 500 possible levels
df <- data.frame(multinomial_variable = sample(paste0('category_', 1:500), 2000, replace=TRUE))

# Plot
ggplot(df, aes(x=multinomial_variable)) + geom_bar() + 
  ggtitle('Multinomial variable with 500 levels') + coord_flip()

Yikes, this is a terrible plot because you cannot see any of the labels, even with the coordinates flipped. We could try making the labels smaller, but they would be too small to see. In this case, my recommendation is to make an entirely different plot: a histogram of the counts of each category, which you can do like this:

# Create dataframe with counts of the categories
multinomial_variable_counts <- df$multinomial_variable %>% table 
df_counts <- data.frame(multinomial_variable_counts)

# Plot
ggplot(df_counts, aes(x=Freq)) + geom_histogram(bins=10) +
  ggtitle('Counts of categories')

The plot above shows that most of the categories occur fewer than 5 times, while a few categories occur 10+ times. In other words, most of the categories are pretty rare, and no categories are dominating the dataset. You can’t see the individual categories, but when you have such an overwhelming number of categories, your goal should be to get an idea of how your data are distributed across the categories, and whether some categories are extremely common or extremely rare.

You might be wondering… how can we find out which are the most common categories? Unfortunately, the histogram above doesn’t show counts for individual categories. What you can do is print out a table of the categories and sort them to see which are the top categories, like this:

# Print a table with the top 10 most common categories
df$multinomial_variable %>% table %>% sort(decreasing=TRUE) %>% head(10)
## .
## category_110   category_2 category_455 category_188 category_273 
##           12           10           10            9            9 
## category_332 category_391 category_430 category_486 category_104 
##            9            9            9            9            8

We can also turn this table into a new dataframe and plot a barplot with only the top 10 most common categories:

# Create dataframe with counts of top 10 categories only
top_10_categories <- df$multinomial_variable %>% table %>% sort(decreasing=TRUE) %>% head(10) 
df_top10 <- data.frame(names=names(top_10_categories), top_10_categories)

# Define factor levels to put them in order of frequency
df_top10$names <- factor(df_top10$names, levels=rev(names(top_10_categories)))

# Plot
ggplot(df_top10, aes(y=Freq, x=names)) + geom_bar(stat='identity') + coord_flip() +
  ggtitle('Top 10 categories')

Later in the course, we will talk more about how to handle categories with too many levels, because they present a challenge for modeling. Each category becomes a separate predictor variable in your model, so even though it is just a single column of data, it actually adds an enormous amount of complexity, which makes results difficult to interpret and can have an adverse effect on predictive performance.

Binary variables

Binary variables are the easiest. In fact, they are so simple, I think it is overkill to make a barplot. You can simply use the table function to show counts:

# Simulate random set of cats of dogs
x <- sample(c('cat','dog'), 100, replace=TRUE)

# Display a table counting the number of cats and dogs
table(x)
## x
## cat dog 
##  51  49

There isn’t much else to say about binary variables for now. Let’s move on to a very common problem that plagues all variable types: missing data!

Missing data

Models don’t know what to do with missing data, so we have to decide how to handle them. Although some model functions may silently handle missing data, usually by dropping entire rows that contain even one NA value, it is best practice to explicitly handle missing data in whatever way you think is most appropriate for your situation. Too often, people put missing data into a model and don’t realize how the model is handling those data points.

Although I saved this for the end of the lesson, I recommend checking for missing values at the start of your analysis, even before making any plots. This way, you will be aware from the beginning that there are missing values to replace. Conveniently, the summary function will show you how many NA values are in each column (which is another reason to use this function at the top of your script after you import the data!):

# Simulate dataframe with 100 rows and 4 columns, one per variable type
n <- 100
df <- data.frame(continuous_var = rnorm(n),
                 ordinal_var = sample(1:5, n, replace=TRUE),
                 multinomial_var = sample(c('cat','dog','pig'), n, replace=TRUE),
                 binary_var = sample(0:1, n, replace=TRUE))

# Add one missing value to each column
df[1, 1] <- df[10, 2] <- df[50, 3] <- df[75, 4] <- NA

# Print table that counts the missing values per column
df %>% summary
##  continuous_var      ordinal_var    multinomial_var   binary_var    
##  Min.   :-2.19076   Min.   :1.000   cat :39         Min.   :0.0000  
##  1st Qu.:-0.63254   1st Qu.:2.000   dog :32         1st Qu.:0.0000  
##  Median :-0.06202   Median :3.000   pig :28         Median :0.0000  
##  Mean   :-0.01540   Mean   :2.909   NA's: 1         Mean   :0.4545  
##  3rd Qu.: 0.53101   3rd Qu.:4.000                   3rd Qu.:1.0000  
##  Max.   : 2.47493   Max.   :5.000                   Max.   :1.0000  
##  NA's   :1          NA's   :1                       NA's   :1

There are two ways to handle missing data: you can drop rows with missing values, or you can replace the missing values with a reasonable alternative. Generally, replacing the missing data is better than dropping rows, unless there is something truly pathological about that entire row of data. In many cases, you want to keep the information in the other columns rather than throw away everything in the row.

So what should you replace the missing values with? As you can probably guess, it depends on the variable type! The most common and easy way to replace missing data is by using an appropriate measure of the central tendency of the variable:

  • For continuous variables, I recommend using the median. When a distribution is perfectly symmetrical, as in the case of a normal distribution, the mean and median will be identical. However, when you have a skewed variable or outliers, the mean will be biased in the direction of the rare values, and the median provides a better measure of central tendancy. Because it is robust to outliers and skew, the median is known as a robust estimator of central tendancy.
  • For discrete variables (including ordinal, multinomial, and binary) use the mode.

Here are examples showing how I replace each of the missing values above with either the median or the mode, with the help of the ifelse function:

# If NA, replace with the median, otherwise leave it the same
df$continuous_var <- ifelse(is.na(df$continuous_var), 
                            median(df$continuous_var, na.rm=TRUE),
                            df$continuous_var)

# If NA, replace with the mode, otherwise leave it the same
df$ordinal_var <- ifelse(is.na(df$ordinal_var), 
                            mode(df$ordinal_var),
                            df$ordinal_var)
df$multinomial_var <- ifelse(is.na(df$multinomial_var), 
                            mode(df$multinomial_var),
                            df$multinomial_var)
df$binary_var <- ifelse(is.na(df$binary_var), 
                            mode(df$binary_var),
                            df$binary_var)

# Check to confirm the missing values are replaced
df %>% is.na %>% colSums
##  continuous_var     ordinal_var multinomial_var      binary_var 
##               0               0               0               0

Great, that fixed the missing values. Note the use of the na.rm=TRUE argument in the median function. This is critical because if you do not include that argument, you will get an error trying to compute the median since it doesn’t know how to handle the NA.

Earlier, I mentioned that outliers can be dealt with in the same way as missing values. You can use an ifelse statement to replace any value that is larger than some threshold with the median. For example, the following line of code would replace any value that is outside 5 standard deviations with the median value, otherwise it keeps them the same:

ifelse(abs(df$var) > 5*sd(df$var), median(df$var, na.rm=TRUE), df$var)

A final suggestion is to always check for missing values again at the END of your data cleaning. People sometimes introduce NA values by accident during data cleaning steps. For instance, a common mistake is to log transform zeroes or negative values, which will silently get converted into NA. Best practice is to make one final check for NA in your data before diving into modeling.

Concluding remarks

ITI is an iterative process that you will return to again and again on a given project, like an artist who keeps returning to your drawing, refining features a little bit more each time.

The image on the left is an artist’s cartoon ‘model’ of a foot, but it is barely recognizable as a foot. With each iteration of the drawing, the foot becomes more detailed. By the final drawing, it is obviously a foot. Part of keeping your motivation high during data science projects is to manage expectations. Your first draft may look like the foot on the left, not the foot on the right, and that is okay because we aren’t finished yet! Be patient and trust the process.