Introduction

These notes accompany my course at Northeastern University, a 6 week introduction to data mining using the R programming language. The goal of the course is to enable students to conduct independent data mining projects, with a focus on honing your workflow and communication skills.

These notes are a key resource for your homework assignments and in-class quizzes. They are the result of my efforts to distill a lot of useful code and information down into a few digestable packets, and structuring these packets in a way that complements the weekly assignments (which are only available to enrolled students).

What is data mining?

The purpose of data mining is to extract useful insights from a dataset. In practice, data mining is an iterative process with several stages, diagrammed below. In a typical project, you will pass through these stages many times:

Rather than starting our learning at the beginning of the workflow, we begin with modeling for two reasons. First, I have found that introducing modeling up front helps students understand the importance of inspecting and transforming their data prior to fitting models. Second, glimpsing the later stages of the pipeline helps students write better project proposals (which are due at the end of Week 2!).

Four fundamental models

A major challenge to learning data science is that there are so many algorithms and models clamoring for your attention. While there is no universally agreed upon core set of models an analyst should master, I have made the opinionated decision to restrict our focus in this course to a minimal set of models that cover the 4 most common use cases for data modeling:

  1. Regression: Predict a continuous target variable from a set of predictors.
  2. Classification: Predict a categorical target variable from a set of predictors.
  3. Dimension reduction: Identify continuous latent variables in a set of predictors.
  4. Clustering: Identify categorical latent variables in a set of predictors.

In machine learning, regression and classification are called supervised learning, which is a term to describe models that have an observed target variable. The supervised learning model y ~ x is ‘trained’ to predict the target y given input x. In contrast, dimension reduction and clustering are examples of unsupervised learning, or models that identify previously unknown structure in x.

The matrix below shows the 4 models used in this course:

These are powerful tools that can achieve the four most common data modeling tasks, and they provide a solid foundation for learning about more advanced models. We will continue to revisit these four models throughout the course, deepening our knowledge of how to apply and interpret them.

Before I provide a high-level introduction to the four models, a quick digression about “information overload”:

Information overload in data science

Data science terminology can be confusing, especially when it comes to modeling. One reason for this is that many similar ideas emerged independently from the disciplines of statistics and machine learning, leading to multiple terms for similar concepts. For example:

  • Statistical models have predictor or independent variables, ML models have features
  • Statistical models predict a response, outcome, or dependent variable, ML models predict a target.
  • Statistics folks fit models, ML folks train models

There is also tremendous hype surrounding ML, with a deluge of new methods, software libraries, and blog posts being published on a daily basis. Each new tool comes with new concepts and new terminology, and every blog post tries to convince you that THIS is the thing you should pay attention to. Diagrams and listicles of “things you should know” are popular.

In this environment, your attention is the most precious resource that you have. The only way to avoid becoming overwhelmed is to be careful about what you pay attention to. If you get side-tracked by every new technique that someone wrote a blog post about, you will waste all of your time going down rabbit holes and feeling panicked about everything you don’t know. This can quickly escalate into imposter syndrome, which is the feeling that you are an imposter who does not belong in this field at all!


To combat this problem (which we all face sometimes!), focus on mastering the 4 basic models in this course. By staying focused on fundamentals, you will lay upon a solid foundation that you can build upon by doing real projects. These models are widely used in both statistics and ML, so they provide a good foundation regardless of whether your future plans involve more statistics or ML-type work.

Data types

In this course, we deal strictly with rectangular data that can be represented as a table, with columns representing variables and rows representing individual observations. Variables can have different types, and it is essential to know your data types because they determine which kinds of models and plots are appropriate.

Here is a taxonomy of data types:


While the concept of data types may seem simple, it can be tricky to decide whether some variables are continuous or discrete. Some common tricky examples include:

  • Count data. When count values are small, this may be treated as an ordinal feature, but when the counts become large (e.g., typical values can be in the tens, hundreds, or more), it may be better to treat it as continuous.
  • Continuous features with multiple peaks in the distribution (i.e., multimodality). Sometimes it makes sense to bin multimodal features into categories (e.g., if it is bimodal, make a cutoff point and convert it into a binary feature).
  • Numerical IDs that really represent categories. Common examples are phone numbers or postal codes. Although these are ‘numerical’, they don’t actually contain numerical information because the phone number ‘617-548-2601’ is not ‘greater than’ than phone number ‘617-548-2600’. These are categorical variables in disguise.
  • ‘Words’ that really represent ordered information. For instance, if we have an alert system that returns colors from green to red indicating the severity of the problem. Although ‘colors’ are categories, the colors encode ranked information, so we may want to treat such a variable as ordinal (e.g., convert green-yellow-orange-red into 1, 2, 3, 4).

Some of the most common mistakes arise from the mis-handling of data types. To test your knowledge, be sure you can answer the following questions:

  1. A dataset on U.S. elections has a column, Democrat_won, indicating whether a Democrat won or lost in a given district. What type of variable is Democrat_won?
  2. A clinical trial dataset contains hundreds of observations per patient, with an 8-digit patient ID column, e.g., 81125679. What type of variable is patient ID?
  3. A grocery story inventory database contains a count column for the number of items of each type. Values range from 0 to several thousand. What type of variable is count?
  4. Survey data contains likert scale questions, with possible values for the answer column being: ‘Strongly disagree’, ‘Disagree’, ‘Neutral’, ‘Agree’, and ‘Strongly Agree’. What type of variable is answer?

Correct answers are given at the end of these notes. If you got any wrong, make sure you understand why!

R Code in the Instructor Notes

I include the R code for generating output in my Instructor Notes, and that is not an accident! I demonstrate many useful data manipulations and plotting techniques in the code blocks, and you are encouraged to use the code in these notes as a resource when doing your assignments.

Getting help with code: You can get a lot of useful information by using the ? operator in R to look up the help pages for functions (e.g., typing ?print in R pulls up the help page for the print function). Any time you are confused about what a function does, your first step should be to check the R help page, and second step should be Google. If you’re still confused about something, ask me in class or shoot me an e-mail.

Four models

It is time to meet the four models we will use in this course. I start by simulating a minimal dataset that I will use to illustrate the use of each model. Simulated data allows us to experiment with models in a context where we know what the ‘true’ relationships in the data are (with real-world data, you never know for sure).

Let us simulate a minimal dataset comprised of a continuous predictor x and two alternative target variables: a continuous y and a binary y2. I draw x from two normal distributions with different means, such that the dataset contains 2 clusters, and both y and y2 are positively predicted by x.

# Import libraries
library('knitr')
library('tidyverse')
library('gridExtra')
library('factoextra')

# Make reproducible
set.seed(37)

# Sample size
n <- 200

# Intercept of linear relationship between y and x
a <- 5

# Slope of linear relationship between y and x
b <- 2

# Define continuous predictor variable x, composed of two normal distributions
x <- c(rnorm(n/2, mean=-2), rnorm(n/2, mean=2))

# Define continuous target y, which is positively related to x
y <- a + b*x + rnorm(n, sd=0.25)

# Define binary target y2, which is positively related to x
y2 <- rbinom(n=n, size=1, prob=1/(1+exp(-x)))

# Combine variables in a dataframe
df <- data.frame(x, y, y2)

# Check simulated data
summary(df)
##        x                   y                 y2       
##  Min.   :-4.704850   Min.   :-4.4995   Min.   :0.000  
##  1st Qu.:-1.964974   1st Qu.: 0.9279   1st Qu.:0.000  
##  Median :-0.001371   Median : 4.8509   Median :1.000  
##  Mean   : 0.005280   Mean   : 5.0104   Mean   :0.515  
##  3rd Qu.: 2.036169   3rd Qu.: 9.2433   3rd Qu.:1.000  
##  Max.   : 4.748582   Max.   :14.4114   Max.   :1.000

Let’s visually inspect the 4 variables I just created, using histograms to plot the continuous features and a barplot for the categorical feature:

# plot distributions of the 3 simulated variables
p1 <- df %>% ggplot(aes(x=x)) + geom_histogram()
p2 <- df %>% ggplot(aes(x=y)) + geom_histogram()
p3 <- df %>% ggplot(aes(x=y2)) + geom_bar()
grid.arrange(p1, p2, p3, nrow=1)

To summarize the dataset:

  1. x is a continuous predictor variable with a bimodal distribution that predicts both y and y2.
  2. y is a continuous target variable that is predicted by x, so it also has a bimodal distribution
  3. y2 is a binary target variable that is predicted by x, and it has a uniform distribution

With our simulated data in hand, let’s meet the supervised learning model and Stat 100 superstar, linear regression.

Linear regression

Linear regression predicts a continuous outcome variable, y, as a linear function of one or more predictor variables, x1, x2, ... , xn. The predictors can be continuous or categorical. The model assumes that predictors have a linear relationship to the response.

When you fit models in R, you specify the relationship between the outcome and predictors with a model formula that gets passed to the model function. For example, let’s fit a linear regression predicting y from x:

# fit linear regression for the continuous target
linear_model <- lm(y ~ x, data=df)

# View model coefficients
summary(linear_model)$coefficients[,'Estimate']
## (Intercept)           x 
##    4.999830    2.006942

The model coefficients correspond to the intercept and slope of a line fit to the two variables. The interpretation is straightforward if you are familiar with the slope-intercept form of a line.

  • Intercept: When x = 0, the value of y is 5 (rounded to 2 decimal places)
  • Slope: When x increases by 1, the value of y increases by 2.01 (rounded to 2 decimal places)

Let’s plot the regression. Be sure you recognize the slope and intercept on the graph.

# plot linear model
df %>% ggplot(aes(x=x, y=y)) +
  geom_point() +
  geom_point(aes(x=0, y=linear_model$coefficients[1]), color='red', size=3, alpha=0.75) +
  geom_smooth() +
  geom_hline(yintercept=linear_model$coefficients[1], linetype=2, color='red') +
  geom_text(aes(x=2.25, y=linear_model$coefficients[1]+0.75, label='Y-intercept'), size=3, color='red') +
  theme(legend.position = "none")

Linear regression is useful when you have a continuous outcome variable, but what if your outcome is binary? This brings us to logistic regression.

Logistic regression

Things get more complicated if the outcome variable is binary rather than continuous. To model binary outcomes, we must use logistic regression, which is essentially a linear regression that uses the logistic function to transform the output of a linear model to be between 0 and 1. This logistic transformation allows the output to be interpreted as probabilities, which can be converted into binary predictions using a threshold, usually of 0.5.

To see the relationship between linear and logistic regression, it is helpful to visualize them side-by-side. Let’s visualize a logistic regression model for y2 ~ x and compare it to the linear regression predicting y ~ x (we won’t worry about interpeting logistic regression model coefficients for now, since its a bit more complicated than linear regression). The color of the points in both plots corresponds to actual values of the binary outcome y2:

# visualize linear and logistic regression models
color <- ifelse(df$y2==1, 'red', 'black') %>% factor
g1 <- ggplot(df, aes(x, y)) + 
  geom_point(color=color) + 
  geom_smooth(method='lm') + 
  theme_gray() +
  ggtitle('Linear regression: y ~ x')
g2 <- ggplot(df, aes(x, y2)) + 
  geom_point(color=color) + 
  stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE) +
  theme_gray() +
  ggtitle('Logistic regression: y2 ~ x')
grid.arrange(g1, g2, nrow=1)

The straight line on the left represents the linear regression model, which directly predicts a y value for any input xusing a straight line. For the logistic regression on the right, the logistic curve represents predicted probabilities for any input x, which get converted to binary predictions using a cutoff (i.e., if p >= 0.5, classify as 1, otherwise classify as 0). The predictor variable x on the x-axis is identical in the two models, but the target variable determines whether linear or logistic regression is appropriate: linear regression is for continuous targets, and logistic regression is for binary targets.

Both of models above have been ‘fit’ or ‘trained’ on the data. If we wish, we could use the trained models to predict future values of y or y2, given some input for the predictor x. This is the basic idea behind supervised learning: You ‘train’ the model on data where both x and y are known, the model ‘learns’ a relationship between x and y, and then you use the model to generate predictions on a new data where the input x is known but the target y is unknown.

Be sure you can answer the questions: What is supervised learning? What are the inputs and outputs of linear and logistic regression? Can you think of some real-world examples where you would use these models?

Principal Components Analysis (PCA)

PCA is a transformation that scales and rotates a set of N (possibly) correlated continuous variables into a set of N uncorrelated continuous variables, ranking the new variables according to how much variation they capture in the original data. These new variables are called ‘principal components’, or ‘PCs’, and they can be interpreted as latent features underlying the original variables. For more intuition about the mathematics of PCA, I recommend this blog post that visualizes the matrix transformations underlying PCA.

Let’s look at a 2D example of how PCA transforms a dataset with N = 2 input variables:

# fit pca 
pca <- prcomp(df[,c('y','x')], scale = TRUE)

# plot raw data
g3 <- ggplot(df, aes(x, y)) + 
  geom_point(color=color) + 
  geom_smooth(method='lm') + 
  theme_gray() +
  ggtitle('Raw data is correlated: y ~ x')

# plot PCs 
g4 <- fviz_pca_ind(pca, geom='point') + 
  geom_point(color=color) +
  geom_smooth(method='lm') + 
  theme_gray() +
  ggtitle('PCs are uncorrelated: pc2 ~ pc1')

# arrange in multi-panel plot
grid.arrange(g3, g4, nrow=1)

Observe that the original variables x1 and x2 have a positive correlation, but after PCA transformation, PC1 and PC2 are uncorrelated (these are shown as Dim1 and Dim2 in the plot on the right). In fact, this is gauranteed by PCA. Although it is difficult to visualize in more than 2 dimensions, the intuition remains the same: the input to PCA is N (possible correlated) continuous variables, and the output is N uncorrelated variables, or PCs.

The axis labels of the PCA plot tell us that PC1 (99.9%, x-axis) explains a much larger proportion of variation in the original data compared to PC2 (0.1%, y-axis). This reflects the fact that PC1 captures the strong correlation between x and y, while PC2 is mainly residual noise. In a real analysis where x and y are correlated predictor variables, we could conclude that PC1 is sufficient to capture the effects of both x and y, and move forward using PC1 as a replacement for the 2 original correlated features. Later in the course, we will return to the topic of how to select a subset of PCs to keep for downstream analysis, since you rarely want to keep all of them.

In this example, we only have two variables and the results are not very useful. The purpose here isn’t to show a realistic usage of PCA, but to provide intuition for its most useful properties: PCA decorrelates correlated continuous variables into a new set of uncorrelated continuous variables, and ranks them according to their importance. Since we frequently select only a subset of the most important PCs for downstream analysis, PCA is associated with ‘dimension reduction’ because this process reduces the dimensionality of the data (i.e., you end up with fewer columns than you began with).

Don’t worry if it still isn’t obvious why decorrelating and reducing the dimensionality of your data is useful. It will become clearer as we dive deeper into the assumptions of linear models in the next couple of weeks, and discover how PCA can be used to engineer better continuous features for predictive models.

Be sure you can answer the questions: What are the inputs and outputs of PCA?

K-Means Clustering

Like PCA, K-Means Clustering can be thought of as a transformation of your variables: K-Means takes N uncorrelated features as input, groups observations into clusters based on how similar they are with respect to the input features, and outputs a categorical variable representing cluster IDs.

The input features for K-Means should not be highly correlated because highly correlated inputs provide redundant information to the algorithm, causing them to have an exaggerated impact on results. For this reason, it is common to perform K-Means Clustering on the results of a PCA, which ‘decorrelates’ the data as discussed above. I take this approach below by peforming K-Means Clustering on our PCA results from above:

# k-means clustering with k = 2
k <- 2
km_pca <- kmeans(pca$x, centers=k, nstart=5)

# visualize k-means
fviz_cluster(km_pca, data=pca$x, geom='point', shape=20) +
  geom_point(color=color) +
  geom_smooth(method='lm') +
  scale_fill_manual(values=c('blue','green')) +
  scale_color_manual(values=c('blue','green')) +
  ggtitle('K-Means Clustering: PCA (pc1, pc2)') +
  guides(color = FALSE)

As you can see, with k = 2, K-Means Clustering identifies two clusters (blue and green) that largely (but imperfectly) capture the simulated black and red clusters. In real life scenarios, we often do not have the black and red labels, so K-Means is incredibly useful because it can identify these hidden groups in the data. These group labels, or cluster IDs, can be used as a categorical feature in downstream models, or they may be of interest in their own right.

Note that I chose the setting k = 2 to specify 2 clusters in the code above. In general, the mimimum number of clusters is 1 (all observations are in the same group) and the maximum number is equal to the number of observations (each observation is in its own group). A tricky part of K-Means Clustering is deciding what the value of k should be. We will revisit this issue later in the course, and will see that it is analogous to the question of how many PCs to keep from a PCA.

In the coming weeks, we will learn more about how K-Means Clustering can uncover hidden patterns in data and help you engineer better categorical features for predictive models.

Be sure you can answer the questions: What are the inputs and outputs of K-Means Clustering?

Concluding remarks

We have now met the four models used in this course. Here is a final visual to help you remember how these 4 models cover the 4 most common data modeling tasks:


Be sure you are comfortable with data types and that you understand the inputs and outputs of the 4 models. These are fundamental building blocks of the course, so it is important that you understand them before moving forward. If you are confused, chances are you are not the only one, so ask questions!

Answers to questions about variable types

  1. Binary, 2. Categorical, 3. Continuous, and 4. Ordinal.