Introduction

These notes focus on assessing the results of linear and logistic regression. There are two components to this assessment which parallel the two types of project goals, understanding and prediction:

Interpreting model coefficients: How can we understand the model in real-world terms?
Evaluating predictive performance: How well does the model predict the future?

While projects may emphasize one aspect of the assessment depending on the goal, it is often the case that both assessments are valuable for refining your model.

Note If you need a refresher on linear and logistic regression, review Instructor Notes 1: Data mining and modeling.

# These notes use the following libraries
library('tidyverse')
library('gridExtra')
library('knitr')

Simulated data

I analyze simulated data in these notes, allowing us to observe that the models accurately recover the relationships in the simulated data.

First, I simulate a normally distributed response variable y predicted by two predictor variables: normally distributed x1 and binary x2. The relationship y ~ x1 has slope 0.5 (i.e., increasing x1 by 1 causes y to increase by 0.5), while y ~ x2 has slope -2 (i.e., increasing x2 from 0 to 1 causes y to decrease by 2). The intercept of the model is 1, meaning that when all the x equal 0, then y = 1.

# Make reproducible
set.seed(42)

# Sample size
n <- 1000

# Y-intercept
a <- 1

# Normally distributed predictor with slope = 2
x1 <- rnorm(n)
b1 <- 0.5

# Binary predictor with slope = -2
x2 <- sample(c(0,1), n, replace=TRUE)
b2 <- -2

# Y is predicted by x1 and x2 (plus normally distributed errors)
y <- a + b1*x1 + b2*x2 + rnorm(n, sd=1)

# Make dataframe
df_lin <- data.frame(y, x1, x2)

# Plot linear model data
df_lin %>% ggplot(aes(x=x1, y=y, color=factor(x2))) + 
  geom_point() + geom_smooth(method='lm') +
  ggtitle('Simulated dataset for linear regression: y ~ x1 + x2') +
  geom_hline(yintercept=a, lty=2)

Note that we could visualize this with 2 separate y ~ x1 and y ~ x2 plots. However, the plot above shows all of the key patterns in the dataset in one figure. Let’s walk through it:

The relationship y ~ x1 is captured by the regression lines, where increases of size 1 along the x axis correspond to increases of size 0.5 along the y axis (since we simulated slope = 0.5 for this relationship).
The relationship y ~ x2 is captured by the vertically shifted regression lines, pink for when x2 = 0 and blue for then x2 = 1. When x2 = 1, the entire line is shifted down by 2, corresponding to the slope of -2 we simulated for this relationship. With binary predictors, you can think of the slope as simply representing the difference in y when the binary variable is 1 verses when it is 0. In this case, whenever x2 is 1, y decreases by 2.
The y-intercept occurs at the point where all of the predictors are 0. Here, when x1 = 0 (the midline of the plot) and x2 = 0, y is approximately 1, corresponding to the simulated y-intercept of 1.

Now let’s make a new dataframe for logistic regression, keeping the same predictor variables but transforming y into a new binary target variable in two steps:

Transform y into probabilities p between 0 and 1 using the logistic function.
Simulate binomial outcome with probabilities p

# Data for logistic regression
df_log <- df_lin

# Transform y into probabilities with logistic function
df_log$p <- 1/(1 + exp(-(a + x1*b1 + x2*b2)))

# Simulate binary outcome with probabilities
df_log$y <- rbinom(n, 1, df_log$p)

# Plot logistic model data
g1 <- df_log %>% ggplot(aes(x=x1, fill=factor(y))) + geom_density(alpha=0.5) +
  ggtitle('y ~ x1')
g2 <- df_log %>% ggplot(aes(x=factor(x2), fill=factor(y))) + geom_bar(position='fill') +
  ggtitle('y ~ x2')
grid.arrange(g1, g2, nrow=1, top='Simulated dataset for logistic regression: y ~ x1 + x2')

These plots show the key relationships in the dataset. Due to the logistic transformation of y into a binary variable, it is less straightforward to see how the values we simulated translate precisely into the patterns above, but we can still identify the major directional relationships:

Larger values of x1 are more likely to have y = 1 (blue density curve is shifted to the right of the pink one), corresponding to the positive relationship we simulated between y and x1.
When x2 = 1, the target y is less likely to be 1 (relatively smaller blue region for the x2 = 1 bar on the right), corresponding to the negative relationship we simulated between y and x2.

Now let’s fit models to the data using built-in R models:

# Fit linear model 
lm_lin <- lm(y ~ x1 + x2, data=df_lin)

# Fit logistic model
lm_log <- glm(y ~ x1 + x2, data=df_log, family='binomial')

We now have:

Linear regression model y ~ x1 + x2
Logistic regresion model y ~ x1 + x2

With our simulated datasets and trained models, let’s interpret the regression coefficients estimated by the models.

Interpreting regression coefficients

Regression coefficients can add immensely to our understanding of a system. They allow us to go beyond qualitative claims that x predicts y, and instead make precise quantitative claims about how changes in x are associated with changes in y.

The meaning of coefficients differs for linear and logistic regression, so I cover them separately below.

Linear regression

By now, you should know what to expect. Here is how to extract the regression coefficients from the linear model:

# View linear regression coefficients
lm_lin %>% coefficients %>% round(3)

## (Intercept)          x1          x2 
##       1.004       0.493      -1.964

We can see that the model coefficients are very close to the values we simulated: the intercept is ~ 1, the slope of x1 is ~0.5, and the slope of x2 is ~ -2. These slopes can each be interpreted as the change in y associated with an increase of 1 in the predictor x. For real datasets, you should be explicit about the units of variables when describing relationships, since the meaning of slopes depends on the units.

Note: Standardizing variables to have a standard deviation of 1 allows you to express effects in terms of standard deviations, which can be more interpretable in cases where the units of the variable don’t have an intuitive meaning (e.g., points on an arbitrary test).

Logistic regression

The slopes in logistic regression are not as intuitive to understand as linear regression due to the transformation of the response variable. In a logistic model, each coefficient represents the logarithm of the odds, or the log-odds, associated with a unit change in the predictor variable.

We can say that a positive log-odds represents a positive effect and a negative log-odds represents a negative effect, but it is hard to interpret the number itself in terms of relative probabilities. Fortunately, it is easier to understand the odds ratio, which we can obtain by simply exponentiating the log-odds. The odds ratio expresses the multiplicative change in the odds of y = 1 when the target increases by 1:

Odds of y = 1 = p(y = 1)/p(y = 0)

Odds Ratio = Odds of y = 1 when x = 1 / Odds of y = 1 when x = 0

An odds ratio of 1 means the predictor has no effect on the target. Odds ratios between 0 and 1 occur when increasing the predictor decreases the odds of y = 1 (i.e., negative relationship) and odds ratios > 1 occur when increasing the predictor increases the odds of y = 1 (i.e., positive relationship).

Let’s see log-odds and odds ratios in action. First, we’ll look at the raw logistic regression coefficients from our model, which are the log-odds:

# Raw logistic regression coefficients (log odds ratios)
lm_log %>% coefficients %>% round(3)

## (Intercept)          x1          x2 
##       0.990       0.503      -2.103

The log-odds correspond to the values we simulated (intercept = 1, b1 = 0.5, and b2 = -2). We have recovered reasonable estimates of the relationships in the data, albeit a bit less accurately than before. The loss in estimation accuracy is due to the information loss that necessarily occurs when a continuous variable is transformed to a binary one, as we did with our target for the logistic regression.

What about odds ratios? We can get those by exponentiating the log-odds:

# Exponentiated logistic regression coefficients (odds ratios)
lm_log %>% coefficients %>% exp %>% round(3)

## (Intercept)          x1          x2 
##       2.692       1.654       0.122

Odds ratios can be interpreted as a multiplier for the odds of y = 1 when the predictor increases from 0 to 1. In the current example:

Increasing x1 from 0 to 1 is associated with a 1.65x increase in the odds of y = 1
Increasing x2 from 0 to 1 is associated with a 0.12x decrease in the odds of y = 1

It is common to see these expressed as percents:

Increasing x1 from 0 to 1 is associated with a 65% increase in the odds of y = 1
Increasing x2 from 0 to 1 is associated with a 88 % decrease in the odds of y = 1

Odds ratios are more helpful for understanding the impact of each predictor, so you should focus on these rather than the raw log-odds that appear in the model summary. It is important to understand the difference between log-odds and odds ratios, because if you interpret log-odds as if they are odds ratios (or vice versa), you will draw incorrect conclusions from your model.

P-values

The p-value for a coefficient (i.e., the slope for a given predictor variable) tells you the probability that your data come from a world where that coefficient is exactly 0. A large p-value means you cannot confidently say that the coefficient is different from 0, while a small p-value means the coefficient is probably not 0, i.e., the predictor has a statistically significant effect on the outcome variable.

P-values are often used as a heuristic for identifying “interesting” or “meaningful” relationships in a dataset. A common threshold for claiming ‘statistical significance’ is p < 0.05. Output from R shows cascading significance levels: * for p < 0.05, ** for p < 0.01, and *** for p < 0.001:

# Summary of linear model coefficients
lm_lin %>% summary

## 
## Call:
## lm(formula = y ~ x1 + x2, data = df_lin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8716 -0.6657 -0.0131  0.6763  3.5482 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.00414    0.04507   22.28   <2e-16 ***
## x1           0.49285    0.03158   15.60   <2e-16 ***
## x2          -1.96440    0.06330  -31.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 997 degrees of freedom
## Multiple R-squared:  0.5514, Adjusted R-squared:  0.5505 
## F-statistic: 612.8 on 2 and 997 DF,  p-value: < 2.2e-16

P-values have an important but often overlooked caveat, which is that they converge on 0 as the sample size goes to infinity. This means that trivially small relationships become statistically significant with large datasets (how large ‘large’ is depends on how small the effect is). It is critical to keep this point in mind when analyzing large datasets:

Statistical significance does not imply practical significance. You must interpret regression coefficients to know what the effect means in real world terms.

P-values should be used as a supplement to the interpretation of regression coefficients. Non-significant p-values indicate that the estimated relationship is uncertain and not worth further discussion. Significant p-values indicate that it is worth discussing the real-world significance of the effect (i.e., the magnitude of estimated regression coefficient).

R²

If you have taken statistics, you probably encountered the R² for linear regression (R² is undefined for logistic regression). It measures the proportion of variation in y that is explained by the model, and ranges between 0 and 1. A low R² means that the model leaves much of the variation in y unexplained, while a high value means that the model explains much of the variation in y.

The R² is useful as an indicator of how much room for improvement you may have. For example, an R² of 0.5 means that there is still 50% of variability in y that could potentially be explained by, e.g., adding better features or using a better specified model. However, it is also possibly to get a high R² by overfitting, so having a high R² does not necessarily mean that the model is well specified or useful for making predictions.

Note: You may notice that R outputs two versions of the R²: Multiple and Adjusted. The Adjusted R² attempts to guard against overfitting by penalizing the addition of variables and is always lower than the Multiple R².

Evaluating predictive performance

This section introduces the evaluation of predictive model performance, including methods for regression and classification.

Train/test validation

In machine learning projects, we must make decisions such as whether to do PCA or bin certain categories together. A rigorous way to choose between options is to compare predictive performance in a train/test framework. This entails splitting the data into two subsets. Then we train models on one subset (the training set) and evaluate predictions on the other (the test set).

Such ‘out-of-sample validation’ allows you to detect overfitting, which occurs when your model makes accurate predictions within the training sample, but does not generalize well to new data. The test set gives you a more honest picture of how your model will perform on future datasets, since evaluating on the training set can lead to overly optimistic assessments of model performance.

Here is the workflow for train/test validation:

Split data into train/test sets
- Common train/test splits include 80%/20% or 90%/10%
- If rows are independent, you may randomize rows into train/test sets
- If you have date/time data and are predicting the future, your test set should come AFTER the training set in your time line (e.g., if you have 24 months of data and want to generate monthly predictions, train on the first 23 months and test on the 24th)
Fit model on training set
Use fitted model to generate predictions and performance metrics for:
- Within-sample predictions on the training set
- Out-of-sample predictions on the test set
Evaluate performance on the training and test sets
- Is there evidence of overfitting?
- How does performance compare to a simple mean/mode model?
- Is the model good enough to be useful?

If there is evidence of overfitting (much better within-sample vs out-of-sample predictions), try to understand why and take action to reduce it. Common culprits include having too many predictor variables relative to observations, or having highly collinear predictors. Alternatively, it is possible that your training and test sets are very different in some way, so double-check that your train/test split is truely random.

Train/test split

Let’s split our dataset with an 80:20 split:

# Randomly select train/test split indices with 80% training size
train_ind <- sample(seq_len(n), size = floor(0.8 * n))

# Split training sets
df_lin_train <- df_lin[train_ind, ]
df_log_train <- df_log[train_ind, ]

# Split test sets
df_lin_test <- df_lin[-train_ind, ] 
df_log_test <- df_log[-train_ind, ]

Important note: This simulated dataset doesn’t require any transformations, but in a real analysis with transformations, you must apply the transformations separately for the training and test sets. This means you must first split the original data into train and test sets, and then apply identical pre-processing steps to each set. It helps to consolidate your transformations into a sequence of tidyverse functions that you can reuse for each subset of the data.

Fit and predict

Let’s re-train the linear and logistical regression models using the training set and use the trained models to make predictions on the train and test sets.

# Train models 
lm_lin_train <- lm(y ~ x1 + x2, data=df_lin_train)
lm_log_train <- glm(y ~ x1 + x2, data=df_log_train, family='binomial')

# Predict training data y and add predictions to training dataframes
df_lin_train$y_pred <- predict(lm_lin_train, df_lin_train)
df_log_train$y_pred_probs <- predict(lm_log_train, df_log_train, type = "response")

# Predict test data y and add predictions to test dataframes
df_lin_test$y_pred <- predict(lm_lin_train, df_lin_test)
df_log_test$y_pred_probs <- predict(lm_log_train, df_log_test, type = "response")

# Convert logistic probabilities into binary predictions
df_log_train$y_pred <- ifelse(df_log_train$y_pred_probs > 0.5, 1, 0)
df_log_test$y_pred <- ifelse(df_log_test$y_pred_probs > 0.5, 1, 0)

We now have:

Linear and logistic regression models trained on train
Model predictions on the train dataset (within-sample predictions)
Model predictions on the test dataset (out-of-sample predictions)

Time to evaluate the predictions.

Evaluation metrics

Evaluation differs for regression and classification. Although we are limited to linear and logistic regression in this course, these concepts apply broadly to predictive modeling and are essential tools of the trade.

Regression

The most straightforward metric to evaluate the accuracy of a regression model is the Mean Absolute Error (MAE), which is simply the average model error. If y is measured in meters and the MAE of a model is 2.4, it means that on average the model is off in its predictions by 2.4 meters. By putting the model error into real-world terms, it is possible to evaluate whether the amount of error is acceptable given the intended purpose of the model.

Let’s compute MAE for our predictions on the training and test sets:

# Mean absolute error on training and test sets
MAE_train <- abs(df_lin_train$y - df_lin_train$y_pred) %>% mean 
MAE_test <- abs(df_lin_test$y - df_lin_test$y_pred) %>% mean 

# Plot results: Predicted vs Actual y 
g1 <- df_lin_train %>% ggplot(aes(x=y, y=y_pred)) + geom_point() + 
  ggtitle(paste('Training set MAE:', round(MAE_train, 2)))
g2 <- df_lin_test %>% ggplot(aes(x=y, y=y_pred)) + geom_point() + 
  ggtitle(paste('Test set MAE:', round(MAE_test, 2)))
grid.arrange(g1, g2, nrow=1, top='Predictive performance of linear regression')

We can see that the predictions are clearly correlated with the true values of y, but there is quite a bit of scatter. The MAE is similar between the training and test sets, which is good: If errors are higher on the test set, it usually means you are overfitting to the training set. The magnitude of these errors is about 0.8, which is not too bad considering that the standard deviation of y is 1.49.

Note: Another common metric for evaluating regression accuracy is Root Mean Squared Error (RMSE), but it is definitely harder to interpret than MAE. For a comparison of MAE and RMSE, see here

Classification

Now we turn to the logistic regression model predictions. For classification, we can describe the predictive performance completely with a confusion matrix, which is simply a table tabulating the correct and incorrect predictions verses the true values. The widely used ‘accuracy’ metric is the sum of the correct answers divided by the total number of predictions:

However, the accuracy metric has a major shortcoming, which is that it becomes very difficult to interpret when the ratio of 0 and 1 in the target variable is not 50:50. For example, if you have a dataset with 100 observations, and the binary target variable is 98% zero and 2% one. By simply guessing ‘zero’ for every prediction, you can get 98% accuracy! This illustrates why accuracy is difficult to interpret: impressive sounding numbers are not always so impressive when you find out how well the model ‘guess zero every time’ performs.

So what alternatives do we have? I find that the tag-team duo of precision and recall are the best way to summarize the performance of a classifier. One number is simply not enough: you need both precision and recall to really understand how the model is performing. Study the figure below to see how precision and recall are computed using components of the confusion matrix:

The reason you need two metrics to understand the performance of a classifier is because there are two sources of error: false positives and false negatives. High precision minimizes false positives, while high recall minimizes false negatives. Be sure you understand this!

Now lets evaluate the results of our logistic regression. To make all of the computations explicit, I manually compute the confusion matrices and performance metrics (accuracy, precision, and recall) for the training and test sets. Note that the order of TRUE/FALSE is reversed in the matrices below compared to the images above (unfortunately, there is no universally agreed upon way to set up a confusion matrix):

# Confusion matrix - training data
cm_train <- table(df_log_train$y_pred, df_log_train$y, dnn = c('Predicted','Actual'))
cm_train

##          Actual
## Predicted   0   1
##         0 305 101
##         1 116 278

paste('Accuracy:', round(( cm_train['1','1'] + cm_train['0','0'] ) / sum(cm_train),2))

## [1] "Accuracy: 0.73"

paste('Precision:', round(cm_train['1','1'] / sum(cm_train['1',]),2))

## [1] "Precision: 0.71"

paste('Recall:', round(cm_train['1','1'] / sum(cm_train[,'1']),2))

## [1] "Recall: 0.73"

# Confusion matrix - test data 
print('')

## [1] ""

cm_test <- table(df_log_test$y_pred, df_log_test$y, dnn = c('Predicted','Actual'))
cm_test

##          Actual
## Predicted  0  1
##         0 75 32
##         1 21 72

paste('Accuracy:', round(( cm_test['1','1'] + cm_test['0','0'] ) / sum(cm_test),2))

## [1] "Accuracy: 0.74"

paste('Precision:', round(cm_test['1','1'] / sum(cm_test['1',]),2))

## [1] "Precision: 0.77"

paste('Recall:', round(cm_test['1','1'] / sum(cm_test[,'1']), 2))

## [1] "Recall: 0.69"

We can interpret the results above as follows:

The metrics are comparable between the training and test sets, so we aren’t overfitting.
The proportion of y = 1 in the test set is 0.52, so the accuracy score of 74% is pretty good (much better than guessing 0 or 1 every time).
The test precision is 0.77, which means that when the model guesses 1, it is correct 0.77% of the time.
The test recall is 0.69, which means that when the actual y is 1, the model guesses 1 correctly 0.69% of the time.

Since this is a simulated dataset with no real meaning, there isn’t much else to say. In a real project, depending on your goals, you may care about optimizing for precision or recall rather than optimizing for overall accuracy. For example, if you are developing a medical test for a deadly illness, you may care more about maximizing recall because there is such a high cost to false negatives (i.e., sending a very sick person home when they need treatment).

Note: In biomedical research, Recall is usually called Sensitivity. A highly sensitive medical test (high recall) will rarely miss a case of disease, but might incorrectly label many healthy patients as having the disease (high false positive rate). Conversely, a precise test is nearly always correct when it says you have a disease, but it might miss many cases of disease (high false negative rate). Whether you want a more sensitive depends on the situation. You must ask which is the more harmful mistake: incorrectly telling a healthy patient that they have a disease (False Positive), or incorrectly telling a sick patient that they are healthy (False Negative)?

Concluding remarks

This concludes the Instructor Notes for ALY 6040: Data Mining!

We have covered the major components of data mining projects: Project ideation, inspecting and cleaning data, engineering new features, fitting and interpreting predictive models, and most important, tying everything together with a narrative.

Where should you go from here? Obviously, we have only scratched the surface. My parting advice is to make projects, not tools, the centerpiece of your studies. Learning tools without having a specific use for them will lead you to invent “projects” that don’t make much sense, much like Calvin here:

Not only is this an ineffective way to learn the tools, but it is an unrealistic workflow. Hopefully, this course laid the foundations for you to undertake independent projects with confidence. This is just the beginning of your analytics journey. Keep tackling datasets, solving problems that inspire you, and getting feedback on your work - it’ll pay off.

Happy mining. :)

Instructor Notes 4: Model evaluation

Randi H. Griffin

Introduction

Simulated data

Interpreting regression coefficients

Linear regression

Logistic regression

P-values

R²

Evaluating predictive performance

Train/test validation

Train/test split

Fit and predict

Evaluation metrics

Regression

Classification

Concluding remarks

Instructor Notes 4: Model evaluation

Randi H. Griffin

Introduction

Simulated data

Interpreting regression coefficients

Linear regression

Logistic regression

P-values

R2

Evaluating predictive performance

Train/test validation

Train/test split

Fit and predict

Evaluation metrics

Regression

Classification

Concluding remarks

R²