These notes accompany my course at Northeastern University, a 6 week introduction to data mining using the R programming language. The goal of the course is to enable students to conduct independent data mining projects, with a focus on honing your workflow and communication skills.

These notes are a key resource for your homework assignments and in-class quizzes. They are the result of my efforts to distill a lot of useful code and information down into a few digestable packets, and structuring these packets in a way that complements the weekly assignments (which are only available to enrolled students).

What is data mining?

The purpose of data mining is to extract useful insights from a dataset. In practice, data mining is an iterative process with several stages, diagrammed below. In a typical project, you will pass through these stages many times:

Rather than starting our learning at the beginning of the workflow, we begin with modeling for two reasons. First, I have found that introducing modeling up front helps students understand the importance of inspecting and transforming their data prior to fitting models. Second, glimpsing the later stages of the pipeline helps students write better project proposals (which are due at the end of Week 2!).

Four fundamental models

A major challenge to learning data science is that there are so many algorithms and models clamoring for your attention. While there is no universally agreed upon core set of models an analyst should master, I have made the opinionated decision to restrict our focus in this course to a minimal set of models that cover the 4 most common use cases for data modeling:

  1. Regression: Predict a continuous target variable from a set of predictors.
  2. Classification: Predict a categorical target variable from a set of predictors.
  3. Dimension reduction: Identify continuous latent variables in a set of predictors.
  4. Clustering: Identify categorical latent variables in a set of predictors.

In machine learning, regression and classification are called supervised learning, which is a term to describe models that have an observed target variable. The supervised learning model y ~ x is ‘trained’ to predict the target y given input x. In contrast, dimension reduction and clustering are examples of unsupervised learning, or models that identify previously unknown structure in x.

The matrix below shows the 4 models used in this course:

These are powerful tools that can achieve the four most common data modeling tasks, and they provide a solid foundation for learning about more advanced models. We will continue to revisit these four models throughout the course, deepening our knowledge of how to apply and interpret them.

Before I provide a high-level introduction to the four models, a quick digression about “information overload”:

Information overload in data science

Data science terminology can be confusing, especially when it comes to modeling. One reason for this is that many similar ideas emerged independently from the disciplines of statistics and machine learning, leading to multiple terms for similar concepts. For example:

  • Statistical models have predictor or independent variables, ML models have features
  • Statistical models predict a response, outcome, or dependent variable, ML models predict a target.
  • Statistics folks fit models, ML folks train models

There is also tremendous hype surrounding ML, with a deluge of new methods, software libraries, and blog posts being published on a daily basis. Each new tool comes with new concepts and new terminology, and every blog post tries to convince you that THIS is the thing you should pay attention to. Diagrams and listicles of “things you should know” are popular.

In this environment, your attention is the most precious resource that you have. The only way to avoid becoming overwhelmed is to be careful about what you pay attention to. If you get side-tracked by every new technique that someone wrote a blog post about, you will waste all of your time going down rabbit holes and feeling panicked about everything you don’t know. This can quickly escalate into imposter syndrome, which is the feeling that you are an imposter who does not belong in this field at all!