Olympic history: Scraping data from sports-reference.com

27 May 2018

This is the first in a series of posts analyzing data on the modern Olympics, comprising all the Games from Athens 1986 to Rio 2016. In this post, I introduce the dataset and the methods I used to collect it. In subsequent posts, I will wrangle and then analyze historical trends in the data.

Where the data comes from

At this time of this post, the International Olympic Committee (IOC) does not maintain an organized database about the Olympics. Luckily, a detailed database was developed and maintained by independent enthusiasts of Olympic history, and this data is currently available on www.sports-reference.com. However, as explained here, the IOC recently recognized the value of maintaining an Olympic database and partnered with the creators of the sports-reference database to transfer the data to an official IOC Olympic Statistical Database. When exactly that will happen and whether the data will be publicly accessible is unclear, so I decided to scrape this dataset before it disappears!

Scraping the data

Below, I present the code I used to scrape data from the individual athlete pages from sports-reference. I used the package RCurl for retrieving data from the web and XML for parsing html files:

# Load packages
library("RCurl")
library("XML")
library("tidyverse")

The sports-reference website has an Olympic athlete directory page with a bunch of sub-pages (453 to be exact) that organizes individual athlete pages by the first two letters of their surname. It looks like this:

My first step was to extract a list of the sub-pages using getURL and parse the data to produce a vector containing the links to the sub-pages.

# Get list of sub-pages in athlete directory (each contains list of individual athlete pages)
athlete_directory <- getURL("https://www.sports-reference.com/olympics/athletes/") %>%
  htmlParse(asText=TRUE) %>%
  xpathSApply('//td/a', xmlGetAttr, 'href') %>%
  paste('https://www.sports-reference.com/', ., sep="")

# Check number of sub-pages 
length(athlete_directory) # 453 pages

My next step was to visit each of these sub-pages and extract a list of links to the individual athlete pages, which look like this:

Imgur

I store the links to each athlete page in one long vector that contains one link per athlete in the database. This step took about 3.5 minutes to complete, and the end result was a vector containing links to pages with information on 135,584 Olympic athletes.

# Initialize vector to store links
individual_links <- c() 

system.time( # ~3.5 minutes
  for (i in 1:length(athlete_directory)) {
    
    # parse athlete directory sub-page to get all links
    new <- getURL(athlete_directory[i]) %>%
      htmlParse(asText=TRUE) %>%
      xpathSApply('//*[(@id = "page_content")]//a', xmlGetAttr, 'href') %>%
      paste('http://www.sports-reference.com/', ., sep="")
    
    # update vector of athlete pages
    individual_links <- c(individual_links, new) 
    
    # track progress in console
    print(i) 
    flush.console() # avoid output buffering
  }
) 

# Check number of individual links (athletes)
length(individual_links) # 135584

Next came the time consuming part. I looped through each of the 135,584 individual athlete pages and extracted two bundles of information: 1) The ‘info-box’ containing basic biographical information such as the athlete’s name, gender, birthdate, and birthplace, and 2) The ‘results’ table containing a row for each event the athlete competed in and columns with information about the Olympics in which the event took place and the athlete’s performance in the event. I stored each bundle of information in separate lists since they come in different formats and will have to be parsed separately. This step took ~26.5 hours.

# Initialize lists to store scraped data
infobox <- results_table <- vector("list", length(individual_links))

# Loop through links and extract data 
system.time( 
  for (i in 1:135584) {
    
    # get html (wait a minute and try again if it times out and throws and error)
    html <- try(getURL(individual_links[i], .opts=curlOptions(followlocation=TRUE)), silent=TRUE)
    if(class(html) == "try-error") {
      Sys.sleep(60)
      html <- getURL(individual_links[i], .opts=curlOptions(followlocation=TRUE))
    }
    html <- htmlParse(html, asText=TRUE)
    
    # save 'infobox'
    infobox[[i]] <- xpathSApply(html, '//*[@id="info_box"]/p', xmlValue) %>%
    strsplit('\n') %>% .[[1]]
    
    # save 'results table'
    results_table[[i]] <- readHTMLTable(html) %>% .$results
    
    # track progress in console
    print(i)
    flush.console() 
  }
)
# 95560.75/135548 = 0.705 sec/page
# total run time: 26.54 hours

Finally, I saved the resulting list of links, info-boxes, and results-tables in an Rdata file so that I can pick up from this point at any time in the future.

save(individual_links, infobox, results_table, file="C:/Users/Randi Griffin/Documents/GitHub/Olympic_history/scrapings.Rdata")

The next step is wrangling the data into a usable format, and that will be the topic of my next post…

Footnotes

Source code for this project is on GitHub.

Back to blog

« Major update to R package btw: prettier code, BayesTraitsV3 and Windows support Olympic history: Data wrangling (part 1) »

Randi H Griffin, PhD

Olympic history: Scraping data from sports-reference.com

Where the data comes from

Scraping the data