Skip to content

Lab 3E: Scraping Web Data

Lab 3E - Scraping web data

Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.

The web as a data source

  • The internet contains huge amounts of information.

    • Using computers to gather this information in an automated fashion is referred to as scraping web data.

    • Scraping data from the web can be difficult because each website displays & stores data differently.

  • In this lab, we'll learn how to scrape data in two steps:

    • Step 1: Gather information from the web.

    • Step 2: Clean it up and turn it into a usable data frame for Lab 3F.

Our first web scraper

  • Copy and paste the link below into a web browser to view the website of data we'd like to scrape and analyze.
    https://labs.idsucla.org/extras/webdata/mountains.html

  • Briefly describe what the data on the website is about.

    Then write down 3 questions you'd be interested in answering by analyzing this data.

HTML

  • HTML is the code that's used to render every website you've ever visited.

  • The following slide shows the HTML code used to create the first two rows of the web data.

    How is the data table in HTML different than the data tables we're used to seeing in R, for example, when we use the View() function?

    What do you think the tags <TABLE>, <TR>, <TH>, <TD> mean? How does HTML use these tags to display the table?

<TABLE>
    <TR>
        <TH>peak</TH>
        <TH>range</TH>
        <TH>state</TH>
        <TH>long</TH>
        <TH>lat</TH>
        <TH>elev_ft</TH>
        <TH>elev_m</TH>
        <TH>prominence_ft</TH>
        <TH>prominence_m</TH>
        <TH>rank</TH>
    </TR>
    <TR>
        <TD>Denali (Mount McKinley)</TD>
        <TD>Alaska Range</TD>
        <TD>Alaska</TD>
        <TD>-151.0063</TD>
        <TD>63.0690</TD>
        <TD>20236</TD>
        <TD>6168</TD>
        <TD>20174</TD>
        <TD>6149</TD>
        <TD>1</TD>
    </TR>
</TABLE>

Get to scraping!

  • Use your browser to go back to the website with the data we're interested in scraping.

  • Find the URL address for the site and assign it the name data_url in R.

  • Then fill in the blanks below to have R scrape every web table available on the site:

    tables <- readHTMLTable(____)
    

Find our data

  • Since readHTMLTable() scrapes every table that is on a particular web URL, we need to find out which table has the data we're interested in.

    – For example, wikipedia.org often has articles with 3 or more tables.

    – This means we need to check all 3 tables to find the one we're interested in.

  • Use the length() function to find out how many tables of data were scraped in our set of tables.

Saving tables

  • Now that we know how many tables we've scraped, we can go back and scrape individual tables by adding the which argument to the readHTMLTable() function.

  • Use readHTMLTable() to re-scrape the data from the web but this time use the which argument to scrape just the individual table.

    – The which argument should be the integer denoting which table you want scraped.

    – Assign the scraped data the name mtns.

Check, save and use!

  • After scraping the data, the only thing left to do is to save it and use it.

  • Fill in the blanks to save the data and give it a file name.

    save(____, file = "____.Rda")
    
  • What is the mean and standard deviation of elev_ft?

  • Which state has the most mountains in our data?