Lab 3E: Scraping Web Data
Lab 3E - Scraping web data
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
The web as a data source
The internet contains huge amounts of information.
Using computers to gather this information in an automated fashion is referred to as scraping web data.
Scraping data from the web can be difficult because each website displays & stores data differently.
In this lab, we'll learn how to scrape data in two steps:
Step 1: Gather information from the web.
Step 2: Clean it up and turn it into a usable data frame for
Our first web scraper
Copy and paste the link below into a web browser to view the website of data we'd like to scrape and analyze.
Briefly describe what the data on the website is about.
– Then write down 3 questions you'd be interested in answering by analyzing this data.
HTMLis the code that's used to render every website you've ever visited.
The following slide shows the
HTMLcode used to create the first two rows of the web data.
– How is the data table in
HTMLdifferent than the data tables we're used to seeing in
R, for example, when we use the
– What do you think the tags
<TD>mean? How does
HTMLuse these tags to display the table?
<TABLE> <TR> <TH>peak</TH> <TH>range</TH> <TH>state</TH> <TH>long</TH> <TH>lat</TH> <TH>elev_ft</TH> <TH>elev_m</TH> <TH>prominence_ft</TH> <TH>prominence_m</TH> <TH>rank</TH> </TR> <TR> <TD>Denali (Mount McKinley)</TD> <TD>Alaska Range</TD> <TD>Alaska</TD> <TD>-151.0063</TD> <TD>63.0690</TD> <TD>20236</TD> <TD>6168</TD> <TD>20174</TD> <TD>6149</TD> <TD>1</TD> </TR> </TABLE>
Get to scraping!
Use your browser to go back to the website with the data we're interested in scraping.
Find the URL address for the site and assign it the name
Then fill in the blanks below to have
Rscrape every web table available on the site:
tables <- readHTMLTable(____)
Find our data
readHTMLTable()scrapes every table that is on a particular web URL, we need to find out which table has the data we're interested in.
– For example,
wikipedia.orgoften has articles with 3 or more tables.
– This means we need to check all 3 tables to find the one we're interested in.
length()function to find out how many tables of data were scraped in our set of
Now that we know how many tables we've scraped, we can go back and scrape individual tables by adding the which argument to the
readHTMLTable()to re-scrape the data from the web but this time use the
whichargument to scrape just the individual table.
whichargument should be the integer denoting which table you want scraped.
– Assign the scraped data the name
Check, save and use!
After scraping the data, the only thing left to do is to save it and use it.
Fill in the blanks to save the data and give it a file name.
save(____, file = "____.Rda")
What is the mean and standard deviation of
statehas the most mountains in our data?