Lab 3E: Scraping Web Data
Lab 3E - Scraping web data
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
The web as a data source
-
The internet contains huge amounts of information.
-
Using computers to gather this information in an automated fashion is referred to as scraping web data.
-
Scraping data from the web can be difficult because each website displays & stores data differently.
-
-
In this lab, we'll learn how to scrape data in two steps:
-
Step 1: Gather information from the web.
-
Step 2: Clean it up and turn it into a usable data frame for
Lab 3F
.
-
Our first web scraper
-
Copy and paste the link below into a web browser to view the website of data we'd like to scrape and analyze.
https://labs.idsucla.org/extras/webdata/mountains.html -
Briefly describe what the data on the website is about.
– Then write down 3 questions you'd be interested in answering by analyzing this data.
HTML
-
HTML
is the code that's used to render every website you've ever visited. -
The following slide shows the
HTML
code used to create the first two rows of the web data.– How is the data table in
HTML
different than the data tables we're used to seeing inR
, for example, when we use theView()
function?– What do you think the tags
<TABLE>
,<TR>
,<TH>
,<TD>
mean? How doesHTML
use these tags to display the table?
<TABLE>
<TR>
<TH>peak</TH>
<TH>range</TH>
<TH>state</TH>
<TH>long</TH>
<TH>lat</TH>
<TH>elev_ft</TH>
<TH>elev_m</TH>
<TH>prominence_ft</TH>
<TH>prominence_m</TH>
<TH>rank</TH>
</TR>
<TR>
<TD>Denali (Mount McKinley)</TD>
<TD>Alaska Range</TD>
<TD>Alaska</TD>
<TD>-151.0063</TD>
<TD>63.0690</TD>
<TD>20236</TD>
<TD>6168</TD>
<TD>20174</TD>
<TD>6149</TD>
<TD>1</TD>
</TR>
</TABLE>
Get to scraping!
-
Use your browser to go back to the website with the data we're interested in scraping.
-
Find the URL address for the site and assign it the name
data_url
inR
. -
Then fill in the blanks below to have
R
scrape every web table available on the site:tables <- readHTMLTable(____)
Find our data
-
Since
readHTMLTable()
scrapes every table that is on a particular web URL, we need to find out which table has the data we're interested in.– For example,
wikipedia.org
often has articles with 3 or more tables.– This means we need to check all 3 tables to find the one we're interested in.
-
Use the
length()
function to find out how many tables of data were scraped in our set oftables
.
Saving tables
-
Now that we know how many tables we've scraped, we can go back and scrape individual tables by adding the which argument to the
readHTMLTable()
function. -
Use
readHTMLTable()
to re-scrape the data from the web but this time use thewhich
argument to scrape just the individual table.– The
which
argument should be the integer denoting which table you want scraped.– Assign the scraped data the name
mtns
.
Check, save and use!
-
After scraping the data, the only thing left to do is to save it and use it.
-
Fill in the blanks to save the data and give it a file name.
save(____, file = "____.Rda")
-
What is the mean and standard deviation of
elev_ft
? -
Which
state
has the most mountains in our data?