Skip to content

Lesson 20: Online Data-ing

Lesson 20: Online Data-ing

Objective:

Students will discover that data exists on the Internet in a variety of areas, formats, and for a variety of purposes.

Materials:

  1. Video: Explore a Google Data Center with Street View found at:
    https://www.engadget.com/2012-10-17-google-inside-data-centers.html

  2. Wikipedia – Video Games handout (LMR_3.17_Wikipedia - Video Games)

  3. Wikipedia – Video Games – CSV Format handout (LMR_3.18_Video Games - CSV)

  4. Online Data-ing handout (LMR_3.19_Online Data-ing)

Vocabulary:

data farm tags HTML

Essential Concepts:

Essential Concepts:

Stretching the conception of data involves seeing that many web pages present information that can be turned into data.

Lesson:

  1. By a show of hands, ask students if they have ever heard of the term data farm. If any of them have, ask him or her to share what they know about it.

  2. Inform students that a data farm is a physical space where high capacity servers are placed to store large amounts of data.

  3. Introduce the video titled Explore a Google data center with Street View found at https://www.engadget.com/2012-10-17-google-inside-data-centers.html by explaining that the data center they are about to see is one of these large data farms used to store vast amounts of data.

  4. After students watch the video, have a class discussion using the following questions:

    1. We have been talking about data for a few months now. How would you respond if someone asked you, “What are data?” Answers will vary by class.

    2. What are some ways that we have stored data? Data frames in R, Excel spreadsheets, .csv files.

  5. Explain that one of the main ways data are distributed is through the Internet. Storing and sharing data on the Internet requires a different format than what we have seen. For example, Wikipedia has a page dedicated to the top video games.

  6. Distribute the Wikipedia – Video Games handout (LMR_3.17), and have students explain the information that the data table provides.

  7. Once the students understand what the data table describes, walk them through the first portion of the HTML, or Hypertext Markup Language, source code (on page 1). Notice that the first header on the table is denoted as “Game.” Ask:

    1. How is “Game” represented in the source code? <th>Game</th>

    2. What do you think the <th> code represents? The <th> is a tag for “table header”

    3. If this were in RStudio, what would we call this header? A variable.

  8. Assign each student team one video game from the Wikipedia data table. Each team will compare how the information is stored in the table with its corresponding HTML source code. Each team should answer the following questions in the DS journals.

    1. Each group was given HTML code for a different game. Which one did your group get? Answers will vary. The variable names are stored at the beginning of the code, in between <th> and </th> and are called tags – they tell the browser to represent the information between them as a header in the table.

    2. Between what tags are the different values of the variables stored? Values are stored between the <td> and </td> tags.

    3. Why do you think the data are stored in such a complex way? Why can’t we just put them in a spreadsheet? Answers may vary by class. One reason is that the data must be displayed in a way that allows a browser to make it look pretty (and readable) on a computer screen.

    4. How could we get this into an R dataframe so we can analyze it? In its current form, this would be very difficult. We would need to represent the data in a different format in order for R to understand it.

  9. Distribute the Wikipedia – Video Games – CSV Format handout (LMR_3.18) and explain that this is yet another way to represent the same video game data.

    Note: The handout only provides information on the first 5 rows of the Wikipedia table. A full version of the file (including all video games in the table) is located on the server with the title bestgames.csv.

  10. Inform students that a file with the CSV format is easily readable by R. Then ask:

    a. Where are the variable names stored? The variable names are stored in the first row

    b. How are values of the variables separated? The values are separated by commas.

    c. If we were interested in using the online data, how would we obtain it? This is a challenging problem – one which students may not know how to answer at this point. The objective is for them to struggle with how they would obtain data and recognize that it is not always as simple as “export, upload, import.”

  11. Split the class into their student teams and distribute the Online Data-ing handout (LMR_3.19). Assign each team a different website (each page of the handout lists a different site) and have them use this site to complete the questions in the handout.

  12. Have each student team share their findings with one other team. They should have their website displayed while discussing their results.

Class Scribes:

One team of students will give a brief talk to discuss what they think the 3 most important topics of the day were.

Homework & Next 2 Days

For the next 4 days, students will collect data using their newly created Participatory Sensing campaign.

Lab 3E: Scraping Web Data

Lab 3F: Maps

Complete Labs 3E and 3F prior to Lesson 21.