Fall 2020 projects

STAT 19000

Project 1


Motivation: In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called RStudio, we will take some time to learn how to connect to it, configure it, and run code.

Context: This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data!

Scope: r, rstudio, Scholar

Learning objectives:

  • Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.
  • Install R and setting up a working environment.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Read and write basic (csv) data.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/disney/splash_mountain.csv

Questions

1. Read the webpage here. Scholar is the computing cluster you will be using throughout the semester, and your time with The Data Mine. Each node is an individual machine with CPUs and memory (RAM). How many cores and how much memory is available, in total, for our 7 frontend nodes? How about for the sub-clusters? How much is available on your computer or laptop?

Item(s) to submit:

  • A sentence explaining how much memory and how many cores the 7 frontends have combined.
  • A sentence explaining how much memory and how many cores the 28 sub-clusters have combined.
  • A sentence explaining how much memory and how many cores your personal computer has.
Solution

The memory on the 7 frontend nodes is: 4 * 512 + 3 * 768 = 4352 GB = 4.3 TB.

The memory on the 28 nodes in the sub-cluster is: 24 * 64 + 4 * 192 = 2304 GB = 2.3 TB.

The memory on Dr. Ward's laptop is 8 GB.

The number of cores on the 7 frontend nodes is: 4 * 20 + 3 * 20 = 140 cores.

The number of cores on the 28 nodes in the sub-cluster is: 24 * 20 + 4 * 16 = 544 cores.

The number of cores on Dr. Ward's laptop is 2.

Press Control and Enter/Return keys at the same time, to run this line. Which frontend are you in?

Relevant topics: running R code

Item(s) to submit:

  • The # of the frontend your RStudio Server session is running on.
Solution

The node we are working on is:

system("hostname", intern=T)
3. From within RStudio, we can run every type of code that you will need to run throughout your time with The Data Mine: Python, R, Bash, SQL, etc. We've created a one-time setup script for you to run, called

/class/datamine/apps/runme.sh (as seen in the video at the top of this page).

After you restart R (as in the video, after 4 minutes and 16 seconds), there should be a message that is printed in your "Console" tab. What does the message say?

Item(s) to submit:

  • The sentence that is printed in the RStudio "Console".
Solution

The welcome message is: "You've successfully loaded The Data Mine R settings!"

4. Projects in The Data Mine should all be submitted using our template found here or on Scholar (/class/datamine/apps/templates/project_template.Rmd). At the beginning of every project, the first step should be downloading and/or copying and pasting the template into a .Rmd file in RStudio. This is also demonstrated in the video at the top of this page.
Open the project template and save it into your home directory, in a new RMarkdown file named project01.Rmd.
Code chunks are parts of the RMarkdown file that contains code. You can identify what type of code a code chunk contains by looking at the engine in the curly braces "{" and "}". How many of each type of code chunk are in our default template?

Hint: You can read about the template here.

Item(s) to submit:

  • A list containing the type of code chunk (r, Python, sql, etc), and how many of each code chunks our default template contains.
Solution

There are 3 chunks of R code, 1 chunk of bash, 1 chunk of Python, 1 chunk of SQL

5. Fill out the project template, replacing the default information with your own. If a category is not applicable to you, put N/A. This template provides examples of how to run each "type" of code we will run in this course. Look for the second R code chunk, and run it by clicking the tiny green play button in the upper right hand corner of the code chunk. What is the output?

Item(s) to submit:

  • The output from running the R code chunk.
Solution

We store 1, 2, 3 into the variable my_variable, and then we display output: 1, 2, 3

my_variable <- c(1,2,3)
my_variable
## [1] 1 2 3
6. In question (1) we answered questions about CPUs and RAM for the Scholar cluster. To do so, we needed to perform some arithmetic. Instead of using a calculator (or paper), write these calculations using R. Replace the content of the second R code chunk in our template with your calculations.

Relevant topics: templates

Item(s) to submit:

  • The R code chunk with your calculations, and output.
Solution

We go back to question 1 and compute directly

4 * 512 + 3 * 768
## [1] 4352
24 * 64 + 4 * 192
## [1] 2304
4 * 20 + 3 * 20
## [1] 140
24 * 20 + 4 * 16
## [1] 544
7. In (6) we got to see how you can type out arithmetic and R will calculate the result for you. One constant throughout the semester will be loading datasets into R. Load our dataset into R by running the following code:
dat <- read.csv("/class/datamine/data/disney/splash_mountain.csv")

Confirm the dataset has been read in by running the head function on it. head prints the first few rows of data:

head(dat)

dat is a variable which contains our data! We can name this variable anything we want, we do not have to name it dat. Run our code to read in our dataset, this time, instead of naming our resulting dataset dat, name it splash_mountain. Place all of your code into a new R code chunk under a new level 3 header (i.e. ### Question 7).

Relevant topics: reading data in R

Item(s) to submit:

  • Code used to answer this question in a code chunk in our template.
  • Output of head.
Solution

We load in splash_mountain data and display the head

splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
head(splash_mountain)
##         date            datetime SACTMIN SPOSTMIN
## 1 01/01/2015 2015-01-01 07:51:12      NA        5
## 2 01/01/2015 2015-01-01 08:02:13      NA        5
## 3 01/01/2015 2015-01-01 08:09:12      NA        5
## 4 01/01/2015 2015-01-01 08:16:12      NA        5
## 5 01/01/2015 2015-01-01 08:23:12      NA        5
## 6 01/01/2015 2015-01-01 08:29:12      NA        5
8. Let's pretend we are now done with our project. We've written some R code, maybe added some text explaining what we did, and we are ready to turn things in. For this course, we will turn in a variety of work, depending on the type of project.
We will always require a PDF which contains text, code, and code output. Normally we would erase any code chunks from the template that are not used, however, for this project, it is OK to just keep the rest of the template intact.
A PDF is generated by "knitting" a PDF (using the "knit" button in RStudio).
In addition, if the project uses R code, you will need to also submit R code in an R script (file ending with .R). (Later this year, when submitting Python code, you will submit a Python script instead.)
Let's practice. Take the code from your project01.R file and paste it (perhaps one or two lines at time) into your RMarkdown file (file ending with .Rmd).
Compile your RMarkdown project into a PDF. Follow the directions in Brightspace to upload and submit your RMarkdown file, compiled PDF, and R script.

Relevant topics: templates

Item(s) to submit:

  • Resulting knitted PDF.
  • project01.R script (with all of your R code) and the analogous project01.Rmd file.
How to build the R script for Project 1 in STAT 19000.

In the videos below, for Question 1 and Question 6, Dr. Ward forgot to calculate the number of cores. (He only included the total amount of memory.) Dr. Ward is a human being who sometimes makes mistakes. Please remember to (also) calculate the number of cores, when you submit Question 1 and Question 6!

Click here for video

How to build the Rmd file and the PDF file for Project 1 in STAT 19000.

Click here for video


Project 2


Introduction to R using 84.51 examples

Click here for video

Introduction to R using NYC Yellow Taxi Cab examples

Click here for video

Motivation: The R environment is a powerful tool to perform data analysis. R is a tool that is often compared to Python. Both have their advantages and disadvantages, and both are worth learning. In this project we will dive in head first and learn the basics while solving data-driven problems.

Context: Last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some R code. In this project, we will continue to use R within RStudio to solve problems. Soon you will see how powerful R is and why it is often a more effective tool to use than spreadsheets.

Scope: r, vectors, indexing, recycling

Learning objectives:

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Read and write basic (csv) data.
  • Explain what "recycling" is in R and predict behavior of provided statements.
  • Identify good and bad aspects of simple plots.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/disney/metadata.csv

A public sample of the data can be found here: /class/datamine/data/disney/metadata.csv

Questions

1. Use the read.csv function to load /class/datamine/data/disney/metadata.csv into a data.frame called myDF. Note that read.csv by default loads data into a data.frame. (We will learn more about the idea of a data.frame, but for now, just think of it like a spreadsheet, in which each column has the same type of data.) Print the first few rows of myDF using the head function (as in Project 1, Question 7).

Relevant topics: reading data in r, head

Item(s) to submit:

  • R code used to solve the problem in an R code chunk.
Solution

We load the data from the disney metadata into a data frame called myDF

myDF <- read.csv("/class/datamine/data/disney/metadata.csv")
2. We've provided you with R code below that will extract the column WDWMAXTEMP of myDF into a vector. What is the 1st value in the vector? What is the 50th value in the vector? What type of data is in the vector? (For this last question, use the typeof function to find the type of data.)
our_vec <- myDF$WDWMAXTEMP

Relevant topics: indexing in r, type, creating variables

Item(s) to submit:

  • R code used to solve the problem in an R code chunk.
  • The values of the first, and 50th element in the vector.
  • The type of data in the vector (using the typeof function).
Solution

We first load the column WDWMAXTEMP of the data frame myDF into a vector:

our_vec <- myDF$WDWMAXTEMP

The 1st element is:

our_vec[1]
## [1] 73.02

The 50th element is:

our_vec[50]
## [1] 51.24

The type of data in the vector is:

typeof(our_vec)
## [1] "double"
3. Use the head function to create a vector called first50 that contains the first 50 values of the vector our_vec. Use the tail function to create a vector called last50 that contains the last 50 values of the vector our_vec.
You can access many elements in a vector at the same time. To demonstrate this, create a vector called mymix that contain the sum of each element of first50 being added to the analogous element of last50.

Relevant topics: indexing in r, creating variables, head and tail

Item(s) to submit:

  • R code used to solve this problem.
  • The contents of each of the three vectors.
Solution

We store the first 50 values of our_vec in first50:

first50 <- head(our_vec, n=50)

Here are the first 50 values:

first50
##  [1] 73.02 78.00 83.12 83.93 72.30 77.67 67.24 59.44 54.89 67.16 77.10 78.24
## [13] 74.89 64.51 62.12 63.58 72.83 70.59 69.89 71.03 75.78 77.09 78.40 70.77
## [25] 63.48 65.48 66.78 61.13 70.11 71.89 69.80 73.34 75.02 66.33 73.02 65.22
## [37] 66.33 71.78 73.58 70.42 64.23 66.96 71.39 59.54 63.24 73.42 74.11 78.70
## [49] 60.48 51.24

We store the last 50 values of our_vec in last50:

last50 <- tail(our_vec, n=50)

Here are the last 50 values:

last50
##  [1] 78.73 78.38 75.73 74.79 76.84 81.25 79.42 77.66 81.15 82.07 70.62 82.63
## [13] 79.12 78.55 76.34 74.10 78.62 74.31 67.69 73.30 74.51 77.66 81.15 82.07
## [25] 70.62 82.63 70.54 57.34 62.29 72.54 62.47 72.31 75.78 72.01 78.48 81.57
## [37] 82.06 80.45 82.01 76.05 79.38 79.38 79.38 79.74 70.41 68.23 58.31 64.29
## [49] 64.29 64.29

The sums of these two vectors, element by element, are:

mymix <- first50 + last50

Here are these 50 values:

mymix
##  [1] 151.75 156.38 158.85 158.72 149.14 158.92 146.66 137.10 136.04 149.23
## [11] 147.72 160.87 154.01 143.06 138.46 137.68 151.45 144.90 137.58 144.33
## [21] 150.29 154.75 159.55 152.84 134.10 148.11 137.32 118.47 132.40 144.43
## [31] 132.27 145.65 150.80 138.34 151.50 146.79 148.39 152.23 155.59 146.47
## [41] 143.61 146.34 150.77 139.28 133.65 141.65 132.42 142.99 124.77 115.53
4. In (3) we were able to rapidly add values together from two different vectors. Both vectors were the same size, hence, it was obvious which elements in each vector were added together.
Create a new vector called hot which contains only the values of myDF$WDWMAXTEMP which are greater than or equal to 80 (our vector contains max temperatures for days at Disney World). How many elements are in hot?
Calculate the sum of hot and first50. Do we get a warning? Read this and then explain what is going on.

Relevant topics: logical indexing, length, recycling

Item(s) to submit:

  • R code used to solve this problem.
  • 1-2 sentences explaining what is happening when we are adding two vectors of different lengths.
Solution

The values of myDF$WDWMAXTEMP that are greater than or equal to 80 are:

hot <- myDF$WDWMAXTEMP[myDF$WDWMAXTEMP >= 80]

The length of the vector hot is:

length(hot)
## [1] 1255

If we calculate the sum of hot and first50,

mynewsum <- hot + first50
## Warning in hot + first50: longer object length is not a multiple of shorter
## object length

we get a warning that hot and first50 have lengths that are not multiples of each other (so it does not make sense to be adding them, element by element).

5. Plot the WDWMAXTEMP vector from myDF.

Item(s) to submit:

  • R code used to solve this problem.
  • Plot of the WDWMAXTEMP vector from myDF.

Relevant topics: plotting

Solution

We plot the max_temp vector from myDF:

plot(myDF$WDWMAXTEMP)

6. The following three pieces of code each create a graphic. The first two graphics are created using only core R functions. The third graphic is created using a package called ggplot. We will learn more about all of these things later on. For now, pick your favorite graphic, and write 1-2 sentences explaining why it is your favorite, what could be improved, and include any interesting observations (if any).
dat <- table(myDF$SEASON)
dotchart(dat, main="Seasons", xlab="Number of Days in Each Season")
dat <- tapply(myDF$WDWMEANTEMP, myDF$DAYOFYEAR, mean, na.rm=T)
seasons <- tapply(myDF$SEASON, myDF$DAYOFYEAR, function(x) unique(x)[1])
pal <- c("#4E79A7", "#F28E2B", "#A0CBE8",  "#FFBE7D", "#59A14F", "#8CD17D", "#B6992D", "#F1CE63", "#499894", "#86BCB6", "#E15759", "#FF9D9A", "#79706E", "#BAB0AC", "#1170aa", "#B07AA1")
colors <- factor(seasons)
levels(colors) <- pal
par(oma=c(7,0,0,0), xpd=NA)
barplot(dat, main="Average Temperature", xlab="Jan 1 (Day 0) - Dec 31 (Day 365)", ylab="Degrees in Fahrenheit", col=as.factor(colors), border = NA, space=0)
legend(0, -30, legend=levels(factor(seasons)), lwd=5, col=pal, ncol=3, cex=0.8, box.col=NA)
library(ggplot2)
library(tidyverse)
summary_temperatures <- myDF %>%
    select(MONTHOFYEAR,WDWMAXTEMP:WDWMEANTEMP) %>%
    group_by(MONTHOFYEAR) %>%
    summarise_all(mean, na.rm=T)
ggplot(summary_temperatures, aes(x=MONTHOFYEAR)) +
    geom_ribbon(aes(ymin = WDWMINTEMP, ymax = WDWMAXTEMP), fill = "#ceb888", alpha=.5) +
    geom_line(aes(y = WDWMEANTEMP), col="#5D8AA8") +
    geom_point(aes(y = WDWMEANTEMP), pch=21,fill = "#5D8AA8", size=2) +
    theme_classic() +
    labs(x = 'Month', y = 'Temperature', title = 'Average temperature range' ) +
    scale_x_continuous(breaks=1:12, labels=month.abb)
Solution

The seasons plot makes it easy to compare the values. On the other hand, the values do not appear to be given in any particular order.

The average temperature plot clearly indicates the rise and fall of the temperatures over the course of the year, but the color scheme is unusual. The colors in the plot do not seem to correspond to the colors in the legend. Additionally, the color schemes in both the plot and the legend seem to be chosen randomly.


Project 3


Motivation: data.frames are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a data.frame.

Context: In the previous project we got our feet wet, and ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we've already learned and introduce a new, flexible data structure called data.frames.

Scope: r, data.frames, recycling, factors

Learning objectives:

  • Explain what "recycling" is in R and predict behavior of provided statements.
  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/disney

Questions

1. Read the dataset /class/datamine/data/disney/splash_mountain.csv into a data.frame called splash_mountain. How many columns, or features are in each dataset? How many rows or observations?

Relevant topics: str, dim

Item(s) to include:

  • R code used to solve the problem.
  • How many columns or features in each dataset?
Solution

We read in the Splash Mountain data set.

splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")

We can use either the str or the dim function to see that this data frame has 223936 rows and 4 columns.

str(splash_mountain)
## 'data.frame':    223936 obs. of  4 variables:
##  $ date    : chr  "01/01/2015" "01/01/2015" "01/01/2015" "01/01/2015" ...
##  $ datetime: chr  "2015-01-01 07:51:12" "2015-01-01 08:02:13" "2015-01-01 08:09:12" "2015-01-01 08:16:12" ...
##  $ SACTMIN : int  NA NA NA NA NA NA NA NA NA 4 ...
##  $ SPOSTMIN: int  5 5 5 5 5 5 5 5 5 NA ...
dim(splash_mountain)
## [1] 223936      4
2. Splash Mountain is a fan favorite ride at Disney World's Magic Kingdom theme park. splash_mountain contains a series of dates and datetimes. For each datetime, splash_mountain contains a posted minimum wait time, SPOSTMIN, and an actual minimum wait time, SACTMIN. What is the average posted minimum wait time for Splash Mountain? What is the standard deviation? Based on the fact that SPOSTMIN represents the posted minimum wait time for our ride, does our mean and standard deviation make sense? Explain. (You might look ahead to Question 3 before writing the answer to Question 2.)

Hint: If you got NA or NaN as a result, see here.

Relevant topics: mean, var, NA, NaN

Item(s) to submit:

  • R code used to solve this problem.
  • The results of running the R code.
  • 1-2 sentences explaining why or why not the results make sense.
Solution

The average minimum posted wait time for Splash Mountain is

mean(splash_mountain$SPOSTMIN, na.rm=T)
## [1] -71.70373

and the standard deviation is

sd(splash_mountain$SPOSTMIN, na.rm=T)
## [1] 328.0586

This is strange because the average minimum posted wait time should not be a negative value, and the standard deviation seems to be very large.

3. In (2) we got some peculiar values for the mean and standard deviation. If you read the "attractions" tab in the file /class/datamine/data/disney/touringplans_data_dictionary.xlsx, you will find that -999 is used as a value in SPOSTMIN and SACTMIN to indicate the ride as being closed. Recalculate the mean and standard deviation of SPOSTMIN, excluding values that are -999. Does this seem to have fixed our problem?

Relevant topics: NA, mean, var, indexing, which

Item(s) to submit:

  • R code used to solve this problem.
  • The result of running the R code.
  • A statement indicating whether or not the value look reasonable now.
Solution

If we remove the -999 values, then the average minimum posted wait time (with the -999 values removed) is

mean(splash_mountain$SPOSTMIN[splash_mountain$SPOSTMIN != -999], na.rm=T)
## [1] 43.3892

and the standard deviation (with the -999 values removed) is

sd(splash_mountain$SPOSTMIN[splash_mountain$SPOSTMIN != -999], na.rm=T)
## [1] 31.74894

This looks more reasonable!

4. SPOSTMIN and SACTMIN aren't the greatest feature/column names. An outsider looking at the data.frame wouldn't be able to immediately get the gist of what they represent. Change SPOSTMIN to posted_min_wait_time and SACTMIN to actual_wait_time.

Hint: You can always use hard-coded integers to change names manually, however, if you use which, you can get the index of the column name that you would like to change. For data.frames like splash_mountain, this is a lot more efficient than manually counting which column is the one with a certain name.

Relevant topics: colnames, names, which

Item(s) to submit:

  • R code used to solve the problem.
  • The output from executing names(splash_mountain) or colnames(splash_mountain).
Solution

The current column names of splash_mountain are:

colnames(splash_mountain)
## [1] "date"     "datetime" "SACTMIN"  "SPOSTMIN"

We can find which column is called SPOSTMIN

which(colnames(splash_mountain) == "SPOSTMIN")
## [1] 4

and change this column name to posted_min_wait_time

colnames(splash_mountain)[4] <- "posted_min_wait_time"

Next, we can find which column is called SACTMIN

which(colnames(splash_mountain) == "SACTMIN")
## [1] 3

and change this column name to actual_wait_time

colnames(splash_mountain)[3] <- "actual_wait_time"

Alternatively, we could have used names instead of colnames and the effect would have been totally the same.

5. Use the cut function to create a new vector called quarter that breaks the date column up by quarter. Use the labels argument in the factor function to label the quarters "q1", "q2", ..., "qX" where X is the last quarter. Add quarter as a column named quarter in splash_mountain. How many quarters are there?

Hint: If you have 2 years of data, this will result in 8 quarters: "q1", ..., "q8".

Hint: We can generate sequential data using seq and paste0:

paste0("item", seq(1, 5))
## [1] "item1" "item2" "item3" "item4" "item5"

or

paste0("item", 1:5)
## [1] "item1" "item2" "item3" "item4" "item5"

Relevant topics: cut, dates, factor, paste0, seq, nlevels

Item(s) to submit:

  • R code used to solve the problem.
  • The head and tail of splash_mountain.
  • The number of quarters in the new quarter column.

Question 5 is intended to be a little more challenging, so we worked through the exact same steps, with two other data sets. That way, if you work through these, all you will need to do, to solve Question 5, is to follow the example, and change two things, namely, the data set itself (in the read.csv file) and also the format of the date.

This basically steps you through everything in Question 5.

We hope that these are helpful resources for you! We appreciate you very much and we are here to support you! You would not know how to solve this question on your own--because we are just getting started--but we like to sometimes put in a question like this, in which you get introduced to several new things, and we will dive deeper into these ideas as we push ahead.

Click here for video

Click here for video

Solution

We first take the date column from splash_mountain and treat it as Date values, and we break the dates according to which quarter they are in.

myresults <- cut((as.Date(splash_mountain$date,"%m/%d/%Y")), breaks="quarter")

There are 20 quarters altogether:

nlevels(myresults)
## [1] 20

We set the levels to be a letter q and then a number between 1 and nlevels(myresults)

levels(myresults) <- paste0("q",1:nlevels(myresults))
splash_mountain$quarters <- myresults
head(splash_mountain)
##         date            datetime actual_wait_time posted_min_wait_time quarters
## 1 01/01/2015 2015-01-01 07:51:12               NA                    5       q1
## 2 01/01/2015 2015-01-01 08:02:13               NA                    5       q1
## 3 01/01/2015 2015-01-01 08:09:12               NA                    5       q1
## 4 01/01/2015 2015-01-01 08:16:12               NA                    5       q1
## 5 01/01/2015 2015-01-01 08:23:12               NA                    5       q1
## 6 01/01/2015 2015-01-01 08:29:12               NA                    5       q1
6. Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.

Project 4


Motivation: Control flow is (roughtly) the order in which instructions are executed. We can execute certain tasks or code if certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops.

Context: We are gaining familiarity working in RStudio and writing R code. In this project we introduce and practice using control flow in R.

Scope: r, data.frames, recycling, factors, if/else, for

Learning objectives:

  • Explain what "recycling" is in R and predict behavior of provided statements.
  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • List the differences between lists, vectors, factors, and data.frames, and when to use each.
  • Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/disney

Questions

1. Use read.csv to read in the /class/datamine/data/disney/splash_mountain.csv data into a data.frame called splash_mountain. In the previous project we calculated the mean and standard deviation of the SPOSTMIN (posted minimum wait time). These are vectorized operations (we will learn more about this next project). Instead of using the mean function, use a loop to calculate the mean(average), just like the previous project. Do not use sum either.

Hint: Remember, if a value is NA, we don't want to include it.

Hint: Remember, if a value is -999, it means the ride is closed, we don't want to include it.

Note: This exercise should make you appreciate the variety of useful functions R has to offer!

Relevant topics: for loops, if/else statements, is.na

Item(s) to submit:

  • R code used to solve the problem w/comments explaining what the code does.
  • The mean posted wait time.
Solution
splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
sum_posted_wait_time <- 0
not_nas_and_nines <- 0
for (i in splash_mountain$SPOSTMIN) {
    if (!is.na(i) && i != -999) {
        sum_posted_wait_time <- sum_posted_wait_time + i
         not_nas_and_nines <- not_nas_and_nines + 1
    } 
}
mean_posted_wait_time <- sum_posted_wait_time/not_nas_and_nines
mean_posted_wait_time
2. Choose one of the .csv files containing data for a ride. Use read.csv to load the file into a data.frame named ride_name where "ride_name" is the name of the ride you chose. Use a for loop to loop through the ride file and add a new column called status. status should contain a string whose value is either "open", or "closed". If SPOSTMIN or SACTMIN is -999, classify the row as "closed". Otherwise, classify the row as "open". After status is added to your data.frame, convert the column to a factor.

Hint: If you want to access two columns at once from a data.frame, you can do: splash_mountain[i, c("SPOSTMIN", "SACTMIN")].

Relevant topics: any, for loops, if/else statements, nrow

Note: For loops are often much slower (here is a video to demonstrate) than vectorized functions, as we will see in (3) below.

Item(s) to submit:

  • R code used to solve the problem w/comments explaining what the code does.
  • The output from running str on ride_name.

In this video, we basically go all the way through Question 2 using a video:

Click here for video

Solution
splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
status <- c()
for (i in 1:nrow(splash_mountain)) {
    if (any(splash_mountain[i, c("SACTMIN", "SPOSTMIN")]==-999, na.rm=T)) {
        status[i] <- "closed"
    }else {
        status[i] <- "open"
    }
}
splash_mountain$status <- factor(status)
3. Typically you want to avoid using for loops (or even apply functions (we will learn more about these later on, don't worry)) when they aren't needed. Instead you can use vectorized operations and indexing. Repeat (2) without using any for loops or apply functions (instead use indexing and the which function). Which method was faster?

Hint: To have multiple conditions within the which statement, use | for logical OR and & for logical AND.

Hint: You can start by assigning every value in status as "open", and then change the correct values to "closed".

Note: Here is a complete example (very much like question 3) with another video that shows how we can classify objects.

Note: Here is a complete example with a video that makes a comparison between the concept of a for loop versus the concept for a vectorized function.

Relevant topics: which

Item(s) to submit:

  • R code used to solve the problem w/comments explaining what the code does.
  • The output from running str on ride_name.
Solution
splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
splash_mountain$status <- "open"
splash_mountain$status[which(splash_mountain$SPOSTMIN == -999 | splash_mountain$SACTMIN == -999)] <- c("closed")
splash_mountain$status <- factor(splash_mountain$status)
4. Create a pie chart for open vs. closed for splash_mountain.csv. First, use the table command to get a count of each status. Use the resulting table as input to the pie function. Make sure to give your pie chart a title that somehow indicates the ride to the audience.

Relevant topics: pie, table

Item(s) to submit:

  • R code used to solve the problem w/comments explaining what the code does.
  • The resulting plot displayed as output in the RMarkdown.
Solution
pie(table(splash_mountain$status), main="Splash Mountain")
5. Loop through the vector of files we've provided below, and create a pie chart of open vs closed for each ride. Place all 6 resulting pie charts on the same image. Make sure to give each pie chart a title that somehow indicates the ride.
ride_names <- c("splash_mountain", "soarin", "pirates_of_caribbean", "expedition_everest", "flight_of_passage", "rock_n_rollercoaster")
ride_files <- paste0("/class/datamine/data/disney/", ride_names, ".csv")

Hint: To place all of the resulting pie charts in the same image, prior to running the for loop, run par(mfrow=c(2,3)).

Relevant topics: for loop, read.csv, pie, table, par

This is not exactly the same, but it is a similar example, using the campaign election data:

mypiechart <- function(x) {
  myDF <- read.csv( paste0("/class/datamine/data/election/itcont", x, ".txt"), sep="|")
  mystate <- rep("other", times=nrow(myDF))
  mystate[myDF$STATE == "CA"] <- "California"
  mystate[myDF$STATE == "TX"] <- "Texas"
  mystate[myDF$STATE == "NY"] <- "New York"
  myDF$stateclassification <- factor(mystate)
  pie(table(myDF$stateclassification))
}

myyears <- c("1980","1984","1988","1992","1996","2000")
par(mfrow=c(2,3))
for (i in myyears) {
  mypiechart(i)
}

Click here for video

Here is another video, which guides students even more closely through Question 5.

Click here for video

Item(s) to submit:

  • R code used to solve the problem w/comments explaining what the code does.
  • The resulting plot displayed as output in the RMarkdown.
Solution
ride_names <- c("splash_mountain", "soarin", "pirates_of_caribbean", "flight_of_passage", "expedition_everest", "rock_n_rollercoaster")
ride_files <- paste0(c("/class/datamine/data/disney/"), ride_names, ".csv")
par(mfrow=c(2,3))
for (i in 1:length(ride_names)) {
    dat <- read.csv(ride_files[i])
    dat$status <- "open"
    dat$status[which(dat$SPOSTMIN == -999 | dat$SACTMIN == -999)] <- c("closed")
    dat$status <- factor(dat$status)
    pie(table(dat$status), main=ride_names[i])
}

Project 5


Motivation: As briefly mentioned in project 4, R differs from other programming languages in that typically you will want to avoid using for loops, and instead use vectorized functions and the apply suite. In this project we will demonstrate some basic vectorized operations, and how they are better to use than loops.

Context: While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.

Scope: r, data.frames, recycling, factors, if/else, for

Learning objectives:

  • Explain what "recycling" is in R and predict behavior of provided statements.
  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • List the differences between lists, vectors, factors, and data.frames, and when to use each.
  • Demonstrate a working knowledge of control flow in r: for loops .

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/fars

To get more information on the dataset, see here.

Questions

1. The fars dataset contains a series of folders labeled by year. In each year folder there is (at least) the files ACCIDENT.CSV, PERSON.CSV, and VEHICLE.CSV. If you take a peek at any ACCIDENT.CSV file in any year, you'll notice that the column YEAR only contains the last two digits of the year. Add a new YEAR column that contains the full year. Use the rbind function to create a data.frame called accidents that combines the ACCIDENT.CSV files from the years 1975 through 1981 (inclusive) into one big dataset. After creating that accidents data frame, change the values in the YEAR column from two digits to four digits (i.e., paste a 19 onto each year value).

Relevant topics: rbind, read.csv, paste0

Here is a video to walk you through the method of solving Question 1.

Click here for video

Here is another video, using two functions you have not (yet) learned, namely, lapply and do.call. You do not need to understand these yet. It is just a glimpse of some powerful functions to come later in the course!

Click here for video

Item(s) to submit:

  • R code used to solve the problem/comments explaining what the code does.
  • The result of unique(accidents$YEAR).
Solution
accidents <- data.frame()
for (i in 1975:1981) {
    filename <- paste0("/class/datamine/data/fars/", i, "/ACCIDENT.CSV")
    dat <- read.csv(filename)
    dat$YEAR <- i
    accidents <- rbind(accidents, dat)
}
2. Using the new accidents data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?

Hint: Look at the variables DRUNK_DR and SCH_BUS.

Relevant topics: table, which, indexing

Here is a video about a related problem with 3 fatalities (instead of considering drunk drivers).

Click here for video

Item(s) to submit:

  • R code used to solve the problem/comments explaining what the code does.
  • The result/answer itself.
Solution
table(accidents$DRUNK_DR, accidents$SCH_BUS)
3. Again using the accidents data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?

Relevant topics: table, which, indexing

Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers), tabulated according to year.

Click here for video

Item(s) to submit:

  • R code used to solve the problem/comments explaining what the code does.
  • The results.
  • Which year had the most qualifying accidents.
Solution
result <- table(accidents$YEAR[which(accidents$DRUNK_DR > 0 & accidents$SCH_BUS==1)])
result[which.max(result)]
4. Again using the accidents data frame: Calculate the mean number of motorists involved in an accident (variable PERSON) with i drunk drivers, where i takes the values from 0 through 6.

Hint: It is OK that there are no accidents involving just 5 drunk drivers.

Hint: You can use either a for loop or a tapply function to accomplish this question.

Relevant topics: for loops, indexing, tapply, mean,

Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers). We calculate the mean number of fatalities for accidents with i drunk drivers, where i takes the values from 0 through 6.

Click here for video

Item(s) to submit:

  • R code used to solve the problem/comments explaining what the code does.
  • The output from running your code.
Solution
for (i in 0:6) {
    print(mean(accidents$PERSONS[accidents$DRUNK_DR == i], na.rm=T))
}
5. Again using the accidents data frame: We have a theory that there are more accidents in cold weather months for Indiana and states around Indiana. For this question, only consider the data for which STATE is one of these: Indiana (18), Illinois (17), Ohio (39), or Michigan (26). Create a barplot that shows the number of accidents by STATE and by month (MONTH) simultanously. What months have the most accidents? Are you surprised by these results? Explain why or why not?

We guide students through the methodology for Question 5 in this video. We also add a legend, in case students want to distinguish which stacked barplot goes with each of the four States.

Click here for video

Relevant topics: in, barplot

Item(s) to submit:

  • R code used to solve the problem/comments explaining what the code does.
  • The output (plot) from running your code.
  • 1-2 sentences explaining which month(s) have the most accidents and whether or not this surprises you.
Solution
dat <- accidents[accidents$STATE %in% c(17, 18, 26, 39),]
barplot(table(dat$STATE, dat$MONTH))
OPTIONAL QUESTION. Spruce up your plot from (5). Do any of the following:
  • add vibrant (and preferably colorblind friendly) colors to your plot
  • add a title
  • add a legend
  • add month names or abbreviations instead of numbers

Hint: Here is a resource to get you started.

Item(s) to submit:

  • R code used to solve the problem/comments explaining what the code does.
  • The output (plot) from running your code.
Solution
library(RColorBrewer)
dat <- accidents[accidents$STATE %in% c(17, 18, 26, 39),]
dat$STATE <- factor(dat$STATE, labels=c("Illinois", "Indiana", "Michigan", "Ohio"))
dat$MONTH <- factor(dat$MONTH, labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
color_palette <- brewer.pal(5, "Set2")
barplot(table(dat$STATE, dat$MONTH), col=color_palette, main="Accidents by Month", legend=T, ylim=c(0, 5000), las=2)

Project 6


The tapply function works like this:

tapply( somedata, thewaythedataisgrouped, myfunction)

myDF <- read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
head(myDF)

We could do four computations to compute the mean SPEND amount in each STORE_R

mean(myDF$SPEND[myDF$STORE_R == "CENTRAL"])
mean(myDF$SPEND[myDF$STORE_R == "EAST   "])
mean(myDF$SPEND[myDF$STORE_R == "SOUTH  "])
mean(myDF$SPEND[myDF$STORE_R == "WEST   "])

but it is easier to do all four of these calculations with the tapply function. We take a mean of the SPEND values, broken into groups according to the STORE_R

tapply( myDF$SPEND, myDF$STORE_R, mean)

We could find the total amount in the SPEND column in 2016 and then again in 2017.

sum(myDF$SPEND[myDF$YEAR == "2016"])
sum(myDF$SPEND[myDF$YEAR == "2017"])

or we could do both of these calculations at once, using the tapply function. We take the sum of all SPEND amounts, broken into groups according to the YEAR

tapply(myDF$SPEND, myDF$YEAR, sum)

As a last example, we can calculate the amount spent on each day of purchases. We take the sum of all SPEND amounts, broken into groups according to the PURCHASE_ day

tapply(myDF$SPEND, myDF$PURCHASE_, sum)

It makes sense to sort the results and then look at the 20 days on which the sum of the SPEND amounts were the highest.

tail(sort(  tapply(myDF$SPEND, myDF$PURCHASE_, sum)  ),n=20)

Click here for video

tapply( mydata, mygroups, myfunction, na.rm=T )

Some generic uses to explain how this would look, if we made the calculations in a naive/verbose/painful way

myfunction(mydata[mygroups == 1], na.rm=T)
myfunction(mydata[mygroups == 2], na.rm=T)
myfunction(mydata[mygroups == 3], na.rm=T) ....

myfunction(mydata[mygroups == "IN"], na.rm=T)
myfunction(mydata[mygroups == "OH"], na.rm=T)
myfunction(mydata[mygroups == "IL"], na.rm=T) ....
myDF <- read.csv("/class/datamine/data/flights/subset/2005.csv")
head(myDF)

sum the Distances of the flights, split into groups according to the airline (UniqueCarrier)

sort(tapply(myDF$Distance, myDF$UniqueCarrier, sum))

Find the mean flight Distance, grouped according to the city of Origin

sort(tapply(myDF$Distance, myDF$Origin, mean))

Calculate the mean departure delay (DepDelay), for each airplane (i.e., each TailNum), using na.rm=T because some of the values of the departure delays are NA

tail(sort(tapply(myDF$DepDelay, myDF$TailNum, mean, na.rm=T)),n=20)

Click here for video

library(data.table)
myDF <- fread("/class/datamine/data/election/itcont2016.txt", sep="|")
head(myDF)

sum the amounts of all contributions made, grouped according to the STATE where the people lived

sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum))

sum the amounts of all contributions made, grouped according to the CITY/STATE where the people lived

tail(sort(tapply(myDF$TRANSACTION_AMT, paste(myDF$CITY, myDF$STATE), sum)),n=20)
mylocations <-  paste(myDF$CITY, myDF$STATE)
tail(sort(tapply(myDF$TRANSACTION_AMT, mylocations, sum)),n=20)

sum the amounts of all contributions made, grouped according to the EMPLOYER where the people worked

tail(sort(tapply(myDF$TRANSACTION_AMT, myDF$EMPLOYER, sum)), n=30)

Click here for video

Motivation: tapply is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. Typically, when writing R code, you will want to use an "apply suite" function rather than a for loop.

Context: The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called tapply from the "apply suite" of functions in R.

Scope: r, for, tapply

Learning objectives:

  • Explain what "recycling" is in R and predict behavior of provided statements.
  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • List the differences between lists, vectors, factors, and data.frames, and when to use each.
  • Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.
  • Demonstrate how apply functions are generally faster than using loops.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/fars/7581.csv

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. The dataset, /class/datamine/data/fars/7581.csv contains the combined accident records from year 1975 to 1981. Load up the dataset into a data.frame named dat. In the previous project's question 4, we asked you to calculate the mean number of motorists involved in an accident (variable PERSON) with i drunk drivers where i takes the values from 0 through 6. This time, solve this question using the tapply function instead. Which method did you prefer and why?
Now that you've read the data into a dataframe named dat, run the following code:
# Read in data that maps state codes to state names
state_names <- read.csv("/class/datamine/data/fars/states.csv")

# Create a vector of state names called v
v <- state_names$state

# Set the names of the new vector to the codes
names(v) <- state_names$code

# Create a new column in the dat dataframe with the actual names of the states
dat$mystates <- v[as.character(dat$STATE)]

Relevant topics: tapply, mean

Item(s) to submit:

  • R code used to solve the problem.
  • The output/solution.
Solution
dat <- read.csv("/class/datamine/data/fars/7581.csv")
state_names <- read.csv("/class/datamine/data/fars/states.csv")
v <- state_names$state
names(v) <- state_names$code
dat$mystates <- v[as.character(dat$STATE)]
tapply(dat$PERSONS, dat$DRUNK_DR, mean, na.rm=T)
2. Make a state-by-state classification of the average number of drunk drivers in an accident. Which state has the highest average number of drunk drivers per accident?

Relevant topics: tapply, mean, sort

Item(s) to submit:

  • R code used to solve the problem.
  • The entire output.
  • Which state has the highest average number of drunk drivers per accident?
Solution
st <- tapply(dat$DRUNK_DR, dat$mystates, mean, na.rm=T)
st[which.max(st)]

# or

sort(tapply(dat$DRUNK_DR, dat$mystates, mean, na.rm=T), decreasing=T)
3. Add up the total number of fatalities, according to the day of the week on which they occurred. Are the numbers surprising to you? What days of the week have a higher number of fatalities? If instead you calculate the proportion of fatalities over the total number of people in the accidents, what would you expect? Calculate it and see if your expectations match.

Hint: Sundays through Saturdays are days 1 through 7, respectively. Day 9 indicates that the day is unknown.

This video example uses the Amazon fine food reviews dataset to make a similar calculation, in which we have two tapply statements, and we divide the results to get a ton of similar ratios all at once. Powerful stuff! It may guide you in your thinking about this question.

Code from the example: double tapply examples

Click here for video

Relevant topics: tapply

Item(s) to submit:

  • R code used to solve the problem.
  • What days have the highest number of fatalities?
  • What would you expect if you calculate the proportion of fatalities over the total number of people in the accidents?
Solution
sort(tapply(dat$FATALS, dat$DAY_WEEK, sum))
sort(tapply(dat$FATALS, dat$DAY_WEEK, sum)/tapply(dat$PERSONS, dat$DAY_WEEK, sum))
4. How many drunk drivers are involved, on average, in crashes that occur on straight roads? How many drunk drivers are involved, on average, in crashes that occur on curved roads? Solve the pair of questions in a single line of R code.

Hint: The ALIGNMNT variable is 1 for straight, 2 for curved, and 9 for unknown.

Relevant topics: tapply, sum

Item(s) to submit:

  • R code used to solve the problem.
  • Results from running the R code.
Solution
tapply(dat$DRUNK_DR, dat$ALIGNMNT, mean, na.rm=T)
5. Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.

This example demonstrates a comparable calculation. In the video, I used the total number of people in the accident, and your question is (instead) about the number of fatalities, but this is essentially the only difference. I hope it helps to explain the way that the cut function works, along with the analogous breaks.

Click here for video

Relevant topics: tapply, cut, sum

Item(s) to submit:

  • R code used to solve the problem.
  • Results from running the R code.
Solution
tapply(dat$FATALS, cut(dat$HOUR, breaks=c(0,6,12,18,24,99), include.lowest=T), sum)

Project 7


Motivation: Three bread-and-butter functions that are a part of the base R are: subset, merge, and split. subset provides a more natural way to filter and select data from a data.frame. split is a useful function that splits a dataset based on one or more factors. merge brings the principals of combining data that SQL uses, to R.

Context: We've been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with three useful functions, all the while gaining experience and practice wrangling data!

Scope: r, subset, merge, split, tapply

Learning objectives:

  • Gain proficiency using split, merge, and subset.
  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Demonstrate how to use tapply to solve data-driven problems.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/goodreads/csv

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. Load up the following two datasets goodreads_books.csv and goodreads_book_authors.csvinto two data.frames books, and authors, respectively. How many columns and rows are in each of these two datasets?

Relevant topics: read.csv, dim

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
books <- read.csv("/class/datamine/data/goodreads/csv/books_sample.csv")
authors <- read.csv("/class/datamine/data/goodreads/csv/goodreads_book_authors.csv")

dim(books)
dim(authors)

# or

str(books)
str(authors)
2. We want to figure out how book size (num_pages) is associated with various metrics. First, let's create a vector called book_size, that categorizes books into 4 categories based on num_pages: small (up to 250 pages), medium (250-500 pages), large (500-1000 pages), huge (1000+ pages).

Note: This video and code might be helpful.

Relevant topics: cut

Item(s) to submit:

  • R code used to solve the problem.
  • The result of table(book_size).
Solution
book_size <- cut(books$num_pages, breaks=c(-Inf, 250, 500, 1000, Inf), labels=c("small", "medium", "large", "huge"))
table(book_size)
head(books$num_pages)

# We can split each element into categories,
# based on how num_pages compares to 0, 250, 500, 1000, 
# and a huge number (say, 50000000, or Inf for infinity)
head(cut(books$num_pages, breaks=c(0,250,500,1000,50000000),include.lowest=T))

# Or like this:
head(cut(books$num_pages, breaks=c(0,250,500,1000,Inf),include.lowest=T))

# It is worthwhile to make sure that we included every book.  
# For instance, if you do not use include.lowest=T, then we 
# will miss the pages that are exactly equal to 0.
# Or if you do not use a big enough number on the right hand side,
# then you will miss the very long pages.
# We know, from question 1, that we should have 1000000 books
# and we can verify that we got them all here:
table( cut(books$num_pages, breaks=c(0,250,500,1000,Inf),include.lowest=T), useNA="always" )
sum(table( cut(books$num_pages, breaks=c(0,250,500,1000,Inf),include.lowest=T), useNA="always" ))

# Now we create the book_size vector:
book_size <- cut(books$num_pages, breaks=c(0,250,500,1000,Inf), include.lowest=T)

# and we can check that the first several entries look reasonable.
head(book_size)
3. Use tapply to calculate the mean average_rating, text_reviews_count, and publication_year by book_size. Did any of the result surprise you? Why or why not?

Relevant topics: tapply

Item(s) to submit:

  • R code used to solve the problem.
  • The output from running the R code.
Solution
tapply(books$average_rating, book_size, mean, na.rm=T)
tapply(books$text_reviews_count, book_size, mean, na.rm=T)
tapply(books$publication_year, book_size, mean, na.rm=T)
# It might be surprising, for instance, that longer books get 
# more reviews and higher reviews. Answers to this part of the 
# question will vary.
4. Notice in (3) how we used tapply 3 times. This would get burdensome if we decided to calculate 4 or 5 or 6 columns instead. Instead of using tapply, we can use split, lapply, and colMeans to perform the same calculations.
Use split to partition the data containing only the following 3 columns: average_rating, text_reviews_count, and publication_year, by book_size. Save the result as books_by_size. What class is the result? lapply is a function that allows you to loop over each item in a list and apply a function. Use lapply and colMeans to perform the same calculation as in (3).

Note: This video and code and also this video and code might be helpful.

Relevant topics: lapply, split, colMeans, indexing

Item(s) to submit:

  • R code used to solve the problem.
  • The output from running the code.
Solution
books_by_size <- split(books[,c('text_reviews_count','average_rating','publication_year')], book_size)
lapply(books_by_size, colMeans, na.rm=TRUE)
# We can build a temporarily data frame that has the 3 desired variables:
head(data.frame(books$average_rating, books$text_reviews_count, books$publication_year))

# and split it according to the book_size.
# We get a list of length 4 as a result:
books_by_size <- split(data.frame(books$average_rating, books$text_reviews_count, books$publication_year), book_size)
class(books_by_size)
length(books_by_size)
str(books_by_size)

# and then we can take the colMeans of each of these 4 parts of the list:
lapply(books_by_size, colMeans, na.rm=T)

# and the results agree with the results from question 3.
5. We are working with a lot more data than we really want right now. We've provided you with the following code to filter out non-English books and only keep columns of interest. This will create a data frame called en_books.
en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]

Now create an equivalent data frame of your own, by using the subset function (instead of indexing). Use res as the name of the data frame that you create. Do the dimensions (using dim) of en_books and res agree? Why or why not? (They should both have 8 columns, but a different number of rows.)

Hint: Since the dimensions don't match, take a look at NA values for the variables used to subset our data.

Note: This video and code and also this video and code might be helpful.

Relevant topics: indexing, subset, NA, %in%

Item(s) to submit:

  • R code used to solve the problem.
  • Do the dimensions match?
  • 1-2 sentences explaining why or why not.
Solution
en_books_subset <- subset(books, language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & publication_year > 2000, select=c(author_id, book_id, average_rating,description, title, ratings_count, language_code, publication_year))


en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]

dim(en_books)
dim(en_books_subset)

sum(is.na(en_books_subset$language_code))
sum(is.na(en_books$language_code))

# you can see that without the NAs the results are the same
all.equal(en_books_subset, en_books[!is.na(en_books$language_code),], check.attributes=F)

# subset is automatically removing NAs
en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]

# We build a new data frame called res that takes a subset
# of the data frame books that meet the two given conditions 
# on language_code and publication_year and that contains the 8 
# desired columns.
res <- subset(books, subset=language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & publication_year > 2000, select=c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year"))
class(res)
dim(res)

# This data frame is smaller.
# The earlier data frame has some rows with missing information,
# which are not included here.
# (We might want to give examples/details about what is missing.)
6. We now have a nice and tidy subset of data, called res. It would be really nice to get some information on the authors. We can find that information in authors dataset loaded in question 1! In question 2 of the previous project, we had a similar issue with the states names. There is a much better and easier way to solve these types of problems. Use the merge function to combine res and authors in a way which appends all information from author when there is a match in res. Use the condition by="author_id" in the merge. This is all you need to do:
mymergedDF <- merge(res, authors, by="author_id")

Note: The resulting data frame will have all of the columns that are found in either res or authors. When we perform the merge, we only insist that the author_id should match. We do not expect that the ratings_count or average_rating should agree in res versus authors. Why? In the res data frame, the ratings_count and average_rating refer to the specific book, but in the authors data frame, the ratings_count and average_rating refer to the total works by the author. Therefore, in mymergedDF, there are columns ratings_count.x and average_rating.x from res, and there are columns ratings_count.y and average_rating.y from authors.

Note: Although we provided the necessary code for this example, you might want to know more about the merge function. This video and code and also this video and code might be helpful.

Relevant topics: merge

Item(s) to submit:

  • the given R code used to solve the problem.
  • The dim of the newly merged data.frame.
Solution
res <- merge(books, authors, by="author_id")
dim(res)
7. For an author of your choice (that is in the dataset), find the author's highest rated book. Do you agree?

Relevant topics: indexing, subset, which, max

Item(s) to submit:

  • R code used to solve the problem.
  • The title of the highest rated book (from your author).
  • 1-2 sentences explaining why or why not you agree with it being the highest rated book from that author.
Solution
brandons_books <- res[res$name=="Brandon Sanderson",]
brandons_books[which.max(brandons_books$average_rating_books),]
myDF <- merge(res, authors, by="author_id")

# Here are the books by author Sandra Cisneros:
myfavoriteDF <- myDF[myDF$name=="Sandra Cisneros", ]
myfavoriteDF

# Her highest rated book has the rating 4.23:
max(myfavoriteDF$average_rating.x)

# It is called "A House of My Own: Stories from My Life":
myfavoriteDF[which.max(myfavoriteDF$average_rating.x), ]$title
OPTIONAL QUESTION. Look at the column names of the new dataframe created in question 6. Notice that there are two values for ratings_count and two values for average_rating. The names that have an appended x are those values from the first argument to merge, and the names that have an appended y, are those values from the second argument to merge. Rename these columns to indicate if they refer to a book, or an author.

Hint: For example, ratings_count.x could be ratings_count_book or ratings_count_author.

Relevant topics: names

Item(s) to submit:

  • R code used to solve the problem.
  • The names of the new data.frame.
Solution
res <- merge(books, authors, by="author_id")

# rename columns
names(res) <- gsub("\\.x", "_books", names(res))
names(res) <- gsub("\\.y", "_author", names(res))

dim(res)

Project 8


Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!

Context: We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.

Scope: r, functions

Learning objectives:

  • Gain proficiency using split, merge, and subset.
  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Demonstrate how to use tapply to solve data-driven problems.
  • Comprehend what a function is, and the components of a function in R.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/goodreads/csv

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. Read in the same data, in the same way as the previous project (with the same names). We've provided you with the function below. How many arguments does the function have? Name all of the arguments. What is the name of the function? Replace the description column in our books data.frame with the same information, but with stripped punctuation using the function provided.
# A function that, given a string (myColumn), returns the string
# without any punctuation.
strip_punctuation <- function(myColumn) {
    # Use regular expressions to identify punctuation.
    # Replace identified punctuation with an empty string ''.
    desc_no_punc <- gsub('[[:punct:]]+', '', myColumn)
    
    # Return the result
    return(desc_no_punc)
}

Hint: Since gsub accepts a vector of values, you can pass an entire vector to strip_punctuation.

Relevant topics: functions

Item(s) to submit:

  • R code used to solve the problem.
  • How many arguments does the function have?
  • What are the name(s) of all of the arguments?
  • What is the name of the function?
Solution
# The function has 1 argument called "myColumn"
# The name of the function is "strip_punctuation"
books$description <- strip_punctuation(books$description)
2. Use the strsplit function to split a string by spaces. Some examples would be:
strsplit("This will split by space.", " ")
## [[1]]
## [1] "This"   "will"   "split"  "by"     "space."
strsplit("This. Will. Split. By. A. Period.", "\\.")
## [[1]]
## [1] "This"    " Will"   " Split"  " By"     " A"      " Period"
An example string is:
test_string <- "This is  a test string  with no punctuation"
Test out strsplit using the provided test_string. Make sure to copy and paste the code that declares test_string. If you counted the words shown in your results, would it be an accurate count? Why or why not?

Relevant topics: strsplit, functions

Item(s) to submit:

  • R code used to solve the problem.
  • 1-2 sentences explaining why or why not your count would be accurate.
Solution
test_string <- "This is a  test string  with no punctuation"
strsplit(test_string, " ")

# It would not be accurate because it is counting the extra spaces as words.
3. Fix the issue in (2), using which. You may need to unlist the strsplit result first. After you've accomplished this, you can count the remaining words!

Relevant topics: strsplit, sum, which

Item(s) to submit:

  • R code used to solve the problem (including counting the words).
Solution
test <- unlist(strsplit(test_string, " "))
length(test[test!=""])
sum(test!="")
4. We are finally to the point where we have code from questions (2) and (3) that we think we may want to use many times. Write a function called count_words which, given a string, description, returns the number of words in description. Test out count_words on the description from the second row of books. How many words are in the description?

Relevant topics: functions, unlist, indexing, strsplit

Item(s) to submit:

  • R code used to solve the problem.
  • The result of using the function on the description from the second row of books.
Solution
count_words <- function(description) {
    split_desc <- unlist(strsplit(description, " "))
    
    return(length(split_desc[split_desc!=""]))
}

count_words(books$description[2])
5. Practice makes perfect! Write a function of your own design that is intended on being used with one of our datasets. Test it out and share the results.

Note: You could even pass (as an argument) one of our datasets to your function and calculate a cool statistic or something like that! Maybe your function makes a plot? Who knows?

Relevant topics: functions

Item(s) to submit:

  • R code used to solve the problem.
  • An example (with output) of using your newly created function.
Solution
# Could be anything.

Project 9


Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!

Context: We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.

Scope: r, functions

Learning objectives:

  • Gain proficiency using split, merge, and subset.
  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Demonstrate how to use tapply to solve data-driven problems.
  • Comprehend what a function is, and the components of a function in R.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/goodreads/csv

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. We've provided you with a function below. How many arguments does the function have, and what are their names? You can get a book_id from the URL of a goodreads book's webpage. For example, if you search for the book Words of Radianceon goodreads, the book_id contained in the url https://www.goodreads.com/book/show/17332218-words-of-radiance#, is 17332218. Another example is https://www.goodreads.com/book/show/157993.The_Little_Prince?from_search=true&from_srp=true&qid=JJGqUK9Vp9&rank=1, (the little prince) with a book_id of 157993.
Find 2 or 3 book_ids and test out the function until you get two successes. Explain in words, what the function is doing, and what options you have.
library(imager)

books <- read.csv("/class/datamine/data/goodreads/csv/goodreads_books.csv")
authors <- read.csv("/class/datamine/data/goodreads/csv/goodreads_book_authors.csv")

get_author_name <- function(my_authors_dataset, my_author_id){
        return(my_authors_dataset[my_authors_dataset$author_id==my_author_id,'name'])
}

fun_plot <- function(my_authors_dataset, my_books_dataset, my_book_id, display_cover=T) {
    book_info <- my_books_dataset[my_books_dataset$book_id==my_book_id,]
    all_books_by_author <- my_books_dataset[my_books_dataset$author_id==book_info$author_id,]
    author_name <- get_author_name(my_authors_dataset, book_info$author_id)
    
    img <- load.image(book_info$image_url)
    
    if(display_cover){
        par(mfrow=c(1,2))
        plot(img, axes=FALSE)
    }
    
    plot(all_books_by_author$num_pages, all_books_by_author$average_rating, 
         ylim=c(0,5.1), pch=21, bg='grey80',
         xlab='Number of pages', ylab='Average rating', 
         main=paste('Books by', author_name))
    
    points(book_info$num_pages, book_info$average_rating,pch=21, bg='orange', cex=1.5)
}

Relevant topics: functions

Item(s) to submit:

  • How many arguments does the function have, and what are their names?
  • The result of using the function on 2-3 book_ids.
  • 1-2 sentences explaining what the function does (generally), and what (if any) options the function provides you with.
Solution
# fun_plot has 4 arguments, my_authors_dataset, my_books_dataset, my_book_id, and display_cover. 
# Given a `book_id`, the function finds all books by the same author and plots the book cover 
# next to a plot of ratings where the current book is highlighted. You can choose to show the 
# cover image or not.

fun_plot(authors, books, 17332218)
fun_plot(authors, books, 17593132)
2. You may have encountered a situation where the my_book_id was not in our dataset, and hence, didn't get plotted. When writing functions, it is usually best to try and foresee issues like this and have the function fail gracefully, instead of showing some ugly (and sometimes unclear) warning. Add some code at the beginning of our function that checks to see if my_book_id is within our dataset, and if it does not exist, prints "Book ID not found.", and exits the function. Test it out on book_id=123 and book_id=19063.

Hint: Run ?stop to see if that is a function that may be useful.

Relevant topics: functions, if/else, stop

Item(s) to submit:

  • R code with your new and improved function.
  • The results from fun_plot(123).
  • The results from fun_plot(19063).
Solution
fun_plot <- function(my_authors_dataset, my_books_dataset, my_book_id, display_cover=T) {
  
    if (!(my_book_id %in% my_books_dataset$book_id)) {
        stop("Book ID not found.")
    }
  
    book_info <- my_books_dataset[my_books_dataset$book_id==my_book_id,]
    all_books_by_author <- my_books_dataset[my_books_dataset$author_id==book_info$author_id,]
    author_name <- get_author_name(my_authors_dataset, book_info$author_id)
    
    img <- load.image(book_info$image_url)
    
    if(display_cover){
        par(mfrow=c(1,2))
        plot(img, axes=FALSE)
    }
    
    plot(all_books_by_author$num_pages, all_books_by_author$average_rating, 
         ylim=c(0,5.1), pch=21, bg='grey80',
         xlab='Number of pages', ylab='Average rating', 
         main=paste('Books by', author_name))
    
    points(book_info$num_pages, book_info$average_rating,pch=21, bg='orange', cex=1.5)
}

fun_plot(authors, books, 19063)
fun_plot(authors, books, 123)
3. We have this nice get_author_name function that accepts a dataset (in this case, our authors dataset), and a book_id and returns the name of the author. Write a new function called get_author_id that accepts an authors name and returns the author_id of the author.
You can test your function using some of these examples:
get_author_id(authors, "Brandon Sanderson") # 38550
get_author_id(authors, "J.K. Rowling") # 1077326

Relevant topics: functions

Item(s) to submit:

  • R code containing your new function.
  • The results of using your new function on a few authors.
Solution
get_author_id <- function(my_author_dataset, my_author_name) {
  return(my_author_dataset[my_author_dataset$name==my_author_name,]$author_id)
}

get_author_id(authors, "Brandon Sanderson") # 38550
get_author_id(authors, "J.K. Rowling") # 1077326
4. See the function below.
search_books_for_word <- function(word) {
    return(books[grepl(word, books$description, fixed=T),]$title)
}
Given a word, search_books_for_word returns the titles of books where the provided word is inside the book's description. search_books_for_word utilizes the books dataset internally. It requires that the books dataset has been loaded into the environment prior to running (and with the correct name). By including and referencing objects defined outside of our function's scope within our function (in this case the variable books), our search_books_for_word function will be more prone to errors, as any changes to those objects may break our function. For example:
our_function <- function(x) {
  print(paste("Our argument is:", x))
  print(paste("Our variable is:", my_variable))
}

# our variable outside the scope of our_function
my_variable <- "dog"

# run our_function
our_function("first")

# change the variable outside the scope of our function
my_variable <- "cat"

# run our_function again
our_function("second")

# imagine a scenario where "my_variable" doesn't exist, our_function would break!
rm(my_variable)
our_function("third")
Fix our search_books_for_word function to accept the books dataset as an argument called my_books_dataset and utilize my_books_dataset within the function instead of the global variable books.

Relevant topics: functions, read.csv, scoping

Item(s) to submit:

  • R code with your new and improved function.
  • An example using the updated function.
Solution
search_books_for_word <- function(my_books_dataset, word) {
    return(my_books_dataset[grepl(word, my_books_dataset$description, fixed=T),]$title)
}
5. Write your own custom function. Make sure your function includes at least 2 arguments. If you access one of our datasets from within your function (which you definitely should do), use what you learned in (4), to avoid future errors dealing with scoping. Your function could output a cool plot, interesting tidbits of information, or anything else you can think of. Get creative and make a function that is fun to use!

Relevant topics: scoping, functions

Item(s) to submit:

  • R code used to solve the problem.
  • Examples using your function with included output.
Solution
# Could be anything.

Project 10


Motivation: Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called functional programming. In this project, we will learn to apply functions to entire vectors of data using sapply.

Context: We've just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. sapply is one of the best ways to do this in R.

Scope: r, sapply, functions

Learning objectives:

  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Utilize apply functions in order to solve a data-driven problem.
  • Gain proficiency using split, merge, and subset.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/okcupid/filtered

Questions

Important note: Please make sure to double check that your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Solution
# You can show files and subdirectories using list.files
list.files("/class/datamine/data/okcupid")

# You can also take a look at files using file.info
file.info("/class/datamine/data/okcupid/filtered/questions.csv")

# You can peek at a file before reading it all in using readLines
readLines("/class/datamine/data/okcupid/filtered/questions.csv", n=10) 

# we can see things are separated by semicolons

users <- read.csv("/class/datamine/data/okcupid/filtered/users.csv")
questions <- read.csv("/class/datamine/data/okcupid/filtered/questions.csv", sep=";")
2. grep is an incredibly powerful tool available to us in R. We will learn more about grep in the future, but for now, know that a simple application of grep is to find a word in a string. In R, grep is vectorized and can be applied to an entire vector of strings. Use grep to find a question that references "google". What is the question?

Click here for video

Click here for video

Hint: If at first you don't succeed, run ?grep and check out the ignore.case argument.

Hint: To prepare for Question 3, look at the entire row of the questions data frame that has the question about google. The first entry on this row tells you the question that you need, in the users data frame, while working on Question 3.

Relevant topics: grep

Item(s) to submit:

  • R code used to solve the problem.
  • The text of the question that references Google.
Solution
questions[grep("google", questions$text, ignore.case=T),]
3. In (2) we found a pretty interesting question. What is the percentage of users that Google someone before the first date? Does the proportion change by gender (as defined by gender2)? How about by gender_orientation?

Hint: The two videos posted in Question 2 might help.

Hint: If you look at the column of users corresponding to the question identified in (2), you will see that this column of users has two possible answers, namely: "No. Why spoil the mystery?" and "Yes. Knowledge is power!".

Hint: Use the tapply function with three inputs:

the correct column of users,

breaking up the data according to gender2 or according to gender_orientation,

and use this as your function in the tapply:

function(x) {prop.table(table(x, useNA="always"))}

Relevant topics: functions, tapply, table, prop.table

Item(s) to submit:

  • R code used to solve this problem.
  • The results of running the code.
  • Written answers to the questions.
Solution
prop.table(table(users_filtered$`q170849`))

prop_google <- function(answers_to_q170849){
  return(prop.table(table(answers_to_q170849))[2])
}

tapply(users_filtered$`q170849`, users_filtered$gender2, prop_google)
tapply(users_filtered$`q170849`, users_filtered$gender_orientation, prop_google)

# ~35% of people Google people before the first date.

# According to this set of data, men Google woman 10% less than woman Google men before the first date.

# When broken down by the provided gender orientations, hetero females, bisexual females, and gay males tend to
# Google people before the first date more than bisexual males, gay females, and hetero males.
4. In project (8) we created a function called count_words. Use this function and sapply to create a vector which contains the number of words in each row of the column text from the questions dataframe. Call the new vector question_length, and add it as a column to the questions dataframe.
count_words <- function(my_text) {
    my_split_text <- unlist(strsplit(my_text, " "))
    
    return(length(my_split_text[my_split_text!=""]))
}

Click here for video

Relevant topics: sapply

Item(s) to submit:

  • R code used to solve this problem.
  • The result of str(questions) (this shows how your questions data frame looks, after adding the new column called question_length).
Solution
count_words <- function(description) {
    split_desc <- unlist(strsplit(description, " "))
    
    return(length(split_desc[split_desc!=""]))
}

questions$question_length <- sapply(as.character(questions$text), count_words)
5. Consider this function called number_of_options that accepts a data frame (for instance, questions)
number_of_options <- function(myDF) {
   table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))}))
}

and counts the number of questions that have each possible number of responses. For instance, if we calculate: number_of_options(questions) we get:

  0   2   3   4 
590 936 519 746

which means that: 590 questions have 0 possible responses; 936 questions have 2 possible responses; 519 questions have 3 possible responses; and 746 questions have 4 possible responses.

Now use the split function to break the data frame questions into 7 smaller data frames, according to the value in questions$Keywords. Then use the sapply function to determine, for each possible value of questions$Keywords, the analogous breakdown of questions with different numbers of responses, as we did above.

Hint: You can write:

mylist <- split(questions, questions$Keywords)
sapply(mylist, number_of_options)

Click here for video

Background/explanation: The way sapply works is the the first argument is by default the first argument to your function, the second argument is the function you want applied, and after that you can specify arguments by name. For example:

test1 <- c(1, 2, 3, 4, NA, 5)
test2 <- c(9, 8, 6, 5, 4, NA)
mylist <- list(first=test1, second=test2)

# for a single vector in the list
mean(mylist$first, na.rm=T)

# what if we want to do this for each vector in the list?
# how do we remove na's?
sapply(mylist, mean)

# we can specify the arguments that are for the mean function
# by naming them after the first two arguments, like this
sapply(mylist, mean, na.rm=T)

# in the code shown above, na.rm=T is passed to the mean function
# just like if you run the following
mean(mylist$first, na.rm=T)
mean(mylist$second, na.rm=T)

# you can include as many arguments to mean as you normally would
# and in any order. just make sure to name the arguments
sapply(mylist, mean, na.rm=T, trim=0.5)
# or sapply(mylist, mean, trim=0.5, na.rm=T) 

# which is similar to 
mean(mylist$first, na.rm=T, trim=0.5)
mean(mylist$second, na.rm=T, trim=0.5)

Relevant topics: sapply, functions, indexing

Item(s) to submit:

  • R code used to solve this problem.
  • The results of the running the code.
Solution
number_of_options <- function(myDF) {
    table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))}))
}

number_of_options(questions)

mylist <- split(questions, questions$Keywords)
sapply(mylist, number_of_options)
6. Lots of questions are asked in this okcupid dataset. Explore the dataset, and either calculate an interesting statistic/result using sapply, or generate a graphic (with good x-axis and/or y-axis labels, main labels, legends, etc.), or both! Write 1-2 sentences about your analysis and/or graphic, and explain what you thought you'd find, and what you actually discovered.

Click here for video

Relevant topics: plotting, functions, sapply

Item(s) to submit:

  • R code used to solve this problem.
  • The results from running your code.
  • 1-2 sentences about your analysis and/or graphic, and explain what you thought you'd find, and what you actually discovered.
Solution
# Could be anything.
OPTIONAL QUESTION. Does it appear that there is an association between the length of the question and whether or not users answered the question? Assume NA means "unanswered". First create a function called percent_answered that, given a vector, returns the percentage of values that are not NA. Use percent_answered and sapply to calculate the percentage of users who answer each question. Plot this result, against the length of the questions.

Hint: length_of_questions <- questions$question_length[grep("^q", questions$X)]

Hint: grep("^q", questions$X) returns the column index of every column that starts with "q". Use the same trick we used in the previous hint, to subset our users data.frame before using sapply to apply percent_answered.

Relevant topics: sapply, is.na, length, grep, plot

Item(s) to submit:

  • R code used to solve this problem.
  • The plot.
  • Whether or not you think there may or may not be an association between question length and whether or not the question is answered.
Solution
number_of_options <- function(questions, question_id) {
    
    is_empty <- function(value) {
        if (is.na(value)) return(TRUE)
        else if (value=="") return(TRUE)
        else return(FALSE)
    }
    
    question <- questions[questions$X==question_id,]
    
    is_empty(question$option_1) + is_empty(question$option_2) + is_empty(question$option_3) + is_empty(question$option_4)
    
}
prop.table(table(sapply(questions$X, number_of_options, questions = questions)))

questions$number_options <- sapply(questions$X, number_of_options, questions = questions)

prop.table(table(sapply(questions$X, number_of_options, questions = questions)))

questions$number_options <- sapply(questions$X, number_of_options, questions = questions)

percent_answered <- function(col) {
    sum(!is.na(col))/length(col)
}

p_answered <- sapply(users[,grep("^q", names(users))], percent_answered)
length_of_questions <- questions$question_length[grep("^q", questions$X)]
plot(length_of_questions, p_answered)

Project 11


Motivation: The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you've learned so far this semester to solve data-driven problems. In previous projects, we've directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you'd like.

Context: You've learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of economic data from Zillow.

Scope: R

Learning objectives:

  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Utilize apply functions in order to solve a data-driven problem.
  • Gain proficiency using split, merge, and subset.
  • Comprehend what a function is, and the components of a function in R.
  • Demonstrate the ability to use nested apply functions to solve a data-driven problem.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/zillow

Questions

1. Read /class/datamine/data/zillow/Zip_time_series.csv into a data.frame called zipc. Look at the RegionName column. It is supposed to be a 5-digit zip code. Either fix the column by writing a function and applying it to the column, or take the time to read the read.csv documentation by running ?read.csv and use an argument to make sure that column is not read in as an integer (which is why zip codes starting with 0 lose the leading 0 when being read in).

Hint: This video demonstrates how to read in data and respect the leading zeroes.

Click here for video

Relevant topics: read.csv, sapply, functions, strrep, nchar

Item(s) to submit:

  • R code used to solve the problem.
  • head of the RegionName column.
Solution
zipc <- read.csv("/class/datamine/data/zillow/Zip_time_series.csv")
fix_zip <- function(x) {
    if (nchar(x) == 5) {
        return(as.character(x))
    }
    paste0(strrep("0", 5-nchar(x)), x)
}
sapply(zipc$RegionName, fix_zip)

# OR

zipc <- read.csv("/class/datamine/data/zillow/Zip_time_series.csv", colClasses = c("RegionName"="character"))

# OR

zipc <- data.frame(fread("/class/datamine/data/zillow/Zip_time_series.csv"))
zipc$Date <- as.Date(zipc$Date)
2. One might assume that the owner of a house tends to value that house more than the buyer. If that was the case, perhaps the median listing price (the price which the seller puts the house on the market, or ask price) would be higher than the ZHVI (Zillow Home Value Index -- essentially an estimate of the home value). For those rows where both MedianListingPrice_AllHomes and ZHVI_AllHomes have non-NA values, on average how much higher or lower is the median listing price? Can you think of any other reasons why this may be?

Relevant topics: mean

Item(s) to submit:

  • R code used to solve the problem.
  • The result itself and 1-2 sentences talking about whether or not you can think of any other reasons that may explain the result.
Solution
mean(zipc$MedianListingPrice_AllHomes - zipc$ZHVI_AllHomes, na.rm=T)
3. Convert the Date column to a date using as.Date. How many years of data do we have in this dataset? Create a line plot with lines for the average MedianListingPrice_AllHomes and average ZHVI_AllHomes by year. The result should be a single plot with multiple lines on it.

Hint: Here we give two videos to help you with this question. The first video gives some examples about working with dates in R.

Click here for video

Hint: This second video gives an example about how to plot two line graphs at the same time in R.

Click here for video

Hint: For a nice addition, add a dotted vertical line on year 2008 near the housing crisis:

abline(v="2008", lty="dotted")

Relevant topics: cut, as.Date, tapply, plot, lines, legend

Item(s) to submit:

  • R code used to solve the problem.
  • The results of running the code.
Solution
zipc$Date <- as.Date(zipc$Date, format="%Y-%m-%d")

zipc$Year <- cut(zipc$Date, breaks="year")
zipc$Year <- substr(zipc$Year, 1, 4)

listings <- tapply(zipc$MedianListingPrice_AllHomes, zipc$Year, mean, na.rm=T)
zhvis <- tapply(zipc$ZHVI_AllHomes, zipc$Year, mean, na.rm=T)

plot(names(listings), listings, type="l", ylim=c(118000, 320000), col="blue", lwd=2)
lines(names(zhvis), zhvis, col="tomato", lwd=2)
legend("topleft", legend=c("Listing price", "ZHVI"), lwd=2, col=c("blue", "tomato"))
abline(v="2008", lty="dotted")
4. Read /class/datamine/data/zillow/State_time_series.csv into a data.frame called states. Calculate the average median listing price by state, and create a map using plot_usmap from the usmap package that shows the average median price by state.

Hint: We give a full example about how to plot values, by State, on a map.

Click here for video

Hint: In order for plot_usmap to work, you must name the column containing states' names to "state".

Hint: To split words like "OhSoCool" into "Oh So Cool", try this: trimws(gsub('([[:upper:]])', ' \\1', "OhSoCool")). This will be useful as you'll need to correct the RegionName column at some point in time. Notice that this will not completely fix "DistrictofColumbia". You will need to fix that one manually.

Relevant topics: tapply, plot_usmap

Item(s) to submit:

  • R code used to solve the problem.
  • The resulting map.
Solution
states <- read.csv("/class/datamine/data/zillow/State_time_series.csv")
aux <- tapply(states$MedianListingPrice_AllHomes, states$RegionName, mean, na.rm=T)
dat <- data.frame(state=names(aux), val=aux)
dat$state <- trimws(gsub('([[:upper:]])', ' \\1', dat$state))
dat$state[dat$state=="Districtof Columbia"] <- "District of Columbia"

plot_usmap(data = dat, values = "val", color = "black") +
    scale_fill_continuous(low = "white", high = "#C28E0E",
                          name = "Median price listing",
                          label = scales::dollar) +
    theme(legend.position = "right")
5. Read /class/datamine/data/zillow/County_time_series.csv into a data.frame named counties. Choose a state (or states) that you would like to "dig down" into county-level data for, and create a plot (or plots) like in (4) that show some interesting statistic by county. You can choose average median listing price if you so desire, however, you don't need to! There are other cool data!

Hint: Make sure that you remember to aggregate your data by RegionName so the plot renders correctly.

Hint: plot_usmap looks for a column named fips. Make sure to rename the RegionName column to fips prior to passing the data.frame to plot_usmap.

Hint: If you get Question 4 working correctly, here are the main differences for Question 5. You need the regions to be "counties" instead of "states", and you need the data.frame to have a column called fips instead of state. These are the main differences between Question 4 and Question 5.

Relevant topics: tapply, plot_usmap

Item(s) to submit:

  • R code used to solve the problem.
  • The resulting map.
Solution
counties_avg <- tapply(counties$MedianListingPrice_AllHomes, counties$RegionName, mean, na.rm=T)
counties_avg <- data.frame(fips = names(counties_avg), MedianListingPrice = unname(counties_avg))

plot_usmap(data = counties_avg, values = "MedianListingPrice", include=c("CA"), color = "black") +
    scale_fill_continuous(low = "white", high = "#C28E0E",
                          name = "Average median price listing",
                          label = scales::dollar) +
    theme(legend.position = "right")

Project 12


Motivation: In the previous project you were forced to do a little bit of date manipulation. Dates can be very difficult to work with, regardless of the language you are using. lubridate is a package within the famous tidyverse, that greatly simplifies some of the most common tasks one needs to perform with date data.

Context: We've been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a tidyverse package that adds great stand-alone value when working with dates.

Scope: r

Learning objectives:

  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Utilize apply functions in order to solve a data-driven problem.
  • Gain proficiency using split, merge, and subset.
  • Demostrate the ability to create basic graphs with default settings.
  • Demonstratre the ability to modify axes labels and titles.
  • Incorporate legends using legend().
  • Demonstrate the ability to customize a plot (color, shape/linetype).
  • Convert strings to dates, and format dates using the lubridate package.

Questions

1. Let's continue our exploration of the Zillow time series data. A useful package for dealing with dates is called lubridate. This is part of the famous tidyverse suite of packages. Run the code below to load it. Read the /class/datamine/data/zillow/State_time_series.csv dataset into a data.frame named states. What class and type is the column Date?
library(lubridate)

Relevant topics: class, typeof

Item(s) to submit:

  • R code used to solve the question.
  • class and typeof column Date.
Solution
library(lubridate)
states <- read.csv("/class/datamine/data/zillow/State_time_series.csv")
typeof(states$Date)
class(states$Date)
2. Convert column Date to a corresponding date format using lubridate. Check that you correctly transformed it by checking its class like we did in question (1). Compare and contrast this method of conversion with the solution you came up with for question (3) in the previous project. Which method do you prefer?

Hint: Take a look at the following functions from lubridate: ymd, mdy, dym.

Hint: Here is a video about ymd, mdy, dym

Click here for video

Relevant topics: dates, lubridate

Item(s) to submit:

  • R code used to solve the question.
  • class of modified column Date.
  • 1-2 sentences stating which method you prefer (if any) and why.
Solution
states$Date <- ymd(states$Date)
class(states$Date)
3. Create 3 new columns in states called year, month, day_of_week (Sun-Sat) using lubridate. Get the frequency table for your newly created columns. Do we have the same amount of data for all years, for all months, and for all days of the week? We did something similar in question (3) in the previous project -- specifically, we broke each date down by year. Which method do you prefer and why?

Hint: Take a look at functions month, year, day, wday.

Hint: You may find the argument of label in wday useful.

Hint: Here is a video about month, year, day, wday

Click here for video

Relevant topics: dates, lubridate

Item(s) to submit:

  • R code used to solve the question.
  • Frequency table for newly created columns.
  • 1-2 sentences answering whether or not we have the same amount of data for all years, months, and days of the week.
  • 1-2 sentences stating which method you prefer (if any) and why.
Solution
states$year <- year(states$Date)
states$month <- month(states$Date)
states$day_of_week <- wday(states$Date, label=TRUE)
table(states$year)
table(states$month)
table(states$day_of_week)
4. Is there a better month or set of months to put your house on the market? Use tapply to compare the average DaysOnZillow_AllHomes for all months. Make a barplot showing our results. Make sure your barplot includes "all of the fixings" (title, labeled axes, legend if necessary, etc. Make it look good.).

Relevant topics: tapply, barplot

Hint: If you want to have the month's abbreviation in your plot, you may find both the month.abb object and the argument names.arg in barplot useful.

Hint: This video might help with Question 4.

Click here for video

Item(s) to submit:

  • R code used to solve the question.
  • The barplot of the average DaysOnZillow_AllHomes for all months.
  • 1-2 sentences answering the question "Is there a better time to put your house on the market?" based on your results.
Solution
avgDaysPerMonth <- tapply(states$DaysOnZillow_AllHomes, states$month, mean, na.rm=T)
barplot(avgDaysPerMonth, names.arg=month.abb, col='steelblue', main="Avg days on Zillow", ylab="# Days")
Solution
states2010plus <- subset(states, year >= 2010)
avgdays <- tapply(states2010plus$DaysOnZillow_AllHomes, states2010plus$Date, mean, na.rm=T)
plot(ymd(names(avgdays)), avgdays, type='l')
6. Do homes sell faster in certain states? For the following states: 'California', 'Indiana', 'NewYork' and 'Florida', make a lineplot for DaysOnZillow_AllHomes by Date with one line per state. Use the states2010plus dataset for this question. Make sure to have each state line colored differently, and to add a legend to your plot. Examine the plot and write 1-2 sentences about any observations you have.

Hint: You may want to use the lines function to add the lines for different state.

Hint: Make sure to fix the y-axis limits using the ylim argument in plot to properly show all four lines.

Hint: You may find the argument col useful to change the color of your line.

Hint: To make your legend fit, consider using the states abbreviation, and the arguments ncol and cex of the legend function.

Relevant topics: subset, indexing, plot, lines

Item(s) to submit:

  • R code used to solve the question.
  • The time series lineplot for DaysOnZillow_AllHomes per date for the 4 states.
  • 1-2 sentences commenting on the patterns found in the plot, and your answer to the question "Do homes sell faster in certain states rather than others?".
Solution
indiana <- subset(states2010plus, RegionName=="Indiana")
california <- subset(states2010plus, RegionName=="California")
florida <- subset(states2010plus, RegionName=="Florida")
new_york <- subset(states2010plus, RegionName=="NewYork")
plot(california$Date, california$DaysOnZillow_AllHomes, type='l', col='tomato', ylim = c(min(states2010plus$DaysOnZillow_AllHomes, na.rm=T), 210), ylab='Days on zillow', xlab = 'Date')
lines(indiana$Date, indiana$DaysOnZillow_AllHomes, col='steelblue')
lines(florida$Date, florida$DaysOnZillow_AllHomes, col='green')
lines(new_york$Date, new_york$DaysOnZillow_AllHomes, col='black')
legend('topright', legend=c('CA','IN','FL', 'NY'),lwd=2, col=c('tomato','steelblue','green','black'), ncol=2, cex=.7)

Project 13


Motivation: It's important to be able to lookup and understand the documentation of a new function. You may have looked up the documentation of functions like paste0 or sapply, and noticed that in the "usage" section, one of the arguments is an ellipsis (...). Well, unless you understand what this does, it's hard to really get it. In this project, we will experiment with ellipsis, and write our own function that utilizes one.

Context: We've learned about, used, and written functions in many projects this semester. In this project, we will utilize some of the less-known features of functions.

Scope: r, functions

Learning objectives:

  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Utilize apply functions in order to solve a data-driven problem.
  • Gain proficiency using split, merge, and subset.
  • Demostrate the ability to create basic graphs with default settings.
  • Demonstratre the ability to modify axes labels and titles.
  • Incorporate legends using legend().
  • Demonstrate the ability to customize a plot (color, shape/linetype).
  • Convert strings to dates, and format dates using the lubridate package.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/beer/

Questions

1. Read /class/datamine/data/beer/beers.csv into a data.frame named beers. Read /class/datamine/data/beer/breweries.csv into a data.frame named breweries. Read /class/datamine/data/beer/reviews.csv into a data.frame named reviews.

Hint: Notice that reviews.csv is a large file. Luckily, you can use a function from the famous data.table package called fread. The function fread is much faster at reading large file compared to read.csv. It reads the data into a class called data.table. We will learn more about this later on. For now, use fread to read in the reviews.csv data then convert it from the data.table class into a data.frame by wrapping the result of fread in the data.frame function.

Hint: Do not forget to load the data.table library before attempeting to use the fread function.

Below we show you an example of how fast the fread function is compared toread.csv.

microbenchmark(read.csv("/class/datamine/data/beer/reviews.csv", nrows=100000), data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows=100000)), times=5)
Unit: milliseconds
expr
read.csv("/class/datamine/data/beer/reviews.csv", nrows = 1e+05)
data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows = 1e+05))
       min        lq      mean    median        uq       max neval
 5948.6289 6482.3395 6746.8976 7040.5881 7086.6728 7176.2589     5
  120.7705  122.3812  127.9842  128.7794  133.7695  134.2205     5

Hint: This video demonstrates how to read the reviews data using fread.

Click here for video

Relevant topics: fread, data.frame

Item(s) to submit:

  • R code used to solve the problem.
2. Take some time to explore the datasets. Like many datasets, our data is broken into 3 "tables". What columns connect each table? How many breweries in breweries don't have an associated beer in beers? How many beers in beers don't have an associated brewery in breweries?

Hint: We compare lists of names using sum or intersect. Similar techniques can be used for Question 2.

Click here for video

Relevant topics: names, in, logical operators, unique

Item(s) to submit:

  • R code used to solve the problem.
  • A description of columns which connect each of the files.
  • How many breweries don't have an associated beer in beers.
  • How many beers don't have an associated brewery in breweries.
3. Run ?sapply and look at the usage section for sapply. If you look at the description for the ... argument, you'll see it is "optional arguments to FUN". What this means is you can specify additional input for the function you are passing to sapply. One example would be passing T to na.rm in the mean function: sapply(dat, mean, na.rm=T). Use sapply and the strsplit function to separate the types of breweries (types) by commas. Use another sapply to loop through your results and count the number of types for each brewery. Be sure to name your final results n_types. What is the average amount of services (n_types) breweries in IN and MI offer (we are looking for the average of IN and MI combined)? Does that surprise you?

Note: When you have one sapply inside of another, or one loop inside of another, or an if/else statement inside of another, this is commonly referred to as nesting. So when Googling, you can type "nested sapply" or "nested if statements", etc.

Hint: We show, in this video, how to find the average number of parts in a midwesterner's name. Perhaps surprisingly, this same technique will be useful in solving Question 3.

Click here for video

Relevant topics: sapply, ...,strplit, in, mean

Item(s) to submit:

  • R code used to solve the question.
  • 1-2 sentences answering the average amount of services breweries in Indiana and Michigan offer, and commenting on this answer.
4. Write a function called compare_beers that accepts a function that you will call FUN, and any number of vectors of beer ids. The function, compare_beers, should cycle through each vector/groups of beer_ids, compute the function, FUN, on the subset of reviews, and print "Group X: some_score" where X is the number 1+, and some_score is the result of applying FUN on the subset of the reviews data.

In the example below the function FUN is the median function and we have two vectors/groups of beer_ids passed with c(271781) being group 1 and c(125646, 82352) group 2. Note that even though our example only passes two vectors to our compare_beers function, we want to write the function in a way that we could pass as many vectors as we want to.

Example:

compare_beers(reviews, median, c(271781), c(125646, 82352))

This example gives the output:

Group 1: 4
Group 2: 4.56

For your solution to this question, find the behavior of compare_beers in this example:

compare_beers(reviews, median, c(88,92,7971), c(74986,1904), c(34,102,104,355))

Hint: There are different approaches to this question. You can use for loops or sapply. It will probably help to start small and build slowly toward the solution.

Hint: This first video shows how to use ... in defining a function.

Click here for video

Hint: This second video basically walks students through how to build this function. If you use this video to learn how to build this function, please be sure to acknowledge this in your project solutions.

Click here for video

Relevant topics: in, ..., indexing, paste0, for loops

Item(s) to submit:

  • R code used to solve the problem.
  • The result from running the provided example.
5. Beer wars! IN and MI against AZ and CO. Use the function you wrote in question (4) to compare beer_id from each group of states. Make a cool plot of some sort. Be sure to comment on your plot.

Hint: Create a vector of beer_ids per group before passing it to your function from (4).

Hint: This video demonstrates an example of how to use the compare_beers function.

Click here for video

Relevant topics: in, ..., indexing, paste0, for loops

Item(s) to submit:

  • R code used to solve the problem.
  • The result from running your function.
  • The resulting plot.
  • 1-2 sentence commenting on your plot.

Project 14


Motivation: Functions are the building blocks of more complex programming. It's vital that you understand how to read and write functions. In this project we will incrementally build and improve upon a function designed to recommend a beer. Note that you will not be winning any awards for this recommendation system, it is just for fun!

Context: One of the main focuses throughout the semester has been on functions, and for good reason. In this project we will continue to exercise our R skills and build up our recommender function.

Scope: r, functions

Learning objectives:

  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Utilize apply functions in order to solve a data-driven problem.
  • Gain proficiency using split, merge, and subset.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/beer/

Questions

1. Read /class/datamine/data/beer/beers.csv into a data.frame named beers. Read /class/datamine/data/beer/breweries.csv into a data.frame named breweries. Read /class/datamine/data/beer/reviews.csv into a data.frame named reviews. As in the previous project, make sure you used the fread function from the data.table package, and convert the data.table to a data.frame. We want to create a very basic beer recommender. We will start simple. Create a function called recommend_a_beer that takes as input my_beer_id (a single value) and returns a vector of beer_ids from the same style. Test your function on 2093.

Hint: Make sure you do not include the given my_beer_id in the vector of beer_ids containing the beer_idsof your recommended beers.

Hint: You may find the function setdiff useful. Run the example below to get an idea of what it does.

Note: You will not win any awards for this recommendation system!

x <- c('a','b','b','c')
y <- c('c','b','d','e','f')
setdiff(x,y)
## [1] "a"
setdiff(y,x)
## [1] "d" "e" "f"

Relevant topics: fread, data.frame, function, setdiff

Item(s) to submit:

  • R code used to solve the problem.
  • Length of result from recommend_a_beer(2093).
  • The result of 2093 %in% recommend_a_beer(2093).
2. That is a lot of beer recommendations! Let's try to narrow it down. Include an argument in your function called min_score with default value of 4.5. Our recommender will only recommend beer_ids with a review score of at least min_score. Test your improved beer recommender with the same beer_id from question (1).

Hint: Note that now we need to look at both beers and reviews datasets.

Relevant topics: in, unique, subset/indexing

Item(s) to submit:

  • R code used to solve the problem.
  • Length of result from recommend_a_beer(2093).
3. There is still room for improvement (obviously) for our beer recommender. Include a new argument in your function called same_brewery_only with default value FALSE. This argument will determine whether or not our beer recommender will return only beers from the same brewery. Test our newly improved beer recommender with the same beer_id from question (1) with the argument same_brewery_only set to TRUE.

Hint: You may find the function intersect useful. Run the example below to get an idea of what it does.

x <- c('a','b','b','c')
y <- c('c','b','d','e','f')
intersect(x,y)
## [1] "b" "c"
intersect(y,x)
## [1] "c" "b"

Relevant topics: if/else, subset, intersect, indexing

Item(s) to submit:

  • R code used to solve the problem.
  • Length of result from recommend_a_beer(2093, same_brewery_only=TRUE).
4. Oops! Bad idea! Maybe including only beers from the same brewery is not the best idea. Add an argument to our beer recommender named type. If type=style our recommender will recommend beers based on the style as we did in question (3). If type=reviewers, our recommender will recommend beers based on reviewers with "similar taste". Select reviewers that gave score equal to or greater than min_score for the given beer id (my_beer_id). For those reviewers, find the beer_ids for other beers that these reviewers have given a score of at least min_score. These beer_ids are the ones our recommender will return. Be sure to test our improved recommender on the same beer_id as in (1)-(3).

Relevant topics: if, subset, in, setdiff, unique

Item(s) to submit:

  • R code used to solve the problem.
  • Length of result from recommend_a_beer(2093, type="reviewers").
6. Play with our recommend_a_beer function. Include another feature to it. Some ideas are: putting a limit on the number of beer_ids we will return, error catching (what if we don't have reviews for a given beer_id?), including a plot to the output, returning beer names instead of ids or new arguments to decide what beer_ids to recommend. Be creative and have fun!

Item(s) to submit:

  • R code used to solve the problem.
  • The result from running the improved recommend_a_beer function showcasing your improvements to it.
  • 1-2 sentecens commenting on what you decided to include and why.

Project 15


Motivation: Some people say it takes 20 hours to learn a skill, some say 10,000 hours. What is certain is it definitely takes time. In this project we will explore an interesting dataset and exercise some of the skills learned this semester.

Context: This is the final project of the semester. We sincerely hope that you've learned something, and that we've provided you with first hand experience digging through data.

Scope: r

Learning objectives:

  • Read and write basic (csv) data.
  • Explain and demonstrate: positional, named, and logical indexing.
  • Utilize apply functions in order to solve a data-driven problem.
  • Gain proficiency using split, merge, and subset.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/donerschoose/

Questions

1. Read the data /class/datamine/data/donerschoose/Projects.csv into a data.frame called projects. Make sure you use the function you learned in Project 13 (fread) from the data.table package to read the data. Don't forget to then convert the data.table into a data.frame. Let's do an initial exploration of this data. What types of projects (Project.Type) are there? How many resource categories (Project.Resource.Category) are there?

Relevant topics: fread, unique, length

Item(s) to submit:

  • R code used to solve the question.
  • 1-2 sentences containing the project's types and how many resource categories are in the dataset.
2. Create two new variables in projects, the number of days a project's lasted and the number of days until the project was fully funded. Name those variables project_duration and time_until_funded, respectively. To calculate them use the project's posted date (Project.Posted.Date), expiration date (Project.Expiration.Date), and fully funded date (Project.Fully.Funded.Date). What are the shortest and longest times until a project is fully funded? For consistency check, see if we have any negative project's duration. If so, how many?

Hint: You may find the argument units in the function difftime useful.

Hint: Be sure to pay attention to the order of operations of difftime.

Hint: Note that if you used the fread function from data.table to read in the data, you will not need to convert the columns as date.

Hint: It is not required that you use difftime.

Relevant topics: difftime, lubridate

Item(s) to submit:

  • R code used to solve the question.
  • Shortest and longest times until a project is fully funded.
  • 1-2 sentences answering whether we have if we have negative project's duration, and if so how many.
3. As you noted in question (2) there may be some project's with negative duration time. As we may have some concerns for the data regarding these projects, filter the projects data to exclude the projects with negative duration, and call this filtered data selected_projects. With that filtered data, make a dotchart for mean time until the project is fully funded (time_until_funded) for the various resource categories (Project.Resource.Category). Make sure to comment on your results. Are they surprising? Could there be another variable influencing this result? If so, name at least one.

Hint: You will first need to average time until fully funded for the different categories before making your plot.

Hint: To make your dotchart look nicer, you may want to first order the average time until fully funded before passing it to the dotchart function. In addition, consider reducing the y-axis font size using the argument cex.

Relevant topics: indexing, subset, tapply, dotchart

Item(s) to submit:

  • R code used to solve the question.
  • Resulting dotchart.
  • 1-2 sentences commenting on your plot. Make sure to mention whether you are surprised or not by the results. Don't forget to add if you think there could be more factors influencing your answer, and if so, be sure to give examples.
4. Read /class/datamine/data/donerschoose/Schools.csv into a data.frame called schools. Combine selected_projects and schools by School.ID keeping only School.IDs present in both datasets. Name the combined data.frame selected_projects. Use the newly combined data to determine the percentage of already fully funded projects (Project.Current.Status) for schools in West Lafayette, IN. In addition, determine the state (School.State) with the highest number of projects. Be sure to specify the number of projects this state has.

Hint: West Lafayette, IN zip codes are 47906 and 47907.

Relevant topics: fread, subset, indexing, merge, table, prop.table, which.max

Item(s) to submit:

  • R code used to solve the question.
  • 1-2 sentences answering the percentage of already fully funded projects for schools in West Lafayette, IN, the state with the highest number of projects, and the number of projects this state has.
5. Using the combined selected_projects data, get the school(s) (School.Name), city/cities (School.City) and state(s) (School.State) for the teacher with the highest percentage of fully funded projects (Project.Current.Status).

Hint: There are many ways to solve this problem. For example, one option to get the teacher's ID is to create a variable indicating whether or not the project is fully funded and use tapply. Another option is to create prop.table and select the corresponding column/row.

Hint: Note that each row in the data corresponds to a unique project ID.

Hint: Once you have the teacher's ID, consider filtering projects to contain only rows for which the corresponding teacher's ID is in, and only the columns we are interested in: School.Name, School.City, and School.State. Then, you can get the unique values in this shortened data.

Hint: To get only certain columns when subetting, you may find the argument select from subset useful.

Relevant topics: indexing, which, max, subset, unique, row.names (if using table), names (if using tapply)

Item(s) to submit:

  • R code used to solve the question.
  • Output of your code containing school(s), city(s) and state(s) of the selected teacher.

STAT 29000

Project 1


Motivation: In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.

Context: We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we've previously learned.

Scope: data wrangling in R, functions

Learning objectives:

  • Comprehend what a function is, and the components of a function in R.
  • Read and write basic (csv) data.
  • Utilize apply functions in order to solve a data-driven problem.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

You can find useful examples that walk you through relevant material in The Examples Book:

https://thedatamine.github.io/the-examples-book

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

Important note: It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials.

We decided to move away from ThinLinc and away from the version of RStudio used last year (https://desktop.scholar.rcac.purdue.edu). That version of RStudio is known to have some strange issues when running code chunks.

Remember the very useful documentation shortcut ?. To use, simply type ? in the console, followed by the name of the function you are interested in.

You can also look for package documentation by using help(package=PACKAGENAME), so for example, to see the documentation for the package ggplot2, we could run:

help(package=ggplot2)

Sometimes it can be helpful to see the source code of a defined function. A function is any chunk of organized code that is used to perform an operation. Source code is the underlying R or c or c++ code that is used to create the function. To see the source code of a defined function, type the function's name without the (). For example, if we were curious about what the function Reduce does, we could run:

Reduce

Occasionally this will be less useful as the resulting code will be code that calls c code we can't see. Other times it will allow you to understand the function better.

Dataset:

/class/datamine/data/airbnb

Often times (maybe even the majority of the time) data doesn’t come in one nice file or database. Explore the datasets in /class/datamine/data/airbnb.

1. You may have noted that, for each country, city, and date we can find 3 files: calendar.csv.gz, listings.csv.gz, and reviews.csv.gz (for now, we will ignore all files in the "visualisations" folders).
Let's take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (calendar.csv.gz, listings.csv.gz, and reviews.csv.gz). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.

Hint: read.csv has an argument to select the number of rows we want to read.

Hint: Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries.

Item(s) to submit:

  • Chunk of code used to read the first 50 rows of each dataset.
  • 1-2 sentences briefly describing the information contained in each dataset.
  • Name(s) of variable(s) that could be used to join them.

To read a compressed csv, simply use the read.csv function:

dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
head(dat)
Solution

The calendar.csv.gz file for 2019-07-08 in Hawaii describes the listing_id, date, available (t or f), price, adjusted_price, minimum_nights, and maximum_nights

hawaii_calendar <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/calendar.csv.gz")
head(hawaii_calendar, n=50)

The listings.csv.gz file for 2019-07-08 in Hawaii has 106 variables, which describe the very specific attributes of the airbnb listings.

hawaii_listings <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/listings.csv.gz")
dim(hawaii_listings)

The reviews.csv.gz file for 2019-07-08 in Hawaii describes the listing_id, id, date, reviewer_id, reviewer_name, and comments

hawaii_reviews <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/reviews.csv.gz")
head(hawaii_reviews, n=50)

The variables that might be used to compare the tables are: date, id, listing_id, maximum_nights, minimum_nights, price

t <- table(c(names(hawaii_calendar), names(hawaii_listings), names(hawaii_reviews)))
t[t > 1]

Let's work towards getting this data into an easier format to analyze. From now on, we will focus on the listings.csv.gz datasets.

2. Write a function called get_paths_for_country, that, given a string with the country name, returns a vector with the full paths for all listings.csv.gz files, starting with /class/datamine/data/airbnb/....
For example, the output from get_paths_for_country("united-states") should have 28 entries. Here are the first 5 entries in the output:
 [1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz"       
 [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz"           
 [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz"     
 [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz"         
 [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"     

Hint: list.files is useful with the recursive=T option.

Hint: Use grep to search for the pattern listings.csv.gz (within the results from the first hint), and use the option value=T to display the values found by the grep function.

Item(s) to submit:

  • Chunk of code for your get_paths_for_country function.
Solution

We extract all 28 of the listings for the United States first:

myprefix <- "/class/datamine/data/airbnb/united-states/"
paste0(myprefix, grep("listings.csv.gz", list.files(myprefix, recursive=T), value=T))

Now we build a function that can do the same thing, for any country

get_paths_for_country <- function(mycountry) {
  myprefix <- paste0("/class/datamine/data/airbnb/", mycountry, "/")
  paste0(myprefix, grep("listings.csv.gz", list.files(myprefix, recursive=T), value=T))
}

and we test this for several countries:

get_paths_for_country("united-states")
get_paths_for_country("brazil")
get_paths_for_country("south-africa")
get_paths_for_country("canada")
3. Write a function called get_data_for_country that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.

Hint: Use stringsAsFactors=F in the read.csv function.

Hint: Use do.call(rbind, <listofdataframes>) to combine a list of dataframes into a single dataframe.

Relevant topics: rbind, lapply, function

Item(s) to submit:

  • Chunk of code for your get_data_for_country function.
Solution

We first get the data from the Canada entries. To do this, we sapply the read.csv function to each of the 6 results from get_paths_for_country("canada") In other words, we read in these 6 data frames.

myresults <- sapply(get_paths_for_country("canada"), read.csv, stringsAsFactors=F, simplify=F)

We get a list of 6 data frames:

length(myresults)
class(myresults)
class(myresults[[1]])
class(myresults[[6]])

and we can check the dimensions of each of the 6 data frames

dim(myresults[[1]])
dim(myresults[[2]])
dim(myresults[[3]])
dim(myresults[[4]])
dim(myresults[[5]])
dim(myresults[[6]])

this is more easily accomplished with another sapply:

sapply(myresults, dim)

We can rbind all 6 of these data frames into one big data frame as follows:

bigDF <- do.call(rbind, myresults)
class(bigDF)
dim(bigDF)

Now we create the desired function called get_data_for_country

get_data_for_country <- function(mycountry) {
  myresults <- sapply(get_paths_for_country(mycountry), read.csv, stringsAsFactors=F, simplify=F)
  do.call(rbind, myresults)
}

and we test it on Canada.

mynewbigDF <- get_data_for_country("canada")

The result has the same size as before

dim(mynewbigDF)
4. Use your get_data_for_country to get the data for a country of your choice, and make sure to name the data.frame listings. Take a look at the following columns: host_is_superhost, host_has_profile_pic, host_identity_verified, and is_location_exact. What is the data type for each column? (You can use class or typeof or str to see the data type.)
These columns would make more sense as logical values (TRUE/FALSE/NA).
Write a function called transform_column that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (""), and we need to be careful when transforming the data. Test your function on column host_is_superhost.

Relevant topics: class, typeof, str, toupper, as.logical

Item(s) to submit:

  • Chunk of code for your transform_column function.
  • Type of transform_column(listings$host_is_superhost).
Solution

These 4 columns from mynewbigDF (which has the data for Canada) only have values "t", "f", ""

head(mynewbigDF$host_is_superhost)
head(mynewbigDF$host_has_profile_pic)
head(mynewbigDF$host_identity_verified)
head(mynewbigDF$is_location_exact)

Please note the 44 values of "" (which are easy to miss) In the first 3 out of 4 of these columns:

table(mynewbigDF$host_is_superhost)
table(mynewbigDF$host_has_profile_pic)
table(mynewbigDF$host_identity_verified)
table(mynewbigDF$is_location_exact)

These are all character vectors, which we can check using class, typeof, or str:

class(mynewbigDF$host_is_superhost)
class(mynewbigDF$host_has_profile_pic)
class(mynewbigDF$host_identity_verified)
class(mynewbigDF$is_location_exact)

typeof(mynewbigDF$host_is_superhost)
typeof(mynewbigDF$host_has_profile_pic)
typeof(mynewbigDF$host_identity_verified)
typeof(mynewbigDF$is_location_exact)

str(mynewbigDF$host_is_superhost)
str(mynewbigDF$host_has_profile_pic)
str(mynewbigDF$host_identity_verified)
str(mynewbigDF$is_location_exact)

We have several ways to transform a column. For example, we could go element-by-element, and make substitutions, like this:

v <- mynewbigDF$host_is_superhost

Here is the way that the values look at the start:

table(v)
v[toupper(v)=="T"] <- TRUE
v[toupper(v)=="F"] <- FALSE
v[toupper(v)==""] <- NA

and here are the values now:

table(v)

You might think that the NA values disappeared, but they just do not show up in the table by default. You can force them to appear, and then we see that the counts of the three values are the same as before.

table(v, useNA="always")

Here is the function:

transform_column <- function(v) {
  v[toupper(v)=="T"] <- TRUE
  v[toupper(v)=="F"] <- FALSE
  v[toupper(v)==""] <- NA
  v
}

We can try the function on mynewbigDF$host_is_superhost:

head(transform_column(mynewbigDF$host_is_superhost))
table(transform_column(mynewbigDF$host_is_superhost))

Another possibility is to make a map, in which we put the old values as the names, and the new values as the values in the vector:

mymap <- c(TRUE, FALSE, NA)
names(mymap) <- c("T", "F", "")
head(mymap[toupper(mynewbigDF$host_is_superhost)])

and if you do not want the names to appear on the vector, you can remove them, like this:

head(unname(mymap[toupper(mynewbigDF$host_is_superhost)]))

Finally we can check the table of the results:

table(mymap[toupper(mynewbigDF$host_is_superhost)])

This might seem strange, and if you do not like it, you can just use the solution given above. If you do like this, and want to wrap it into a function, we can write:

transform_column <- function(v) {
  mymap <- c(TRUE, FALSE, NA)
  names(mymap) <- c("T", "F", "")
  unname(mymap[toupper(v)])
}

and again we can try this new version of the function on mynewbigDF$host_is_superhost:

head(transform_column(mynewbigDF$host_is_superhost))
table(transform_column(mynewbigDF$host_is_superhost))
5. Apply your function transform_column to the columns instant_bookable and is_location_exact in your listings data.
Based on your listings data, if you are looking at an instant bookable listing (where instant_bookable is TRUE), would you expect the location to be exact (where is_location_exact is TRUE)? Why or why not?

Hint: Make a frequency table, and see how many instant bookable listings have exact location.

Relevant topics: apply, table

Item(s) to submit:

  • Chunk of code to get a frequency table.
  • 1-2 sentences explaining whether or not we would expect the location to be exact if we were looking at a instant bookable listing.
Solution

Now we look at the Canada results, for which instant_bookable is TRUE, to see if is_location_exact is TRUE or FALSE:

table(transform_column(mynewbigDF$is_location_exact)[transform_column(mynewbigDF$instant_bookable) == TRUE])

In other words, if instant_bookable is TRUE, then we expect the value for is_location_exact to usually be TRUE as well.

As a closing note, we could remove the check to see whether the inner values are TRUE because, by default, we will only exact the TRUE values when we do a lookup like this:

table(transform_column(mynewbigDF$is_location_exact)[transform_column(mynewbigDF$instant_bookable)])

Project 2


Motivation: The ability to quickly reproduce an analysis is important. It is often necessary that other individuals will need to be able to understand and reproduce an analysis. This concept is so important there are classes solely on reproducible research! In fact, there are papers that investigate and highlight the lack of reproducibility in various fields. If you are interested in reading about this topic, a good place to start is the paper titled "Why Most Published Research Findings Are False", by John Ioannidis (2005).

Context: Making your work reproducible is extremely important. We will focus on the computational part of reproducibility. We will learn RMarkdown to document your analyses so others can easily understand and reproduce the computations that led to your conclusions. Pay close attention as future project templates will be RMarkdown templates.

Scope: Understand Markdown, RMarkdown, and how to use it to make your data analysis reproducible.

Learning objectives:

  • Use Markdown syntax within an Rmarkdown document to achieve various text transformations.
  • Use RMarkdown code chunks to display and/or run snippets of code.

Questions

Click here for video

1. Make the following text (including the asterisks) bold: This needs to be **very** bold. Make the following text (including the underscores) italicized: This needs to be _very_ italicized.

Important note: Surround your answer in 4 backticks. This will allow you to display the markdown without having the markdown "take effect". For example:

````
Some *marked* **up** text.
````

Hint: Be sure to check out the Rmarkdown Cheatsheet and our section on Rmarkdown in the book.

Note: Rmarkdown is essentially Markdown + the ability to run and display code chunks. In this question, we are actually using Markdown within Rmarkdown!

Relevant topics: rmarkdown, escaping characters

Item(s) to submit: - 2 lines of markdown text, surrounded by 4 backticks. Note that when compiled, this text will be unmodified, regular text.

Solution

We can achieve this style of text:

This needs to be **very** bold

This needs to be _very_ italicized.

by using this Markdown text:

**This needs to be \*\*very\*\* bold**
_This needs to be \_very\_ italicized._

The backslashes specify that we want the asterisks and underscores to appear.

2. Create an unordered list of your top 3 favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another ordered list that ranks your academic interests in order of most interested to least interested.

Hint: You can learn what ordered and unordered lists are here.

Note: Similar to (1), in this question we are dealing with Markdown. If we were to copy and paste the solution to this problem in a Markdown editor, it would be the same result as when we Knit it here.

Relevant topics: rmarkdown

Item(s) to submit: - Create the lists, this time don't surround your code in backticks. Note that when compiled, this text will appear as nice, formatted lists.

Solution

An unordered list of my top 3 favorite academic interests is:

  • asymptotic analysis of sequences
  • data science
  • analysis of algorithms

An ordered list of my top 3 favorite academic interests is:

  1. analysis of algorithms
  2. asymptotic analysis of sequences
  3. data science
3. Browse https://www.linkedin.com/ and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown. Include the following:
  • A header for this section (your choice of size) that says "About".
  • The text of your personal "About" section that you would feel comfortable uploading to linkedin, including at least 1 link.

Relevant topics: rmarkdown

Item(s) to submit: - Create the described profile, don't surround your code in backticks.

Solution
About

I am Professor in Statistics and (by courtesy) of Mathematics and Public Health at Purdue University. My research is in probabilistic, combinatorial, and analytic techniques for the analysis of algorithms and data structures; I am also interested in science of information, game theory, and large-scale computation. I currently serve as

  • Director of The Data Mine
  • Interim Co-Director of the Integrative Data Science Initiative
  • Principal Investigator for the Purdue Statistics Living Learning Community, funded by the National Science Foundation
  • Associate Director for the NSF Center for Science of Information (now a core center in Purdue's Discovery Park)
  • Associate Director of the Actuarial Science Program
4. Your co-worker wrote a report, and has asked you to beautify it. Knowing Rmarkdown, you agreed. Make improvements to this section. At a minimum:
  • Make the title pronounced.
  • Make all links appear as a word or words, rather than the long-form URL.
  • Organize all code into code chunks where code and output are displayed. If the output is really long, just display the code.
  • Make the calls to the library function be evaluated but not displayed.
  • Make sure all warnings and errors that may eventually occur, do not appear in the final document.

Feel free to make any other changes that make the report more visually pleasing.

```{r my-load-packages}
library(ggplot2)
```

```{r declare-variable-290, eval=FALSE}
my_variable <- c(1,2,3)
```

All About the Iris Dataset

This paper goes into detail about the `iris` dataset that is built into r. You can find a list of built-in datasets by visiting https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or by running the following code:

data()

The iris dataset has 5 columns. You can get the names of the columns by running the following code:

names(iris)

Alternatively, you could just run the following code:

iris

The second option provides more detail about the dataset.

According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html there is another dataset built-in to r called `iris3`. This dataset is 3 dimensional instead of 2 dimensional.

An iris is a really pretty flower. You can see a picture of one here:

https://www.gardenia.net/storage/app/public/guides/detail/83847060_mOptimized.jpg

In summary. I really like irises, and there is a dataset in r called `iris`.

Relevant topics: rmarkdown

Item(s) to submit:

  • Make improvements to this section, and place it all under the Question 4 header in your template.
Solution
my_variable <- c(1,2,3)

*** All About the Iris Dataset

This paper goes into detail about the iris dataset that is built into r. You can find a list of built-in datasets by visiting the R datasets index or by running the following code:

data()

The iris dataset has 5 columns. You can get the names of the columns by running the following code:

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Alternatively, you could just run the following code:

iris

The second option provides more detail about the dataset.

According to the R iris help page there is another dataset built-in to R called iris3. This dataset is 3 dimensional instead of 2 dimensional.

An iris is a really pretty flower. You can see a picture of one here:

In summary. I really like irises, and there is a dataset in R called iris.

5. Create a plot using a built-in dataset like iris, mtcars, or Titanic, and display the plot using a code chunk. Make sure the code used to generate the plot is hidden. Include a descriptive caption for the image. Make sure to use an RMarkdown chunk option to create the caption.

Relevant topics: rmarkdown, plotting in r

Item(s) to submit:

  • Code chunk under that creates and displays a plot using a built-in dataset like iris, mtcars, or Titanic.
Solution

We create the figure called "Sepal Length for the Iris Data Set".

Sepal Length for the Iris Data Set

Figure 1: Sepal Length for the Iris Data Set

6. Insert the following code chunk under the Question 6 header in your template. Try knitting the document. Two things will go wrong. What is the first problem? What is the second problem?
```{r my-load-packages}
plot(my_variable)
```

Hint: Take a close look at the name we give our code chunk.

Hint: Take a look at the code chunk where my_variable is declared.

Relevant topics: rmarkdown

Item(s) to submit:

  • The modified version of the inserted code that fixes both problems.
  • A sentence explaining what the first problem was.
  • A sentence explaining what the second problem was.
Solution

We needed to change the section title from install-packages to a new name, since we already had a section with this same title.

In the section called declare-variable, we had eval=F, and as a result, the variable my_variable was never declared. So we removed the option eval=F.

plot(my_variable)
For Project 2, please submit your .Rmd file and the resulting .pdf file. (For this project, you do not need to submit a .R file.)

Project 3


Motivation: The ability to navigate a shell, like bash, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful bash tools, help you navigate a filesystem, and even run bash tools from within an RMarkdown file in RStudio.

Context: At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file.

Scope: bash, RStudio

Learning objectives:

  • Distinguish differences in /home, /scratch, and /class.
  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
  • Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.
  • Use man to read and learn about UNIX utilities.
  • Run bash commands from within and RMarkdown file in RStudio.

There are a variety of ways to connect to Scholar. In this class, we will primarily connect to RStudio Server by opening a browser and navigating to https://rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface.

Here is a video to remind you about some of the basic tools you can use in UNIX/Linux:

Click here for video

This is the easiest book for learning this stuff; it is short and gets right to the point:

https://go.oreilly.com/purdue-university/library/view/-/0596002610

you just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through).

It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details.

Solution

Here are some changes Kevin likes:

  • Uncheck "Restore .Rdata into workspace at startup".
  • Change tab width 4.
  • Check "Soft-wrap R source files".
  • Check "Highlight selected line".
  • Check "Strip trailing horizontal whitespace when saving".
  • Uncheck "Show margin".
2. There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a bash shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient!
What is the default directory of your bash shell?

Hint: Start by reading the section on man. man stands for manual, and you can find the "official" documentation for the command by typing man <command_of_interest>. For example:

# read the manual for the `man` command
# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down
man man 

Relevant topics: man, pwd, ~, .., .

Item(s) to submit:

  • The full filepath of default directory (home directory). Ex: Kevin's is: /home/kamstut
  • The bash code used to show your home directory or current directory (also known as the working directory) when the bash shell is first launched.
Solution
# whatever is stored in the $HOME environment variable 
# is what ~ represents
cd ~
pwd

# if we change $HOME, ~ changes too!
HOME=/home/kamstut/projects
cd ~
pwd

# if other users on the linux system share certain files or folders
# in their home directory, you can access their home folder similarly
ls ~mdw

# but they _have_ to give you permissions 
3. Learning to navigate away from our home directory to other folders, and back again, is vital. Perform the following actions, in order:
  • Write a single command to navigate to the folder containing our full datasets: /class/datamine/data.
  • Write a command to confirm you are in the correct folder.
  • Write a command to list the files and directories within the data directory. (You do not need to recursively list subdirectories and files contained therein.) What are the names of the files and directories?
  • Write another command to return back to your home directory.
  • Write a command to confirm you are in the correct folder.

Note: / is commonly referred to as the root directory in a linux/unix filesystem. Think of it as a folder that contains every other folder in the computer. /home is a folder within the root directory. /home/kamstut is the full filepath of Kevin's home directory. There is a folder home inside the root directory. Inside home is another folder named kamstut which is Kevin's home directory.

Relevant topics: man, cd, pwd, ls, ~, .., .

Item(s) to submit:

  • Command used to navigate to the data directory.
  • Command used to confirm you are in the data directory.
  • Command used to list files and folders.
  • List of files and folders in the data directory.
  • Command used to navigate back to the home directory.
  • Command used to confirm you are in the home directory.
Solution
# navigate to the data directory using `cd` (change directory)
cd /class/datamine/data

# confirm the location using `pwd` (print working directory)
pwd

# list files
ls

# cd without any arguments automatically returns to the directory
# saved in the $HOME environment variable
cd

# another trick, if you wanted to _quickly_ return to the data 
# directory, or most recent directory is the following (uncommented)
cd ~-

# confirm the location using pwd
pwd
4. Let's learn about two more important concepts. . refers to the current working directory, or the directory displayed when you run pwd. Unlike pwd you can use this when navigating the filesystem! So, for example, if you wanted to see the contents of a file called my_file.txt that lives in /home/kamstut (so, a full path of /home/kamstut/my_file.txt), and you are currently in /home/kamstut, you could run: cat ./my_file.txt.
.. represents the parent folder or the folder in which your current folder is contained. So let's say I was in /home/kamstut/projects/ and I wanted to get the contents of the file /home/kamstut/my_file.txt. You could do: cat ../my_file.txt.
When you navigate a directory tree using . and .. you create paths that are called relative paths because they are relative to your current directory. Alternatively, a full path or (absolute path) is the path starting from the root directory. So /home/kamstut/my_file.txt is the absolute path for my_file.txt and ../my_file.txt is a relative path. Perform the following actions, in order:
  • Write a single command to navigate to the data directory.
  • Write a single command to navigate back to your home directory using a relative path. Do not use ~ or the cd command without a path argument.

Relevant topics: man, cd, pwd, ls, ~, .., .

Item(s) to submit:

  • Command used to navigate to the data directory.
  • Command used to navigate back to your home directory that uses a relative path.
Solution
cd /class/datamine/data
pwd
cd ../../../home/kamstut
pwd
5. In Scholar, when you want to deal with really large amounts of data, you want to access scratch (you can read more here). Your scratch directory on Scholar is located here: /scratch/scholar/$USER. $USER is an environment variable containing your username. Test it out: echo /scratch/scholar/$USER. Perform the following actions:
  • Navigate to your scratch directory.
  • Confirm you are in the correct location.
  • Execute myquota.
  • Find the location of the myquota bash script.
  • Output the first 5 and last 5 lines of the bash script.
  • Count the number of lines in the bash script.
  • How many kilobytes is the script?

Hint: You could use each of the commands in the relevant topics once.

Hint: When you type myquota on Scholar there are sometimes two warnings about xauth but sometimes there are no warnings. If you get a warning that says Warning: untrusted X11 forwarding setup failed: xauth key data not generated it is safe to ignore this error.

Hint: Commands often have options. Options are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages. For example: man wc. You can see -m, -l, and -w are all options for wc. To test this out:

# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command.
wc /class/datamine/data/flights/1987.csv
# to count the lines, use the -l option
wc -l /class/datamine/data/flights/1987.csv
# to count the words, use the -w option
wc -w /class/datamine/data/flights/1987.csv
# you can combine options as well
wc -w -l /class/datamine/data/flights/1987.csv
# some people like to use a single tack `-`
wc -wl /class/datamine/data/flights/1987.csv
# order doesn't matter
wc -lw /class/datamine/data/flights/1987.csv

Hint: The -h option for the du command is useful.

Relevant topics: cd, pwd, type, head, tail, wc, du

Item(s) to submit:

  • Command used to navigate to your scratch directory.
  • Command used to confirm your location.
  • Output of myquota.
  • Command used to find the location of the myquota script.
  • Absolute path of the myquota script.
  • Command used to output the first 5 lines of the myquota script.
  • Command used to output the last 5 lines of the myquota script.
  • Command used to find the number of lines in the myquota script.
  • Number of lines in the script.
  • Command used to find out how many kilobytes the script is.
  • Number of kilobytes that the script takes up.
Solution
# navigate to my scratch folder
cd /scratch/scholar/$USER

# confirm
pwd

# what is my quota, execute the myquota script
myquota

# get the location of the myquota script
type myquota

# get the first 5 lines of the myquota script
head /usr/local/bin/myquota

# get the last 5 lines of the myquota script
tail /usr/local/bin/myquota

# get the number of lines in the myquota script
wc -l /usr/local/bin/myquota

# get the number of kilobytes of the myquota script
du -h --apparent-size /usr/local/bin/myquota
ls -la /usr/local/bin/myquota
6. Perform the following operations:
  • Navigate to your scratch directory.
  • Copy and paste the file: /class/datamine/data/flights/1987.csv to your current directory (scratch).
  • Create a new directory called my_test_dir in your scratch folder.
  • Move the file you copied to your scratch directory, into your new folder.
  • Use touch to create an empty file named im_empty.txt in your scratch folder.
  • Remove the directory my_test_dir and the contents of the directory.
  • Remove the im_empty.txt file.

Hint: rmdir may not be able to do what you think, instead, check out the options for rm using man rm.

Relevant topics: cd, cp, mv, mkdir, touch, rmdir, rm

Item(s) to submit:

  • Command used to navigate to your scratch directory.
  • Command used to copy the file, /class/datamine/data/flights/1987.csv to your current directory (scratch).
  • Command used to create a new directory called my_test_dir in your scratch folder.
  • Command used to move the file you copied earlier 1987.csv into your new my_test_dir folder.
  • Command used to create an empty file named im_empty.txt in your scratch folder.
  • Command used to remove the directory and the contents of the directory my_test_dir.
  • Command used to remove the im_empty.txt file.
Solution
# navigate to the scratch folder
cd /scratch/scholar/$USER

# copy the 1987.csv file to the current directory (scratch)
cp /class/datamine/data/flights/1987.csv .

# make a directory in the scratch directory called `my_test_dir`
mkdir my_test_dir

# move 1987.csv to the new folder
mv 1987.csv my_test_dir

# create an empty file in the scratch folder
touch im_empty.txt

# remove the directory and the contents of the directory
rm -r my_test_dir

# remove the im_empty.txt file
rm im_empty.txt
7. Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.

Project 4


Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. grep is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.

Context: We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep, and experiment with regular expressions using grep, R, and later on, Python.

Scope: grep, regular expression basics, utilizing regular expression tools in R and Python

Learning objectives:

  • Use grep to search for patterns within a dataset.
  • Use cut to section off and slice up data from the command line.
  • Use wc to count the number of lines of input.

You can find useful examples that walk you through relevant material in The Examples Book:

https://thedatamine.github.io/the-examples-book

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

Important note: I would highly recommend using single quotes ' to surround your regular expressions. Double quotes can have unexpected behavior due to some shell's expansion rules. In addition, pay close attention to escaping certain characters in your regular expressions.

Dataset

The following questions will use the dataset the_office_dialogue.csv found in Scholar under the data directory /class/datamine/data/. A public sample of the data can be found here: the_office_dialogue.csv

Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.

grep stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep, we will be using it with textual data. You can read about and see examples of grep here.

1. Login to Scholar and use grep to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica.". Where is it located exactly?

Relevant topics: grep

Item(s) to submit:

  • The grep command used to find the dataset.
  • The name and location in Scholar of the dataset.
Solution
grep -Ri "bears. beets. battlestar galactica." /class/datamine 

/class/datamine/data/movies_and_tv/the_office_dialogue.csv

2. grep prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first n lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.
Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.

Hint: The result if you were to search for "bears. beets. battlestar galactica." should be:

"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica."

Hint: One method to solve this problem would be to pipe the output from grep to cut.

Relevant topics: cut, grep

Item(s) to submit:

  • The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".
Solution
grep -i "bears. beets. battlestar galactica." /class/datamine/data/movies_and_tv/the_office_dialogue.csv | cut -d "," -f 7,8  
3. This particular dataset happens to be very small. You could imagine a scenario where the file is many gigabytes and not easy to load completely into R or Python. We are interested in learning what makes Jim and Pam tick as a couple. Use a line of UNIX commands to create a new dataset called jim_and_pam.csv (remember, a good place to store data temporarily is /scratch/scholar/$USER). Include only lines that are spoken by either Jim or Pam, or reference Jim or Pam in any way. How many rows of data are in the new file? How many megabytes is the new file (to the nearest 1/10th of a megabyte)?

Hint: Redirection.

Hint: It is OK if you get an erroneous line where the word "jim" or "pam" appears as a part of another word.

Relevant topics: grep, ls, wc, redirection

Item(s) to submit:

  • The line of UNIX commands used to create the new file.
  • The number of rows of data in the new file, and the accompanying UNIX command used to find this out.
  • The number of megabytes (to the nearest 1/10th of a megabyte) that the new file has, and the accompanying UNIX command used to find this out.
Solution
grep -i "jim\|pam" /class/datamine/data/movies_and_tv/the_office_dialogue.csv | cut -d "," -f4,7,8,9,12 > jim_and_pam.csv
wc -l jim_and_pam.csv

13779 lines

ls -h jim_and_pam.csv

1.4mb

4. Find all lines where either Jim/Pam/Michael/Dwight's name is followed by an exclamation mark. Use only 1 "!" within your regular expression. How many lines are there? Ignore case (whether or not parts of the names are capitalized or not).

Relevant topics: grep, basic matches, escaping characters

Item(s) to submit:

  • The UNIX command(s) used to solve this problem.
  • The number of lines where either Jim/Pam/Michael/Dwight's name is followed by an exclamation mark.
Solution
grep -E '(Jim|Pam|Michael|Dwight)!' the_office_dialogue.csv

# or

grep '\(Jim\|Pam\|Michael\|Dwight\)!' the_office_dialogue.csv| wc -l
5. Find all lines that contain the text "that's what" followed by any amount of any text and then "said". How many lines are there?

Relevant topics: grep

Item(s) to submit:

  • The UNIX command used to solve this problem.
  • The number of lines that contain the text "that's what" followed by any amount of text and then "said".
Solution
grep -i "that's what .* said" /class/datamine/data/movies_and_tv/the_office_dialogue.csv

Regular expressions are really a useful semi language-agnostic tool. What this means is regardless of the programming language your are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using read.csv. Name the resulting data.frame dat.

6. The text_w_direction column in dat contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.
(a) Create a new column called has_direction that is set to TRUE if the text_w_direction column has direction, and FALSE otherwise. Use the grepl function in R to accomplish this.

Hint: Make sure all opening brackets "[" have a corresponding closing bracket "]".

Hint: Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text.

(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.

We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.

This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.

In our actual dataset, for example, dat$text_w_direction[2789] is a line with 2 directions.

Relevant topics: grep, grepl, basic matches, escaping characters

Item(s) to submit:

  • The R code and regular expression used to solve the first part of this problem.
  • The R code and regular expression used to solve the second part of this problem.
  • How many lines have >= 2 directions?
  • How many lines have >= 5 directions?
Solution
dat$has_direction <- grepl("(\\[.*\\])+", dat$text_w_direction)

#

length(grep("\\[.*\\].*\\[.*\\]", dat$text_w_direction))
length(grep("\\[.*\\].*\\[.*\\].*\\[.*\\].*\\[.*\\].*\\[.*\\]", dat$text_w_direction))
OPTIONAL QUESTION. Use the str_extract_all function from the stringr package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called direction.
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

In this question, your solution may have extracted:

[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]

(It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.)

Relevant topics: str_extract_all, basic matches, escaping characters

Item(s) to submit:

  • The R code used to solve this problem.
Solution
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[[^\\[\\]]*\\])", simplify=F)

# or 

dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[.*?\\])", simplify=F)

Project 5


Motivation: Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc.

Context: We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.

Scope: grep, regular expression basics, UNIX utilities, redirection, piping

Learning objectives:

  • Use cut to section off and slice up data from the command line.
  • Use piping to string UNIX commands together.
  • Use sort and it's options to sort data in different ways.
  • Use head to isolate n lines of output.
  • Use wc to summarize the number of lines in a file or in output.
  • Use uniq to filter out non-unique lines.
  • Use grep to search files effectively.

You can find useful examples that walk you through relevant material in The Examples Book:

https://thedatamine.github.io/the-examples-book

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

Don't forget the very useful documentation shortcut ? for R code. To use, simply type ? in the console, followed by the name of the function you are interested in. In the Terminal, you can use the man command to check the documentation of bash code.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/amazon/amazon_fine_food_reviews.csv

A public sample of the data can be found here: amazon_fine_food_reviews.csv

Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.

Here are three videos that might also be useful, as you work on Project 5:

Click here for video

Click here for video

Click here for video

Questions

1. What is the Id of the most helpful review, according to the highest HelpfulnessNumerator?

Important note: You can always pipe output to head in case you want the first few values of a lot of output. Note that if you used sort before head, you may see the following error messages:

sort: write failed: standard output: Broken pipe
sort: write error

This is because head would truncate the output from sort. This is okay. See this discussion for more details.

Relevant topics: cut, sort, head, piping

Item(s) to submit:

  • Line of UNIX commands used to solve the problem.
  • The Id of the most helpful review.
Solution
cut -d, -f5 amazon_fine_food_reviews.csv| sort -nr | head -n3
2. Some entries under the Summary column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion.
To further clarify what we mean by unique, if we had the following vector in R, c("a", "b", "a", "c"), its unique values are c("a", "b", "c").

Relevant topics: cut, uniq, sort, wc, piping

Item(s) to submit:

  • Two lines of UNIX commands used to solve the problem.
  • The ratio of unique Summary's.
Solution
cut -d, -f9 amazon_fine_food_reviews.csv | sort -u | wc -l
cut -d, -f9 amazon_fine_food_reviews.csv | wc -l

# 295162/568455
3. Use a chain of UNIX commands, piped in a sequence, to create a frequency table of Score.

Relevant topics: cut, uniq, sort, piping

Item(s) to submit:

  • The line of UNIX commands used to solve the problem.
  • The frequency table.
Solution
cut -d, -f7 amazon_fine_food_reviews.csv | sort -n | uniq -c
4. Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why?

Hint: You may need to pipe the output to sort multiple times.

Hint: To create the frequency table, read through the man pages for uniq. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following:

man uniq

Relevant topics: cut, uniq, sort, head, piping, man

Item(s) to submit:

  • The line of UNIX commands used to solve the problem.
  • The frequency table.
Solution
cut -d, -f3 amazon_fine_food_reviews.csv | sort | uniq -c | sort -nr | head -n1

# You should use UserId instead of ProfileName as the former is a unique identifier
5. Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let's consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let's consider anything in between neither strongly positive nor negative.

Hint: You may find the solution to problem (3) useful.

Relevant topics: cut, uniq, sort, grep, piping

Item(s) to submit:

  • The line of UNIX commands used to solve the problem.
Solution
grep -i 'A3OXHLG6DIBRW8' amazon_fine_food_reviews.csv | cut -d, -f7 | sort | uniq -c
6. Find the most helpful review with a Score of 5. Then (separately) find the most helpful review with a Score of 1. As before, we are considering the most helpful review to be the review with the highest HelpfulnessNumerator.

Hint: You can use multiple lines to solve this problem.

Relevant topics: sort, head, piping

Item(s) to submit:

  • The lines of UNIX commands used to solve the problem.
  • ProductId's of both requested reviews.
Solution
sort -t, -k7rn,7 -k5rn,5 amazon_fine_food_reviews.csv | head -n2
sort -t, -k7n,7 -k5rn,5 amazon_fine_food_reviews.csv | head -n2
OPTIONAL QUESTION. For only the two ProductIds from the previous question, create a new dataset called scores.csv that contains the ProductIds and Scores from all reviews for these two items.

Relevant topics: cut, grep, redirection

Item(s) to submit:

  • The line of UNIX commands used to solve the problem.
Solution
cut -d, -f2,7 amazon_fine_food_reviews.csv| grep -Ei '(B00012182G|B003EMQGVI|B00065LI0A)' > scores.csv

Project 6


Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives:

  • Use awk to process and manipulate textual data.
  • Use piping and redirection within the terminal to pass around data between utilities.

Dataset

The following questions will use the dataset found here or in Scholar:

/class/datamine/data/flights/subset/YYYY.csv

An example from 1987 data can be found here or in Scholar:

/class/datamine/data/flights/subset/1987.csv

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the Origin, from 1987.csv. Write another line, this time using awk to do the same thing. Which one do you prefer, and why?

Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk:

Click here for video

Relevant topics: cut, awk

Item(s) to submit:

  • One line of UNIX commands to solve the problem without using awk.
  • One line of UNIX commands to solve the problem using awk.
  • 1-2 sentences describing which method you prefer and why.
Solution
cut -d, -f17 1987.csv
awk -F, '{print $17}' 1987.csv
2. Write a bash script that accepts a year (1987, 1988, etc.) and a column n and returns the nth column of the associated year of data.

Here are two examples to illustrate how to write a bash script:

Click here for video

Click here for video

Hint: In this example, you only need to turn in the content of your bash script (starting with #!/bin/bash) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it:

#!/bin/bash

echo "First argument: $1"
echo "Second argument: $2"

If you simply drop that text into a file called my_script.sh, located here: /home/$USER/my_script.sh, and if you run the following:

# Setup bash to run; this only needs to be run one time per session.
# It makes bash behave a little more naturally in RStudio.
exec bash

# Navigate to the location of my_script.sh
cd /home/$USER

# Make sure that the script is runable.
# This only needs to be done one time for each new script that you write.
chmod 755 my_script.sh

# Execute my_script.sh
./my_script.sh okay cool

then it will print:

First argument: okay
Second argument: cool

In this example, if we were to turn in the "content of your bash script (starting with #!/bin/bash) in a code chunk, our solution would look like this:

#!/bin/bash

echo "First argument: $1"
echo "Second argument: $2"

And although we aren't running the code chunk above, we know that it works because we tested it in the terminal.

Hint: Using awk you could have a script with just two lines: 1 with the "hash-bang" (#!/bin/bash), and 1 with a single awk command.

Relevant topics: awk, bash scripts

Item(s) to submit:

  • The content of your bash script (starting with #!/bin/bash) in a code chunk.
Solution
#!/bin/bash
awk -F, -v col=$2 '{print $col}' $1.csv
3. How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using awk, then solve this problem using only awk.

Here is a similar example, using the election data set:

Click here for video

Relevant topics: cut, grep, wc, awk, piping

Item(s) to submit:

  • One line of UNIX commands to solve the problem without using awk.
  • One line of UNIX commands to solve the problem using awk.
  • The number of flights that arrived at Indianapolis (IND) in 2008.
Solution
cut -d, -f18 2008.csv | grep 'IND' | wc -l
awk -F, '{if ($18 == "IND") count++}END{print count}' 2008.csv
4. Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you'd like. Are they indeed the same? How many unique values do we have per category (Origin, Dest)?

Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data:

Click here for video

Relevant topics: cut, sort, uniq, wc, awk

Item(s) to submit:

  • 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same.
  • The UNIX command(s) used to figure out if the number of unique origins and destinations are the same.
  • The number of unique values per category (Origin, Dest).
Solution
cut -d, -f17 2008.csv | sort | uniq | wc -l
cut -d, -f18 2008.csv | sort | uniq | wc -l
5. In (4) we found that there are not the same number of unique Origin's as Dest's. Find the IATA airport code for all Origin's that don't appear in a Dest and all Dest's that don't appear in an Origin in the 2008 data.

Hint: The examples on this page should help. Note that these examples are based on Process Substitution, which basically allows you to specify commands whose output would be used as the input of comm. There should be no space between < and (, otherwise your bash will not work as intended.

Relevant topics: comm, cut, sort, uniq, redirection

Item(s) to submit:

  • The line(s) of UNIX command(s) used to answer the question.
  • The list of Origins that don't appear in Dest.
  • The list of Dests that don't appear in Origin.
Solution
comm -23 <(cut -d, -f17 2008.csv | sort | uniq) <(cut -d, -f18 2008.csv | sort | uniq)
comm -13 <(cut -d, -f17 2008.csv | sort | uniq) <(cut -d, -f18 2008.csv | sort | uniq)
6. What was the percentage of flights in 2008 per unique Origin with the Dest of "IND"? What percentage of flights had "PHX" as Origin (among all flights with Dest of "IND"?

Here is an example using the percentages of donations contributed from CEOs from various States:

Click here for video

Hint: You can do the mean calculation in awk by dividing the result from (3) by the number of unique Origin's that have a Dest of "IND".

Relevant topics: awk, sort, grep, wc

Item(s) to submit:

  • The percentage of flights in 2008 per unique Origin with the Dest of "IND".
  • 1-2 sentences explaining how "PHX" compares (as a unique ORIGIN) to the other Origins (all with the Dest of "IND")?
Solution
awk -F, '{if($18=="IND") print $17}' 2008.csv | sort -u | wc -l
awk -F, '{if($18=="IND") print $17}' 2008.csv | grep -i PHX | wc -l
OPTIONAL QUESTION. Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like:
1987, 12345
1988, 44

Run the script with inputs: 1991 and ORD. Include the output in your submission.

Relevant topics: bash scripts, cut, piping, grep, wc

Item(s) to submit:

  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.
  • The output of the script given 1991 and ORD as inputs.
Solution
#!/bin/bash
FLIGHTS_OUT="$(cut -d, -f17 $1.csv | grep -i $2 | wc -l)"
FLIGHTS_IN="$(cut -d, -f18 $1.csv | grep -i $2 | wc -l)"
echo "$((FLIGHTS_OUT + FLIGHTS_IN)), $1"

Project 7


Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives:

  • Use awk to process and manipulate textual data.
  • Use piping and redirection within the terminal to pass around data between utilities.

Dataset

The following questions will use the dataset found in Scholar: /class/datamine/data/flights/subset/YYYY.csv

An example of the data for the year 1987 can be found here.

Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. Write a line of code that prints a list of the unique values in the DayOfWeek column. Write a line of code that prints a list of the unique values in the DayOfMonth column. Write a line of code that prints a list of the unique values in the Month column. Use the 1987.csv dataset. Are the results what you expected?

Relevant topics: cut, sort

Item(s) to submit:

  • 3 lines of code used to get a list of unique values for the chosen columns.
  • 1-2 sentences explaining whether or not the results are what you expected.
Solution
cut -d, -f3 1987.csv | sort -nu
cut -d, -f4 1987.csv | sort -nu
cut -d, -f2 1987.csv | sort -nu
# We can print the unique values in the DayOfWeek, DayOfMonth, and Month,
# using these three respectively pipelines in bash shell:
cat /class/datamine/data/flights/subset/1987.csv | cut -d, -f4 | sort | uniq
cat /class/datamine/data/flights/subset/1987.csv | cut -d, -f3 | sort | uniq
cat /class/datamine/data/flights/subset/1987.csv | cut -d, -f2 | sort | uniq

# or we can do this equivalently with awk:
cat /class/datamine/data/flights/subset/1987.csv | awk -F, '{print $4}' | sort | uniq
cat /class/datamine/data/flights/subset/1987.csv | awk -F, '{print $3}' | sort | uniq
cat /class/datamine/data/flights/subset/1987.csv | awk -F, '{print $2}' | sort | uniq
2. Our files should have 29 columns. For a given file, write a line of code that prints any lines that do not have 29 columns. Test it on 1987.csv, were there any rows without 29 columns?

Hint: See here. NF looks like it may be useful!

Relevant topics: awk

Item(s) to submit:

  • Line of code used to solve the problem.
  • 1-2 sentences explaining whether or not there were any rows without 29 columns.
Solution
awk -F, '{if (NF != 29) print $0}' 1987.csv
# We can print any lines that do not have 29 columns, as follows:
cat /class/datamine/data/flights/subset/1987.csv | awk -F, '{if (NF !=
29) {print $0}}'
# Note that there are no such lines.
# In other words, every line has exactly 29 columns.
3. Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do not have 29 columns.

Relevant topics: awk, bash scripts

Item(s) to submit:

  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.
  • The results of running your bash scripts from year 1987 to 2008.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
  awk -F, '{if (NF != 29) print $0}' $f.csv
done
# Here is a bash script for which the variable i cycles through the
# desired years, and the script prints any lines that do not have 29 fields.

#!/bin/bash
for (( i=$1; i<=$2; i++ ))
  do
cat /class/datamine/data/flights/subset/${i}.csv | awk -F, '{if (NF
!= 29) {print $0}}'
done

# change permissions, to make the script executable
chmod 755 myscript.sh 

./myscript.sh 1987 2008  # run the bash script

# As in question 2, note that there are no such lines.
# In other words, every line has exactly 29 columns.
4. awk is a really good tool to quickly get some data and manipulate it a little bit. The column Distance contains the distances of the flights in miles. Use awk to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.

Example output:

Miles: 12345
Kilometers: 19867.35168

Relevant topics: awk, piping

Item(s) to submit:

  • The code used to solve the problem.
  • The results of running the code.
Solution
awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' 1990.csv
# The total distance traveled in miles is stored in mytotaldistance
# and at the end, we print mytotaldistance (which is given in miles)
# and also mytotaldistance*1.609344 (which is the total distance in
# kilometers).
cat /class/datamine/data/flights/subset/1990.csv | awk -F, '{mytotaldistance = mytotaldistance + $19} END{print "The total distance is ", mytotaldistance, " miles."; print "Equivalently, this is ", mytotaldistance*1.609344, " kilometers."}'
5. Use awk to calculate the sum of the number of DepDelay minutes, grouped according to DayOfWeek. Use 2007.csv.

Example output:

DayOfWeek:  0
1:  1234567
2:  1234567
3:  1234567
4:  1234567
5:  1234567
6:  1234567
7:  1234567

Note: 1 is Monday.

Relevant topics: awk, sort, piping

Item(s) to submit:

  • The code used to solve the problem.
  • The output from running the code.
Solution
awk -F, '{delay[$4]=delay[$4]+$16}END{for (d in delay) print d": ", delay[d]}' 2007.csv | sort -n
# We store the DepDelay in an array called mytotaldelays:
cat /class/datamine/data/flights/subset/2007.csv | awk -F, '{mytotaldelays[$4] = mytotaldelays[$4] + $16} END{for (i=1; i<=7; i++) {print "The total DepDelay in minutes is ", mytotaldelays[i], " on day
", i, " of the week."}}'
6. It wouldn't be fair to compare the total DepDelay minutes by DayOfWeek as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of DepDelay minutes by the number of flights per DayOfWeek. Use 2007.csv.

Example output:

DayOfWeek:  0
1:  1.234567
2:  1.234567
3:  1.234567
4:  1.234567
5:  1.234567
6:  1.234567
7:  1.234567

Relevant topics: awk, sort, piping

Item(s) to submit:

  • The code used to solve the problem.
  • The output from running the code.
Solution
awk -F, '{delay[$4]=delay[$4]+$16; flights[$4]++}END{for (d in delay) print d": ", delay[d]/flights[d]}' 2007.csv | sort -n
# Now we also keep track of mynumberofflights on each day of the week.
cat /class/datamine/data/flights/subset/2007.csv | awk -F, '{mytotaldelays[$4] = mytotaldelays[$4] + $16; mynumberofflights[$4] = mynumberofflights[$4] + 1} END{for (i=1; i<=7; i++) {print "The average DepDelay in minutes is ", mytotaldelays[i]/mynumberofflights[i], " on day ", i, " of the week."}}'

Project 8


Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives:

  • Use awk to process and manipulate textual data.
  • Use piping and redirection within the terminal to pass around data between utilities.

Dataset

The following questions will use the dataset found in Scholar: /class/datamine/data/flights/subset/YYYY.csv

An example of the data for the year 1987 can be found here.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. Let's say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!
Write a line of awk code that, prints the total number of flights that occur on weekend days, followed by the total number of flights that occur on the weekdays. Complete this calculation for 2008 using the 2008.csv file.

Note: Under the column DayOfWeek, Monday through Sunday are represented by 1-7, respectively.

Relevant topics: awk

Item(s) to submit:

  • Line of awk code that solves the problem.
  • The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.
Solution
awk -F, '{if ($4==1 || $4==2 || $4==3 || $4==4) weekday++; else (weekend++)}END{print weekend,weekday}' 2008.csv
awk -F, '{if ($4 ~ /1|2|3|4/) weekday++; else (weekend++)}END{print weekend,weekday}' 2008.csv
2. Note that in (1), we are comparing 3 days to 4! Write a line of awk code that, prints the average number of flights on a weekend day, followed by the average number of flights on the weekdays. Continue to use data for 2008.

Hint: You don't need a large if statement to do this, you can use the ~ comparison operator.

Relevant topics: awk

Item(s) to submit:

  • Line of awk code that solves the problem.
  • The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.
Solution
awk -F, '{if ($4==1 || $4==2 || $4==3 || $4==4) weekday++; else (weekend++)}END{print weekend/3,weekday/4}' 2008.csv
awk -F, '{if ($4 ~ /1|2|3|4/) weekday++; else (weekend++)}END{print weekend/3,weekday/4}' 2008.csv
3. We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let's use the tools we've learned to explore this a little bit.
Take a look at airports.csv. In particular run the following:
head airports.csv
Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within awk. This is messy and we would prefer to create a new file called new_airports.csv without any quotes. Write a line of code to do this.

Note: You may be wondering why we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It's important to practice and learn ways to fix these things.

Hint: You could use gsub within awk to replace '"' with ''. You can find how to use gsub here.

Hint: If you leave out the column number argument to gsub it will apply the substitution to every field in every column.

Hint:

cat new_airports.csv | wc -l # should be 159 without header

Relevant topics: awk, redirection

Item(s) to submit:

  • Line of awk code used to create the new dataset.
Solution
awk -F, '{gsub(/"/, ""); print $0}' airports.csv > new_airports.csv
4. Write a line of commands that creates a new dataset called az_fl_airports.txt. az_fl_airports.txt should only contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3),new_airports.csv as a starting point.
How many airports are there? Did you expect this? Use a line of bash code to count this.
Create a new dataset (called az_fl_flights.txt) that contains all of the data for flights into or out of Florida and Arizona (using the 2008.csv file). Use the newly created dataset, az_fl_airports.txt to accomplish this.

Hint: https://unix.stackexchange.com/questions/293684/basic-grep-awk-help-extracting-all-lines-containing-a-list-of-terms-from-one-f

Hint:

cat az_fl_flights.txt | wc -l # should be 484705

Relevant topics: awk, wc, piping

Item(s) to submit:

  • All UNIX commands used to answer the questions.
  • The number of airports.
  • 1-2 sentences explaining whether you expected this number of airports.
Solution
awk -F, '{if ($4 == "AZ" || $4 == "FL") print $1}' new_airports.csv > az_fl_airports.txt
wc -l az_fl_airports.txt
grep -w -F -f az_fl_airports.txt 2008.csv
5. Write a bash script that accepts the year as an argument and performs the same operations as in question 4, returning the number of flights into and out of both AZ and FL for any given year.

Relevant topics: bash scripts, grep, for loop, redirection

Item(s) to submit:

  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.
  • The line of UNIX code you used to execute the script and create the new dataset.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
  grep -w -F -f $3 $f.csv
done
./my_script.sh 1987 2008 az_fl_airports.txt > az_fl_flights.csv

Project 9


Motivation: Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, cloudflare, a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per this thread on hackernews). Learning SQL is well worth your time!

Context: There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite.

Scope: SQL, sqlite

Learning objectives:

  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/lahman/lahman.db

This is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Important note: For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:

```{r, include=F}
library(RSQLite)
lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db")
```

Each solution should then be placed in a code chunk like this:

```{sql, connection=lahman}
SELECT * FROM batting LIMIT 1;
```

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/lahman/lahman.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

```{sql, connection=lahman, eval=F}
SELECT * FROM batting LIMIT 1;
```

This will allow the code to be displayed without throwing an error.

1. Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal and access the Lahman database. How many tables are available?

Hint: To connect to the database, do the following:

sqlite3 /class/datamine/data/lahman/lahman.db

Hint: This is a good resource.

Relevant topics: sqlite3

Item(s) to submit:

  • How many tables are available in the Lahman database?
  • The sqlite3 commands used to figure out how many tables are available.
Solution
.tables
2. Some people like to try to visit all 30 MLB ballparks in their lifetime. Use SQL commands to get a list of parks and the cities they're located in. For your final answer, limit the output to 10 records/rows.

Note: There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the LIMIT clause.

Hint: Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following:

.headers on

Relevant topics: SELECT, FROM, LIMIT

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT parkname, city FROM parks LIMIT 10;
3. There is nothing more exciting to witness than a home run hit by a batter. It's impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (HR) in a season. List their playerID, yearID, home run total, and the teamID they played for.

Hint: There are 8 occurrences of home runs greater than or equal to 60.

Hint: The batting table is where you should look for this question.

Relevant topics: SELECT, FROM, WHERE, LIMIT

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT teamID, yearID, teamID, HR FROM batting WHERE HR>=60;
4. Make a list of players born on your birth day (don't worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year.

Hint: The people table is where you should look for this question.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT nameFirst, nameLast, birthYear FROM people
  WHERE birthMonth==5 AND birthDay==29
  ORDER BY birthYear DESC LIMIT 10;
5. Get the Cleveland (CLE) Pitching Roster from the 2016 season (playerID, W, L, SO). Order the pitchers by number of Strikeouts (SO) in descending order.

Hint: The pitching table is where you should look for this question.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT playerID, W, L, SO FROM pitching 
    WHERE teamID=='CLE' AND yearID==2016
    ORDER BY SO DESC LIMIT 10;
6. Find the 10 team and year pairs that have the most number of Errors (E) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs?

Hint: The teams table is where you should look for this question.

Hint: The BETWEEN clause is useful here.

Hint: It is OK to use multiple queries to answer the question.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT teamID, franchID, yearID, W, L, E FROM teams
  WHERE yearID BETWEEN 1960 AND 1970
  ORDER BY E DESC
  LIMIT 10;
SELECT franchName FROM teamsfranchises WHERE franchID=='LAA';
SELECT teamID, franchID, yearID, W, L, E FROM teams
  WHERE yearID >= 1960 AND yearID <= 1970
  ORDER BY E DESC
  LIMIT 10;
SELECT franchName FROM teamsfranchises WHERE franchID=='LAA';
7. Find the playerID for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table pitching)? What year and team did he win the most games as a manager (use table managers)?

Hint: It is OK to use multiple queries to answer the question.

Note: There was a tie among the two years in which Bob Lemon had the most wins as a pitcher.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT playerID FROM people
  WHERE nameFirst=='Bob' AND nameLast=='Lemon';
SELECT teamID, yearID, W FROM pitching
  WHERE playerID=='lemonbo01'
  ORDER BY W DESC;
SELECT teamID, yearID, W FROM managers
  WHERE playerID=='lemonbo01'
  ORDER BY W DESC;

Project 10


Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data.

Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG, MIN, and MAX.

Scope: SQL, sqlite

Learning objectives:

  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
  • Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.

Dataset

The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at scholar-db.rcac.purdue.edu. As in Project 9, this is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Important note: For this project all solutions should be done using R code chunks, and the RMariaDB package. Run the following code to load the library:

library(RMariaDB)
1. Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and, rather than navigating to the terminal like we did in the previous project, instead, create a connection to our MariaDB lahman database using the RMariaDB package in R, and the credentials below. Confirm the connection by running the following code chunk:
con <- dbConnect(RMariaDB::MariaDB(), 
                 host="scholar-db.rcac.purdue.edu", 
                 db="lahmandb", 
                 user="lahman_user", 
                 password="HitAH0merun")

head(dbGetQuery(con, "SHOW tables;"))

Hint: In the example provided, the variable con from the dbConnect function is the connection. Each query that you make, using the dbGetQuery, needs to use this connection con. You can change the name con if you want to (it is user defined), but if you change the name con, you need to change it on all of your connections. If your connection to the database dies while you are working on the project, you can always re-run the dbConnect line again, to reset your connection to the database.

Relevant topics: RMariaDB, dbConnect, dbGetQuery

Item(s) to submit:

  • R code used to solve the problem.
  • Output from running your (potentially modified) head(dbGetQuery(con, "SHOW tables;")).
Solution
library(RMariaDB)
host <- "scholar-db.rcac.purdue.edu"
dbname <- "lahmandb"
user <- "lahman_user"
password <- "HitAH0merun"
con <- dbConnect(MariaDB(),host=host,
                 dbname=dbname,
                 user=user,
                 password=password)
head(dbGetQuery(con, "SHOW tables;"))
2. How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (SB) and hit at least 40 home runs (HR) in one year.

Hint: Use the batting table.

Important note: You only need to run library(RMariaDB) and the dbConnect portion of the code a single time towards the top of your project. After that, you can simply reuse your connection con to run queries.

Important note: In our project template, for this project, make all of the SQL queries using the dbGetQuery function, which returns the results directly in R. Therefore, your RMarkdown blocks for this project should all be {r} blocks (as opposed to the {sql} blocks used in Project 9).

Hint: You can use dbGetQuery to run your queries from within R. Example:

dbGetQuery(con, "SELECT * FROM battings LIMIT 5;")

Note: We already demonstrated the correct SQL query to use for the 40/40 club in this video, but now we want you to use RMariaDB to solve this query:

Click here for video

Relevant topics: dbGetQuery, AND/OR, DISTINCT, COUNT

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con,"SELECT DISTINCT COUNT(*) FROM batting WHERE HR>=40 AND SB>=40;")
3. Find Corey Kluber's lifetime across his career (i.e., use SUM from SQL to summarize his achievements) in two categories: strikeouts (SO) and walks (BB). Also display his Strikeouts to Walks ratio. A Strikeout to Walks ratio is calculated by this equation: \(\frac{Strikeouts}{Walks}\).

Click here for video

Important note: Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use sum in R rather than SUM in SQL.

Hint: Use the people table to find the playerID and use the pitching table to find the statistics.

Relevant topics: dbGetQuery, SUM, SELECT, FROM, WHERE

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con, "SELECT playerID FROM people WHERE nameFirst ='Corey' AND nameLast = 'Kluber';")
dbGetQuery(con, "SELECT SUM(SO), SUM(BB), SUM(SO)/SUM(BB) FROM pitching WHERE playerID ='klubeco01';")
4. How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"?

Hint: Use the people table to find the playerID and use the batting table to find the statistics.

Relevant topics: dbGetQuery, AND/OR, SUM

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con, "SELECT playerID FROM people WHERE nameFirst ='Giancarlo' AND nameLast = 'Stanton';")
dbGetQuery(con, "SELECT COUNT(*), SUM(SO) FROM batting WHERE playerID = 'stantmi03' AND (teamID = 'MIA' OR teamID = 'FLO');")
5. The Batting Average is a metric for a batter's performance. The Batting Average in a year is calculated by \(\frac{H}{AB}\) (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to playerID, teamID, and yearID.

Hint: Use the batting table.

Relevant topics: dbGetQuery, ORDER BY, BETWEEN

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con,"SELECT playerID, teamID, yearID, H, AB, H/AB FROM batting
  WHERE yearID BETWEEN 2000 AND 2010 AND AB>300
  ORDER BY H/AB DESC LIMIT 5;")
6. How many unique players have hit > 50 home runs (HR) in a season?

Hint: Use the batting table.

Hint: If you view DISTINCT as being paired with SELECT, instead, think of it as being paired with one of the fields you are selecting.

Relevant topics: dbGetQuery, DISTINCT, COUNT

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con, "SELECT COUNT(DISTINCT playerID) FROM batting WHERE HR>50;")
# As you can see, DISTINCT is with the field playerID, not with SELECT. 
# The following query should help distinguish this.
dbGetQuery(con, "SELECT COUNT(DISTINCT playerID), COUNT(playerID) FROM batting WHERE HR>50;")
7. Find the number of unique players that attended Purdue University. Start by finding the schoolID for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic.

Hint: Use the schools table to get the schoolIDs, and the collegeplaying table to get the statistics.

Hint: You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates.

Hint: To find IU's id, try the following query: SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';. You can find more about the LIKE clause and % here.

Relevant topics: dbGetQuery, plotting in R, DISTINCT, COUNT

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
dbGetQuery(con, "SELECT schoolID FROM schools WHERE name_full='Purdue University';")
dbGetQuery(con, "SELECT schoolID FROM schools WHERE LIKE '%indiana%';")
dbGetQuery(con,"SELECT COUNT(DISTINCT playerID) FROM collegeplaying WHERE schoolID ='purdue';")
dbGetQuery(con,"SELECT DISTINCT playerID FROM collegeplaying WHERE schoolID ='indiana';")
purdue <- length(unique(dbGetQuery(con,"SELECT playerID FROM collegeplaying WHERE schoolID ='purdue';"))$playerID)
iu <- length(unique(dbGetQuery(con,"SELECT playerID FROM collegeplaying WHERE schoolID ='indiana';"))$playerID)
barplot(purdue, iu)
barplot(log(c(purdue, iu)), main="Pro baseball players: Indiana vs. Purdue", col=c("#990000","#CFB53B"), names.arg=c("Purdue", "IU"), cex.names=.5, ylab="Log count")

Project 11


Motivation: Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.

Context: We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset!

Scope: SQL, SQL in R

Learning objectives:

  • Demonstrate the ability to interact with popular database management systems within R.
  • Solve data-driven problems using a combination of SQL and R.
  • Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.
  • Showcase the ability to filter, alias, and write subqueries.
  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.

Dataset

The following questions will use the elections database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar:

/class/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be /class/datamine/data/election/itcont1980.txt)

A public sample of the data can be found here:

https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt)

Questions

Important note: For this project you will need to connect to the database elections using the RMariaDB package in R. Include the following code chunk in the beginning of your RMarkdown file:

```{r setup-database-connection}
library(RMariaDB)
con <- dbConnect(RMariaDB::MariaDB(), 
                 host="scholar-db.rcac.purdue.edu", 
                 db="elections", 
                 user="elections_user", 
                 password="Dataelect!98")
```

When a question involves SQL queries in this project, you may use a SQL code chunk (with {sql}), or an R code chunk (with {r}) and functions like dbGetQuery as you did in Project 10. Please refer to Question 5 in the project template for examples.

1. Approximately how large was the lahman database (use the sqlite database in Scholar: /class/datamine/data/lahman/lahman.db)? Use UNIX utilities you've learned about this semester to write a line of code to return the size of that .db file (in MB).
The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder /class/datamine/data/election/. How much data (in MB) is there?
The data in that folder has been added to the elections database, all aggregated in the elections table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table elections?

Note: These are some examples of how to get the sizes of collections of files in UNIX.

Click here for video

Hint: The SQL query will take some time! Be patient.

Note: You may use more than one code chunk in your RMarkdown file for the different tasks.

Note: We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from ls and du to match, use the --apparent-size option with du.

Note: A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See here for more information. For this question, either solution will be given full credit. This is a potentially useful example.

Relevant topics: SQL, SQL in R, awk, ls, du

Item(s) to submit:

  • Line of code (bash/awk) to show the size (in MB) of the lahman database file.
  • Approximate size of the lahman database in MB.
  • Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in /class/datamine/data/election.
  • The size of the elections data in MB.
  • SQL query used to find the number of rows of data in the elections table in the elections database.
  • The number of rows in the elections table in the elections database.
Solution
ls -la *.txt | awk '{ total += $4; }END{print total/1000000}'
SELECT COUNT(*) FROM elections;
2. Write a SQL query using the LIKE command to find a unique list of zip_codes that start with "479".
Write another SQL query and answer: How many unique zip_codes are there that begin with "479"?

Note: Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project.

Click here for video

Hint: The first query returns a list of zip codes, and the second returns a count.

Hint: Make sure you only select zip_codes.

Relevant topics: SQL, LIKE

Item(s) to submit:

  • SQL queries used to answer the question.
  • The first 5 results from running the query.
Solution
SELECT DISTINCT zip_code FROM elections WHERE zip_code LIKE '479%' LIMIT 5;
SELECT COUNT(DISTINCT zip_code) FROM elections WHERE zip_code LIKE '479%';
3. Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an alias for our field so it doesn't read COUNT(*) but rather Indiana Donations.

Hint: You may enclose an alias's name in quotation marks (single or double) when the name contains space.

Relevant topics: SQL, WHERE, aliasing

Item(s) to submit:

  • SQL query used to answer the question.
  • The result of the SQL query.
Solution
SELECT COUNT(*) FROM elections WHERE state='IN';
SELECT COUNT(*) AS 'Indiana Donations' FROM elections WHERE state='IN';
4. Rewrite the query in (3) so the result is displayed like: IN: 1234567. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to OH: 1234567. In other words, the state abbreviation should be dynamic, not static.

Note: This video demonstrates how to use CONCAT in a MySQL query.

Click here for video

Hint: Use CONCAT and aliasing to accomplish this.

Hint: Remember, state contains the state abbreviation.

Relevant topics: SQL, aliasing, CONCAT

Item(s) to submit:

  • SQL query used to answer the question.
Solution
SELECT CONCAT(state, ": ", COUNT(*)) AS 'Donations' FROM elections WHERE state='IN';
5. In (2) we wrote a query that returns a unique list of zip_codes that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip_codes starting with "479". What percent of donations in Indiana come from said zip_codes?

Note: This video gives two examples of sub-queries.

Click here for video

Hint: You can simply manually calculate the percent using the count in (2) and (5).

Relevant topics: SQL, aliasing, subqueries, IN

Item(s) to submit:

  • SQL queries used to answer the question.
  • The percentage of donations from Indiana from zip_codes starting with "479".
Solution
SELECT COUNT(*) FROM elections WHERE zip_code IN (SELECT DISTINCT zip_code FROM elections WHERE zip_code LIKE '479%');
6. In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (transaction_amt). Which 5 states gave the most donations (highest count)? Order you result from most to least.

Note: In this video we demonstrate GROUP BY, ORDER BY, DESC, and other aspects of MySQL that might help with this question.

Click here for video

Hint: You may want to create an alias in order to sort.

Relevant topics: SQL, GROUP BY

Item(s) to submit:

  • SQL query used to answer the question.
  • Which 5 states gave the most donations?
Solution
SELECT state, COUNT(*) AS 'cnt', SUM(transaction_amt) AS 'total' FROM elections GROUP BY state ORDER BY cnt DESC;

Project 12


Motivation: Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share.

Context: We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.

Scope: SQL, sqlite, joins

Learning objectives:

  • Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.
  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.
  • Showcase the ability to filter, alias, and write subqueries.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db

A public sample of the data can be found here.

Important note: For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:

```{r, include=F}
library(RSQLite)
bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db")
```

Each solution should then be placed in a code chunk like this:

```{sql, connection=bikeshare}
SELECT * FROM station LIMIT 5;
```

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

```{sql, connection=bikeshare, eval=F}
SELECT * FROM station LIMIT 5;
```

This will allow the code to be displayed without throwing an error.

There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform all of the other types of joins (RIGHT JOIN, FULL OUTER JOIN). You can see the documentation here: SQL, INNER JOIN, LEFT JOIN

Questions

1. Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the sqlite concatenate operator || (see here). Write a query that returns the first 5 records of information from the station table formatted in the following way:

(id) name @ (lat, long)

For example:

(84) Ryland Park @ (37.342725, -121.895617)

Hint: Here is a video about how to concatenate strings in SQLite.

Click here for vidoe

Relevant topics: aliasing, concat

Item(s) to submit:

  • SQL query used to solve this problem.
  • The first 5 records of information from the station table.
Solution
SELECT '('||id||') '||name||' @ ('||lat||','||long||')' FROM station LIMIT 5;
2. There is a variety of interesting weather information in the weather table. Write a query that finds the average mean_temperature_f by zip_code. Which is on average the warmest zip_code?

Use aliases to format the result in the following way:

Zip Code|Avg Temperature
94041|61.3808219178082

Note that this is the output if you use sqlite in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly.

Hint: Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT

Click here for video

Relevant topics: aliasing, GROUP BY, AVG

Item(s) to submit:

  • SQL query used to solve this problem.
  • The results of the query copy and pasted.
Solution
SELECT zip_code AS 'Zip Code', AVG(mean_temperature_f) AS 'Avg Temperature' FROM weather GROUP BY zip_code;
Solution
SELECT COUNT(DISTINCT zip_code) FROM trip;
SELECT zip_code, COUNT(zip_code) FROM trip GROUP BY zip_code;
4. In (2) we wrote a query that finds the average mean_temperature_f by zip_code. What if we want to tack on our results in (2) to information from each row in the station table based on the zip_codes? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables.

Hint: Use the query from (2) as a sub query within your solution.

Hint: Here is a video about JOIN and LEFT JOIN.

Click here for video

Relevant topics: INNER JOIN, subqueries, aliasing

Item(s) to submit:

  • SQL query used to solve this problem.
Solution
SELECT * FROM station as s INNER JOIN (SELECT zip_code AS 'Zip Code', AVG(mean_temperature_f) AS 'Avg Temperature' FROM weather GROUP BY zip_code) AS sub ON s.zip_code=sub.'Zip Code';
5. In (3) we alluded to the fact that the zip_codes in the trip table aren't very consistent. Users can enter a zip code when using the app. This means that zip_code can be from anywhere in the world! With that being said, if the zip_code is one of the 5 zip_codes for which we have weather data (from question 2), we can add that weather information to matching rows of the trip table. In (4) we used an INNER JOIN to append some weather information to each row in the station table. For this question, write a query that performs an INNER JOIN and appends weather data from the weather table to the trip data from the trip table. Limit your output to 5 lines.

Important note: Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the trip table, you can use start_date for for the date information.

Hint: You will want to wrap your dates and datetimes in sqlite's date function prior to comparison.

Relevant topics: INNER JOIN, aliasing

Item(s) to submit:

  • SQL query used to solve this problem.
  • First 5 lines of output.
Solution
SELECT * FROM trip AS t INNER JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date) LIMIT 5;

Project 13


Motivation: Databases you will work with won't necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like primary key, and foreign key will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice!

Context: We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you've already learned, with a focus on subqueries and joins.

Scope: SQL, sqlite

Learning objectives:

  • Write and run SQL queries in sqlite on real-world data.
  • Identify primary and foreign keys in a SQL table.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/movies_and_tv/imdb.db

Important note: For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd:

```{r, include=F}
library(RSQLite)
imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
```

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/movies_and_tv/imdb.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

```{sql, connection=imdb, eval=F}
SELECT * FROM titles LIMIT 5;
```

This will allow the code to be displayed without throwing an error.

Questions

1. A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the titles table. Do you think there are any primary keys? How about foreign keys? Now examine the episodes table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?

Hint: A primary key can also be a foreign key.

Hint: Here are two videos. The first video will remind you how to find the names of all of the tables in the imdb database. The second video will introduce you to the titles and episodes tables in the imdb database.

Click here for video

Click here for video

Relevant topics: primary key, foreign key

Item(s) to submit:

  • List any primary or foreign keys in the titles table.
  • List any primary or foreign keys in the episodes table.
2. If you paste a title_id to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series Grey's Anatomy. Write a SQL query to confirm that the title_id tt0413573 does indeed belong to Grey's Anatomy. Then browse imdb.com and find your favorite TV show. Get the title_id from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database:
SELECT * FROM titles WHERE title_id='<title id here>';

Make sure to replace "<title id here>" with the title_id of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons.

Relevant topics: SELECT, WHERE

Item(s) to submit:

  • SQL query used to confirm that title_id tt0413573 does indeed belong to Grey's Anatomy.
  • The output of the query.
  • The title_id of your favorite TV show.
  • SQL query used to confirm the title_id for your favorite TV show.
  • The output of the query.
3. The episode_title_id column in the episodes table references titles of individual episodes of a TV series. The show_title_id references the titles of the show itself. With that in mind, write a query that gets a list of all of the episodes_title_ids (found in the episodes table), with the associated primary_title (found in the titles table) for each episode of Grey's Anatomy.

Relevant topics: INNER JOIN

Hint: This video shows how to extract titles of episodes in the imdb database.

Click here for video

Item(s) to submit:

  • SQL query used to solve the problem in a code chunk.
4. We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of episode_title_ids (found in the episodes table), with the associated primary_title (found in the titles table) for each episode.

Hint: This part is just like question (3) but this time with your favorite TV show, which you chose in (2).

Hint: This video shows how to use a subquery, to JOIN a total of three tables in the imdb database.

Click here for video

Relevant topics: INNER JOIN, aliasing

Item(s) to submit:

  • SQL query used to solve the problem in a code chunk.
  • The first 5 results from your query.
5. Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode?

Relevant topics: INNER JOIN, aliasing, subqueries, ORDER BY, DESC, LIMIT

Note: Various helpful examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL query used to solve the problem in a code chunk.
  • The episode_title_id, primary_title, and rating of the top rated episode from your favorite TV series, in question (2).
  • A statement saying whether the highest rated episode is also your favorite episode.

Project 14


Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.

Context: We've just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.

Scope: awk, UNIX utilities, bash scripts, fread

Learning objectives:

  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
  • Use grep to search files effectively.
  • Use cut to section off data from the command line.
  • Use piping to string UNIX commands together.
  • Use awk for data extraction, and preprocessing.
  • Create bash scripts to automate a process or processes.

Dataset

The following questions will use PLOTSNAP.csv from the data folder found in Scholar:

/class/datamine/data/forest

To read more about PLOTSNAP.csv that you will be working with:

https://www.uvm.edu/femc/data/archive/project/federal-forest-inventory-analysis-data-for/dataset/plot-level-data-gathered-through-forest/metadata#fields

Questions

1. Take a look at at PLOTSNAP.csv. Write a line of awk code that displays the STATECD followed by the number of rows with that STATECD.

Relevant topics: awk

Item(s) to submit:

  • Code used to solve the problem.
  • Count of the following STATECDs: 1, 2, 4, 5, 6
2. Unfortunately, there isn't a very accessible list available that shows which state each STATECD represents. This is no problem for us though, the dataset has LAT and LON! Write some bash that prints just the STATECD, LAT, and LON.

Note: There are 92 columns in our dataset: awk -F, 'NR==1{print NF}' PLOTSNAP.csv. To create a list of STATECD to state, we only really need STATECD, LAT, and LON. Keeping the other 89 variables will keep our data at 2.1gb.

Relevant topics: cut, awk

Item(s) to submit:

  • Code used to solve the problem.
  • The output of your code piped to head.
4. We are going to further understand the data from question (3) by finding the actual locations based on the LAT and LON columns. We can use the library revgeo to get a location given a pair of longitude and latitude values. revgeo uses a free API hosted by photon in order to do so.
For example:
library(revgeo)
revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
The code above will give you the address information in six columns, from the most-granular housenumber to the least-granular country. Depending on the coordinates, revgeo may or may not give you results for each column. For this question, we are going to keep only the state column.
There are over 4 million rows in our dataset -- we do not want to hit photon's API that many times. Instead, we are going to do the following:
First: Unless you feel comfortable using data.table, convert your data.table to a data.frame:
my_dataframe <- data.frame(my_datatable)
Second: Calculate the average LAT and LON for each STATECD, and call the new data.frame, dat. This should result in 57 rows of lat/long pairs.
Third: For each row in dat, run a reverse geocode and append the state to a new column called STATE.

Hint: To calculate the average LAT and LON for each STATECD, you could use the sqldf package to run SQL queries on your data.frame.

Hint: mapply is a useful apply function to use to solve this problem.

Hint: Here is some extra help:

library(revgeo)

points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866))

# Note that the "output" argument gets passed to the "revgeo" function.
mapply(revgeo, points$longitude, points$latitude, output="frame")

# The output isn't in a great format, and we'd prefer to just get the "state" data.
# Let's wrap "revgeo" into another function that just gets "state" and try again.
get_state <- function(lon, lat) {
  return(revgeo(lon, lat, output="frame")["state"])
}

mapply(get_state, points$longitude, points$latitude)

Important note: It is okay to get "Not Found" for some of the addresses.

Relevant topics: apply functions, sqldf

Item(s) to submit:

  • Code used to solve the problem.
  • The head of the resulting data.frame.

Project 15


Motivation: We've done a lot of work with SQL this semester. Let's review concepts in this project and mix and match R and SQL to solve data-driven problems.

Context: In this project, we will reinforce topics you've already learned, with a focus on SQL.

Scope: SQL, sqlite, R

Learning objectives:

  • Write and run SQL queries in sqlite on real-world data.
  • Use SQL from within R.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/movies_and_tv/imdb.db

Questions

1. What is the first year where our database has > 1000 titles? Use the premiered column in the titles table as our year. What year has the most titles?

Hint: There could be missing values in premiered. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL here.

Relevant topics: COUNT, GROUP BY, ORDER BY, DESC, HAVING, NULL

Item(s) to submit:

  • SQL queries used to answer the questions.
  • What year is the first year to have > 1000 titles?
  • What year has the most titles?
2. How many, and what are the unique types from the titles table? Fpr the year found in question (1) with the most titles, how many titles of each type are there?

Item(s) to submit:

  • SQL queries used to answer the questions.
  • How many and what are the unique types from the titles table?
  • A list of type and and count for the year (premiered) that had the most titles.
3. Write a query that gets the episode_title_id, primary_title, rating, and votes, of all of the episodes of Friends (title_id is tt0108778).

Hint: You can slightly modify the solution to question (5) in project 13.

Relevant topics: INNER JOIN, subqueries, aliasing

Item(s) to submit:

  • SQL query used to answer the question.
  • First 5 results of the query.
4. Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes primary_titles contained their name?

Relevant topics: SQL in R, grep

Item(s) to submit:

  • R code in a code chunk that was used to find the solution.
  • The solution pasted below the code chunk.
5. Create a graphic showing our results in (4) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.

Relevant topics: plotting

Item(s) to submit:

  • The R code used to generate the graphic.
  • The graphic in a png or jpg/jpeg format.
6. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see type column from titles table): Romance, Comedy, Animation. In the titles table, you can find the genres in the genres column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.

To query rows which have the genre Action as one of its genres:

SELECT * FROM titles WHERE genres LIKE '%action%';

Relevant topics: LIKE, INNER JOIN

Item(s) to submit:

  • Any code you used to solve the problem in a code chunk.
  • The average rating of each of the genres listed for movies.

STAT 39000

Project 1


Motivation: In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.

Context: We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we've previously learned.

Scope: data wrangling in R, functions

Learning objectives:

  • Comprehend what a function is, and the components of a function in R.
  • Read and write basic (csv) data.
  • Utilize apply functions in order to solve a data-driven problem.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

You can find useful examples that walk you through relevant material in The Examples Book:

https://thedatamine.github.io/the-examples-book

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

Important note: It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials.

We decided to move away from ThinLinc and away from the version of RStudio used last year (https://desktop.scholar.rcac.purdue.edu). The version of RStudio is known to have some strange issues when running code chunks.

Remember the very useful documentation shortcut ?. To use, simply type ? in the console, followed by the name of the function you are interested in.

You can also look for package documentation by using help(package=PACKAGENAME), so for example, to see the documentation for the package ggplot2, we could run:

help(package=ggplot2)

Sometimes it can be helpful to see the source code of a defined function. A function is any chunk of organized code that is used to perform an operation. Source code is the underlying R or c or c++ code that is used to create the function. To see the source code of a defined function, type the function's name without the (). For example, if we were curious about what the function Reduce does, we could run:

Reduce

Occasionally this will be less useful as the resulting code will be code that calls c code we can't see. Other times it will allow you to understand the function better.

Dataset:

/class/datamine/data/airbnb

Often times (maybe even the majority of the time) data doesn’t come in one nice file or database. Explore the datasets in /class/datamine/data/airbnb.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. You may have noted that, for each country, city, and date we can find 3 files: calendar.csv.gz, listings.csv.gz, and reviews.csv.gz (for now, we will ignore all files in the "visualisations" folders).
Let's take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (calendar.csv.gz, listings.csv.gz, and reviews.csv.gz). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.

Hint: read.csv has an argument to select the number of rows we want to read.

Hint: Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries.

Item(s) to submit:

  • Chunk of code used to read the first 50 rows of each dataset.
  • 1-2 sentences briefly describing the information contained in each dataset.
  • Name(s) of variable(s) that could be used to join them.

To read a compressed csv, simply use the read.csv function:

dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
head(dat)

Let's work towards getting this data into an easier format to analyze. From now on, we will focus on the listings.csv.gz datasets.

Solution

The calendar.csv.gz file for 2019-07-08 in Hawaii describes the listing_id, date, available (t or f), price, adjusted_price, minimum_nights, and maximum_nights

hawaii_calendar <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/calendar.csv.gz")
head(hawaii_calendar, n=50)

The listings.csv.gz file for 2019-07-08 in Hawaii has 106 variables, which describe the very specific attributes of the airbnb listings.

hawaii_calendar <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/listings.csv.gz")
dim(hawaii_listings)

The reviews.csv.gz file for 2019-07-08 in Hawaii describes the listing_id, id, date, reviewer_id, reviewer_name, and comments

hawaii_calendar <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/reviews.csv.gz")
head(hawaii_reviews, n=50)

The variables that might be used to compare the tables are: date, id, listing_id, maximum_nights, minimum_nights, price

t <- table(c(names(hawaii_calendar), names(hawaii_listings), names(hawaii_reviews)))
t[t > 1]
2. Write a function called get_paths_for_country, that, given a string with the country name, returns a vector with the full paths for all listings.csv.gz files, starting with /class/datamine/data/airbnb/....
For example, the output from get_paths_for_country("united-states") should have 28 entries. Here are the first 5 entries in the output:
 [1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz"       
 [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz"           
 [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz"     
 [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz"         
 [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"     

Hint: list.files is useful with the recursive=T option.

Hint: Use grep to search for the pattern listings.csv.gz (within the results from the first hint), and use the option value=T to display the values found by the grep function.

Item(s) to submit:

  • Chunk of code for your get_paths_for_country function.
Solution

We extract all 28 of the listings for the United States first:

myprefix <- "/class/datamine/data/airbnb/united-states/"
paste0(myprefix, grep("listings.csv.gz", list.files(myprefix, recursive=T), value=T))

Now we build a function that can do the same thing, for any country

get_paths_for_country <- function(mycountry) {
  myprefix <- paste0("/class/datamine/data/airbnb/", mycountry, "/")
  paste0(myprefix, grep("listings.csv.gz", list.files(myprefix, recursive=T), value=T))
}

and we test this for several countries:

get_paths_for_country("united-states")
get_paths_for_country("brazil")
get_paths_for_country("south-africa")
get_paths_for_country("canada")
3. Write a function called get_data_for_country that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.

Hint: Use stringsAsFactors=F in the read.csv function.

Hint: Use do.call(rbind, <listofdataframes>) to combine a list of dataframes into a single dataframe.

Relevant topics: rbind, lapply, function

Item(s) to submit:

  • Chunk of code for your get_data_for_country function.
Solution

We first get the data from the Canada entries. To do this, we sapply the read.csv function to each of the 6 results from get_paths_for_country("canada") In other words, we read in these 6 data frames.

myresults <- sapply(get_paths_for_country("canada"), read.csv, stringsAsFactors=F, simplify=F)

We get a list of 6 data frames:

length(myresults)
class(myresults)
class(myresults[[1]])
class(myresults[[6]])

and we can check the dimensions of each of the 6 data frames

dim(myresults[[1]])
dim(myresults[[2]])
dim(myresults[[3]])
dim(myresults[[4]])
dim(myresults[[5]])
dim(myresults[[6]])

this is more easily accomplished with another sapply:

sapply(myresults, dim)

We can rbind all 6 of these data frames into one big data frame as follows:

bigDF <- do.call(rbind, myresults)
class(bigDF)
dim(bigDF)

Now we create the desired function called get_data_for_country

get_data_for_country <- function(mycountry) {
  myresults <- sapply(get_paths_for_country(mycountry), read.csv, stringsAsFactors=F, simplify=F)
  do.call(rbind, myresults)
}

and we test it on Canada.

mynewbigDF <- get_data_for_country("canada")

The result has the same size as before

dim(mynewbigDF)
4. Use your get_data_for_country to get the data for a country of your choice, and make sure to name the data.frame listings. Take a look at the following columns: host_is_superhost, host_has_profile_pic, host_identity_verified, and is_location_exact. What is the data type for each column? (You can use class or typeof or str to see the data type.)
These columns would make more sense as logical values (TRUE/FALSE/NA).
Write a function called transform_column that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (""), and we need to be careful when transforming the data. Test your function on column host_is_superhost.

Relevant topics: class, typeof, str, toupper, as.logical

Item(s) to submit:

  • Chunk of code for your transform_column function.
  • Type of transform_column(listings$host_is_superhost).
Solution

These 4 columns from mynewbigDF (which has the data for Canada) only have values "t", "f", ""

head(mynewbigDF$host_is_superhost)
head(mynewbigDF$host_has_profile_pic)
head(mynewbigDF$host_identity_verified)
head(mynewbigDF$is_location_exact)

Please note the 44 values of "" (which are easy to miss) In the first 3 out of 4 of these columns:

table(mynewbigDF$host_is_superhost)
table(mynewbigDF$host_has_profile_pic)
table(mynewbigDF$host_identity_verified)
table(mynewbigDF$is_location_exact)

These are all character vectors, which we can check using class, typeof, or str:

class(mynewbigDF$host_is_superhost)
class(mynewbigDF$host_has_profile_pic)
class(mynewbigDF$host_identity_verified)
class(mynewbigDF$is_location_exact)

typeof(mynewbigDF$host_is_superhost)
typeof(mynewbigDF$host_has_profile_pic)
typeof(mynewbigDF$host_identity_verified)
typeof(mynewbigDF$is_location_exact)

str(mynewbigDF$host_is_superhost)
str(mynewbigDF$host_has_profile_pic)
str(mynewbigDF$host_identity_verified)
str(mynewbigDF$is_location_exact)

We have several ways to transform a column. For example, we could go element-by-element, and make substitutions, like this:

v <- mynewbigDF$host_is_superhost

Here is the way that the values look at the start:

table(v)
v[toupper(v)=="T"] <- TRUE
v[toupper(v)=="F"] <- FALSE
v[toupper(v)==""] <- NA

and here are the values now:

table(v)

You might think that the NA values disappeared, but they just do not show up in the table by default. You can force them to appear, and then we see that the counts of the three values are the same as before.

table(v, useNA="always")

Here is the function:

transform_column <- function(v) {
  v[toupper(v)=="T"] <- TRUE
  v[toupper(v)=="F"] <- FALSE
  v[toupper(v)==""] <- NA
  v
}

We can try the function on mynewbigDF$host_is_superhost:

head(transform_column(mynewbigDF$host_is_superhost))
table(transform_column(mynewbigDF$host_is_superhost))

Another possibility is to make a map, in which we put the old values as the names, and the new values as the values in the vector:

mymap <- c(TRUE, FALSE, NA)
names(mymap) <- c("T", "F", "")
head(mymap[toupper(mynewbigDF$host_is_superhost)])

and if you do not want the names to appear on the vector, you can remove them, like this:

head(unname(mymap[toupper(mynewbigDF$host_is_superhost)]))

Finally we can check the table of the results:

table(mymap[toupper(mynewbigDF$host_is_superhost)])

This might seem strange, and if you do not like it, you can just use the solution given above. If you do like this, and want to wrap it into a function, we can write:

transform_column <- function(v) {
  mymap <- c(TRUE, FALSE, NA)
  names(mymap) <- c("T", "F", "")
  unname(mymap[toupper(v)])
}

and again we can try this new version of the function on mynewbigDF$host_is_superhost:

head(transform_column(mynewbigDF$host_is_superhost))
table(transform_column(mynewbigDF$host_is_superhost))
5. Create a histogram for response rates (host_response_rate) for super hosts (where host_is_superhost is TRUE). If your listings do not contain any super hosts, load data from a different country. Note that we first need to convert host_response_rate from a character containing "%" signs to a numeric variable.

Relevant topics: gsub, as.numeric

Item(s) to submit:

  • Chunk of code used to answer the question.
  • Histogram of response rates for super hosts.
Solution

Now we look at the Canada results, for which host_is_superhost is TRUE. We make a histogram of the host_response_rate for those values. To do this, we first get the host_response_rate values (for which host_is_superhost is TRUE)

myvalues <- mynewbigDF$host_response_rate[transform_column(mynewbigDF$host_is_superhost) == TRUE]

and then we remove the percentage symbols from myvalues and convert the character vector to numbers, and then finally make the histogram.

hist(as.numeric(gsub("%", "", myvalues)))

As a closing note, we could remove the check to see whether the inner values are TRUE because, by default, we will only exact the TRUE values when we do a lookup like this:

myvalues <- mynewbigDF$host_response_rate[transform_column(mynewbigDF$host_is_superhost)]
hist(as.numeric(gsub("%", "", myvalues)))

Project 2


Motivation: The ability to quickly reproduce an analysis is important. It is often necessary that other individuals will need to be able to understand and reproduce an analysis. This concept is so important there are classes solely on reproducible research! In fact, there are papers that investigate and highlight the lack of reproducibility in various fields. If you are interested in reading about this topic, a good place to start is the paper titled "Why Most Published Research Findings Are False", by John Ioannidis (2005).

Context: Making your work reproducible is extremely important. We will focus on the computational part of reproducibility. We will learn RMarkdown to document your analyses so others can easily understand and reproduce the computations that led to your conclusions. Pay close attention as future project templates will be RMarkdown templates.

Scope: Understand Markdown, RMarkdown, and how to use it to make your data analysis reproducible.

Learning objectives:

  • Use Markdown syntax within an Rmarkdown document to achieve various text transformations.
  • Use RMarkdown code chunks to display and/or run snippets of code.

Questions

Click here for video

1. Make the following text (including the asterisks) bold: This needs to be **very** bold. Make the following text (including the underscores) italicized: This needs to be _very_ italicized.

Important note: Surround your answer in 4 backticks. This will allow you to display the markdown without having the markdown "take effect". For example:

````
Some *marked* **up** text.
````

Hint: Be sure to check out the Rmarkdown Cheatsheet and our section on Rmarkdown in the book.

Note: Rmarkdown is essentially Markdown + the ability to run and display code chunks. In this question, we are actually using Markdown within Rmarkdown!

Relevant topics: rmarkdown, escaping characters

Item(s) to submit: - 2 lines of markdown text, surrounded by 4 backticks. Note that when compiled, this text will be unmodified, regular text.

Solution

We can achieve this style of text:

This needs to be **very** bold

This needs to be _very_ italicized.

by using this Markdown text:

**This needs to be \*\*very\*\* bold**
_This needs to be \_very\_ italicized._

The backslashes specify that we want the asterisks and underscores to appear.

2. Create an unordered list of your top 3 favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another ordered list that ranks your academic interests in order of most interested to least interested.

Hint: You can learn what ordered and unordered lists are here.

Note: Similar to (1), in this question we are dealing with Markdown. If we were to copy and paste the solution to this problem in a Markdown editor, it would be the same result as when we Knit it here.

Relevant topics: rmarkdown

Item(s) to submit: - Create the lists, this time don't surround your code in backticks. Note that when compiled, this text will appear as nice, formatted lists.

Solution

An unordered list of my top 3 favorite academic interests is:

  • asymptotic analysis of sequences
  • data science
  • analysis of algorithms

An ordered list of my top 3 favorite academic interests is:

  1. analysis of algorithms
  2. asymptotic analysis of sequences
  3. data science
3. Browse https://www.linkedin.com/ and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown. Include the following:
  • A header for this section (your choice of size) that says "About".
  • The text of your personal "About" section that you would feel comfortable uploading to linkedin, including at least 1 link.

Relevant topics: rmarkdown

Item(s) to submit: - Create the described profile, don't surround your code in backticks.

Solution
About

I am Professor in Statistics and (by courtesy) of Mathematics and Public Health at Purdue University. My research is in probabilistic, combinatorial, and analytic techniques for the analysis of algorithms and data structures; I am also interested in science of information, game theory, and large-scale computation. I currently serve as

  • Director of The Data Mine
  • Interim Co-Director of the Integrative Data Science Initiative
  • Principal Investigator for the Purdue Statistics Living Learning Community, funded by the National Science Foundation
  • Associate Director for the NSF Center for Science of Information (now a core center in Purdue's Discovery Park)
  • Associate Director of the Actuarial Science Program ##### Solution

The  equation is \[P(A\ |\ B) = \frac{P(B\ |\ A)P(A)}{P(B)}\]

4. LaTeX is a powerful editing tool where you can create beautifully formatted equations and formulas. Replicate the equation found here as closely as possible.

Hint: Lookup "latex mid" and "latex frac".

Item(s) to submit:

  • Replicate the equation using LaTeX under the Question 4 header in your template.
Solution
my_variable <- c(1,2,3)

*** All About the Iris Dataset

This paper goes into detail about the iris dataset that is built into r. You can find a list of built-in datasets by visiting the R datasets index or by running the following code:

data()

The iris dataset has 5 columns. You can get the names of the columns by running the following code:

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Alternatively, you could just run the following code:

iris

The second option provides more detail about the dataset.

According to the R iris help page there is another dataset built-in to R called iris3. This dataset is 3 dimensional instead of 2 dimensional.

An iris is a really pretty flower. You can see a picture of one here:

In summary. I really like irises, and there is a dataset in R called iris.

5. Your co-worker wrote a report, and has asked you to beautify it. Knowing Rmarkdown, you agreed. Make improvements to this section. At a minimum:
  • Make the title pronounced.
  • Make all links appear as a word or words, rather than the long-form URL.
  • Organize all code into code chunks where code and output are displayed. If the output is really long, just display the code.
  • Make the calls to the library function be evaluated but not displayed.
  • Make sure all warnings and errors that may eventually occur, do not appear in the final document.

Feel free to make any other changes that make the report more visually pleasing.

```{r my-load-packages}
library(ggplot2)
```

```{r declare-variable-390, eval=FALSE}
my_variable <- c(1,2,3)
```

All About the Iris Dataset

This paper goes into detail about the `iris` dataset that is built into r. You can find a list of built-in datasets by visiting https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or by running the following code:

data()

The iris dataset has 5 columns. You can get the names of the columns by running the following code:

names(iris)

Alternatively, you could just run the following code:

iris

The second option provides more detail about the dataset.

According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html there is another dataset built-in to r called `iris3`. This dataset is 3 dimensional instead of 2 dimensional.

An iris is a really pretty flower. You can see a picture of one here:

https://www.gardenia.net/storage/app/public/guides/detail/83847060_mOptimized.jpg

In summary. I really like irises, and there is a dataset in r called `iris`.

Relevant topics: rmarkdown

Item(s) to submit:

  • Make improvements to this section, and place it all under the Question 5 header in your template.
Solution

Here are the Sepal.Length values from the iris data set.

plot(iris$Sepal.Length, main="Iris Sepal Lengths")

6. Create a plot using a built-in dataset like iris, mtcars, or Titanic, and display the plot using a code chunk. Make sure the code used to generate the plot is hidden. Include a descriptive caption for the image. Make sure to use an RMarkdown chunk option to create the caption.

Relevant topics: rmarkdown, plotting in r

Item(s) to submit:

  • Code chunk under that creates and displays a plot using a built-in dataset like iris, mtcars, or Titanic.
7. Insert the following code chunk under the Question 7 header in your template. Try knitting the document. Two things will go wrong. What is the first problem? What is the second problem?
```{r my-load-packages}
plot(my_variable)
```

Hint: Take a close look at the name we give our code chunk.

Hint: Take a look at the code chunk where my_variable is declared.

Relevant topics: rmarkdown

Item(s) to submit:

  • The modified version of the inserted code that fixes both problems.
  • A sentence explaining what the first problem was.
  • A sentence explaining what the second problem was.
Solution

We needed to change the section title from install-packages to a new name, since we already had a section with this same title.

In the section called declare-variable, we had eval=F, and as a result, the variable my_variable was never declared. So we removed the option eval=F.

plot(my_variable)
For Project 2, please submit your .Rmd file and the resulting .pdf file. (For this project, you do not need to submit a .R file.)
OPTIONAL QUESTION. RMarkdown is also an excellent tool to create a slide deck. Use the information here or here to convert your solutions into a slide deck rather than the regular PDF. You may experiment with slidy, ioslides or beamer, however, make your final set of solutions use beamer as the output is a PDF. Make any needed modifications to make the solutions knit into a well-organized slide deck (For example, include slide breaks and make sure the contents are shown completely.). Modify (2) so the bullets are incrementally presented as the slides progress.

Important note: You do not need to submit the original PDF for this project, just the beamer slide version of the PDF.

Relevant topics: rmarkdown

Item(s) to submit:

  • The modified version of the solutions in beamer slide form.

Project 3


Motivation: The ability to navigate a shell, like bash, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful bash tools, help you navigate a filesystem, and even run bash tools from within an RMarkdown file in RStudio.

Context: At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file.

Scope: bash, RStudio

Learning objectives:

  • Distinguish differences in /home, /scratch, and /class.
  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
  • Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.
  • Use man to read and learn about UNIX utilities.
  • Run bash commands from within and RMarkdown file in RStudio.

There are a variety of ways to connect to Scholar. In this class, we will primarily connect to RStudio Server by opening a browser and navigating to https://rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface.

Here is a video to remind you about some of the basic tools you can use in UNIX/Linux:

Click here for video

This is the easiest book for learning this stuff; it is short and gets right to the point:

https://go.oreilly.com/purdue-university/library/view/-/0596002610

you just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through).

It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details.

Solution

Here are some changes Kevin likes:

  • Uncheck "Restore .Rdata into workspace at startup".
  • Change tab width 4.
  • Check "Soft-wrap R source files".
  • Check "Highlight selected line".
  • Check "Strip trailing horizontal whitespace when saving".
  • Uncheck "Show margin".
2. There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a bash shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient!
What is the default directory of your bash shell?

Hint: Start by reading the section on man. man stands for manual, and you can find the "official" documentation for the command by typing man <command_of_interest>. For example:

# read the manual for the `man` command
# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down
man man 

Relevant topics: man, pwd, ~, .., .

Item(s) to submit:

  • The full filepath of default directory (home directory). Ex: Kevin's is: /home/kamstut
  • The bash code used to show your home directory or current directory (also known as the working directory) when the bash shell is first launched.
Solution
# whatever is stored in the $HOME environment variable 
# is what ~ represents
cd ~
pwd

# if we change $HOME, ~ changes too!
HOME=/home/kamstut/projects
cd ~
pwd

# if other users on the linux system share certain files or folders
# in their home directory, you can access their home folder similarly
ls ~mdw

# but they _have_ to give you permissions 
3. Learning to navigate away from our home directory to other folders, and back again, is vital. Perform the following actions, in order:
  • Write a single command to navigate to the folder containing our full datasets: /class/datamine/data.
  • Write a command to confirm you are in the correct folder.
  • Write a command to list the files and directories within the data directory. (You do not need to recursively list subdirectories and files contained therein.) What are the names of the files and directories?
  • Write another command to return back to your home directory.
  • Write a command to confirm you are in the correct folder.

Note: / is commonly referred to as the root directory in a linux/unix filesystem. Think of it as a folder that contains every other folder in the computer. /home is a folder within the root directory. /home/kamstut is the full filepath of Kevin's home directory. There is a folder home inside the root directory. Inside home is another folder named kamstut which is Kevin's home directory.

Relevant topics: man, cd, pwd, ls, ~, .., .

Item(s) to submit:

  • Command used to navigate to the data directory.
  • Command used to confirm you are in the data directory.
  • Command used to list files and folders.
  • List of files and folders in the data directory.
  • Command used to navigate back to the home directory.
  • Commnad used to confirm you are in the home directory.
Solution
# navigate to the data directory using `cd` (change directory)
cd /class/datamine/data

# confirm the location using `pwd` (print working directory)
pwd

# list files
ls

# cd without any arguments automatically returns to the directory
# saved in the $HOME environment variable
cd

# another trick, if you wanted to _quickly_ return to the data 
# directory, or most recent directory is the following (uncommented)
cd ~-

# confirm the location using pwd
pwd
4. Let's learn about two more important concepts. . refers to the current working directory, or the directory displayed when you run pwd. Unlike pwd you can use this when navigating the filesystem! So, for example, if you wanted to see the contents of a file called my_file.txt that lives in /home/kamstut (so, a full path of /home/kamstut/my_file.txt), and you are currently in /home/kamstut, you could run: cat ./my_file.txt.
.. represents the parent folder or the folder in which your current folder is contained. So let's say I was in /home/kamstut/projects/ and I wanted to get the contents of the file /home/kamstut/my_file.txt. You could do: cat ../my_file.txt.
When you navigate a directory tree using . and .. you create paths that are called relative paths because they are relative to your current directory. Alternatively, a full path or (absolute path) is the path starting from the root directory. So /home/kamstut/my_file.txt is the absolute path for my_file.txt and ../my_file.txt is a relative path. Perform the following actions, in order:
  • Write a single command to navigate to the data directory.
  • Write a single command to navigate back to your home directory using a relative path. Do not use ~ or the cd command without a path argument.

Relevant topics: man, cd, pwd, ls, ~, .., .

Item(s) to submit:

  • Command used to navigate to the data directory.
  • Command used to navigate back to your home directory that uses a relative path.
Solution
cd /class/datamine/data
pwd
cd ../../../home/kamstut
pwd
5. In Scholar, when you want to deal with really large amounts of data, you want to access scratch (you can read more here). Your scratch directory on Scholar is located here: /scratch/scholar/$USER. $USER is an environment variable containing your username. Test it out: echo /scratch/scholar/$USER. Perform the following actions:
  • Navigate to your scratch directory.
  • Confirm you are in the correct location.
  • Execute myquota.
  • Find the location of the myquota bash script.
  • Output the first 5 and last 5 lines of the bash script.
  • Count the number of lines in the bash script.
  • How many kilobytes is the script?

Hint: You could use each of the commands in the relevant topics once.

Hint: When you type myquota on Scholar there are sometimes two warnings about xauth but sometimes there are no warnings. If you get a warning that says Warning: untrusted X11 forwarding setup failed: xauth key data not generated it is safe to ignore this error.

Hint: Commands often have options. Options are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages. For example: man wc. You can see -m, -l, and -w are all options for wc. To test this out:

# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command.
wc /class/datamine/data/flights/1987.csv
# to count the lines, use the -l option
wc -l /class/datamine/data/flights/1987.csv
# to count the words, use the -w option
wc -w /class/datamine/data/flights/1987.csv
# you can combine options as well
wc -w -l /class/datamine/data/flights/1987.csv
# some people like to use a single tack `-`
wc -wl /class/datamine/data/flights/1987.csv
# order doesn't matter
wc -lw /class/datamine/data/flights/1987.csv

Hint: The -h option for the du command is useful.

Relevant topics: cd, pwd, type, head, tail, wc, du

Item(s) to submit:

  • Command used to navigate to your scratch directory.
  • Command used to confirm your location.
  • Output of myquota.
  • Command used to find the location of the myquota script.
  • Absolute path of the myquota script.
  • Command used to output the first 5 lines of the myquota script.
  • Command used to output the last 5 lines of the myquota script.
  • Command used to find the number of lines in the myquota script.
  • Number of lines in the script.
  • Command used to find out how many kilobytes the script is.
  • Number of kilobytes that the script takes up.
Solution
# navigate to my scratch folder
cd /scratch/scholar/$USER

# confirm
pwd

# what is my quota, execute the myquota script
myquota

# get the location of the myquota script
type myquota

# get the first 5 lines of the myquota script
head /usr/local/bin/myquota

# get the last 5 lines of the myquota script
tail /usr/local/bin/myquota

# get the number of lines in the myquota script
wc -l /usr/local/bin/myquota

# get the number of kilobytes of teh myquota script
du -h --apparent-size /usr/local/bin/myquota
ls -la /usr/local/bin/myquota
6. Perform the following operations:
  • Navigate to your scratch directory.
  • Copy and paste the file: /class/datamine/data/flights/1987.csv to your current directory (scratch).
  • Create a new directory called my_test_dir in your scratch folder.
  • Move the file you copied to your scratch directory, into your new folder.
  • Use touch to create an empty file named im_empty.txt in your scratch folder.
  • Remove the directory my_test_dir and the contents of the directory.
  • Remove the im_empty.txt file.

Hint: rmdir may not be able to do what you think, instead, check out the options for rm using man rm.

Relevant topics: cd, cp, mv, mkdir, touch, rmdir, rm

Item(s) to submit:

  • Command used to navigate to your scratch directory.
  • Command used to copy the file, /class/datamine/data/flights/1987.csv to your current directory (scratch).
  • Command used to create a new directory called my_test_dir in your scratch folder.
  • Command used to move the file you copied earlier 1987.csv into your new my_test_dir folder.
  • Command used to create an empty file named im_empty.txt in your scratch folder.
  • Command used to remove the directory and the contents of the directory my_test_dir.
  • Command used to remove the im_empty.txt file.
Solution
# navigate to the scratch folder
cd /scratch/scholar/$USER

# copy the 1987.csv file to the current directory (scratch)
cp /class/datamine/data/flights/1987.csv .

# make a directory in the scratch directory called `my_test_dir`
mkdir my_test_dir

# move 1987.csv to the new folder
mv 1987.csv my_test_dir

# create an empty file in the scratch folder
touch im_empty.txt

# remove the directory and the contents of the directory
rm -r my_test_dir

# remove the im_empty.txt file
rm im_empty.txt
7. Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.

Project 4


Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. grep is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.

Context: We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep, and experiment with regular expressions using grep, R, and later on, Python.

Scope: grep, regular expression basics, utilizing regular expression tools in R and Python

Learning objectives:

  • Use grep to search for patterns within a dataset.
  • Use cut to section off and slice up data from the command line.
  • Use wc to count the number of lines of input.

You can find useful examples that walk you through relevant material in The Examples Book:

https://thedatamine.github.io/the-examples-book

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

Important note: I would highly recommend using single quotes ' to surround your regular expressions. Double quotes can have unexpected behavior due to some shell's expansion rules. In addition, pay close attention to escaping certain characters in your regular expressions.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/movies_and_tv/the_office_dialogue.csv

A public sample of the data can be found here: the_office_dialogue.csv

Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.

grep stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep, we will be using it with textual data. You can read about and see examples of grep here.

1. Login to Scholar and use grep to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica.". What is the name of the dataset and where is it located?

Relevant topics: grep

Item(s) to submit:

  • The grep command used to find the dataset.
  • The name and location in Scholar of the dataset.
  • Use grep and grepl within R to solve a data-driven problem.
Solution
grep -Ri "bears. beets. battlestar galactica." /class/datamine 

/class/datamine/data/the_office/the_office_dialogue.csv

2. grep prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first n lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.
Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.

Hint: The result if you were to search for "bears. beets. battlestar galactica." should be:

"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica."

Hint: One method to solve this problem would be to pipe the output from grep to cut.

Relevant topics: cut, grep

Item(s) to submit:

  • The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".
Solution
grep -i "bears. beets. battlestar galactica." /class/datamine/data/movies_and_tv/the_office_dialogue.csv | cut -d "," -f 7,8  
3. Find all of the lines where Pam is called "Beesley" instead of "Pam" or "Pam Beesley".

Hint: A negative lookbehind would be one way to solve this, in order to use a negative lookbehind with grep make sure to add the -P option. In addition, make sure to use single quotes to make sure your regular expression is taken literally. If you use double quotes, variables are expanded.

Relevant topics: grep

Item(s) to submit:

  • The UNIX command used to solve this problem.
Solution
grep -Pi '(?<!Pam )Beesley'

Regular expressions are really a useful semi language-agnostic tool. What this means is regardless of the programming language your are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using read.csv. Name the resulting data.frame dat.

4. The text_w_direction column in dat contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.
(a) Create a new column called has_direction that is set to TRUE if the text_w_direction column has direction, and FALSE otherwise. Use the grepl function in R to accomplish this.

Hint: Make sure all opening brackets "[" have a corresponding closing bracket "]".

Hint: Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text.

(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.

We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.

This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.

In our actual dataset, for example, dat$text_w_direction[2789] is a line with 2 directions.

Relevant topics: grep, grepl, basic matches, escaping characters

Item(s) to submit:

  • The R code and regular expression used to solve the first part of this problem.
  • The R code and regular expression used to solve the second part of this problem.
  • How many lines have >= 2 directions?
  • How many lines have >= 5 directions?
Solution
dat$has_direction <- grepl("(\\[.*\\])+", dat$text_w_direction)

#

length(grep("\\[.*\\].*\\[.*\\]", dat$text_w_direction))
length(grep("\\[.*\\].*\\[.*\\].*\\[.*\\].*\\[.*\\].*\\[.*\\]", dat$text_w_direction))
5. Use the str_extract_all function from the stringr package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called direction.
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

In this question, your solution may have extracted:

[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]

(It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.)

Relevant topics: str_extract_all, basic matches, escaping characters

Item(s) to submit:

  • The R code used to solve this problem.
Solution
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[[^\\[\\]]*\\])", simplify=F)

# or 

dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[.*?\\])", simplify=F)
OPTIONAL QUESTION. Repeat (5) but this time make sure you only capture the brackets and text within the brackets. Save the results in a new column called direction_correct. You can test to see if it is working by running the following code:
dat$direction_correct[747]
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

In (5), your solution may have extracted:

[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]

This is ok for (5). In this question, however, we want to fix this to only extract:

[emphasize this]
[emphasize this] [shrug]

Hint: This regular expression will be hard to read.

Hint: The pattern we want is: literal opening bracket, followed by 0+ of any character other than the literal [ or literal ], followed by a literal closing bracket.

Relevant topics: str_extract_all

Item(s) to submit:

  • The R code used to solve this problem.
Solution
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[[^\\[\\]]*\\])", simplify=F)

# or 

dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[.*?\\])", simplify=F)

Project 5


Motivation: Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc.

Context: We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.

Scope: grep, regular expression basics, UNIX utilities, redirection, piping

Learning objectives:

  • Use cut to section off and slice up data from the command line.
  • Use piping to string UNIX commands together.
  • Use sort and it's options to sort data in different ways.
  • Use head to isolate n lines of output.
  • Use wc to summarize the number of lines in a file or in output.
  • Use uniq to filter out non-unique lines.
  • Use grep to search files effectively.

You can find useful examples that walk you through relevant material in The Examples Book:

https://thedatamine.github.io/the-examples-book

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

Don't forget the very useful documentation shortcut ? for R code. To use, simply type ? in the console, followed by the name of the function you are interested in. In the Terminal, you can use the man command to check the documentation of bash code.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/amazon/amazon_fine_food_reviews.csv

A public sample of the data can be found here: amazon_fine_food_reviews.csv

Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.

Here are three videos that might also be useful, as you work on Project 5:

Click here for video

Click here for video

Click here for video

Questions

1. What is the Id of the most helpful review, according to the highest HelpfulnessNumerator?

Important note: You can always pipe output to head in case you want the first few values of a lot of output. Note that if you used sort before head, you may see the following error messages:

sort: write failed: standard output: Broken pipe
sort: write error

This is because head would truncate the output from sort. This is okay. See this discussion for more details.

Relevant topics: cut, sort, head, piping

Item(s) to submit:

  • Line of UNIX commands used to solve the problem.
  • The Id of the most helpful review.
Solution
cut -d, -f5 amazon_fine_food_reviews.csv| sort -nr | head -n3
2. Some entries under the Summary column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion.
To further clarify what we mean by unique, if we had the following vector in R, c("a", "b", "a", "c"), its unique values are c("a", "b", "c").

Relevant topics: cut, uniq, sort, wc, piping

Item(s) to submit:

  • Two lines of UNIX commands used to solve the problem.
  • The ratio of unique Summary's.
Solution
cut -d, -f9 amazon_fine_food_reviews.csv | sort -u | wc -l
cut -d, -f9 amazon_fine_food_reviews.csv | wc -l
3. Use a chain of UNIX commands, piped in a sequence, to create a frequency table of Score.

Relevant topics: cut, uniq, sort, piping

Item(s) to submit:

  • The line of UNIX commands used to solve the problem.
  • The frequency table.
Solution
cut -d, -f7 amazon_fine_food_reviews.csv | sort -n | uniq -c
4. Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why?

Hint: You may need to pipe the output to sort multiple times.

Hint: To create the frequency table, read through the man pages for uniq. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following:

man uniq

Relevant topics: cut, uniq, sort, head, piping, man

Item(s) to submit:

  • The line of UNIX commands used to solve the problem.
  • The frequency table.
Solution
cut -d, -f3 amazon_fine_food_reviews.csv | sort | uniq -c | sort -nr | head -n1
5. Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let's consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let's consider anything in between neither strongly positive nor negative.

Hint: You may find the solution to problem (3) useful.

Relevant topics: cut, uniq, sort, grep, piping

Item(s) to submit:

  • The line of UNIX commands used to solve the problem.
Solution
grep -i 'A3OXHLG6DIBRW8' amazon_fine_food_reviews.csv | cut -d, -f7 | sort | uniq -c
6. Find the most helpful review with a Score of 5. Then (separately) find the most helpful review with a Score of 1. As before, we are considering the most helpful review to be the review with the highest HelpfulnessNumerator.

Hint: You can use multiple lines to solve this problem.

Relevant topics: sort, head, piping

Item(s) to submit:

  • The lines of UNIX commands used to solve the problem.
  • ProductId's of both requested reviews.
Solution
sort -t, -k7rn,7 -k5rn,5 amazon_fine_food_reviews.csv | head -n2
sort -t, -k7n,7 -k5rn,5 amazon_fine_food_reviews.csv | head -n2
7. For only the two ProductIds from the previous question, create a new dataset called scores.csv that contains the ProductIds and Scores from all reviews for these two items.

Relevant topics: cut, grep, redirection

Item(s) to submit:

  • The line of UNIX commands used to solve the problem.
Solution
cut -d, -f2,7 amazon_fine_food_reviews.csv| grep -Ei '(B00012182G|B003EMQGVI|B00065LI0A)' > scores.csv
OPTIONAL QUESTION. Use R to load up scores.csv into a new data.frame called dat. Create a histogram for each products' Score. Compare the most helpful review Score with the Score's given in the histogram. Based on this comparison, point out some curiosities about the product that may be worth exploring. For example, if a product receives many high scores, but has a super helpful review that gives the product 1 star, I may tend to wonder if the product is not as great as it seems to be.

Relevant topics: read.csv, hist

Item(s) to submit:

  • R code used to create the histograms.
  • 3 histograms, 1 for each ProductId.
  • 1-2 sentences describing the curious pattern that you would like to further explore.
Solution
dat <- read.csv("scores.csv", header=F)
par(mfrow=c(1,3))
tapply(dat$V2, dat$V1, hist)

Project 6


Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives:

  • Use awk to process and manipulate textual data.
  • Use piping and redirection within the terminal to pass around data between utilities.
  • Use output created from the terminal to create a plot using R.

Dataset

The following questions will use the dataset found here or in Scholar:

/class/datamine/data/flights/subset/YYYY.csv

An example from 1987 data can be found here or in Scholar:

/class/datamine/data/flights/subset/1987.csv

Questions

1. In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the Origin, from 1987.csv. Write another line, this time using awk to do the same thing. Which one do you prefer, and why?

Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk:

Click here for video

Relevant topics: cut, awk

Item(s) to submit:

  • One line of UNIX commands to solve the problem without using awk.
  • One line of UNIX commands to solve the problem using awk.
  • 1-2 sentences describing which method you prefer and why.
Solution
cut -d, -f17 1987.csv
awk -F, '{print $17}' 1987.csv
2. Write a bash script that accepts a year (1987, 1988, etc.) and a column n and returns the nth column of the associated year of data.

Here are two examples to illustrate how to write a bash script:

Click here for video

Click here for video

Hint: In this example, you only need to turn in the content of your bash script (starting with #!/bin/bash) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it:

#!/bin/bash

echo "First argument: $1"
echo "Second argument: $2"

If you simply drop that text into a file called my_script.sh, located here: /home/$USER/my_script.sh, and if you run the following:

# Setup bash to run; this only needs to be run one time per session.
# It makes bash behave a little more naturally in RStudio.
exec bash

# Navigate to the location of my_script.sh
cd /home/$USER

# Make sure that the script is runable.
# This only needs to be done one time for each new script that you write.
chmod 755 my_script.sh

# Execute my_script.sh
./my_script.sh okay cool

then it will print:

First argument: okay
Second argument: cool

In this example, if we were to turn in the "content of your bash script (starting with #!/bin/bash) in a code chunk, our solution would look like this:

#!/bin/bash

echo "First argument: $1"
echo "Second argument: $2"

And although we aren't running the code chunk above, we know that it works because we tested it in the terminal.

Hint: Using awk you could have a script with just two lines: 1 with the "hash-bang" (#!/bin/bash), and 1 with a single awk command.

Relevant topics: awk, bash scripts

Item(s) to submit:

  • The content of your bash script (starting with #!/bin/bash) in a code chunk.
Solution
#!/bin/bash
awk -F, -v col=$2 '{print $col}' $1.csv
3. How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using awk, then solve this problem using only awk.

Here is a similar example, using the election data set:

Click here for video

Relevant topics: cut, grep, wc, awk, piping

Item(s) to submit:

  • One line of UNIX commands to solve the problem without using awk.
  • One line of UNIX commands to solve the problem using awk.
  • The number of flights that arrived at Indianapolis (IND) in 2008.
Solution
cut -d, -f18 2008.csv | grep 'IND' | wc -l
awk -F, '{if ($18 == "IND") count++}END{print count}' 2008.csv
4. Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you'd like. Are they indeed the same? How many unique values do we have per category (Origin, Dest)?

Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data:

Click here for video

Relevant topics: cut, sort, uniq, wc, awk

Item(s) to submit:

  • 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same.
  • The UNIX command(s) used to figure out if the number of unique origins and destinations are the same.
  • The number of unique values per category (Origin, Dest).
Solution
cut -d, -f17 2008.csv | sort | uniq | wc -l
cut -d, -f18 2008.csv | sort | uniq | wc -l
5. In (4) we found that there are not the same number of unique Origin's as Dest's. Find the IATA airport code for all Origin's that don't appear in a Dest and all Dest's that don't appear in an Origin in the 2008 data.

Hint: The examples on this page should help. Note that these examples are based on Process Substitution, which basically allows you to specify commands whose output would be used as the input of comm. There should be no space between < and (, otherwise your bash will not work as intended.

Relevant topics: comm, cut, sort, uniq, redirection

Item(s) to submit:

  • The line(s) of UNIX command(s) used to answer the question.
  • The list of Origins that don't appear in Dest.
  • The list of Dests that don't appear in Origin.
Solution
comm -23 <(cut -d, -f17 2008.csv | sort | uniq) <(cut -d, -f18 2008.csv | sort | uniq)
comm -13 <(cut -d, -f17 2008.csv | sort | uniq) <(cut -d, -f18 2008.csv | sort | uniq)
6. What was the percentage of flights in 2008 per unique Origin with the Dest of "IND"? What percentage of flights had "PHX" as Origin (among all flights with Dest of "IND"?

Here is an example using the percentages of donations contributed from CEOs from various States:

Click here for video

Hint: You can do the mean calculation in awk by dividing the result from (3) by the number of unique Origin's that have a Dest of "IND".

Relevant topics: awk, sort, grep, wc

Item(s) to submit:

  • The percentage of flights in 2008 per unique Origin with the Dest of "IND".
  • 1-2 sentences explaining how "PHX" compares (as a unique ORIGIN) to the other Origins (all with the Dest of "IND")?
Solution
awk -F, '{if($18=="IND") print $17}' 2008.csv | sort -u | wc -l
awk -F, '{if($18=="IND") print $17}' 2008.csv | grep -i PHX | wc -l
7. Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like:
1987, 12345
1988, 44

Run the script with inputs: 1991 and ORD. Include the output in your submission.

Relevant topics: bash scripts, cut, piping, grep, wc

Item(s) to submit:

  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.
  • The output of the script given 1991 and ORD as inputs.
Solution
#!/bin/bash
FLIGHTS_OUT="$(cut -d, -f17 $1.csv | grep -i $2 | wc -l)"
FLIGHTS_IN="$(cut -d, -f18 $1.csv | grep -i $2 | wc -l)"
echo "$((FLIGHTS_OUT + FLIGHTS_IN)), $1"
OPTIONAL QUESTION. Pick your favorite airport and get its IATA airport code. Write a bash script that, given the first year, last year, and airport code, runs the bash script from (7) for all years in the provided range for your given airport, or loops through all of the files for the given airport, appending all of the data to a new file called my_airport.csv.

Relevant topics: bash scripts, cut, grep, wc, for loops, echo, redirection

Item(s) to submit:

  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do 
    FLIGHTS_OUT="$(cut -d, -f17 $f.csv | grep -i $3 | wc -l)"
    FLIGHTS_IN="$(cut -d, -f18 $f.csv | grep -i $3 | wc -l)"
    echo "$((FLIGHTS_OUT + FLIGHTS_IN)), $f" 
done
OPTIONAL QUESTION. In R, load my_airport.csv and create a line plot showing the year-by-year change. Label your x-axis "Year", your y-axis "Num Flights", and your title the name of the IATA airport code. Write 1-2 sentences with your observations.

Relevant topics: read.csv, lines

Item(s) to submit:

  • Line chart showing year-by-year change in flights into and out of the chosen airport.
  • R code used to create the chart.
  • 1-2 sentences with your observations.
Solution
dat <- read.csv("my_airport.csv", header=F)
plot(dat[,2], dat[,1],xlab="Year", ylab="Num Flights", main="JFK", type="l")

Project 7


Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives:

  • Use awk to process and manipulate textual data.
  • Use piping and redirection within the terminal to pass around data between utilities.

Dataset:

The following questions will use the dataset found in Scholar:

/class/datamine/data/flights/subset/YYYY.csv

An example of the data for the year 1987 can be found here.

Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. Write a line of code that prints a list of the unique values in the DayOfWeek column. Write a line of code that prints a list of the unique values in the DayOfMonth column. Write a line of code that prints a list of the unique values in the Month column. Use the 1987.csv dataset. Are the results what you expected?

Relevant topics: cut, sort

Item(s) to submit:

  • 3 lines of code used to get a list of unique values for the chosen columns.
  • 1-2 sentences explaining whether or not the results are what you expected.
Solution
cut -d, -f3 1987.csv | sort -nu
cut -d, -f4 1987.csv | sort -nu
cut -d, -f2 1987.csv | sort -nu
2. Our files should have 29 columns. For a given file, write a line of code that prints any lines that do not have 29 columns. Test it on 1987.csv, were there any rows without 29 columns?

Hint: See here. NF looks like it may be useful!

Relevant topics: awk

Item(s) to submit:

  • Line of code used to solve the problem.
  • 1-2 sentences explaining whether or not there were any rows without 29 columns.
Solution
awk -F, '{if (NF != 29) print $0}' 1987.csv
3. Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do not have 29 columns.

Relevant topics: awk, bash scripts

Item(s) to submit:

  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.
  • The results of running your bash scripts from year 1987 to 2008.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
  awk -F, '{if (NF != 29) print $0}' $f.csv
done
4.awk is a really good tool to quickly get some data and manipulate it a little bit. The column Distance contains the distances of the flights in miles. Use awk to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.

Example output:

Miles: 12345
Kilometers: 19867.35168

Relevant topics: awk, piping

Item(s) to submit:

  • The code used to solve the problem.
  • The results of running the code.
Solution
awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' 1990.csv
5. Use awk to calculate the sum of the number of DepDelay minutes, grouped according to DayOfWeek. Use 2007.csv.

Example output:

DayOfWeek:  0
1:  1234567
2:  1234567
3:  1234567
4:  1234567
5:  1234567
6:  1234567
7:  1234567

Note: 1 is Monday.

Relevant topics: awk, sort, piping

Item(s) to submit:

  • The code used to solve the problem.
  • The output from running the code.
Solution
awk -F, '{delay[$4]=delay[$4]+$16}END{for (d in delay) print d": ", delay[d]}' 2007.csv | sort -n
6. It wouldn't be fair to compare the total DepDelay minutes by DayOfWeek as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of DepDelay minutes by the number of flights per DayOfWeek. Use 2007.csv.

Example output:

DayOfWeek:  0
1:  1.234567
2:  1.234567
3:  1.234567
4:  1.234567
5:  1.234567
6:  1.234567
7:  1.234567

Relevant topics: awk, sort, piping

Item(s) to submit:

  • The code used to solve the problem.
  • The output from running the code.
Solution
awk -F, '{delay[$4]=delay[$4]+$16; flights[$4]++}END{for (d in delay) print d": ", delay[d]/flights[d]}' 2007.csv | sort -n
7. Anyone who has flown knows how frustrating it can be waiting for takeoff, or deboarding the aircraft. These roughly translate to TaxiOut and TaxiIn respectively. If you were to fly into or out of IND what is your expected total taxi time? Use 2007.csv.

Note: Taxi times are in minutes.

Relevant topics: awk, grep

Item(s) to submit:

  • The code used to solve the problem.
  • The output from running the code.
Solution
grep -i IND 2007.csv | awk -F, '{taxi=taxi+$20+$21; count++}END{print taxi/count}'

Project 8


Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives:

  • Use awk to process and manipulate textual data.
  • Use piping and redirection within the terminal to pass around data between utilities.

Dataset:

The following questions will use the dataset found in Scholar:

/class/datamine/data/flights/subset/YYYY.csv

An example of the data for the year 1987 can be found here.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

1. Let's say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!
Write a line of awk code that, prints the total number of flights that occur on weekend days, followed by the total number of flights that occur on the weekdays. Complete this calculation for 2008 using the 2008.csv file.
Modify your code to instead print the average number of flights that occur on weekend days, followed by the average number of flights that occur on the weekdays.

Hint: You don't need a large if statement to do this, you can use the ~ comparison operator.

Relevant topics: awk

Item(s) to submit:

  • Lines of awk code that solves the problem.
  • The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.
  • The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.
Solution
awk -F, '{if ($4 ~ /1|2|3|4/) weekday++; else (weekend++)}END{print weekend,weekday}' 2008.csv
awk -F, '{if ($4 ~ /1|2|3|4/) weekday++; else (weekend++)}END{print weekend/3,weekday/4}' 2008.csv
2. We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let's use the tools we've learned to explore this a little bit.
Take a look at airports.csv. In particular run the following:
head airports.csv
Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within awk. This is messy and we would prefer to create a new file called new_airports.csv without any quotes. Write a line of code to do this.

Note: You may be wondering why we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It's important to practice and learn ways to fix these things.

Hint: You could use gsub within awk to replace '"' with ''. You can find how to use gsub here.

Hint: If you leave out the column number argument to gsub it will apply the substitution to every field in every column.

Hint:

cat new_airports.csv | wc -l # should be 159 without header

Relevant topics: awk, redirection

Item(s) to submit:

  • Line of awk code used to create the new dataset.
Solution
awk -F, '{gsub(/"/, ""); print $0}' airports.csv > new_airports.csv
3. Write a line of commands that creates a new dataset called az_fl_airports.txt. az_fl_airports.txt should only contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3),new_airports.csv as a starting point.
How many airports are there? Did you expect this? Use a line of bash code to count this.
Create a new dataset (called az_fl_flights.txt) that contains all of the data for flights into or out of Florida and Arizona (using the 2008.csv file). Use the newly created dataset, az_fl_airports.txt to accomplish this.

Hint: https://unix.stackexchange.com/questions/293684/basic-grep-awk-help-extracting-all-lines-containing-a-list-of-terms-from-one-f

Hint:

cat az_fl_flights.txt | wc -l # should be 484705

Relevant topics: awk, wc, piping

Item(s) to submit:

  • All UNIX commands used to answer the questions.
  • The number of airports.
  • 1-2 sentences explaining whether you expected this number of airports.
Solution
awk -F, '{if ($4 == "AZ" || $4 == "FL") print $1}' new_airports.csv > az_fl_airports.txt
wc -l az_fl_airports.txt
4. Write a bash script that accepts the start year, end year, and filename containing airport codes (az_fl_airports.txt), and outputs the data for flights into or out of any of the airports listed in the provided filename (az_fl_airports.txt). The script should output data for flights using all of the years of data in the provided range. Run the bash script to create a new file called az_fl_flights_total.csv.

Relevant topics: bash scripts, grep, for loop, redirection

Item(s) to submit:

  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.
  • The line of UNIX code you used to execute the script and create the new dataset.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
  grep -w -F -f $3 $f.csv
done
./my_script.sh 1987 2008 az_fl_airports.txt > az_fl_flights_total.csv
5. Use the newly created dataset, az_fl_flights_total.csv, from question 4 to calculate the total number of flights into and out of both states by month, and by year, for a total of 3 columns (year, month, flights). Export this information to a new file called snowbirds.csv.
Load up your newly created dataset and use either R or Python (or some other tool) to create a graphic that illustrates whether or not we believe the "snowbird effect" effects flights. Include a description of your graph, as well as your (anecdotal) conclusion.

Hint: You can use 1 dimensional arrays to accomplish this if the key is the combination of, for example, the year and month.

Relevant topics: awk, redirection

Item(s) to submit:

  • The line of awk code used to create the new dataset, snowbirds.csv.
  • Code used to create the visualization in a code chunk.
  • The generated plot as either a png or jpg/jpeg.
  • 1-2 sentences describing your plot and your conclusion.
Solution
awk -F, '{M[$1","$2]++}END{for (monthyear in M) print monthyear","M[monthyear]}' az_fl_flights_total.csv > snowbirds.csv
library(ggplot2)
dat <- read.csv("snowbirds.csv", header=F)
names(dat) <- c("year", "month", "n_flights")
ggplot(dat) + 
  geom_line(aes(x=month, y=n_flights, group=year, col = as.factor(year))) + 
  geom_point(aes(x=month, y=n_flights)) + 
  theme_classic(base_size=14) + 
  scale_x_continuous(breaks=1:12, labels = month.abb) +
  labs(x= 'Month', y='', title = 'Total flights per month in/out AZ and FL (1987-2008)', col='Year')

Project 9


Motivation: Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, cloudflare, a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per this thread on hackernews). Learning SQL is well worth your time!

Context: There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite.

Scope: SQL, sqlite

Learning objectives:

  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.

Dataset:

The following questions will use the dataset found in Scholar:

/class/datamine/data/lahman/lahman.db

This is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Important note: For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:

```{r, include=F}
library(RSQLite)
lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db")
```

Each solution should then be placed in a code chunk like this:

```{sql, connection=lahman}
SELECT * FROM batting LIMIT 1;
```

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/lahman/lahman.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

```{sql, connection=lahman, eval=F}
SELECT * FROM batting LIMIT 1;
```

This will allow the code to be displayed without throwing an error.

1. Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal and access the Lahman database. How many tables are available?

Hint: To connect to the database, do the following:

sqlite3 /class/datamine/data/lahman/lahman.db

Hint: This is a good resource.

Relevant topics: sqlite3

Item(s) to submit:

  • How many tables are available in the Lahman database?
  • The sqlite3 commands used to figure out how many tables are available.
Solution
.tables
2. Some people like to try to visit all 30 MLB ballparks in their lifetime. Use SQL commands to get a list of parks and the cities they're located in. For your final answer, limit the output to 10 records/rows.

Note: There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the LIMIT clause.

Hint: Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following:

.headers on

Relevant topics: SELECT, FROM, LIMIT

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT parkname, city FROM parks LIMIT 10;
3. There is nothing more exciting to witness than a home run hit by a batter. It's impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (HR) in a season. List their playerID, yearID, home run total, and the teamID they played for.

Hint: There are 8 occurrences of home runs greater than or equal to 60.

Hint: The batting table is where you should look for this question.

Relevant topics: SELECT, FROM, WHERE, LIMIT

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT teamID, yearID, teamID, HR FROM batting WHERE HR>=60;
4. Make a list of players born on your birth day (don't worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year.

Hint: The people table is where you should look for this question.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT nameFirst, nameLast, birthYear FROM people
  WHERE birthMonth==5 AND birthDay==29
  ORDER BY birthYear DESC LIMIT 10;
5. Get the Cleveland (CLE) Pitching Roster from the 2016 season (playerID, W, L, SO). Order the pitchers by number of Strikeouts (SO) in descending order.

Hint: The pitching table is where you should look for this question.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT playerID, W, L, SO FROM pitching 
    WHERE teamID=='CLE' AND yearID==2016
    ORDER BY SO DESC LIMIT 10;
6. Find the 10 team and year pairs that have the most number of Errors (E) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs?

Hint: The teams table is where you should look for this question.

Hint: The BETWEEN clause is useful here.

Hint: It is OK to use multiple queries to answer the question.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT teamID, franchID, yearID, W, L, E FROM teams
  WHERE yearID BETWEEN 1960 AND 1970
  ORDER BY E DESC
  LIMIT 10;
SELECT franchName FROM teamsfranchises WHERE franchID=='LAA';
SELECT teamID, franchID, yearID, W, L, E FROM teams
  WHERE yearID >= 1960 AND yearID <= 1970
  ORDER BY E DESC
  LIMIT 10;
SELECT franchName FROM teamsfranchises WHERE franchID=='LAA';
7. Find the playerID for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table pitching)? What year and team did he win the most games as a manager (use table managers)?

Hint: It is OK to use multiple queries to answer the question.

Note: There was a tie among the two years in which Bob Lemon had the most wins as a pitcher.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT playerID FROM people
  WHERE nameFirst=='Bob' AND nameLast=='Lemon';
SELECT teamID, yearID, W FROM pitching
  WHERE playerID=='lemonbo01'
  ORDER BY W DESC;
SELECT teamID, yearID, W FROM managers
  WHERE playerID=='lemonbo01'
  ORDER BY W DESC;
8. Find the AL West (use lgID and divID to specify AL West) home run (HR), walk (BB), and stolen base (SB) totals by team between 2000 and 2010. Which team and year combo led in each category in the decade?

Hint: The teams table is where you should look for this question.

Hint: It is OK to use multiple queries to answer the question.

Hint: Use divID == 'W' as one of the conditions. Please note using double quotes: divID == "W" will not work.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The team-year combination that ranked top in each category.
Solution
SELECT teamID, yearID, HR, BB, SB FROM teams
  WHERE lgID=='AL' AND divID=='W'
  AND yearID BETWEEN 2000 AND 2010
  ORDER BY HR DESC LIMIT 1;
SELECT teamID, yearID, HR, BB, SB FROM teams
  WHERE lgID=='AL' AND divID=='W'
  AND yearID BETWEEN 2000 AND 2010
  ORDER BY BB DESC LIMIT 1;
SELECT teamID, yearID, HR, BB, SB FROM teams
  WHERE lgID=='AL' AND divID=='W'
  AND yearID BETWEEN 2000 AND 2010
  ORDER BY SB DESC LIMIT 1;
9. Get a list of the following by year: wins (W), losses (L), Home Runs Hit (HR), homeruns allowed (HRA), and total home game attendance (attendance) for the Detroit Tigers when winning a World Series (WSWin is Y) or when winning league champion (LgWin is Y).

Hint: The teams table is where you should look for this question.

Hint: Be careful with the order of operations for AND and OR. Remember you can force order of operations using parentheses.

Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN

Note: Examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL code used to solve the problem.
  • The first 10 results of the query.
Solution
SELECT teamID, yearID, W, L, HR, HRA, attendance FROM teams
  WHERE teamID=='DET' AND (WSWin=='Y' OR LgWin=='Y');

Project 10


Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data.

Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG, MIN, and MAX.

Scope: SQL, sqlite

Learning objectives:

  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
  • Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.

Dataset

The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at scholar-db.rcac.purdue.edu. As in Project 9, this is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.

Questions

Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Important note: For this project all solutions should be done using R code chunks, and the RMariaDB package. Run the following code to load the library:

library(RMariaDB)
1. Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and, rather than navigating to the terminal like we did in the previous project, instead, create a connection to our MariaDB lahman database using the RMariaDB package in R, and the credentials below. Confirm the connection by running the following code chunk:
con <- dbConnect(RMariaDB::MariaDB(), 
                 host="scholar-db.rcac.purdue.edu", 
                 db="lahmandb", 
                 user="lahman_user", 
                 password="HitAH0merun")

head(dbGetQuery(con, "SHOW tables;"))

Hint: In the example provided, the variable con from the dbConnect function is the connection. Each query that you make, using the dbGetQuery, needs to use this connection con. You can change the name con if you want to (it is user defined), but if you change the name con, you need to change it on all of your connections. If your connection to the database dies while you are working on the project, you can always re-run the dbConnect line again, to reset your connection to the database.

Relevant topics: RMariaDB, dbConnect, dbGetQuery

Item(s) to submit:

  • R code used to solve the problem.
  • Output from running your (potentially modified) head(dbGetQuery(con, "SHOW tables;")).
Solution
library(RMariaDB)
host <- "scholar-db.rcac.purdue.edu"
dbname <- "lahmandb"
user <- "lahman_user"
password <- "HitAH0merun"
con <- dbConnect(MariaDB(),host=host,
                 dbname=dbname,
                 user=user,
                 password=password)
head(dbGetQuery(con, "SHOW tables;"))
2. How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (SB) and hit at least 40 home runs (HR) in one year.

Hint: Use the batting table.

Important note: You only need to run library(RMariaDB) and the dbConnect portion of the code a single time towards the top of your project. After that, you can simply reuse your connection con to run queries.

Important note: In our project template, for this project, make all of the SQL queries using the dbGetQuery function, which returns the results directly in R. Therefore, your RMarkdown blocks for this project should all be {r} blocks (as opposed to the {sql} blocks used in Project 9).

Hint: You can use dbGetQuery to run your queries from within R. Example:

dbGetQuery(con, "SELECT * FROM battings LIMIT 5;")

Note: We already demonstrated the correct SQL query to use for the 40/40 club in this video, but now we want you to use RMariaDB to solve this query:

Click here for video

Relevant topics: dbGetQuery, AND/OR, DISTINCT, COUNT

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con,"SELECT DISTINCT COUNT(*) FROM batting WHERE HR>=40 AND SB>=40;")
3. How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"?

Click here for video

Important note: Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use sum in R rather than SUM in SQL.

Hint: Use the people table to find the playerID and use the batting table to find the statistics.

Relevant topics: dbGetQuery, AND/OR, SUM

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con, "SELECT playerID FROM people WHERE nameFirst ='Giancarlo' AND nameLast = 'Stanton';")
dbGetQuery(con, "SELECT COUNT(*), SUM(SO) FROM batting WHERE playerID = 'stantmi03' AND (teamID = 'MIA' OR teamID = 'FLO');")
4. The Batting Average is a metric for a batter's performance. The Batting Average in a year is calculated by \(\frac{H}{AB}\) (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to playerID, teamID, and yearID.

Hint: Use the batting table.

Relevant topics: dbGetQuery, ORDER BY, BETWEEN

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con,"SELECT playerID, teamID, yearID, H, AB, H/AB FROM batting
  WHERE yearID BETWEEN 2000 AND 2010 AND AB>300
  ORDER BY H/AB DESC LIMIT 5;")
5. How many unique players have hit > 50 home runs (HR) in a season?

Hint: If you view DISTINCT as being paired with SELECT, instead, think of it as being paired with one of the fields you are selecting.

Relevant topics: dbGetQuery, DISTINCT, COUNT

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con, "SELECT COUNT(DISTINCT playerID) FROM batting WHERE HR>50;")
# As you can see, DISTINCT is with the field playerID, not with SELECT. 
# The following query should help distinguish this.
dbGetQuery(con, "SELECT COUNT(DISTINCT playerID), COUNT(playerID) FROM batting WHERE HR>50;")
6. Find the number of unique players that attended Purdue University. Start by finding the schoolID for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic.

Hint: Use the schools table to get the schoolIDs, and the collegeplaying table to get the statistics.

Hint: You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates.

Hint: To find IU's id, try the following query: SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';. You can find more about the LIKE clause and % here.

Relevant topics: dbGetQuery, plotting in R, DISTINCT, COUNT

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
dbGetQuery(con, "SELECT schoolID FROM schools WHERE name_full='Purdue University';")
dbGetQuery(con, "SELECT schoolID FROM schools WHERE LIKE '%indiana%';")
dbGetQuery(con,"SELECT COUNT(DISTINCT playerID) FROM collegeplaying WHERE schoolID ='purdue';")
dbGetQuery(con,"SELECT DISTINCT playerID FROM collegeplaying WHERE schoolID ='indiana';")
purdue <- length(unique(dbGetQuery(con,"SELECT playerID FROM collegeplaying WHERE schoolID ='purdue';"))$playerID)
iu <- length(unique(dbGetQuery(con,"SELECT playerID FROM collegeplaying WHERE schoolID ='indiana';"))$playerID)
barplot(purdue, iu)
barplot(log(c(purdue, iu)), main="Pro baseball players: Indiana vs. Purdue", col=c("#990000","#CFB53B"), names.arg=c("Purdue", "IU"), cex.names=.5, ylab="Log count")
7. Use R, SQL and the lahman database to create an interesting infographic. For those of you who are not baseball fans, try doing a Google image search for "baseball plots" for inspiration. Make sure the plot is polished, has appropriate labels, color, etc.

Relevant topics: SQL, plotting in R

Item(s) to submit:

  • R code used to solve the problem.
  • The result of running the R code.
Solution
# Could be anything.

Project 11


Motivation: Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.

Context: We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset!

Scope: SQL, SQL in R

Learning objectives:

  • Demonstrate the ability to interact with popular database management systems within R.
  • Solve data-driven problems using a combination of SQL and R.
  • Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.
  • Showcase the ability to filter, alias, and write subqueries.
  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.

Dataset

The following questions will use the elections database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar:

/class/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be /class/datamine/data/election/itcont1980.txt)

A public sample of the data can be found here:

https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt)

Questions

Important note: For this project you will need to connect to the database elections using the RMariaDB package in R. Include the following code chunk in the beginning of your RMarkdown file:

```{r setup-database-connection}
library(RMariaDB)
con <- dbConnect(RMariaDB::MariaDB(), 
                 host="scholar-db.rcac.purdue.edu", 
                 db="elections", 
                 user="elections_user", 
                 password="Dataelect!98")
```

When a question involves SQL queries in this project, you may use a SQL code chunk (with {sql}), or an R code chunk (with {r}) and functions like dbGetQuery as you did in Project 10. Please refer to Question 5 in the project template for examples.

1. Approximately how large was the lahman database (use the sqlite database in Scholar: /class/datamine/data/lahman/lahman.db)? Use UNIX utilities you've learned about this semester to write a line of code to return the size of that .db file (in MB).
The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder /class/datamine/data/election/. How much data (in MB) is there?
The data in that folder has been added to the elections database, all aggregated in the elections table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table elections?

Note: These are some examples of how to get the sizes of collections of files in UNIX.

Click here for video

Hint: The SQL query will take some time! Be patient.

Note: You may use more than one code chunk in your RMarkdown file for the different tasks.

Note: We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from ls and du to match, use the --apparent-size option with du.

Note: A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See here for more information. For this question, either solution will be given full credit. This is a potentially useful example.

Relevant topics: SQL, SQL in R, awk, ls, du

Item(s) to submit:

  • Line of code (bash/awk) to show the size (in MB) of the lahman database file.
  • Approximate size of the lahman database in MB.
  • Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in /class/datamine/data/election.
  • The size of the elections data in MB.
  • SQL query used to find the number of rows of data in the elections table in the elections database.
  • The number of rows in the elections table in the elections database.
Solution
ls -la *.txt | awk '{ total += $4; }END{print total/1000000}'
SELECT COUNT(*) FROM elections;
2. Write a SQL query using the LIKE command to find a unique list of zip_codes that start with "479".
Write another SQL query and answer: How many unique zip_codes are there that begin with "479"?

Note: Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project.

Click here for video

Hint: The first query returns a list of zip codes, and the second returns a count.

Hint: Make sure you only select zip_codes.

Relevant topics: SQL, LIKE

Item(s) to submit:

  • SQL queries used to answer the question.
  • The first 5 results from running the query.
Solution
SELECT DISTINCT zip_code FROM elections WHERE zip_code LIKE '479%' LIMIT 5;
SELECT COUNT(DISTINCT zip_code) FROM elections WHERE zip_code LIKE '479%';
3. Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an alias for our field so it doesn't read COUNT(*) but rather Indiana Donations.

Hint: You may enclose an alias's name in quotation marks (single or double) when the name contains space.

Relevant topics: SQL, WHERE, aliasing

Item(s) to submit:

  • SQL query used to answer the question.
  • The result of the SQL query.
Solution
SELECT COUNT(*) FROM elections WHERE state='IN';
SELECT COUNT(*) AS 'Indiana Donations' FROM elections WHERE state='IN';
4. Rewrite the query in (3) so the result is displayed like: IN: 1234567. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to OH: 1234567. In other words, the state abbreviation should be dynamic, not static.

Note: This video demonstrates how to use CONCAT in a MySQL query.

Click here for video

Hint: Use CONCAT and aliasing to accomplish this.

Hint: Remember, state contains the state abbreviation.

Relevant topics: SQL, aliasing, CONCAT

Item(s) to submit:

  • SQL query used to answer the question.
Solution
SELECT CONCAT(state, ": ", COUNT(*)) AS 'Donations' FROM elections WHERE state='IN';
5. In (2) we wrote a query that returns a unique list of zip_codes that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip_codes starting with "479". What percent of donations in Indiana come from said zip_codes?

Note: This video gives two examples of sub-queries.

Click here for video

Hint: You can simply manually calculate the percent using the count in (2) and (5).

Relevant topics: SQL, aliasing, subqueries, IN

Item(s) to submit:

  • SQL queries used to answer the question.
  • The percentage of donations from Indiana from zip_codes starting with "479".
Solution
SELECT COUNT(*) FROM elections WHERE zip_code IN (SELECT DISTINCT zip_code FROM elections WHERE zip_code LIKE '479%');
6. In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (transaction_amt). Which 5 states gave the most donations (highest count)? Order you result from most to least.

Note: In this video we demonstrate GROUP BY, ORDER BY, DESC, and other aspects of MySQL that might help with this question.

Click here for video

Hint: You may want to create an alias in order to sort.

Relevant topics: SQL, GROUP BY

Item(s) to submit:

  • SQL query used to answer the question.
  • Which 5 states gave the most donations?
Solution
SELECT state, COUNT(*) AS 'cnt', SUM(transaction_amt) AS 'total' FROM elections GROUP BY state ORDER BY cnt DESC;
7. Write a query that gets the number of donations, and sum of donations, by year, for Indiana. Create one or more graphics that highlights the year-by-year changes. Write a short 1-2 sentences explaining your graphic(s).

Relevant topics: SQL in R, GROUP, date functions in SQL

Item(s) to submit:

  • SQL query used to answer the question.
  • R code used to create your graphic(s).
  • 1 or more graphics in png/jpeg format.
  • 1-2 sentences summarizing your graphic(s).
Solution
library(RMariaDB)
library(ggplot2)
library(tidyverse)
db <- dbConnect(RMariaDB::MariaDB(), host="scholar-db.rcac.purdue.edu", db="elections", user="elections_user", password="Dataelect!98")
dat <- dbGetQuery(db, "SELECT COUNT(*) as cnt, SUM(transaction_amt) as amt, YEAR(transaction_dt) as yr FROM elections WHERE state='IN' GROUP BY yr ORDER BY yr;")
filtered_dat = filter(dat, between(yr, 1978, 2020))
ggplot(filtered_dat, aes(x=yr,y=amt/1e6)) +
  geom_line() +
  geom_point() +
  theme_bw() +
  labs(x='Year', y='Amount in millions')
ggplot(filtered_dat, aes(x=yr,y=amt/(1e3*cnt))) +
  geom_line() +
  geom_point() +
  theme_bw() +
  labs(x='Year', y='Average amount in thousands')
ggplot(filtered_dat, aes(x=yr,y=as.numeric(cnt)/1e3)) +
  geom_line()  +
  geom_point() +
  theme_bw() +
  labs(x='Year', y='Number of donations in thousands')

Project 12


Motivation: Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share.

Context: We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.

Scope: SQL, sqlite, joins

Learning objectives:

  • Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.
  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.
  • Showcase the ability to filter, alias, and write subqueries.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db

A public sample of the data can be found here.

Important note: For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:

```{r, include=F}
library(RSQLite)
bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db")
```

Each solution should then be placed in a code chunk like this:

```{sql, connection=bikeshare}
SELECT * FROM station LIMIT 5;
```

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

```{sql, connection=bikeshare, eval=F}
SELECT * FROM station LIMIT 5;
```

This will allow the code to be displayed without throwing an error.

There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform all of the other types of joins (RIGHT JOIN, FULL OUTER JOIN). You can see the documentation here: SQL, INNER JOIN, LEFT JOIN

Questions

1. Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the sqlite concatenate operator || (see here). Write a query that returns the first 5 records of information from the station table formatted in the following way:

(id) name @ (lat, long)

For example:

(84) Ryland Park @ (37.342725, -121.895617)

Hint: Here is a video about how to concatenate strings in SQLite.

Click here for video

Relevant topics: aliasing, concat

Item(s) to submit:

  • SQL query used to solve this problem.
  • The first 5 records of information from the station table.
Solution
SELECT '('||id||') '||name||' @ ('||lat||','||long||')' FROM station LIMIT 5;
2. There is a variety of interesting weather information in the weather table. Write a query that finds the average mean_temperature_f by zip_code. Which is on average the warmest zip_code?

Use aliases to format the result in the following way:

Zip Code|Avg Temperature
94041|61.3808219178082

Note that this is the output if you use sqlite in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly.

Hint: Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT

Click here for video

Relevant topics: aliasing, GROUP BY, AVG

Item(s) to submit:

  • SQL query used to solve this problem.
  • The results of the query copy and pasted.
Solution
SELECT zip_code AS 'Zip Code', AVG(mean_temperature_f) AS 'Avg Temperature' FROM weather GROUP BY zip_code;
Solution
SELECT COUNT(DISTINCT zip_code) FROM trip;
SELECT zip_code, COUNT(zip_code) FROM trip GROUP BY zip_code;
4. In (2) we wrote a query that finds the average mean_temperature_f by zip_code. What if we want to tack on our results in (2) to information from each row in the station table based on the zip_codes? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables.

Hint: Use the query from (2) as a sub query within your solution.

Hint: Here is a video about JOIN and LEFT JOIN.

Click here for video

Relevant topics: INNER JOIN, subqueries, aliasing

Item(s) to submit:

  • SQL query used to solve this problem.
Solution
SELECT * FROM station as s INNER JOIN (SELECT zip_code AS 'Zip Code', AVG(mean_temperature_f) AS 'Avg Temperature' FROM weather GROUP BY zip_code) AS sub ON s.zip_code=sub.'Zip Code';
5. In (3) we alluded to the fact that the zip_codes in the trip table aren't very consistent. Users can enter a zip code when using the app. This means that zip_code can be from anywhere in the world! With that being said, if the zip_code is one of the 5 zip_codes for which we have weather data (from question 2), we can add that weather information to matching rows of the trip table. In (4) we used an INNER JOIN to append some weather information to each row in the station table. For this question, write a query that performs an INNER JOIN and appends weather data from the weather table to the trip data from the trip table. Limit your output to 5 lines.

Important note: Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the trip table, you can use start_date for for the date information.

Hint: You will want to wrap your dates and datetimes in sqlite's date function prior to comparison.

Relevant topics: INNER JOIN, aliasing

Item(s) to submit:

  • SQL query used to solve this problem.
  • First 5 lines of output.
Solution
SELECT * FROM trip AS t INNER JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date) LIMIT 5;
6. How many rows are in the result from (5) (when not limiting to 5 lines)? How many rows are in the trip table? As you can see, a large proportion of the data from the trip table did not match the data from the weather table, and therefore was removed from the result. What if we want to keep all of the data from the trip table and add on data from the weather table if we have a match? Write a query to accomplish this. How many rows are in the result?

Item(s) to submit:

  • SQL query used to find how many rows from the result in (5).
  • The number of rows in the result of (5).
  • SQL query to find how many rows are in the trip table.
  • The number of rows in the trip table.
  • SQL query to keep all of the data from the trip table and add on matching data from the weather table when available.
  • The number of rows in the result.
Solution
SELECT COUNT(*) FROM trip AS t INNER JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date);
SELECT COUNT(*) FROM trip;
SELECT * FROM trip AS t LEFT JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date) LIMIT 5;
SELECT COUNT(*) FROM trip AS t LEFT JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date);

Project 13


Motivation: Databases you will work with won't necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like primary key, and foreign key will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice!

Context: We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you've already learned, with a focus on subqueries and joins.

Scope: SQL, sqlite

Learning objectives:

  • Write and run SQL queries in sqlite on real-world data.
  • Identify primary and foreign keys in a SQL table.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/movies_and_tv/imdb.db

Important note: For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd:

```{r, include=F}
library(RSQLite)
imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
```

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/movies_and_tv/imdb.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

```{sql, connection=imdb, eval=F}
SELECT * FROM titles LIMIT 5;
```

This will allow the code to be displayed without throwing an error.

Questions

1. A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the titles table. Do you think there are any primary keys? How about foreign keys? Now examine the episodes table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?

Hint: A primary key can also be a foreign key.

Hint: Here are two videos. The first video will remind you how to find the names of all of the tables in the imdb database. The second video will introduce you to the titles and episodes tables in the imdb database.

Click here for video

Click here for video

Relevant topics: primary key, foreign key

Item(s) to submit:

  • List any primary or foreign keys in the titles table.
  • List any primary or foreign keys in the episodes table.
2. If you paste a title_id to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series Grey's Anatomy. Write a SQL query to confirm that the title_id tt0413573 does indeed belong to Grey's Anatomy. Then browse imdb.com and find your favorite TV show. Get the title_id from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database:
SELECT * FROM titles WHERE title_id='<title id here>';

Make sure to replace "<title id here>" with the title_id of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons.

Relevant topics: SELECT, WHERE

Item(s) to submit:

  • SQL query used to confirm that title_id tt0413573 does indeed belong to Grey's Anatomy.
  • The output of the query.
  • The title_id of your favorite TV show.
  • SQL query used to confirm the title_id for your favorite TV show.
  • The output of the query.
3. The episode_title_id column in the episodes table references titles of individual episodes of a TV series. The show_title_id references the titles of the show itself. With that in mind, write a query that gets a list of all of the episodes_title_ids (found in the episodes table), with the associated primary_title (found in the titles table) for each episode of Grey's Anatomy.

Relevant topics: INNER JOIN

Hint: This video shows how to extract titles of episodes in the imdb database.

Click here for video

Item(s) to submit:

  • SQL query used to solve the problem in a code chunk.
4. We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of episode_title_ids (found in the episodes table), with the associated primary_title (found in the titles table) for each episode.

Hint: This part is just like question (3) but this time with your favorite TV show, which you chose in (2).

Hint: This video shows how to use a subquery, to JOIN a total of three tables in the imdb database.

Click here for video

Relevant topics: INNER JOIN, aliasing

Item(s) to submit:

  • SQL query used to solve the problem in a code chunk.
  • The first 5 results from your query.
5. Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode?

Relevant topics: INNER JOIN, aliasing, subqueries, ORDER BY, DESC, LIMIT

Note: Various helpful examples that utilize the relevant topics in this problem can be found here.

Item(s) to submit:

  • SQL query used to solve the problem in a code chunk.
  • The episode_title_id, primary_title, and rating of the top rated episode from your favorite TV series, in question (2).
  • A statement saying whether the highest rated episode is also your favorite episode.
6. Write a query that returns the season_number (from the episodes table), and average rating (from the ratings table) for each season of your favorite TV show from (2). Write another query that only returns the season number and rating for the highest rated season. Consider the highest rated season the season with the highest average.

Relevant topics: INNER JOIN, aliasing, GROUP BY, AVG

Item(s) to submit:

  • The 2 SQL queries used to solve the problems in two code chunks.
7. Write a query that returns the primary_title and rating of the highest rated episode per season for your favorite TV show from question (2).

Note: You can show one highest rated episode for each season, without the need to worry about ties.

Relevant topics: MAX, subqueries, GROUP BY, INNER JOIN, aliasing

Item(s) to submit:

  • The SQL query used to solve the problem.
  • The output from your query.

Project 14


Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.

Context: We've just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.

Scope: awk, UNIX utilities, bash scripts, fread

Learning objectives:

  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
  • Use grep to search files effectively.
  • Use cut to section off data from the command line.
  • Use piping to string UNIX commands together.
  • Use awk for data extraction, and preprocessing.
  • Create bash scripts to automate a process or processes.

Questions

1. Take a look at at PLOTSNAP.csv. Write a line of awk code that displays the STATECD followed by the number of rows with that STATECD.

Relevant topics: awk

Item(s) to submit:

  • Code used to solve the problem.
  • Count of the following STATECDs: 1, 2, 4, 5, 6
2. Unfortunately, there isn't a very accessible list available that shows which state each STATECD represents. This is no problem for us though, the dataset has LAT and LON! Write some bash that prints just the STATECD, LAT, and LON.

Note: There are 92 columns in our dataset: awk -F, 'NR==1{print NF}' PLOTSNAP.csv. To create a list of STATECD to state, we only really need STATECD, LAT, and LON. Keeping the other 89 variables will keep our data at 2.1gb.

Relevant topics: cut, awk

Item(s) to submit:

  • Code used to solve the problem.
  • The output of your code piped to head.
4. We are going to further understand the data from question (3) by finding the actual locations based on the LAT and LON columns. We can use the library revgeo to get a location given a pair of longitude and latitude values. revgeo uses a free API hosted by photon in order to do so.
For example:
library(revgeo)
revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
The code above will give you the address information in six columns, from the most-granular housenumber to the least-granular country. Depending on the coordinates, revgeo may or may not give you results for each column. For this question, we are going to keep only the state column.
There are over 4 million rows in our dataset -- we do not want to hit photon's API that many times. Instead, we are going to do the following:
First: Unless you feel comfortable using data.table, convert your data.table to a data.frame:
my_dataframe <- data.frame(my_datatable)
Second: Calculate the average LAT and LON for each STATECD, and call the new data.frame, dat. This should result in 57 rows of lat/long pairs.
Third: For each row in dat, run a reverse geocode and append the state to a new column called STATE.

Hint: To calculate the average LAT and LON for each STATECD, you could use the sqldf package to run SQL queries on your data.frame.

Hint: mapply is a useful apply function to use to solve this problem.

Hint: Here is some extra help:

library(revgeo)

points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866))

# Note that the "output" argument gets passed to the "revgeo" function.
mapply(revgeo, points$longitude, points$latitude, output="frame")

# The output isn't in a great format, and we'd prefer to just get the "state" data.
# Let's wrap "revgeo" into another function that just gets "state" and try again.
get_state <- function(lon, lat) {
  return(revgeo(lon, lat, output="frame")["state"])
}

mapply(get_state, points$longitude, points$latitude)

Important note: It is okay to get "Not Found" for some of the addresses.

Relevant topics: apply functions, sqldf

Item(s) to submit:

  • Code used to solve the problem.
  • The head of the resulting data.frame.
5. Use the leaflet, addTiles, and addCircles functions from the leaflet package to map our average latitude and longitude data from question (4) to a map (should be a total of 57 lat/long pairs).

Hint: See here for an example of adding points to a map.

Relevant topics: leaflet

Item(s) to submit:

  • Code used to create the map.
  • The map itself as output from running the code chunk.
6. Write a bash script that accepts at least 1 argument, and performs a useful task using at least 1 dataset from the forest folder in /class/datamine/data/forest. An example of a useful task could be printing a report of summary statistics for the data. Feel free to get creative. Note that tasks must be non-trivial -- a bash script that counts the number of lines in a file is not appropriate. Make sure to properly document (via comments) what your bash script does. Also ensure that your script returns columnar data with appropriate separating characters (for example a csv).

Relevant topics: bash scripts, awk, unix utilities

Item(s) to submit:

  • The content of your bash script starting from #!/bin/bash.
  • Example output from running your script as intended.
  • A description of what your script does.
7. You used fread in question (2). Now use the cmd argument in conjunction with your script from (6) to read the script output into a data.table in your R environment.

Relevant topics: fread

Item(s) to submit:

  • The R code used to read in and preprocess your data using your bash script from (6).
  • The head of the resulting data.table.

Project 15


Motivation: We've done a lot of work with SQL this semester. Let's review concepts in this project and mix and match R and SQL to solve data-driven problems.

Context: In this project, we will reinforce topics you've already learned, with a focus on SQL.

Scope: SQL, sqlite, R

Learning objectives:

  • Write and run SQL queries in sqlite on real-world data.
  • Use SQL from within R.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/movies_and_tv/imdb.db

F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One ...". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let's use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.

Questions

1. Write a query that gets the episode_title_id, primary_title, rating, and votes, of all of the episodes of Friends (title_id is tt0108778).

Hint: You can slightly modify the solution to question (5) in project 13.

Relevant topics: INNER JOIN, subqueries, aliasing

Item(s) to submit:

  • SQL query used to answer the question.
  • First 5 results of the query.
2. Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes primary_titles contained their name?

Relevant topics: SQL in R, grep

Item(s) to submit:

  • R code in a code chunk that was used to find the solution.
  • The solution pasted below the code chunk.
3. Create a graphic showing our results in (2) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.

Relevant topics: plotting

Item(s) to submit:

  • The R code used to generate the graphic.
  • The graphic in a png or jpg/jpeg format.
4. Now we will turn our focus to other information in the database. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see type column from titles table): Romance, Comedy, Animation. In the titles table, you can find the genres in the genres column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.

To query rows which have the genre Action as one of its genres:

SELECT * FROM titles WHERE genres LIKE '%action%';

Relevant topics: LIKE, INNER JOIN

Item(s) to submit:

  • Any code you used to solve the problem in a code chunk.
  • The average rating of each of the genres listed for movies.
5. Write a function called top_episode in R which accepts the path to the imdb.db database, as well as the title_id of a tv series (for example, "tt0108778" or "tt1266020"), and returns the season_number, episode_number, primary_title, and rating of the highest rated episode in the series. Test it out on some of your favorite series, and share the results.

Relevant topics: functions, INNER JOIN, ORDER BY

Item(s) to submit:

  • Any code you used to solve the problem in a code chunk.
  • The results for at least 3 of your favorite tv series.