Fall 2020 projects
STAT 19000
Project 1
Motivation: In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called RStudio, we will take some time to learn how to connect to it, configure it, and run code.
Context: This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data!
Scope: r, rstudio, Scholar
Learning objectives:
- Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.
- Install R and setting up a working environment.
- Explain and demonstrate: positional, named, and logical indexing.
- Read and write basic (csv) data.
Make sure to read about, and use the template found here, and the important information about projects submissions here.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/disney/splash_mountain.csv
Questions
1. Read the webpage here. Scholar is the computing cluster you will be using throughout the semester, and your time with The Data Mine. Each node is an individual machine with CPUs and memory (RAM). How many cores and how much memory is available, in total, for our 7 frontend nodes? How about for the sub-clusters? How much is available on your computer or laptop?
Item(s) to submit:
- A sentence explaining how much memory and how many cores the 7 frontends have combined.
- A sentence explaining how much memory and how many cores the 28 sub-clusters have combined.
- A sentence explaining how much memory and how many cores your personal computer has.
Solution
The memory on the 7 frontend nodes is: 4 * 512 + 3 * 768 = 4352 GB = 4.3 TB.
The memory on the 28 nodes in the sub-cluster is: 24 * 64 + 4 * 192 = 2304 GB = 2.3 TB.
The memory on Dr. Ward's laptop is 8 GB.
The number of cores on the 7 frontend nodes is: 4 * 20 + 3 * 20 = 140 cores.
The number of cores on the 28 nodes in the sub-cluster is: 24 * 20 + 4 * 16 = 544 cores.
The number of cores on Dr. Ward's laptop is 2.
Press Control and Enter/Return keys at the same time, to run this line. Which frontend are you in?
Relevant topics: running R code
Item(s) to submit:
-
The
#
of the frontend your RStudio Server session is running on.
Solution
The node we are working on is:
system("hostname", intern=T)
3. From within RStudio, we can run every type of code that you will need to run throughout your time with The Data Mine: Python, R, Bash, SQL, etc. We've created a one-time setup script for you to run, called
/class/datamine/apps/runme.sh
(as seen in the video at the top of this page).
After you restart R (as in the video, after 4 minutes and 16 seconds), there should be a message that is printed in your "Console" tab. What does the message say?
Item(s) to submit:
- The sentence that is printed in the RStudio "Console".
Solution
The welcome message is: "You've successfully loaded The Data Mine R settings!"
4. Projects in The Data Mine should all be submitted using our template found here or on Scholar (/class/datamine/apps/templates/project_template.Rmd
). At the beginning of every project, the first step should be downloading and/or copying and pasting the template into a .Rmd
file in RStudio. This is also demonstrated in the video at the top of this page.
Open the project template and save it into your home directory, in a new RMarkdown file named project01.Rmd
.
Code chunks are parts of the RMarkdown file that contains code. You can identify what type of code a code chunk contains by looking at the engine in the curly braces "{" and "}". How many of each type of code chunk are in our default template?
Hint: You can read about the template here.
Item(s) to submit:
- A list containing the type of code chunk (r, Python, sql, etc), and how many of each code chunks our default template contains.
Solution
There are 3 chunks of R code, 1 chunk of bash, 1 chunk of Python, 1 chunk of SQL
Solution
We store 1, 2, 3
into the variable my_variable, and then we display output: 1, 2, 3
my_variable <- c(1,2,3)
my_variable
## [1] 1 2 3
6. In question (1) we answered questions about CPUs and RAM for the Scholar cluster. To do so, we needed to perform some arithmetic. Instead of using a calculator (or paper), write these calculations using R. Replace the content of the second R code chunk in our template with your calculations.
Relevant topics: templates
Item(s) to submit:
- The R code chunk with your calculations, and output.
Solution
We go back to question 1 and compute directly
4 * 512 + 3 * 768
## [1] 4352
24 * 64 + 4 * 192
## [1] 2304
4 * 20 + 3 * 20
## [1] 140
24 * 20 + 4 * 16
## [1] 544
7. In (6) we got to see how you can type out arithmetic and R will calculate the result for you. One constant throughout the semester will be loading datasets into R. Load our dataset into R by running the following code:
dat <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
Confirm the dataset has been read in by running the head
function on it. head
prints the first few rows of data:
head(dat)
dat
is a variable which contains our data! We can name this variable anything we want, we do not have to name it dat
. Run our code to read in our dataset, this time, instead of naming our resulting dataset dat
, name it splash_mountain
. Place all of your code into a new R code chunk under a new level 3 header (i.e. ### Question 7
).
Relevant topics: reading data in R
Item(s) to submit:
- Code used to answer this question in a code chunk in our template.
-
Output of
head
.
Solution
We load in splash_mountain data and display the head
splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
head(splash_mountain)
## date datetime SACTMIN SPOSTMIN
## 1 01/01/2015 2015-01-01 07:51:12 NA 5
## 2 01/01/2015 2015-01-01 08:02:13 NA 5
## 3 01/01/2015 2015-01-01 08:09:12 NA 5
## 4 01/01/2015 2015-01-01 08:16:12 NA 5
## 5 01/01/2015 2015-01-01 08:23:12 NA 5
## 6 01/01/2015 2015-01-01 08:29:12 NA 5
8. Let's pretend we are now done with our project. We've written some R code, maybe added some text explaining what we did, and we are ready to turn things in. For this course, we will turn in a variety of work, depending on the type of project.
We will always require a PDF which contains text, code, and code output. Normally we would erase any code chunks from the template that are not used, however, for this project, it is OK to just keep the rest of the template intact.
In addition, if the project uses R code, you will need to also submit R code in an R script (file ending with .R
). (Later this year, when submitting Python code, you will submit a Python script instead.)
Let's practice. Take the code from your project01.R
file and paste it (perhaps one or two lines at time) into your RMarkdown file (file ending with .Rmd
).
Compile your RMarkdown project into a PDF. Follow the directions in Brightspace to upload and submit your RMarkdown file, compiled PDF, and R script.
Relevant topics: templates
Item(s) to submit:
- Resulting knitted PDF.
-
project01.R
script (with all of your R code) and the analogousproject01.Rmd
file.
How to build the R script for Project 1 in STAT 19000.
In the videos below, for Question 1 and Question 6, Dr. Ward forgot to calculate the number of cores. (He only included the total amount of memory.) Dr. Ward is a human being who sometimes makes mistakes. Please remember to (also) calculate the number of cores, when you submit Question 1 and Question 6!
Project 2
Introduction to R using 84.51 examples
Introduction to R using NYC Yellow Taxi Cab examples
Motivation: The R environment is a powerful tool to perform data analysis. R is a tool that is often compared to Python. Both have their advantages and disadvantages, and both are worth learning. In this project we will dive in head first and learn the basics while solving data-driven problems.
Context: Last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some R code. In this project, we will continue to use R within RStudio to solve problems. Soon you will see how powerful R is and why it is often a more effective tool to use than spreadsheets.
Scope: r, vectors, indexing, recycling
Learning objectives:
- List the differences between lists, vectors, factors, and data.frames, and when to use each.
- Explain and demonstrate: positional, named, and logical indexing.
- Read and write basic (csv) data.
- Explain what "recycling" is in R and predict behavior of provided statements.
- Identify good and bad aspects of simple plots.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/disney/metadata.csv
A public sample of the data can be found here: /class/datamine/data/disney/metadata.csv
Questions
1. Use the read.csv
function to load /class/datamine/data/disney/metadata.csv into a data.frame
called myDF
. Note that read.csv
by default loads data into a data.frame
. (We will learn more about the idea of a data.frame
, but for now, just think of it like a spreadsheet, in which each column has the same type of data.) Print the first few rows of myDF
using the head
function (as in Project 1, Question 7).
Relevant topics: reading data in r, head
Item(s) to submit:
- R code used to solve the problem in an R code chunk.
Solution
We load the data from the disney metadata into a data frame called myDF
myDF <- read.csv("/class/datamine/data/disney/metadata.csv")
2. We've provided you with R code below that will extract the column WDWMAXTEMP
of myDF
into a vector. What is the 1st value in the vector? What is the 50th value in the vector? What type of data is in the vector? (For this last question, use the typeof
function to find the type of data.)
our_vec <- myDF$WDWMAXTEMP
Relevant topics: indexing in r, type, creating variables
Item(s) to submit:
- R code used to solve the problem in an R code chunk.
- The values of the first, and 50th element in the vector.
-
The type of data in the vector (using the
typeof
function).
Solution
We first load the column WDWMAXTEMP
of the data frame myDF
into a vector:
our_vec <- myDF$WDWMAXTEMP
The 1st element is:
our_vec[1]
## [1] 73.02
The 50th element is:
our_vec[50]
## [1] 51.24
The type of data in the vector is:
typeof(our_vec)
## [1] "double"
3. Use the head function to create a vector called first50
that contains the first 50 values of the vector our_vec
. Use the tail function to create a vector called last50
that contains the last 50 values of the vector our_vec
.
You can access many elements in a vector at the same time. To demonstrate this, create a vector called mymix
that contain the sum of each element of first50
being added to the analogous element of last50
.
Relevant topics: indexing in r, creating variables, head and tail
Item(s) to submit:
- R code used to solve this problem.
- The contents of each of the three vectors.
Solution
We store the first 50 values of our_vec
in first50
:
first50 <- head(our_vec, n=50)
Here are the first 50 values:
first50
## [1] 73.02 78.00 83.12 83.93 72.30 77.67 67.24 59.44 54.89 67.16 77.10 78.24
## [13] 74.89 64.51 62.12 63.58 72.83 70.59 69.89 71.03 75.78 77.09 78.40 70.77
## [25] 63.48 65.48 66.78 61.13 70.11 71.89 69.80 73.34 75.02 66.33 73.02 65.22
## [37] 66.33 71.78 73.58 70.42 64.23 66.96 71.39 59.54 63.24 73.42 74.11 78.70
## [49] 60.48 51.24
We store the last 50 values of our_vec
in last50:
last50 <- tail(our_vec, n=50)
Here are the last 50 values:
last50
## [1] 78.73 78.38 75.73 74.79 76.84 81.25 79.42 77.66 81.15 82.07 70.62 82.63
## [13] 79.12 78.55 76.34 74.10 78.62 74.31 67.69 73.30 74.51 77.66 81.15 82.07
## [25] 70.62 82.63 70.54 57.34 62.29 72.54 62.47 72.31 75.78 72.01 78.48 81.57
## [37] 82.06 80.45 82.01 76.05 79.38 79.38 79.38 79.74 70.41 68.23 58.31 64.29
## [49] 64.29 64.29
The sums of these two vectors, element by element, are:
mymix <- first50 + last50
Here are these 50 values:
mymix
## [1] 151.75 156.38 158.85 158.72 149.14 158.92 146.66 137.10 136.04 149.23
## [11] 147.72 160.87 154.01 143.06 138.46 137.68 151.45 144.90 137.58 144.33
## [21] 150.29 154.75 159.55 152.84 134.10 148.11 137.32 118.47 132.40 144.43
## [31] 132.27 145.65 150.80 138.34 151.50 146.79 148.39 152.23 155.59 146.47
## [41] 143.61 146.34 150.77 139.28 133.65 141.65 132.42 142.99 124.77 115.53
4. In (3) we were able to rapidly add values together from two different vectors. Both vectors were the same size, hence, it was obvious which elements in each vector were added together.
Create a new vector called hot
which contains only the values of myDF$WDWMAXTEMP
which are greater than or equal to 80 (our vector contains max temperatures for days at Disney World). How many elements are in hot
?
Calculate the sum of hot
and first50
. Do we get a warning? Read this and then explain what is going on.
Relevant topics: logical indexing, length, recycling
Item(s) to submit:
- R code used to solve this problem.
- 1-2 sentences explaining what is happening when we are adding two vectors of different lengths.
Solution
The values of myDF$WDWMAXTEMP
that are greater than or equal to 80 are:
hot <- myDF$WDWMAXTEMP[myDF$WDWMAXTEMP >= 80]
The length of the vector hot
is:
length(hot)
## [1] 1255
If we calculate the sum of hot
and first50
,
mynewsum <- hot + first50
## Warning in hot + first50: longer object length is not a multiple of shorter
## object length
we get a warning that hot
and first50
have lengths that are not multiples of each other (so it does not make sense to be adding them, element by element).
5. Plot the WDWMAXTEMP
vector from myDF
.
Item(s) to submit:
- R code used to solve this problem.
-
Plot of the
WDWMAXTEMP
vector frommyDF
.
Relevant topics: plotting
Solution
We plot the max_temp
vector from myDF
:
plot(myDF$WDWMAXTEMP)
6. The following three pieces of code each create a graphic. The first two graphics are created using only core R functions. The third graphic is created using a package called ggplot
. We will learn more about all of these things later on. For now, pick your favorite graphic, and write 1-2 sentences explaining why it is your favorite, what could be improved, and include any interesting observations (if any).
dat <- table(myDF$SEASON)
dotchart(dat, main="Seasons", xlab="Number of Days in Each Season")
dat <- tapply(myDF$WDWMEANTEMP, myDF$DAYOFYEAR, mean, na.rm=T)
seasons <- tapply(myDF$SEASON, myDF$DAYOFYEAR, function(x) unique(x)[1])
pal <- c("#4E79A7", "#F28E2B", "#A0CBE8", "#FFBE7D", "#59A14F", "#8CD17D", "#B6992D", "#F1CE63", "#499894", "#86BCB6", "#E15759", "#FF9D9A", "#79706E", "#BAB0AC", "#1170aa", "#B07AA1")
colors <- factor(seasons)
levels(colors) <- pal
par(oma=c(7,0,0,0), xpd=NA)
barplot(dat, main="Average Temperature", xlab="Jan 1 (Day 0) - Dec 31 (Day 365)", ylab="Degrees in Fahrenheit", col=as.factor(colors), border = NA, space=0)
legend(0, -30, legend=levels(factor(seasons)), lwd=5, col=pal, ncol=3, cex=0.8, box.col=NA)
library(ggplot2)
library(tidyverse)
summary_temperatures <- myDF %>%
select(MONTHOFYEAR,WDWMAXTEMP:WDWMEANTEMP) %>%
group_by(MONTHOFYEAR) %>%
summarise_all(mean, na.rm=T)
ggplot(summary_temperatures, aes(x=MONTHOFYEAR)) +
geom_ribbon(aes(ymin = WDWMINTEMP, ymax = WDWMAXTEMP), fill = "#ceb888", alpha=.5) +
geom_line(aes(y = WDWMEANTEMP), col="#5D8AA8") +
geom_point(aes(y = WDWMEANTEMP), pch=21,fill = "#5D8AA8", size=2) +
theme_classic() +
labs(x = 'Month', y = 'Temperature', title = 'Average temperature range' ) +
scale_x_continuous(breaks=1:12, labels=month.abb)
Solution
The seasons plot makes it easy to compare the values. On the other hand, the values do not appear to be given in any particular order.
The average temperature plot clearly indicates the rise and fall of the temperatures over the course of the year, but the color scheme is unusual. The colors in the plot do not seem to correspond to the colors in the legend. Additionally, the color schemes in both the plot and the legend seem to be chosen randomly.
Project 3
Motivation: data.frame
s are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a data.frame
.
Context: In the previous project we got our feet wet, and ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we've already learned and introduce a new, flexible data structure called data.frame
s.
Scope: r, data.frames, recycling, factors
Learning objectives:
- Explain what "recycling" is in R and predict behavior of provided statements.
- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- List the differences between lists, vectors, factors, and data.frames, and when to use each.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/disney
Questions
1. Read the dataset /class/datamine/data/disney/splash_mountain.csv
into a data.frame called splash_mountain
. How many columns, or features are in each dataset? How many rows or observations?
Item(s) to include:
- R code used to solve the problem.
- How many columns or features in each dataset?
Solution
We read in the Splash Mountain data set.
splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
We can use either the str
or the dim
function to see that this data frame has 223936 rows and 4 columns.
str(splash_mountain)
## 'data.frame': 223936 obs. of 4 variables:
## $ date : chr "01/01/2015" "01/01/2015" "01/01/2015" "01/01/2015" ...
## $ datetime: chr "2015-01-01 07:51:12" "2015-01-01 08:02:13" "2015-01-01 08:09:12" "2015-01-01 08:16:12" ...
## $ SACTMIN : int NA NA NA NA NA NA NA NA NA 4 ...
## $ SPOSTMIN: int 5 5 5 5 5 5 5 5 5 NA ...
dim(splash_mountain)
## [1] 223936 4
2. Splash Mountain is a fan favorite ride at Disney World's Magic Kingdom theme park. splash_mountain
contains a series of dates and datetimes. For each datetime, splash_mountain
contains a posted minimum wait time, SPOSTMIN
, and an actual minimum wait time, SACTMIN
. What is the average posted minimum wait time for Splash Mountain? What is the standard deviation? Based on the fact that SPOSTMIN
represents the posted minimum wait time for our ride, does our mean and standard deviation make sense? Explain. (You might look ahead to Question 3 before writing the answer to Question 2.)
Hint: If you got NA
or NaN
as a result, see here.
Relevant topics: mean, var, NA, NaN
Item(s) to submit:
- R code used to solve this problem.
- The results of running the R code.
- 1-2 sentences explaining why or why not the results make sense.
Solution
The average minimum posted wait time for Splash Mountain is
mean(splash_mountain$SPOSTMIN, na.rm=T)
## [1] -71.70373
and the standard deviation is
sd(splash_mountain$SPOSTMIN, na.rm=T)
## [1] 328.0586
This is strange because the average minimum posted wait time should not be a negative value, and the standard deviation seems to be very large.
3. In (2) we got some peculiar values for the mean and standard deviation. If you read the "attractions" tab in the file /class/datamine/data/disney/touringplans_data_dictionary.xlsx
, you will find that -999 is used as a value in SPOSTMIN
and SACTMIN
to indicate the ride as being closed. Recalculate the mean and standard deviation of SPOSTMIN
, excluding values that are -999. Does this seem to have fixed our problem?
Relevant topics: NA, mean, var, indexing, which
Item(s) to submit:
- R code used to solve this problem.
- The result of running the R code.
- A statement indicating whether or not the value look reasonable now.
Solution
If we remove the -999 values, then the average minimum posted wait time (with the -999 values removed) is
mean(splash_mountain$SPOSTMIN[splash_mountain$SPOSTMIN != -999], na.rm=T)
## [1] 43.3892
and the standard deviation (with the -999 values removed) is
sd(splash_mountain$SPOSTMIN[splash_mountain$SPOSTMIN != -999], na.rm=T)
## [1] 31.74894
This looks more reasonable!
4. SPOSTMIN
and SACTMIN
aren't the greatest feature/column names. An outsider looking at the data.frame wouldn't be able to immediately get the gist of what they represent. Change SPOSTMIN
to posted_min_wait_time
and SACTMIN
to actual_wait_time
.
Hint: You can always use hard-coded integers to change names manually, however, if you use which
, you can get the index of the column name that you would like to change. For data.frames like splash_mountain
, this is a lot more efficient than manually counting which column is the one with a certain name.
Relevant topics: colnames, names, which
Item(s) to submit:
- R code used to solve the problem.
-
The output from executing
names(splash_mountain)
orcolnames(splash_mountain)
.
Solution
The current column names of splash_mountain
are:
colnames(splash_mountain)
## [1] "date" "datetime" "SACTMIN" "SPOSTMIN"
We can find which column is called SPOSTMIN
which(colnames(splash_mountain) == "SPOSTMIN")
## [1] 4
and change this column name to posted_min_wait_time
colnames(splash_mountain)[4] <- "posted_min_wait_time"
Next, we can find which column is called SACTMIN
which(colnames(splash_mountain) == "SACTMIN")
## [1] 3
and change this column name to actual_wait_time
colnames(splash_mountain)[3] <- "actual_wait_time"
Alternatively, we could have used names instead of colnames
and the effect would have been totally the same.
5. Use the cut
function to create a new vector called quarter
that breaks the date
column up by quarter. Use the labels
argument in the factor
function to label the quarters "q1", "q2", ..., "qX" where X
is the last quarter. Add quarter
as a column named quarter
in splash_mountain
. How many quarters are there?
Hint: If you have 2 years of data, this will result in 8 quarters: "q1", ..., "q8".
Hint: We can generate sequential data using seq
and paste0
:
paste0("item", seq(1, 5))
## [1] "item1" "item2" "item3" "item4" "item5"
or
paste0("item", 1:5)
## [1] "item1" "item2" "item3" "item4" "item5"
Relevant topics: cut, dates, factor, paste0, seq, nlevels
Item(s) to submit:
- R code used to solve the problem.
-
The
head
andtail
ofsplash_mountain
. -
The number of quarters in the new
quarter
column.
Question 5 is intended to be a little more challenging, so we worked through the exact same steps, with two other data sets. That way, if you work through these, all you will need to do, to solve Question 5, is to follow the example, and change two things, namely, the data set itself (in the read.csv
file) and also the format of the date.
This basically steps you through everything in Question 5.
We hope that these are helpful resources for you! We appreciate you very much and we are here to support you! You would not know how to solve this question on your own--because we are just getting started--but we like to sometimes put in a question like this, in which you get introduced to several new things, and we will dive deeper into these ideas as we push ahead.
Solution
We first take the date column from splash_mountain
and treat it as Date
values, and we break the dates according to which quarter they are in.
myresults <- cut((as.Date(splash_mountain$date,"%m/%d/%Y")), breaks="quarter")
There are 20 quarters altogether:
nlevels(myresults)
## [1] 20
We set the levels to be a letter q
and then a number between 1
and nlevels(myresults)
levels(myresults) <- paste0("q",1:nlevels(myresults))
splash_mountain$quarters <- myresults
head(splash_mountain)
## date datetime actual_wait_time posted_min_wait_time quarters
## 1 01/01/2015 2015-01-01 07:51:12 NA 5 q1
## 2 01/01/2015 2015-01-01 08:02:13 NA 5 q1
## 3 01/01/2015 2015-01-01 08:09:12 NA 5 q1
## 4 01/01/2015 2015-01-01 08:16:12 NA 5 q1
## 5 01/01/2015 2015-01-01 08:23:12 NA 5 q1
## 6 01/01/2015 2015-01-01 08:29:12 NA 5 q1
6. Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.
Project 4
Motivation: Control flow is (roughtly) the order in which instructions are executed. We can execute certain tasks or code if certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops.
Context: We are gaining familiarity working in RStudio and writing R code. In this project we introduce and practice using control flow in R.
Scope: r, data.frames, recycling, factors, if/else, for
Learning objectives:
- Explain what "recycling" is in R and predict behavior of provided statements.
- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- List the differences between lists, vectors, factors, and data.frames, and when to use each.
- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/disney
Questions
Solution
splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
sum_posted_wait_time <- 0
not_nas_and_nines <- 0
for (i in splash_mountain$SPOSTMIN) {
if (!is.na(i) && i != -999) {
sum_posted_wait_time <- sum_posted_wait_time + i
not_nas_and_nines <- not_nas_and_nines + 1
}
}
mean_posted_wait_time <- sum_posted_wait_time/not_nas_and_nines
mean_posted_wait_time
2. Choose one of the .csv
files containing data for a ride. Use read.csv
to load the file into a data.frame named ride_name
where "ride_name" is the name of the ride you chose. Use a for loop to loop through the ride file and add a new column called status
. status
should contain a string whose value is either "open", or "closed". If SPOSTMIN
or SACTMIN
is -999, classify the row as "closed". Otherwise, classify the row as "open". After status
is added to your data.frame, convert the column to a factor
.
Hint: If you want to access two columns at once from a data.frame, you can do: splash_mountain[i, c("SPOSTMIN", "SACTMIN")]
.
Relevant topics: any, for loops, if/else statements, nrow
Note: For loops are often much slower (here is a video to demonstrate) than vectorized functions, as we will see in (3) below.
Item(s) to submit:
- R code used to solve the problem w/comments explaining what the code does.
-
The output from running
str
onride_name
.
In this video, we basically go all the way through Question 2 using a video:
Solution
splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
status <- c()
for (i in 1:nrow(splash_mountain)) {
if (any(splash_mountain[i, c("SACTMIN", "SPOSTMIN")]==-999, na.rm=T)) {
status[i] <- "closed"
}else {
status[i] <- "open"
}
}
splash_mountain$status <- factor(status)
3. Typically you want to avoid using for loops (or even apply functions (we will learn more about these later on, don't worry)) when they aren't needed. Instead you can use vectorized operations and indexing. Repeat (2) without using any for loops or apply functions (instead use indexing and the which
function). Which method was faster?
Hint: To have multiple conditions within the which
statement, use |
for logical OR and &
for logical AND.
Hint: You can start by assigning every value in status
as "open", and then change the correct values to "closed".
Note: Here is a complete example (very much like question 3) with another video that shows how we can classify objects.
Note: Here is a complete example with a video that makes a comparison between the concept of a for loop versus the concept for a vectorized function.
Relevant topics: which
Item(s) to submit:
- R code used to solve the problem w/comments explaining what the code does.
-
The output from running
str
onride_name
.
Solution
splash_mountain <- read.csv("/class/datamine/data/disney/splash_mountain.csv")
splash_mountain$status <- "open"
splash_mountain$status[which(splash_mountain$SPOSTMIN == -999 | splash_mountain$SACTMIN == -999)] <- c("closed")
splash_mountain$status <- factor(splash_mountain$status)
4. Create a pie chart for open vs. closed for splash_mountain.csv
. First, use the table
command to get a count of each status
. Use the resulting table as input to the pie
function. Make sure to give your pie chart a title that somehow indicates the ride to the audience.
Item(s) to submit:
- R code used to solve the problem w/comments explaining what the code does.
- The resulting plot displayed as output in the RMarkdown.
Solution
pie(table(splash_mountain$status), main="Splash Mountain")
5. Loop through the vector of files we've provided below, and create a pie chart of open vs closed for each ride. Place all 6 resulting pie charts on the same image. Make sure to give each pie chart a title that somehow indicates the ride.
ride_names <- c("splash_mountain", "soarin", "pirates_of_caribbean", "expedition_everest", "flight_of_passage", "rock_n_rollercoaster")
ride_files <- paste0("/class/datamine/data/disney/", ride_names, ".csv")
Hint: To place all of the resulting pie charts in the same image, prior to running the for loop, run par(mfrow=c(2,3))
.
Relevant topics: for loop, read.csv, pie, table, par
This is not exactly the same, but it is a similar example, using the campaign election data:
mypiechart <- function(x) {
myDF <- read.csv( paste0("/class/datamine/data/election/itcont", x, ".txt"), sep="|")
mystate <- rep("other", times=nrow(myDF))
mystate[myDF$STATE == "CA"] <- "California"
mystate[myDF$STATE == "TX"] <- "Texas"
mystate[myDF$STATE == "NY"] <- "New York"
myDF$stateclassification <- factor(mystate)
pie(table(myDF$stateclassification))
}
myyears <- c("1980","1984","1988","1992","1996","2000")
par(mfrow=c(2,3))
for (i in myyears) {
mypiechart(i)
}
Here is another video, which guides students even more closely through Question 5.
Item(s) to submit:
- R code used to solve the problem w/comments explaining what the code does.
- The resulting plot displayed as output in the RMarkdown.
Solution
ride_names <- c("splash_mountain", "soarin", "pirates_of_caribbean", "flight_of_passage", "expedition_everest", "rock_n_rollercoaster")
ride_files <- paste0(c("/class/datamine/data/disney/"), ride_names, ".csv")
par(mfrow=c(2,3))
for (i in 1:length(ride_names)) {
dat <- read.csv(ride_files[i])
dat$status <- "open"
dat$status[which(dat$SPOSTMIN == -999 | dat$SACTMIN == -999)] <- c("closed")
dat$status <- factor(dat$status)
pie(table(dat$status), main=ride_names[i])
}
Project 5
Motivation: As briefly mentioned in project 4, R differs from other programming languages in that typically you will want to avoid using for loops, and instead use vectorized functions and the apply suite. In this project we will demonstrate some basic vectorized operations, and how they are better to use than loops.
Context: While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.
Scope: r, data.frames, recycling, factors, if/else, for
Learning objectives:
- Explain what "recycling" is in R and predict behavior of provided statements.
- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- List the differences between lists, vectors, factors, and data.frames, and when to use each.
- Demonstrate a working knowledge of control flow in r: for loops .
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/fars
To get more information on the dataset, see here.
Questions
1. The fars
dataset contains a series of folders labeled by year. In each year folder there is (at least) the files ACCIDENT.CSV
, PERSON.CSV
, and VEHICLE.CSV
. If you take a peek at any ACCIDENT.CSV
file in any year, you'll notice that the column YEAR
only contains the last two digits of the year. Add a new YEAR
column that contains the full year. Use the rbind
function to create a data.frame called accidents
that combines the ACCIDENT.CSV
files from the years 1975 through 1981 (inclusive) into one big dataset. After creating that accidents
data frame, change the values in the YEAR
column from two digits to four digits (i.e., paste a 19 onto each year value).
Relevant topics: rbind, read.csv, paste0
Here is a video to walk you through the method of solving Question 1.
Here is another video, using two functions you have not (yet) learned, namely, lapply
and do.call
. You do not need to understand these yet. It is just a glimpse of some powerful functions to come later in the course!
Item(s) to submit:
- R code used to solve the problem/comments explaining what the code does.
-
The result of
unique(accidents$YEAR)
.
Solution
accidents <- data.frame()
for (i in 1975:1981) {
filename <- paste0("/class/datamine/data/fars/", i, "/ACCIDENT.CSV")
dat <- read.csv(filename)
dat$YEAR <- i
accidents <- rbind(accidents, dat)
}
2. Using the new accidents
data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?
Hint: Look at the variables DRUNK_DR
and SCH_BUS
.
Relevant topics: table, which, indexing
Here is a video about a related problem with 3 fatalities (instead of considering drunk drivers).
Item(s) to submit:
- R code used to solve the problem/comments explaining what the code does.
- The result/answer itself.
Solution
table(accidents$DRUNK_DR, accidents$SCH_BUS)
3. Again using the accidents
data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?
Relevant topics: table, which, indexing
Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers), tabulated according to year.
Item(s) to submit:
- R code used to solve the problem/comments explaining what the code does.
- The results.
- Which year had the most qualifying accidents.
Solution
result <- table(accidents$YEAR[which(accidents$DRUNK_DR > 0 & accidents$SCH_BUS==1)])
result[which.max(result)]
4. Again using the accidents
data frame: Calculate the mean number of motorists involved in an accident (variable PERSON
) with i drunk drivers, where i takes the values from 0 through 6.
Hint: It is OK that there are no accidents involving just 5 drunk drivers.
Hint: You can use either a for
loop or a tapply
function to accomplish this question.
Relevant topics: for loops, indexing, tapply, mean,
Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers). We calculate the mean number of fatalities for accidents with i drunk drivers, where i takes the values from 0 through 6.
Item(s) to submit:
- R code used to solve the problem/comments explaining what the code does.
- The output from running your code.
Solution
for (i in 0:6) {
print(mean(accidents$PERSONS[accidents$DRUNK_DR == i], na.rm=T))
}
5. Again using the accidents
data frame: We have a theory that there are more accidents in cold weather months for Indiana and states around Indiana. For this question, only consider the data for which STATE
is one of these: Indiana (18), Illinois (17), Ohio (39), or Michigan (26). Create a barplot that shows the number of accidents by STATE
and by month (MONTH
) simultanously. What months have the most accidents? Are you surprised by these results? Explain why or why not?
We guide students through the methodology for Question 5 in this video. We also add a legend, in case students want to distinguish which stacked barplot goes with each of the four States.
Item(s) to submit:
- R code used to solve the problem/comments explaining what the code does.
- The output (plot) from running your code.
- 1-2 sentences explaining which month(s) have the most accidents and whether or not this surprises you.
Solution
dat <- accidents[accidents$STATE %in% c(17, 18, 26, 39),]
barplot(table(dat$STATE, dat$MONTH))
OPTIONAL QUESTION. Spruce up your plot from (5). Do any of the following:
- add vibrant (and preferably colorblind friendly) colors to your plot
- add a title
- add a legend
- add month names or abbreviations instead of numbers
Hint: Here is a resource to get you started.
Item(s) to submit:
- R code used to solve the problem/comments explaining what the code does.
- The output (plot) from running your code.
Solution
library(RColorBrewer)
dat <- accidents[accidents$STATE %in% c(17, 18, 26, 39),]
dat$STATE <- factor(dat$STATE, labels=c("Illinois", "Indiana", "Michigan", "Ohio"))
dat$MONTH <- factor(dat$MONTH, labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
color_palette <- brewer.pal(5, "Set2")
barplot(table(dat$STATE, dat$MONTH), col=color_palette, main="Accidents by Month", legend=T, ylim=c(0, 5000), las=2)
Project 6
The tapply
function works like this:
tapply( somedata, thewaythedataisgrouped, myfunction)
myDF <- read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
head(myDF)
We could do four computations to compute the mean
SPEND
amount in each STORE_R
mean(myDF$SPEND[myDF$STORE_R == "CENTRAL"])
mean(myDF$SPEND[myDF$STORE_R == "EAST "])
mean(myDF$SPEND[myDF$STORE_R == "SOUTH "])
mean(myDF$SPEND[myDF$STORE_R == "WEST "])
but it is easier to do all four of these calculations with the tapply
function. We take a mean
of the SPEND
values, broken into groups according to the STORE_R
tapply( myDF$SPEND, myDF$STORE_R, mean)
We could find the total amount in the SPEND
column in 2016 and then again in 2017.
sum(myDF$SPEND[myDF$YEAR == "2016"])
sum(myDF$SPEND[myDF$YEAR == "2017"])
or we could do both of these calculations at once, using the tapply
function. We take the sum
of all SPEND
amounts, broken into groups according to the YEAR
tapply(myDF$SPEND, myDF$YEAR, sum)
As a last example, we can calculate the amount spent on each day of purchases. We take the sum
of all SPEND
amounts, broken into groups according to the PURCHASE_
day
tapply(myDF$SPEND, myDF$PURCHASE_, sum)
It makes sense to sort the results and then look at the 20 days on which the sum
of the SPEND
amounts were the highest.
tail(sort( tapply(myDF$SPEND, myDF$PURCHASE_, sum) ),n=20)
tapply( mydata, mygroups, myfunction, na.rm=T )
Some generic uses to explain how this would look, if we made the calculations in a naive/verbose/painful way
myfunction(mydata[mygroups == 1], na.rm=T)
myfunction(mydata[mygroups == 2], na.rm=T)
myfunction(mydata[mygroups == 3], na.rm=T) ....
myfunction(mydata[mygroups == "IN"], na.rm=T)
myfunction(mydata[mygroups == "OH"], na.rm=T)
myfunction(mydata[mygroups == "IL"], na.rm=T) ....
myDF <- read.csv("/class/datamine/data/flights/subset/2005.csv")
head(myDF)
sum
the Distance
s of the flights, split into groups according to the airline (UniqueCarrier
)
sort(tapply(myDF$Distance, myDF$UniqueCarrier, sum))
Find the mean
flight Distance
, grouped according to the city of Origin
sort(tapply(myDF$Distance, myDF$Origin, mean))
Calculate the mean
departure delay (DepDelay
), for each airplane (i.e., each TailNum
), using na.rm=T
because some of the values of the departure delays are NA
tail(sort(tapply(myDF$DepDelay, myDF$TailNum, mean, na.rm=T)),n=20)
library(data.table)
myDF <- fread("/class/datamine/data/election/itcont2016.txt", sep="|")
head(myDF)
sum
the amounts of all contributions made, grouped according to the STATE
where the people lived
sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum))
sum
the amounts of all contributions made, grouped according to the CITY
/STATE
where the people lived
tail(sort(tapply(myDF$TRANSACTION_AMT, paste(myDF$CITY, myDF$STATE), sum)),n=20)
mylocations <- paste(myDF$CITY, myDF$STATE)
tail(sort(tapply(myDF$TRANSACTION_AMT, mylocations, sum)),n=20)
sum
the amounts of all contributions made, grouped according to the EMPLOYER
where the people worked
tail(sort(tapply(myDF$TRANSACTION_AMT, myDF$EMPLOYER, sum)), n=30)
Motivation: tapply
is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. Typically, when writing R code, you will want to use an "apply suite" function rather than a for loop.
Context: The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called tapply
from the "apply suite" of functions in R.
Scope: r, for, tapply
Learning objectives:
- Explain what "recycling" is in R and predict behavior of provided statements.
- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.
- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- List the differences between lists, vectors, factors, and data.frames, and when to use each.
- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.
- Demonstrate how apply functions are generally faster than using loops.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/fars/7581.csv
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. The dataset, /class/datamine/data/fars/7581.csv
contains the combined accident records from year 1975 to 1981. Load up the dataset into a data.frame named dat
. In the previous project's question 4, we asked you to calculate the mean number of motorists involved in an accident (variable PERSON
) with i drunk drivers where i takes the values from 0 through 6. This time, solve this question using the tapply
function instead. Which method did you prefer and why?
Now that you've read the data into a dataframe named dat
, run the following code:
# Read in data that maps state codes to state names
state_names <- read.csv("/class/datamine/data/fars/states.csv")
# Create a vector of state names called v
v <- state_names$state
# Set the names of the new vector to the codes
names(v) <- state_names$code
# Create a new column in the dat dataframe with the actual names of the states
dat$mystates <- v[as.character(dat$STATE)]
Item(s) to submit:
- R code used to solve the problem.
- The output/solution.
Solution
dat <- read.csv("/class/datamine/data/fars/7581.csv")
state_names <- read.csv("/class/datamine/data/fars/states.csv")
v <- state_names$state
names(v) <- state_names$code
dat$mystates <- v[as.character(dat$STATE)]
tapply(dat$PERSONS, dat$DRUNK_DR, mean, na.rm=T)
2. Make a state-by-state classification of the average number of drunk drivers in an accident. Which state has the highest average number of drunk drivers per accident?
Relevant topics: tapply, mean, sort
Item(s) to submit:
- R code used to solve the problem.
- The entire output.
- Which state has the highest average number of drunk drivers per accident?
Solution
st <- tapply(dat$DRUNK_DR, dat$mystates, mean, na.rm=T)
st[which.max(st)]
# or
sort(tapply(dat$DRUNK_DR, dat$mystates, mean, na.rm=T), decreasing=T)
3. Add up the total number of fatalities, according to the day of the week on which they occurred. Are the numbers surprising to you? What days of the week have a higher number of fatalities? If instead you calculate the proportion of fatalities over the total number of people in the accidents, what would you expect? Calculate it and see if your expectations match.
Hint: Sundays through Saturdays are days 1 through 7, respectively. Day 9 indicates that the day is unknown.
This video example uses the Amazon fine food reviews dataset to make a similar calculation, in which we have two tapply statements, and we divide the results to get a ton of similar ratios all at once. Powerful stuff! It may guide you in your thinking about this question.
Code from the example: double tapply examples
Relevant topics: tapply
Item(s) to submit:
- R code used to solve the problem.
- What days have the highest number of fatalities?
- What would you expect if you calculate the proportion of fatalities over the total number of people in the accidents?
Solution
sort(tapply(dat$FATALS, dat$DAY_WEEK, sum))
sort(tapply(dat$FATALS, dat$DAY_WEEK, sum)/tapply(dat$PERSONS, dat$DAY_WEEK, sum))
4. How many drunk drivers are involved, on average, in crashes that occur on straight roads? How many drunk drivers are involved, on average, in crashes that occur on curved roads? Solve the pair of questions in a single line of R code.
Hint: The ALIGNMNT
variable is 1 for straight, 2 for curved, and 9 for unknown.
Item(s) to submit:
- R code used to solve the problem.
- Results from running the R code.
Solution
tapply(dat$DRUNK_DR, dat$ALIGNMNT, mean, na.rm=T)
5. Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.
This example demonstrates a comparable calculation. In the video, I used the total number of people in the accident, and your question is (instead) about the number of fatalities, but this is essentially the only difference. I hope it helps to explain the way that the cut function works, along with the analogous breaks.
Relevant topics: tapply, cut, sum
Item(s) to submit:
- R code used to solve the problem.
- Results from running the R code.
Solution
tapply(dat$FATALS, cut(dat$HOUR, breaks=c(0,6,12,18,24,99), include.lowest=T), sum)
Project 7
Motivation: Three bread-and-butter functions that are a part of the base R are: subset
, merge
, and split
. subset
provides a more natural way to filter and select data from a data.frame. split
is a useful function that splits a dataset based on one or more factors. merge
brings the principals of combining data that SQL uses, to R.
Context: We've been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with three useful functions, all the while gaining experience and practice wrangling data!
Scope: r, subset, merge, split, tapply
Learning objectives:
- Gain proficiency using split, merge, and subset.
- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Demonstrate how to use tapply to solve data-driven problems.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/goodreads/csv
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
Solution
books <- read.csv("/class/datamine/data/goodreads/csv/books_sample.csv")
authors <- read.csv("/class/datamine/data/goodreads/csv/goodreads_book_authors.csv")
dim(books)
dim(authors)
# or
str(books)
str(authors)
2. We want to figure out how book size (num_pages
) is associated with various metrics. First, let's create a vector called book_size
, that categorizes books into 4 categories based on num_pages
: small
(up to 250 pages), medium
(250-500 pages), large
(500-1000 pages), huge
(1000+ pages).
Note: This video and code might be helpful.
Relevant topics: cut
Item(s) to submit:
- R code used to solve the problem.
-
The result of
table(book_size)
.
Solution
book_size <- cut(books$num_pages, breaks=c(-Inf, 250, 500, 1000, Inf), labels=c("small", "medium", "large", "huge"))
table(book_size)
head(books$num_pages)
# We can split each element into categories,
# based on how num_pages compares to 0, 250, 500, 1000,
# and a huge number (say, 50000000, or Inf for infinity)
head(cut(books$num_pages, breaks=c(0,250,500,1000,50000000),include.lowest=T))
# Or like this:
head(cut(books$num_pages, breaks=c(0,250,500,1000,Inf),include.lowest=T))
# It is worthwhile to make sure that we included every book.
# For instance, if you do not use include.lowest=T, then we
# will miss the pages that are exactly equal to 0.
# Or if you do not use a big enough number on the right hand side,
# then you will miss the very long pages.
# We know, from question 1, that we should have 1000000 books
# and we can verify that we got them all here:
table( cut(books$num_pages, breaks=c(0,250,500,1000,Inf),include.lowest=T), useNA="always" )
sum(table( cut(books$num_pages, breaks=c(0,250,500,1000,Inf),include.lowest=T), useNA="always" ))
# Now we create the book_size vector:
book_size <- cut(books$num_pages, breaks=c(0,250,500,1000,Inf), include.lowest=T)
# and we can check that the first several entries look reasonable.
head(book_size)
3. Use tapply
to calculate the mean average_rating
, text_reviews_count
, and publication_year
by book_size
. Did any of the result surprise you? Why or why not?
Relevant topics: tapply
Item(s) to submit:
- R code used to solve the problem.
- The output from running the R code.
Solution
tapply(books$average_rating, book_size, mean, na.rm=T)
tapply(books$text_reviews_count, book_size, mean, na.rm=T)
tapply(books$publication_year, book_size, mean, na.rm=T)
# It might be surprising, for instance, that longer books get
# more reviews and higher reviews. Answers to this part of the
# question will vary.
4. Notice in (3) how we used tapply
3 times. This would get burdensome if we decided to calculate 4 or 5 or 6 columns instead. Instead of using tapply, we can use split
, lapply
, and colMeans
to perform the same calculations.
Use split
to partition the data containing only the following 3 columns: average_rating
, text_reviews_count
, and publication_year
, by book_size
. Save the result as books_by_size
. What class is the result? lapply
is a function that allows you to loop over each item in a list and apply a function. Use lapply
and colMeans
to perform the same calculation as in (3).
Note: This video and code and also this video and code might be helpful.
Relevant topics: lapply, split, colMeans, indexing
Item(s) to submit:
- R code used to solve the problem.
- The output from running the code.
Solution
books_by_size <- split(books[,c('text_reviews_count','average_rating','publication_year')], book_size)
lapply(books_by_size, colMeans, na.rm=TRUE)
# We can build a temporarily data frame that has the 3 desired variables:
head(data.frame(books$average_rating, books$text_reviews_count, books$publication_year))
# and split it according to the book_size.
# We get a list of length 4 as a result:
books_by_size <- split(data.frame(books$average_rating, books$text_reviews_count, books$publication_year), book_size)
class(books_by_size)
length(books_by_size)
str(books_by_size)
# and then we can take the colMeans of each of these 4 parts of the list:
lapply(books_by_size, colMeans, na.rm=T)
# and the results agree with the results from question 3.
5. We are working with a lot more data than we really want right now. We've provided you with the following code to filter out non-English books and only keep columns of interest. This will create a data frame called en_books
.
en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]
Now create an equivalent data frame of your own, by using the subset
function (instead of indexing). Use res
as the name of the data frame that you create. Do the dimensions (using dim
) of en_books
and res
agree? Why or why not? (They should both have 8 columns, but a different number of rows.)
Hint: Since the dimensions don't match, take a look at NA values for the variables used to subset our data.
Note: This video and code and also this video and code might be helpful.
Relevant topics: indexing, subset, NA, %in%
Item(s) to submit:
- R code used to solve the problem.
- Do the dimensions match?
- 1-2 sentences explaining why or why not.
Solution
en_books_subset <- subset(books, language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & publication_year > 2000, select=c(author_id, book_id, average_rating,description, title, ratings_count, language_code, publication_year))
en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]
dim(en_books)
dim(en_books_subset)
sum(is.na(en_books_subset$language_code))
sum(is.na(en_books$language_code))
# you can see that without the NAs the results are the same
all.equal(en_books_subset, en_books[!is.na(en_books$language_code),], check.attributes=F)
# subset is automatically removing NAs
en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]
# We build a new data frame called res that takes a subset
# of the data frame books that meet the two given conditions
# on language_code and publication_year and that contains the 8
# desired columns.
res <- subset(books, subset=language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & publication_year > 2000, select=c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year"))
class(res)
dim(res)
# This data frame is smaller.
# The earlier data frame has some rows with missing information,
# which are not included here.
# (We might want to give examples/details about what is missing.)
Solution
res <- merge(books, authors, by="author_id")
dim(res)
Solution
brandons_books <- res[res$name=="Brandon Sanderson",]
brandons_books[which.max(brandons_books$average_rating_books),]
myDF <- merge(res, authors, by="author_id")
# Here are the books by author Sandra Cisneros:
myfavoriteDF <- myDF[myDF$name=="Sandra Cisneros", ]
myfavoriteDF
# Her highest rated book has the rating 4.23:
max(myfavoriteDF$average_rating.x)
# It is called "A House of My Own: Stories from My Life":
myfavoriteDF[which.max(myfavoriteDF$average_rating.x), ]$title
Solution
res <- merge(books, authors, by="author_id")
# rename columns
names(res) <- gsub("\\.x", "_books", names(res))
names(res) <- gsub("\\.y", "_author", names(res))
dim(res)
Project 8
Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!
Context: We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.
Scope: r, functions
Learning objectives:
- Gain proficiency using split, merge, and subset.
- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Demonstrate how to use tapply to solve data-driven problems.
- Comprehend what a function is, and the components of a function in R.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/goodreads/csv
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. Read in the same data, in the same way as the previous project (with the same names). We've provided you with the function below. How many arguments does the function have? Name all of the arguments. What is the name of the function? Replace the description
column in our books
data.frame with the same information, but with stripped punctuation using the function provided.
# A function that, given a string (myColumn), returns the string
# without any punctuation.
strip_punctuation <- function(myColumn) {
# Use regular expressions to identify punctuation.
# Replace identified punctuation with an empty string ''.
desc_no_punc <- gsub('[[:punct:]]+', '', myColumn)
# Return the result
return(desc_no_punc)
}
Hint: Since gsub
accepts a vector of values, you can pass an entire vector to strip_punctuation
.
Relevant topics: functions
Item(s) to submit:
- R code used to solve the problem.
- How many arguments does the function have?
- What are the name(s) of all of the arguments?
- What is the name of the function?
Solution
# The function has 1 argument called "myColumn"
# The name of the function is "strip_punctuation"
books$description <- strip_punctuation(books$description)
2. Use the strsplit
function to split a string by spaces. Some examples would be:
strsplit("This will split by space.", " ")
## [[1]]
## [1] "This" "will" "split" "by" "space."
strsplit("This. Will. Split. By. A. Period.", "\\.")
## [[1]]
## [1] "This" " Will" " Split" " By" " A" " Period"
An example string is:
test_string <- "This is a test string with no punctuation"
Test out strsplit
using the provided test_string
. Make sure to copy and paste the code that declares test_string
. If you counted the words shown in your results, would it be an accurate count? Why or why not?
Relevant topics: strsplit, functions
Item(s) to submit:
- R code used to solve the problem.
- 1-2 sentences explaining why or why not your count would be accurate.
Solution
test_string <- "This is a test string with no punctuation"
strsplit(test_string, " ")
# It would not be accurate because it is counting the extra spaces as words.
3. Fix the issue in (2), using which
. You may need to unlist
the strsplit
result first. After you've accomplished this, you can count the remaining words!
Relevant topics: strsplit, sum, which
Item(s) to submit:
- R code used to solve the problem (including counting the words).
Solution
test <- unlist(strsplit(test_string, " "))
length(test[test!=""])
sum(test!="")
4. We are finally to the point where we have code from questions (2) and (3) that we think we may want to use many times. Write a function called count_words
which, given a string, description
, returns the number of words in description
. Test out count_words
on the description
from the second row of books
. How many words are in the description?
Relevant topics: functions, unlist, indexing, strsplit
Item(s) to submit:
- R code used to solve the problem.
-
The result of using the function on the
description
from the second row ofbooks
.
Solution
count_words <- function(description) {
split_desc <- unlist(strsplit(description, " "))
return(length(split_desc[split_desc!=""]))
}
count_words(books$description[2])
Solution
# Could be anything.
Project 9
Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!
Context: We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.
Scope: r, functions
Learning objectives:
- Gain proficiency using split, merge, and subset.
- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Demonstrate how to use tapply to solve data-driven problems.
- Comprehend what a function is, and the components of a function in R.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/goodreads/csv
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. We've provided you with a function below. How many arguments does the function have, and what are their names? You can get a book_id
from the URL of a goodreads book's webpage. For example, if you search for the book Words of Radiance
on goodreads, the book_id
contained in the url https://www.goodreads.com/book/show/17332218-words-of-radiance#, is 17332218. Another example is https://www.goodreads.com/book/show/157993.The_Little_Prince?from_search=true&from_srp=true&qid=JJGqUK9Vp9&rank=1, (the little prince) with a book_id
of 157993.
Find 2 or 3 book_id
s and test out the function until you get two successes. Explain in words, what the function is doing, and what options you have.
library(imager)
books <- read.csv("/class/datamine/data/goodreads/csv/goodreads_books.csv")
authors <- read.csv("/class/datamine/data/goodreads/csv/goodreads_book_authors.csv")
get_author_name <- function(my_authors_dataset, my_author_id){
return(my_authors_dataset[my_authors_dataset$author_id==my_author_id,'name'])
}
fun_plot <- function(my_authors_dataset, my_books_dataset, my_book_id, display_cover=T) {
book_info <- my_books_dataset[my_books_dataset$book_id==my_book_id,]
all_books_by_author <- my_books_dataset[my_books_dataset$author_id==book_info$author_id,]
author_name <- get_author_name(my_authors_dataset, book_info$author_id)
img <- load.image(book_info$image_url)
if(display_cover){
par(mfrow=c(1,2))
plot(img, axes=FALSE)
}
plot(all_books_by_author$num_pages, all_books_by_author$average_rating,
ylim=c(0,5.1), pch=21, bg='grey80',
xlab='Number of pages', ylab='Average rating',
main=paste('Books by', author_name))
points(book_info$num_pages, book_info$average_rating,pch=21, bg='orange', cex=1.5)
}
Relevant topics: functions
Item(s) to submit:
- How many arguments does the function have, and what are their names?
-
The result of using the function on 2-3
book_id
s. - 1-2 sentences explaining what the function does (generally), and what (if any) options the function provides you with.
Solution
# fun_plot has 4 arguments, my_authors_dataset, my_books_dataset, my_book_id, and display_cover.
# Given a `book_id`, the function finds all books by the same author and plots the book cover
# next to a plot of ratings where the current book is highlighted. You can choose to show the
# cover image or not.
fun_plot(authors, books, 17332218)
fun_plot(authors, books, 17593132)
2. You may have encountered a situation where the my_book_id
was not in our dataset, and hence, didn't get plotted. When writing functions, it is usually best to try and foresee issues like this and have the function fail gracefully, instead of showing some ugly (and sometimes unclear) warning. Add some code at the beginning of our function that checks to see if my_book_id
is within our dataset, and if it does not exist, prints "Book ID not found.", and exits the function. Test it out on book_id=123
and book_id=19063
.
Hint: Run ?stop
to see if that is a function that may be useful.
Relevant topics: functions, if/else, stop
Item(s) to submit:
- R code with your new and improved function.
-
The results from
fun_plot(123)
. -
The results from
fun_plot(19063)
.
Solution
fun_plot <- function(my_authors_dataset, my_books_dataset, my_book_id, display_cover=T) {
if (!(my_book_id %in% my_books_dataset$book_id)) {
stop("Book ID not found.")
}
book_info <- my_books_dataset[my_books_dataset$book_id==my_book_id,]
all_books_by_author <- my_books_dataset[my_books_dataset$author_id==book_info$author_id,]
author_name <- get_author_name(my_authors_dataset, book_info$author_id)
img <- load.image(book_info$image_url)
if(display_cover){
par(mfrow=c(1,2))
plot(img, axes=FALSE)
}
plot(all_books_by_author$num_pages, all_books_by_author$average_rating,
ylim=c(0,5.1), pch=21, bg='grey80',
xlab='Number of pages', ylab='Average rating',
main=paste('Books by', author_name))
points(book_info$num_pages, book_info$average_rating,pch=21, bg='orange', cex=1.5)
}
fun_plot(authors, books, 19063)
fun_plot(authors, books, 123)
You can test your function using some of these examples:
get_author_id(authors, "Brandon Sanderson") # 38550
get_author_id(authors, "J.K. Rowling") # 1077326
Relevant topics: functions
Item(s) to submit:
- R code containing your new function.
- The results of using your new function on a few authors.
Solution
get_author_id <- function(my_author_dataset, my_author_name) {
return(my_author_dataset[my_author_dataset$name==my_author_name,]$author_id)
}
get_author_id(authors, "Brandon Sanderson") # 38550
get_author_id(authors, "J.K. Rowling") # 1077326
4. See the function below.
search_books_for_word <- function(word) {
return(books[grepl(word, books$description, fixed=T),]$title)
}
Given a word, search_books_for_word
returns the titles of books where the provided word is inside the book's description. search_books_for_word
utilizes the books
dataset internally. It requires that the books
dataset has been loaded into the environment prior to running (and with the correct name). By including and referencing objects defined outside of our function's scope within our function (in this case the variable books
), our search_books_for_word
function will be more prone to errors, as any changes to those objects may break our function. For example:
our_function <- function(x) {
print(paste("Our argument is:", x))
print(paste("Our variable is:", my_variable))
}
# our variable outside the scope of our_function
my_variable <- "dog"
# run our_function
our_function("first")
# change the variable outside the scope of our function
my_variable <- "cat"
# run our_function again
our_function("second")
# imagine a scenario where "my_variable" doesn't exist, our_function would break!
rm(my_variable)
our_function("third")
Fix our search_books_for_word
function to accept the books
dataset as an argument called my_books_dataset
and utilize my_books_dataset
within the function instead of the global variable books
.
Relevant topics: functions, read.csv, scoping
Item(s) to submit:
- R code with your new and improved function.
- An example using the updated function.
Solution
search_books_for_word <- function(my_books_dataset, word) {
return(my_books_dataset[grepl(word, my_books_dataset$description, fixed=T),]$title)
}
5. Write your own custom function. Make sure your function includes at least 2 arguments. If you access one of our datasets from within your function (which you definitely should do), use what you learned in (4), to avoid future errors dealing with scoping. Your function could output a cool plot, interesting tidbits of information, or anything else you can think of. Get creative and make a function that is fun to use!
Relevant topics: scoping, functions
Item(s) to submit:
- R code used to solve the problem.
- Examples using your function with included output.
Solution
# Could be anything.
Project 10
Motivation: Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called functional programming. In this project, we will learn to apply functions to entire vectors of data using sapply
.
Context: We've just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. sapply
is one of the best ways to do this in R.
Scope: r, sapply, functions
Learning objectives:
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Utilize apply functions in order to solve a data-driven problem.
- Gain proficiency using split, merge, and subset.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/okcupid/filtered
Questions
Important note: Please make sure to double check that your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
Solution
# You can show files and subdirectories using list.files
list.files("/class/datamine/data/okcupid")
# You can also take a look at files using file.info
file.info("/class/datamine/data/okcupid/filtered/questions.csv")
# You can peek at a file before reading it all in using readLines
readLines("/class/datamine/data/okcupid/filtered/questions.csv", n=10)
# we can see things are separated by semicolons
users <- read.csv("/class/datamine/data/okcupid/filtered/users.csv")
questions <- read.csv("/class/datamine/data/okcupid/filtered/questions.csv", sep=";")
2. grep
is an incredibly powerful tool available to us in R. We will learn more about grep
in the future, but for now, know that a simple application of grep
is to find a word in a string. In R, grep
is vectorized and can be applied to an entire vector of strings. Use grep
to find a question that references "google". What is the question?
Hint: If at first you don't succeed, run ?grep
and check out the ignore.case
argument.
Hint: To prepare for Question 3, look at the entire row of the questions
data frame that has the question about google. The first entry on this row tells you the question that you need, in the users
data frame, while working on Question 3.
Relevant topics: grep
Item(s) to submit:
- R code used to solve the problem.
-
The
text
of the question that references Google.
Solution
questions[grep("google", questions$text, ignore.case=T),]
3. In (2) we found a pretty interesting question. What is the percentage of users that Google someone before the first date? Does the proportion change by gender (as defined by gender2
)? How about by gender_orientation
?
Hint: The two videos posted in Question 2 might help.
Hint: If you look at the column of users
corresponding to the question identified in (2), you will see that this column of users
has two possible answers, namely: "No. Why spoil the mystery?"
and "Yes. Knowledge is power!"
.
Hint: Use the tapply
function with three inputs:
the correct column of users
,
breaking up the data according to gender2
or according to gender_orientation
,
and use this as your function in the tapply
:
function(x) {prop.table(table(x, useNA="always"))}
Relevant topics: functions, tapply, table, prop.table
Item(s) to submit:
- R code used to solve this problem.
- The results of running the code.
- Written answers to the questions.
Solution
prop.table(table(users_filtered$`q170849`))
prop_google <- function(answers_to_q170849){
return(prop.table(table(answers_to_q170849))[2])
}
tapply(users_filtered$`q170849`, users_filtered$gender2, prop_google)
tapply(users_filtered$`q170849`, users_filtered$gender_orientation, prop_google)
# ~35% of people Google people before the first date.
# According to this set of data, men Google woman 10% less than woman Google men before the first date.
# When broken down by the provided gender orientations, hetero females, bisexual females, and gay males tend to
# Google people before the first date more than bisexual males, gay females, and hetero males.
4. In project (8) we created a function called count_words
. Use this function and sapply
to create a vector which contains the number of words in each row of the column text
from the questions
dataframe. Call the new vector question_length
, and add it as a column to the questions
dataframe.
count_words <- function(my_text) {
my_split_text <- unlist(strsplit(my_text, " "))
return(length(my_split_text[my_split_text!=""]))
}
Relevant topics: sapply
Item(s) to submit:
- R code used to solve this problem.
-
The result of
str(questions)
(this shows how yourquestions
data frame looks, after adding the new column calledquestion_length
).
Solution
count_words <- function(description) {
split_desc <- unlist(strsplit(description, " "))
return(length(split_desc[split_desc!=""]))
}
questions$question_length <- sapply(as.character(questions$text), count_words)
5. Consider this function called number_of_options
that accepts a data frame (for instance, questions
)
number_of_options <- function(myDF) {
table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))}))
}
and counts the number of questions that have each possible number of responses. For instance, if we calculate: number_of_options(questions)
we get:
0 2 3 4
590 936 519 746
which means that: 590 questions have 0 possible responses; 936 questions have 2 possible responses; 519 questions have 3 possible responses; and 746 questions have 4 possible responses.
Now use the split
function to break the data frame questions
into 7 smaller data frames, according to the value in questions$Keywords
. Then use the sapply
function to determine, for each possible value of questions$Keywords
, the analogous breakdown of questions with different numbers of responses, as we did above.
Hint: You can write:
mylist <- split(questions, questions$Keywords)
sapply(mylist, number_of_options)
Background/explanation: The way sapply
works is the the first argument is by default the first argument to your function, the second argument is the function you want applied, and after that you can specify arguments by name. For example:
test1 <- c(1, 2, 3, 4, NA, 5)
test2 <- c(9, 8, 6, 5, 4, NA)
mylist <- list(first=test1, second=test2)
# for a single vector in the list
mean(mylist$first, na.rm=T)
# what if we want to do this for each vector in the list?
# how do we remove na's?
sapply(mylist, mean)
# we can specify the arguments that are for the mean function
# by naming them after the first two arguments, like this
sapply(mylist, mean, na.rm=T)
# in the code shown above, na.rm=T is passed to the mean function
# just like if you run the following
mean(mylist$first, na.rm=T)
mean(mylist$second, na.rm=T)
# you can include as many arguments to mean as you normally would
# and in any order. just make sure to name the arguments
sapply(mylist, mean, na.rm=T, trim=0.5)
# or sapply(mylist, mean, trim=0.5, na.rm=T)
# which is similar to
mean(mylist$first, na.rm=T, trim=0.5)
mean(mylist$second, na.rm=T, trim=0.5)
Relevant topics: sapply, functions, indexing
Item(s) to submit:
- R code used to solve this problem.
- The results of the running the code.
Solution
number_of_options <- function(myDF) {
table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))}))
}
number_of_options(questions)
mylist <- split(questions, questions$Keywords)
sapply(mylist, number_of_options)
6. Lots of questions are asked in this okcupid
dataset. Explore the dataset, and either calculate an interesting statistic/result using sapply
, or generate a graphic (with good x-axis and/or y-axis labels, main labels, legends, etc.), or both! Write 1-2 sentences about your analysis and/or graphic, and explain what you thought you'd find, and what you actually discovered.
Relevant topics: plotting, functions, sapply
Item(s) to submit:
- R code used to solve this problem.
- The results from running your code.
- 1-2 sentences about your analysis and/or graphic, and explain what you thought you'd find, and what you actually discovered.
Solution
# Could be anything.
OPTIONAL QUESTION. Does it appear that there is an association between the length of the question and whether or not users answered the question? Assume NA means "unanswered". First create a function called percent_answered
that, given a vector, returns the percentage of values that are not NA. Use percent_answered
and sapply
to calculate the percentage of users who answer each question. Plot this result, against the length of the questions.
Hint: length_of_questions <- questions$question_length[grep("^q", questions$X)]
Hint: grep("^q", questions$X)
returns the column index of every column that starts with "q". Use the same trick we used in the previous hint, to subset our users
data.frame before using sapply
to apply percent_answered
.
Relevant topics: sapply, is.na, length, grep, plot
Item(s) to submit:
- R code used to solve this problem.
- The plot.
- Whether or not you think there may or may not be an association between question length and whether or not the question is answered.
Solution
number_of_options <- function(questions, question_id) {
is_empty <- function(value) {
if (is.na(value)) return(TRUE)
else if (value=="") return(TRUE)
else return(FALSE)
}
question <- questions[questions$X==question_id,]
is_empty(question$option_1) + is_empty(question$option_2) + is_empty(question$option_3) + is_empty(question$option_4)
}
prop.table(table(sapply(questions$X, number_of_options, questions = questions)))
questions$number_options <- sapply(questions$X, number_of_options, questions = questions)
prop.table(table(sapply(questions$X, number_of_options, questions = questions)))
questions$number_options <- sapply(questions$X, number_of_options, questions = questions)
percent_answered <- function(col) {
sum(!is.na(col))/length(col)
}
p_answered <- sapply(users[,grep("^q", names(users))], percent_answered)
length_of_questions <- questions$question_length[grep("^q", questions$X)]
plot(length_of_questions, p_answered)
Project 11
Motivation: The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you've learned so far this semester to solve data-driven problems. In previous projects, we've directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you'd like.
Context: You've learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of economic data from Zillow.
Scope: R
Learning objectives:
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Utilize apply functions in order to solve a data-driven problem.
- Gain proficiency using split, merge, and subset.
- Comprehend what a function is, and the components of a function in R.
- Demonstrate the ability to use nested apply functions to solve a data-driven problem.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/zillow
Questions
1. Read /class/datamine/data/zillow/Zip_time_series.csv
into a data.frame called zipc
. Look at the RegionName
column. It is supposed to be a 5-digit zip code. Either fix the column by writing a function and applying it to the column, or take the time to read the read.csv
documentation by running ?read.csv
and use an argument to make sure that column is not read in as an integer (which is why zip codes starting with 0
lose the leading 0
when being read in).
Hint: This video demonstrates how to read in data and respect the leading zeroes.
Relevant topics: read.csv, sapply, functions, strrep, nchar
Item(s) to submit:
- R code used to solve the problem.
-
head
of theRegionName
column.
Solution
zipc <- read.csv("/class/datamine/data/zillow/Zip_time_series.csv")
fix_zip <- function(x) {
if (nchar(x) == 5) {
return(as.character(x))
}
paste0(strrep("0", 5-nchar(x)), x)
}
sapply(zipc$RegionName, fix_zip)
# OR
zipc <- read.csv("/class/datamine/data/zillow/Zip_time_series.csv", colClasses = c("RegionName"="character"))
# OR
zipc <- data.frame(fread("/class/datamine/data/zillow/Zip_time_series.csv"))
zipc$Date <- as.Date(zipc$Date)
2. One might assume that the owner of a house tends to value that house more than the buyer. If that was the case, perhaps the median listing price (the price which the seller puts the house on the market, or ask price) would be higher than the ZHVI (Zillow Home Value Index -- essentially an estimate of the home value). For those rows where both MedianListingPrice_AllHomes
and ZHVI_AllHomes
have non-NA values, on average how much higher or lower is the median listing price? Can you think of any other reasons why this may be?
Relevant topics: mean
Item(s) to submit:
- R code used to solve the problem.
- The result itself and 1-2 sentences talking about whether or not you can think of any other reasons that may explain the result.
Solution
mean(zipc$MedianListingPrice_AllHomes - zipc$ZHVI_AllHomes, na.rm=T)
3. Convert the Date
column to a date using as.Date
. How many years of data do we have in this dataset? Create a line plot with lines for the average MedianListingPrice_AllHomes
and average ZHVI_AllHomes
by year. The result should be a single plot with multiple lines on it.
Hint: Here we give two videos to help you with this question. The first video gives some examples about working with dates in R.
Hint: This second video gives an example about how to plot two line graphs at the same time in R.
Hint: For a nice addition, add a dotted vertical line on year 2008 near the housing crisis:
abline(v="2008", lty="dotted")
Relevant topics: cut, as.Date, tapply, plot, lines, legend
Item(s) to submit:
- R code used to solve the problem.
- The results of running the code.
Solution
zipc$Date <- as.Date(zipc$Date, format="%Y-%m-%d")
zipc$Year <- cut(zipc$Date, breaks="year")
zipc$Year <- substr(zipc$Year, 1, 4)
listings <- tapply(zipc$MedianListingPrice_AllHomes, zipc$Year, mean, na.rm=T)
zhvis <- tapply(zipc$ZHVI_AllHomes, zipc$Year, mean, na.rm=T)
plot(names(listings), listings, type="l", ylim=c(118000, 320000), col="blue", lwd=2)
lines(names(zhvis), zhvis, col="tomato", lwd=2)
legend("topleft", legend=c("Listing price", "ZHVI"), lwd=2, col=c("blue", "tomato"))
abline(v="2008", lty="dotted")
4. Read /class/datamine/data/zillow/State_time_series.csv
into a data.frame called states
. Calculate the average median listing price by state, and create a map using plot_usmap
from the usmap
package that shows the average median price by state.
Hint: We give a full example about how to plot values, by State, on a map.
Hint: In order for plot_usmap
to work, you must name the column containing states' names to "state".
Hint: To split words like "OhSoCool" into "Oh So Cool", try this: trimws(gsub('([[:upper:]])', ' \\1', "OhSoCool"))
. This will be useful as you'll need to correct the RegionName
column at some point in time. Notice that this will not completely fix "DistrictofColumbia". You will need to fix that one manually.
Relevant topics: tapply, plot_usmap
Item(s) to submit:
- R code used to solve the problem.
- The resulting map.
Solution
states <- read.csv("/class/datamine/data/zillow/State_time_series.csv")
aux <- tapply(states$MedianListingPrice_AllHomes, states$RegionName, mean, na.rm=T)
dat <- data.frame(state=names(aux), val=aux)
dat$state <- trimws(gsub('([[:upper:]])', ' \\1', dat$state))
dat$state[dat$state=="Districtof Columbia"] <- "District of Columbia"
plot_usmap(data = dat, values = "val", color = "black") +
scale_fill_continuous(low = "white", high = "#C28E0E",
name = "Median price listing",
label = scales::dollar) +
theme(legend.position = "right")
5. Read /class/datamine/data/zillow/County_time_series.csv
into a data.frame named counties
. Choose a state (or states) that you would like to "dig down" into county-level data for, and create a plot (or plots) like in (4) that show some interesting statistic by county. You can choose average median listing price if you so desire, however, you don't need to! There are other cool data!
Hint: Make sure that you remember to aggregate your data by RegionName
so the plot renders correctly.
Hint: plot_usmap
looks for a column named fips
. Make sure to rename the RegionName
column to fips
prior to passing the data.frame to plot_usmap
.
Hint: If you get Question 4 working correctly, here are the main differences for Question 5. You need the regions
to be "counties"
instead of "states"
, and you need the data.frame
to have a column called fips
instead of state
. These are the main differences between Question 4 and Question 5.
Relevant topics: tapply, plot_usmap
Item(s) to submit:
- R code used to solve the problem.
- The resulting map.
Solution
counties_avg <- tapply(counties$MedianListingPrice_AllHomes, counties$RegionName, mean, na.rm=T)
counties_avg <- data.frame(fips = names(counties_avg), MedianListingPrice = unname(counties_avg))
plot_usmap(data = counties_avg, values = "MedianListingPrice", include=c("CA"), color = "black") +
scale_fill_continuous(low = "white", high = "#C28E0E",
name = "Average median price listing",
label = scales::dollar) +
theme(legend.position = "right")
Project 12
Motivation: In the previous project you were forced to do a little bit of date manipulation. Dates can be very difficult to work with, regardless of the language you are using. lubridate
is a package within the famous tidyverse, that greatly simplifies some of the most common tasks one needs to perform with date data.
Context: We've been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a tidyverse package that adds great stand-alone value when working with dates.
Scope: r
Learning objectives:
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Utilize apply functions in order to solve a data-driven problem.
- Gain proficiency using split, merge, and subset.
- Demostrate the ability to create basic graphs with default settings.
- Demonstratre the ability to modify axes labels and titles.
- Incorporate legends using legend().
- Demonstrate the ability to customize a plot (color, shape/linetype).
- Convert strings to dates, and format dates using the lubridate package.
Questions
1. Let's continue our exploration of the Zillow time series data. A useful package for dealing with dates is called lubridate
. This is part of the famous tidyverse suite of packages. Run the code below to load it. Read the /class/datamine/data/zillow/State_time_series.csv
dataset into a data.frame named states
. What class and type is the column Date
?
library(lubridate)
Relevant topics: class, typeof
Item(s) to submit:
- R code used to solve the question.
-
class
andtypeof
columnDate
.
Solution
library(lubridate)
states <- read.csv("/class/datamine/data/zillow/State_time_series.csv")
typeof(states$Date)
class(states$Date)
2. Convert column Date
to a corresponding date format using lubridate
. Check that you correctly transformed it by checking its class like we did in question (1). Compare and contrast this method of conversion with the solution you came up with for question (3) in the previous project. Which method do you prefer?
Hint: Take a look at the following functions from lubridate
: ymd
, mdy
, dym
.
Hint: Here is a video about ymd
, mdy
, dym
Relevant topics: dates, lubridate
Item(s) to submit:
- R code used to solve the question.
-
class
of modified columnDate
. - 1-2 sentences stating which method you prefer (if any) and why.
Solution
states$Date <- ymd(states$Date)
class(states$Date)
3. Create 3 new columns in states
called year
, month
, day_of_week
(Sun-Sat) using lubridate
. Get the frequency table for your newly created columns. Do we have the same amount of data for all years, for all months, and for all days of the week? We did something similar in question (3) in the previous project -- specifically, we broke each date down by year. Which method do you prefer and why?
Hint: Take a look at functions month
, year
, day
, wday
.
Hint: You may find the argument of label
in wday
useful.
Hint: Here is a video about month
, year
, day
, wday
Relevant topics: dates, lubridate
Item(s) to submit:
- R code used to solve the question.
- Frequency table for newly created columns.
- 1-2 sentences answering whether or not we have the same amount of data for all years, months, and days of the week.
- 1-2 sentences stating which method you prefer (if any) and why.
Solution
states$year <- year(states$Date)
states$month <- month(states$Date)
states$day_of_week <- wday(states$Date, label=TRUE)
table(states$year)
table(states$month)
table(states$day_of_week)
4. Is there a better month or set of months to put your house on the market? Use tapply
to compare the average DaysOnZillow_AllHomes
for all months. Make a barplot showing our results. Make sure your barplot includes "all of the fixings" (title, labeled axes, legend if necessary, etc. Make it look good.).
Relevant topics: tapply, barplot
Hint: If you want to have the month's abbreviation in your plot, you may find both the month.abb
object and the argument names.arg
in barplot
useful.
Hint: This video might help with Question 4.
Item(s) to submit:
- R code used to solve the question.
-
The barplot of the average
DaysOnZillow_AllHomes
for all months. - 1-2 sentences answering the question "Is there a better time to put your house on the market?" based on your results.
Solution
avgDaysPerMonth <- tapply(states$DaysOnZillow_AllHomes, states$month, mean, na.rm=T)
barplot(avgDaysPerMonth, names.arg=month.abb, col='steelblue', main="Avg days on Zillow", ylab="# Days")
5. Filter the states
data to contain only years from 2010+ and call it states2010plus
. Make a lineplot showing the average DaysOnZillow_AllHomes
by Date
using states2010plus
data. Can you spot any trends? Write 1-2 sentences explaining what (if any) trends you see.
Relevant topics: subset, tapply, plot
Item(s) to submit:
- R code used to solve the question.
-
The time series lineplot for the average
DaysOnZillow_AllHomes
per date. - 1-2 sentences commenting on the patterns found in the plot, and your impressions of it.
Solution
states2010plus <- subset(states, year >= 2010)
avgdays <- tapply(states2010plus$DaysOnZillow_AllHomes, states2010plus$Date, mean, na.rm=T)
plot(ymd(names(avgdays)), avgdays, type='l')
6. Do homes sell faster in certain states? For the following states: 'California', 'Indiana', 'NewYork' and 'Florida', make a lineplot for DaysOnZillow_AllHomes
by Date
with one line per state. Use the states2010plus
dataset for this question. Make sure to have each state line colored differently, and to add a legend to your plot. Examine the plot and write 1-2 sentences about any observations you have.
Hint: You may want to use the lines
function to add the lines for different state.
Hint: Make sure to fix the y-axis limits using the ylim
argument in plot
to properly show all four lines.
Hint: You may find the argument col
useful to change the color of your line.
Hint: To make your legend fit, consider using the states abbreviation, and the arguments ncol
and cex
of the legend
function.
Relevant topics: subset, indexing, plot, lines
Item(s) to submit:
- R code used to solve the question.
-
The time series lineplot for
DaysOnZillow_AllHomes
per date for the 4 states. - 1-2 sentences commenting on the patterns found in the plot, and your answer to the question "Do homes sell faster in certain states rather than others?".
Solution
indiana <- subset(states2010plus, RegionName=="Indiana")
california <- subset(states2010plus, RegionName=="California")
florida <- subset(states2010plus, RegionName=="Florida")
new_york <- subset(states2010plus, RegionName=="NewYork")
plot(california$Date, california$DaysOnZillow_AllHomes, type='l', col='tomato', ylim = c(min(states2010plus$DaysOnZillow_AllHomes, na.rm=T), 210), ylab='Days on zillow', xlab = 'Date')
lines(indiana$Date, indiana$DaysOnZillow_AllHomes, col='steelblue')
lines(florida$Date, florida$DaysOnZillow_AllHomes, col='green')
lines(new_york$Date, new_york$DaysOnZillow_AllHomes, col='black')
legend('topright', legend=c('CA','IN','FL', 'NY'),lwd=2, col=c('tomato','steelblue','green','black'), ncol=2, cex=.7)
Project 13
Motivation: It's important to be able to lookup and understand the documentation of a new function. You may have looked up the documentation of functions like paste0
or sapply
, and noticed that in the "usage" section, one of the arguments is an ellipsis (...
). Well, unless you understand what this does, it's hard to really get it. In this project, we will experiment with ellipsis, and write our own function that utilizes one.
Context: We've learned about, used, and written functions in many projects this semester. In this project, we will utilize some of the less-known features of functions.
Scope: r, functions
Learning objectives:
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Utilize apply functions in order to solve a data-driven problem.
- Gain proficiency using split, merge, and subset.
- Demostrate the ability to create basic graphs with default settings.
- Demonstratre the ability to modify axes labels and titles.
- Incorporate legends using legend().
- Demonstrate the ability to customize a plot (color, shape/linetype).
- Convert strings to dates, and format dates using the lubridate package.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/beer/
Questions
1. Read /class/datamine/data/beer/beers.csv
into a data.frame named beers
. Read /class/datamine/data/beer/breweries.csv
into a data.frame named breweries
. Read /class/datamine/data/beer/reviews.csv
into a data.frame named reviews
.
Hint: Notice that reviews.csv
is a large file. Luckily, you can use a function from the famous data.table
package called fread
. The function fread
is much faster at reading large file compared to read.csv
. It reads the data into a class called data.table
. We will learn more about this later on. For now, use fread
to read in the reviews.csv
data then convert it from the data.table
class into a data.frame
by wrapping the result of fread
in the data.frame
function.
Hint: Do not forget to load the data.table
library before attempeting to use the fread
function.
Below we show you an example of how fast the fread
function is compared toread.csv
.
microbenchmark(read.csv("/class/datamine/data/beer/reviews.csv", nrows=100000), data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows=100000)), times=5)
Unit: milliseconds
expr
read.csv("/class/datamine/data/beer/reviews.csv", nrows = 1e+05)
data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows = 1e+05))
min lq mean median uq max neval
5948.6289 6482.3395 6746.8976 7040.5881 7086.6728 7176.2589 5
120.7705 122.3812 127.9842 128.7794 133.7695 134.2205 5
Hint: This video demonstrates how to read the reviews
data using fread
.
Relevant topics: fread, data.frame
Item(s) to submit:
- R code used to solve the problem.
2. Take some time to explore the datasets. Like many datasets, our data is broken into 3 "tables". What columns connect each table? How many breweries in breweries
don't have an associated beer in beers
? How many beers in beers
don't have an associated brewery in breweries
?
Hint: We compare lists of names using sum
or intersect
. Similar techniques can be used for Question 2.
Relevant topics: names, in, logical operators, unique
Item(s) to submit:
- R code used to solve the problem.
- A description of columns which connect each of the files.
-
How many breweries don't have an associated beer in
beers
. -
How many beers don't have an associated brewery in
breweries
.
3. Run ?sapply
and look at the usage section for sapply
. If you look at the description for the ...
argument, you'll see it is "optional arguments to FUN
". What this means is you can specify additional input for the function you are passing to sapply
. One example would be passing T
to na.rm
in the mean function: sapply(dat, mean, na.rm=T)
. Use sapply
and the strsplit
function to separate the types of breweries (types
) by commas. Use another sapply
to loop through your results and count the number of types for each brewery. Be sure to name your final results n_types
. What is the average amount of services (n_types
) breweries in IN and MI offer (we are looking for the average of IN and MI combined)? Does that surprise you?
Note: When you have one sapply
inside of another, or one loop inside of another, or an if/else statement inside of another, this is commonly referred to as nesting. So when Googling, you can type "nested sapply" or "nested if statements", etc.
Hint: We show, in this video, how to find the average number of parts in a midwesterner's name. Perhaps surprisingly, this same technique will be useful in solving Question 3.
Relevant topics: sapply, ...,strplit, in, mean
Item(s) to submit:
- R code used to solve the question.
- 1-2 sentences answering the average amount of services breweries in Indiana and Michigan offer, and commenting on this answer.
4. Write a function called compare_beers
that accepts a function that you will call FUN
, and any number of vectors of beer ids. The function, compare_beers
, should cycle through each vector/groups of beer_id
s, compute the function, FUN
, on the subset of reviews
, and print "Group X: some_score" where X is the number 1+, and some_score is the result of applying FUN
on the subset of the reviews
data.
In the example below the function FUN
is the median
function and we have two vectors/groups of beer_id
s passed with c(271781) being group 1 and c(125646, 82352) group 2. Note that even though our example only passes two vectors to our compare_beers
function, we want to write the function in a way that we could pass as many vectors as we want to.
Example:
compare_beers(reviews, median, c(271781), c(125646, 82352))
This example gives the output:
Group 1: 4
Group 2: 4.56
For your solution to this question, find the behavior of compare_beers
in this example:
compare_beers(reviews, median, c(88,92,7971), c(74986,1904), c(34,102,104,355))
Hint: There are different approaches to this question. You can use for loops or sapply
. It will probably help to start small and build slowly toward the solution.
Hint: This first video shows how to use ...
in defining a function.
Hint: This second video basically walks students through how to build this function. If you use this video to learn how to build this function, please be sure to acknowledge this in your project solutions.
Relevant topics: in, ..., indexing, paste0, for loops
Item(s) to submit:
- R code used to solve the problem.
- The result from running the provided example.
5. Beer wars! IN and MI against AZ and CO. Use the function you wrote in question (4) to compare beer_id from each group of states. Make a cool plot of some sort. Be sure to comment on your plot.
Hint: Create a vector of beer_ids
per group before passing it to your function from (4).
Hint: This video demonstrates an example of how to use the compare_beers
function.
Relevant topics: in, ..., indexing, paste0, for loops
Item(s) to submit:
- R code used to solve the problem.
- The result from running your function.
- The resulting plot.
- 1-2 sentence commenting on your plot.
Project 14
Motivation: Functions are the building blocks of more complex programming. It's vital that you understand how to read and write functions. In this project we will incrementally build and improve upon a function designed to recommend a beer. Note that you will not be winning any awards for this recommendation system, it is just for fun!
Context: One of the main focuses throughout the semester has been on functions, and for good reason. In this project we will continue to exercise our R skills and build up our recommender function.
Scope: r, functions
Learning objectives:
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Utilize apply functions in order to solve a data-driven problem.
- Gain proficiency using split, merge, and subset.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/beer/
Questions
1. Read /class/datamine/data/beer/beers.csv
into a data.frame named beers
. Read /class/datamine/data/beer/breweries.csv
into a data.frame named breweries
. Read /class/datamine/data/beer/reviews.csv
into a data.frame named reviews
. As in the previous project, make sure you used the fread
function from the data.table
package, and convert the data.table
to a data.frame
. We want to create a very basic beer recommender. We will start simple. Create a function called recommend_a_beer
that takes as input my_beer_id
(a single value) and returns a vector of beer_ids
from the same style
. Test your function on 2093
.
Hint: Make sure you do not include the given my_beer_id
in the vector of beer_ids
containing the beer_ids
of your recommended beers.
Hint: You may find the function setdiff
useful. Run the example below to get an idea of what it does.
Note: You will not win any awards for this recommendation system!
x <- c('a','b','b','c')
y <- c('c','b','d','e','f')
setdiff(x,y)
## [1] "a"
setdiff(y,x)
## [1] "d" "e" "f"
Relevant topics: fread, data.frame, function, setdiff
Item(s) to submit:
- R code used to solve the problem.
-
Length of result from
recommend_a_beer(2093)
. -
The result of
2093 %in% recommend_a_beer(2093)
.
2. That is a lot of beer recommendations! Let's try to narrow it down. Include an argument in your function called min_score
with default value of 4.5. Our recommender will only recommend beer_ids
with a review score of at least min_score
. Test your improved beer recommender with the same beer_id
from question (1).
Hint: Note that now we need to look at both beers
and reviews
datasets.
Relevant topics: in, unique, subset/indexing
Item(s) to submit:
- R code used to solve the problem.
-
Length of result from
recommend_a_beer(2093)
.
3. There is still room for improvement (obviously) for our beer recommender. Include a new argument in your function called same_brewery_only
with default value FALSE
. This argument will determine whether or not our beer recommender will return only beers from the same brewery. Test our newly improved beer recommender with the same beer_id
from question (1) with the argument same_brewery_only
set to TRUE
.
Hint: You may find the function intersect
useful. Run the example below to get an idea of what it does.
x <- c('a','b','b','c')
y <- c('c','b','d','e','f')
intersect(x,y)
## [1] "b" "c"
intersect(y,x)
## [1] "c" "b"
Relevant topics: if/else, subset, intersect, indexing
Item(s) to submit:
- R code used to solve the problem.
-
Length of result from
recommend_a_beer(2093, same_brewery_only=TRUE)
.
4. Oops! Bad idea! Maybe including only beers from the same brewery is not the best idea. Add an argument to our beer recommender named type
. If type=style
our recommender will recommend beers based on the style
as we did in question (3). If type=reviewers
, our recommender will recommend beers based on reviewers with "similar taste". Select reviewers that gave score equal to or greater than min_score
for the given beer id (my_beer_id
). For those reviewers, find the beer_ids
for other beers that these reviewers have given a score of at least min_score
. These beer_ids
are the ones our recommender will return. Be sure to test our improved recommender on the same beer_id
as in (1)-(3).
Relevant topics: if, subset, in, setdiff, unique
Item(s) to submit:
- R code used to solve the problem.
-
Length of result from
recommend_a_beer(2093, type="reviewers")
.
5. Let's try to narrow down the recommendations. Include an argument called abv_range
that indicates the abv range we would like the recommended beers to be at. Set abv_range
default value to NULL
so that if a user does not specify the abv_range
our recommender does not consider it. Test our recommender for beer_id
2093, with abv_range = c(8.9,9.1)
and min_score=4.9
.
Hint: You may find the function is.null
useful.
Relevant topics: if, >=, <=, intersect
Item(s) to submit:
- R code used to solve the problem.
-
Length of result from
recommend_a_beer(2093, abv_range=c(8.9, 9.1), type="reviewers", min_score=4.9)
.
6. Play with our recommend_a_beer
function. Include another feature to it. Some ideas are: putting a limit on the number of beer_id
s we will return, error catching (what if we don't have reviews for a given beer_id
?), including a plot to the output, returning beer names instead of ids or new arguments to decide what beer_id
s to recommend. Be creative and have fun!
Item(s) to submit:
- R code used to solve the problem.
-
The result from running the improved
recommend_a_beer
function showcasing your improvements to it. - 1-2 sentecens commenting on what you decided to include and why.
Project 15
Motivation: Some people say it takes 20 hours to learn a skill, some say 10,000 hours. What is certain is it definitely takes time. In this project we will explore an interesting dataset and exercise some of the skills learned this semester.
Context: This is the final project of the semester. We sincerely hope that you've learned something, and that we've provided you with first hand experience digging through data.
Scope: r
Learning objectives:
- Read and write basic (csv) data.
- Explain and demonstrate: positional, named, and logical indexing.
- Utilize apply functions in order to solve a data-driven problem.
- Gain proficiency using split, merge, and subset.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/donerschoose/
Questions
1. Read the data /class/datamine/data/donerschoose/Projects.csv
into a data.frame called projects
. Make sure you use the function you learned in Project 13 (fread
) from the data.table
package to read the data. Don't forget to then convert the data.table
into a data.frame
. Let's do an initial exploration of this data. What types of projects (Project.Type
) are there? How many resource categories (Project.Resource.Category
) are there?
Relevant topics: fread, unique, length
Item(s) to submit:
- R code used to solve the question.
- 1-2 sentences containing the project's types and how many resource categories are in the dataset.
2. Create two new variables in projects
, the number of days a project's lasted and the number of days until the project was fully funded. Name those variables project_duration
and time_until_funded
, respectively. To calculate them use the project's posted date (Project.Posted.Date
), expiration date (Project.Expiration.Date
), and fully funded date (Project.Fully.Funded.Date
). What are the shortest and longest times until a project is fully funded? For consistency check, see if we have any negative project's duration. If so, how many?
Hint: You may find the argument units
in the function difftime
useful.
Hint: Be sure to pay attention to the order of operations of difftime
.
Hint: Note that if you used the fread
function from data.table
to read in the data, you will not need to convert the columns as date.
Hint: It is not required that you use difftime
.
Relevant topics: difftime, lubridate
Item(s) to submit:
- R code used to solve the question.
- Shortest and longest times until a project is fully funded.
- 1-2 sentences answering whether we have if we have negative project's duration, and if so how many.
3. As you noted in question (2) there may be some project's with negative duration time. As we may have some concerns for the data regarding these projects, filter the projects
data to exclude the projects with negative duration, and call this filtered data selected_projects
. With that filtered data, make a dotchart
for mean time until the project is fully funded (time_until_funded
) for the various resource categories (Project.Resource.Category
). Make sure to comment on your results. Are they surprising? Could there be another variable influencing this result? If so, name at least one.
Hint: You will first need to average time until fully funded for the different categories before making your plot.
Hint: To make your dotchart
look nicer, you may want to first order the average time until fully funded before passing it to the dotchart
function. In addition, consider reducing the y-axis font size using the argument cex
.
Relevant topics: indexing, subset, tapply, dotchart
Item(s) to submit:
- R code used to solve the question.
- Resulting dotchart.
- 1-2 sentences commenting on your plot. Make sure to mention whether you are surprised or not by the results. Don't forget to add if you think there could be more factors influencing your answer, and if so, be sure to give examples.
4. Read /class/datamine/data/donerschoose/Schools.csv
into a data.frame called schools
. Combine selected_projects
and schools
by School.ID
keeping only School.ID
s present in both datasets. Name the combined data.frame selected_projects
. Use the newly combined data to determine the percentage of already fully funded projects (Project.Current.Status
) for schools in West Lafayette, IN. In addition, determine the state (School.State
) with the highest number of projects. Be sure to specify the number of projects this state has.
Hint: West Lafayette, IN zip codes are 47906 and 47907.
Relevant topics: fread, subset, indexing, merge, table, prop.table, which.max
Item(s) to submit:
- R code used to solve the question.
- 1-2 sentences answering the percentage of already fully funded projects for schools in West Lafayette, IN, the state with the highest number of projects, and the number of projects this state has.
5. Using the combined selected_projects
data, get the school(s) (School.Name
), city/cities (School.City
) and state(s) (School.State
) for the teacher with the highest percentage of fully funded projects (Project.Current.Status
).
Hint: There are many ways to solve this problem. For example, one option to get the teacher's ID is to create a variable indicating whether or not the project is fully funded and use tapply
. Another option is to create prop.table
and select the corresponding column/row.
Hint: Note that each row in the data corresponds to a unique project ID.
Hint: Once you have the teacher's ID, consider filtering projects
to contain only rows for which the corresponding teacher's ID is in, and only the columns we are interested in: School.Name
, School.City
, and School.State
. Then, you can get the unique values in this shortened data.
Hint: To get only certain columns when subetting, you may find the argument select
from subset
useful.
Relevant topics: indexing, which, max, subset, unique, row.names (if using table
), names (if using tapply
)
Item(s) to submit:
- R code used to solve the question.
- Output of your code containing school(s), city(s) and state(s) of the selected teacher.
STAT 29000
Project 1
Motivation: In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.
Context: We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we've previously learned.
Scope: data wrangling in R, functions
Learning objectives:
- Comprehend what a function is, and the components of a function in R.
- Read and write basic (csv) data.
- Utilize apply functions in order to solve a data-driven problem.
Make sure to read about, and use the template found here, and the important information about projects submissions here.
You can find useful examples that walk you through relevant material in The Examples Book:
https://thedatamine.github.io/the-examples-book
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
Important note: It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials.
We decided to move away from ThinLinc and away from the version of RStudio used last year (https://desktop.scholar.rcac.purdue.edu). That version of RStudio is known to have some strange issues when running code chunks.
Remember the very useful documentation shortcut ?
. To use, simply type ?
in the console, followed by the name of the function you are interested in.
You can also look for package documentation by using help(package=PACKAGENAME)
, so for example, to see the documentation for the package ggplot2
, we could run:
help(package=ggplot2)
Sometimes it can be helpful to see the source code of a defined function. A function is any chunk of organized code that is used to perform an operation. Source code is the underlying R
or c
or c++
code that is used to create the function. To see the source code of a defined function, type the function's name without the ()
. For example, if we were curious about what the function Reduce
does, we could run:
Reduce
Occasionally this will be less useful as the resulting code will be code that calls c
code we can't see. Other times it will allow you to understand the function better.
Dataset:
/class/datamine/data/airbnb
Often times (maybe even the majority of the time) data doesn’t come in one nice file or database. Explore the datasets in /class/datamine/data/airbnb
.
1. You may have noted that, for each country, city, and date we can find 3 files: calendar.csv.gz
, listings.csv.gz
, and reviews.csv.gz
(for now, we will ignore all files in the "visualisations" folders).
Let's take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (calendar.csv.gz
, listings.csv.gz
, and reviews.csv.gz
). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.
Hint: read.csv
has an argument to select the number of rows we want to read.
Hint: Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries.
Item(s) to submit:
- Chunk of code used to read the first 50 rows of each dataset.
- 1-2 sentences briefly describing the information contained in each dataset.
- Name(s) of variable(s) that could be used to join them.
To read a compressed csv, simply use the read.csv
function:
dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
head(dat)
Solution
The calendar.csv.gz
file for 2019-07-08 in Hawaii describes the listing_id
, date
, available
(t or f), price
, adjusted_price
, minimum_nights
, and maximum_nights
hawaii_calendar <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/calendar.csv.gz")
head(hawaii_calendar, n=50)
The listings.csv.gz
file for 2019-07-08 in Hawaii has 106 variables, which describe the very specific attributes of the airbnb listings.
hawaii_listings <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/listings.csv.gz")
dim(hawaii_listings)
The reviews.csv.gz
file for 2019-07-08 in Hawaii describes the listing_id
, id
, date
, reviewer_id
, reviewer_name
, and comments
hawaii_reviews <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/reviews.csv.gz")
head(hawaii_reviews, n=50)
The variables that might be used to compare the tables are: date
, id
, listing_id
, maximum_nights
, minimum_nights
, price
t <- table(c(names(hawaii_calendar), names(hawaii_listings), names(hawaii_reviews)))
t[t > 1]
Let's work towards getting this data into an easier format to analyze. From now on, we will focus on the listings.csv.gz
datasets.
2. Write a function called get_paths_for_country
, that, given a string with the country name, returns a vector with the full paths for all listings.csv.gz
files, starting with /class/datamine/data/airbnb/...
.
For example, the output from get_paths_for_country("united-states")
should have 28 entries. Here are the first 5 entries in the output:
[1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz"
[2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz"
[3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz"
[4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz"
[5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"
Hint: list.files
is useful with the recursive=T
option.
Hint: Use grep
to search for the pattern listings.csv.gz
(within the results from the first hint), and use the option value=T
to display the values found by the grep
function.
Item(s) to submit:
-
Chunk of code for your
get_paths_for_country
function.
Solution
We extract all 28 of the listings for the United States first:
myprefix <- "/class/datamine/data/airbnb/united-states/"
paste0(myprefix, grep("listings.csv.gz", list.files(myprefix, recursive=T), value=T))
Now we build a function that can do the same thing, for any country
get_paths_for_country <- function(mycountry) {
myprefix <- paste0("/class/datamine/data/airbnb/", mycountry, "/")
paste0(myprefix, grep("listings.csv.gz", list.files(myprefix, recursive=T), value=T))
}
and we test this for several countries:
get_paths_for_country("united-states")
get_paths_for_country("brazil")
get_paths_for_country("south-africa")
get_paths_for_country("canada")
3. Write a function called get_data_for_country
that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.
Hint: Use stringsAsFactors=F
in the read.csv
function.
Hint: Use do.call(rbind, <listofdataframes>)
to combine a list of dataframes into a single dataframe.
Relevant topics: rbind, lapply, function
Item(s) to submit:
-
Chunk of code for your
get_data_for_country
function.
Solution
We first get the data from the Canada entries. To do this, we sapply the read.csv
function to each of the 6 results from get_paths_for_country("canada")
In other words, we read in these 6 data frames.
myresults <- sapply(get_paths_for_country("canada"), read.csv, stringsAsFactors=F, simplify=F)
We get a list of 6 data frames:
length(myresults)
class(myresults)
class(myresults[[1]])
class(myresults[[6]])
and we can check the dimensions of each of the 6 data frames
dim(myresults[[1]])
dim(myresults[[2]])
dim(myresults[[3]])
dim(myresults[[4]])
dim(myresults[[5]])
dim(myresults[[6]])
this is more easily accomplished with another sapply
:
sapply(myresults, dim)
We can rbind
all 6 of these data frames into one big data frame as follows:
bigDF <- do.call(rbind, myresults)
class(bigDF)
dim(bigDF)
Now we create the desired function called get_data_for_country
get_data_for_country <- function(mycountry) {
myresults <- sapply(get_paths_for_country(mycountry), read.csv, stringsAsFactors=F, simplify=F)
do.call(rbind, myresults)
}
and we test it on Canada.
mynewbigDF <- get_data_for_country("canada")
The result has the same size as before
dim(mynewbigDF)
4. Use your get_data_for_country
to get the data for a country of your choice, and make sure to name the data.frame listings
. Take a look at the following columns: host_is_superhost
, host_has_profile_pic
, host_identity_verified
, and is_location_exact
. What is the data type for each column? (You can use class
or typeof
or str
to see the data type.)
These columns would make more sense as logical values (TRUE/FALSE/NA).
Write a function called transform_column
that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (""
), and we need to be careful when transforming the data. Test your function on column host_is_superhost
.
Relevant topics: class, typeof, str, toupper, as.logical
Item(s) to submit:
-
Chunk of code for your
transform_column
function. -
Type of
transform_column(listings$host_is_superhost)
.
Solution
These 4 columns from mynewbigDF
(which has the data for Canada) only have values "t", "f", ""
head(mynewbigDF$host_is_superhost)
head(mynewbigDF$host_has_profile_pic)
head(mynewbigDF$host_identity_verified)
head(mynewbigDF$is_location_exact)
Please note the 44 values of ""
(which are easy to miss) In the first 3 out of 4 of these columns:
table(mynewbigDF$host_is_superhost)
table(mynewbigDF$host_has_profile_pic)
table(mynewbigDF$host_identity_verified)
table(mynewbigDF$is_location_exact)
These are all character vectors, which we can check using class
, typeof
, or str
:
class(mynewbigDF$host_is_superhost)
class(mynewbigDF$host_has_profile_pic)
class(mynewbigDF$host_identity_verified)
class(mynewbigDF$is_location_exact)
typeof(mynewbigDF$host_is_superhost)
typeof(mynewbigDF$host_has_profile_pic)
typeof(mynewbigDF$host_identity_verified)
typeof(mynewbigDF$is_location_exact)
str(mynewbigDF$host_is_superhost)
str(mynewbigDF$host_has_profile_pic)
str(mynewbigDF$host_identity_verified)
str(mynewbigDF$is_location_exact)
We have several ways to transform a column. For example, we could go element-by-element, and make substitutions, like this:
v <- mynewbigDF$host_is_superhost
Here is the way that the values look at the start:
table(v)
v[toupper(v)=="T"] <- TRUE
v[toupper(v)=="F"] <- FALSE
v[toupper(v)==""] <- NA
and here are the values now:
table(v)
You might think that the NA
values disappeared, but they just do not show up in the table by default. You can force them to appear, and then we see that the counts of the three values are the same as before.
table(v, useNA="always")
Here is the function:
transform_column <- function(v) {
v[toupper(v)=="T"] <- TRUE
v[toupper(v)=="F"] <- FALSE
v[toupper(v)==""] <- NA
v
}
We can try the function on mynewbigDF$host_is_superhost
:
head(transform_column(mynewbigDF$host_is_superhost))
table(transform_column(mynewbigDF$host_is_superhost))
Another possibility is to make a map, in which we put the old values as the names, and the new values as the values in the vector:
mymap <- c(TRUE, FALSE, NA)
names(mymap) <- c("T", "F", "")
head(mymap[toupper(mynewbigDF$host_is_superhost)])
and if you do not want the names to appear on the vector, you can remove them, like this:
head(unname(mymap[toupper(mynewbigDF$host_is_superhost)]))
Finally we can check the table of the results:
table(mymap[toupper(mynewbigDF$host_is_superhost)])
This might seem strange, and if you do not like it, you can just use the solution given above. If you do like this, and want to wrap it into a function, we can write:
transform_column <- function(v) {
mymap <- c(TRUE, FALSE, NA)
names(mymap) <- c("T", "F", "")
unname(mymap[toupper(v)])
}
and again we can try this new version of the function on mynewbigDF$host_is_superhost
:
head(transform_column(mynewbigDF$host_is_superhost))
table(transform_column(mynewbigDF$host_is_superhost))
5. Apply your function transform_column
to the columns instant_bookable
and is_location_exact
in your listings
data.
Based on your listings
data, if you are looking at an instant bookable listing (where instant_bookable
is TRUE
), would you expect the location to be exact (where is_location_exact
is TRUE
)? Why or why not?
Hint: Make a frequency table, and see how many instant bookable listings have exact location.
Item(s) to submit:
- Chunk of code to get a frequency table.
- 1-2 sentences explaining whether or not we would expect the location to be exact if we were looking at a instant bookable listing.
Solution
Now we look at the Canada results, for which instant_bookable
is TRUE
, to see if is_location_exact
is TRUE
or FALSE
:
table(transform_column(mynewbigDF$is_location_exact)[transform_column(mynewbigDF$instant_bookable) == TRUE])
In other words, if instant_bookable
is TRUE
, then we expect the value for is_location_exact
to usually be TRUE
as well.
As a closing note, we could remove the check to see whether the inner values are TRUE
because, by default, we will only exact the TRUE
values when we do a lookup like this:
table(transform_column(mynewbigDF$is_location_exact)[transform_column(mynewbigDF$instant_bookable)])
Project 2
Motivation: The ability to quickly reproduce an analysis is important. It is often necessary that other individuals will need to be able to understand and reproduce an analysis. This concept is so important there are classes solely on reproducible research! In fact, there are papers that investigate and highlight the lack of reproducibility in various fields. If you are interested in reading about this topic, a good place to start is the paper titled "Why Most Published Research Findings Are False", by John Ioannidis (2005).
Context: Making your work reproducible is extremely important. We will focus on the computational part of reproducibility. We will learn RMarkdown to document your analyses so others can easily understand and reproduce the computations that led to your conclusions. Pay close attention as future project templates will be RMarkdown templates.
Scope: Understand Markdown, RMarkdown, and how to use it to make your data analysis reproducible.
Learning objectives:
- Use Markdown syntax within an Rmarkdown document to achieve various text transformations.
- Use RMarkdown code chunks to display and/or run snippets of code.
Questions
1. Make the following text (including the asterisks) bold: This needs to be **very** bold
. Make the following text (including the underscores) italicized: This needs to be _very_ italicized.
Important note: Surround your answer in 4 backticks. This will allow you to display the markdown without having the markdown "take effect". For example:
````
Some *marked* **up** text.
````
Hint: Be sure to check out the Rmarkdown Cheatsheet and our section on Rmarkdown in the book.
Note: Rmarkdown is essentially Markdown + the ability to run and display code chunks. In this question, we are actually using Markdown within Rmarkdown!
Relevant topics: rmarkdown, escaping characters
Item(s) to submit: - 2 lines of markdown text, surrounded by 4 backticks. Note that when compiled, this text will be unmodified, regular text.
Solution
We can achieve this style of text:
This needs to be **very** bold
This needs to be _very_ italicized.
by using this Markdown text:
**This needs to be \*\*very\*\* bold**
_This needs to be \_very\_ italicized._
The backslashes specify that we want the asterisks and underscores to appear.
2. Create an unordered list of your top 3 favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another ordered list that ranks your academic interests in order of most interested to least interested.
Hint: You can learn what ordered and unordered lists are here.
Note: Similar to (1), in this question we are dealing with Markdown. If we were to copy and paste the solution to this problem in a Markdown editor, it would be the same result as when we Knit it here.
Relevant topics: rmarkdown
Item(s) to submit: - Create the lists, this time don't surround your code in backticks. Note that when compiled, this text will appear as nice, formatted lists.
Solution
An unordered list of my top 3 favorite academic interests is:
- asymptotic analysis of sequences
- data science
- analysis of algorithms
An ordered list of my top 3 favorite academic interests is:
- analysis of algorithms
- asymptotic analysis of sequences
- data science
3. Browse https://www.linkedin.com/ and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown. Include the following:
- A header for this section (your choice of size) that says "About".
- The text of your personal "About" section that you would feel comfortable uploading to linkedin, including at least 1 link.
Relevant topics: rmarkdown
Item(s) to submit: - Create the described profile, don't surround your code in backticks.
Solution
About
I am Professor in Statistics and (by courtesy) of Mathematics and Public Health at Purdue University. My research is in probabilistic, combinatorial, and analytic techniques for the analysis of algorithms and data structures; I am also interested in science of information, game theory, and large-scale computation. I currently serve as
- Director of The Data Mine
- Interim Co-Director of the Integrative Data Science Initiative
- Principal Investigator for the Purdue Statistics Living Learning Community, funded by the National Science Foundation
- Associate Director for the NSF Center for Science of Information (now a core center in Purdue's Discovery Park)
- Associate Director of the Actuarial Science Program
4. Your co-worker wrote a report, and has asked you to beautify it. Knowing Rmarkdown, you agreed. Make improvements to this section. At a minimum:
- Make the title pronounced.
- Make all links appear as a word or words, rather than the long-form URL.
- Organize all code into code chunks where code and output are displayed. If the output is really long, just display the code.
- Make the calls to the
library
function be evaluated but not displayed. - Make sure all warnings and errors that may eventually occur, do not appear in the final document.
Feel free to make any other changes that make the report more visually pleasing.
```{r my-load-packages}
library(ggplot2)
```
```{r declare-variable-290, eval=FALSE}
my_variable <- c(1,2,3)
```
All About the Iris Dataset
This paper goes into detail about the `iris` dataset that is built into r. You can find a list of built-in datasets by visiting https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or by running the following code:
data()
The iris dataset has 5 columns. You can get the names of the columns by running the following code:
names(iris)
Alternatively, you could just run the following code:
iris
The second option provides more detail about the dataset.
According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html there is another dataset built-in to r called `iris3`. This dataset is 3 dimensional instead of 2 dimensional.
An iris is a really pretty flower. You can see a picture of one here:
https://www.gardenia.net/storage/app/public/guides/detail/83847060_mOptimized.jpg
In summary. I really like irises, and there is a dataset in r called `iris`.
Relevant topics: rmarkdown
Item(s) to submit:
- Make improvements to this section, and place it all under the Question 4 header in your template.
Solution
my_variable <- c(1,2,3)
*** All About the Iris Dataset
This paper goes into detail about the iris
dataset that is built into r. You can find a list of built-in datasets by visiting the R datasets index or by running the following code:
data()
The iris dataset has 5 columns. You can get the names of the columns by running the following code:
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Alternatively, you could just run the following code:
iris
The second option provides more detail about the dataset.
According to the R iris help page there is another dataset built-in to R
called iris3
. This dataset is 3 dimensional instead of 2 dimensional.
An iris is a really pretty flower. You can see a picture of one here:
In summary. I really like irises, and there is a dataset in R
called iris
.
Solution
We create the figure called "Sepal Length for the Iris Data Set".
6. Insert the following code chunk under the Question 6 header in your template. Try knitting the document. Two things will go wrong. What is the first problem? What is the second problem?
```{r my-load-packages}
plot(my_variable)
```
Hint: Take a close look at the name we give our code chunk.
Hint: Take a look at the code chunk where my_variable
is declared.
Relevant topics: rmarkdown
Item(s) to submit:
- The modified version of the inserted code that fixes both problems.
- A sentence explaining what the first problem was.
- A sentence explaining what the second problem was.
Solution
We needed to change the section title from install-packages
to a new name, since we already had a section with this same title.
In the section called declare-variable
, we had eval=F
, and as a result, the variable my_variable
was never declared. So we removed the option eval=F
.
plot(my_variable)
For Project 2, please submit your .Rmd file and the resulting .pdf file. (For this project, you do not need to submit a .R file.)
Project 3
Motivation: The ability to navigate a shell, like bash
, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful bash
tools, help you navigate a filesystem, and even run bash
tools from within an RMarkdown file in RStudio.
Context: At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file.
Scope: bash, RStudio
Learning objectives:
- Distinguish differences in /home, /scratch, and /class.
- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
- Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.
-
Use
man
to read and learn about UNIX utilities. -
Run
bash
commands from within and RMarkdown file in RStudio.
There are a variety of ways to connect to Scholar. In this class, we will primarily connect to RStudio Server by opening a browser and navigating to https://rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface.
Here is a video to remind you about some of the basic tools you can use in UNIX/Linux:
This is the easiest book for learning this stuff; it is short and gets right to the point:
https://go.oreilly.com/purdue-university/library/view/-/0596002610
you just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through).
It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details.
Solution
Here are some changes Kevin likes:
- Uncheck "Restore .Rdata into workspace at startup".
- Change tab width 4.
- Check "Soft-wrap R source files".
- Check "Highlight selected line".
- Check "Strip trailing horizontal whitespace when saving".
- Uncheck "Show margin".
2. There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a bash
shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient!
What is the default directory of your bash shell?
Hint: Start by reading the section on man
. man
stands for manual, and you can find the "official" documentation for the command by typing man <command_of_interest>
. For example:
# read the manual for the `man` command
# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down
man man
Relevant topics: man, pwd, ~, .., .
Item(s) to submit:
-
The full filepath of default directory (home directory). Ex: Kevin's is:
/home/kamstut
-
The
bash
code used to show your home directory or current directory (also known as the working directory) when thebash
shell is first launched.
Solution
# whatever is stored in the $HOME environment variable
# is what ~ represents
cd ~
pwd
# if we change $HOME, ~ changes too!
HOME=/home/kamstut/projects
cd ~
pwd
# if other users on the linux system share certain files or folders
# in their home directory, you can access their home folder similarly
ls ~mdw
# but they _have_ to give you permissions
Solution
# navigate to the data directory using `cd` (change directory)
cd /class/datamine/data
# confirm the location using `pwd` (print working directory)
pwd
# list files
ls
# cd without any arguments automatically returns to the directory
# saved in the $HOME environment variable
cd
# another trick, if you wanted to _quickly_ return to the data
# directory, or most recent directory is the following (uncommented)
cd ~-
# confirm the location using pwd
pwd
..
represents the parent folder or the folder in which your current folder is contained. So let's say I was in /home/kamstut/projects/
and I wanted to get the contents of the file /home/kamstut/my_file.txt
. You could do: cat ../my_file.txt
.
Solution
cd /class/datamine/data
pwd
cd ../../../home/kamstut
pwd
5. In Scholar, when you want to deal with really large amounts of data, you want to access scratch (you can read more here). Your scratch directory on Scholar is located here: /scratch/scholar/$USER
. $USER
is an environment variable containing your username. Test it out: echo /scratch/scholar/$USER
. Perform the following actions:
- Navigate to your scratch directory.
- Confirm you are in the correct location.
- Execute
myquota
. - Find the location of the
myquota
bash script. - Output the first 5 and last 5 lines of the bash script.
- Count the number of lines in the bash script.
- How many kilobytes is the script?
Hint: You could use each of the commands in the relevant topics once.
Hint: When you type myquota
on Scholar there are sometimes two warnings about xauth
but sometimes there are no warnings. If you get a warning that says Warning: untrusted X11 forwarding setup failed: xauth key data not generated
it is safe to ignore this error.
Hint: Commands often have options. Options are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION
section of the man
pages. For example: man wc
. You can see -m
, -l
, and -w
are all options for wc
. To test this out:
# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command.
wc /class/datamine/data/flights/1987.csv
# to count the lines, use the -l option
wc -l /class/datamine/data/flights/1987.csv
# to count the words, use the -w option
wc -w /class/datamine/data/flights/1987.csv
# you can combine options as well
wc -w -l /class/datamine/data/flights/1987.csv
# some people like to use a single tack `-`
wc -wl /class/datamine/data/flights/1987.csv
# order doesn't matter
wc -lw /class/datamine/data/flights/1987.csv
Hint: The -h
option for the du
command is useful.
Relevant topics: cd, pwd, type, head, tail, wc, du
Item(s) to submit:
- Command used to navigate to your scratch directory.
- Command used to confirm your location.
-
Output of
myquota
. -
Command used to find the location of the
myquota
script. -
Absolute path of the
myquota
script. -
Command used to output the first 5 lines of the
myquota
script. -
Command used to output the last 5 lines of the
myquota
script. -
Command used to find the number of lines in the
myquota
script. - Number of lines in the script.
- Command used to find out how many kilobytes the script is.
- Number of kilobytes that the script takes up.
Solution
# navigate to my scratch folder
cd /scratch/scholar/$USER
# confirm
pwd
# what is my quota, execute the myquota script
myquota
# get the location of the myquota script
type myquota
# get the first 5 lines of the myquota script
head /usr/local/bin/myquota
# get the last 5 lines of the myquota script
tail /usr/local/bin/myquota
# get the number of lines in the myquota script
wc -l /usr/local/bin/myquota
# get the number of kilobytes of the myquota script
du -h --apparent-size /usr/local/bin/myquota
ls -la /usr/local/bin/myquota
6. Perform the following operations:
- Navigate to your scratch directory.
- Copy and paste the file:
/class/datamine/data/flights/1987.csv
to your current directory (scratch). - Create a new directory called
my_test_dir
in your scratch folder. - Move the file you copied to your scratch directory, into your new folder.
- Use
touch
to create an empty file namedim_empty.txt
in your scratch folder. - Remove the directory
my_test_dir
and the contents of the directory. - Remove the
im_empty.txt
file.
Hint: rmdir
may not be able to do what you think, instead, check out the options for rm
using man rm
.
Relevant topics: cd, cp, mv, mkdir, touch, rmdir, rm
Item(s) to submit:
- Command used to navigate to your scratch directory.
-
Command used to copy the file,
/class/datamine/data/flights/1987.csv
to your current directory (scratch). -
Command used to create a new directory called
my_test_dir
in your scratch folder. -
Command used to move the file you copied earlier
1987.csv
into your newmy_test_dir
folder. -
Command used to create an empty file named
im_empty.txt
in your scratch folder. -
Command used to remove the directory and the contents of the directory
my_test_dir
. -
Command used to remove the
im_empty.txt
file.
Solution
# navigate to the scratch folder
cd /scratch/scholar/$USER
# copy the 1987.csv file to the current directory (scratch)
cp /class/datamine/data/flights/1987.csv .
# make a directory in the scratch directory called `my_test_dir`
mkdir my_test_dir
# move 1987.csv to the new folder
mv 1987.csv my_test_dir
# create an empty file in the scratch folder
touch im_empty.txt
# remove the directory and the contents of the directory
rm -r my_test_dir
# remove the im_empty.txt file
rm im_empty.txt
7. Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.
Project 4
Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. grep
is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
Context: We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep
, and experiment with regular expressions using grep
, R, and later on, Python.
Scope: grep, regular expression basics, utilizing regular expression tools in R and Python
Learning objectives:
-
Use
grep
to search for patterns within a dataset. -
Use
cut
to section off and slice up data from the command line. -
Use
wc
to count the number of lines of input.
You can find useful examples that walk you through relevant material in The Examples Book:
https://thedatamine.github.io/the-examples-book
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
Important note: I would highly recommend using single quotes '
to surround your regular expressions. Double quotes can have unexpected behavior due to some shell's expansion rules. In addition, pay close attention to escaping certain characters in your regular expressions.
Dataset
The following questions will use the dataset the_office_dialogue.csv
found in Scholar under the data directory /class/datamine/data/
. A public sample of the data can be found here: the_office_dialogue.csv
Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
grep
stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep
, we will be using it with textual data. You can read about and see examples of grep
here.
1. Login to Scholar and use grep
to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica.". Where is it located exactly?
Relevant topics: grep
Item(s) to submit:
-
The
grep
command used to find the dataset. - The name and location in Scholar of the dataset.
Solution
grep -Ri "bears. beets. battlestar galactica." /class/datamine
/class/datamine/data/movies_and_tv/the_office_dialogue.csv
2. grep
prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first n lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.
Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.
Hint: The result if you were to search for "bears. beets. battlestar galactica." should be:
"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica."
Hint: One method to solve this problem would be to pipe the output from grep
to cut
.
Item(s) to submit:
- The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".
Solution
grep -i "bears. beets. battlestar galactica." /class/datamine/data/movies_and_tv/the_office_dialogue.csv | cut -d "," -f 7,8
3. This particular dataset happens to be very small. You could imagine a scenario where the file is many gigabytes and not easy to load completely into R or Python. We are interested in learning what makes Jim and Pam tick as a couple. Use a line of UNIX commands to create a new dataset called jim_and_pam.csv
(remember, a good place to store data temporarily is /scratch/scholar/$USER
). Include only lines that are spoken by either Jim or Pam, or reference Jim or Pam in any way. How many rows of data are in the new file? How many megabytes is the new file (to the nearest 1/10th of a megabyte)?
Hint: Redirection.
Hint: It is OK if you get an erroneous line where the word "jim" or "pam" appears as a part of another word.
Relevant topics: grep, ls, wc, redirection
Item(s) to submit:
- The line of UNIX commands used to create the new file.
- The number of rows of data in the new file, and the accompanying UNIX command used to find this out.
- The number of megabytes (to the nearest 1/10th of a megabyte) that the new file has, and the accompanying UNIX command used to find this out.
Solution
grep -i "jim\|pam" /class/datamine/data/movies_and_tv/the_office_dialogue.csv | cut -d "," -f4,7,8,9,12 > jim_and_pam.csv
wc -l jim_and_pam.csv
13779 lines
ls -h jim_and_pam.csv
1.4mb
4. Find all lines where either Jim/Pam/Michael/Dwight's name is followed by an exclamation mark. Use only 1 "!" within your regular expression. How many lines are there? Ignore case (whether or not parts of the names are capitalized or not).
Relevant topics: grep, basic matches, escaping characters
Item(s) to submit:
- The UNIX command(s) used to solve this problem.
- The number of lines where either Jim/Pam/Michael/Dwight's name is followed by an exclamation mark.
Solution
grep -E '(Jim|Pam|Michael|Dwight)!' the_office_dialogue.csv
# or
grep '\(Jim\|Pam\|Michael\|Dwight\)!' the_office_dialogue.csv| wc -l
5. Find all lines that contain the text "that's what" followed by any amount of any text and then "said". How many lines are there?
Relevant topics: grep
Item(s) to submit:
- The UNIX command used to solve this problem.
- The number of lines that contain the text "that's what" followed by any amount of text and then "said".
Solution
grep -i "that's what .* said" /class/datamine/data/movies_and_tv/the_office_dialogue.csv
Regular expressions are really a useful semi language-agnostic tool. What this means is regardless of the programming language your are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using read.csv
. Name the resulting data.frame dat
.
6. The text_w_direction
column in dat
contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.
(a) Create a new column called has_direction
that is set to TRUE
if the text_w_direction
column has direction, and FALSE
otherwise. Use the grepl
function in R to accomplish this.
Hint: Make sure all opening brackets "[" have a corresponding closing bracket "]".
Hint: Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text.
(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.
We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.
In our actual dataset, for example, dat$text_w_direction[2789]
is a line with 2 directions.
Relevant topics: grep, grepl, basic matches, escaping characters
Item(s) to submit:
- The R code and regular expression used to solve the first part of this problem.
- The R code and regular expression used to solve the second part of this problem.
- How many lines have >= 2 directions?
- How many lines have >= 5 directions?
Solution
dat$has_direction <- grepl("(\\[.*\\])+", dat$text_w_direction)
#
length(grep("\\[.*\\].*\\[.*\\]", dat$text_w_direction))
length(grep("\\[.*\\].*\\[.*\\].*\\[.*\\].*\\[.*\\].*\\[.*\\]", dat$text_w_direction))
OPTIONAL QUESTION. Use the str_extract_all
function from the stringr
package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called direction
.
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
In this question, your solution may have extracted:
[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]
(It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.)
Relevant topics: str_extract_all, basic matches, escaping characters
Item(s) to submit:
- The R code used to solve this problem.
Solution
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[[^\\[\\]]*\\])", simplify=F)
# or
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[.*?\\])", simplify=F)
Project 5
Motivation: Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc.
Context: We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
Scope: grep, regular expression basics, UNIX utilities, redirection, piping
Learning objectives:
-
Use
cut
to section off and slice up data from the command line. - Use piping to string UNIX commands together.
-
Use
sort
and it's options to sort data in different ways. -
Use
head
to isolate n lines of output. -
Use
wc
to summarize the number of lines in a file or in output. -
Use
uniq
to filter out non-unique lines. -
Use
grep
to search files effectively.
You can find useful examples that walk you through relevant material in The Examples Book:
https://thedatamine.github.io/the-examples-book
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
Don't forget the very useful documentation shortcut ?
for R code. To use, simply type ?
in the console, followed by the name of the function you are interested in. In the Terminal, you can use the man
command to check the documentation of bash
code.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/amazon/amazon_fine_food_reviews.csv
A public sample of the data can be found here: amazon_fine_food_reviews.csv
Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
Here are three videos that might also be useful, as you work on Project 5:
Questions
1. What is the Id
of the most helpful review, according to the highest HelpfulnessNumerator
?
Important note: You can always pipe output to head
in case you want the first few values of a lot of output. Note that if you used sort
before head
, you may see the following error messages:
sort: write failed: standard output: Broken pipe
sort: write error
This is because head
would truncate the output from sort
. This is okay. See this discussion for more details.
Relevant topics: cut, sort, head, piping
Item(s) to submit:
- Line of UNIX commands used to solve the problem.
-
The
Id
of the most helpful review.
Solution
cut -d, -f5 amazon_fine_food_reviews.csv| sort -nr | head -n3
2. Some entries under the Summary
column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion.
To further clarify what we mean by unique, if we had the following vector in R, c("a", "b", "a", "c")
, its unique values are c("a", "b", "c")
.
Relevant topics: cut, uniq, sort, wc, piping
Item(s) to submit:
- Two lines of UNIX commands used to solve the problem.
-
The ratio of unique
Summary
's.
Solution
cut -d, -f9 amazon_fine_food_reviews.csv | sort -u | wc -l
cut -d, -f9 amazon_fine_food_reviews.csv | wc -l
# 295162/568455
3. Use a chain of UNIX commands, piped in a sequence, to create a frequency table of Score
.
Relevant topics: cut, uniq, sort, piping
Item(s) to submit:
- The line of UNIX commands used to solve the problem.
- The frequency table.
Solution
cut -d, -f7 amazon_fine_food_reviews.csv | sort -n | uniq -c
4. Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why?
Hint: You may need to pipe the output to sort
multiple times.
Hint: To create the frequency table, read through the man
pages for uniq
. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following:
man uniq
Relevant topics: cut, uniq, sort, head, piping, man
Item(s) to submit:
- The line of UNIX commands used to solve the problem.
- The frequency table.
Solution
cut -d, -f3 amazon_fine_food_reviews.csv | sort | uniq -c | sort -nr | head -n1
# You should use UserId instead of ProfileName as the former is a unique identifier
5. Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let's consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let's consider anything in between neither strongly positive nor negative.
Hint: You may find the solution to problem (3) useful.
Relevant topics: cut, uniq, sort, grep, piping
Item(s) to submit:
- The line of UNIX commands used to solve the problem.
Solution
grep -i 'A3OXHLG6DIBRW8' amazon_fine_food_reviews.csv | cut -d, -f7 | sort | uniq -c
6. Find the most helpful review with a Score
of 5. Then (separately) find the most helpful review with a Score
of 1. As before, we are considering the most helpful review to be the review with the highest HelpfulnessNumerator
.
Hint: You can use multiple lines to solve this problem.
Relevant topics: sort, head, piping
Item(s) to submit:
- The lines of UNIX commands used to solve the problem.
-
ProductId
's of both requested reviews.
Solution
sort -t, -k7rn,7 -k5rn,5 amazon_fine_food_reviews.csv | head -n2
sort -t, -k7n,7 -k5rn,5 amazon_fine_food_reviews.csv | head -n2
OPTIONAL QUESTION. For only the two ProductId
s from the previous question, create a new dataset called scores.csv
that contains the ProductId
s and Score
s from all reviews for these two items.
Relevant topics: cut, grep, redirection
Item(s) to submit:
- The line of UNIX commands used to solve the problem.
Solution
cut -d, -f2,7 amazon_fine_food_reviews.csv| grep -Ei '(B00012182G|B003EMQGVI|B00065LI0A)' > scores.csv
Project 6
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Learning objectives:
-
Use
awk
to process and manipulate textual data. - Use piping and redirection within the terminal to pass around data between utilities.
Dataset
The following questions will use the dataset found here or in Scholar:
/class/datamine/data/flights/subset/YYYY.csv
An example from 1987 data can be found here or in Scholar:
/class/datamine/data/flights/subset/1987.csv
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the Origin
, from 1987.csv
. Write another line, this time using awk
to do the same thing. Which one do you prefer, and why?
Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk:
Item(s) to submit:
-
One line of UNIX commands to solve the problem without using
awk
. -
One line of UNIX commands to solve the problem using
awk
. - 1-2 sentences describing which method you prefer and why.
Solution
cut -d, -f17 1987.csv
awk -F, '{print $17}' 1987.csv
2. Write a bash script that accepts a year (1987, 1988, etc.) and a column n and returns the nth column of the associated year of data.
Here are two examples to illustrate how to write a bash script:
Hint: In this example, you only need to turn in the content of your bash script (starting with #!/bin/bash
) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it:
#!/bin/bash
echo "First argument: $1"
echo "Second argument: $2"
If you simply drop that text into a file called my_script.sh
, located here: /home/$USER/my_script.sh
, and if you run the following:
# Setup bash to run; this only needs to be run one time per session.
# It makes bash behave a little more naturally in RStudio.
exec bash
# Navigate to the location of my_script.sh
cd /home/$USER
# Make sure that the script is runable.
# This only needs to be done one time for each new script that you write.
chmod 755 my_script.sh
# Execute my_script.sh
./my_script.sh okay cool
then it will print:
First argument: okay
Second argument: cool
In this example, if we were to turn in the "content of your bash script (starting with #!/bin/bash
) in a code chunk, our solution would look like this:
#!/bin/bash
echo "First argument: $1"
echo "Second argument: $2"
And although we aren't running the code chunk above, we know that it works because we tested it in the terminal.
Hint: Using awk
you could have a script with just two lines: 1 with the "hash-bang" (#!/bin/bash
), and 1 with a single awk
command.
Relevant topics: awk, bash scripts
Item(s) to submit:
-
The content of your bash script (starting with
#!/bin/bash
) in a code chunk.
Solution
#!/bin/bash
awk -F, -v col=$2 '{print $col}' $1.csv
3. How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using awk
, then solve this problem using only awk
.
Here is a similar example, using the election data set:
Relevant topics: cut, grep, wc, awk, piping
Item(s) to submit:
-
One line of UNIX commands to solve the problem without using
awk
. -
One line of UNIX commands to solve the problem using
awk
. - The number of flights that arrived at Indianapolis (IND) in 2008.
Solution
cut -d, -f18 2008.csv | grep 'IND' | wc -l
awk -F, '{if ($18 == "IND") count++}END{print count}' 2008.csv
4. Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you'd like. Are they indeed the same? How many unique values do we have per category (Origin
, Dest
)?
Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data:
Relevant topics: cut, sort, uniq, wc, awk
Item(s) to submit:
- 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same.
- The UNIX command(s) used to figure out if the number of unique origins and destinations are the same.
-
The number of unique values per category (
Origin
,Dest
).
Solution
cut -d, -f17 2008.csv | sort | uniq | wc -l
cut -d, -f18 2008.csv | sort | uniq | wc -l
5. In (4) we found that there are not the same number of unique Origin
's as Dest
's. Find the IATA airport code for all Origin
's that don't appear in a Dest
and all Dest
's that don't appear in an Origin
in the 2008 data.
Hint: The examples on this page should help. Note that these examples are based on Process Substitution, which basically allows you to specify commands whose output would be used as the input of comm
. There should be no space between <
and (
, otherwise your bash will not work as intended.
Relevant topics: comm, cut, sort, uniq, redirection
Item(s) to submit:
- The line(s) of UNIX command(s) used to answer the question.
-
The list of
Origin
s that don't appear inDest
. -
The list of
Dest
s that don't appear inOrigin
.
Solution
comm -23 <(cut -d, -f17 2008.csv | sort | uniq) <(cut -d, -f18 2008.csv | sort | uniq)
comm -13 <(cut -d, -f17 2008.csv | sort | uniq) <(cut -d, -f18 2008.csv | sort | uniq)
6. What was the percentage of flights in 2008 per unique Origin
with the Dest
of "IND"? What percentage of flights had "PHX" as Origin
(among all flights with Dest
of "IND"?
Here is an example using the percentages of donations contributed from CEOs from various States:
Hint: You can do the mean calculation in awk by dividing the result from (3) by the number of unique Origin
's that have a Dest
of "IND".
Relevant topics: awk, sort, grep, wc
Item(s) to submit:
-
The percentage of flights in 2008 per unique
Origin
with theDest
of "IND". -
1-2 sentences explaining how "PHX" compares (as a unique
ORIGIN
) to the otherOrigin
s (all with theDest
of "IND")?
Solution
awk -F, '{if($18=="IND") print $17}' 2008.csv | sort -u | wc -l
awk -F, '{if($18=="IND") print $17}' 2008.csv | grep -i PHX | wc -l
OPTIONAL QUESTION. Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like:
1987, 12345
1988, 44
Run the script with inputs: 1991
and ORD
. Include the output in your submission.
Relevant topics: bash scripts, cut, piping, grep, wc
Item(s) to submit:
- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-
The output of the script given
1991
andORD
as inputs.
Solution
#!/bin/bash
FLIGHTS_OUT="$(cut -d, -f17 $1.csv | grep -i $2 | wc -l)"
FLIGHTS_IN="$(cut -d, -f18 $1.csv | grep -i $2 | wc -l)"
echo "$((FLIGHTS_OUT + FLIGHTS_IN)), $1"
Project 7
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Learning objectives:
-
Use
awk
to process and manipulate textual data. - Use piping and redirection within the terminal to pass around data between utilities.
Dataset
The following questions will use the dataset found in Scholar: /class/datamine/data/flights/subset/YYYY.csv
An example of the data for the year 1987 can be found here.
Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. Write a line of code that prints a list of the unique values in the DayOfWeek
column. Write a line of code that prints a list of the unique values in the DayOfMonth
column. Write a line of code that prints a list of the unique values in the Month
column. Use the 1987.csv
dataset. Are the results what you expected?
Item(s) to submit:
- 3 lines of code used to get a list of unique values for the chosen columns.
- 1-2 sentences explaining whether or not the results are what you expected.
Solution
cut -d, -f3 1987.csv | sort -nu
cut -d, -f4 1987.csv | sort -nu
cut -d, -f2 1987.csv | sort -nu
# We can print the unique values in the DayOfWeek, DayOfMonth, and Month,
# using these three respectively pipelines in bash shell:
cat /class/datamine/data/flights/subset/1987.csv | cut -d, -f4 | sort | uniq
cat /class/datamine/data/flights/subset/1987.csv | cut -d, -f3 | sort | uniq
cat /class/datamine/data/flights/subset/1987.csv | cut -d, -f2 | sort | uniq
# or we can do this equivalently with awk:
cat /class/datamine/data/flights/subset/1987.csv | awk -F, '{print $4}' | sort | uniq
cat /class/datamine/data/flights/subset/1987.csv | awk -F, '{print $3}' | sort | uniq
cat /class/datamine/data/flights/subset/1987.csv | awk -F, '{print $2}' | sort | uniq
2. Our files should have 29 columns. For a given file, write a line of code that prints any lines that do not have 29 columns. Test it on 1987.csv
, were there any rows without 29 columns?
Hint: See here. NF
looks like it may be useful!
Relevant topics: awk
Item(s) to submit:
- Line of code used to solve the problem.
- 1-2 sentences explaining whether or not there were any rows without 29 columns.
Solution
awk -F, '{if (NF != 29) print $0}' 1987.csv
# We can print any lines that do not have 29 columns, as follows:
cat /class/datamine/data/flights/subset/1987.csv | awk -F, '{if (NF !=
29) {print $0}}'
# Note that there are no such lines.
# In other words, every line has exactly 29 columns.
3. Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do not have 29 columns.
Relevant topics: awk, bash scripts
Item(s) to submit:
- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
- The results of running your bash scripts from year 1987 to 2008.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
awk -F, '{if (NF != 29) print $0}' $f.csv
done
# Here is a bash script for which the variable i cycles through the
# desired years, and the script prints any lines that do not have 29 fields.
#!/bin/bash
for (( i=$1; i<=$2; i++ ))
do
cat /class/datamine/data/flights/subset/${i}.csv | awk -F, '{if (NF
!= 29) {print $0}}'
done
# change permissions, to make the script executable
chmod 755 myscript.sh
./myscript.sh 1987 2008 # run the bash script
# As in question 2, note that there are no such lines.
# In other words, every line has exactly 29 columns.
4. awk
is a really good tool to quickly get some data and manipulate it a little bit. The column Distance
contains the distances of the flights in miles. Use awk
to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.
Example output:
Miles: 12345
Kilometers: 19867.35168
Item(s) to submit:
- The code used to solve the problem.
- The results of running the code.
Solution
awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' 1990.csv
# The total distance traveled in miles is stored in mytotaldistance
# and at the end, we print mytotaldistance (which is given in miles)
# and also mytotaldistance*1.609344 (which is the total distance in
# kilometers).
cat /class/datamine/data/flights/subset/1990.csv | awk -F, '{mytotaldistance = mytotaldistance + $19} END{print "The total distance is ", mytotaldistance, " miles."; print "Equivalently, this is ", mytotaldistance*1.609344, " kilometers."}'
5. Use awk
to calculate the sum of the number of DepDelay
minutes, grouped according to DayOfWeek
. Use 2007.csv
.
Example output:
DayOfWeek: 0
1: 1234567
2: 1234567
3: 1234567
4: 1234567
5: 1234567
6: 1234567
7: 1234567
Note: 1 is Monday.
Relevant topics: awk, sort, piping
Item(s) to submit:
- The code used to solve the problem.
- The output from running the code.
Solution
awk -F, '{delay[$4]=delay[$4]+$16}END{for (d in delay) print d": ", delay[d]}' 2007.csv | sort -n
# We store the DepDelay in an array called mytotaldelays:
cat /class/datamine/data/flights/subset/2007.csv | awk -F, '{mytotaldelays[$4] = mytotaldelays[$4] + $16} END{for (i=1; i<=7; i++) {print "The total DepDelay in minutes is ", mytotaldelays[i], " on day
", i, " of the week."}}'
6. It wouldn't be fair to compare the total DepDelay
minutes by DayOfWeek
as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of DepDelay
minutes by the number of flights per DayOfWeek
. Use 2007.csv
.
Example output:
DayOfWeek: 0
1: 1.234567
2: 1.234567
3: 1.234567
4: 1.234567
5: 1.234567
6: 1.234567
7: 1.234567
Relevant topics: awk, sort, piping
Item(s) to submit:
- The code used to solve the problem.
- The output from running the code.
Solution
awk -F, '{delay[$4]=delay[$4]+$16; flights[$4]++}END{for (d in delay) print d": ", delay[d]/flights[d]}' 2007.csv | sort -n
# Now we also keep track of mynumberofflights on each day of the week.
cat /class/datamine/data/flights/subset/2007.csv | awk -F, '{mytotaldelays[$4] = mytotaldelays[$4] + $16; mynumberofflights[$4] = mynumberofflights[$4] + 1} END{for (i=1; i<=7; i++) {print "The average DepDelay in minutes is ", mytotaldelays[i]/mynumberofflights[i], " on day ", i, " of the week."}}'
Project 8
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Learning objectives:
-
Use
awk
to process and manipulate textual data. - Use piping and redirection within the terminal to pass around data between utilities.
Dataset
The following questions will use the dataset found in Scholar: /class/datamine/data/flights/subset/YYYY.csv
An example of the data for the year 1987 can be found here.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. Let's say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!
Write a line of awk
code that, prints the total number of flights that occur on weekend days, followed by the total number of flights that occur on the weekdays. Complete this calculation for 2008 using the 2008.csv
file.
Note: Under the column DayOfWeek
, Monday through Sunday are represented by 1-7, respectively.
Relevant topics: awk
Item(s) to submit:
-
Line of
awk
code that solves the problem. - The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.
Solution
awk -F, '{if ($4==1 || $4==2 || $4==3 || $4==4) weekday++; else (weekend++)}END{print weekend,weekday}' 2008.csv
awk -F, '{if ($4 ~ /1|2|3|4/) weekday++; else (weekend++)}END{print weekend,weekday}' 2008.csv
2. Note that in (1), we are comparing 3 days to 4! Write a line of awk
code that, prints the average number of flights on a weekend day, followed by the average number of flights on the weekdays. Continue to use data for 2008.
Hint: You don't need a large if statement to do this, you can use the ~
comparison operator.
Relevant topics: awk
Item(s) to submit:
-
Line of
awk
code that solves the problem. - The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.
Solution
awk -F, '{if ($4==1 || $4==2 || $4==3 || $4==4) weekday++; else (weekend++)}END{print weekend/3,weekday/4}' 2008.csv
awk -F, '{if ($4 ~ /1|2|3|4/) weekday++; else (weekend++)}END{print weekend/3,weekday/4}' 2008.csv
3. We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let's use the tools we've learned to explore this a little bit.
Take a look at airports.csv
. In particular run the following:
head airports.csv
Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within awk
. This is messy and we would prefer to create a new file called new_airports.csv
without any quotes. Write a line of code to do this.
Note: You may be wondering why we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It's important to practice and learn ways to fix these things.
Hint: You could use gsub
within awk
to replace '"' with ''. You can find how to use gsub
here.
Hint: If you leave out the column number argument to gsub
it will apply the substitution to every field in every column.
Hint:
cat new_airports.csv | wc -l # should be 159 without header
Relevant topics: awk, redirection
Item(s) to submit:
-
Line of
awk
code used to create the new dataset.
Solution
awk -F, '{gsub(/"/, ""); print $0}' airports.csv > new_airports.csv
4. Write a line of commands that creates a new dataset called az_fl_airports.txt
. az_fl_airports.txt
should only contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3),new_airports.csv
as a starting point.
How many airports are there? Did you expect this? Use a line of bash code to count this.
Create a new dataset (called az_fl_flights.txt
) that contains all of the data for flights into or out of Florida and Arizona (using the 2008.csv
file). Use the newly created dataset, az_fl_airports.txt
to accomplish this.
Hint:
cat az_fl_flights.txt | wc -l # should be 484705
Relevant topics: awk, wc, piping
Item(s) to submit:
- All UNIX commands used to answer the questions.
- The number of airports.
- 1-2 sentences explaining whether you expected this number of airports.
Solution
awk -F, '{if ($4 == "AZ" || $4 == "FL") print $1}' new_airports.csv > az_fl_airports.txt
wc -l az_fl_airports.txt
grep -w -F -f az_fl_airports.txt 2008.csv
5. Write a bash script that accepts the year as an argument and performs the same operations as in question 4, returning the number of flights into and out of both AZ and FL for any given year.
Relevant topics: bash scripts, grep, for loop, redirection
Item(s) to submit:
- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
- The line of UNIX code you used to execute the script and create the new dataset.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
grep -w -F -f $3 $f.csv
done
./my_script.sh 1987 2008 az_fl_airports.txt > az_fl_flights.csv
Project 9
Motivation: Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, cloudflare, a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per this thread on hackernews). Learning SQL is well worth your time!
Context: There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite.
Scope: SQL, sqlite
Learning objectives:
- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/lahman/lahman.db
This is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
Important note: For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:
```{r, include=F}
library(RSQLite)
lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db")
```
Each solution should then be placed in a code chunk like this:
```{sql, connection=lahman}
SELECT * FROM batting LIMIT 1;
```
If you want to use a SQLite-specific function like .tables
(or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
sqlite3 /class/datamine/data/lahman/lahman.db
From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F
like this:
```{sql, connection=lahman, eval=F}
SELECT * FROM batting LIMIT 1;
```
This will allow the code to be displayed without throwing an error.
Solution
.tables
2. Some people like to try to visit all 30 MLB ballparks in their lifetime. Use SQL commands to get a list of parks
and the cities they're located in. For your final answer, limit the output to 10 records/rows.
Note: There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the LIMIT
clause.
Hint: Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following:
.headers on
Relevant topics: SELECT, FROM, LIMIT
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT parkname, city FROM parks LIMIT 10;
3. There is nothing more exciting to witness than a home run hit by a batter. It's impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (HR
) in a season. List their playerID
, yearID
, home run total, and the teamID
they played for.
Hint: There are 8 occurrences of home runs greater than or equal to 60.
Hint: The batting
table is where you should look for this question.
Relevant topics: SELECT, FROM, WHERE, LIMIT
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT teamID, yearID, teamID, HR FROM batting WHERE HR>=60;
4. Make a list of players born on your birth day (don't worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year.
Hint: The people
table is where you should look for this question.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT nameFirst, nameLast, birthYear FROM people
WHERE birthMonth==5 AND birthDay==29
ORDER BY birthYear DESC LIMIT 10;
5. Get the Cleveland (CLE) Pitching Roster from the 2016 season (playerID
, W
, L
, SO
). Order the pitchers by number of Strikeouts (SO) in descending order.
Hint: The pitching
table is where you should look for this question.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT playerID, W, L, SO FROM pitching
WHERE teamID=='CLE' AND yearID==2016
ORDER BY SO DESC LIMIT 10;
6. Find the 10 team and year pairs that have the most number of Errors (E
) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs?
Hint: The teams
table is where you should look for this question.
Hint: The BETWEEN
clause is useful here.
Hint: It is OK to use multiple queries to answer the question.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT teamID, franchID, yearID, W, L, E FROM teams
WHERE yearID BETWEEN 1960 AND 1970
ORDER BY E DESC
LIMIT 10;
SELECT franchName FROM teamsfranchises WHERE franchID=='LAA';
SELECT teamID, franchID, yearID, W, L, E FROM teams
WHERE yearID >= 1960 AND yearID <= 1970
ORDER BY E DESC
LIMIT 10;
SELECT franchName FROM teamsfranchises WHERE franchID=='LAA';
7. Find the playerID
for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table pitching
)? What year and team did he win the most games as a manager (use table managers
)?
Hint: It is OK to use multiple queries to answer the question.
Note: There was a tie among the two years in which Bob Lemon had the most wins as a pitcher.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT playerID FROM people
WHERE nameFirst=='Bob' AND nameLast=='Lemon';
SELECT teamID, yearID, W FROM pitching
WHERE playerID=='lemonbo01'
ORDER BY W DESC;
SELECT teamID, yearID, W FROM managers
WHERE playerID=='lemonbo01'
ORDER BY W DESC;
Project 10
Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data.
Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG
, MIN
, and MAX
.
Scope: SQL, sqlite
Learning objectives:
- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.
Dataset
The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at scholar-db.rcac.purdue.edu
. As in Project 9, this is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
Important note: For this project all solutions should be done using R code chunks, and the RMariaDB
package. Run the following code to load the library:
library(RMariaDB)
Solution
library(RMariaDB)
host <- "scholar-db.rcac.purdue.edu"
dbname <- "lahmandb"
user <- "lahman_user"
password <- "HitAH0merun"
con <- dbConnect(MariaDB(),host=host,
dbname=dbname,
user=user,
password=password)
head(dbGetQuery(con, "SHOW tables;"))
2. How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (SB
) and hit at least 40 home runs (HR
) in one year.
Hint: Use the batting
table.
Important note: You only need to run library(RMariaDB)
and the dbConnect
portion of the code a single time towards the top of your project. After that, you can simply reuse your connection con
to run queries.
Important note: In our project template, for this project, make all of the SQL queries using the dbGetQuery
function, which returns the results directly in R
. Therefore, your RMarkdown
blocks for this project should all be {r}
blocks (as opposed to the {sql}
blocks used in Project 9).
Hint: You can use dbGetQuery
to run your queries from within R. Example:
dbGetQuery(con, "SELECT * FROM battings LIMIT 5;")
Note: We already demonstrated the correct SQL query to use for the 40/40 club in this video, but now we want you to use RMariaDB
to solve this query:
Relevant topics: dbGetQuery, AND/OR, DISTINCT, COUNT
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con,"SELECT DISTINCT COUNT(*) FROM batting WHERE HR>=40 AND SB>=40;")
3. Find Corey Kluber's lifetime across his career (i.e., use SUM
from SQL
to summarize his achievements) in two categories: strikeouts (SO
) and walks (BB
). Also display his Strikeouts to Walks ratio. A Strikeout to Walks ratio is calculated by this equation: \(\frac{Strikeouts}{Walks}\).
Important note: Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use sum
in R rather than SUM
in SQL.
Hint: Use the people
table to find the playerID
and use the pitching
table to find the statistics.
Relevant topics: dbGetQuery, SUM, SELECT, FROM, WHERE
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con, "SELECT playerID FROM people WHERE nameFirst ='Corey' AND nameLast = 'Kluber';")
dbGetQuery(con, "SELECT SUM(SO), SUM(BB), SUM(SO)/SUM(BB) FROM pitching WHERE playerID ='klubeco01';")
4. How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"?
Hint: Use the people
table to find the playerID
and use the batting
table to find the statistics.
Relevant topics: dbGetQuery, AND/OR, SUM
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con, "SELECT playerID FROM people WHERE nameFirst ='Giancarlo' AND nameLast = 'Stanton';")
dbGetQuery(con, "SELECT COUNT(*), SUM(SO) FROM batting WHERE playerID = 'stantmi03' AND (teamID = 'MIA' OR teamID = 'FLO');")
5. The Batting Average is a metric for a batter's performance. The Batting Average in a year is calculated by \(\frac{H}{AB}\) (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to playerID
, teamID
, and yearID.
Hint: Use the batting
table.
Relevant topics: dbGetQuery, ORDER BY, BETWEEN
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con,"SELECT playerID, teamID, yearID, H, AB, H/AB FROM batting
WHERE yearID BETWEEN 2000 AND 2010 AND AB>300
ORDER BY H/AB DESC LIMIT 5;")
6. How many unique players have hit > 50 home runs (HR
) in a season?
Hint: Use the batting
table.
Hint: If you view DISTINCT
as being paired with SELECT
, instead, think of it as being paired with one of the fields you are selecting.
Relevant topics: dbGetQuery, DISTINCT, COUNT
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con, "SELECT COUNT(DISTINCT playerID) FROM batting WHERE HR>50;")
# As you can see, DISTINCT is with the field playerID, not with SELECT.
# The following query should help distinguish this.
dbGetQuery(con, "SELECT COUNT(DISTINCT playerID), COUNT(playerID) FROM batting WHERE HR>50;")
7. Find the number of unique players that attended Purdue University. Start by finding the schoolID
for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic.
Hint: Use the schools
table to get the schoolID
s, and the collegeplaying
table to get the statistics.
Hint: You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates.
Hint: To find IU's id, try the following query: SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';
. You can find more about the LIKE clause and %
here.
Relevant topics: dbGetQuery, plotting in R, DISTINCT, COUNT
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
dbGetQuery(con, "SELECT schoolID FROM schools WHERE name_full='Purdue University';")
dbGetQuery(con, "SELECT schoolID FROM schools WHERE LIKE '%indiana%';")
dbGetQuery(con,"SELECT COUNT(DISTINCT playerID) FROM collegeplaying WHERE schoolID ='purdue';")
dbGetQuery(con,"SELECT DISTINCT playerID FROM collegeplaying WHERE schoolID ='indiana';")
purdue <- length(unique(dbGetQuery(con,"SELECT playerID FROM collegeplaying WHERE schoolID ='purdue';"))$playerID)
iu <- length(unique(dbGetQuery(con,"SELECT playerID FROM collegeplaying WHERE schoolID ='indiana';"))$playerID)
barplot(purdue, iu)
barplot(log(c(purdue, iu)), main="Pro baseball players: Indiana vs. Purdue", col=c("#990000","#CFB53B"), names.arg=c("Purdue", "IU"), cex.names=.5, ylab="Log count")
Project 11
Motivation: Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.
Context: We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset!
Scope: SQL, SQL in R
Learning objectives:
- Demonstrate the ability to interact with popular database management systems within R.
- Solve data-driven problems using a combination of SQL and R.
- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.
- Showcase the ability to filter, alias, and write subqueries.
- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.
Dataset
The following questions will use the elections
database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar:
/class/datamine/data/election/itcontYYYY.txt
(for example, data for year 1980 would be /class/datamine/data/election/itcont1980.txt
)
A public sample of the data can be found here:
https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt)
Questions
Important note: For this project you will need to connect to the database elections
using the RMariaDB
package in R. Include the following code chunk in the beginning of your RMarkdown file:
```{r setup-database-connection}
library(RMariaDB)
con <- dbConnect(RMariaDB::MariaDB(),
host="scholar-db.rcac.purdue.edu",
db="elections",
user="elections_user",
password="Dataelect!98")
```
When a question involves SQL queries in this project, you may use a SQL code chunk (with {sql}
), or an R code chunk (with {r}
) and functions like dbGetQuery
as you did in Project 10. Please refer to Question 5 in the project template for examples.
1. Approximately how large was the lahman database (use the sqlite database in Scholar: /class/datamine/data/lahman/lahman.db
)? Use UNIX utilities you've learned about this semester to write a line of code to return the size of that .db file (in MB).
The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder /class/datamine/data/election/
. How much data (in MB) is there?
The data in that folder has been added to the elections
database, all aggregated in the elections
table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table elections
?
Note: These are some examples of how to get the sizes of collections of files in UNIX.
Hint: The SQL query will take some time! Be patient.
Note: You may use more than one code chunk in your RMarkdown file for the different tasks.
Note: We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from ls
and du
to match, use the --apparent-size
option with du
.
Note: A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See here for more information. For this question, either solution will be given full credit. This is a potentially useful example.
Relevant topics: SQL, SQL in R, awk, ls, du
Item(s) to submit:
- Line of code (bash/awk) to show the size (in MB) of the lahman database file.
- Approximate size of the lahman database in MB.
-
Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in
/class/datamine/data/election
. - The size of the elections data in MB.
-
SQL query used to find the number of rows of data in the
elections
table in theelections
database. -
The number of rows in the
elections
table in theelections
database.
Solution
ls -la *.txt | awk '{ total += $4; }END{print total/1000000}'
SELECT COUNT(*) FROM elections;
2. Write a SQL query using the LIKE
command to find a unique list of zip_code
s that start with "479".
Write another SQL query and answer: How many unique zip_code
s are there that begin with "479"?
Note: Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project.
Hint: The first query returns a list of zip codes, and the second returns a count.
Hint: Make sure you only select zip_code
s.
Item(s) to submit:
- SQL queries used to answer the question.
- The first 5 results from running the query.
Solution
SELECT DISTINCT zip_code FROM elections WHERE zip_code LIKE '479%' LIMIT 5;
SELECT COUNT(DISTINCT zip_code) FROM elections WHERE zip_code LIKE '479%';
3. Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an alias for our field so it doesn't read COUNT(*)
but rather Indiana Donations
.
Hint: You may enclose an alias's name in quotation marks (single or double) when the name contains space.
Relevant topics: SQL, WHERE, aliasing
Item(s) to submit:
- SQL query used to answer the question.
- The result of the SQL query.
Solution
SELECT COUNT(*) FROM elections WHERE state='IN';
SELECT COUNT(*) AS 'Indiana Donations' FROM elections WHERE state='IN';
4. Rewrite the query in (3) so the result is displayed like: IN: 1234567
. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to OH: 1234567
. In other words, the state abbreviation should be dynamic, not static.
Note: This video demonstrates how to use CONCAT in a MySQL query.
Hint: Use CONCAT and aliasing to accomplish this.
Hint: Remember, state
contains the state abbreviation.
Relevant topics: SQL, aliasing, CONCAT
Item(s) to submit:
- SQL query used to answer the question.
Solution
SELECT CONCAT(state, ": ", COUNT(*)) AS 'Donations' FROM elections WHERE state='IN';
5. In (2) we wrote a query that returns a unique list of zip_code
s that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip_code
s starting with "479". What percent of donations in Indiana come from said zip_code
s?
Note: This video gives two examples of sub-queries.
Hint: You can simply manually calculate the percent using the count in (2) and (5).
Relevant topics: SQL, aliasing, subqueries, IN
Item(s) to submit:
- SQL queries used to answer the question.
-
The percentage of donations from Indiana from
zip_code
s starting with "479".
Solution
SELECT COUNT(*) FROM elections WHERE zip_code IN (SELECT DISTINCT zip_code FROM elections WHERE zip_code LIKE '479%');
6. In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (transaction_amt
). Which 5 states gave the most donations (highest count)? Order you result from most to least.
Note: In this video we demonstrate GROUP BY
, ORDER BY
, DESC
, and other aspects of MySQL that might help with this question.
Hint: You may want to create an alias in order to sort.
Relevant topics: SQL, GROUP BY
Item(s) to submit:
- SQL query used to answer the question.
- Which 5 states gave the most donations?
Solution
SELECT state, COUNT(*) AS 'cnt', SUM(transaction_amt) AS 'total' FROM elections GROUP BY state ORDER BY cnt DESC;
Project 12
Motivation: Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share.
Context: We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
Scope: SQL, sqlite, joins
Learning objectives:
- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.
- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.
- Showcase the ability to filter, alias, and write subqueries.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db
A public sample of the data can be found here.
Important note: For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:
```{r, include=F}
library(RSQLite)
bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db")
```
Each solution should then be placed in a code chunk like this:
```{sql, connection=bikeshare}
SELECT * FROM station LIMIT 5;
```
If you want to use a SQLite-specific function like .tables
(or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db
From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F
like this:
```{sql, connection=bikeshare, eval=F}
SELECT * FROM station LIMIT 5;
```
This will allow the code to be displayed without throwing an error.
There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform all of the other types of joins (RIGHT JOIN, FULL OUTER JOIN). You can see the documentation here: SQL, INNER JOIN, LEFT JOIN
Questions
1. Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the sqlite
concatenate operator ||
(see here). Write a query that returns the first 5 records of information from the station
table formatted in the following way:
(id) name @ (lat, long)
For example:
(84) Ryland Park @ (37.342725, -121.895617)
Hint: Here is a video about how to concatenate strings in SQLite.
Relevant topics: aliasing, concat
Item(s) to submit:
- SQL query used to solve this problem.
-
The first 5 records of information from the
station
table.
Solution
SELECT '('||id||') '||name||' @ ('||lat||','||long||')' FROM station LIMIT 5;
2. There is a variety of interesting weather information in the weather
table. Write a query that finds the average mean_temperature_f
by zip_code
. Which is on average the warmest zip_code
?
Use aliases to format the result in the following way:
Zip Code|Avg Temperature
94041|61.3808219178082
Note that this is the output if you use sqlite
in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly.
Hint: Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT
Relevant topics: aliasing, GROUP BY, AVG
Item(s) to submit:
- SQL query used to solve this problem.
- The results of the query copy and pasted.
Solution
SELECT zip_code AS 'Zip Code', AVG(mean_temperature_f) AS 'Avg Temperature' FROM weather GROUP BY zip_code;
Solution
SELECT COUNT(DISTINCT zip_code) FROM trip;
SELECT zip_code, COUNT(zip_code) FROM trip GROUP BY zip_code;
4. In (2) we wrote a query that finds the average mean_temperature_f
by zip_code
. What if we want to tack on our results in (2) to information from each row in the station
table based on the zip_code
s? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables.
Hint: Use the query from (2) as a sub query within your solution.
Hint: Here is a video about JOIN and LEFT JOIN.
Relevant topics: INNER JOIN, subqueries, aliasing
Item(s) to submit:
- SQL query used to solve this problem.
Solution
SELECT * FROM station as s INNER JOIN (SELECT zip_code AS 'Zip Code', AVG(mean_temperature_f) AS 'Avg Temperature' FROM weather GROUP BY zip_code) AS sub ON s.zip_code=sub.'Zip Code';
5. In (3) we alluded to the fact that the zip_code
s in the trip
table aren't very consistent. Users can enter a zip code when using the app. This means that zip_code
can be from anywhere in the world! With that being said, if the zip_code
is one of the 5 zip_code
s for which we have weather data (from question 2), we can add that weather information to matching rows of the trip
table. In (4) we used an INNER JOIN to append some weather information to each row in the station
table. For this question, write a query that performs an INNER JOIN and appends weather data from the weather
table to the trip data from the trip
table. Limit your output to 5 lines.
Important note: Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the trip
table, you can use start_date
for for the date information.
Hint: You will want to wrap your dates and datetimes in sqlite's date
function prior to comparison.
Relevant topics: INNER JOIN, aliasing
Item(s) to submit:
- SQL query used to solve this problem.
- First 5 lines of output.
Solution
SELECT * FROM trip AS t INNER JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date) LIMIT 5;
Project 13
Motivation: Databases you will work with won't necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like primary key, and foreign key will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice!
Context: We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you've already learned, with a focus on subqueries and joins.
Scope: SQL, sqlite
Learning objectives:
-
Write and run SQL queries in
sqlite
on real-world data. - Identify primary and foreign keys in a SQL table.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/imdb.db
Important note: For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd:
```{r, include=F}
library(RSQLite)
imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
```
If you want to use a SQLite-specific function like .tables
(or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
sqlite3 /class/datamine/data/movies_and_tv/imdb.db
From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F
like this:
```{sql, connection=imdb, eval=F}
SELECT * FROM titles LIMIT 5;
```
This will allow the code to be displayed without throwing an error.
Questions
1. A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the titles
table. Do you think there are any primary keys? How about foreign keys? Now examine the episodes
table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?
Hint: A primary key can also be a foreign key.
Hint: Here are two videos. The first video will remind you how to find the names of all of the tables in the imdb
database. The second video will introduce you to the titles
and episodes
tables in the imdb
database.
Relevant topics: primary key, foreign key
Item(s) to submit:
-
List any primary or foreign keys in the
titles
table. -
List any primary or foreign keys in the
episodes
table.
2. If you paste a title_id
to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series Grey's Anatomy. Write a SQL query to confirm that the title_id
tt0413573 does indeed belong to Grey's Anatomy. Then browse imdb.com and find your favorite TV show. Get the title_id
from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database:
SELECT * FROM titles WHERE title_id='<title id here>';
Make sure to replace "<title id here>" with the title_id
of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons.
Relevant topics: SELECT, WHERE
Item(s) to submit:
-
SQL query used to confirm that
title_id
tt0413573 does indeed belong to Grey's Anatomy. - The output of the query.
-
The
title_id
of your favorite TV show. -
SQL query used to confirm the
title_id
for your favorite TV show. - The output of the query.
3. The episode_title_id
column in the episodes
table references titles of individual episodes of a TV series. The show_title_id
references the titles of the show itself. With that in mind, write a query that gets a list of all of the episodes_title_id
s (found in the episodes
table), with the associated primary_title
(found in the titles
table) for each episode of Grey's Anatomy.
Relevant topics: INNER JOIN
Hint: This video shows how to extract titles of episodes in the imdb
database.
Item(s) to submit:
- SQL query used to solve the problem in a code chunk.
4. We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of episode_title_id
s (found in the episodes
table), with the associated primary_title
(found in the titles
table) for each episode.
Hint: This part is just like question (3) but this time with your favorite TV show, which you chose in (2).
Hint: This video shows how to use a subquery, to JOIN
a total of three tables in the imdb
database.
Relevant topics: INNER JOIN, aliasing
Item(s) to submit:
- SQL query used to solve the problem in a code chunk.
- The first 5 results from your query.
5. Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode?
Relevant topics: INNER JOIN, aliasing, subqueries, ORDER BY, DESC, LIMIT
Note: Various helpful examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL query used to solve the problem in a code chunk.
-
The
episode_title_id
,primary_title
, andrating
of the top rated episode from your favorite TV series, in question (2). - A statement saying whether the highest rated episode is also your favorite episode.
Project 14
Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.
Context: We've just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.
Scope: awk, UNIX utilities, bash scripts, fread
Learning objectives:
- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
- Use grep to search files effectively.
- Use cut to section off data from the command line.
- Use piping to string UNIX commands together.
- Use awk for data extraction, and preprocessing.
- Create bash scripts to automate a process or processes.
Dataset
The following questions will use PLOTSNAP.csv from the data folder found in Scholar:
/class/datamine/data/forest
To read more about PLOTSNAP.csv that you will be working with:
Questions
1. Take a look at at PLOTSNAP.csv
. Write a line of awk code that displays the STATECD
followed by the number of rows with that STATECD
.
Relevant topics: awk
Item(s) to submit:
- Code used to solve the problem.
-
Count of the following
STATECD
s: 1, 2, 4, 5, 6
2. Unfortunately, there isn't a very accessible list available that shows which state each STATECD
represents. This is no problem for us though, the dataset has LAT
and LON
! Write some bash that prints just the STATECD
, LAT
, and LON
.
Note: There are 92 columns in our dataset: awk -F, 'NR==1{print NF}' PLOTSNAP.csv
. To create a list of STATECD
to state, we only really need STATECD
, LAT
, and LON
. Keeping the other 89 variables will keep our data at 2.1gb.
Item(s) to submit:
- Code used to solve the problem.
-
The output of your code piped to
head
.
3. fread
is a "Fast and Friendly File Finagler". It is part of the very popular data.table
package in R. We will learn more about this package next semester. For now, read the documentation here and use the cmd
argument in conjunction with your bash code from (2) to read the data of STATECD
, LAT
, and LON
into a data.table
in your R environment.
Relevant topics: fread
Item(s) to submit:
- Code used to solve the problem.
-
The
head
of the resultingdata.table
.
4. We are going to further understand the data from question (3) by finding the actual locations based on the LAT
and LON
columns. We can use the library revgeo
to get a location given a pair of longitude and latitude values. revgeo
uses a free API hosted by photon in order to do so.
For example:
library(revgeo)
revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
The code above will give you the address information in six columns, from the most-granular housenumber
to the least-granular country
. Depending on the coordinates, revgeo
may or may not give you results for each column. For this question, we are going to keep only the state
column.
There are over 4 million rows in our dataset -- we do not want to hit photon's API that many times. Instead, we are going to do the following:
First: Unless you feel comfortable using data.table
, convert your data.table
to a data.frame
:
my_dataframe <- data.frame(my_datatable)
Second: Calculate the average LAT
and LON
for each STATECD
, and call the new data.frame
, dat
. This should result in 57 rows of lat/long pairs.
Third: For each row in dat
, run a reverse geocode and append the state
to a new column called STATE
.
Hint: To calculate the average LAT
and LON
for each STATECD
, you could use the sqldf
package to run SQL queries on your data.frame
.
Hint: mapply
is a useful apply function to use to solve this problem.
Hint: Here is some extra help:
library(revgeo)
points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866))
# Note that the "output" argument gets passed to the "revgeo" function.
mapply(revgeo, points$longitude, points$latitude, output="frame")
# The output isn't in a great format, and we'd prefer to just get the "state" data.
# Let's wrap "revgeo" into another function that just gets "state" and try again.
get_state <- function(lon, lat) {
return(revgeo(lon, lat, output="frame")["state"])
}
mapply(get_state, points$longitude, points$latitude)
Important note: It is okay to get "Not Found" for some of the addresses.
Relevant topics: apply functions, sqldf
Item(s) to submit:
- Code used to solve the problem.
-
The
head
of the resultingdata.frame
.
Project 15
Motivation: We've done a lot of work with SQL this semester. Let's review concepts in this project and mix and match R and SQL to solve data-driven problems.
Context: In this project, we will reinforce topics you've already learned, with a focus on SQL.
Scope: SQL, sqlite, R
Learning objectives:
-
Write and run SQL queries in
sqlite
on real-world data. - Use SQL from within R.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/imdb.db
Questions
1. What is the first year where our database has > 1000 titles? Use the premiered
column in the titles
table as our year. What year has the most titles?
Hint: There could be missing values in premiered
. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL here.
Relevant topics: COUNT, GROUP BY, ORDER BY, DESC, HAVING, NULL
Item(s) to submit:
- SQL queries used to answer the questions.
- What year is the first year to have > 1000 titles?
- What year has the most titles?
2. How many, and what are the unique type
s from the titles
table? Fpr the year found in question (1) with the most titles
, how many titles of each type
are there?
Item(s) to submit:
- SQL queries used to answer the questions.
-
How many and what are the unique
types
from thetitles
table? -
A list of
type
and and count for the year (premiered
) that had the mosttitles
.
F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One ...". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let's use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.
3. Write a query that gets the episode_title_id
, primary_title
, rating
, and votes
, of all of the episodes of Friends (title_id
is tt0108778).
Hint: You can slightly modify the solution to question (5) in project 13.
Relevant topics: INNER JOIN, subqueries, aliasing
Item(s) to submit:
- SQL query used to answer the question.
- First 5 results of the query.
4. Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes primary_title
s contained their name?
Relevant topics: SQL in R, grep
Item(s) to submit:
- R code in a code chunk that was used to find the solution.
- The solution pasted below the code chunk.
5. Create a graphic showing our results in (4) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.
Relevant topics: plotting
Item(s) to submit:
- The R code used to generate the graphic.
- The graphic in a png or jpg/jpeg format.
6. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see type
column from titles
table): Romance, Comedy, Animation. In the titles
table, you can find the genres in the genres
column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.
To query rows which have the genre Action as one of its genres:
SELECT * FROM titles WHERE genres LIKE '%action%';
Relevant topics: LIKE, INNER JOIN
Item(s) to submit:
- Any code you used to solve the problem in a code chunk.
- The average rating of each of the genres listed for movies.
STAT 39000
Project 1
Motivation: In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.
Context: We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we've previously learned.
Scope: data wrangling in R, functions
Learning objectives:
- Comprehend what a function is, and the components of a function in R.
- Read and write basic (csv) data.
- Utilize apply functions in order to solve a data-driven problem.
Make sure to read about, and use the template found here, and the important information about projects submissions here.
You can find useful examples that walk you through relevant material in The Examples Book:
https://thedatamine.github.io/the-examples-book
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
Important note: It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials.
We decided to move away from ThinLinc and away from the version of RStudio used last year (https://desktop.scholar.rcac.purdue.edu). The version of RStudio is known to have some strange issues when running code chunks.
Remember the very useful documentation shortcut ?
. To use, simply type ?
in the console, followed by the name of the function you are interested in.
You can also look for package documentation by using help(package=PACKAGENAME)
, so for example, to see the documentation for the package ggplot2
, we could run:
help(package=ggplot2)
Sometimes it can be helpful to see the source code of a defined function. A function is any chunk of organized code that is used to perform an operation. Source code is the underlying R
or c
or c++
code that is used to create the function. To see the source code of a defined function, type the function's name without the ()
. For example, if we were curious about what the function Reduce
does, we could run:
Reduce
Occasionally this will be less useful as the resulting code will be code that calls c
code we can't see. Other times it will allow you to understand the function better.
Dataset:
/class/datamine/data/airbnb
Often times (maybe even the majority of the time) data doesn’t come in one nice file or database. Explore the datasets in /class/datamine/data/airbnb
.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. You may have noted that, for each country, city, and date we can find 3 files: calendar.csv.gz
, listings.csv.gz
, and reviews.csv.gz
(for now, we will ignore all files in the "visualisations" folders).
Let's take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (calendar.csv.gz
, listings.csv.gz
, and reviews.csv.gz
). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.
Hint: read.csv
has an argument to select the number of rows we want to read.
Hint: Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries.
Item(s) to submit:
- Chunk of code used to read the first 50 rows of each dataset.
- 1-2 sentences briefly describing the information contained in each dataset.
- Name(s) of variable(s) that could be used to join them.
To read a compressed csv, simply use the read.csv
function:
dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
head(dat)
Let's work towards getting this data into an easier format to analyze. From now on, we will focus on the listings.csv.gz
datasets.
Solution
The calendar.csv.gz
file for 2019-07-08 in Hawaii describes the listing_id
, date
, available
(t or f), price
, adjusted_price
, minimum_nights
, and maximum_nights
hawaii_calendar <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/calendar.csv.gz")
head(hawaii_calendar, n=50)
The listings.csv.gz
file for 2019-07-08 in Hawaii has 106 variables, which describe the very specific attributes of the airbnb listings.
hawaii_calendar <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/listings.csv.gz")
dim(hawaii_listings)
The reviews.csv.gz
file for 2019-07-08 in Hawaii describes the listing_id
, id
, date
, reviewer_id
, reviewer_name
, and comments
hawaii_calendar <- read.csv("/class/datamine/data/airbnb/united-states/hi/hawaii/2019-07-08/data/reviews.csv.gz")
head(hawaii_reviews, n=50)
The variables that might be used to compare the tables are: date
, id
, listing_id
, maximum_nights
, minimum_nights
, price
t <- table(c(names(hawaii_calendar), names(hawaii_listings), names(hawaii_reviews)))
t[t > 1]
2. Write a function called get_paths_for_country
, that, given a string with the country name, returns a vector with the full paths for all listings.csv.gz
files, starting with /class/datamine/data/airbnb/...
.
For example, the output from get_paths_for_country("united-states")
should have 28 entries. Here are the first 5 entries in the output:
[1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz"
[2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz"
[3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz"
[4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz"
[5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"
Hint: list.files
is useful with the recursive=T
option.
Hint: Use grep
to search for the pattern listings.csv.gz
(within the results from the first hint), and use the option value=T
to display the values found by the grep
function.
Item(s) to submit:
-
Chunk of code for your
get_paths_for_country
function.
Solution
We extract all 28 of the listings for the United States first:
myprefix <- "/class/datamine/data/airbnb/united-states/"
paste0(myprefix, grep("listings.csv.gz", list.files(myprefix, recursive=T), value=T))
Now we build a function that can do the same thing, for any country
get_paths_for_country <- function(mycountry) {
myprefix <- paste0("/class/datamine/data/airbnb/", mycountry, "/")
paste0(myprefix, grep("listings.csv.gz", list.files(myprefix, recursive=T), value=T))
}
and we test this for several countries:
get_paths_for_country("united-states")
get_paths_for_country("brazil")
get_paths_for_country("south-africa")
get_paths_for_country("canada")
3. Write a function called get_data_for_country
that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.
Hint: Use stringsAsFactors=F
in the read.csv
function.
Hint: Use do.call(rbind, <listofdataframes>)
to combine a list of dataframes into a single dataframe.
Relevant topics: rbind, lapply, function
Item(s) to submit:
-
Chunk of code for your
get_data_for_country
function.
Solution
We first get the data from the Canada entries. To do this, we sapply the read.csv
function to each of the 6 results from get_paths_for_country("canada")
In other words, we read in these 6 data frames.
myresults <- sapply(get_paths_for_country("canada"), read.csv, stringsAsFactors=F, simplify=F)
We get a list of 6 data frames:
length(myresults)
class(myresults)
class(myresults[[1]])
class(myresults[[6]])
and we can check the dimensions of each of the 6 data frames
dim(myresults[[1]])
dim(myresults[[2]])
dim(myresults[[3]])
dim(myresults[[4]])
dim(myresults[[5]])
dim(myresults[[6]])
this is more easily accomplished with another sapply
:
sapply(myresults, dim)
We can rbind
all 6 of these data frames into one big data frame as follows:
bigDF <- do.call(rbind, myresults)
class(bigDF)
dim(bigDF)
Now we create the desired function called get_data_for_country
get_data_for_country <- function(mycountry) {
myresults <- sapply(get_paths_for_country(mycountry), read.csv, stringsAsFactors=F, simplify=F)
do.call(rbind, myresults)
}
and we test it on Canada.
mynewbigDF <- get_data_for_country("canada")
The result has the same size as before
dim(mynewbigDF)
4. Use your get_data_for_country
to get the data for a country of your choice, and make sure to name the data.frame listings
. Take a look at the following columns: host_is_superhost
, host_has_profile_pic
, host_identity_verified
, and is_location_exact
. What is the data type for each column? (You can use class
or typeof
or str
to see the data type.)
These columns would make more sense as logical values (TRUE/FALSE/NA).
Write a function called transform_column
that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (""
), and we need to be careful when transforming the data. Test your function on column host_is_superhost
.
Relevant topics: class, typeof, str, toupper, as.logical
Item(s) to submit:
-
Chunk of code for your
transform_column
function. -
Type of
transform_column(listings$host_is_superhost)
.
Solution
These 4 columns from mynewbigDF
(which has the data for Canada) only have values "t", "f", ""
head(mynewbigDF$host_is_superhost)
head(mynewbigDF$host_has_profile_pic)
head(mynewbigDF$host_identity_verified)
head(mynewbigDF$is_location_exact)
Please note the 44 values of ""
(which are easy to miss) In the first 3 out of 4 of these columns:
table(mynewbigDF$host_is_superhost)
table(mynewbigDF$host_has_profile_pic)
table(mynewbigDF$host_identity_verified)
table(mynewbigDF$is_location_exact)
These are all character vectors, which we can check using class
, typeof
, or str
:
class(mynewbigDF$host_is_superhost)
class(mynewbigDF$host_has_profile_pic)
class(mynewbigDF$host_identity_verified)
class(mynewbigDF$is_location_exact)
typeof(mynewbigDF$host_is_superhost)
typeof(mynewbigDF$host_has_profile_pic)
typeof(mynewbigDF$host_identity_verified)
typeof(mynewbigDF$is_location_exact)
str(mynewbigDF$host_is_superhost)
str(mynewbigDF$host_has_profile_pic)
str(mynewbigDF$host_identity_verified)
str(mynewbigDF$is_location_exact)
We have several ways to transform a column. For example, we could go element-by-element, and make substitutions, like this:
v <- mynewbigDF$host_is_superhost
Here is the way that the values look at the start:
table(v)
v[toupper(v)=="T"] <- TRUE
v[toupper(v)=="F"] <- FALSE
v[toupper(v)==""] <- NA
and here are the values now:
table(v)
You might think that the NA
values disappeared, but they just do not show up in the table by default. You can force them to appear, and then we see that the counts of the three values are the same as before.
table(v, useNA="always")
Here is the function:
transform_column <- function(v) {
v[toupper(v)=="T"] <- TRUE
v[toupper(v)=="F"] <- FALSE
v[toupper(v)==""] <- NA
v
}
We can try the function on mynewbigDF$host_is_superhost
:
head(transform_column(mynewbigDF$host_is_superhost))
table(transform_column(mynewbigDF$host_is_superhost))
Another possibility is to make a map, in which we put the old values as the names, and the new values as the values in the vector:
mymap <- c(TRUE, FALSE, NA)
names(mymap) <- c("T", "F", "")
head(mymap[toupper(mynewbigDF$host_is_superhost)])
and if you do not want the names to appear on the vector, you can remove them, like this:
head(unname(mymap[toupper(mynewbigDF$host_is_superhost)]))
Finally we can check the table of the results:
table(mymap[toupper(mynewbigDF$host_is_superhost)])
This might seem strange, and if you do not like it, you can just use the solution given above. If you do like this, and want to wrap it into a function, we can write:
transform_column <- function(v) {
mymap <- c(TRUE, FALSE, NA)
names(mymap) <- c("T", "F", "")
unname(mymap[toupper(v)])
}
and again we can try this new version of the function on mynewbigDF$host_is_superhost
:
head(transform_column(mynewbigDF$host_is_superhost))
table(transform_column(mynewbigDF$host_is_superhost))
5. Create a histogram for response rates (host_response_rate
) for super hosts (where host_is_superhost
is TRUE
). If your listings do not contain any super hosts, load data from a different country. Note that we first need to convert host_response_rate
from a character containing "%" signs to a numeric variable.
Relevant topics: gsub, as.numeric
Item(s) to submit:
- Chunk of code used to answer the question.
- Histogram of response rates for super hosts.
Solution
Now we look at the Canada results, for which host_is_superhost
is TRUE
. We make a histogram of the host_response_rate
for those values. To do this, we first get the host_response_rate
values (for which host_is_superhost
is TRUE
)
myvalues <- mynewbigDF$host_response_rate[transform_column(mynewbigDF$host_is_superhost) == TRUE]
and then we remove the percentage symbols from myvalues
and convert the character vector to numbers, and then finally make the histogram.
hist(as.numeric(gsub("%", "", myvalues)))
As a closing note, we could remove the check to see whether the inner values are TRUE
because, by default, we will only exact the TRUE
values when we do a lookup like this:
myvalues <- mynewbigDF$host_response_rate[transform_column(mynewbigDF$host_is_superhost)]
hist(as.numeric(gsub("%", "", myvalues)))
Project 2
Motivation: The ability to quickly reproduce an analysis is important. It is often necessary that other individuals will need to be able to understand and reproduce an analysis. This concept is so important there are classes solely on reproducible research! In fact, there are papers that investigate and highlight the lack of reproducibility in various fields. If you are interested in reading about this topic, a good place to start is the paper titled "Why Most Published Research Findings Are False", by John Ioannidis (2005).
Context: Making your work reproducible is extremely important. We will focus on the computational part of reproducibility. We will learn RMarkdown to document your analyses so others can easily understand and reproduce the computations that led to your conclusions. Pay close attention as future project templates will be RMarkdown templates.
Scope: Understand Markdown, RMarkdown, and how to use it to make your data analysis reproducible.
Learning objectives:
- Use Markdown syntax within an Rmarkdown document to achieve various text transformations.
- Use RMarkdown code chunks to display and/or run snippets of code.
Questions
1. Make the following text (including the asterisks) bold: This needs to be **very** bold
. Make the following text (including the underscores) italicized: This needs to be _very_ italicized.
Important note: Surround your answer in 4 backticks. This will allow you to display the markdown without having the markdown "take effect". For example:
````
Some *marked* **up** text.
````
Hint: Be sure to check out the Rmarkdown Cheatsheet and our section on Rmarkdown in the book.
Note: Rmarkdown is essentially Markdown + the ability to run and display code chunks. In this question, we are actually using Markdown within Rmarkdown!
Relevant topics: rmarkdown, escaping characters
Item(s) to submit: - 2 lines of markdown text, surrounded by 4 backticks. Note that when compiled, this text will be unmodified, regular text.
Solution
We can achieve this style of text:
This needs to be **very** bold
This needs to be _very_ italicized.
by using this Markdown text:
**This needs to be \*\*very\*\* bold**
_This needs to be \_very\_ italicized._
The backslashes specify that we want the asterisks and underscores to appear.
2. Create an unordered list of your top 3 favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another ordered list that ranks your academic interests in order of most interested to least interested.
Hint: You can learn what ordered and unordered lists are here.
Note: Similar to (1), in this question we are dealing with Markdown. If we were to copy and paste the solution to this problem in a Markdown editor, it would be the same result as when we Knit it here.
Relevant topics: rmarkdown
Item(s) to submit: - Create the lists, this time don't surround your code in backticks. Note that when compiled, this text will appear as nice, formatted lists.
Solution
An unordered list of my top 3 favorite academic interests is:
- asymptotic analysis of sequences
- data science
- analysis of algorithms
An ordered list of my top 3 favorite academic interests is:
- analysis of algorithms
- asymptotic analysis of sequences
- data science
3. Browse https://www.linkedin.com/ and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown. Include the following:
- A header for this section (your choice of size) that says "About".
- The text of your personal "About" section that you would feel comfortable uploading to linkedin, including at least 1 link.
Relevant topics: rmarkdown
Item(s) to submit: - Create the described profile, don't surround your code in backticks.
Solution
About
I am Professor in Statistics and (by courtesy) of Mathematics and Public Health at Purdue University. My research is in probabilistic, combinatorial, and analytic techniques for the analysis of algorithms and data structures; I am also interested in science of information, game theory, and large-scale computation. I currently serve as
- Director of The Data Mine
- Interim Co-Director of the Integrative Data Science Initiative
- Principal Investigator for the Purdue Statistics Living Learning Community, funded by the National Science Foundation
- Associate Director for the NSF Center for Science of Information (now a core center in Purdue's Discovery Park)
- Associate Director of the Actuarial Science Program ##### Solution
The equation is \[P(A\ |\ B) = \frac{P(B\ |\ A)P(A)}{P(B)}\]
4. LaTeX is a powerful editing tool where you can create beautifully formatted equations and formulas. Replicate the equation found here as closely as possible.
Hint: Lookup "latex mid" and "latex frac".
Item(s) to submit:
- Replicate the equation using LaTeX under the Question 4 header in your template.
Solution
my_variable <- c(1,2,3)
*** All About the Iris Dataset
This paper goes into detail about the iris
dataset that is built into r. You can find a list of built-in datasets by visiting the R datasets index or by running the following code:
data()
The iris dataset has 5 columns. You can get the names of the columns by running the following code:
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Alternatively, you could just run the following code:
iris
The second option provides more detail about the dataset.
According to the R iris help page there is another dataset built-in to R
called iris3
. This dataset is 3 dimensional instead of 2 dimensional.
An iris is a really pretty flower. You can see a picture of one here:
In summary. I really like irises, and there is a dataset in R
called iris
.
5. Your co-worker wrote a report, and has asked you to beautify it. Knowing Rmarkdown, you agreed. Make improvements to this section. At a minimum:
- Make the title pronounced.
- Make all links appear as a word or words, rather than the long-form URL.
- Organize all code into code chunks where code and output are displayed. If the output is really long, just display the code.
- Make the calls to the
library
function be evaluated but not displayed. - Make sure all warnings and errors that may eventually occur, do not appear in the final document.
Feel free to make any other changes that make the report more visually pleasing.
```{r my-load-packages}
library(ggplot2)
```
```{r declare-variable-390, eval=FALSE}
my_variable <- c(1,2,3)
```
All About the Iris Dataset
This paper goes into detail about the `iris` dataset that is built into r. You can find a list of built-in datasets by visiting https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or by running the following code:
data()
The iris dataset has 5 columns. You can get the names of the columns by running the following code:
names(iris)
Alternatively, you could just run the following code:
iris
The second option provides more detail about the dataset.
According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html there is another dataset built-in to r called `iris3`. This dataset is 3 dimensional instead of 2 dimensional.
An iris is a really pretty flower. You can see a picture of one here:
https://www.gardenia.net/storage/app/public/guides/detail/83847060_mOptimized.jpg
In summary. I really like irises, and there is a dataset in r called `iris`.
Relevant topics: rmarkdown
Item(s) to submit:
- Make improvements to this section, and place it all under the Question 5 header in your template.
Solution
Here are the Sepal.Length
values from the iris
data set.
plot(iris$Sepal.Length, main="Iris Sepal Lengths")
7. Insert the following code chunk under the Question 7 header in your template. Try knitting the document. Two things will go wrong. What is the first problem? What is the second problem?
```{r my-load-packages}
plot(my_variable)
```
Hint: Take a close look at the name we give our code chunk.
Hint: Take a look at the code chunk where my_variable
is declared.
Relevant topics: rmarkdown
Item(s) to submit:
- The modified version of the inserted code that fixes both problems.
- A sentence explaining what the first problem was.
- A sentence explaining what the second problem was.
Solution
We needed to change the section title from install-packages
to a new name, since we already had a section with this same title.
In the section called declare-variable
, we had eval=F
, and as a result, the variable my_variable
was never declared. So we removed the option eval=F
.
plot(my_variable)
For Project 2, please submit your .Rmd file and the resulting .pdf file. (For this project, you do not need to submit a .R file.)
OPTIONAL QUESTION. RMarkdown is also an excellent tool to create a slide deck. Use the information here or here to convert your solutions into a slide deck rather than the regular PDF. You may experiment with slidy
, ioslides
or beamer
, however, make your final set of solutions use beamer
as the output is a PDF. Make any needed modifications to make the solutions knit into a well-organized slide deck (For example, include slide breaks and make sure the contents are shown completely.). Modify (2) so the bullets are incrementally presented as the slides progress.
Important note: You do not need to submit the original PDF for this project, just the beamer
slide version of the PDF.
Relevant topics: rmarkdown
Item(s) to submit:
-
The modified version of the solutions in
beamer
slide form.
Project 3
Motivation: The ability to navigate a shell, like bash
, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful bash
tools, help you navigate a filesystem, and even run bash
tools from within an RMarkdown file in RStudio.
Context: At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file.
Scope: bash, RStudio
Learning objectives:
- Distinguish differences in /home, /scratch, and /class.
- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
- Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc.
-
Use
man
to read and learn about UNIX utilities. -
Run
bash
commands from within and RMarkdown file in RStudio.
There are a variety of ways to connect to Scholar. In this class, we will primarily connect to RStudio Server by opening a browser and navigating to https://rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface.
Here is a video to remind you about some of the basic tools you can use in UNIX/Linux:
This is the easiest book for learning this stuff; it is short and gets right to the point:
https://go.oreilly.com/purdue-university/library/view/-/0596002610
you just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through).
It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details.
Solution
Here are some changes Kevin likes:
- Uncheck "Restore .Rdata into workspace at startup".
- Change tab width 4.
- Check "Soft-wrap R source files".
- Check "Highlight selected line".
- Check "Strip trailing horizontal whitespace when saving".
- Uncheck "Show margin".
2. There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a bash
shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient!
What is the default directory of your bash shell?
Hint: Start by reading the section on man
. man
stands for manual, and you can find the "official" documentation for the command by typing man <command_of_interest>
. For example:
# read the manual for the `man` command
# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down
man man
Relevant topics: man, pwd, ~, .., .
Item(s) to submit:
-
The full filepath of default directory (home directory). Ex: Kevin's is:
/home/kamstut
-
The
bash
code used to show your home directory or current directory (also known as the working directory) when thebash
shell is first launched.
Solution
# whatever is stored in the $HOME environment variable
# is what ~ represents
cd ~
pwd
# if we change $HOME, ~ changes too!
HOME=/home/kamstut/projects
cd ~
pwd
# if other users on the linux system share certain files or folders
# in their home directory, you can access their home folder similarly
ls ~mdw
# but they _have_ to give you permissions
Solution
# navigate to the data directory using `cd` (change directory)
cd /class/datamine/data
# confirm the location using `pwd` (print working directory)
pwd
# list files
ls
# cd without any arguments automatically returns to the directory
# saved in the $HOME environment variable
cd
# another trick, if you wanted to _quickly_ return to the data
# directory, or most recent directory is the following (uncommented)
cd ~-
# confirm the location using pwd
pwd
..
represents the parent folder or the folder in which your current folder is contained. So let's say I was in /home/kamstut/projects/
and I wanted to get the contents of the file /home/kamstut/my_file.txt
. You could do: cat ../my_file.txt
.
Solution
cd /class/datamine/data
pwd
cd ../../../home/kamstut
pwd
5. In Scholar, when you want to deal with really large amounts of data, you want to access scratch (you can read more here). Your scratch directory on Scholar is located here: /scratch/scholar/$USER
. $USER
is an environment variable containing your username. Test it out: echo /scratch/scholar/$USER
. Perform the following actions:
- Navigate to your scratch directory.
- Confirm you are in the correct location.
- Execute
myquota
. - Find the location of the
myquota
bash script. - Output the first 5 and last 5 lines of the bash script.
- Count the number of lines in the bash script.
- How many kilobytes is the script?
Hint: You could use each of the commands in the relevant topics once.
Hint: When you type myquota
on Scholar there are sometimes two warnings about xauth
but sometimes there are no warnings. If you get a warning that says Warning: untrusted X11 forwarding setup failed: xauth key data not generated
it is safe to ignore this error.
Hint: Commands often have options. Options are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION
section of the man
pages. For example: man wc
. You can see -m
, -l
, and -w
are all options for wc
. To test this out:
# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command.
wc /class/datamine/data/flights/1987.csv
# to count the lines, use the -l option
wc -l /class/datamine/data/flights/1987.csv
# to count the words, use the -w option
wc -w /class/datamine/data/flights/1987.csv
# you can combine options as well
wc -w -l /class/datamine/data/flights/1987.csv
# some people like to use a single tack `-`
wc -wl /class/datamine/data/flights/1987.csv
# order doesn't matter
wc -lw /class/datamine/data/flights/1987.csv
Hint: The -h
option for the du
command is useful.
Relevant topics: cd, pwd, type, head, tail, wc, du
Item(s) to submit:
- Command used to navigate to your scratch directory.
- Command used to confirm your location.
-
Output of
myquota
. -
Command used to find the location of the
myquota
script. -
Absolute path of the
myquota
script. -
Command used to output the first 5 lines of the
myquota
script. -
Command used to output the last 5 lines of the
myquota
script. -
Command used to find the number of lines in the
myquota
script. - Number of lines in the script.
- Command used to find out how many kilobytes the script is.
- Number of kilobytes that the script takes up.
Solution
# navigate to my scratch folder
cd /scratch/scholar/$USER
# confirm
pwd
# what is my quota, execute the myquota script
myquota
# get the location of the myquota script
type myquota
# get the first 5 lines of the myquota script
head /usr/local/bin/myquota
# get the last 5 lines of the myquota script
tail /usr/local/bin/myquota
# get the number of lines in the myquota script
wc -l /usr/local/bin/myquota
# get the number of kilobytes of teh myquota script
du -h --apparent-size /usr/local/bin/myquota
ls -la /usr/local/bin/myquota
6. Perform the following operations:
- Navigate to your scratch directory.
- Copy and paste the file:
/class/datamine/data/flights/1987.csv
to your current directory (scratch). - Create a new directory called
my_test_dir
in your scratch folder. - Move the file you copied to your scratch directory, into your new folder.
- Use
touch
to create an empty file namedim_empty.txt
in your scratch folder. - Remove the directory
my_test_dir
and the contents of the directory. - Remove the
im_empty.txt
file.
Hint: rmdir
may not be able to do what you think, instead, check out the options for rm
using man rm
.
Relevant topics: cd, cp, mv, mkdir, touch, rmdir, rm
Item(s) to submit:
- Command used to navigate to your scratch directory.
-
Command used to copy the file,
/class/datamine/data/flights/1987.csv
to your current directory (scratch). -
Command used to create a new directory called
my_test_dir
in your scratch folder. -
Command used to move the file you copied earlier
1987.csv
into your newmy_test_dir
folder. -
Command used to create an empty file named
im_empty.txt
in your scratch folder. -
Command used to remove the directory and the contents of the directory
my_test_dir
. -
Command used to remove the
im_empty.txt
file.
Solution
# navigate to the scratch folder
cd /scratch/scholar/$USER
# copy the 1987.csv file to the current directory (scratch)
cp /class/datamine/data/flights/1987.csv .
# make a directory in the scratch directory called `my_test_dir`
mkdir my_test_dir
# move 1987.csv to the new folder
mv 1987.csv my_test_dir
# create an empty file in the scratch folder
touch im_empty.txt
# remove the directory and the contents of the directory
rm -r my_test_dir
# remove the im_empty.txt file
rm im_empty.txt
7. Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.
Project 4
Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. grep
is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
Context: We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep
, and experiment with regular expressions using grep
, R, and later on, Python.
Scope: grep, regular expression basics, utilizing regular expression tools in R and Python
Learning objectives:
-
Use
grep
to search for patterns within a dataset. -
Use
cut
to section off and slice up data from the command line. -
Use
wc
to count the number of lines of input.
You can find useful examples that walk you through relevant material in The Examples Book:
https://thedatamine.github.io/the-examples-book
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
Important note: I would highly recommend using single quotes '
to surround your regular expressions. Double quotes can have unexpected behavior due to some shell's expansion rules. In addition, pay close attention to escaping certain characters in your regular expressions.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/the_office_dialogue.csv
A public sample of the data can be found here: the_office_dialogue.csv
Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
grep
stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep
, we will be using it with textual data. You can read about and see examples of grep
here.
1. Login to Scholar and use grep
to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica.". What is the name of the dataset and where is it located?
Relevant topics: grep
Item(s) to submit:
-
The
grep
command used to find the dataset. - The name and location in Scholar of the dataset.
-
Use
grep
andgrepl
within R to solve a data-driven problem.
Solution
grep -Ri "bears. beets. battlestar galactica." /class/datamine
/class/datamine/data/the_office/the_office_dialogue.csv
2. grep
prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first n lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.
Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.
Hint: The result if you were to search for "bears. beets. battlestar galactica." should be:
"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica."
Hint: One method to solve this problem would be to pipe the output from grep
to cut
.
Item(s) to submit:
- The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".
Solution
grep -i "bears. beets. battlestar galactica." /class/datamine/data/movies_and_tv/the_office_dialogue.csv | cut -d "," -f 7,8
3. Find all of the lines where Pam is called "Beesley" instead of "Pam" or "Pam Beesley".
Hint: A negative lookbehind would be one way to solve this, in order to use a negative lookbehind with grep
make sure to add the -P option. In addition, make sure to use single quotes to make sure your regular expression is taken literally. If you use double quotes, variables are expanded.
Relevant topics: grep
Item(s) to submit:
- The UNIX command used to solve this problem.
Solution
grep -Pi '(?<!Pam )Beesley'
Regular expressions are really a useful semi language-agnostic tool. What this means is regardless of the programming language your are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using read.csv
. Name the resulting data.frame dat
.
4. The text_w_direction
column in dat
contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.
(a) Create a new column called has_direction
that is set to TRUE
if the text_w_direction
column has direction, and FALSE
otherwise. Use the grepl
function in R to accomplish this.
Hint: Make sure all opening brackets "[" have a corresponding closing bracket "]".
Hint: Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text.
(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.
We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.
In our actual dataset, for example, dat$text_w_direction[2789]
is a line with 2 directions.
Relevant topics: grep, grepl, basic matches, escaping characters
Item(s) to submit:
- The R code and regular expression used to solve the first part of this problem.
- The R code and regular expression used to solve the second part of this problem.
- How many lines have >= 2 directions?
- How many lines have >= 5 directions?
Solution
dat$has_direction <- grepl("(\\[.*\\])+", dat$text_w_direction)
#
length(grep("\\[.*\\].*\\[.*\\]", dat$text_w_direction))
length(grep("\\[.*\\].*\\[.*\\].*\\[.*\\].*\\[.*\\].*\\[.*\\]", dat$text_w_direction))
5. Use the str_extract_all
function from the stringr
package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called direction
.
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
In this question, your solution may have extracted:
[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]
(It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.)
Relevant topics: str_extract_all, basic matches, escaping characters
Item(s) to submit:
- The R code used to solve this problem.
Solution
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[[^\\[\\]]*\\])", simplify=F)
# or
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[.*?\\])", simplify=F)
OPTIONAL QUESTION. Repeat (5) but this time make sure you only capture the brackets and text within the brackets. Save the results in a new column called direction_correct
. You can test to see if it is working by running the following code:
dat$direction_correct[747]
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].
In (5), your solution may have extracted:
[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]
This is ok for (5). In this question, however, we want to fix this to only extract:
[emphasize this]
[emphasize this] [shrug]
Hint: This regular expression will be hard to read.
Hint: The pattern we want is: literal opening bracket, followed by 0+ of any character other than the literal [ or literal ], followed by a literal closing bracket.
Relevant topics: str_extract_all
Item(s) to submit:
- The R code used to solve this problem.
Solution
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[[^\\[\\]]*\\])", simplify=F)
# or
dat$direction_correct <- str_extract_all(dat$text_w_direction, "(\\[.*?\\])", simplify=F)
Project 5
Motivation: Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc.
Context: We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
Scope: grep, regular expression basics, UNIX utilities, redirection, piping
Learning objectives:
-
Use
cut
to section off and slice up data from the command line. - Use piping to string UNIX commands together.
-
Use
sort
and it's options to sort data in different ways. -
Use
head
to isolate n lines of output. -
Use
wc
to summarize the number of lines in a file or in output. -
Use
uniq
to filter out non-unique lines. -
Use
grep
to search files effectively.
You can find useful examples that walk you through relevant material in The Examples Book:
https://thedatamine.github.io/the-examples-book
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
Don't forget the very useful documentation shortcut ?
for R code. To use, simply type ?
in the console, followed by the name of the function you are interested in. In the Terminal, you can use the man
command to check the documentation of bash
code.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/amazon/amazon_fine_food_reviews.csv
A public sample of the data can be found here: amazon_fine_food_reviews.csv
Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
Here are three videos that might also be useful, as you work on Project 5:
Questions
1. What is the Id
of the most helpful review, according to the highest HelpfulnessNumerator
?
Important note: You can always pipe output to head
in case you want the first few values of a lot of output. Note that if you used sort
before head
, you may see the following error messages:
sort: write failed: standard output: Broken pipe
sort: write error
This is because head
would truncate the output from sort
. This is okay. See this discussion for more details.
Relevant topics: cut, sort, head, piping
Item(s) to submit:
- Line of UNIX commands used to solve the problem.
-
The
Id
of the most helpful review.
Solution
cut -d, -f5 amazon_fine_food_reviews.csv| sort -nr | head -n3
2. Some entries under the Summary
column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion.
To further clarify what we mean by unique, if we had the following vector in R, c("a", "b", "a", "c")
, its unique values are c("a", "b", "c")
.
Relevant topics: cut, uniq, sort, wc, piping
Item(s) to submit:
- Two lines of UNIX commands used to solve the problem.
-
The ratio of unique
Summary
's.
Solution
cut -d, -f9 amazon_fine_food_reviews.csv | sort -u | wc -l
cut -d, -f9 amazon_fine_food_reviews.csv | wc -l
3. Use a chain of UNIX commands, piped in a sequence, to create a frequency table of Score
.
Relevant topics: cut, uniq, sort, piping
Item(s) to submit:
- The line of UNIX commands used to solve the problem.
- The frequency table.
Solution
cut -d, -f7 amazon_fine_food_reviews.csv | sort -n | uniq -c
4. Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why?
Hint: You may need to pipe the output to sort
multiple times.
Hint: To create the frequency table, read through the man
pages for uniq
. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following:
man uniq
Relevant topics: cut, uniq, sort, head, piping, man
Item(s) to submit:
- The line of UNIX commands used to solve the problem.
- The frequency table.
Solution
cut -d, -f3 amazon_fine_food_reviews.csv | sort | uniq -c | sort -nr | head -n1
5. Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let's consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let's consider anything in between neither strongly positive nor negative.
Hint: You may find the solution to problem (3) useful.
Relevant topics: cut, uniq, sort, grep, piping
Item(s) to submit:
- The line of UNIX commands used to solve the problem.
Solution
grep -i 'A3OXHLG6DIBRW8' amazon_fine_food_reviews.csv | cut -d, -f7 | sort | uniq -c
6. Find the most helpful review with a Score
of 5. Then (separately) find the most helpful review with a Score
of 1. As before, we are considering the most helpful review to be the review with the highest HelpfulnessNumerator
.
Hint: You can use multiple lines to solve this problem.
Relevant topics: sort, head, piping
Item(s) to submit:
- The lines of UNIX commands used to solve the problem.
-
ProductId
's of both requested reviews.
Solution
sort -t, -k7rn,7 -k5rn,5 amazon_fine_food_reviews.csv | head -n2
sort -t, -k7n,7 -k5rn,5 amazon_fine_food_reviews.csv | head -n2
7. For only the two ProductId
s from the previous question, create a new dataset called scores.csv
that contains the ProductId
s and Score
s from all reviews for these two items.
Relevant topics: cut, grep, redirection
Item(s) to submit:
- The line of UNIX commands used to solve the problem.
Solution
cut -d, -f2,7 amazon_fine_food_reviews.csv| grep -Ei '(B00012182G|B003EMQGVI|B00065LI0A)' > scores.csv
OPTIONAL QUESTION. Use R to load up scores.csv
into a new data.frame called dat
. Create a histogram for each products' Score
. Compare the most helpful review Score
with the Score
's given in the histogram. Based on this comparison, point out some curiosities about the product that may be worth exploring. For example, if a product receives many high scores, but has a super helpful review that gives the product 1 star, I may tend to wonder if the product is not as great as it seems to be.
Relevant topics: read.csv, hist
Item(s) to submit:
- R code used to create the histograms.
-
3 histograms, 1 for each
ProductId
. - 1-2 sentences describing the curious pattern that you would like to further explore.
Solution
dat <- read.csv("scores.csv", header=F)
par(mfrow=c(1,3))
tapply(dat$V2, dat$V1, hist)
Project 6
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Learning objectives:
-
Use
awk
to process and manipulate textual data. - Use piping and redirection within the terminal to pass around data between utilities.
- Use output created from the terminal to create a plot using R.
Dataset
The following questions will use the dataset found here or in Scholar:
/class/datamine/data/flights/subset/YYYY.csv
An example from 1987 data can be found here or in Scholar:
/class/datamine/data/flights/subset/1987.csv
Questions
1. In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the Origin
, from 1987.csv
. Write another line, this time using awk
to do the same thing. Which one do you prefer, and why?
Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk:
Item(s) to submit:
-
One line of UNIX commands to solve the problem without using
awk
. -
One line of UNIX commands to solve the problem using
awk
. - 1-2 sentences describing which method you prefer and why.
Solution
cut -d, -f17 1987.csv
awk -F, '{print $17}' 1987.csv
2. Write a bash script that accepts a year (1987, 1988, etc.) and a column n and returns the nth column of the associated year of data.
Here are two examples to illustrate how to write a bash script:
Hint: In this example, you only need to turn in the content of your bash script (starting with #!/bin/bash
) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it:
#!/bin/bash
echo "First argument: $1"
echo "Second argument: $2"
If you simply drop that text into a file called my_script.sh
, located here: /home/$USER/my_script.sh
, and if you run the following:
# Setup bash to run; this only needs to be run one time per session.
# It makes bash behave a little more naturally in RStudio.
exec bash
# Navigate to the location of my_script.sh
cd /home/$USER
# Make sure that the script is runable.
# This only needs to be done one time for each new script that you write.
chmod 755 my_script.sh
# Execute my_script.sh
./my_script.sh okay cool
then it will print:
First argument: okay
Second argument: cool
In this example, if we were to turn in the "content of your bash script (starting with #!/bin/bash
) in a code chunk, our solution would look like this:
#!/bin/bash
echo "First argument: $1"
echo "Second argument: $2"
And although we aren't running the code chunk above, we know that it works because we tested it in the terminal.
Hint: Using awk
you could have a script with just two lines: 1 with the "hash-bang" (#!/bin/bash
), and 1 with a single awk
command.
Relevant topics: awk, bash scripts
Item(s) to submit:
-
The content of your bash script (starting with
#!/bin/bash
) in a code chunk.
Solution
#!/bin/bash
awk -F, -v col=$2 '{print $col}' $1.csv
3. How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using awk
, then solve this problem using only awk
.
Here is a similar example, using the election data set:
Relevant topics: cut, grep, wc, awk, piping
Item(s) to submit:
-
One line of UNIX commands to solve the problem without using
awk
. -
One line of UNIX commands to solve the problem using
awk
. - The number of flights that arrived at Indianapolis (IND) in 2008.
Solution
cut -d, -f18 2008.csv | grep 'IND' | wc -l
awk -F, '{if ($18 == "IND") count++}END{print count}' 2008.csv
4. Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you'd like. Are they indeed the same? How many unique values do we have per category (Origin
, Dest
)?
Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data:
Relevant topics: cut, sort, uniq, wc, awk
Item(s) to submit:
- 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same.
- The UNIX command(s) used to figure out if the number of unique origins and destinations are the same.
-
The number of unique values per category (
Origin
,Dest
).
Solution
cut -d, -f17 2008.csv | sort | uniq | wc -l
cut -d, -f18 2008.csv | sort | uniq | wc -l
5. In (4) we found that there are not the same number of unique Origin
's as Dest
's. Find the IATA airport code for all Origin
's that don't appear in a Dest
and all Dest
's that don't appear in an Origin
in the 2008 data.
Hint: The examples on this page should help. Note that these examples are based on Process Substitution, which basically allows you to specify commands whose output would be used as the input of comm
. There should be no space between <
and (
, otherwise your bash will not work as intended.
Relevant topics: comm, cut, sort, uniq, redirection
Item(s) to submit:
- The line(s) of UNIX command(s) used to answer the question.
-
The list of
Origin
s that don't appear inDest
. -
The list of
Dest
s that don't appear inOrigin
.
Solution
comm -23 <(cut -d, -f17 2008.csv | sort | uniq) <(cut -d, -f18 2008.csv | sort | uniq)
comm -13 <(cut -d, -f17 2008.csv | sort | uniq) <(cut -d, -f18 2008.csv | sort | uniq)
6. What was the percentage of flights in 2008 per unique Origin
with the Dest
of "IND"? What percentage of flights had "PHX" as Origin
(among all flights with Dest
of "IND"?
Here is an example using the percentages of donations contributed from CEOs from various States:
Hint: You can do the mean calculation in awk by dividing the result from (3) by the number of unique Origin
's that have a Dest
of "IND".
Relevant topics: awk, sort, grep, wc
Item(s) to submit:
-
The percentage of flights in 2008 per unique
Origin
with theDest
of "IND". -
1-2 sentences explaining how "PHX" compares (as a unique
ORIGIN
) to the otherOrigin
s (all with theDest
of "IND")?
Solution
awk -F, '{if($18=="IND") print $17}' 2008.csv | sort -u | wc -l
awk -F, '{if($18=="IND") print $17}' 2008.csv | grep -i PHX | wc -l
7. Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like:
1987, 12345
1988, 44
Run the script with inputs: 1991
and ORD
. Include the output in your submission.
Relevant topics: bash scripts, cut, piping, grep, wc
Item(s) to submit:
- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-
The output of the script given
1991
andORD
as inputs.
Solution
#!/bin/bash
FLIGHTS_OUT="$(cut -d, -f17 $1.csv | grep -i $2 | wc -l)"
FLIGHTS_IN="$(cut -d, -f18 $1.csv | grep -i $2 | wc -l)"
echo "$((FLIGHTS_OUT + FLIGHTS_IN)), $1"
OPTIONAL QUESTION. Pick your favorite airport and get its IATA airport code. Write a bash script that, given the first year, last year, and airport code, runs the bash script from (7) for all years in the provided range for your given airport, or loops through all of the files for the given airport, appending all of the data to a new file called my_airport.csv
.
Relevant topics: bash scripts, cut, grep, wc, for loops, echo, redirection
Item(s) to submit:
- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
FLIGHTS_OUT="$(cut -d, -f17 $f.csv | grep -i $3 | wc -l)"
FLIGHTS_IN="$(cut -d, -f18 $f.csv | grep -i $3 | wc -l)"
echo "$((FLIGHTS_OUT + FLIGHTS_IN)), $f"
done
OPTIONAL QUESTION. In R, load my_airport.csv
and create a line plot showing the year-by-year change. Label your x-axis "Year", your y-axis "Num Flights", and your title the name of the IATA airport code. Write 1-2 sentences with your observations.
Relevant topics: read.csv, lines
Item(s) to submit:
- Line chart showing year-by-year change in flights into and out of the chosen airport.
- R code used to create the chart.
- 1-2 sentences with your observations.
Solution
dat <- read.csv("my_airport.csv", header=F)
plot(dat[,2], dat[,1],xlab="Year", ylab="Num Flights", main="JFK", type="l")
Project 7
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Learning objectives:
-
Use
awk
to process and manipulate textual data. - Use piping and redirection within the terminal to pass around data between utilities.
Dataset:
The following questions will use the dataset found in Scholar:
/class/datamine/data/flights/subset/YYYY.csv
An example of the data for the year 1987 can be found here.
Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. Write a line of code that prints a list of the unique values in the DayOfWeek
column. Write a line of code that prints a list of the unique values in the DayOfMonth
column. Write a line of code that prints a list of the unique values in the Month
column. Use the 1987.csv
dataset. Are the results what you expected?
Item(s) to submit:
- 3 lines of code used to get a list of unique values for the chosen columns.
- 1-2 sentences explaining whether or not the results are what you expected.
Solution
cut -d, -f3 1987.csv | sort -nu
cut -d, -f4 1987.csv | sort -nu
cut -d, -f2 1987.csv | sort -nu
2. Our files should have 29 columns. For a given file, write a line of code that prints any lines that do not have 29 columns. Test it on 1987.csv
, were there any rows without 29 columns?
Hint: See here. NF
looks like it may be useful!
Relevant topics: awk
Item(s) to submit:
- Line of code used to solve the problem.
- 1-2 sentences explaining whether or not there were any rows without 29 columns.
Solution
awk -F, '{if (NF != 29) print $0}' 1987.csv
3. Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do not have 29 columns.
Relevant topics: awk, bash scripts
Item(s) to submit:
- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
- The results of running your bash scripts from year 1987 to 2008.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
awk -F, '{if (NF != 29) print $0}' $f.csv
done
4.awk
is a really good tool to quickly get some data and manipulate it a little bit. The column Distance
contains the distances of the flights in miles. Use awk
to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.
Example output:
Miles: 12345
Kilometers: 19867.35168
Item(s) to submit:
- The code used to solve the problem.
- The results of running the code.
Solution
awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' 1990.csv
5. Use awk
to calculate the sum of the number of DepDelay
minutes, grouped according to DayOfWeek
. Use 2007.csv
.
Example output:
DayOfWeek: 0
1: 1234567
2: 1234567
3: 1234567
4: 1234567
5: 1234567
6: 1234567
7: 1234567
Note: 1 is Monday.
Relevant topics: awk, sort, piping
Item(s) to submit:
- The code used to solve the problem.
- The output from running the code.
Solution
awk -F, '{delay[$4]=delay[$4]+$16}END{for (d in delay) print d": ", delay[d]}' 2007.csv | sort -n
6. It wouldn't be fair to compare the total DepDelay
minutes by DayOfWeek
as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of DepDelay
minutes by the number of flights per DayOfWeek
. Use 2007.csv
.
Example output:
DayOfWeek: 0
1: 1.234567
2: 1.234567
3: 1.234567
4: 1.234567
5: 1.234567
6: 1.234567
7: 1.234567
Relevant topics: awk, sort, piping
Item(s) to submit:
- The code used to solve the problem.
- The output from running the code.
Solution
awk -F, '{delay[$4]=delay[$4]+$16; flights[$4]++}END{for (d in delay) print d": ", delay[d]/flights[d]}' 2007.csv | sort -n
7. Anyone who has flown knows how frustrating it can be waiting for takeoff, or deboarding the aircraft. These roughly translate to TaxiOut
and TaxiIn
respectively. If you were to fly into or out of IND what is your expected total taxi time? Use 2007.csv
.
Note: Taxi times are in minutes.
Item(s) to submit:
- The code used to solve the problem.
- The output from running the code.
Solution
grep -i IND 2007.csv | awk -F, '{taxi=taxi+$20+$21; count++}END{print taxi/count}'
Project 8
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Learning objectives:
-
Use
awk
to process and manipulate textual data. - Use piping and redirection within the terminal to pass around data between utilities.
Dataset:
The following questions will use the dataset found in Scholar:
/class/datamine/data/flights/subset/YYYY.csv
An example of the data for the year 1987 can be found here.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
1. Let's say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!
Write a line of awk
code that, prints the total number of flights that occur on weekend days, followed by the total number of flights that occur on the weekdays. Complete this calculation for 2008 using the 2008.csv
file.
Modify your code to instead print the average number of flights that occur on weekend days, followed by the average number of flights that occur on the weekdays.
Hint: You don't need a large if statement to do this, you can use the ~
comparison operator.
Relevant topics: awk
Item(s) to submit:
-
Lines of
awk
code that solves the problem. - The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.
- The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.
Solution
awk -F, '{if ($4 ~ /1|2|3|4/) weekday++; else (weekend++)}END{print weekend,weekday}' 2008.csv
awk -F, '{if ($4 ~ /1|2|3|4/) weekday++; else (weekend++)}END{print weekend/3,weekday/4}' 2008.csv
2. We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let's use the tools we've learned to explore this a little bit.
Take a look at airports.csv
. In particular run the following:
head airports.csv
Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within awk
. This is messy and we would prefer to create a new file called new_airports.csv
without any quotes. Write a line of code to do this.
Note: You may be wondering why we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It's important to practice and learn ways to fix these things.
Hint: You could use gsub
within awk
to replace '"' with ''. You can find how to use gsub
here.
Hint: If you leave out the column number argument to gsub
it will apply the substitution to every field in every column.
Hint:
cat new_airports.csv | wc -l # should be 159 without header
Relevant topics: awk, redirection
Item(s) to submit:
-
Line of
awk
code used to create the new dataset.
Solution
awk -F, '{gsub(/"/, ""); print $0}' airports.csv > new_airports.csv
3. Write a line of commands that creates a new dataset called az_fl_airports.txt
. az_fl_airports.txt
should only contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3),new_airports.csv
as a starting point.
How many airports are there? Did you expect this? Use a line of bash code to count this.
Create a new dataset (called az_fl_flights.txt
) that contains all of the data for flights into or out of Florida and Arizona (using the 2008.csv
file). Use the newly created dataset, az_fl_airports.txt
to accomplish this.
Hint:
cat az_fl_flights.txt | wc -l # should be 484705
Relevant topics: awk, wc, piping
Item(s) to submit:
- All UNIX commands used to answer the questions.
- The number of airports.
- 1-2 sentences explaining whether you expected this number of airports.
Solution
awk -F, '{if ($4 == "AZ" || $4 == "FL") print $1}' new_airports.csv > az_fl_airports.txt
wc -l az_fl_airports.txt
4. Write a bash script that accepts the start year, end year, and filename containing airport codes (az_fl_airports.txt
), and outputs the data for flights into or out of any of the airports listed in the provided filename (az_fl_airports.txt
). The script should output data for flights using all of the years of data in the provided range. Run the bash script to create a new file called az_fl_flights_total.csv
.
Relevant topics: bash scripts, grep, for loop, redirection
Item(s) to submit:
- The content of your bash script (starting with "#!/bin/bash") in a code chunk.
- The line of UNIX code you used to execute the script and create the new dataset.
Solution
#!/bin/bash
for ((f=$1; f<=$2; f++)); do
grep -w -F -f $3 $f.csv
done
./my_script.sh 1987 2008 az_fl_airports.txt > az_fl_flights_total.csv
5. Use the newly created dataset, az_fl_flights_total.csv
, from question 4 to calculate the total number of flights into and out of both states by month, and by year, for a total of 3 columns (year, month, flights). Export this information to a new file called snowbirds.csv
.
Load up your newly created dataset and use either R or Python (or some other tool) to create a graphic that illustrates whether or not we believe the "snowbird effect" effects flights. Include a description of your graph, as well as your (anecdotal) conclusion.
Hint: You can use 1 dimensional arrays to accomplish this if the key is the combination of, for example, the year and month.
Relevant topics: awk, redirection
Item(s) to submit:
-
The line of
awk
code used to create the new dataset,snowbirds.csv
. - Code used to create the visualization in a code chunk.
- The generated plot as either a png or jpg/jpeg.
- 1-2 sentences describing your plot and your conclusion.
Solution
awk -F, '{M[$1","$2]++}END{for (monthyear in M) print monthyear","M[monthyear]}' az_fl_flights_total.csv > snowbirds.csv
library(ggplot2)
dat <- read.csv("snowbirds.csv", header=F)
names(dat) <- c("year", "month", "n_flights")
ggplot(dat) +
geom_line(aes(x=month, y=n_flights, group=year, col = as.factor(year))) +
geom_point(aes(x=month, y=n_flights)) +
theme_classic(base_size=14) +
scale_x_continuous(breaks=1:12, labels = month.abb) +
labs(x= 'Month', y='', title = 'Total flights per month in/out AZ and FL (1987-2008)', col='Year')
Project 9
Motivation: Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, cloudflare, a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per this thread on hackernews). Learning SQL is well worth your time!
Context: There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite.
Scope: SQL, sqlite
Learning objectives:
- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
Dataset:
The following questions will use the dataset found in Scholar:
/class/datamine/data/lahman/lahman.db
This is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
Important note: For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:
```{r, include=F}
library(RSQLite)
lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db")
```
Each solution should then be placed in a code chunk like this:
```{sql, connection=lahman}
SELECT * FROM batting LIMIT 1;
```
If you want to use a SQLite-specific function like .tables
(or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
sqlite3 /class/datamine/data/lahman/lahman.db
From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F
like this:
```{sql, connection=lahman, eval=F}
SELECT * FROM batting LIMIT 1;
```
This will allow the code to be displayed without throwing an error.
Solution
.tables
2. Some people like to try to visit all 30 MLB ballparks in their lifetime. Use SQL commands to get a list of parks
and the cities they're located in. For your final answer, limit the output to 10 records/rows.
Note: There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the LIMIT
clause.
Hint: Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following:
.headers on
Relevant topics: SELECT, FROM, LIMIT
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT parkname, city FROM parks LIMIT 10;
3. There is nothing more exciting to witness than a home run hit by a batter. It's impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (HR
) in a season. List their playerID
, yearID
, home run total, and the teamID
they played for.
Hint: There are 8 occurrences of home runs greater than or equal to 60.
Hint: The batting
table is where you should look for this question.
Relevant topics: SELECT, FROM, WHERE, LIMIT
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT teamID, yearID, teamID, HR FROM batting WHERE HR>=60;
4. Make a list of players born on your birth day (don't worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year.
Hint: The people
table is where you should look for this question.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT nameFirst, nameLast, birthYear FROM people
WHERE birthMonth==5 AND birthDay==29
ORDER BY birthYear DESC LIMIT 10;
5. Get the Cleveland (CLE) Pitching Roster from the 2016 season (playerID
, W
, L
, SO
). Order the pitchers by number of Strikeouts (SO) in descending order.
Hint: The pitching
table is where you should look for this question.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT playerID, W, L, SO FROM pitching
WHERE teamID=='CLE' AND yearID==2016
ORDER BY SO DESC LIMIT 10;
6. Find the 10 team and year pairs that have the most number of Errors (E
) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs?
Hint: The teams
table is where you should look for this question.
Hint: The BETWEEN
clause is useful here.
Hint: It is OK to use multiple queries to answer the question.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT teamID, franchID, yearID, W, L, E FROM teams
WHERE yearID BETWEEN 1960 AND 1970
ORDER BY E DESC
LIMIT 10;
SELECT franchName FROM teamsfranchises WHERE franchID=='LAA';
SELECT teamID, franchID, yearID, W, L, E FROM teams
WHERE yearID >= 1960 AND yearID <= 1970
ORDER BY E DESC
LIMIT 10;
SELECT franchName FROM teamsfranchises WHERE franchID=='LAA';
7. Find the playerID
for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table pitching
)? What year and team did he win the most games as a manager (use table managers
)?
Hint: It is OK to use multiple queries to answer the question.
Note: There was a tie among the two years in which Bob Lemon had the most wins as a pitcher.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT playerID FROM people
WHERE nameFirst=='Bob' AND nameLast=='Lemon';
SELECT teamID, yearID, W FROM pitching
WHERE playerID=='lemonbo01'
ORDER BY W DESC;
SELECT teamID, yearID, W FROM managers
WHERE playerID=='lemonbo01'
ORDER BY W DESC;
8. Find the AL West (use lgID
and divID
to specify AL West) home run (HR
), walk (BB
), and stolen base (SB
) totals by team between 2000 and 2010. Which team and year combo led in each category in the decade?
Hint: The teams
table is where you should look for this question.
Hint: It is OK to use multiple queries to answer the question.
Hint: Use divID == 'W'
as one of the conditions. Please note using double quotes: divID == "W"
will not work.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The team-year combination that ranked top in each category.
Solution
SELECT teamID, yearID, HR, BB, SB FROM teams
WHERE lgID=='AL' AND divID=='W'
AND yearID BETWEEN 2000 AND 2010
ORDER BY HR DESC LIMIT 1;
SELECT teamID, yearID, HR, BB, SB FROM teams
WHERE lgID=='AL' AND divID=='W'
AND yearID BETWEEN 2000 AND 2010
ORDER BY BB DESC LIMIT 1;
SELECT teamID, yearID, HR, BB, SB FROM teams
WHERE lgID=='AL' AND divID=='W'
AND yearID BETWEEN 2000 AND 2010
ORDER BY SB DESC LIMIT 1;
9. Get a list of the following by year: wins (W
), losses (L
), Home Runs Hit (HR
), homeruns allowed (HRA
), and total home game attendance (attendance
) for the Detroit Tigers when winning a World Series (WSWin
is Y
) or when winning league champion (LgWin
is Y
).
Hint: The teams
table is where you should look for this question.
Hint: Be careful with the order of operations for AND
and OR
. Remember you can force order of operations using parentheses.
Relevant topics: SELECT, FROM, WHERE, AND, ORDER BY, DESC, LIMIT, BETWEEN
Note: Examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL code used to solve the problem.
- The first 10 results of the query.
Solution
SELECT teamID, yearID, W, L, HR, HRA, attendance FROM teams
WHERE teamID=='DET' AND (WSWin=='Y' OR LgWin=='Y');
Project 10
Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data.
Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG
, MIN
, and MAX
.
Scope: SQL, sqlite
Learning objectives:
- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.
- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.
- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.
- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.
Dataset
The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at scholar-db.rcac.purdue.edu
. As in Project 9, this is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.
Questions
Important note: Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Important note: Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head
to print a sample of the data or output. Extremely large PDFs will be subject to lose points.
Important note: For this project all solutions should be done using R code chunks, and the RMariaDB
package. Run the following code to load the library:
library(RMariaDB)
Solution
library(RMariaDB)
host <- "scholar-db.rcac.purdue.edu"
dbname <- "lahmandb"
user <- "lahman_user"
password <- "HitAH0merun"
con <- dbConnect(MariaDB(),host=host,
dbname=dbname,
user=user,
password=password)
head(dbGetQuery(con, "SHOW tables;"))
2. How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (SB
) and hit at least 40 home runs (HR
) in one year.
Hint: Use the batting
table.
Important note: You only need to run library(RMariaDB)
and the dbConnect
portion of the code a single time towards the top of your project. After that, you can simply reuse your connection con
to run queries.
Important note: In our project template, for this project, make all of the SQL queries using the dbGetQuery
function, which returns the results directly in R
. Therefore, your RMarkdown
blocks for this project should all be {r}
blocks (as opposed to the {sql}
blocks used in Project 9).
Hint: You can use dbGetQuery
to run your queries from within R. Example:
dbGetQuery(con, "SELECT * FROM battings LIMIT 5;")
Note: We already demonstrated the correct SQL query to use for the 40/40 club in this video, but now we want you to use RMariaDB
to solve this query:
Relevant topics: dbGetQuery, AND/OR, DISTINCT, COUNT
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con,"SELECT DISTINCT COUNT(*) FROM batting WHERE HR>=40 AND SB>=40;")
3. How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"?
Important note: Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use sum
in R rather than SUM
in SQL.
Hint: Use the people
table to find the playerID
and use the batting
table to find the statistics.
Relevant topics: dbGetQuery, AND/OR, SUM
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con, "SELECT playerID FROM people WHERE nameFirst ='Giancarlo' AND nameLast = 'Stanton';")
dbGetQuery(con, "SELECT COUNT(*), SUM(SO) FROM batting WHERE playerID = 'stantmi03' AND (teamID = 'MIA' OR teamID = 'FLO');")
4. The Batting Average is a metric for a batter's performance. The Batting Average in a year is calculated by \(\frac{H}{AB}\) (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to playerID
, teamID
, and yearID.
Hint: Use the batting
table.
Relevant topics: dbGetQuery, ORDER BY, BETWEEN
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con,"SELECT playerID, teamID, yearID, H, AB, H/AB FROM batting
WHERE yearID BETWEEN 2000 AND 2010 AND AB>300
ORDER BY H/AB DESC LIMIT 5;")
5. How many unique players have hit > 50 home runs (HR
) in a season?
Hint: If you view DISTINCT
as being paired with SELECT
, instead, think of it as being paired with one of the fields you are selecting.
Relevant topics: dbGetQuery, DISTINCT, COUNT
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con, "SELECT COUNT(DISTINCT playerID) FROM batting WHERE HR>50;")
# As you can see, DISTINCT is with the field playerID, not with SELECT.
# The following query should help distinguish this.
dbGetQuery(con, "SELECT COUNT(DISTINCT playerID), COUNT(playerID) FROM batting WHERE HR>50;")
6. Find the number of unique players that attended Purdue University. Start by finding the schoolID
for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic.
Hint: Use the schools
table to get the schoolID
s, and the collegeplaying
table to get the statistics.
Hint: You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates.
Hint: To find IU's id, try the following query: SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';
. You can find more about the LIKE clause and %
here.
Relevant topics: dbGetQuery, plotting in R, DISTINCT, COUNT
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
dbGetQuery(con, "SELECT schoolID FROM schools WHERE name_full='Purdue University';")
dbGetQuery(con, "SELECT schoolID FROM schools WHERE LIKE '%indiana%';")
dbGetQuery(con,"SELECT COUNT(DISTINCT playerID) FROM collegeplaying WHERE schoolID ='purdue';")
dbGetQuery(con,"SELECT DISTINCT playerID FROM collegeplaying WHERE schoolID ='indiana';")
purdue <- length(unique(dbGetQuery(con,"SELECT playerID FROM collegeplaying WHERE schoolID ='purdue';"))$playerID)
iu <- length(unique(dbGetQuery(con,"SELECT playerID FROM collegeplaying WHERE schoolID ='indiana';"))$playerID)
barplot(purdue, iu)
barplot(log(c(purdue, iu)), main="Pro baseball players: Indiana vs. Purdue", col=c("#990000","#CFB53B"), names.arg=c("Purdue", "IU"), cex.names=.5, ylab="Log count")
7. Use R, SQL and the lahman database to create an interesting infographic. For those of you who are not baseball fans, try doing a Google image search for "baseball plots" for inspiration. Make sure the plot is polished, has appropriate labels, color, etc.
Relevant topics: SQL, plotting in R
Item(s) to submit:
- R code used to solve the problem.
- The result of running the R code.
Solution
# Could be anything.
Project 11
Motivation: Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.
Context: We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset!
Scope: SQL, SQL in R
Learning objectives:
- Demonstrate the ability to interact with popular database management systems within R.
- Solve data-driven problems using a combination of SQL and R.
- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.
- Showcase the ability to filter, alias, and write subqueries.
- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.
Dataset
The following questions will use the elections
database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar:
/class/datamine/data/election/itcontYYYY.txt
(for example, data for year 1980 would be /class/datamine/data/election/itcont1980.txt
)
A public sample of the data can be found here:
https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt)
Questions
Important note: For this project you will need to connect to the database elections
using the RMariaDB
package in R. Include the following code chunk in the beginning of your RMarkdown file:
```{r setup-database-connection}
library(RMariaDB)
con <- dbConnect(RMariaDB::MariaDB(),
host="scholar-db.rcac.purdue.edu",
db="elections",
user="elections_user",
password="Dataelect!98")
```
When a question involves SQL queries in this project, you may use a SQL code chunk (with {sql}
), or an R code chunk (with {r}
) and functions like dbGetQuery
as you did in Project 10. Please refer to Question 5 in the project template for examples.
1. Approximately how large was the lahman database (use the sqlite database in Scholar: /class/datamine/data/lahman/lahman.db
)? Use UNIX utilities you've learned about this semester to write a line of code to return the size of that .db file (in MB).
The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder /class/datamine/data/election/
. How much data (in MB) is there?
The data in that folder has been added to the elections
database, all aggregated in the elections
table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table elections
?
Note: These are some examples of how to get the sizes of collections of files in UNIX.
Hint: The SQL query will take some time! Be patient.
Note: You may use more than one code chunk in your RMarkdown file for the different tasks.
Note: We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from ls
and du
to match, use the --apparent-size
option with du
.
Note: A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See here for more information. For this question, either solution will be given full credit. This is a potentially useful example.
Relevant topics: SQL, SQL in R, awk, ls, du
Item(s) to submit:
- Line of code (bash/awk) to show the size (in MB) of the lahman database file.
- Approximate size of the lahman database in MB.
-
Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in
/class/datamine/data/election
. - The size of the elections data in MB.
-
SQL query used to find the number of rows of data in the
elections
table in theelections
database. -
The number of rows in the
elections
table in theelections
database.
Solution
ls -la *.txt | awk '{ total += $4; }END{print total/1000000}'
SELECT COUNT(*) FROM elections;
2. Write a SQL query using the LIKE
command to find a unique list of zip_code
s that start with "479".
Write another SQL query and answer: How many unique zip_code
s are there that begin with "479"?
Note: Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project.
Hint: The first query returns a list of zip codes, and the second returns a count.
Hint: Make sure you only select zip_code
s.
Item(s) to submit:
- SQL queries used to answer the question.
- The first 5 results from running the query.
Solution
SELECT DISTINCT zip_code FROM elections WHERE zip_code LIKE '479%' LIMIT 5;
SELECT COUNT(DISTINCT zip_code) FROM elections WHERE zip_code LIKE '479%';
3. Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an alias for our field so it doesn't read COUNT(*)
but rather Indiana Donations
.
Hint: You may enclose an alias's name in quotation marks (single or double) when the name contains space.
Relevant topics: SQL, WHERE, aliasing
Item(s) to submit:
- SQL query used to answer the question.
- The result of the SQL query.
Solution
SELECT COUNT(*) FROM elections WHERE state='IN';
SELECT COUNT(*) AS 'Indiana Donations' FROM elections WHERE state='IN';
4. Rewrite the query in (3) so the result is displayed like: IN: 1234567
. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to OH: 1234567
. In other words, the state abbreviation should be dynamic, not static.
Note: This video demonstrates how to use CONCAT in a MySQL query.
Hint: Use CONCAT and aliasing to accomplish this.
Hint: Remember, state
contains the state abbreviation.
Relevant topics: SQL, aliasing, CONCAT
Item(s) to submit:
- SQL query used to answer the question.
Solution
SELECT CONCAT(state, ": ", COUNT(*)) AS 'Donations' FROM elections WHERE state='IN';
5. In (2) we wrote a query that returns a unique list of zip_code
s that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip_code
s starting with "479". What percent of donations in Indiana come from said zip_code
s?
Note: This video gives two examples of sub-queries.
Hint: You can simply manually calculate the percent using the count in (2) and (5).
Relevant topics: SQL, aliasing, subqueries, IN
Item(s) to submit:
- SQL queries used to answer the question.
-
The percentage of donations from Indiana from
zip_code
s starting with "479".
Solution
SELECT COUNT(*) FROM elections WHERE zip_code IN (SELECT DISTINCT zip_code FROM elections WHERE zip_code LIKE '479%');
6. In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (transaction_amt
). Which 5 states gave the most donations (highest count)? Order you result from most to least.
Note: In this video we demonstrate GROUP BY
, ORDER BY
, DESC
, and other aspects of MySQL that might help with this question.
Hint: You may want to create an alias in order to sort.
Relevant topics: SQL, GROUP BY
Item(s) to submit:
- SQL query used to answer the question.
- Which 5 states gave the most donations?
Solution
SELECT state, COUNT(*) AS 'cnt', SUM(transaction_amt) AS 'total' FROM elections GROUP BY state ORDER BY cnt DESC;
7. Write a query that gets the number of donations, and sum of donations, by year, for Indiana. Create one or more graphics that highlights the year-by-year changes. Write a short 1-2 sentences explaining your graphic(s).
Relevant topics: SQL in R, GROUP, date functions in SQL
Item(s) to submit:
- SQL query used to answer the question.
- R code used to create your graphic(s).
- 1 or more graphics in png/jpeg format.
- 1-2 sentences summarizing your graphic(s).
Solution
library(RMariaDB)
library(ggplot2)
library(tidyverse)
db <- dbConnect(RMariaDB::MariaDB(), host="scholar-db.rcac.purdue.edu", db="elections", user="elections_user", password="Dataelect!98")
dat <- dbGetQuery(db, "SELECT COUNT(*) as cnt, SUM(transaction_amt) as amt, YEAR(transaction_dt) as yr FROM elections WHERE state='IN' GROUP BY yr ORDER BY yr;")
filtered_dat = filter(dat, between(yr, 1978, 2020))
ggplot(filtered_dat, aes(x=yr,y=amt/1e6)) +
geom_line() +
geom_point() +
theme_bw() +
labs(x='Year', y='Amount in millions')
ggplot(filtered_dat, aes(x=yr,y=amt/(1e3*cnt))) +
geom_line() +
geom_point() +
theme_bw() +
labs(x='Year', y='Average amount in thousands')
ggplot(filtered_dat, aes(x=yr,y=as.numeric(cnt)/1e3)) +
geom_line() +
geom_point() +
theme_bw() +
labs(x='Year', y='Number of donations in thousands')
Project 12
Motivation: Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share.
Context: We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
Scope: SQL, sqlite, joins
Learning objectives:
- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.
- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.
- Showcase the ability to filter, alias, and write subqueries.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db
A public sample of the data can be found here.
Important note: For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd:
```{r, include=F}
library(RSQLite)
bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db")
```
Each solution should then be placed in a code chunk like this:
```{sql, connection=bikeshare}
SELECT * FROM station LIMIT 5;
```
If you want to use a SQLite-specific function like .tables
(or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db
From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F
like this:
```{sql, connection=bikeshare, eval=F}
SELECT * FROM station LIMIT 5;
```
This will allow the code to be displayed without throwing an error.
There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform all of the other types of joins (RIGHT JOIN, FULL OUTER JOIN). You can see the documentation here: SQL, INNER JOIN, LEFT JOIN
Questions
1. Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the sqlite
concatenate operator ||
(see here). Write a query that returns the first 5 records of information from the station
table formatted in the following way:
(id) name @ (lat, long)
For example:
(84) Ryland Park @ (37.342725, -121.895617)
Hint: Here is a video about how to concatenate strings in SQLite.
Relevant topics: aliasing, concat
Item(s) to submit:
- SQL query used to solve this problem.
-
The first 5 records of information from the
station
table.
Solution
SELECT '('||id||') '||name||' @ ('||lat||','||long||')' FROM station LIMIT 5;
2. There is a variety of interesting weather information in the weather
table. Write a query that finds the average mean_temperature_f
by zip_code
. Which is on average the warmest zip_code
?
Use aliases to format the result in the following way:
Zip Code|Avg Temperature
94041|61.3808219178082
Note that this is the output if you use sqlite
in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly.
Hint: Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT
Relevant topics: aliasing, GROUP BY, AVG
Item(s) to submit:
- SQL query used to solve this problem.
- The results of the query copy and pasted.
Solution
SELECT zip_code AS 'Zip Code', AVG(mean_temperature_f) AS 'Avg Temperature' FROM weather GROUP BY zip_code;
Solution
SELECT COUNT(DISTINCT zip_code) FROM trip;
SELECT zip_code, COUNT(zip_code) FROM trip GROUP BY zip_code;
4. In (2) we wrote a query that finds the average mean_temperature_f
by zip_code
. What if we want to tack on our results in (2) to information from each row in the station
table based on the zip_code
s? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables.
Hint: Use the query from (2) as a sub query within your solution.
Hint: Here is a video about JOIN and LEFT JOIN.
Relevant topics: INNER JOIN, subqueries, aliasing
Item(s) to submit:
- SQL query used to solve this problem.
Solution
SELECT * FROM station as s INNER JOIN (SELECT zip_code AS 'Zip Code', AVG(mean_temperature_f) AS 'Avg Temperature' FROM weather GROUP BY zip_code) AS sub ON s.zip_code=sub.'Zip Code';
5. In (3) we alluded to the fact that the zip_code
s in the trip
table aren't very consistent. Users can enter a zip code when using the app. This means that zip_code
can be from anywhere in the world! With that being said, if the zip_code
is one of the 5 zip_code
s for which we have weather data (from question 2), we can add that weather information to matching rows of the trip
table. In (4) we used an INNER JOIN to append some weather information to each row in the station
table. For this question, write a query that performs an INNER JOIN and appends weather data from the weather
table to the trip data from the trip
table. Limit your output to 5 lines.
Important note: Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the trip
table, you can use start_date
for for the date information.
Hint: You will want to wrap your dates and datetimes in sqlite's date
function prior to comparison.
Relevant topics: INNER JOIN, aliasing
Item(s) to submit:
- SQL query used to solve this problem.
- First 5 lines of output.
Solution
SELECT * FROM trip AS t INNER JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date) LIMIT 5;
6. How many rows are in the result from (5) (when not limiting to 5 lines)? How many rows are in the trip
table? As you can see, a large proportion of the data from the trip
table did not match the data from the weather
table, and therefore was removed from the result. What if we want to keep all of the data from the trip
table and add on data from the weather
table if we have a match? Write a query to accomplish this. How many rows are in the result?
Item(s) to submit:
- SQL query used to find how many rows from the result in (5).
- The number of rows in the result of (5).
-
SQL query to find how many rows are in the
trip
table. -
The number of rows in the
trip
table. -
SQL query to keep all of the data from the
trip
table and add on matching data from theweather
table when available. - The number of rows in the result.
Solution
SELECT COUNT(*) FROM trip AS t INNER JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date);
SELECT COUNT(*) FROM trip;
SELECT * FROM trip AS t LEFT JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date) LIMIT 5;
SELECT COUNT(*) FROM trip AS t LEFT JOIN weather AS w ON t.zip_code=w.zip_code AND date(t.start_date)=date(w.date);
Project 13
Motivation: Databases you will work with won't necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like primary key, and foreign key will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice!
Context: We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you've already learned, with a focus on subqueries and joins.
Scope: SQL, sqlite
Learning objectives:
-
Write and run SQL queries in
sqlite
on real-world data. - Identify primary and foreign keys in a SQL table.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/imdb.db
Important note: For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd:
```{r, include=F}
library(RSQLite)
imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
```
If you want to use a SQLite-specific function like .tables
(or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:
sqlite3 /class/datamine/data/movies_and_tv/imdb.db
From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F
like this:
```{sql, connection=imdb, eval=F}
SELECT * FROM titles LIMIT 5;
```
This will allow the code to be displayed without throwing an error.
Questions
1. A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the titles
table. Do you think there are any primary keys? How about foreign keys? Now examine the episodes
table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?
Hint: A primary key can also be a foreign key.
Hint: Here are two videos. The first video will remind you how to find the names of all of the tables in the imdb
database. The second video will introduce you to the titles
and episodes
tables in the imdb
database.
Relevant topics: primary key, foreign key
Item(s) to submit:
-
List any primary or foreign keys in the
titles
table. -
List any primary or foreign keys in the
episodes
table.
2. If you paste a title_id
to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series Grey's Anatomy. Write a SQL query to confirm that the title_id
tt0413573 does indeed belong to Grey's Anatomy. Then browse imdb.com and find your favorite TV show. Get the title_id
from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database:
SELECT * FROM titles WHERE title_id='<title id here>';
Make sure to replace "<title id here>" with the title_id
of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons.
Relevant topics: SELECT, WHERE
Item(s) to submit:
-
SQL query used to confirm that
title_id
tt0413573 does indeed belong to Grey's Anatomy. - The output of the query.
-
The
title_id
of your favorite TV show. -
SQL query used to confirm the
title_id
for your favorite TV show. - The output of the query.
3. The episode_title_id
column in the episodes
table references titles of individual episodes of a TV series. The show_title_id
references the titles of the show itself. With that in mind, write a query that gets a list of all of the episodes_title_id
s (found in the episodes
table), with the associated primary_title
(found in the titles
table) for each episode of Grey's Anatomy.
Relevant topics: INNER JOIN
Hint: This video shows how to extract titles of episodes in the imdb
database.
Item(s) to submit:
- SQL query used to solve the problem in a code chunk.
4. We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of episode_title_id
s (found in the episodes
table), with the associated primary_title
(found in the titles
table) for each episode.
Hint: This part is just like question (3) but this time with your favorite TV show, which you chose in (2).
Hint: This video shows how to use a subquery, to JOIN
a total of three tables in the imdb
database.
Relevant topics: INNER JOIN, aliasing
Item(s) to submit:
- SQL query used to solve the problem in a code chunk.
- The first 5 results from your query.
5. Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode?
Relevant topics: INNER JOIN, aliasing, subqueries, ORDER BY, DESC, LIMIT
Note: Various helpful examples that utilize the relevant topics in this problem can be found here.
Item(s) to submit:
- SQL query used to solve the problem in a code chunk.
-
The
episode_title_id
,primary_title
, andrating
of the top rated episode from your favorite TV series, in question (2). - A statement saying whether the highest rated episode is also your favorite episode.
6. Write a query that returns the season_number
(from the episodes
table), and average rating
(from the ratings
table) for each season of your favorite TV show from (2). Write another query that only returns the season number and rating
for the highest rated season. Consider the highest rated season the season with the highest average.
Relevant topics: INNER JOIN, aliasing, GROUP BY, AVG
Item(s) to submit:
- The 2 SQL queries used to solve the problems in two code chunks.
7. Write a query that returns the primary_title
and rating
of the highest rated episode per season for your favorite TV show from question (2).
Note: You can show one highest rated episode for each season, without the need to worry about ties.
Relevant topics: MAX, subqueries, GROUP BY, INNER JOIN, aliasing
Item(s) to submit:
- The SQL query used to solve the problem.
- The output from your query.
Project 14
Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.
Context: We've just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.
Scope: awk, UNIX utilities, bash scripts, fread
Learning objectives:
- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.
- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.
- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.
- Use grep to search files effectively.
- Use cut to section off data from the command line.
- Use piping to string UNIX commands together.
- Use awk for data extraction, and preprocessing.
- Create bash scripts to automate a process or processes.
Dataset:
The following questions will use PLOTSNAP.csv and TREE.csv from the data folder found in Scholar:
/class/datamine/data/forest
To read more about the two files from this dataset that you will be working with:
PLOTSNAP.csv:
TREE.csv:
AND
Questions
1. Take a look at at PLOTSNAP.csv
. Write a line of awk code that displays the STATECD
followed by the number of rows with that STATECD
.
Relevant topics: awk
Item(s) to submit:
- Code used to solve the problem.
-
Count of the following
STATECD
s: 1, 2, 4, 5, 6
2. Unfortunately, there isn't a very accessible list available that shows which state each STATECD
represents. This is no problem for us though, the dataset has LAT
and LON
! Write some bash that prints just the STATECD
, LAT
, and LON
.
Note: There are 92 columns in our dataset: awk -F, 'NR==1{print NF}' PLOTSNAP.csv
. To create a list of STATECD
to state, we only really need STATECD
, LAT
, and LON
. Keeping the other 89 variables will keep our data at 2.1gb.
Item(s) to submit:
- Code used to solve the problem.
-
The output of your code piped to
head
.
3. fread
is a "Fast and Friendly File Finagler". It is part of the very popular data.table
package in R. We will learn more about this package next semester. For now, read the documentation here and use the cmd
argument in conjunction with your bash code from (2) to read the data of STATECD
, LAT
, and LON
into a data.table
in your R environment.
Relevant topics: fread
Item(s) to submit:
- Code used to solve the problem.
-
The
head
of the resultingdata.table
.
4. We are going to further understand the data from question (3) by finding the actual locations based on the LAT
and LON
columns. We can use the library revgeo
to get a location given a pair of longitude and latitude values. revgeo
uses a free API hosted by photon in order to do so.
For example:
library(revgeo)
revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
The code above will give you the address information in six columns, from the most-granular housenumber
to the least-granular country
. Depending on the coordinates, revgeo
may or may not give you results for each column. For this question, we are going to keep only the state
column.
There are over 4 million rows in our dataset -- we do not want to hit photon's API that many times. Instead, we are going to do the following:
First: Unless you feel comfortable using data.table
, convert your data.table
to a data.frame
:
my_dataframe <- data.frame(my_datatable)
Second: Calculate the average LAT
and LON
for each STATECD
, and call the new data.frame
, dat
. This should result in 57 rows of lat/long pairs.
Third: For each row in dat
, run a reverse geocode and append the state
to a new column called STATE
.
Hint: To calculate the average LAT
and LON
for each STATECD
, you could use the sqldf
package to run SQL queries on your data.frame
.
Hint: mapply
is a useful apply function to use to solve this problem.
Hint: Here is some extra help:
library(revgeo)
points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866))
# Note that the "output" argument gets passed to the "revgeo" function.
mapply(revgeo, points$longitude, points$latitude, output="frame")
# The output isn't in a great format, and we'd prefer to just get the "state" data.
# Let's wrap "revgeo" into another function that just gets "state" and try again.
get_state <- function(lon, lat) {
return(revgeo(lon, lat, output="frame")["state"])
}
mapply(get_state, points$longitude, points$latitude)
Important note: It is okay to get "Not Found" for some of the addresses.
Relevant topics: apply functions, sqldf
Item(s) to submit:
- Code used to solve the problem.
-
The
head
of the resultingdata.frame
.
5. Use the leaflet
, addTiles
, and addCircles
functions from the leaflet
package to map our average latitude and longitude data from question (4) to a map (should be a total of 57 lat/long pairs).
Hint: See here for an example of adding points to a map.
Relevant topics: leaflet
Item(s) to submit:
- Code used to create the map.
- The map itself as output from running the code chunk.
6. Write a bash script that accepts at least 1 argument, and performs a useful task using at least 1 dataset from the forest
folder in /class/datamine/data/forest
. An example of a useful task could be printing a report of summary statistics for the data. Feel free to get creative. Note that tasks must be non-trivial -- a bash script that counts the number of lines in a file is not appropriate. Make sure to properly document (via comments) what your bash script does. Also ensure that your script returns columnar data with appropriate separating characters (for example a csv).
Relevant topics: bash scripts, awk, unix utilities
Item(s) to submit:
-
The content of your bash script starting from
#!/bin/bash
. - Example output from running your script as intended.
- A description of what your script does.
7. You used fread
in question (2). Now use the cmd
argument in conjunction with your script from (6) to read the script output into a data.table
in your R environment.
Relevant topics: fread
Item(s) to submit:
- The R code used to read in and preprocess your data using your bash script from (6).
-
The
head
of the resultingdata.table
.
Project 15
Motivation: We've done a lot of work with SQL this semester. Let's review concepts in this project and mix and match R and SQL to solve data-driven problems.
Context: In this project, we will reinforce topics you've already learned, with a focus on SQL.
Scope: SQL, sqlite, R
Learning objectives:
-
Write and run SQL queries in
sqlite
on real-world data. - Use SQL from within R.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/imdb.db
F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One ...". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let's use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.
Questions
1. Write a query that gets the episode_title_id
, primary_title
, rating
, and votes
, of all of the episodes of Friends (title_id
is tt0108778).
Hint: You can slightly modify the solution to question (5) in project 13.
Relevant topics: INNER JOIN, subqueries, aliasing
Item(s) to submit:
- SQL query used to answer the question.
- First 5 results of the query.
2. Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes primary_title
s contained their name?
Relevant topics: SQL in R, grep
Item(s) to submit:
- R code in a code chunk that was used to find the solution.
- The solution pasted below the code chunk.
3. Create a graphic showing our results in (2) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.
Relevant topics: plotting
Item(s) to submit:
- The R code used to generate the graphic.
- The graphic in a png or jpg/jpeg format.
4. Now we will turn our focus to other information in the database. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see type
column from titles
table): Romance, Comedy, Animation. In the titles
table, you can find the genres in the genres
column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.
To query rows which have the genre Action as one of its genres:
SELECT * FROM titles WHERE genres LIKE '%action%';
Relevant topics: LIKE, INNER JOIN
Item(s) to submit:
- Any code you used to solve the problem in a code chunk.
- The average rating of each of the genres listed for movies.