BUSINESS DATA MINING (IDS 472)

HOMEWORK 2

• You should submit a report in pdf or word in blackboard in addition to your R script file.

• One submission is sufficient for the entire group.

• Please include the names of all team-members in your write up and in the name of the file.

• Please include any R codes you use to answer the questions in your pdf report.

Problem 1. Explain what each of the following R functions do? You can run them in R and check the

results.

(a) c(1, 17, −6, 3)

(b) seq(1, 5, by=0.5)

(c) seq(0, 10, length=5)

(d) rep(0, 5)

(e) rep(1:3, 4)

(f) rep(4:6, 1:3)

(g) sample(1:3)

(h) sample(1:5, size=3, replace=FALSE)

(i) sample(c(2,5,3), size=4, replace=TRUE)

(j) sample(1:2, size=10, prob=c(1,3), replace=TRUE)

(k) c(1, 2, 3) + c(4, 5, 6)

(l) max(1:10)

(m) min(1:10)

(n) range(1:10)

(o) matrix(1:12, nr=3, nc=4)

(q) Let a ← c(1,2,3), b ← c(10, 20, 30), c ←c(100, 200, 300), d ← c(1000, 2000, 3000). What does

the function rbind(a, b, c, d) do? What does cbind(a, b, c, d) do?

1

2 HOMEWORK 2 DUE DATE: FRIDAY, SEPTEMBER 25 AT 11:59 PM

(r) Let C be the following matrix

a b c d

1 10 100 1000

2 20 200 2000

3 30 300 3000

What is sum(C)? What is apply(C, 1, sum)? What is apply(C, 2, sum)?

(s) Let movies ← c(“SPYDERMAN”,“BATMAN”,“VERTIGO”,“CHINATOWN”). What does

lapply(movies, tolower) do? Notice that “tolower” changes the string value of a matrix to

lower case.

(t) Let x ← factor(c(“alpha”, “beta”, “gamma”, “alpha”, “beta”)). What does the function levels(x) return?

(u) c ← 35:50

(v) c(1, 2, 3) + c(4, 5, 6)

c(1, 2, 3, 4) + c(10, 20)

(x) sqrt(c(100, 225, 400))

Problem 2. Create the following vectors in R.

a = (5, 10, 15, 20, …, 160)

b = (87, 86, 85, …, 56)

Use vector arithmetic to multiply these vectors and call the result d. Select subsets of d to identify the

following.

(a) What are the 19th, 20th, and 21st elements of d?

(b) What are all of the elements of d which are less than 2000?

(c) How many elements of d are greater than 6000?

Problem 3. This exercise relates to the College data set, which can be found in the file College.csv. It

contains a number of variables for 777 different universities and colleges in the US. The variables are

• Private : Public/private indicator

• Apps : Number of applications received

• Accept : Number of applicants accepted

• Enroll : Number of new students enrolled

• Top10perc : New students from top 10% of high school class

• Top25perc : New students from top 25% of high school class

• F.Undergrad : Number of full-time undergraduates

BUSINESS DATA MINING (IDS 472) 3

• P.Undergrad : Number of part-time undergraduates

• Outstate : Out-of-state tuition

• Room.Board : Room and board costs

• Books : Estimated book costs

• Personal : Estimated personal spending

• PhD : Percent of faculty with Ph.D.’s

• Terminal : Percent of faculty with terminal degree

• S.F.Ratio : Student/faculty ratio

• perc.alumni : Percent of alumni who donate

• Expend : Instructional expenditure per student

• Grad.Rate : Graduation rate

(a) Read the data into R. Call the loaded data “college”. Explain how you do this.

(b) How many variables are in this data set. What are their measurements? How do you get these

information?

(c) Use the function colnames() to change the “Top10perc” and “Top 25per” variables names to

“Top10” and “Top25”.

(d) Look at the data. You should notice that the first column is just the name of each university.

We don’t really want R to treat this as data. However, it may be handy to have these names

for later. Try the following commands:

rownames (college) → college [,1]

You should see that there is now a row.names column with the name of each university recorded.

This means that R has given each row a name corresponding to the appropriate university. R

will not try to perform calculations on the row names. However, we still need to eliminate the

first column in the data where the names are stored. Write a code to eliminate the first column.

(e) Add a column to indicate the acceptance rate for each university (acceptance rate = number of

accepted applications / number of applications received).

(f) Provide a summary statistics for numerical variables in the data set.

(g) Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of

the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10]. Can

you observe any useful information in the plots?

(h) Use the boxplot() function to produce side-by-side boxplots of Outstate versus Private. Do you

observe any useful information in this plot?

(i) Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going

to divide universities into two groups based on whether or not the proportion of students coming

from the top 10% of their high school classes exceeds 50%. Follow the code below.

4 HOMEWORK 2 DUE DATE: FRIDAY, SEPTEMBER 25 AT 11:59 PM

Elite → rep (“No”,nrow(college))

Elite[college$Top10perc > 50] = “Yes”

Elite = as.factor(Elite)

college = data.frame(college,Elite)

i. Explain each line of the above code.

ii. Use the summary() function to see how many elite universities there are. Now use the

plot() function to produce side-by-side boxplots of Outstate versus Elite.

(j) Use the hist() function to produce some histograms with differing numbers of bins for a few of

the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide

the print window into four regions so that four plots can be made simultaneously. Modifying

the arguments to this function will divide the screen in other ways.

(k) What is room and board costs of private schools on average ?

(l) Create a new binary variable that is 1 if the student/faculty ratio is greater than 0.5 and 0

otherwise.

(m) Compare the distribution of out of state tuition for private and public colleges.

Problem 4. This exercise involves the “Auto” data set.

(a) Remove the missing values from this data set.

(b) What is the range of each quantitative predictor? You can answer this using the range() function.

(c) What is the mean and standard deviation of each quantitative predictor?

(d) Remove the 10th through 85th observations. What is the range, mean, and standard deviation

of each predictor in the subset of the data that remains?

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of

your choice. Create some plots highlighting the relationships among the predictors. Comment

on your findings.

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your

plots suggest that any of the other variables might be useful in predicting mpg? Justify your

answer.

Problem 5. FiveThirtyEight, a data journalism site devoted to politics, sports, science, economics,

and culture, recently published a series of articles on gun deaths in America. Gun violence in the

United States is a significant political issue, and while reducing gun deaths is a noble goal, we must first

understand the causes and patterns in gun violence in order to craft appropriate policies. As part of the

project, FiveThirtyEight collected data from the Centers for Disease Control and Prevention, as well as

BUSINESS DATA MINING (IDS 472) 5

other governmental agencies and non-profits, on all gun deaths in the United States from 2012-2014.You

can find this dataset, called ”gun deaths.csv”, on blackboard.

(a) Generate a data frame that summarizes the number of gun deaths per month.

(b) Generate a bar chart with labels on the x-axis. That is, each month should be labeled “Jan”,

“Feb”, “Mar” and etc.

(c) Generate a bar chart that identifies the number of gun deaths associated with each type of intent

cause of death. The bars should be sorted from highest to lowest values.

(d) Generate a boxplot visualizing the age of gun death victims, by sex. Print the average age of

female gun death victims.

Answer the following questions. Generate appropriate figures/tables to support your conclusions.

(e) How many white males with at least a high school education were killed by guns in 2012?

(f) Which season of the year has the most gun deaths? Assume that

– Winter = January – March

– Spring = April – June

– Summer = July – September

– Fall = October – December

– Hint: You need to convert a continuous variable into a categorical variable.

(g) Are whites who are killed by guns more likely to die because of suicide or homicide? How does

this compare to blacks and Hispanics?

(h) Are police-involved gun deaths significantly different from other gun deaths? Assess the relationship between police involvement and other variables.