Homework #2 Due Thursday, September 29

Question #1 : Data Mining and Terrorism

Read the following two short articles:

"Mining The Matrix:"

"Why Data Mining Wont Stop Terror"

One article talks about why data mining cannot possibly work in
fighting terrorism, the other seems to describe an actual example
where data mining effectively identified persons of interest.

Write a paragraph explaining the conflict in these two articles,
and expressing your opinion about the role of data mining in fighting
terrorism.   Would your opinion change if the government used the same
technology for another purpose, say for looking for drug dealers, or

Question #2: Missing Data Techniques
(reference: Cook/Swayne p.52)

Here we will be using the "pbc" data set, a data set collected by the
Mayo Clinic to study primary biliary cirrhosis.

PBC Data
Info on PBC Data

Read the data in, but be very careful of the missing data
indicator. (Hint: it is not numeric, so this might cause trouble
reading in the data.  In R, consider using the 'na.strings' option in

There are many interesting variables to use for modelling the
dependent variable, status, but for simplicity, we will focus on four
of these variables, chol, copper, trig and platelet.

Look at the univariate distributions of complete (non-missing) cases
for these four variables.  What transformations could you use to make
these more bell-shaped?  Make these transformations and use the
transformed variables for the rest of this exercise.

Plot a scatterplot matrix using just these four variables.

Focus on two of the variables: chol and copper.  There are many missing values.
How do the missing values of these two variables relate to each other and to the response?
Assess whether the missing data can be considered 'missing at random'.

Now,  pick a method for filling in the all of the missing values with some numeric value.
Explain what you did, plot the new variables and comment on why this might have been a good or bad strategy.

(a good reference for missing values is here.  You may want to look at this for inspiration.

Question #3 Principal Components

Input the Boston Housing data

Info on the Boston Housing data

This data set has 13 variables measures on census tracts in Boston.
The goal is to predict median value of a home in the tract (medv).

Find the first two principal components of the 13 independent
variables (without scaling) and plot them against each other.
Also show a plot of the cumulative variance
explained by the 13 principal components.

Look at the projection matrix (also called the rotation or loadings matrix).  This
matrix gives the "loadings" on the variables - the weights in the
linear combination that make up the projection.  You will see that one
of the variables has a much higher weight than the others.  Why is
this?

Now, standardize each of the variables by subtracting its mean and dividing by
its std. dev.  Plot the first two PCs and the scree again.