Question #1 : Data Mining and Terrorism

Read the following two short articles:

"Mining The Matrix:"

"Why Data Mining Wont Stop Terror"

One article talks about why data mining cannot possibly work in

fighting terrorism, the other seems to describe an actual example

where data mining effectively identified persons of interest.

Write a paragraph explaining the conflict in these two articles,

and expressing your opinion about the role of data mining in fighting

terrorism. Would your opinion change if the government used the same

technology for another purpose, say for looking for drug dealers, or

tax evaders?

Question #2: Missing Data Techniques

(reference: Cook/Swayne p.52)

Here we will be using the "pbc" data set, a data set collected by the

Mayo Clinic to study primary biliary cirrhosis.

PBC Data

Info on PBC Data

Read the data in, but be very careful of the missing data

indicator. (Hint: it is not numeric, so this might cause trouble

reading in the data. In R, consider using the 'na.strings' option in

'read.table')

There are many interesting variables to use for modelling the

dependent variable, status, but for simplicity, we will focus on four

of these variables, chol, copper, trig and platelet.

Look at the univariate distributions of complete (non-missing) cases

for these four variables. What transformations could you use to make

these more bell-shaped? Make these transformations and use the

transformed variables for the rest of this exercise.

Plot a scatterplot matrix using just these four variables.

Focus on two of the variables: chol and copper. There are many missing values.

How do the missing values of these two variables relate to each other and to the response?

Assess whether the missing data can be considered 'missing at random'.

Now, pick a method for filling in the all of the missing values with some numeric value.

Explain what you did, plot the new variables and comment on why this might have been a good or bad strategy.

(a good reference for missing values is here. You may want to look at this for inspiration.

Question #3 Principal Components

Input the Boston Housing data

Info on the Boston Housing data

This data set has 13 variables measures on census tracts in Boston.

The goal is to predict median value of a home in the tract (medv).

Find the first two principal components of the 13 independent

variables (without scaling) and plot them against each other.

Also show a plot of the cumulative variance

explained by the 13 principal components.

Look at the projection matrix (also called the rotation or loadings matrix). This

matrix gives the "loadings" on the variables - the weights in the

linear combination that make up the projection. You will see that one

of the variables has a much higher weight than the others. Why is

this?

Now, standardize each of the variables by subtracting its mean and dividing by

its std. dev. Plot the first two PCs and the scree again.