Email: volinsky@research.att.com

Phone: 973-360-8644

Office Hours: by request before or after lectures

Website for Class: http://www.research.att.com/~volinsky/DataMining/Columbia2011/Columbia2011.html

Form for requesting overloads

**Class Logistics:**

September 6 - December 8, 2011

Tuesdays and Thursdays, 6:10pm to 7:25pm

428 Pupin Hall

Please note the Columbia
Academic Integrity Policy.

"Academic integrity is the
cornerstone of
the work of our intellectual community. It is therefore vital that we
teach our students its value as well as identify instances of
dishonesty. In these pages you will find information about the
resources available to you as you address this in your classroom,
specifically regarding prevention, detection, and the Dean's Discipline
process.

Teaching
Assistants:

Wei Wang - ww2243

Michael Agne - mra2110

You are encouraged to hand
in homeworks
in paper form at the beginning of the class when they are due.

If HW must be submitted electronically, please submit to:

w4240.fall2011.stat.columbia.edu@gmail.com

**TA Office
Hours:
**

Tuesday 3:30-4:30 PM - SSW 901

Wednesday 4-5 PM - Pupin 214

**Overview:**

Data mining is a field which has developed in the past decade
to
gain knowledge from the vast stores of data available in many fields,
including astronomy, medicine, astronomy, computational biology, and
internet traffic and retail.

This class will give a broad overview of concepts,
methodology,
algorithms and applications in data mining. The course is designed to
give a wide appreciation of the many tools and models that are
available to do data mining, and provide an introduction to the data
mining toolbox. The first half of the course will focus on data mining
concepts and techniques, and will cover the most basic and popular
tools for data mining: EDA, regression, classification, and clustering.
The second half of the course will cover more specific application
areas, including text mining, web data mining, recommendation models,
and fraud detection.

The class is appropriate for graduate students from many
disciplines. To take the class, you should have familiarity
with
basic statistical techniques and concepts (linear models, regression,
variance, correlation, etc) and proficiency with some software that
will allow you to do statistical analysis (e.g. R, SAS,
SPSS). You should be able to input data, graph
data,
manipulate and filter data, and run statistical models as a
prerequisite to this class.

Following is a list of our units, slides will be posted here,
when
available.

In Class Final Presentations:

Hernandez (Music)

Wang (GPS)

Li (Twitter)

Shekhar (Finance)

Feder (Baseball)

MundtGreen(Restaurants)

Timeline is as follows

9/15 |
HW 1 Due |

9/29 | HW2 Due |

10/11 |
Proposal for Class Project Due |

10/13 |
HW3 Due |

10/25 |
Midterm |

11/3 |
HW4 Due |

11/17 |
Guest Lecture - Ken Shirley |

11/17 |
HW5 Due |

12/8 |
HW6 Due |

12/8 |
Last Day of Class |

12/13 | Extra Credit HW due |

12/13 | Class Project Due |

12/20 |
Final Exam - 7PM |

**Assessment:**
Assessment in the class will
be based on three elements

**Exams**(40%). There will be a midterm in Mid-October and a final exam during finals week.

- Homework
(30%) There will
be a homework assignment due approximately every other
Thursday.
Homeworks are due at the beginning of class on the day they are
due. Late
homeworks are not
accepted.

**Data Analysis Project**(30%). The class is designed to provide you with a toolbox of data mining methods so that you can apply them in different domains. You will be expected to collect a data set (online or otherwise), formulate a scientific question of interest, and perform the data analysis to address that question. More information on the project page

**Software:**
You may use whatever
software you like for your Data Analysis Project, but since the labs
will focus on R, it is recommended to attempt to learn R to get the
most out of the class.

There are many resources available for
R in the library and online. R is an open-source software system and
many people have created free tutorials and other documentation online.
Go to www.r-project.org
and click on "Manuals" on the left menu for several options. The Introduction
to R is
particularly helpful. There are many books about R that might be
helpful -- the library has a few that focus on S or Splus, which will
be helpful. S and Splus have basically the same functionality and
syntax as R so many of the examples will be able to be used verbatim.

Helpful R Resources:

- Revolution
R Blog

- R Bloggers has great contributed articles
- More R web sites at the bottom of this page.
- A useful R Tutorial .
- Another useful R
Tutorial

Recommended R Books:

- Introductory Statistics with R - Dalgaard
- A Beginner's
Guide to R
- Zuur

- R In A Nutshell (O'Reilly) - Adler
- A Handbook of Statistical Analyses Using R - Everitt and Hothorn
- Here is a comprehensive
list of R Books

**
Text:** There is no official text
for this class. I will use materials drawn from
many
different sources. However I will refer often to the
following
three texts:

1. T. Hastie, R. Tibshirani, and J. Friedman: The Elements of Statistical Learning: data mining, inference and prediction. Springer Verlag.

This text is avaible online as a PDF file

2. D. Hand, H. Mannila, P. Smyth: Principles of Data
Mining

3. Witten and Frank: Data Mining: Practical Machine Learning
Tools
and Techniques

In addition we will use

- Padhraic Smyth online Data Mining class notes
- Interactive and Dynamic Graphics for Data Analysis by Di Cook and Deborah Swayne. The web page for this book has great data sets, examples and even movies about data mining.
- David Madigan's Course notes
- Also thanks to Shawndra Hill, who graciously shared some of her notes and examples