Data Mining, Columbia University, Fall 2011


Instructor: Chris Volinsky
Email: volinsky@research.att.com

Phone: 973-360-8644
Office Hours: by request before or after lectures

Website for Class: http://www.research.att.com/~volinsky/DataMining/Columbia2011/Columbia2011.html

Form for requesting overloads

Class Logistics: 

September 6 - December 8, 2011

Tuesdays and Thursdays, 6:10pm to 7:25pm

428 Pupin Hall

Please note the Columbia Academic Integrity Policy

"Academic integrity is the cornerstone of the work of our intellectual community. It is therefore vital that we teach our students its value as well as identify instances of dishonesty. In these pages you will find information about the resources available to you as you address this in your classroom, specifically regarding prevention, detection, and the Dean's Discipline process.

Teaching Assistants: 

Wei Wang - ww2243
Michael Agne - mra2110

You are encouraged to hand in homeworks in paper form at the beginning of the class when they are due.
If HW must be submitted electronically, please submit to:

w4240.fall2011.stat.columbia.edu@gmail.com

TA Office Hours:  

Tuesday 3:30-4:30  PM - SSW 901
Wednesday 4-5 PM - Pupin 214

Overview:

Data mining is a field which has developed in the past decade to gain knowledge from the vast stores of data available in many fields, including astronomy, medicine, astronomy, computational biology, and internet traffic and retail. 

This class will give a broad overview of concepts, methodology, algorithms and applications in data mining. The course is designed to give a wide appreciation of the many tools and models that are available to do data mining, and provide an introduction to the data mining toolbox. The first half of the course will focus on data mining concepts and techniques, and will cover the most basic and popular tools for data mining: EDA, regression, classification, and clustering. The second half of the course will cover more specific application areas, including text mining, web data mining, recommendation models, and fraud detection.

The class is appropriate for graduate students from many disciplines.  To take the class, you should have familiarity with basic statistical techniques and concepts (linear models, regression, variance, correlation, etc) and proficiency with some software that will allow you to do statistical analysis (e.g. R, SAS, SPSS).   You should be able to input data, graph data, manipulate and filter data, and run statistical models as a prerequisite to this class.

Following is a list of our units, slides will be posted here, when available.

Unit
Title
Slides HW
1
Intro to Data Mining
Topic 1-DMIntro.ppt
HW1.html Solution
2
EDA/Visualization Tools
Topic 2-EDAViz.ppt
HW2.html Solution
3
Data Mining Concepts
Topic3-DMConcepts

4
Regression
Topic4.1-RegressionZorych.pdf
Topic4.2-RegressionVolinsky.ppt

5
Classification
Topic5.1-ClassificationMadigan.pdf
Topic5.2-ClassificationVolinsky.ppt
HW3.html
Solution
6
Clustering and Unsupervised Learning
Topic6-Clustering.ppt   Topic6.2-ClusteringExample.ppt

7
Text Mining
Topic7-TextMining.ppt  


Midterm
Sample Midterm Questions
MidtermReview
Midterm Solutions
R: Tree Code
Class Data 
8
Web Mining
Topic8-WebMining.ppt  
HW4.html
Solution
9
Advanced Classification: SVM and Neural Nets Topic9-AdvancedClassfication.ppt  
HW5.pdf
Solution
10
Ensemble Methods Topic10-EnsembleMethods.ppt  

11
Bayesian Methods - Ken Shirley Topic11-BayesianMethods.pdf  

12
Reccomender Systems and the Netflix Prize
Topic12-Recommender Systems / Netflix Prize 
HW6.html
Solution 
13
Social and other Networks
Topic13-Networks  
ExtraCreditHW.html
Extra Credit Solution   
14
Class Presentations

Term Project Notes
15
Final Exam
Final Review 
Final Answer Key 

In Class Final Presentations:
Hernandez (Music)
Wang (GPS)
Li (Twitter)
Shekhar (Finance)
Feder (Baseball)
MundtGreen(Restaurants)

 Timeline is as follows

9/15
HW 1 Due
9/29 HW2 Due
10/11
Proposal for Class Project Due
10/13
HW3 Due
10/25
Midterm
11/3
HW4 Due
11/17
Guest Lecture - Ken Shirley
11/17
HW5  Due
12/8
HW6 Due
12/8
Last Day of Class
12/13 Extra Credit HW due
12/13 Class Project Due
12/20
Final Exam - 7PM


Assessment: Assessment in the class will be based on three elements




Software:  You may use whatever software you like for your Data Analysis Project, but since the labs will focus on R, it is recommended to attempt to learn R to get the most out of the class.

There are many resources available for R in the library and online. R is an open-source software system and many people have created free tutorials and other documentation online. Go to www.r-project.org and click on "Manuals" on the left menu for several options. The Introduction to R is particularly helpful. There are many books about R that might be helpful -- the library has a few that focus on S or Splus, which will be helpful. S and Splus have basically the same functionality and syntax as R so many of the examples will be able to be used verbatim.

Helpful R Resources:

Recommended R Books:


Text:
  There is no official text for this class.   I will use materials drawn from many different sources.  However I will refer often to the following three texts:

1. T. Hastie, R. Tibshirani, and J. Friedman:  The Elements of Statistical Learning: data mining, inference and prediction. Springer Verlag.  

       This text is avaible online as a  PDF file

2. D. Hand, H. Mannila, P. Smyth:  Principles of Data Mining

3. Witten and Frank: Data Mining: Practical Machine Learning Tools and Techniques

In addition we will use


Data Sets: Data used during the class can be accessed at the Data Sets page