September 6 - December 8, 2011
Tuesdays and Thursdays, 6:10pm to 7:25pm
428 Pupin Hall
Please note the Columbia
Academic Integrity Policy.
"Academic integrity is the
the work of our intellectual community. It is therefore vital that we
teach our students its value as well as identify instances of
dishonesty. In these pages you will find information about the
resources available to you as you address this in your classroom,
specifically regarding prevention, detection, and the Dean's Discipline
Wei Wang - ww2243
Michael Agne - mra2110
You are encouraged to hand
in paper form at the beginning of the class when they are due.
If HW must be submitted electronically, please submit to:
Tuesday 3:30-4:30 PM - SSW 901
Wednesday 4-5 PM - Pupin 214
Data mining is a field which has developed in the past decade
gain knowledge from the vast stores of data available in many fields,
including astronomy, medicine, astronomy, computational biology, and
internet traffic and retail.
This class will give a broad overview of concepts,
algorithms and applications in data mining. The course is designed to
give a wide appreciation of the many tools and models that are
available to do data mining, and provide an introduction to the data
mining toolbox. The first half of the course will focus on data mining
concepts and techniques, and will cover the most basic and popular
tools for data mining: EDA, regression, classification, and clustering.
The second half of the course will cover more specific application
areas, including text mining, web data mining, recommendation models,
and fraud detection.
The class is appropriate for graduate students from many
disciplines. To take the class, you should have familiarity
basic statistical techniques and concepts (linear models, regression,
variance, correlation, etc) and proficiency with some software that
will allow you to do statistical analysis (e.g. R, SAS,
SPSS). You should be able to input data, graph
manipulate and filter data, and run statistical models as a
prerequisite to this class.
Following is a list of our units, slides will be posted here,
||Intro to Data Mining
||Data Mining Concepts
||Sample Midterm Questions
R: Tree Code
||Advanced Classification: SVM and Neural Nets||Topic9-AdvancedClassfication.ppt
||Bayesian Methods - Ken Shirley||Topic11-BayesianMethods.pdf
||Reccomender Systems and the Netflix Prize
||Topic12-Recommender Systems / Netflix Prize
||Social and other Networks
Extra Credit Solution
||Term Project Notes
||Final Answer Key
Timeline is as follows
||HW 1 Due
||Proposal for Class Project Due
||Guest Lecture - Ken Shirley
||Last Day of Class
|12/13||Extra Credit HW due|
|12/13||Class Project Due
||Final Exam - 7PM
Assessment: Assessment in the class will be based on three elements
Software: You may use whatever software you like for your Data Analysis Project, but since the labs will focus on R, it is recommended to attempt to learn R to get the most out of the class.
There are many resources available for
R in the library and online. R is an open-source software system and
many people have created free tutorials and other documentation online.
Go to www.r-project.org
and click on "Manuals" on the left menu for several options. The Introduction
to R is
particularly helpful. There are many books about R that might be
helpful -- the library has a few that focus on S or Splus, which will
be helpful. S and Splus have basically the same functionality and
syntax as R so many of the examples will be able to be used verbatim.
Helpful R Resources:
Recommended R Books:
Text: There is no official text for this class. I will use materials drawn from many different sources. However I will refer often to the following three texts:
1. T. Hastie, R. Tibshirani, and J. Friedman: The Elements of Statistical Learning: data mining, inference and prediction. Springer Verlag.
This text is avaible online as a PDF file
2. D. Hand, H. Mannila, P. Smyth: Principles of Data
3. Witten and Frank: Data Mining: Practical Machine Learning
In addition we will use