Class Logistics:
September 6 - December 8, 2011
Tuesdays and Thursdays, 6:10pm to 7:25pm
428 Pupin Hall
Please note the Columbia
Academic Integrity Policy.
"Academic integrity is the
cornerstone of
the work of our intellectual community. It is therefore vital that we
teach our students its value as well as identify instances of
dishonesty. In these pages you will find information about the
resources available to you as you address this in your classroom,
specifically regarding prevention, detection, and the Dean's Discipline
process.
Teaching
Assistants:
Wei Wang - ww2243
Michael Agne - mra2110
You are encouraged to hand
in homeworks
in paper form at the beginning of the class when they are due.
If HW must be submitted electronically, please submit to:
w4240.fall2011.stat.columbia.edu@gmail.com
TA Office
Hours:
Tuesday 3:30-4:30 PM - SSW 901
Wednesday 4-5 PM - Pupin 214
Overview:
Data mining is a field which has developed in the past decade
to
gain knowledge from the vast stores of data available in many fields,
including astronomy, medicine, astronomy, computational biology, and
internet traffic and retail.
This class will give a broad overview of concepts,
methodology,
algorithms and applications in data mining. The course is designed to
give a wide appreciation of the many tools and models that are
available to do data mining, and provide an introduction to the data
mining toolbox. The first half of the course will focus on data mining
concepts and techniques, and will cover the most basic and popular
tools for data mining: EDA, regression, classification, and clustering.
The second half of the course will cover more specific application
areas, including text mining, web data mining, recommendation models,
and fraud detection.
The class is appropriate for graduate students from many
disciplines. To take the class, you should have familiarity
with
basic statistical techniques and concepts (linear models, regression,
variance, correlation, etc) and proficiency with some software that
will allow you to do statistical analysis (e.g. R, SAS,
SPSS). You should be able to input data, graph
data,
manipulate and filter data, and run statistical models as a
prerequisite to this class.
Following is a list of our units, slides will be posted here,
when
available.
Timeline is as follows
| 9/15 |
HW 1 Due |
| 9/29 | HW2 Due |
| 10/11 |
Proposal for Class Project Due |
| 10/13 |
HW3 Due |
| 10/25 |
Midterm |
| 11/3 |
HW4 Due |
| 11/17 |
Guest Lecture - Ken Shirley |
| 11/17 |
HW5 Due |
| 12/8 |
HW6 Due |
| 12/8 |
Last Day of Class |
| 12/13 | Extra Credit HW due |
| 12/13 | Class Project Due |
| 12/20 |
Final Exam - 7PM |
Assessment: Assessment in the class will be based on three elements
Software: You may use whatever software you like for your Data Analysis Project, but since the labs will focus on R, it is recommended to attempt to learn R to get the most out of the class.
There are many resources available for
R in the library and online. R is an open-source software system and
many people have created free tutorials and other documentation online.
Go to www.r-project.org
and click on "Manuals" on the left menu for several options. The Introduction
to R is
particularly helpful. There are many books about R that might be
helpful -- the library has a few that focus on S or Splus, which will
be helpful. S and Splus have basically the same functionality and
syntax as R so many of the examples will be able to be used verbatim.
Helpful R Resources:
Recommended R Books:
Text: There is no official text
for this class. I will use materials drawn from
many
different sources. However I will refer often to the
following
three texts:
1. T. Hastie, R. Tibshirani, and J.
Friedman: The Elements of Statistical Learning: data mining,
inference and prediction. Springer Verlag.
This text is avaible online as a PDF file
2. D. Hand, H. Mannila, P. Smyth: Principles of Data
Mining
3. Witten and Frank: Data Mining: Practical Machine Learning
Tools
and Techniques
In addition we will use