Data Mining
I400 - Spring 2005
▪ Syllabus
▪ Grades
▪ Midterm Exam: March 9th 2005, in class. Review: March 7th.
▪ Final Exam: May 2nd 2005; 2:45-4:45pm. Review: April 27th.
Week 1: January 10-14, 2005
Topics
Overview of the course
Introduction to data mining
Data sets
Reading material
Textbook: Introduction (Chapter 1, pp.1-24)
Textbook: Measurement and data (Chapter 2, sections 2.1, 2.2, 2.5-2.7)
Matlab primer by Kermit Sigmon from University of Florida (pdf)
(Lecture Notes M) (Lecture Notes W) (Homework Assignment) (Lab tutorial)
Week 2: January 17-21, 2005
Topics
Distance measures and correlation
K-nearest neighbor algorithm
Reading material
Textbook: Measurement and data (Chapter 2, section 2.3)
Textbook: Predictive modeling for classification (Chapter 10, section 10.6)
Week 3: January 24-28, 2005
Topics
K-nearest neighbor algorithm
Introduction to data visualization
Introduction to data preprocessing (data cleaning and data integration)
Reading material
Textbook: Visualizing and exploring data (Chapter 3, sections 3.1-3.5)
Demonstration
K-NN algorithm by Dennis Groth from Indiana University
(Lecture Notes M) (Lecture Notes W) (Homework Assignment)
Week 4: January 31-February 4, 2005
Topics
Data transformation and data reduction
Introduction to principal components analysis (PCA)
Reading material
Textbook: Transforming data (Chapter 2, section 2.4)
Student tutorial on PCA by Lindsay Smith from University of Otago, NZ (pdf)
(Lecture Notes M) (Lecture Notes W) (Lab Material)
Week 5: February 7-11, 2005
Topics
Principal components analysis (PCA)
Statistical approaches to classification and regression
Reading material
Textbook: Data analysis and uncertainty (Chapter 4, sections 4.1-4.4)
Useful link
N-fold cross-validation by Jeff Schneider from Carnegie Mellon University
(Lecture Notes M) (Lecture Notes W)
Week 6: February 14-18, 2005
Topics
Bayes' rule and statistical inference
Discriminant function
(Lecture Notes M) (Lecture Notes W) (Homework Assignment)
Week 7: February 21-25, 2005
Topics
Naive Bayes model
Entropy
Reading material
Textbook: Predictive modeling for classification (Chapter 10, section 10.8)
Week 8: February 28-March 4, 2005
Topics
Decision trees
Reading material
Decision tree learning from Machine learning book by Tom Mitchell (Chapter 3)
(Lecture Notes by Tom Mitchell from Carnegie Mellon University)
Week 9: March 7-11 2005
Review for the midterm: Monday
Midterm exam: Wednesday
Week 10: March 14-18 2005
Spring break
Week 11: March 21-25, 2005
Topics
Linear regression
Logistic regression
Introduction to Weka software by Henry Paik
Reading material
Textbook: Models and patterns (Chapter 6: 6.1, 6.2, 6.3.1)
(Lecture Notes M) (Lecture Notes W) (Homework Assignment) (Datasets)
Week 12: March 28-April 1, 2005
Topics
Data mining applications: prediction of protein structure
Introduction to neural networks
Reading material
Artificial neural networks from Machine learning book by Tom Mitchell (Chapter 4)
(Lecture Notes M - 9MB) (Lecture Notes by Tom Mitchell from Carnegie Mellon University)
Week 13: April 4-8, 2005
Topics
Neural networks: training and applications
Introduction to association rules; Apriori algorithm
Reading material
Artificial neural networks from Machine learning book by Tom Mitchell (Chapter 4)
Textbook: Finding patterns and rules (Chapter 13)
Week 14: April 11-15, 2005
Topics
Mining association rules: extracting multilevel and statistically significant rules
Cluster analysis: hierarchical clustering
Reading material
Textbook: Descriptive modeling (Chapter 9; Sections 9.3 and 9.5)
Good paper on association rules!
S. Brin, R. Motwani, and C. Silverstein. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 265-276, Tuscon, Arizona, May 13-15 1997 (ps)
Good paper on clustering!
G. Karypis, E-H Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8): 68-75, 1999. (pdf)
(Lecture Notes M) (Lecture Notes W) by Jiawei Han from University of Illinois Urbana-Champaign
(Homework Assignment) (Assignment Data)
Week 15: April 18-22, 2005
Topics
Clustering: partitioning methods
Clustering: density-based methods
Introduction to outlier detection
Reading material
Textbook: Descriptive modeling (Chapter 9; Section 9.4)
Review paper on clustering
A. K. Jain, M. N. Murty, P. J. Flynn. Data clustering: a review. ACM Computing Surveys 31(3): 264-323, 1999. (pdf)
(Lecture Notes W by Jiawei Han from UIUC)
Week 16: April 25-29, 2005
Topics
Data mining applications: information retrieval
Data mining applications: biometrics
"Google" paper
S. Brin, L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems: 30(1-7), 107-117, 1998. (pdf)
Biometrics review paper
A. K. Jain, S. Pankanti, S. Prabhakar, L. Hong, A. Ross, J. L. Wayman. Biometrics: a grand challenge. Proceedings of the International Conference on Pattern Recognition, 2004. (pdf)
Guest presentations:
Filippo Menczer, "Google under the hood", Indiana University (Monday, April 25)
Nitesh Chawla, University of Notre Dame (Friday, April 29)
Week 17: May 2, 2005
Final Exam:
Time: 2:45-4:45pm
Place: BU 327
Grades:
Wednesday, May 4th by the end of the day (11:59pm)
Last updated: 05/05/2005 12:35 PM