Data Mining

I400 - Spring 2005

 

Syllabus

Grades

Midterm Exam: March 9th 2005, in class. Review: March 7th.

Final Exam: May 2nd 2005; 2:45-4:45pm. Review: April 27th.

 


Week 1: January 10-14, 2005

 

Topics

Overview of the course

Introduction to data mining

Data sets

 

Reading material

Textbook: Introduction (Chapter 1, pp.1-24)

Textbook: Measurement and data (Chapter 2, sections 2.1, 2.2, 2.5-2.7)

Matlab primer by Kermit Sigmon from University of Florida (pdf)

 

(Lecture Notes M) (Lecture Notes W) (Homework Assignment) (Lab tutorial)


Week 2: January 17-21, 2005

 

Topics

Distance measures and correlation

K-nearest neighbor algorithm

 

Reading material

Textbook: Measurement and data (Chapter 2, section 2.3)

Textbook: Predictive modeling for classification (Chapter 10, section 10.6)

 

(Lecture Notes W)


Week 3: January 24-28, 2005

 

Topics

K-nearest neighbor algorithm

Introduction to data visualization

Introduction to data preprocessing (data cleaning and data integration)

 

Reading material

Textbook: Visualizing and exploring data (Chapter 3, sections 3.1-3.5)

 

Demonstration

K-NN algorithm by Dennis Groth from Indiana University

 

(Lecture Notes M) (Lecture Notes W) (Homework Assignment)


Week 4: January 31-February 4, 2005

 

Topics

Data transformation and data reduction

Introduction to principal components analysis (PCA)

 

Reading material

Textbook: Transforming data (Chapter 2, section 2.4)

Student tutorial on PCA by Lindsay Smith from University of Otago, NZ (pdf)

 

(Lecture Notes M) (Lecture Notes W) (Lab Material)


Week 5: February 7-11, 2005

 

Topics

Principal components analysis (PCA)

Statistical approaches to classification and regression

 

Reading material

Textbook: Data analysis and uncertainty (Chapter 4, sections 4.1-4.4)

 

Useful link

N-fold cross-validation by Jeff Schneider from Carnegie Mellon University

 

(Lecture Notes M) (Lecture Notes W)


Week 6: February 14-18, 2005

 

Topics

Bayes' rule and statistical inference

Discriminant function

 

(Lecture Notes M) (Lecture Notes W) (Homework Assignment)


Week 7: February 21-25, 2005

 

Topics

Naive Bayes model

Entropy

 

Reading material

Textbook: Predictive modeling for classification (Chapter 10, section 10.8)

 

(Lecture Notes M)


Week 8: February 28-March 4, 2005

 

Topics

Decision trees

 

Reading material

Decision tree learning from Machine learning book by Tom Mitchell (Chapter 3)

 

(Lecture Notes by Tom Mitchell from Carnegie Mellon University)


Week 9: March 7-11 2005

 

Review for the midterm: Monday

Midterm exam: Wednesday

 


Week 10: March 14-18 2005

 

Spring break

 


Week 11: March 21-25, 2005

 

Topics

Linear regression

Logistic regression

Introduction to Weka software by Henry Paik

 

Reading material

Textbook: Models and patterns (Chapter 6: 6.1, 6.2, 6.3.1)

 

(Lecture Notes M) (Lecture Notes W) (Homework Assignment) (Datasets)


Week 12: March 28-April 1, 2005

 

Topics

Data mining applications: prediction of protein structure

Introduction to neural networks

 

Reading material

Artificial neural networks from Machine learning book by Tom Mitchell (Chapter 4)

 

(Lecture Notes M - 9MB) (Lecture Notes by Tom Mitchell from Carnegie Mellon University)


Week 13: April 4-8, 2005

 

Topics

Neural networks: training and applications

Introduction to association rules; Apriori algorithm

 

Reading material

Artificial neural networks from Machine learning book by Tom Mitchell (Chapter 4)

Textbook: Finding patterns and rules (Chapter 13)

 

(Lecture Notes W)


Week 14: April 11-15, 2005

 

Topics

Mining association rules: extracting multilevel and statistically significant rules

Cluster analysis: hierarchical clustering

 

Reading material

Textbook: Descriptive modeling (Chapter 9; Sections 9.3 and 9.5)

 

Good paper on association rules!

S. Brin, R. Motwani, and C. Silverstein. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 265-276, Tuscon, Arizona, May 13-15 1997 (ps)

 

Good paper on clustering!

G. Karypis, E-H Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8): 68-75, 1999. (pdf)

 

(Lecture Notes M) (Lecture Notes W) by Jiawei Han from University of Illinois Urbana-Champaign

(Homework Assignment) (Assignment Data)


Week 15: April 18-22, 2005

 

Topics

Clustering: partitioning methods

Clustering: density-based methods

Introduction to outlier detection

 

Reading material

Textbook: Descriptive modeling (Chapter 9; Section 9.4)

 

Review paper on clustering

A. K. Jain, M. N. Murty, P. J. Flynn. Data clustering: a review. ACM Computing Surveys 31(3): 264-323, 1999. (pdf)

 

(Lecture Notes W by Jiawei Han from UIUC)


Week 16: April 25-29, 2005

 

Topics

Data mining applications: information retrieval

Data mining applications: biometrics

 

"Google" paper

S. Brin, L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems: 30(1-7), 107-117, 1998. (pdf)

 

Biometrics review paper

A. K. Jain, S. Pankanti, S. Prabhakar, L. Hong, A. Ross, J. L. Wayman. Biometrics: a grand challenge. Proceedings of the International Conference on Pattern Recognition, 2004. (pdf)

 

Guest presentations:

Filippo Menczer, "Google under the hood", Indiana University (Monday, April 25)

Nitesh Chawla, University of Notre Dame (Friday, April 29)

 


Week 17: May 2, 2005

 

Final Exam:

Time: 2:45-4:45pm

Place: BU 327

 

Grades:

Wednesday, May 4th by the end of the day (11:59pm)

 


Last updated: 05/05/2005 12:35 PM