BAD 64020 Summer 2010 Booth
BUSINESS ADMINISTRATION
ADVANCED STATISTICAL MODELS
BAD 64020
Topic: Data Mining
Summer 2010
Instructor: Dr. David Booth
Office: A428 BSA
Phone (Office): 672-1143
Office Hours: MT 2-4:30 pm and by appointment
e-mail: dbooth@kent.edu
Please note that if I am not in my office at these times you will find a note on my door telling you where I am. Please then go to that location to see me. Please feel free to call me or leave a note in my mailbox if you need to contact me.
Textbook: Pearson, R. K., Mining Imperfect Data, SIAM(2005)
Booth, D.E., UMAP Module 626 (1985)
Croux & Haesbroeck (2003), Comp. Stat. & Data Analysis 44, 273-295
Pregibon (1981), Annals of Stat. 9, 705-724
Seis, G, ED230B/C: Regression with clustered data
Booth, D. E. (1982), Decision Sci. 13, 71-81
Booth, D et al(1986), Computers & Biomed. Res., 19, 1-12
Zhu et al (2007), Current Analyt. Chem. 3, 233-237
Booth et al (2005), Current Analyt. Chem. 1, 181-186
Connor et al (1994), IEEE Trans. On Neural Networks 5, 240-254
Kaufman & Rousseeuw, Finding Groups in Data, Wiley (1990)(Rpkg cluster)
Draghici(2003), Data Analysis Tools for DNA Microarrays,Chapman&Hall
Speed, T.(2003), Statistical Analysis of Gene Expression Microarray Data,
Chapman&HALL
Stekel, D.(2003), Microarray Bioinformatics, Cambridge
Wu et al (2009), Biotechniques 47(2), 691-700
Hauser & Booth (2010), J. Data Sci., in press
Bianco & Martinez(2009), Computational Stat & Data Analysis 53,4095-4105
Lee et al(2005), Expert Systems with Applications 29, 1-16
Booth(1982), Decision Sciences 13, 71-81
R Handout
SAS Handout
Course objectives:
At the end of the course the student will have:
0) Learned some common methods of data analysis for large data sets
1) Learned the major causes of problems in large data sets (Missing Values, Outliers)
2) Learned the major methods of dealing with the problems in 1, imputation, trimming and other robust methods.
3) Gained experience with one or more of the methods in 2) with a large data set, using either SAS or R.
These skills will prepare you for more advanced work in your major, either in college or on the job.
Attendance and Make-up Policy:
In general, students are expected to attend class and are responsible for any material discussed and/or assigned. With respect to make-up, the general policy is no make-up of missed work (including exams) is allowed, and no late work will be accepted. The only exceptions are:
1) A prearranged situation (e.g., course field trips, athletic trip, etc.)
2) Emergency illness, death in the family, etc., in this case the instructor should be notified as soon as possible.
3) Contact the instructor early
Performance Evaluation:
There will be 1 hourly examination, worth 100 points and class assignments to be turned in. Exam formats will be open book and notes.
There will be a term paper on using one or more of the topics discussed under objective 2), using a large data set and an appropriate analysis method e.g. neural network, cluster analysis, etc., chosen in consultation with the instructor. In fact, the topic and data set must be approved by the instructor. Get approval early. The method must be implemented in SAS or R, if possible (discuss with the instructor), using a real data set and a complete journal style paper submitted. The paper will be worth 100 points.
Academic dishonesty, in all forms, is prohibited. All material handed in is in the public domain. This syllabus is a guide, not an absolute contract. The grading scale is 90+ A, 80+ B, etc.
Students with Disabilities:
In accordance with University policy, if you have a documented disability and require accommodations to obtain equal access in this course, please contact the instructor at the beginning of the semester or when given an assignment for which an accommodation is required. Students with disabilities must verify their eligibility through the Office of Student Accessibility Services (SAS) (672-3391).
Tentative Course Outline
Section
Introduction
Common Analysis Methods for Large Data Sets
Large n allows close distinctions to be drawn
Multiple comparisons with large no. of tests-Bonferroni |
Topic & References
Stekel, Chpt 7-9 Methods for Hypothesis testing(GLM), and Classification(clustering,PCA, SVM and ANN)
Stekel, p. 133 Speed, p.63,Draghici, p. 166, 225
|
Suggested Problems |
|
|
|
Introduction to Data Anomalies |
Pearson, Chapt. 1 |
|
Data Pretreatment for Missing Values
Overview of outliers and related anomalies
Univariate outliers
Data pretreatment-outliers |
Pearson, Sect. 4.1, 4.2
Pearson, Chapt. 2
Pearson, Chapt. 3
Pearson, sect. 4.3-4.5 Wu et al (2009) |
|
Good Data Characterization
Robust Statistics and nonparametric regression |
Pearson, Chapt. 5
UMAP Module 626, COMAP |
|
|
|
|
RPDA |
Booth (1986), (1982) |
|
|
|
|
Classification by logistic regression
Classification by AGORAS (p>n?)
Cross-validation
|
Pregibon (1981)
Kaufman & Rousseeuw
Speed, pp.107,125-6,135 |
|
Classification with the BY estimator
Classification by Neural Networks
Time series outliers |
Croux and Haesbroeck (2003)
Zhu et al (2007)
Connor et al (1994) Booth et al (2005) |
|
GSA |
Pearson, Chapt. 6 |
|
Sampling a large Data set |
Pearson, Chapt. 7 |
|
Open questions |
Pearson, Chapt. 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
FINAL EXAM |
|
|
|
|
|