BAD 64020 Summer 2010 Booth

BUSINESS ADMINISTRATION

ADVANCED STATISTICAL MODELS

BAD 64020

Topic: Data Mining

Summer 2010

Instructor: Dr. David Booth

Office: A428 BSA

Phone (Office): 672-1143

Office Hours: MT 2-4:30 pm and by appointment

e-mail: dbooth@kent.edu

Please note that if I am not in my office at these times you will find a note on my door telling you where I am. Please then go to that location to see me. Please feel free to call me or leave a note in my mailbox if you need to contact me.

Textbook: Pearson, R. K., Mining Imperfect Data, SIAM(2005)

Booth, D.E., UMAP Module 626 (1985)

Croux & Haesbroeck (2003), Comp. Stat. & Data Analysis 44, 273-295

Pregibon (1981), Annals of Stat. 9, 705-724

Seis, G, ED230B/C: Regression with clustered data

Booth, D. E. (1982), Decision Sci. 13, 71-81

Booth, D et al(1986), Computers & Biomed. Res., 19, 1-12

Zhu et al (2007), Current Analyt. Chem. 3, 233-237

Booth et al (2005), Current Analyt. Chem. 1, 181-186

Connor et al (1994), IEEE Trans. On Neural Networks 5, 240-254

Kaufman & Rousseeuw, Finding Groups in Data, Wiley (1990)(Rpkg cluster)

Draghici(2003), Data Analysis Tools for DNA Microarrays,Chapman&Hall

Speed, T.(2003), Statistical Analysis of Gene Expression Microarray Data,

Chapman&HALL

Stekel, D.(2003), Microarray Bioinformatics, Cambridge

Wu et al (2009), Biotechniques 47(2), 691-700

Hauser & Booth (2010), J. Data Sci., in press

Bianco & Martinez(2009), Computational Stat & Data Analysis 53,4095-4105

Lee et al(2005), Expert Systems with Applications 29, 1-16

Booth(1982), Decision Sciences 13, 71-81

R Handout

SAS Handout

Course objectives:

At the end of the course the student will have:

0) Learned some common methods of data analysis for large data sets

1) Learned the major causes of problems in large data sets (Missing Values, Outliers)

2) Learned the major methods of dealing with the problems in 1, imputation, trimming and other robust methods.

3) Gained experience with one or more of the methods in 2) with a large data set, using either SAS or R.

These skills will prepare you for more advanced work in your major, either in college or on the job.

Attendance and Make-up Policy:

In general, students are expected to attend class and are responsible for any material discussed and/or assigned. With respect to make-up, the general policy is no make-up of missed work (including exams) is allowed, and no late work will be accepted. The only exceptions are:

1) A prearranged situation (e.g., course field trips, athletic trip, etc.)

2) Emergency illness, death in the family, etc., in this case the instructor should be notified as soon as possible.

3) Contact the instructor early

Performance Evaluation:

There will be 1 hourly examination, worth 100 points and class assignments to be turned in. Exam formats will be open book and notes.

There will be a term paper on using one or more of the topics discussed under objective 2), using a large data set and an appropriate analysis method e.g. neural network, cluster analysis, etc., chosen in consultation with the instructor. In fact, the topic and data set must be approved by the instructor. Get approval early. The method must be implemented in SAS or R, if possible (discuss with the instructor), using a real data set and a complete journal style paper submitted. The paper will be worth 100 points.

Academic dishonesty, in all forms, is prohibited. All material handed in is in the public domain. This syllabus is a guide, not an absolute contract. The grading scale is 90+ A, 80+ B, etc.

Students with Disabilities:

In accordance with University policy, if you have a documented disability and require accommodations to obtain equal access in this course, please contact the instructor at the beginning of the semester or when given an assignment for which an accommodation is required. Students with disabilities must verify their eligibility through the Office of Student Accessibility Services (SAS) (672-3391).

Tentative Course Outline

Section Introduction Common Analysis Methods for Large Data Sets Large n allows close distinctions to be drawn Multiple comparisons with large no. of tests-Bonferroni	Topic & References Stekel, Chpt 7-9 Methods for Hypothesis testing(GLM), and Classification(clustering,PCA, SVM and ANN) Stekel, p. 133 Speed, p.63,Draghici, p. 166, 225	Suggested Problems

Introduction to Data Anomalies	Pearson, Chapt. 1
Data Pretreatment for Missing Values Overview of outliers and related anomalies Univariate outliers Data pretreatment-outliers	Pearson, Sect. 4.1, 4.2 Pearson, Chapt. 2 Pearson, Chapt. 3 Pearson, sect. 4.3-4.5 Wu et al (2009)
Good Data Characterization Robust Statistics and nonparametric regression	Pearson, Chapt. 5 UMAP Module 626, COMAP

RPDA	Booth (1986), (1982)

Classification by logistic regression Classification by AGORAS (p>n?) Cross-validation	Pregibon (1981) Kaufman & Rousseeuw Speed, pp.107,125-6,135
Classification with the BY estimator Classification by Neural Networks Time series outliers	Croux and Haesbroeck (2003) Zhu et al (2007) Connor et al (1994) Booth et al (2005)
GSA	Pearson, Chapt. 6
Sampling a large Data set	Pearson, Chapt. 7
Open questions	Pearson, Chapt. 8


























FINAL EXAM

Personal tools

Sections

BAD 64020 Summer 2010 Booth

Tentative Course Outline

GSA

FINAL EXAM

Document Actions