Personal tools
You are here: Home Academics Syllabi Summer 2010 Syllabi BAD 64020 Summer 2010 Booth

BAD 64020 Summer 2010 Booth

BUSINESS ADMINISTRATION

ADVANCED STATISTICAL MODELS

BAD 64020

Topic: Data Mining

Summer 2010

 

Instructor:              Dr. David Booth

Office:                   A428 BSA

Phone (Office):     672-1143

Office Hours:        MT 2-4:30 pm and by appointment

e-mail: dbooth@kent.edu

Please note that if I am not in my office at these times you will find a note on my door telling you where I am. Please then go to that location to see me. Please feel free to call me or leave a note in my mailbox if you need to contact me.

 

Textbook: Pearson, R. K., Mining Imperfect Data, SIAM(2005)

                  Booth, D.E., UMAP Module 626 (1985)

                  Croux & Haesbroeck (2003), Comp. Stat. & Data Analysis 44, 273-295

                  Pregibon (1981), Annals of Stat. 9, 705-724

                  Seis, G, ED230B/C: Regression with clustered data

                  Booth, D. E. (1982), Decision Sci. 13, 71-81

                  Booth, D et al(1986), Computers & Biomed. Res., 19, 1-12 

                  Zhu et al (2007), Current Analyt. Chem. 3, 233-237

                  Booth et al (2005), Current Analyt. Chem. 1, 181-186

                   Connor et al (1994), IEEE Trans. On Neural Networks 5, 240-254

                  Kaufman & Rousseeuw, Finding Groups in Data, Wiley (1990)(Rpkg cluster)

                  Draghici(2003), Data Analysis Tools for DNA Microarrays,Chapman&Hall

                  Speed, T.(2003), Statistical Analysis of Gene Expression Microarray Data,

                  Chapman&HALL    

                  Stekel, D.(2003), Microarray Bioinformatics, Cambridge

                 Wu et al (2009), Biotechniques 47(2), 691-700

                  Hauser & Booth (2010), J. Data Sci., in press

                  Bianco & Martinez(2009), Computational Stat & Data Analysis 53,4095-4105

                  Lee et al(2005), Expert Systems with Applications 29, 1-16

                  Booth(1982), Decision Sciences 13, 71-81

                  R Handout

                  SAS Handout   

 

Course objectives:

 

At the end of the course the student will have:

0)      Learned some common methods of data analysis for large data sets

1)      Learned the major causes of problems in large data sets (Missing Values, Outliers)

2)      Learned the major methods of dealing with the problems in 1, imputation, trimming and other robust methods.

3)      Gained experience with one or more of the methods in 2) with a large data set, using either SAS or R.

These skills will prepare you for more advanced work in your major, either in college or on the job.

 

Attendance and Make-up Policy:

 

In general, students are expected to attend class and are responsible for any material discussed and/or assigned. With respect to make-up, the general policy is no make-up of missed work (including exams) is allowed, and no late work will be accepted. The only exceptions are:

1)      A prearranged situation (e.g., course field trips, athletic trip, etc.)

2)      Emergency illness, death in the family, etc., in this case the instructor should be notified as soon as possible.

3)      Contact the instructor early

 

Performance Evaluation:

 

There will be 1 hourly examination, worth 100 points and class assignments to be turned in. Exam formats will be open book and notes.

There will be a term paper on using one or more of the topics discussed under objective 2), using a large data set and an appropriate analysis method e.g. neural network, cluster analysis, etc., chosen in consultation with the instructor. In fact, the topic and data set must be approved by the instructor. Get approval early. The method must be implemented in SAS or R, if possible (discuss with the instructor), using a real data set and a complete journal style paper submitted. The paper will be worth 100 points.

Academic dishonesty, in all forms, is prohibited. All material handed in is in the public domain. This syllabus is a guide, not an absolute contract. The grading scale is 90+ A, 80+ B, etc.

 

Students with Disabilities:

 

In accordance with University policy, if you have a documented disability and require accommodations to obtain equal access in this course, please contact the instructor at the beginning of the semester or when given an assignment for which an accommodation is required. Students with disabilities must verify their eligibility through the Office of Student Accessibility Services (SAS) (672-3391).

 

 

Tentative Course Outline

 

Section

 

Introduction

 

Common Analysis Methods for Large

Data Sets

 

Large n allows close distinctions to be drawn

 

Multiple comparisons with large no. of tests-Bonferroni

Topic & References

 

 

 

Stekel, Chpt 7-9 Methods for

Hypothesis testing(GLM), and

Classification(clustering,PCA,

SVM and ANN)

 

 

 

 

 

 

 

 

Stekel, p. 133

Speed, p.63,Draghici, p. 166, 225

 

Suggested Problems

 

 

 

Introduction to Data Anomalies

Pearson, Chapt. 1

 

 

Data Pretreatment for Missing Values

 

Overview of outliers and related anomalies

 

Univariate outliers

 

Data pretreatment-outliers                

 

Pearson, Sect. 4.1, 4.2

 

 

Pearson, Chapt. 2

 

 

 

Pearson, Chapt. 3

 

Pearson, sect. 4.3-4.5

Wu et al (2009)

 

 

Good Data Characterization

 

Robust Statistics and nonparametric regression

 

 

Pearson, Chapt. 5

 

UMAP Module 626, COMAP

 

 

 

 

RPDA

Booth (1986), (1982)

 

 

 

 

Classification by logistic regression

 

Classification by AGORAS  (p>n?)

 

Cross-validation  

               

Pregibon (1981)

 

 

Kaufman & Rousseeuw

 

 

Speed, pp.107,125-6,135

 

Classification with the BY estimator

 

Classification by

Neural Networks

 

Time series outliers

Croux and Haesbroeck (2003)

 

 

Zhu et al (2007)

 

 

Connor et al (1994)

Booth et al (2005)

 

GSA

Pearson, Chapt. 6

 

 

Sampling a large Data set

 

Pearson, Chapt. 7

 

 

Open questions

 

Pearson, Chapt. 8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FINAL EXAM

 

 

 

 

 

 

Document Actions