STATISTICS/ECON 206
COURSE INFORMATION
POLICIES OUTLINE & CALENDER FAQ
PAST TESTS (test topics/arrangement may differ in current course)
Fall 2012 T1; T2; T3 ; T4; T5;
Spring 2013 T1; T2; T3; T4; T5
Spring 2013 Grades (weighted for current items only)
RELATED WEBSITES
Java Applets / Coin Flipping Page / American Statistical Association / Journal of Statistics Education /
Dr. Brian Goff/414 Grise Hall
Phone (270)745-3855/brian.goff@wku.edu
Last Modified: March 29, 2013
Western Kentucky University
CONTACT INFORMATION
Dr. Brian Goff Grise 414/745-3855
Email: brian.goff@wku.edu / Office Hours for Spring 2013: TTh 9:30-11:30
(Appointments & drop-ins welcome at other times)
RESOURCES
Hyperstats (online statistics reference by David Lane) and other online links
Excel and SPSS (available on WKUnet)
Website: http://people.wku.edu/brian.goff/e206.htm
OBJECTIVES FOR STUDENTS
To gain a basic and practical understanding of how to collect and analyze data with emphasis on business applications
GRADING POLICIES
Tests 1-5 40% (drop lowest grade)
Test 6 25%
Software Skills Test 10%
Assignments (5) 25%
Total 100% A=90%+; B=80-89%; C=70-79%; D= 60-69%;F below 60%
Tests 1-5: These tests will assess your understanding and ability to use the key statistical concepts and skills developed in this course. You will need a scantron (in good condition), a pencil, and a calculator for each exam. I will drop your lowest score. NO EARLY OR MAKEUP TESTS. If weather (etc.) postpones a test or assignment deadline date, the test or deadline will move to the next class meeting. Material on tests will reflect class lecutres, mini-assignments, and assignments. Past tests are provided as aids but not as complete guides to current tests.
Test 6: This is a test over the prior stat concepts tests (and any additional concepts covered after the last one).
Assignments: These are short assignments make use of Excel and SPSS. See links in course outline for details. You may complete these individually or as part of a 2 person team (team should turn in one assignment with all names). You may seek help from other students or graduate assistants, but you must not merely copy the work of others. Grades will be either full credit, half credit, or zero. (If you complete these as part of a team, make sure you know how to accomplish the Excel/SPSS tasks yourself. They will be tested on the Data Skills Tests). Assignment due dates are clearly posted. Assignments are due at the beginning of class. Assignments turned in after beginning of class receive 10% reduction. Assignments more than 48 hours late receive zero credit.
Software Skills Test : This will be completed in class in the computer lab testing your ability to complete the data analytic tasks from the assignments using Excel and SPSS.
MISCELLANEOUS & CLASSROOM POLICIES
Last day to drop with a "W" or change to audit is listed on the Academic Affairs website. If you have an ADA covered disability requiring special consideration, please register with the ADA Compliance Office, and then see me.
Classroom Policies: Orderly behavior and respect for others who are speaking (including me) is expected. No computer or cell phone use is permitted (including texting). No food or drinks without tops are permitted. If late, enter with a minimum of disturbance and be seated in the nearest seat. Behavior that is inappropriate or distracting to other students or myself is not permitted including personal conversations. No cell phones or other electronic devices are permitted. Individuals involved in incidents that significantly violate these policies will receive a warning and then will be notified of a letter grade reduction per subsequent incident.
COURSE OUTLINE & CALENDAR (Some dates/topics subject to change.Check back frequently for updates)
Note: HS refers to Hyperstats by David Lane; For additional background, see Statsoft and Wikipedia.
Weeks 1-3: Basic Measurement and Management of Data (HS Chapter 1, 2 including links within text)
Data analytics goes mainstream: IBM Commercial Moneyball MIT Sports Analytics Conference
Software: SAS IBM-SPSS
Data, variables, and measurement: data validity and reliability, cross sectional data, time series data, longitudinal study, quantitative data, qualitative data
Statistical notation and frequently used transformations: See SPSS and Excel
Data entry, manipulation, and transformation in software
Graph Basics (histograms, time plots, multi-dimensional GapMinder)
Assignment 1 Data entry and manipulation in Excel & SPSS (Complete in class Thursday January 31 Room 530; otherwise due Thursday Feb 7 )
Test 1, Thursday, February 7
Additional Materials & Tutorials: Wiki on Validity (intro and sections on reliability and construct validity); Data Analysis Add-in in Excel (2007 version)
Data Sets for Section: bp cross bp long sp500 2001 mlb manager long mlb manager spss Flu Time Plot Old Faithful Histogram Google Images for Pie Charts Histograms Time Plots
Random Walk with Apparent Trend
Weeks 4-5: Describing Characteristics of Single Variables (HS Chapter 2)
Statistical Descriptions (mean, median, mode, standard deviation for quantitative and qualitative data)
Graphs and Statistics (histograms, time plots, scatterplots, multi-dimensional)
Empirical Rule, Chebyshev's Theorem, Outliers, Statistical Process Control
Assignment 2 Describing variables with stats, Due Tuesday February 19
Test 2, Thursday, February 21
Additional Materials and Tutorials: Mean, Median, Mode Standard Deviation Data Analysis: Descriptive Stats in Excel Pie Charts Histograms Time Plots GapMinder Graphics
Statistical Process Control ( Supplement 1 ) Excel File Example
Data Sets for Section: bodytemp(SPSS) cancer survival(SPSS) minor leagues(Excel) hot chocolate temp(Excel)
Weeks 6-7: Describing Relationships Between Variables (HS Chapter 3, 15-0, 15-2, 16-3a and b)
Correlation analysis
Regression analysis
Crosstablution and Contingency analysis
Assignment 3: Describing relationships between variables, Due Tuesday March 5
Test 3, Thursday March 7
Additional Materials & Tutorials: Regression & Correlation Applet DC Times on Causation
Data for Section: Diamond Price Data Titanic Data 5K Times Data Country Health Care Data Affairs Data Advertising Data
Week 8: SPRING BREAK
Weeks 9-10: Basic Ideas of Probability (HS Chapter 4, 5-0, 5-2)
Basic Concepts; Addition Rule; Multiplication Rule;
Expected Value; Law of Large Numbers
Conditional probabilities Probability distributions using Excel
Assignment 4: Probability problems, Due Tuesday March 26
Test 4, Thursday March 28
Additional Materials & Tutorials: Normal Distribution Normal Distribution & Empirical Rule Addition Rule Compound Probablity Making Free Throws
Coin Flipping Applet Birthday Problem App Let's Make a Deal Applet Binomial Applet Normal Distribution Applet
Birthday Problem Computations Birthday Problem Explained Bayesian Probability Video
Weeks 11-12: Statistical Tests and Sampling Errors (HS Chapter 6, 7-0, 7-1, 8-0, 8-1, 8-2, 8-10, 9)
Errors in Surveys and Samples: Non-sampling error and sampling error
Sampling error and its distribution (also Central Limit Theorem)
Sampling error and confidence interval estimates
Statistical Tests using p-values; Power of Tests: Big Samples - Small Effects; Small Sample - Insignificant Effects
Assignment 5: Practice with statistical tests, Due Tuesday April 9
Test 5, Thursday April 11
Additional Materials & Tutorials: K Tests and P-values Type I Errors Sampling Distribution Applet
Week 13: Statistical Applications & "Literacy"
Geometric mean (HS 2-2a)
Regression Applications: Statistical analysis of presidential elections -- Ray Fair Yale CU Boulder State Level Gasoline Prices; Market Model for Stock; Simpson's Paradox (omitted variables bias);
Issues in Dealing with Rates: Denominator Choice; Rates of Change of Rates of Change (Percent of Percent)
Looking out for Junk Science
Week 14
Help Session for Skills Test, Tuesday April 23 Room 530
Software Skills Test, Thursday April 25 Room 530
Dropbox Excel file link: Adverstising.xlsx
Dropbox SPSS file link: Advertising.sav
Week 15
Test 6, Thursday May 2
Questions:
3a. Briefly describe the data set used in the assignment and measurement of specific variables.
b. Based on your output, what would "typical" kilowatts used per day be? Is the data set symmetric and what are outlying values?
c. Do winter months have higher gas usage than other months? Conduct test where the null hypothesis is that winter month gas usage is the same as other months.
PRETTY GOOD ANSWERS
3. a. The data set consisted of 5 variables related to the monthly electric and gas usage of residential utility customers from July 1990 to June 1998. The data was obtained from an SPSS data file on WKUnet (m:spss/utility). The variables included were a month identifier, average kilowatt hours used per day, gas thermal units used, average daily temperature, and number of days in the month.
b. Based on the histogram and descriptive statistics, typical kilowatt hours used per day ranged from about 30 to 40 per month. The mean was about 34 and the median was 37 with a standard deviation of 3. The data were skewed to the right with a few large outliers well beyond 50 kw hours per day.
c. The means for winter month gas usage (December, January, February) were about 30 percent higher than for other months. Based on a low p-value (0.005), a hypothesis that winter months and other months have the same mean could be rejected with strong confidence.
POOR ANSWERS
3. a. It had some variables about utility customers.
b. 34. A few big outliers. s.d. was 3.
c. winter had higher don't know what p-value means
Why are these poor answers? Information provided is far too little. No detail or sources about the variables used are provided in (a). Incomplete sentences used in (b) and (c). Pronoun (it) used without antecedent in (a). Abbreviation used for standard deviation in (b) without first defining it somewhere. Improper capitilization and punctuation in (c). Use of contractions such as "don't" in (c) is not appropriate in formal reports.
READING SUPPLEMENT 1 -- Statistical Process Control (SPC)
Empirical Rule: a summary of variation in outcomes for data that have roughly a bell-shaped (normal) shape and a means to indentify outliers. The Empirical Rule states: i) about 68 percent of the data will be between +/ 1 standard deviation from the mean, ii) about 95% of the data will be between +/- 2 standard deviations from the mean, and iii) about 99.9% of the data will be between +/- 3 standard deviations from the mean.
Chebeshev's Rule: a summary of variation in outcomes for data regardless of their shape and a means to identify outliers. The rule states: i) about 75% of of the data will be between +/- 2 standard deviations from the mean, ii) about 89% of the data will be between +/- 3 standard deviations from the mean, and iii) about 94% of the data will be between +/-4 standard deviations from the mean.
Standardize Units (z-units, z-values, z-scores): the name given to variable (X) which has been converted so that the mean is zero (0) and the standard deviation is one (1). This conversion is done by the formula Zi = (Xi - Mean)/Std. Deviation, where i refers to each individual item in the data set. This conversion eliminates whatever units were used to measure X, and it allows each data point to be easily evaluated in terms of how much it differs from the mean. < /FONT >
Statistical Process Control (SPC): The phrase and acronym applied to systematic methods of analyzing repetitive production processes using charts that track variation of the process, usually with the goal of monitoring and improving quality. Chapter 18 provides more details.
Process: The forces (inputs) working together to generate the outcomes of a variable; in manufacturing or provision of services these include equipment, tools, materials, people, and "environmental" influences such as weather or other events that determine the characteristics of a good or service being produced; the same kinds of forces generate outcomes personal settings also.
Process Variation: changes in the process that lead to quantitative or qualitative differences in the characteristics of a part, a good, or service being produced
Common Causes of (Process) Variation: sources of variation which are inherent (built-in) to the process design as it is currently configured; usual or normal process variation; sources of variation which have the potential to influence all process observations; variation only eliminated through redesign or improvement of design of the process; random variation
Assignable Causes of (Process) Variation: special or specific sources of variation which are not built into the design of the process; variation which does not influence all observations; variation which can be eliminated without altering the basic design
Control Charts: Graphs which record sample measurements of a process -- usually a repetitive processs. X-Bar Charts are the simplest and monitor variation in sample means of the process. The chart is used if the variation in samples is likely due to assignable variation (high variability or patterns) or merely to common (expected) variation.
ASSIGNMENTS
Assignment 1
Objective: To gain familiarity entering/transforming data and generating output using Excel and SPSS
Note: If using Excel on your own computer, the Analysis Toolpak Add-in must be enabled in order for the Data Analysis option to appear on the Data menu.
Be aware: the commands below are based on Excel 2007. Excel 2010 differs slightly in some respects
1. Open Excel and Activate the Analysis Toolpak Add-in (see class instructions)
2. Retrieve the file. Click on this Dropbox link: https://www.dropbox.com/s/31lwa7c7x4soes9/e206%20assign%201%20SP500.xlsx
The file contains monthly values for the monthly Standard & Poor's 500 Index from Jan 2010 through August 2012.
3. Create a new variable, labeled PCT Change S&P (percentage changes in the monthly S&P 500 Index)
(To compute PCT Change S&P point and click the cursor on Cell C3. Type the following formula to compute the percent change in the S&P 500 Index:
= 100*(b3 - b2)/b2 (make sure to include the =)
Press Enter. Now, highlight Cell C3 again, right click the mouse, and seclect "Copy". Select all the cells from C4 to the end of the data by dragging the cursor until the entire column next to the data
in Column B is highligted. Release the left mouse button and right click the mouse and select "Paste." You should now have a column of numbers showing monthly percentage changes
4. Eliminate all but 1 decimal place in the percent change data.
Highlight all of the data in column C. Right Click and select "Format Cells." Then select, "Number" and "decimal places" = 2.
5. Place your name in Cell A34. Print all three columns of data and your name (A1..C34)
I have setup the file so that a chart of the percent change in S&P 500 displays. Do not print it.
(Also, practice generating descriptive statistics tables: >Data>Data Analysis>Descriptive Stats. For input range, select columns B and C. Select boxes for "labels in the first row" and "summary statistics."
Reformat Column Widths so everything displays properly, and change the number of decimal places to 2 for all the numbers.)
6. Access SPSS on WKUnet.
7. Retrieve the file Employee data.sav via Dropbox link: https://dl.dropbox.com/u/100884157/e206%20Employee%20data.sav
(It should open into SPSS; if not, save it to the desktop and open SPSS, then open the file)
The file contains data on 474 employees of a particular company.
(Note: you can see descriptions of the variables by pointing at the variable labels in the top row or by clicking on "Variable View" at the bottom of the spreadsheet).
8. Use Transform>Compute to create a new variable (Totexp) as is the sum of Jobtime and Prevexp.
9. Use Transform>Compute to create a new variable (LogSal) as the lograrithm of Current Salary, log10(currentsalary).
10. Create basic descriptive statistics for Totexp and Logsal using Analyze>Descriptive Stats.
11. Clean up the output and put your name on it. Include your name next to "Descriptives". Delete the unneeded output (the commands)
12. Print your output.
Staple/Clip your Excel and SPSS output and place your name on the front.
Assignment 2
Objective: Generating and interpreting basic statistics and graphics in Excel and SPSS
Note: When using Excel, the Analysis Toolpak Add-in must be enabled in order for the Data Analysis option to appear on the Data menu.
If you need assistance generating descriptive statistics in Excel, consult the Youtube tutorial)
In Excel:
1. Retrieve the data file: https://www.dropbox.com/s/31lwa7c7x4soes9/e206%20assign%201%20SP500.xlsx
The file contains monthly values for the Standard & Poor's 500 Index from 1957 through August 2008.
2. Compute the monthly percentage change for the S&P 500 Index as we did in Assignment 1.
3. Generate descrptive statistics for the percentage change using the methods shown in class (Data Analysis>Descriptive Statistics>Summary Statistics). Select the output to be shown in a new worksheet.
4. Place a title on the output: Table 1. Descriptive Statistics for S&P 500. Adjust the column widths so everything is readable. Change the number of decimal points displayed to 2.
5. Print the descriptive statistics. (You should be able to explain the specific (not general, specific number) meaning the results)
6. On the output, compute (by hand) the standardized value (z-value) for a change in the Index of 10% showing your work
7. On the output, draw a histogram (roughly) based on the descriptive statistics.
In SPSS:
8. Retrieve the data file: https://dl.dropbox.com/u/100884157/e206%20Employee%20data.sav
The file contains salary information on employees listed by their ID
9. Generate descriptive statistics, histogram, and stem-and-leaf plot for Current Salary (Salary)
(Analyze>Descriptive Statistics> Explore. Place Salary and Totexp in the "Dependent List" box. Select the "Plots" button. Select "Histogram" and Stem-Leaf. Deselct Box Plot
10. In the output window, change the title of the statistics to Table 1. Employee Salary and the title of the histogram to Figure 1. Employee Salary
11. Delete the programming commands from the output window. (Make sure you know the meaning of the statistics that you generated)
12. Next to the statistics table, compute (by hand) the standardized value (z-value) for an employee with a salary of $51,000
13. Print the output
Staple/Clip your Excel and SPSS output and place your name on the top of the first page
Assignment 3
Objective: Generting regression and crosstabulation results
1. Open the file airfares.xls from Dropbox: airfares.xlsx
The file includes several variables regarding roundtrip air fares to several major cities collected on a particular day for Wed-Wed flights. Variables are defined next to the data.
2. Click Tools>Data Analysis>Regression. Use the pop-up windonw to generate a regression analysis with Fare as the dependent variable (y-variable) and Distance as the independent variable (x-variable).
*****Remember to click the "Labels" box and also click the "Line Fit Plot"box.
3. Replace Output Summary as title with Table 1: Regression of Air Fare and Distance. Format the output so that only 3 decimal values appear and all columns are readable. You should remove columns beyond column C
4. Print the output.
5. Next to the output:
a) Write out the regression output in equation form.
b) What does a 1 mile increase in distance predict for Fare? What about a 100 mile increase?
c) Hypothetically, what would Fare be if distance were 0?
6. Start SPSS and open the data file bankloan assign 3.sav from Dropbox: bank loan assign 3.sav
The file includes several variables related to bank loans and borrowers including
Default (whether they defaulted on the loan (1) or not (0)) and Debt to Income ratio (DebtIncome) put into one of three categories (low, medium, high).
7. Generate a crosstabulation using Analyze>Descriptive Stats>Crosstabs: place Debtincome as the Row Variable and Default as the Column Variable. Click the "Cells" button and select both "Observed" and "Expected" under Counts. Then select Continue.
8. On the output page replace "Crosstabs" with Table 2: Crosstabulation of Default and Debt to Income Levels. Also, remove the SPSS commands
9. Print the output
10. On the output
a. identify which cells have an expected count larger than the count (the actual count).
b. Show the manual calculation of "expected count" for the combination of Default (=1) and High Income.
Staple/Clip your Excel and SPSS output and place your name on the front
Assignment 4
Objective: Practice computing and understanding probability distributions & calculations
1. Suppose that the average amount of time that customers are put on hold by customer support is 8 minutes with a standard deviation of 1.5 minutes. If wait time is normally distributed, calculate the probability of waiting less than 10 minutes.
a. In Excel, highlight cell A1, click click fx icon on tool bar; select Function Category = statistical; Function name = NORMDIST; fill in the following values in the blanks.
X = 10; Mean = 8; Standard_Dev = 1.5; Cumulative = True.
b. Calculate the probability of waiting more than 10 minutes and enter this figure into cell A2.
c. After you print the Excel worksheet below, draw a graphic that illustrates the problem that you just solved.
2. Suppose that data entry has a 1% (0.01) probability of an error any given entry, and that data entry errors are a binomial variable (error or no error). Compute the likelihood of zero errors given 100 entries.
a. In Excel, highlight cell A5, click the fx icon on tool bar; select Function Category = statistical; Function name = BINOMDIST; fill in the following values in the blanks:
Number_s = 0, Trials = 10, Probability_s = 0.01, Cumulative = True.
b. Repeat the procedure above (except change Cumulative = False) for the probability of 1 error and 2 errors (using cells A6 and A7).
c. Calculate the probability sum of the probabilities of 1 or 2 errors and place this sum in cell B4.
3. Suppose that on average 10 customers arrive at the returns counter each hour and that arrivals follow a Poisson distribution. Compute the likelihood of 15 or fewer customers arriving in an hour at the desk:
a. In Excel, highlight cell A10, click the fx incon on tool bar; Function Category = statistical; Function name = POISSON; fill in the following values in the blanks:
X = 15; Mean = 10; Cumulative = True
b. From this value, compute the probability of more than 15 customers arriving in an hour, and enter this value in cell A11.
4. Print the worksheet (and remember to complete 1c)
Assignment 5
Ojective: Practice computing and understanding standard errors, interval estimates, and hypothesis tests
1. Open the SPSS file Market Sales Data from Dropbox: https://www.dropbox.com/s/f4qoqf6ojalzygs/Market%20Sales%20Data.sav
(You may need to save the file to the desktop first. You must use a campus computer with SPSS)
The file includes several variables quantity of sales for a national chain. The variables are:
Market ID: Numeric ID for 1 of 10 regional market definitions
MktSize: Numeric ID for size of market (1=small, 2=medium, 3=large)
LocID: Numeric ID for specific store
AgeLoc: Age of store in years
Promo: Numeric ID for 1 of 3 different promotions
Week: Week of observation (1-4)
Sales: Number of units sold in thousands
2. i) Create descriptive statistics including a 95% confidence interval for Sales
(Select Analyze>Descriptive Stats>Explore. In the dialog box, put Sales into "Dependent List" and Select "Statistics" button under "Display", then Select "OK"
ii) Print this output with Title: Table 1. Descriptive Stats for Sales (in thousands)
iii) Below the output, using a simple number line, show what happens if you increase the Confidence Interval to 99%
(To derive the numbers for the 99% CI, you can make the computation by hand using a t-value of 2.5 or by repeating the prior step, and in the dialog box, Select the "Statistics" button and change 95% to 99%)
3. (i) Test the null hypothesis that the average number of units sold (over all possible weeks and stores) is equal to 50 (and the alternative that it is not equal to 50) using the p-value method (SPSS labels 2-tailed p-values as "Sig. (2-tailed)") and with a cutoff significance level of 0.05.
(Select Analyze>Compare Mean>One-Sample t-test. Put Sales as "Test Variable" and 50 as "Test Value"
ii) Print this output with Title: Table 2. Test of Average Sales = 50 thousand
iii) Under the output, write out the null and alternative hypotheses and explain whether you would accept or reject the null hypothesis
ivi) What would you do with the reported p-value to turn this into a 1-tailed test that the mean was 50 or greater (null)?
4. (i) Test the null hypothesis that the average number of units sold in weeks 1 and 2 is the same as in weeks 3 and 4 (alternative is that they do not equal) using the p-value approach with a cutoff signficance level of 0.05.
(Select Analyze>Compare Means>Independent Samples t-test. Put Sales as "Test Variables" and Week as "Grouping Variable." Select Define Groups, Select "Cut Point" and put "3" in the blank (this tests 1,2 versus 3,4). Select Continue and OK.)
ii) Print this output with Title: Table 3. Test of Average Sales Week 1-2 = Week 3-4.
iii) Under the output, write out the null and alternative hypotheses and explain whether you would accept or reject the null hypothesis
5. i) Run a regression with Sales as the Dependent Variable and Age of Store as the Independent Variable to test the null hypothesis that the slope coefficient equals 0 (alternative that it does not equal zero).
(Select Analyze>Regressino>Linear. The 2-tailed p-value is labeled "Sig")
ii) Print this output with Title: Table 4. Test of Slope Coefficient for Age of Store
iii) Under the output, write out the null and alternative hypotheses and explain whether you would accept or reject the null hypothesis
Staple/Clip output and place your name on the front
Additional Questions/Project Worksheet
5. Write a 1 page cover sheet report in the following format:
Deadline = Wednesday, April 21 (beginning of class)