POLICIES OUTLINE & CALENDER FAQ
PAST TESTS (test topics/arrangement may differ in current course)
Fall 2012 T1; T2; T3 ; T4; T5;
Spring 2013 T1; T2; T3; T4; T5
Spring 2013 Grades (weighted for current items only)
Java Applets / Coin Flipping Page / American Statistical Association / Journal of Statistics Education /
Back to My Homepage
Dr. Brian Goff/414 Grise Hall
Last Modified: March 29, 2013
Western Kentucky University
Dr. Brian Goff Grise 414/745-3855
Email: firstname.lastname@example.org / Office Hours for Spring 2013: TTh 9:30-11:30
(Appointments & drop-ins welcome at other times)
Hyperstats (online statistics reference by David Lane) and other online links
Excel and SPSS (available on WKUnet)
OBJECTIVES FOR STUDENTS
To gain a basic and practical understanding of how to collect and analyze data with emphasis on business applications
Tests 1-5 40% (drop lowest grade)
Test 6 25%
Software Skills Test 10%
Assignments (5) 25%
Total 100% A=90%+; B=80-89%; C=70-79%; D= 60-69%;F below 60%
Tests 1-5: These tests will assess your understanding and ability to use the key statistical concepts and skills developed in this course. You will need a scantron (in good condition), a pencil, and a calculator for each exam. I will drop your lowest score. NO EARLY OR MAKEUP TESTS. If weather (etc.) postpones a test or assignment deadline date, the test or deadline will move to the next class meeting. Material on tests will reflect class lecutres, mini-assignments, and assignments. Past tests are provided as aids but not as complete guides to current tests.
Test 6: This is a test over the prior stat concepts tests (and any additional concepts covered after the last one).
Assignments: These are short assignments make use of Excel and SPSS. See links in course outline for details. You may complete these individually or as part of a 2 person team (team should turn in one assignment with all names). You may seek help from other students or graduate assistants, but you must not merely copy the work of others. Grades will be either full credit, half credit, or zero. (If you complete these as part of a team, make sure you know how to accomplish the Excel/SPSS tasks yourself. They will be tested on the Data Skills Tests). Assignment due dates are clearly posted. Assignments are due at the beginning of class. Assignments turned in after beginning of class receive 10% reduction. Assignments more than 48 hours late receive zero credit.
Software Skills Test : This will be completed in class in the computer lab testing your ability to complete the data analytic tasks from the assignments using Excel and SPSS.
MISCELLANEOUS & CLASSROOM POLICIES
Last day to drop with a "W" or change to audit is listed on the Academic Affairs website. If you have an ADA covered disability requiring special consideration, please register with the ADA Compliance Office, and then see me.
Classroom Policies: Orderly behavior and respect for others who are speaking (including me) is expected. No computer or cell phone use is permitted (including texting). No food or drinks without tops are permitted. If late, enter with a minimum of disturbance and be seated in the nearest seat. Behavior that is inappropriate or distracting to other students or myself is not permitted including personal conversations. No cell phones or other electronic devices are permitted. Individuals involved in incidents that significantly violate these policies will receive a warning and then will be notified of a letter grade reduction per subsequent incident.
COURSE OUTLINE & CALENDAR (Some dates/topics subject to change.Check back frequently for updates)
Note: HS refers to Hyperstats by David Lane; For additional background, see Statsoft and Wikipedia.
Weeks 1-3: Basic Measurement and Management of Data (HS Chapter 1, 2 including links within text)
Data analytics goes mainstream: IBM Commercial Moneyball MIT Sports Analytics Conference
Software: SAS IBM-SPSS
Data, variables, and measurement: data validity and reliability, cross sectional data, time series data, longitudinal study, quantitative data, qualitative data
Statistical notation and frequently used transformations: See SPSS and Excel
Data entry, manipulation, and transformation in software
Graph Basics (histograms, time plots, multi-dimensional GapMinder)
Assignment 1 Data entry and manipulation in Excel & SPSS (Complete in class Thursday January 31 Room 530; otherwise due Thursday Feb 7 )
Test 1, Thursday, February 7
Additional Materials & Tutorials: Wiki on Validity (intro and sections on reliability and construct validity); Data Analysis Add-in in Excel (2007 version)
Data Sets for Section: bp cross bp long sp500 2001 mlb manager long mlb manager spss Flu Time Plot Old Faithful Histogram Google Images for Pie Charts Histograms Time Plots
Random Walk with Apparent Trend
Weeks 4-5: Describing Characteristics of Single Variables (HS Chapter 2)
Statistical Descriptions (mean, median, mode, standard deviation for quantitative and qualitative data)
Graphs and Statistics (histograms, time plots, scatterplots, multi-dimensional)
Empirical Rule, Chebyshev's Theorem, Outliers, Statistical Process Control
Assignment 2 Describing variables with stats, Due Tuesday February 19
Test 2, Thursday, February 21
Additional Materials and Tutorials: Mean, Median, Mode Standard Deviation Data Analysis: Descriptive Stats in Excel Pie Charts Histograms Time Plots GapMinder Graphics
Statistical Process Control ( Supplement 1 ) Excel File Example
Data Sets for Section: bodytemp(SPSS) cancer survival(SPSS) minor leagues(Excel) hot chocolate temp(Excel)
Weeks 6-7: Describing Relationships Between Variables (HS Chapter 3, 15-0, 15-2, 16-3a and b)
Crosstablution and Contingency analysis
Assignment 3: Describing relationships between variables, Due Tuesday March 5
Test 3, Thursday March 7
Additional Materials & Tutorials: Regression & Correlation Applet DC Times on Causation
Data for Section: Diamond Price Data Titanic Data 5K Times Data Country Health Care Data Affairs Data Advertising Data
Week 8: SPRING BREAK
Weeks 9-10: Basic Ideas of Probability (HS Chapter 4, 5-0, 5-2)
Basic Concepts; Addition Rule; Multiplication Rule;
Expected Value; Law of Large Numbers
Conditional probabilities Probability distributions using Excel
Assignment 4: Probability problems, Due Tuesday March 26
Test 4, Thursday March 28
Additional Materials & Tutorials: Normal Distribution Normal Distribution & Empirical Rule Addition Rule Compound Probablity Making Free Throws
Coin Flipping Applet Birthday Problem App Let's Make a Deal Applet Binomial Applet Normal Distribution Applet
Birthday Problem Computations Birthday Problem Explained Bayesian Probability Video
Weeks 11-12: Statistical Tests and Sampling Errors (HS Chapter 6, 7-0, 7-1, 8-0, 8-1, 8-2, 8-10, 9)
Errors in Surveys and Samples: Non-sampling error and sampling error
Sampling error and its distribution (also Central Limit Theorem)
Sampling error and confidence interval estimates
Statistical Tests using p-values; Power of Tests: Big Samples - Small Effects; Small Sample - Insignificant Effects
Assignment 5: Practice with statistical tests, Due Tuesday April 9
Test 5, Thursday April 11
Additional Materials & Tutorials: K Tests and P-values Type I Errors Sampling Distribution Applet
Week 13: Statistical Applications & "Literacy"
Geometric mean (HS 2-2a)
Regression Applications: Statistical analysis of presidential elections -- Ray Fair Yale CU Boulder State Level Gasoline Prices; Market Model for Stock; Simpson's Paradox (omitted variables bias);
Issues in Dealing with Rates: Denominator Choice; Rates of Change of Rates of Change (Percent of Percent)
Looking out for Junk Science
Help Session for Skills Test, Tuesday April 23 Room 530
Software Skills Test, Thursday April 25 Room 530
Dropbox Excel file link: Adverstising.xlsx
Dropbox SPSS file link: Advertising.sav
Test 6, Thursday May 2
Tips for Assignments
3a. Briefly describe the data set used in the assignment and measurement of specific variables.
b. Based on your output, what would "typical" kilowatts used per day be? Is the data set symmetric and what are outlying values?
c. Do winter months have higher gas usage than other months? Conduct test where the null hypothesis is that winter month gas usage is the same as other months.
PRETTY GOOD ANSWERS
3. a. The data set consisted of 5 variables related to the monthly electric and gas usage of residential utility customers from July 1990 to June 1998. The data was obtained from an SPSS data file on WKUnet (m:spss/utility). The variables included were a month identifier, average kilowatt hours used per day, gas thermal units used, average daily temperature, and number of days in the month.
b. Based on the histogram and descriptive statistics, typical kilowatt hours used per day ranged from about 30 to 40 per month. The mean was about 34 and the median was 37 with a standard deviation of 3. The data were skewed to the right with a few large outliers well beyond 50 kw hours per day.
c. The means for winter month gas usage (December, January, February) were about 30 percent higher than for other months. Based on a low p-value (0.005), a hypothesis that winter months and other months have the same mean could be rejected with strong confidence.
3. a. It had some variables about utility customers.
b. 34. A few big outliers. s.d. was 3.
c. winter had higher don't know what p-value means
Why are these poor answers? Information provided is far too little. No detail or sources about the variables used are provided in (a). Incomplete sentences used in (b) and (c). Pronoun (it) used without antecedent in (a). Abbreviation used for standard deviation in (b) without first defining it somewhere. Improper capitilization and punctuation in (c). Use of contractions such as "don't" in (c) is not appropriate in formal reports.
READING SUPPLEMENT 1 -- Statistical Process Control (SPC)
Empirical Rule: a summary of variation in outcomes for data that have roughly a bell-shaped (normal) shape and a means to indentify outliers. The Empirical Rule states: i) about 68 percent of the data will be between +/ 1 standard deviation from the mean, ii) about 95% of the data will be between +/- 2 standard deviations from the mean, and iii) about 99.9% of the data will be between +/- 3 standard deviations from the mean.
Chebeshev's Rule: a summary of variation in outcomes for data regardless of their shape and a means to identify outliers. The rule states: i) about 75% of of the data will be between +/- 2 standard deviations from the mean, ii) about 89% of the data will be between +/- 3 standard deviations from the mean, and iii) about 94% of the data will be between +/-4 standard deviations from the mean.
Standardize Units (z-units, z-values, z-scores): the name given to variable (X) which has been converted so that the mean is zero (0) and the standard deviation is one (1). This conversion is done by the formula Zi = (Xi - Mean)/Std. Deviation, where i refers to each individual item in the data set. This conversion eliminates whatever units were used to measure X, and it allows each data point to be easily evaluated in terms of how much it differs from the mean. < /FONT >
Statistical Process Control (SPC): The phrase and acronym applied to systematic methods of analyzing repetitive production processes using charts that track variation of the process, usually with the goal of monitoring and improving quality. Chapter 18 provides more details.
Process: The forces (inputs) working together to generate the outcomes of a variable; in manufacturing or provision of services these include equipment, tools, materials, people, and "environmental" influences such as weather or other events that determine the characteristics of a good or service being produced; the same kinds of forces generate outcomes personal settings also.
Process Variation: changes in the process that lead to quantitative or qualitative differences in the characteristics of a part, a good, or service being produced
Common Causes of (Process) Variation: sources of variation which are inherent (built-in) to the process design as it is currently configured; usual or normal process variation; sources of variation which have the potential to influence all process observations; variation only eliminated through redesign or improvement of design of the process; random variation
Assignable Causes of (Process) Variation: special or specific sources of variation which are not built into the design of the process; variation which does not influence all observations; variation which can be eliminated without altering the basic design
Control Charts: Graphs which record sample measurements of a process -- usually a repetitive processs. X-Bar Charts are the simplest and monitor variation in sample means of the process. The chart is used if the variation in samples is likely due to assignable variation (high variability or patterns) or merely to common (expected) variation.
Objective: To gain familiarity entering/transforming data and generating output using Excel and SPSS
Note: If using Excel on your own computer, the Analysis Toolpak Add-in must be enabled in order for the Data Analysis option to appear on the Data menu.
Be aware: the commands below are based on Excel 2007. Excel 2010 differs slightly in some respects
1. Open Excel and Activate the Analysis Toolpak Add-in (see class instructions)
2. Retrieve the file. Click on this Dropbox link: https://www.dropbox.com/s/31lwa7c7x4soes9/e206%20assign%201%20SP500.xlsx
The file contains monthly values for the monthly Standard & Poor's 500 Index from Jan 2010 through August 2012.
3. Create a new variable, labeled PCT Change S&P (percentage changes in the monthly S&P 500 Index)
(To compute PCT Change S&P point and click the cursor on Cell C3. Type the following formula to compute the percent change in the S&P 500 Index:
= 100*(b3 - b2)/b2 (make sure to include the =)
Press Enter. Now, highlight Cell C3 again, right click the mouse, and seclect "Copy". Select all the cells from C4 to the end of the data by dragging the cursor until the entire column next to the data
in Column B is highligted. Release the left mouse button and right click the mouse and select "Paste." You should now have a column of numbers showing monthly percentage changes
4. Eliminate all but 1 decimal place in the percent change data.
Highlight all of the data in column C. Right Click and select "Format Cells." Then select, "Number" and "decimal places" = 2.
5. Place your name in Cell A34. Print all three columns of data and your name (A1..C34)
I have setup the file so that a chart of the percent change in S&P 500 displays. Do not print it.
(Also, practice generating descriptive statistics tables: >Data>Data Analysis>Descriptive Stats. For input range, select columns B and C. Select boxes for "labels in the first row" and "summary statistics."
Reformat Column Widths so everything displays properly, and change the number of decimal places to 2 for all the numbers.)
6. Access SPSS on WKUnet.
7. Retrieve the file Employee data.sav via Dropbox link: https://dl.dropbox.com/u/100884157/e206%20Employee%20data.sav
(It should open into SPSS; if not, save it to the desktop and open SPSS, then open the file)
The file contains data on 474 employees of a particular company.
(Note: you can see descriptions of the variables by pointing at the variable labels in the top row or by clicking on "Variable View" at the bottom of the spreadsheet).
8. Use Transform>Compute to create a new variable (Totexp) as is the sum of Jobtime and Prevexp.
9. Use Transform>Compute to create a new variable (LogSal) as the lograrithm of Current Salary, log10(currentsalary).
10. Create basic descriptive statistics for Totexp and Logsal using Analyze>Descriptive Stats.
11. Clean up the output and put your name on it. Include your name next to "Descriptives". Delete the unneeded output (the commands)
12. Print your output.
Staple/Clip your Excel and SPSS output and place your name on the front.
Objective: Generating and interpreting basic statistics and graphics in Excel and SPSS
Note: When using Excel, the Analysis Toolpak Add-in must be enabled in order for the Data Analysis option to appear on the Data menu.
If you need assistance generating descriptive statistics in Excel, consult the Youtube tutorial)
1. Retrieve the data file: https://www.dropbox.com/s/31lwa7c7x4soes9/e206%20assign%201%20SP500.xlsx
The file contains monthly values for the Standard & Poor's 500 Index from 1957 through August 2008.
2. Compute the monthly percentage change for the S&P 500 Index as we did in Assignment 1.
3. Generate descrptive statistics for the percentage change using the methods shown in class (Data Analysis>Descriptive Statistics>Summary Statistics). Select the output to be shown in a new worksheet.
4. Place a title on the output: Table 1. Descriptive Statistics for S&P 500. Adjust the column widths so everything is readable. Change the number of decimal points displayed to 2.
5. Print the descriptive statistics. (You should be able to explain the specific (not general, specific number) meaning the results)
6. On the output, compute (by hand) the standardized value (z-value) for a change in the Index of 10% showing your work
7. On the output, draw a histogram (roughly) based on the descriptive statistics.
8. Retrieve the data file: https://dl.dropbox.com/u/100884157/e206%20Employee%20data.sav
The file contains salary information on employees listed by their ID
9. Generate descriptive statistics, histogram, and stem-and-leaf plot for Current Salary (Salary)
(Analyze>Descriptive Statistics> Explore. Place Salary and Totexp in the "Dependent List" box. Select the "Plots" button. Select "Histogram" and Stem-Leaf. Deselct Box Plot
10. In the output window, change the title of the statistics to Table 1. Employee Salary and the title of the histogram to Figure 1. Employee Salary
11. Delete the programming commands from the output window. (Make sure you know the meaning of the statistics that you generated)
12. Next to the statistics table, compute (by hand) the standardized value (z-value) for an employee with a salary of $51,000
13. Print the output
Staple/Clip your Excel and SPSS output and place your name on the top of the first page
Objective: Generting regression and crosstabulation results
1. Open the file airfares.xls from Dropbox: airfares.xlsx
The file includes several variables regarding roundtrip air fares to several major cities collected on a particular day for Wed-Wed flights. Variables are defined next to the data.
2. Click Tools>Data Analysis>Regression. Use the pop-up windonw to generate a regression analysis with Fare as the dependent variable (y-variable) and Distance as the independent variable (x-variable).
*****Remember to click the "Labels" box and also click the "Line Fit Plot"box.
3. Replace Output Summary as title with Table 1: Regression of Air Fare and Distance. Format the output so that only 3 decimal values appear and all columns are readable. You should remove columns beyond column C
4. Print the output.
5. Next to the output:
a) Write out the regression output in equation form.
b) What does a 1 mile increase in distance predict for Fare? What about a 100 mile increase?
c) Hypothetically, what would Fare be if distance were 0?
6. Start SPSS and open the data file bankloan assign 3.sav from Dropbox: bank loan assign 3.sav
The file includes several variables related to bank loans and borrowers including
Default (whether they defaulted on the loan (1) or not (0)) and Debt to Income ratio (DebtIncome) put into one of three categories (low, medium, high).
7. Generate a crosstabulation using Analyze>Descriptive Stats>Crosstabs: place Debtincome as the Row Variable and Default as the Column Variable. Click the "Cells" button and select both "Observed" and "Expected" under Counts. Then select Continue.
8. On the output page replace "Crosstabs" with Table 2: Crosstabulation of Default and Debt to Income Levels. Also, remove the SPSS commands
9. Print the output
10. On the output
a. identify which cells have an expected count larger than the count (the actual count).
b. Show the manual calculation of "expected count" for the combination of Default (=1) and High Income.
Staple/Clip your Excel and SPSS output and place your name on the front
Objective: Practice computing and understanding probability distributions & calculations
1. Suppose that the average amount of time that customers are put on hold by customer support is 8 minutes with a standard deviation of 1.5 minutes. If wait time is normally distributed, calculate the probability of waiting less than 10 minutes.
a. In Excel, highlight cell A1, click click fx icon on tool bar; select Function Category = statistical; Function name = NORMDIST; fill in the following values in the blanks.
X = 10; Mean = 8; Standard_Dev = 1.5; Cumulative = True.
b. Calculate the probability of waiting more than 10 minutes and enter this figure into cell A2.
c. After you print the Excel worksheet below, draw a graphic that illustrates the problem that you just solved.
2. Suppose that data entry has a 1% (0.01) probability of an error any given entry, and that data entry errors are a binomial variable (error or no error). Compute the likelihood of zero errors given 100 entries.
a. In Excel, highlight cell A5, click the fx icon on tool bar; select Function Category = statistical; Function name = BINOMDIST; fill in the following values in the blanks:
Number_s = 0, Trials = 10, Probability_s = 0.01, Cumulative = True.
b. Repeat the procedure above (except change Cumulative = False) for the probability of 1 error and 2 errors (using cells A6 and A7).
c. Calculate the probability sum of the probabilities of 1 or 2 errors and place this sum in cell B4.
3. Suppose that on average 10 customers arrive at the returns counter each hour and that arrivals follow a Poisson distribution. Compute the likelihood of 15 or fewer customers arriving in an hour at the desk:
a. In Excel, highlight cell A10, click the fx incon on tool bar; Function Category = statistical; Function name = POISSON; fill in the following values in the blanks:
X = 15; Mean = 10; Cumulative = True
b. From this value, compute the probability of more than 15 customers arriving in an hour, and enter this value in cell A11.
4. Print the worksheet (and remember to complete 1c)
Ojective: Practice computing and understanding standard errors, interval estimates, and hypothesis tests
1. Open the SPSS file Market Sales Data from Dropbox: https://www.dropbox.com/s/f4qoqf6ojalzygs/Market%20Sales%20Data.sav
(You may need to save the file to the desktop first. You must use a campus computer with SPSS)
The file includes several variables quantity of sales for a national chain. The variables are:
Market ID: Numeric ID for 1 of 10 regional market definitions
MktSize: Numeric ID for size of market (1=small, 2=medium, 3=large)
LocID: Numeric ID for specific store
AgeLoc: Age of store in years
Promo: Numeric ID for 1 of 3 different promotions
Week: Week of observation (1-4)
Sales: Number of units sold in thousands
2. i) Create descriptive statistics including a 95% confidence interval for Sales
(Select Analyze>Descriptive Stats>Explore. In the dialog box, put Sales into "Dependent List" and Select "Statistics" button under "Display", then Select "OK"
ii) Print this output with Title: Table 1. Descriptive Stats for Sales (in thousands)
iii) Below the output, using a simple number line, show what happens if you increase the Confidence Interval to 99%
(To derive the numbers for the 99% CI, you can make the computation by hand using a t-value of 2.5 or by repeating the prior step, and in the dialog box, Select the "Statistics" button and change 95% to 99%)
3. (i) Test the null hypothesis that the average number of units sold (over all possible weeks and stores) is equal to 50 (and the alternative that it is not equal to 50) using the p-value method (SPSS labels 2-tailed p-values as "Sig. (2-tailed)") and with a cutoff significance level of 0.05.
(Select Analyze>Compare Mean>One-Sample t-test. Put Sales as "Test Variable" and 50 as "Test Value"
ii) Print this output with Title: Table 2. Test of Average Sales = 50 thousand
iii) Under the output, write out the null and alternative hypotheses and explain whether you would accept or reject the null hypothesis
ivi) What would you do with the reported p-value to turn this into a 1-tailed test that the mean was 50 or greater (null)?
4. (i) Test the null hypothesis that the average number of units sold in weeks 1 and 2 is the same as in weeks 3 and 4 (alternative is that they do not equal) using the p-value approach with a cutoff signficance level of 0.05.
(Select Analyze>Compare Means>Independent Samples t-test. Put Sales as "Test Variables" and Week as "Grouping Variable." Select Define Groups, Select "Cut Point" and put "3" in the blank (this tests 1,2 versus 3,4). Select Continue and OK.)
ii) Print this output with Title: Table 3. Test of Average Sales Week 1-2 = Week 3-4.
iii) Under the output, write out the null and alternative hypotheses and explain whether you would accept or reject the null hypothesis
5. i) Run a regression with Sales as the Dependent Variable and Age of Store as the Independent Variable to test the null hypothesis that the slope coefficient equals 0 (alternative that it does not equal zero).
(Select Analyze>Regressino>Linear. The 2-tailed p-value is labeled "Sig")
ii) Print this output with Title: Table 4. Test of Slope Coefficient for Age of Store
iii) Under the output, write out the null and alternative hypotheses and explain whether you would accept or reject the null hypothesis
Staple/Clip output and place your name on the front
Additional Questions/Project Worksheet
1. Select one of the data gathering activities below (or, you can seek my approval on an alternative data source/topic)
-- Use Carmax.com to collect information on used car prices for 30 minivan models (collect 10 of one maker (Honda, Toyota, ...) and 20 of various other makes (so your data is Honda = 1,0 or similar) along with model year and mileage;
(Variant: instead of different makes, you could consider different colors with, for example, 10 black and 20 of various other colors)
-- Use Amazon or similar site to collect prices for 30 televisions along with the size, Type (Plasma = 1,0), and refresh rate
(Variant: instead of televisions you could consider some other item along with 2 or 3 independent variables)
-- Use Edmonds.com to collect information on miles per gallon for 30 models along with each model's horsepower, weight, and width;
(Variant: instead of mileage you could collect information on 0-60 time or some other performane measure)
-- Use MLB websites and Baseball-Reference.com to collect information on attendance for all MLB teams along with each team's winning percentage in the season, the number of league championships in last 10 years, and metro area population;
(Variant: in place of attendance, use ticket prices for each team for a specific section -- for example, lowest section near third base or use NFL data from Stub Hub instead of MLB data)
2. Formulate a question/idea that you would like to estimate/test
Examples: How much does a specific color (black) add or subtract from the value of a car?
How much does the type of car (SUV) add or subtract?
How much does additional size (inches) add to the price of a television?
How much does more horsepower reduce miles per gallon?
How much does a history of winning a league championship influence ticket prices?
How much does winning percentage influence attendance?
How much does population influence ticket prices?
3. Write out a regression equation that can estimate your question/idea while taking account of other Key influences
Examples: Car Price = Black Color + Year + Mileage + Horsepower + Length + Weight
Attendance = Championships in last 10 years + Winning Percentage + Metro Population
TV Price = Size + Type + Refresh rate
MPG = Horsepower + Weight + Width
4. i) Enter your data into Excel with variable names on the top row.
ii) On its own worksheet, generate a table of descriptive statistics for your variables
iii) On its own worksheet, generate a regression with your dependent variable and the independent variable linked to your question.
iv) On its own worksheet, generage a regression with your dependent variable and all of your independent variables
v) For all of these, give an appropriate title (for example: Table 1. MLB Descriptive Statistics), adjust column widths, limit decimals to 3 places
vi) Print the tables and your raw data workheet
5. Write a 1 page cover sheet report in the following format:
Paragraph 1: Express the question that you are investigating
Paragraph 2: Briefly summarize the variables that you have collected to investigate the question.
Paragraph 3: Write out out the regression equation that you used.
Paragraph 4: Explain the meaning of the slope coefficient that you found in 4iii and how it answers your basic question
Paragraph 5: Express whether this coefficient changed very much in 4iv and whether the R2 changed very much
Accessing SPSS on WKUnet
Objective: Practice in collecting and assessing key facts about univariate data
(You may work alone or in 2-person teams. If you work as a team, turn in (1) report with both names on the front.
You or your team may discuss the assignment with others but you may not copy answers or use the same data.)
1. Select some items for sale that have relatively standard attributes. Examples would be things such as gasoline, 32-inch or 36-inch televisions, tickets to similar types of games or concerts, computers of a given storage capacity, RAM, & monitor size and so on. For this item, collect at least 20 prices offered by different suppliers or on different brands. Keep track of the brand and/or location of item collected. For example, SonyCC, for Sony-Circuit City. You may collect this from online sources such as Yahoo Shopping or you may use locations around town. In either case, you must document the exact source(s) of your data such as Best Buy, or Yahoo Shopping. If it is an online source, provide the URL. If each data point comes from a different source (such as gas station), you must document each separate source such as the specific station.
2. i) Enter the prices you find as a variable into Excel along with another variable to identify the source of the price. Remember to use labels for each variable. For example,spreadsheet might look like
Source Price (Source and Price are the names of your variables; the data are entered in the columns below)
ii) Create a numerical statistical summary of your variable. Also create a histogram. Make sure to place each of these on separate worksheets. (Remember, to create a histogram, you need to create a column with "bin widths" -- numbers that tell Excel where to put the columns.)
iii) Edit your output. Resize the columns so the table fits properly. Redo cells so that you have no more than 2 decimal places. Place the title, TABLE 1: STATISTICS ON (whatever your variable describes) at the top of your descriptives table. Resize the histogram so that it has a nice appearance. Change the "gap width" on the histogram so that the bars are touching each other (if you know how). Place the title FIGURE 1: HISTOGRAM ON (your variable) at the top of your histogram.
iv) Print both your data spreadsheet and your output worksheets.
3. On a separate sheet of paper, print or type answers to these questions in complete sentences. Explain your answers by making specific use of the pertinent statistics and histogram that you generate.
i) Describe the price variable that you collected, the units of measure, your method of sampling, and precisely report the source(s). If you have a large number of sources, place this information on a separate sheet at the back and refer to it here.
ii) Using statistics describing the center and variability of your data, explain what would be typical and unusual prices.
iii) Using the skewness and kurtosis statistics, explain other characteristics of the your price data. Overall, does your data have nearly normal distribution or not? Explain your answer using both the graphical data as well as the numerical statistics.
iv) Compute the standardized values for the first two prices in your data set. Show your calculations.
v) Based on your data and output, draw a boxplot that includes labels indicating the information shown by the boxplot.
Deadline = Wednesday, February 4 (beginning of class)
Remember: Place answers on a separate sheet at the back. Staple or clip all sheets together. Your output should include the specified results -- no more and no less with answers typed or printed very neatly. Make sure your name is clearly printed or typed on a cover sheet.
(For details on using Excel: See instructions for Mini-Assignment 2 and Excel Help)
ASSIGNMENT zzzz (MPLETE -- DO NOT USE)
Objective: Generating & reporting a regression analysis (you may work alone or in a 2-person team; teams turn in one report with both names on the front)
1. Collect data on a two quantitative variables that you think would be related to each other in some way. You should have at least 30 observations for each variable. These variables must be from a documented source (no personal surveys).
Enter values for the two variables that you have collected into Excel.
2. i) Use the Chart Wizard to create a scatterplot (XY scatter graph in Excel) with your dependent variable on the Y-axis and your explanatory variable on the X-axis. (Refer to Mini-assignment #3 or Excel Help for assistance).
ii) Create linear regression results for your dependent variable and explanatory variable. (Use Tools>Data Analysis). Make sure use a label for each variable, include the label in the data range, and check "labels." Also check "residuals" to compute residual (error) and predicted values. Place the regression output on a separate worksheet.
iii) EDIT YOUR OUTPUT in the following way. Make sure to format the results so they fit within the columns properly. For the scatterplot, change the title to FIGURE 1: Scatterplot of (descriptions of your variables). Change the regression title to TABLE 1: REGRESSION RESULTS FOR (descriptions of your variables). (If you change the Page Setup under "File" to "Landscape," the prinout will look nicer. This is advisable but not required).
iv) Print your data sheet, the Scatterplot sheet, and the Regression sheet.
3. Answer the following questions on a separate sheet attached to the back of your output.
i) Describe the variables in your data set (what they are; what units are used) and specifically document the source.
ii) Neatly write out the regression results in equation form.
iii) Take the first value for the predicted and residual (error) values and show how they are computed. Then, in a sentence or two, explain the meaning of the predicted values.
iv) Using your scatterplot, draw the regression line as accurately as possible.
v) Using the output that you generated as the basis, describe how well your explanatory variable accounts for differences in your dependent variable.
Deadline = March 10 (beginning of class; remember to use a staple or clip; print neatly or type your answers. Where responses are in words, use complete sentences.)
ASSIGNMENT yyy OMPLETE -- DO NOT USE)
Objective: Sampling error and testing for statistical significance (hypotheses testing) with Excel.
1. Develop a proposition between two variables (measured in identical units) that you can test with data. The claim must be stated as an equality or inequality between the means or proportions of the two variables. These variables must be obtained from an observable source such as a local store, online, SPSS files, ... . Do not use surveys of people (students, friends, ...) for this assignment. Collect at least 20 observations for each variable.
Example: I could collect 20 fiction book prices and 20 non-fiction book prices from Amazon.com. and make my null hypothesis: Average Price (in $) of Fiction Books = Average Price (in $) of Non-fiction Books;
(Note: This is the same thing as Avg. Price Fiction - Avg. Price Non-fiction = 0; The test must be between two variables that are measured the same way such as in $, miles, gallons, lbs., ....) Examples of other variables would be financial variables on companies, points or performance measures for sports teams or players, prices on products, ...
2. i) Input your variables into two columns in an Excel spreadsheet.
ii) Obtain descriptive statistics for each of your variables using Tools/Data Analysis/Descriptive Statistics. Select "Summary Statistics," "Confidence Level for the Mean," and "Labels" and click "OK." In the output worksheet, make sure the columns with the output table are highlighted, and then choose Format/Columns/Autofit, so all of the table is shown. Then print your output.
iii) Select Tools/Data Analysis/t-test: two-sample assuming unequal variances and put your two variables in the input ranges, select "Labels," and choose the value from your null hypothesis to plug in for "hypothesized mean difference." Click "OK." In the output worksheet, use Format/Columns/Autofit to make the column width fit your table, and then print this output. Also print your data sheet.
3. Print or type answers to the following questions in complete sentences on a separate sheet attached to your output with a paper clip or staple:
i) Describe your variables and the source from which you obtained it.
ii) Explain your sampling method and how this might influence your estimate of sampling error. What might be some possible sources of non-sampling error.
iii) Clearly write out your proposition (null hypothesis) as an equation and then explain it in a sentence or two. What is the opposite of your proposition (alternative hypothesis).
iv) What is the estimated size of the sampling error for the main statistics (sample mean or sample proportion)? Write out an equation showing how the 95% confidence intervals are computed (be specific use actual figures from your output).
v) Evaluate your proposition using the statistics that you generated (be specific!).
Deadline = Wednesday, April 21 (beginning of class)