Descriptive Statistics

You can copy any of the code you see in the shaded boxes directly into your R Markdown document. I recommend watching the video tutorial for R Markdown here: https://youtu.be/r9bFBsNqnww?si=eWLJaVMbPVvQzZUN.

You should start every R script with the setwd() command to tell R where to look for data and store output. You can find this path using RStudio by clicking Session-> Set Working Directory -> Choose Directory and selecting the folder that contains the relevant data.

I suggest creating a folder on your desktop named ECON_307 that contains all of the data for this class. This will allow you to use the same path for each assignment. You could also have sub folders for each assignment to help keep things organized.

# setting the directory
# this is the path for my computer. yours will be different.
setwd('/Users/stp48131/Library/CloudStorage/Dropbox/WKU/Teaching/ECON_307/Class_Materials/Honors/Descriptive_Statistics')

R uses packages that are designed to do most of the things we would need to do in this class. You install packages using the install.packages() function. You can do this in the console on your personal computer since it only needs to be done once. You will need to comment these lines out by putting a # at the beginning if you want to leave the code in your script when you knit.

# Installing packages
install.packages('readxl') # this allows us to read Excel files
install.packages('moments') # this is necessary for a few of the summary statistics
install.packages('tidyverse') # useful for generating variable and cleaning data
install.packages('vtable') # nicely formatted summary tables
install.packages('ggplot2') # nice graphs

Once the packages are installed we need to load them so they can be used. This must be done each time you start a new R session.

# loading the packages
library(readxl)
library(moments)
library(tidyverse)
library(vtable)
library(ggplot2)

We load Excel file using the read_excel() function that comes with the readxl package. The following code loads the data and stores it as a data frame named “data.”

# loading the excel file and assigning it to the data frame named data
data <- read_excel("Descriptive_Statistics_R.xlsx")

The next step is to estimate the returns for XOM. Recall that returns are computed using the following formula: $\frac{Price_t}{Price_{t-1}}-1$. We will use the lag() function to compute returns since we cannot reference cells like we would in Excel.

# estimating returns
# data$XOM_return tells R to add the variable XOM_return to the data frame named data. 
# I multiplied by 100 so the format looks like a percentage.
data$XOM_return <-  (data$adjusted_price / lag(data$adjusted_price) -1)*100

We can attach the data frame since we will only be working with one data set. This allows us to reference variables without having to specify a data frame each time. For example, if we do not attach the data frame and want generic summary statistics for the XOM_return variable, we need to use summary(data$XOM_return). Once we have attached the data we can use summary(XOM_return) because R knows to look in the data data frame.

attach(data)

We will use functions to estimate the summary statistics. Most of these are intuitive but some require additional options. We will store each sample statistic as a variable so we can reference it later.

# the na.rm=TRUE option tells R to drop any missing values before estimating the statistic
average <- mean(XOM_return,na.rm = TRUE)
median <- median(XOM_return,na.rm = TRUE)
var <- var(XOM_return,na.rm=TRUE)
stddev <- sd(XOM_return,na.rm=TRUE)
minimum <- min(XOM_return,na.rm=TRUE)
maximum <- max(XOM_return,na.rm=TRUE)
# probs=.25 tells R to estimate the 25th percentile
q1 <- quantile(XOM_return,na.rm=TRUE,probs = .25)
# probs=.75 tells R to estimate the 75th percentile
q3 <- quantile(XOM_return,na.rm=TRUE,probs = .75)
# skewness and kurtosis will be slightly different than what
# you see in Excel
skew <- skewness(XOM_return,na.rm=TRUE)
kurt <- kurtosis(XOM_return,na.rm=TRUE)-3 # subtracting 3 to produce excess kurtosis

Formatting the summary statistics as a table can be a little clunky. I like the st() function that comes with the vtable package. The following code creates a table that contains labels and values for each statistic. You can add or drop any statistics that you need or do not need for your application.

# creating a table with the summary statistics
st(data,
   vars = c('XOM_return'), # variables to keep
   digits = 4, 
   out = 'return',
   summ.names = c('Mean',
                  'Median',
                  'Std. Dev.',
                  'Min',
                  'Max',
                  'Skew',
                  'Kurt',
                  'Q1',
                  'Q3'),
   summ = c('mean(x)', # average
            'median(x)', # median
            'sd(x)', # standard deviation
            'min(x)', # minimum
            'max(x)', # maximum
            'skewness(x)', # skewness from moments package 
            'kurtosis(x)', # kurtosis from moments package
            'pctile(x)[25]', # first quartile
            'pctile(x)[75]' # third quartile
            )
   )

##     Variable   Mean  Median Std. Dev.    Min   Max    Skew  Kurt     Q1    Q3
## 1 XOM_return -1.143 -0.6436     8.385 -26.19 22.39 -0.1982 4.564 -4.808 2.766

Frequency tables for continuous variables are also a little clunky in R. The trade-off is that you can easily customize the bin sizes and range of values.

## frequency table
# we can set the breaks so they can be used in other functions later
# lower bound is -30, upper bound is 25, bin width is 5
break_info=seq(-30, # lower bound
               25, # upper bound
               by=5 # bin size
               )
# lower bound is -.1 and the upper bound is .1. the bin size is .025
as.data.frame(table(cut(XOM_return, # variable of interest
                        breaks=break_info # notice this references what I added above
                        )
                    )
              )

##         Var1 Freq
## 1  (-30,-25]    1
## 2  (-25,-20]    0
## 3  (-20,-15]    1
## 4  (-15,-10]    5
## 5   (-10,-5]    4
## 6     (-5,0]   16
## 7      (0,5]   12
## 8     (5,10]    6
## 9    (10,15]    0
## 10   (15,20]    1
## 11   (20,25]    1

We can use ggplot() to make a frequency plot of the monthly returns. Again, the code is a little clunky but it gives you complete control of how the plot looks.

ggplot(data, aes(x=XOM_return)) +
  geom_histogram(breaks=break_info,
                 color="black", 
                 fill="white",
                 na.rm = TRUE,
                 stat = 'bin') +
  scale_x_continuous(breaks=break_info) +
  labs(x='XOM Return',
       y='Frequency')

We can also directly count the number of monthly returns that fall within a given range using the subset and length functions.

# Counting the number of observations where the return is between -2.5% and 2.5%
length(subset(XOM_return, # variable of interest
              XOM_return>-2.5 & XOM_return<2.5 # range of values we are interested in
              )
       )

## [1] 19

End the script by detaching the data.

detach(data)