You can copy any of the code you see in the shaded boxes directly into your R Markdown document. I recommend watching the video tutorial for R Markdown here: https://youtu.be/r9bFBsNqnww?si=eWLJaVMbPVvQzZUN.
You should start every R script with the setwd()
command
to tell R where to look for data and store output. You can find this
path using RStudio by clicking Session-> Set Working Directory ->
Choose Directory and selecting the folder that contains the relevant
data.
I suggest creating a folder on your desktop named ECON_307 that contains all of the data for this class. This will allow you to use the same path for each assignment. You could also have sub folders for each assignment to help keep things organized.
# setting the directory
# this is the path for my computer. yours will be different.
setwd('/Users/stp48131/Library/CloudStorage/Dropbox/WKU/Teaching/ECON_307/Class_Materials/Honors/Descriptive_Statistics')
R uses packages that are designed to do most of the things we would
need to do in this class. You install packages using the
install.packages()
function. You can do this in the console
on your personal computer since it only needs to be done once. You will
need to comment these lines out by putting a # at the beginning if you
want to leave the code in your script when you knit.
# Installing packages
install.packages('readxl') # this allows us to read Excel files
install.packages('moments') # this is necessary for a few of the summary statistics
install.packages('tidyverse') # useful for generating variable and cleaning data
install.packages('vtable') # nicely formatted summary tables
install.packages('ggplot2') # nice graphs
Once the packages are installed we need to load them so they can be used. This must be done each time you start a new R session.
# loading the packages
library(readxl)
library(moments)
library(tidyverse)
library(vtable)
library(ggplot2)
We load Excel file using the read_excel()
function that
comes with the readxl package. The following code loads the data and
stores it as a data frame named “data.”
# loading the excel file and assigning it to the data frame named data
data <- read_excel("Descriptive_Statistics_R.xlsx")
The next step is to estimate the returns for XOM. Recall that returns
are computed using the following formula: \(\frac{Price_t}{Price_{t-1}}-1\). We will
use the lag()
function to compute returns since we cannot
reference cells like we would in Excel.
# estimating returns
# data$XOM_return tells R to add the variable XOM_return to the data frame named data.
# I multiplied by 100 so the format looks like a percentage.
data$XOM_return <- (data$adjusted_price / lag(data$adjusted_price) -1)*100
We can attach the data frame since we will only be working with one
data set. This allows us to reference variables without having to
specify a data frame each time. For example, if we do not attach the
data frame and want generic summary statistics for the XOM_return
variable, we need to use summary(data$XOM_return)
. Once we
have attached the data we can use summary(XOM_return)
because R knows to look in the data data frame.
attach(data)
We will use functions to estimate the summary statistics. Most of these are intuitive but some require additional options. We will store each sample statistic as a variable so we can reference it later.
# the na.rm=TRUE option tells R to drop any missing values before estimating the statistic
average <- mean(XOM_return,na.rm = TRUE)
median <- median(XOM_return,na.rm = TRUE)
var <- var(XOM_return,na.rm=TRUE)
stddev <- sd(XOM_return,na.rm=TRUE)
minimum <- min(XOM_return,na.rm=TRUE)
maximum <- max(XOM_return,na.rm=TRUE)
# probs=.25 tells R to estimate the 25th percentile
q1 <- quantile(XOM_return,na.rm=TRUE,probs = .25)
# probs=.75 tells R to estimate the 75th percentile
q3 <- quantile(XOM_return,na.rm=TRUE,probs = .75)
# skewness and kurtosis will be slightly different than what
# you see in Excel
skew <- skewness(XOM_return,na.rm=TRUE)
kurt <- kurtosis(XOM_return,na.rm=TRUE)-3 # subtracting 3 to produce excess kurtosis
Formatting the summary statistics as a table can be a little clunky.
I like the st()
function that comes with the
vtable
package. The following code creates a table that
contains labels and values for each statistic. You can add or drop any
statistics that you need or do not need for your application.
# creating a table with the summary statistics
st(data,
vars = c('XOM_return'), # variables to keep
digits = 4,
out = 'return',
summ.names = c('Mean',
'Median',
'Std. Dev.',
'Min',
'Max',
'Skew',
'Kurt',
'Q1',
'Q3'),
summ = c('mean(x)', # average
'median(x)', # median
'sd(x)', # standard deviation
'min(x)', # minimum
'max(x)', # maximum
'skewness(x)', # skewness from moments package
'kurtosis(x)', # kurtosis from moments package
'pctile(x)[25]', # first quartile
'pctile(x)[75]' # third quartile
)
)
## Variable Mean Median Std. Dev. Min Max Skew Kurt Q1 Q3
## 1 XOM_return -1.143 -0.6436 8.385 -26.19 22.39 -0.1982 4.564 -4.808 2.766
Frequency tables for continuous variables are also a little clunky in R. The trade-off is that you can easily customize the bin sizes and range of values.
## frequency table
# we can set the breaks so they can be used in other functions later
# lower bound is -30, upper bound is 25, bin width is 5
break_info=seq(-30, # lower bound
25, # upper bound
by=5 # bin size
)
# lower bound is -.1 and the upper bound is .1. the bin size is .025
as.data.frame(table(cut(XOM_return, # variable of interest
breaks=break_info # notice this references what I added above
)
)
)
## Var1 Freq
## 1 (-30,-25] 1
## 2 (-25,-20] 0
## 3 (-20,-15] 1
## 4 (-15,-10] 5
## 5 (-10,-5] 4
## 6 (-5,0] 16
## 7 (0,5] 12
## 8 (5,10] 6
## 9 (10,15] 0
## 10 (15,20] 1
## 11 (20,25] 1
We can use ggplot()
to make a frequency plot of the
monthly returns. Again, the code is a little clunky but it gives you
complete control of how the plot looks.
ggplot(data, aes(x=XOM_return)) +
geom_histogram(breaks=break_info,
color="black",
fill="white",
na.rm = TRUE,
stat = 'bin') +
scale_x_continuous(breaks=break_info) +
labs(x='XOM Return',
y='Frequency')
We can also directly count the number of monthly returns that fall within a given range using the subset and length functions.
# Counting the number of observations where the return is between -2.5% and 2.5%
length(subset(XOM_return, # variable of interest
XOM_return>-2.5 & XOM_return<2.5 # range of values we are interested in
)
)
## [1] 19
End the script by detaching the data.
detach(data)