The first thing you must do when starting a new R script is to set
the directory. This tells R where it should look for the files you want
to load. This directory will also contain any output or figures you
export. You change the directory with the setwd()
command.
Click here view a
short video on how to save a script and set the directory.
setwd('/Users/stp48131/Dropbox/WKU/Teaching/R/Introduction to R')
There are a number of useful packages that are installed when you
download R. There are also many user-written packages that make certain
tasks much easier. You have to install these packages and then load them
before they can be used. You install packages using the
install.packages()
function. You could put this in the
script but I recommend installing packages from the console. You only
need to do this once on your personal computer. Once you have the
package installed you can skip to the next step were the packages are
loaded.
We are going to use a package called pacman
to manage
all of the installation and loading of packages. This package is useful
because it will first check to see if a package is installed before
trying to load it. If the package has not been installed,
pacman
will install and then load it.
install.packages('pacman') # we will use this to install everything else.
# this only needs to be done once on your personal computer
# and should be done in the console
You must load the installed packages before they can be used. Load
installed packages with the library()
function.
library(pacman) # this loads the pacman package.
Now that we have pacman
installed and loaded, we can use
it to install the other packages we need. readxl
,
tidyverse
, and ggplot2
are packages we will
use very often. We can install and load these using the
p_load()
function that comes with pacman
.
# loads the readxl, ggplot2, and tidyverse packages.
# if they have not been installed, this will install them before loading
p_load(readxl, ggplot2, tidyverse, vtable, moments, stargazer)
The next step is to load data into memory. You load Excel files using
the read_excel()
function from the readxl
package. You must assign the data to a data frame for future reference.
The following line assigns data from the RealEstate.xlsx Excel file to
the data frame named housedata. You can view this data frame in the
Environment tab in R Studio.
The .name_repair is optional but is helpful when the Excel file has
column names with spaces. This will remove spaces from the names. Notice
the variable college.town" was
college town” in the Excel
file. The .name_repair option replaced the space with a period.
housedata<-read_excel('RealEstate.xlsx',
.name_repair = 'universal')
## New names:
## • `college town` -> `college.town`
You can use the sheet=“sheetname” option if you need to load data from a specific sheet in the Excel file.
The second most common data format you will use in this course is the
csv file. These are text files where columns are separated by a comma.
We can load these files using the read_csv()
function that
comes with the tidyverse package.
Notice the option used to fix the variable names was “name_repair” instead of “.name_repair.” There are many small details like this that will drive you nuts as you learn R.
housedata_2<-read_csv('RealEstate.csv',
name_repair = 'universal')
## New names:
## Rows: 1000 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (6): price, sqft, age, pool, fireplace, college.town
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `college town` -> `college.town`
There are several ways to view data in R Studio. You can view the entire data frame by clicking on the relevant data frame in the Environment tab. You can inspect the first and last several rows of data using the head() and tail() functions.
# head() displays the first 6 rows
head(housedata)
## # A tibble: 6 × 6
## price sqft age pool fireplace college.town
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 205452 2346 6 0 1 0
## 2 185328 2003 5 0 1 0
## 3 248422 2777 6 0 0 0
## 4 154690 2017 1 0 0 0
## 5 221801 2645 0 0 1 0
## 6 199119 2156 6 0 1 0
# tail() displays the last 6 rows
tail(housedata)
## # A tibble: 6 × 6
## price sqft age pool fireplace college.town
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 253392 2053 1 0 0 1
## 2 257195 2284 4 0 0 1
## 3 338295 3000 11 0 1 1
## 4 263526 2399 6 0 0 1
## 5 300728 2874 9 0 0 1
## 6 220987 2093 2 0 1 1
There are a few built-in packages you can use to estimate summary
statistics in R. The summary()
function is the quickest way
to get basic descriptive statistics.
summary(housedata)
## price sqft age pool
## Min. :134316 Min. :2003 Min. : 0.000 Min. :0.000
## 1st Qu.:215647 1st Qu.:2283 1st Qu.: 3.000 1st Qu.:0.000
## Median :245832 Median :2536 Median : 6.000 Median :0.000
## Mean :247656 Mean :2521 Mean : 9.392 Mean :0.204
## 3rd Qu.:278264 3rd Qu.:2775 3rd Qu.:13.000 3rd Qu.:0.000
## Max. :345197 Max. :3000 Max. :60.000 Max. :1.000
## fireplace college.town
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :1.000
## Mean :0.518 Mean :0.519
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000
Notice this provided summary statistics for each variable in the
housedata data frame. If you want summary statistics for a particular
variable, you can specify that in the summary()
function.
The following line produces summary statistics for the price variable.
The general format for this function is summary(dataframe$varname).
summary(housedata$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 134316 215647 245832 247656 278264 345197
If you do not need all of the statistics provided by
summary()
, you can estimate specific statistics by using
the appropriate function.
mean(housedata$price) # mean
## [1] 247655.7
sd(housedata$price) # standard deviation
## [1] 42192.73
median(housedata$price) # median
## [1] 245832.5
min(housedata$price) # minimum
## [1] 134316
max(housedata$price) # maximum
## [1] 345197
quantile(housedata$price,probs=c(.25, .5,.75)) # specific quantiles
## 25% 50% 75%
## 215646.8 245832.5 278264.5
You can store the summary values using “<-” if you need to reference them later. For example, we can store the average of the price variable by using the following:
average_price<-mean(housedata$price) # storing the mean
average_price # displaying the stored value
## [1] 247655.7
we can use the st()
function in the vtable
package to make a nice summary table that can be exported to Word. This
function allows you to select certain variables and generate any
statistics you may want. If you want to include skewness and kurtosis,
you need to add the “moments” package to your p_load()
function above.
# creating a table with the summary statistics
st(housedata,
#vars = c('price','sqft','age','pool','fireplace','college.town'), # variables to keep
digits = 4,
summ = c('mean(x)', # average
'median(x)', # median
'sd(x)', # standard deviation
'min(x)', # minimum
'max(x)', # maximum
'skewness(x)', # skewness from moments package
'kurtosis(x)', # kurtosis from moments package
'pctile(x)[25]', # first quartile
'pctile(x)[75]' # third quartile
)
)
Variable | Mean | Median | Sd | Min | Max | Skewness | Kurtosis | Pctile[25] | Pctile[75] |
---|---|---|---|---|---|---|---|---|---|
price | 247656 | 245832 | 42193 | 134316 | 345197 | 0.09056 | 2.333 | 215647 | 278264 |
sqft | 2521 | 2536 | 291.8 | 2003 | 3000 | -0.09283 | 1.815 | 2283 | 2775 |
age | 9.392 | 6 | 9.427 | 0 | 60 | 1.648 | 6.015 | 3 | 13 |
pool | 0.204 | 0 | 0.4032 | 0 | 1 | 1.469 | 3.158 | 0 | 0 |
fireplace | 0.518 | 1 | 0.4999 | 0 | 1 | -0.07205 | 1.005 | 0 | 1 |
college.town | 0.519 | 1 | 0.4999 | 0 | 1 | -0.07605 | 1.006 | 0 | 1 |
R is an excellent tool for creating really nice plots. The most basic
plot you will use to visualize the distribution of a variable is a
histogram. The hist()
function is a quick way to create a
basic histogram in R.
hist(housedata$price)
You can create a nicer plot using ggplot
with
geom_histogram()
. We will use ggplot
to plot
everything moving forward so I recommend getting familiar with that
syntax as soon as possible. The syntax is a little clunky but you will
use the same basic syntax for every plot you create.
Here is a histogram:
ggplot(housedata, aes(x=price)) +
geom_histogram(breaks=seq(130000,350000,by=15000),
color="black",
fill="white") +
labs(x='Sales Price',
y='Frequency')
Here is a scatter plot:
ggplot(housedata,aes(x=sqft, y=price)) +
geom_point() +
labs(x="Square Footage",
y="Sales Price"
)
The code above generates a scatter plot with square footage on the x
axis and price on the y axis. The first argument is the dataframe we
want to use for the plot. aes()
allows you to assign
variables to the axes. We add the “+ geom_point()
” to
produce a scatter plot. R will display an empty plot if you forget this.
The labs()
option allows you to label the x and y axes. If
you wanted to generate a line plot, you would use
“geom_line()
” instead of “geom_point()
”. A
line graph is not appropriate here but you can see the output below.
ggplot(housedata,aes(x=sqft, y=price)) +
geom_line() +
labs(x="Square Footage",
y="Sales Price")
You will use the default function for linear models to estimate most regressions in this class. Assume the model you want to estimate is:
price=\(\beta_0\)+\(\beta_1\)sqft+\(\epsilon\).
You will estimate this regression using the lm()
function. The basic syntax for this function is: lm(dependent variable ~
independent variable, data=data frame). You need to store the results so
you can view them later.
reg_results<-lm(price~sqft, data=housedata)
The previous line did not display any output but the results were
stored in the reg_results list. You can view the regression results
using the summary()
function.
summary(reg_results)
##
## Call:
## lm(formula = price ~ sqft, data = housedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77683 -30522 2576 30912 73371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30920.276 9336.991 3.312 0.000961 ***
## sqft 85.973 3.679 23.367 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33940 on 998 degrees of freedom
## Multiple R-squared: 0.3536, Adjusted R-squared: 0.353
## F-statistic: 546 on 1 and 998 DF, p-value: < 2.2e-16
The “Estimate” column contains the estimates of \(\beta_0\) and \(\beta_1\). In this example, \(\hat{\beta_0}\)=30,920 and \(\hat{\beta_1}\)=86. The estimated regression equation is:
\(\hat{price}\)=30,920+86sqft.
On average, a property with 0 square feet will sell for $30,920. On average, sales price increases by $86 when square footage increases by 1.
You can add the estimated regression line to the scatter plot to visualize the relationship. The fitted values (predicted values) are stored in the reg_results list. You access these using “reg_results$fitted.values”.
We will use the same code from before but add to it a line plot of the fitted values. We do this by adding “geom_line(aes(y=reg_results$fitted.values))” to the scatter plot.
ggplot(housedata,aes(x=sqft, y=price)) +
geom_point() +
geom_line(aes(y=reg_results$fitted.values)) +
labs(x="Square Footage",
y="Sales Price")
Extending this model to include more than one independent variable is
as simple as adding the additional variables to the lm()
function.
reg_results2<-lm(price~sqft+age+pool+fireplace+college.town,
data=housedata)
summary(reg_results2)
##
## Call:
## lm(formula = price ~ sqft + age + pool + fireplace + college.town,
## data = housedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47971 -10411 198 10438 44759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6911.880 4289.365 1.611 0.107410
## sqft 83.183 1.672 49.759 < 2e-16 ***
## age -192.991 51.567 -3.743 0.000193 ***
## pool 4352.570 1205.261 3.611 0.000320 ***
## fireplace 1398.810 976.807 1.432 0.152452
## college.town 60196.233 971.531 61.960 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15330 on 994 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8679
## F-statistic: 1314 on 5 and 994 DF, p-value: < 2.2e-16
We will use the stargazer
package to generate formatted
regression tables. We can combine the output from the two regressions
above into a single table. The “type=‘text’” option tells R to generate
a table that can easily be viewed in the output window. You will use the
“type=‘html’” option later when you want to export the table to a Word
or Powerpoint file.
stargazer(reg_results,reg_results2,
type='text')
##
## =======================================================================
## Dependent variable:
## ---------------------------------------------------
## price
## (1) (2)
## -----------------------------------------------------------------------
## sqft 85.973*** 83.183***
## (3.679) (1.672)
##
## age -192.991***
## (51.567)
##
## pool 4,352.570***
## (1,205.261)
##
## fireplace 1,398.810
## (976.807)
##
## college.town 60,196.230***
## (971.531)
##
## Constant 30,920.280*** 6,911.880
## (9,336.991) (4,289.365)
##
## -----------------------------------------------------------------------
## Observations 1,000 1,000
## R2 0.354 0.869
## Adjusted R2 0.353 0.868
## Residual Std. Error 33,938.410 (df = 998) 15,334.440 (df = 994)
## F Statistic 546.037*** (df = 1; 998) 1,313.837*** (df = 5; 994)
## =======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01