Introduction to R

Setting the Directory

The first thing you must do when starting a new R script is to set the directory. This tells R where it should look for the files you want to load. This directory will also contain any output or figures you export. You change the directory with the setwd() command. Click here view a short video on how to save a script and set the directory.

setwd('/Users/stp48131/Dropbox/WKU/Teaching/R/Introduction to R')

Installing and Loading Packages

There are a number of useful packages that are installed when you download R. There are also many user-written packages that make certain tasks much easier. You have to install these packages and then load them before they can be used. You install packages using the install.packages() function. You could put this in the script but I recommend installing packages from the console. You only need to do this once on your personal computer. Once you have the package installed you can skip to the next step were the packages are loaded.

We are going to use a package called pacman to manage all of the installation and loading of packages. This package is useful because it will first check to see if a package is installed before trying to load it. If the package has not been installed, pacman will install and then load it.

install.packages('pacman')  # we will use this to install everything else. 
                            # this only needs to be done once on your personal computer
                            # and should be done in the console

You must load the installed packages before they can be used. Load installed packages with the library() function.

library(pacman) # this loads the pacman package.

Now that we have pacman installed and loaded, we can use it to install the other packages we need. readxl, tidyverse, and ggplot2 are packages we will use very often. We can install and load these using the p_load() function that comes with pacman.

# loads the readxl, ggplot2, and tidyverse packages. 
# if they have not been installed, this will install them before loading
p_load(readxl, ggplot2, tidyverse, vtable, moments, stargazer)

Loading Data

The next step is to load data into memory. You load Excel files using the read_excel() function from the readxl package. You must assign the data to a data frame for future reference. The following line assigns data from the RealEstate.xlsx Excel file to the data frame named housedata. You can view this data frame in the Environment tab in R Studio.

The .name_repair is optional but is helpful when the Excel file has column names with spaces. This will remove spaces from the names. Notice the variable college.town" wascollege town” in the Excel file. The .name_repair option replaced the space with a period.

housedata<-read_excel('RealEstate.xlsx',
                      .name_repair = 'universal')

## New names:
## • `college town` -> `college.town`

You can use the sheet=“sheetname” option if you need to load data from a specific sheet in the Excel file.

The second most common data format you will use in this course is the csv file. These are text files where columns are separated by a comma. We can load these files using the read_csv() function that comes with the tidyverse package.

Notice the option used to fix the variable names was “name_repair” instead of “.name_repair.” There are many small details like this that will drive you nuts as you learn R.

housedata_2<-read_csv('RealEstate.csv', 
                      name_repair = 'universal')

## New names:
## Rows: 1000 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (6): price, sqft, age, pool, fireplace, college.town
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `college town` -> `college.town`

Inspecting Data

There are several ways to view data in R Studio. You can view the entire data frame by clicking on the relevant data frame in the Environment tab. You can inspect the first and last several rows of data using the head() and tail() functions.

# head() displays the first 6 rows
head(housedata)

## # A tibble: 6 × 6
##    price  sqft   age  pool fireplace college.town
##    <dbl> <dbl> <dbl> <dbl>     <dbl>        <dbl>
## 1 205452  2346     6     0         1            0
## 2 185328  2003     5     0         1            0
## 3 248422  2777     6     0         0            0
## 4 154690  2017     1     0         0            0
## 5 221801  2645     0     0         1            0
## 6 199119  2156     6     0         1            0

# tail() displays the last 6 rows
tail(housedata)

## # A tibble: 6 × 6
##    price  sqft   age  pool fireplace college.town
##    <dbl> <dbl> <dbl> <dbl>     <dbl>        <dbl>
## 1 253392  2053     1     0         0            1
## 2 257195  2284     4     0         0            1
## 3 338295  3000    11     0         1            1
## 4 263526  2399     6     0         0            1
## 5 300728  2874     9     0         0            1
## 6 220987  2093     2     0         1            1

Estimating Summary Statistics

There are a few built-in packages you can use to estimate summary statistics in R. The summary() function is the quickest way to get basic descriptive statistics.

summary(housedata)

##      price             sqft           age              pool      
##  Min.   :134316   Min.   :2003   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:215647   1st Qu.:2283   1st Qu.: 3.000   1st Qu.:0.000  
##  Median :245832   Median :2536   Median : 6.000   Median :0.000  
##  Mean   :247656   Mean   :2521   Mean   : 9.392   Mean   :0.204  
##  3rd Qu.:278264   3rd Qu.:2775   3rd Qu.:13.000   3rd Qu.:0.000  
##  Max.   :345197   Max.   :3000   Max.   :60.000   Max.   :1.000  
##    fireplace      college.town  
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.000   Median :1.000  
##  Mean   :0.518   Mean   :0.519  
##  3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :1.000   Max.   :1.000

Notice this provided summary statistics for each variable in the housedata data frame. If you want summary statistics for a particular variable, you can specify that in the summary() function. The following line produces summary statistics for the price variable. The general format for this function is summary(dataframe$varname).

summary(housedata$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  134316  215647  245832  247656  278264  345197

If you do not need all of the statistics provided by summary(), you can estimate specific statistics by using the appropriate function.

mean(housedata$price) # mean

## [1] 247655.7

sd(housedata$price) # standard deviation

## [1] 42192.73

median(housedata$price) # median

## [1] 245832.5

min(housedata$price) # minimum

## [1] 134316

max(housedata$price) # maximum

## [1] 345197

quantile(housedata$price,probs=c(.25, .5,.75)) # specific quantiles

##      25%      50%      75% 
## 215646.8 245832.5 278264.5

You can store the summary values using “<-” if you need to reference them later. For example, we can store the average of the price variable by using the following:

average_price<-mean(housedata$price) # storing the mean
average_price  # displaying the stored value

## [1] 247655.7

we can use the st() function in the vtable package to make a nice summary table that can be exported to Word. This function allows you to select certain variables and generate any statistics you may want. If you want to include skewness and kurtosis, you need to add the “moments” package to your p_load() function above.

# creating a table with the summary statistics
st(housedata,
   #vars = c('price','sqft','age','pool','fireplace','college.town'), # variables to keep
   digits = 4, 
   summ = c('mean(x)', # average
            'median(x)', # median
            'sd(x)', # standard deviation
            'min(x)', # minimum
            'max(x)', # maximum
            'skewness(x)', # skewness from moments package 
            'kurtosis(x)', # kurtosis from moments package
            'pctile(x)[25]', # first quartile
            'pctile(x)[75]' # third quartile
            )
   )

Summary Statistics
Variable	Mean	Median	Sd	Min	Max	Skewness	Kurtosis	Pctile[25]	Pctile[75]
price	247656	245832	42193	134316	345197	0.09056	2.333	215647	278264
sqft	2521	2536	291.8	2003	3000	-0.09283	1.815	2283	2775
age	9.392	6	9.427	0	60	1.648	6.015	3	13
pool	0.204	0	0.4032	0	1	1.469	3.158	0	0
fireplace	0.518	1	0.4999	0	1	-0.07205	1.005	0	1
college.town	0.519	1	0.4999	0	1	-0.07605	1.006	0	1

Plotting Data

R is an excellent tool for creating really nice plots. The most basic plot you will use to visualize the distribution of a variable is a histogram. The hist() function is a quick way to create a basic histogram in R.

hist(housedata$price)

You can create a nicer plot using ggplot with geom_histogram(). We will use ggplot to plot everything moving forward so I recommend getting familiar with that syntax as soon as possible. The syntax is a little clunky but you will use the same basic syntax for every plot you create.

Here is a histogram:

ggplot(housedata, aes(x=price)) +
  geom_histogram(breaks=seq(130000,350000,by=15000),
                 color="black", 
                 fill="white") +
  labs(x='Sales Price',
       y='Frequency')

Here is a scatter plot:

ggplot(housedata,aes(x=sqft, y=price)) + 
  geom_point() +
  labs(x="Square Footage",
       y="Sales Price"
       )

The code above generates a scatter plot with square footage on the x axis and price on the y axis. The first argument is the dataframe we want to use for the plot. aes() allows you to assign variables to the axes. We add the “+ geom_point()” to produce a scatter plot. R will display an empty plot if you forget this. The labs() option allows you to label the x and y axes. If you wanted to generate a line plot, you would use “geom_line()” instead of “geom_point()”. A line graph is not appropriate here but you can see the output below.

ggplot(housedata,aes(x=sqft, y=price)) + 
  geom_line() +
  labs(x="Square Footage",
       y="Sales Price")

Estimating a Regression

You will use the default function for linear models to estimate most regressions in this class. Assume the model you want to estimate is:

price=$\beta_0$+$\beta_1$sqft+$\epsilon$.

You will estimate this regression using the lm() function. The basic syntax for this function is: lm(dependent variable ~ independent variable, data=data frame). You need to store the results so you can view them later.

reg_results<-lm(price~sqft, data=housedata)

The previous line did not display any output but the results were stored in the reg_results list. You can view the regression results using the summary() function.

summary(reg_results)

## 
## Call:
## lm(formula = price ~ sqft, data = housedata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -77683 -30522   2576  30912  73371 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30920.276   9336.991   3.312 0.000961 ***
## sqft           85.973      3.679  23.367  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33940 on 998 degrees of freedom
## Multiple R-squared:  0.3536, Adjusted R-squared:  0.353 
## F-statistic:   546 on 1 and 998 DF,  p-value: < 2.2e-16

The “Estimate” column contains the estimates of $\beta_0$ and $\beta_1$. In this example, $\hat{\beta_0}$=30,920 and $\hat{\beta_1}$=86. The estimated regression equation is:

$\hat{price}$=30,920+86sqft.

On average, a property with 0 square feet will sell for $30,920. On average, sales price increases by $86 when square footage increases by 1.

You can add the estimated regression line to the scatter plot to visualize the relationship. The fitted values (predicted values) are stored in the reg_results list. You access these using “reg_results$fitted.values”.

We will use the same code from before but add to it a line plot of the fitted values. We do this by adding “geom_line(aes(y=reg_results$fitted.values))” to the scatter plot.

ggplot(housedata,aes(x=sqft, y=price)) + 
  geom_point() +
  geom_line(aes(y=reg_results$fitted.values)) +
  labs(x="Square Footage",
       y="Sales Price")

Extending this model to include more than one independent variable is as simple as adding the additional variables to the lm() function.

reg_results2<-lm(price~sqft+age+pool+fireplace+college.town,
                 data=housedata)

summary(reg_results2)

## 
## Call:
## lm(formula = price ~ sqft + age + pool + fireplace + college.town, 
##     data = housedata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47971 -10411    198  10438  44759 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6911.880   4289.365   1.611 0.107410    
## sqft            83.183      1.672  49.759  < 2e-16 ***
## age           -192.991     51.567  -3.743 0.000193 ***
## pool          4352.570   1205.261   3.611 0.000320 ***
## fireplace     1398.810    976.807   1.432 0.152452    
## college.town 60196.233    971.531  61.960  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15330 on 994 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8679 
## F-statistic:  1314 on 5 and 994 DF,  p-value: < 2.2e-16

We will use the stargazer package to generate formatted regression tables. We can combine the output from the two regressions above into a single table. The “type=‘text’” option tells R to generate a table that can easily be viewed in the output window. You will use the “type=‘html’” option later when you want to export the table to a Word or Powerpoint file.

stargazer(reg_results,reg_results2, 
          type='text')

## 
## =======================================================================
##                                     Dependent variable:                
##                     ---------------------------------------------------
##                                            price                       
##                               (1)                       (2)            
## -----------------------------------------------------------------------
## sqft                       85.973***                 83.183***         
##                             (3.679)                   (1.672)          
##                                                                        
## age                                                 -192.991***        
##                                                       (51.567)         
##                                                                        
## pool                                                4,352.570***       
##                                                     (1,205.261)        
##                                                                        
## fireplace                                            1,398.810         
##                                                      (976.807)         
##                                                                        
## college.town                                       60,196.230***       
##                                                      (971.531)         
##                                                                        
## Constant                 30,920.280***               6,911.880         
##                           (9,336.991)               (4,289.365)        
##                                                                        
## -----------------------------------------------------------------------
## Observations                 1,000                     1,000           
## R2                           0.354                     0.869           
## Adjusted R2                  0.353                     0.868           
## Residual Std. Error  33,938.410 (df = 998)     15,334.440 (df = 994)   
## F Statistic         546.037*** (df = 1; 998) 1,313.837*** (df = 5; 994)
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01