Rarely will you find data that is ready to use without at least a small amount of cleaning. The steps you take to clean your data will change project to project but there are a few steps we almost always take. I will demonstrate how to generate variables, select variables, and filter observations based on some condition.

The first step is to load the necessary packages. The tidyverse package provides a number of convenient tools for data cleaning. We will also load the readxl package to load our dataset.

# loading packages
library(readxl)
library(tidyverse)
# setting the directory
setwd('~/Desktop/ECON 465')
# loading the data 
data<-data.frame(read_excel('largehousesample.xlsx'))

This dataset has 14,300 observations and 15 variables. We will use piping to clean this dataset. Anytime you see %>%, you are telling R to take the object to the left of %>% and apply whatever is on the right of %>%. For example, “data %>% summary()” tells R to take the data data frame and insert that into the summary() function. This is equivalent to using summary(data).

# notice these two lines produce the same result
data %>% summary()
##     soldyear      soldmonth      abvgroundsqft       age        
##  Min.   :2000   Min.   : 1.000   Min.   : 500   Min.   :  0.00  
##  1st Qu.:2003   1st Qu.: 4.000   1st Qu.:1150   1st Qu.:  7.00  
##  Median :2006   Median : 7.000   Median :1490   Median : 28.00  
##  Mean   :2006   Mean   : 6.595   Mean   :1672   Mean   : 32.87  
##  3rd Qu.:2008   3rd Qu.: 9.000   3rd Qu.:1994   3rd Qu.: 50.00  
##  Max.   :2011   Max.   :12.000   Max.   :9392   Max.   :200.00  
##     bedrooms        centralair      daysonmarket     disttorailroad     
##  Min.   : 1.000   Min.   :0.0000   Min.   :   0.00   Min.   :     0.42  
##  1st Qu.: 3.000   1st Qu.:1.0000   1st Qu.:  27.00   1st Qu.:  2301.10  
##  Median : 3.000   Median :1.0000   Median :  61.00   Median :  5087.70  
##  Mean   : 3.262   Mean   :0.9113   Mean   :  83.45   Mean   :  8680.47  
##  3rd Qu.: 4.000   3rd Qu.:1.0000   3rd Qu.: 114.00   3rd Qu.: 10565.60  
##  Max.   :12.000   Max.   :1.0000   Max.   :1158.00   Max.   :134817.02  
##    fireplace        fullbaths      lotsizeuse        partialbaths   
##  Min.   :0.0000   Min.   :1.00   Min.   :  0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.00   1st Qu.:  0.1800   1st Qu.:0.0000  
##  Median :0.0000   Median :2.00   Median :  0.2500   Median :0.0000  
##  Mean   :0.4799   Mean   :1.83   Mean   :  0.8259   Mean   :0.3687  
##  3rd Qu.:1.0000   3rd Qu.:2.00   3rd Qu.:  0.4300   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :8.00   Max.   :220.0000   Max.   :5.0000  
##  salesprice2011    listingoffice      distclosesttower  
##  Min.   :   1028   Length:14300       Min.   :   95.24  
##  1st Qu.: 105944   Class :character   1st Qu.: 2794.41  
##  Median : 151723   Mode  :character   Median : 4290.04  
##  Mean   : 185468                      Mean   : 5278.84  
##  3rd Qu.: 225647                      3rd Qu.: 6499.05  
##  Max.   :2968308                      Max.   :32094.04
summary(data)
##     soldyear      soldmonth      abvgroundsqft       age        
##  Min.   :2000   Min.   : 1.000   Min.   : 500   Min.   :  0.00  
##  1st Qu.:2003   1st Qu.: 4.000   1st Qu.:1150   1st Qu.:  7.00  
##  Median :2006   Median : 7.000   Median :1490   Median : 28.00  
##  Mean   :2006   Mean   : 6.595   Mean   :1672   Mean   : 32.87  
##  3rd Qu.:2008   3rd Qu.: 9.000   3rd Qu.:1994   3rd Qu.: 50.00  
##  Max.   :2011   Max.   :12.000   Max.   :9392   Max.   :200.00  
##     bedrooms        centralair      daysonmarket     disttorailroad     
##  Min.   : 1.000   Min.   :0.0000   Min.   :   0.00   Min.   :     0.42  
##  1st Qu.: 3.000   1st Qu.:1.0000   1st Qu.:  27.00   1st Qu.:  2301.10  
##  Median : 3.000   Median :1.0000   Median :  61.00   Median :  5087.70  
##  Mean   : 3.262   Mean   :0.9113   Mean   :  83.45   Mean   :  8680.47  
##  3rd Qu.: 4.000   3rd Qu.:1.0000   3rd Qu.: 114.00   3rd Qu.: 10565.60  
##  Max.   :12.000   Max.   :1.0000   Max.   :1158.00   Max.   :134817.02  
##    fireplace        fullbaths      lotsizeuse        partialbaths   
##  Min.   :0.0000   Min.   :1.00   Min.   :  0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.00   1st Qu.:  0.1800   1st Qu.:0.0000  
##  Median :0.0000   Median :2.00   Median :  0.2500   Median :0.0000  
##  Mean   :0.4799   Mean   :1.83   Mean   :  0.8259   Mean   :0.3687  
##  3rd Qu.:1.0000   3rd Qu.:2.00   3rd Qu.:  0.4300   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :8.00   Max.   :220.0000   Max.   :5.0000  
##  salesprice2011    listingoffice      distclosesttower  
##  Min.   :   1028   Length:14300       Min.   :   95.24  
##  1st Qu.: 105944   Class :character   1st Qu.: 2794.41  
##  Median : 151723   Mode  :character   Median : 4290.04  
##  Mean   : 185468                      Mean   : 5278.84  
##  3rd Qu.: 225647                      3rd Qu.: 6499.05  
##  Max.   :2968308                      Max.   :32094.04

The first step we will take is to select only the variables we want. We do this using the select() function. We will store the selected variables as a new data frame named data1. The code below tells R to select the listed variables in the data data frame and assign the result to the data frame named data1. You should see a new data frame in the environment window named data1.

# selecting a few variables from the original dataset and storing the subset
# in a new data frame
data1 <- data %>%
  select('soldyear', 'abvgroundsqft','age','bedrooms',
         'daysonmarket','fullbaths','lotsizeuse','salesprice2011',
         'listingoffice'
         )

The next step is to generate new variables. First, we will create a variable that depends on the value of a current continuous variable. If the relationships between the independent and dependent variables are non-linear we often need to generate the square or natural log of a variable. We can do this using the mutate() function. The following code starts with the data1 data frame and generates new variables that are equal to the natural log of salesprice2011 and age squared. We will save the new data in a data frame named data2. You should see these new variables in the data2 data frame.

# generating age squared and the natural log of salesprice2011
# storing the updated data frame 
data2<- data1 %>%
  mutate(agesq=age^2,
         lnsalesprice=log(salesprice2011)
         )

We may also want to generate a dummy variable that equals 1 if some condition is met and 0 otherwise. We do this using the ifelse() function inside of the mutate() function. The ifelse() function works with quantitative or string variables. We will generate a dummy variable equal to one if listingoffice is equal to RE/MAX. We will also generate a dummy variable equal to one if the property took more than 365 days to sell. We will store the updated data in a new data frame named data3.

# generating two dummy variables
data3<- data2 %>%
  mutate(remax=ifelse(listingoffice=='RE/MAX',1,0),
         over1year=ifelse(daysonmarket>365,1,0))

Notice we used “=” and “==” when we generated the remax variable. When we use “==” we are saying “if this condition is met”. We use “=” when we are assigning values to a variable. For example, mutate(x=60) says to assign the value of 60 to the variable x. “listingoffice==‘RE/MAX’” says “where listingoffice is equal to RE/MAX”. “remax=ifelse(listingoffice==‘RE/MAX’,1,0)” says “where listingoffice is equal to RE/MAX, assign the value of 1 to the variable remax. Assign the value of 0 to remax if listingoffice is equal to anything else.”

“over1year=ifelse(daysonmarket>365,1,0)” says “where daysonmarket is greater than 365 assign the value of 1 to over1year. Assign the value of 0 to over1year if daysonmarket is less than or equal to 365.” You can use the “>”, “<”, “>=”, “<=”, or “!=” (not equal to) operators for quantitative variables. You can use “==” and “!=” for string variables such as listingoffice.

Lastly, we will filter our datat to keep observations that satisfy certain conditions. We do this using the filter() function. We will end with na.omit() to remove every row with a missing value.

The following code starts with the data3 data frame and selects observations where soldyear is greater than 2005 and bedrooms is less than or equal to 5. The new data is stored as the data4 data frame.

# filtering the data based on sold year and bedrooms
data4<-data3 %>%
  filter(soldyear>2005,
         bedrooms<=5
         ) %>%
  na.omit()
# displaying the unique values of soldyear and bedrooms to check
unique(data4$soldyear)
## [1] 2009 2010 2011 2006 2007 2008
unique(data4$bedrooms)
## [1] 3 2 4 5 1

Notice our sample only includes obseervations where the soldyear is 2006 or greater and bedrooms is less than or equal to 5.

I applied each step of the data cleaning process separately and stored each result in a separate data frame. We can do this all in one step to avoid having multiple data frames in memory.

# all of the cleaning at one time
data5<-data.frame(read_excel('largehousesample.xlsx')) %>% # loading data
  select('soldyear', 'abvgroundsqft','age','bedrooms',
         'daysonmarket','fullbaths','lotsizeuse','salesprice2011',
         'listingoffice'
         ) %>% # selecting variables
  mutate(agesq=age^2,
         lnsalesprice=log(salesprice2011),
         remax=ifelse(listingoffice=='RE/MAX',1,0),
         over1year=ifelse(daysonmarket>365,1,0)
         ) %>% # generating variables
  filter(soldyear>2005,
         bedrooms<=5
         ) %>% # filtering data 
  na.omit()

Notice the data5 data frame is identical to data4 that was generated when we did each step separately.

This is slightly more advanced but we can use the filter() function within other functions. The following code provide summary statistics for sales price where bedrooms is equal to 3.

summary(filter(data,bedrooms==3)$salesprice2011)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1986  105942  140236  152090  185469 1229191

We could also create a scatter plot of sales price againt square footage for 3 bedroom homes.

ggplot(filter(data,bedrooms==3),aes(x=abvgroundsqft,y=salesprice2011)) +
  geom_point()

Tidyverse contains many more functions to make data cleaning easier. If you are familiar with mutate(), filter(), and select() you should be able to handle 95% of what you would ever need to do.