Rarely will you find data that is ready to use without at least a small amount of cleaning. The steps you take to clean your data will change project to project but there are a few steps we almost always take. I will demonstrate how to generate variables, select variables, and filter observations based on some condition.
The first step is to load the necessary packages. The tidyverse package provides a number of convenient tools for data cleaning. We will also load the readxl package to load our dataset.
# loading packages
library(readxl)
library(tidyverse)
# setting the directory
setwd('~/Desktop/ECON 465')
# loading the data
data<-data.frame(read_excel('largehousesample.xlsx'))
This dataset has 14,300 observations and 15 variables. We will use piping to clean this dataset. Anytime you see %>%, you are telling R to take the object to the left of %>% and apply whatever is on the right of %>%. For example, “data %>% summary()” tells R to take the data data frame and insert that into the summary() function. This is equivalent to using summary(data).
# notice these two lines produce the same result
data %>% summary()
## soldyear soldmonth abvgroundsqft age
## Min. :2000 Min. : 1.000 Min. : 500 Min. : 0.00
## 1st Qu.:2003 1st Qu.: 4.000 1st Qu.:1150 1st Qu.: 7.00
## Median :2006 Median : 7.000 Median :1490 Median : 28.00
## Mean :2006 Mean : 6.595 Mean :1672 Mean : 32.87
## 3rd Qu.:2008 3rd Qu.: 9.000 3rd Qu.:1994 3rd Qu.: 50.00
## Max. :2011 Max. :12.000 Max. :9392 Max. :200.00
## bedrooms centralair daysonmarket disttorailroad
## Min. : 1.000 Min. :0.0000 Min. : 0.00 Min. : 0.42
## 1st Qu.: 3.000 1st Qu.:1.0000 1st Qu.: 27.00 1st Qu.: 2301.10
## Median : 3.000 Median :1.0000 Median : 61.00 Median : 5087.70
## Mean : 3.262 Mean :0.9113 Mean : 83.45 Mean : 8680.47
## 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.: 114.00 3rd Qu.: 10565.60
## Max. :12.000 Max. :1.0000 Max. :1158.00 Max. :134817.02
## fireplace fullbaths lotsizeuse partialbaths
## Min. :0.0000 Min. :1.00 Min. : 0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.: 0.1800 1st Qu.:0.0000
## Median :0.0000 Median :2.00 Median : 0.2500 Median :0.0000
## Mean :0.4799 Mean :1.83 Mean : 0.8259 Mean :0.3687
## 3rd Qu.:1.0000 3rd Qu.:2.00 3rd Qu.: 0.4300 3rd Qu.:1.0000
## Max. :1.0000 Max. :8.00 Max. :220.0000 Max. :5.0000
## salesprice2011 listingoffice distclosesttower
## Min. : 1028 Length:14300 Min. : 95.24
## 1st Qu.: 105944 Class :character 1st Qu.: 2794.41
## Median : 151723 Mode :character Median : 4290.04
## Mean : 185468 Mean : 5278.84
## 3rd Qu.: 225647 3rd Qu.: 6499.05
## Max. :2968308 Max. :32094.04
summary(data)
## soldyear soldmonth abvgroundsqft age
## Min. :2000 Min. : 1.000 Min. : 500 Min. : 0.00
## 1st Qu.:2003 1st Qu.: 4.000 1st Qu.:1150 1st Qu.: 7.00
## Median :2006 Median : 7.000 Median :1490 Median : 28.00
## Mean :2006 Mean : 6.595 Mean :1672 Mean : 32.87
## 3rd Qu.:2008 3rd Qu.: 9.000 3rd Qu.:1994 3rd Qu.: 50.00
## Max. :2011 Max. :12.000 Max. :9392 Max. :200.00
## bedrooms centralair daysonmarket disttorailroad
## Min. : 1.000 Min. :0.0000 Min. : 0.00 Min. : 0.42
## 1st Qu.: 3.000 1st Qu.:1.0000 1st Qu.: 27.00 1st Qu.: 2301.10
## Median : 3.000 Median :1.0000 Median : 61.00 Median : 5087.70
## Mean : 3.262 Mean :0.9113 Mean : 83.45 Mean : 8680.47
## 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.: 114.00 3rd Qu.: 10565.60
## Max. :12.000 Max. :1.0000 Max. :1158.00 Max. :134817.02
## fireplace fullbaths lotsizeuse partialbaths
## Min. :0.0000 Min. :1.00 Min. : 0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.: 0.1800 1st Qu.:0.0000
## Median :0.0000 Median :2.00 Median : 0.2500 Median :0.0000
## Mean :0.4799 Mean :1.83 Mean : 0.8259 Mean :0.3687
## 3rd Qu.:1.0000 3rd Qu.:2.00 3rd Qu.: 0.4300 3rd Qu.:1.0000
## Max. :1.0000 Max. :8.00 Max. :220.0000 Max. :5.0000
## salesprice2011 listingoffice distclosesttower
## Min. : 1028 Length:14300 Min. : 95.24
## 1st Qu.: 105944 Class :character 1st Qu.: 2794.41
## Median : 151723 Mode :character Median : 4290.04
## Mean : 185468 Mean : 5278.84
## 3rd Qu.: 225647 3rd Qu.: 6499.05
## Max. :2968308 Max. :32094.04
The first step we will take is to select only the variables we want. We do this using the select() function. We will store the selected variables as a new data frame named data1. The code below tells R to select the listed variables in the data data frame and assign the result to the data frame named data1. You should see a new data frame in the environment window named data1.
# selecting a few variables from the original dataset and storing the subset
# in a new data frame
data1 <- data %>%
select('soldyear', 'abvgroundsqft','age','bedrooms',
'daysonmarket','fullbaths','lotsizeuse','salesprice2011',
'listingoffice'
)
The next step is to generate new variables. First, we will create a variable that depends on the value of a current continuous variable. If the relationships between the independent and dependent variables are non-linear we often need to generate the square or natural log of a variable. We can do this using the mutate() function. The following code starts with the data1 data frame and generates new variables that are equal to the natural log of salesprice2011 and age squared. We will save the new data in a data frame named data2. You should see these new variables in the data2 data frame.
# generating age squared and the natural log of salesprice2011
# storing the updated data frame
data2<- data1 %>%
mutate(agesq=age^2,
lnsalesprice=log(salesprice2011)
)
We may also want to generate a dummy variable that equals 1 if some condition is met and 0 otherwise. We do this using the ifelse() function inside of the mutate() function. The ifelse() function works with quantitative or string variables. We will generate a dummy variable equal to one if listingoffice is equal to RE/MAX. We will also generate a dummy variable equal to one if the property took more than 365 days to sell. We will store the updated data in a new data frame named data3.
# generating two dummy variables
data3<- data2 %>%
mutate(remax=ifelse(listingoffice=='RE/MAX',1,0),
over1year=ifelse(daysonmarket>365,1,0))
Notice we used “=” and “==” when we generated the remax variable. When we use “==” we are saying “if this condition is met”. We use “=” when we are assigning values to a variable. For example, mutate(x=60) says to assign the value of 60 to the variable x. “listingoffice==‘RE/MAX’” says “where listingoffice is equal to RE/MAX”. “remax=ifelse(listingoffice==‘RE/MAX’,1,0)” says “where listingoffice is equal to RE/MAX, assign the value of 1 to the variable remax. Assign the value of 0 to remax if listingoffice is equal to anything else.”
“over1year=ifelse(daysonmarket>365,1,0)” says “where daysonmarket is greater than 365 assign the value of 1 to over1year. Assign the value of 0 to over1year if daysonmarket is less than or equal to 365.” You can use the “>”, “<”, “>=”, “<=”, or “!=” (not equal to) operators for quantitative variables. You can use “==” and “!=” for string variables such as listingoffice.
Lastly, we will filter our datat to keep observations that satisfy certain conditions. We do this using the filter() function. We will end with na.omit() to remove every row with a missing value.
The following code starts with the data3 data frame and selects observations where soldyear is greater than 2005 and bedrooms is less than or equal to 5. The new data is stored as the data4 data frame.
# filtering the data based on sold year and bedrooms
data4<-data3 %>%
filter(soldyear>2005,
bedrooms<=5
) %>%
na.omit()
# displaying the unique values of soldyear and bedrooms to check
unique(data4$soldyear)
## [1] 2009 2010 2011 2006 2007 2008
unique(data4$bedrooms)
## [1] 3 2 4 5 1
Notice our sample only includes obseervations where the soldyear is 2006 or greater and bedrooms is less than or equal to 5.
I applied each step of the data cleaning process separately and stored each result in a separate data frame. We can do this all in one step to avoid having multiple data frames in memory.
# all of the cleaning at one time
data5<-data.frame(read_excel('largehousesample.xlsx')) %>% # loading data
select('soldyear', 'abvgroundsqft','age','bedrooms',
'daysonmarket','fullbaths','lotsizeuse','salesprice2011',
'listingoffice'
) %>% # selecting variables
mutate(agesq=age^2,
lnsalesprice=log(salesprice2011),
remax=ifelse(listingoffice=='RE/MAX',1,0),
over1year=ifelse(daysonmarket>365,1,0)
) %>% # generating variables
filter(soldyear>2005,
bedrooms<=5
) %>% # filtering data
na.omit()
Notice the data5 data frame is identical to data4 that was generated when we did each step separately.
This is slightly more advanced but we can use the filter() function within other functions. The following code provide summary statistics for sales price where bedrooms is equal to 3.
summary(filter(data,bedrooms==3)$salesprice2011)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1986 105942 140236 152090 185469 1229191
We could also create a scatter plot of sales price againt square footage for 3 bedroom homes.
ggplot(filter(data,bedrooms==3),aes(x=abvgroundsqft,y=salesprice2011)) +
geom_point()
Tidyverse contains many more functions to make data cleaning easier. If you are familiar with mutate(), filter(), and select() you should be able to handle 95% of what you would ever need to do.