library(here)
library(tidyverse)
library(stargazer)
library(estimatr)
library(dplyr)
library(plm)
library(lmtest)
library(car)
library(sandwich)
library(formatR)
How Can We Differentiate Between Correlation vs Causation?
Correlation can easily be confused with causation. Sometimes it’s not so obvious whether X causes Y, or if X and Y are just correlated and simply tied together by a different actor, Z.
For example: consider that people often get ice cream on days that they also end up getting sunburnt. Could this mean that ice cream may cause skin to burn? Probably not. This is because we know, intuitively, that this is not the case. We also know, logically, that people may get ice cream more on warm sunny days, and these warm sunny days cause the sunburn, not the ice cream. However, other problems are more nuanced an complex, and the answer may not be so obvious. How do we discern causation from correlation, then?
The TLDR: we can use control groups! Here, I use difference-in-differences control group analysis to determine whether or not increased proximity to an incinerator (a proxy for increased proximity to higher levels of air pollution) causes a decrease in average housing values. A key assumption we need validated in order to perform this analysis is called the parallel trends assumption.
For this parallel trends assumption to hold water, we need to be fairly certain that if we had withheld treatment (i.e. if a nearby incinerator were never installed) the price of houses in the ‘treated’ group (i.e. houses that actually did end up close to an incinerator) would have followed the same pricing trend as the ‘control’ group (i.e. houses that actually did not end up close to an incinerator). By controlling for other time invariant variables such as number of rooms, house size, and other attributes, I consider this assumption and analysis valid.
The Code
Load Libraries
Read in the Data1
<- here("posts","correlation-vs-causation","data", "KM_EDS241.csv")
file_path <- read_csv(file_path) |> glimpse() raw_dat
Rows: 321
Columns: 7
$ year <dbl> 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 19…
$ age <dbl> 48, 83, 58, 11, 48, 78, 22, 78, 42, 41, 78, 38, 18, 32, 18, 58…
$ rooms <dbl> 7, 6, 6, 5, 5, 6, 6, 6, 8, 5, 6, 5, 6, 6, 6, 7, 6, 5, 4, 5, 4,…
$ area <dbl> 1660, 2612, 1144, 1136, 1868, 1780, 1700, 1556, 1642, 1443, 14…
$ land <dbl> 4578, 8370, 5000, 10000, 10000, 9500, 10878, 3870, 7000, 7950,…
$ nearinc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,…
$ rprice <dbl> 60000, 40000, 34000, 63900, 44000, 46000, 56000, 38500, 60500,…
<- raw_dat |>
clean_dat mutate(nearinc = as.factor(nearinc)) |>
mutate(known_nearinc = ifelse(nearinc==1 & year==1981,1,0)) |> # since in 1978, the incinerator location wasn't well known
mutate(known_nearinc = as.factor(known_nearinc))
A Quick Data Exploration
Looking at the housing data plotted into boxplot distributions, below, we see that houses perceived as sited close to an incinerator (dots on the right hand side) appear to have lower housing prices on average than those houses not perceived as sited close to an incinerator (dots on the left hand side). Is this because the housing locations perceived as close to the incinerator are simply correlated with lower housing values for reasons other than incinerator proximity? Could this be because the incinerator is sited on less expensive land, with less expensive homes? Or, is this because close incinerator proximity causes lower housing values? Perhaps, because people do not want to buy homes in areas with worse air quality?
ggplot(data = clean_dat, mapping = aes(x = known_nearinc, y = rprice)) +
geom_boxplot(alpha = 0.5) +
xlab("Percieved as Close to Incinerator") +
ylab("Inflation-Adjusted House Selling Price") +
ggtitle("House Selling Prices and Incinerator Proximity") +
scale_x_discrete(labels = c("No", "Yes"))
Finding Causation: Linear Regression with Difference In Differences Control Group Analysis
To help tease this out, we can use futher analysis that assumes that housing values would change at the same rate over time (from 1978 to 1981) if no incinerators ended up being built (this is called the parallel trends assumption) and look at the difference in differences between the effect of time on the ‘treated’ (houses with incinerators nearby) housing values vs the effect of time on the treated if the treated were assumed to have the same rate of change in housing values as the control (houses without incinerators nearby). This approach also effectively controls for any time-variant omitted variables (like if something else such as a policy change influences housing prices across the board). We also control for other time invariant variables (such as house size) that we have in our dataset that would otherwise create bias.
The known_nearinc
variable effectively creates an indicator of 1 for the house values in 1981 with incinerators known to be nearby (our treatment group) and our control group is any other house value (since in 1978, it was not common knowledge that houses would be near an incinerator). Note that we still control for the nearinc
variable (which represents houses that end up being near incinerators, whether this was known or not), since this variable may represent a proxy for other factors like “lower land values” or other missing variables that could influence housing prices that are also associated with choosing where to site an incinerator. According to the parallel trends assumption, the treatment group and control group should only have the same slope if there is no incinerator housing price effect. If there is a difference in slope, this difference is the effect of treatment (i.e. at cause of the incinerator proximity). So, if the coefficient on the known_nearinc
is statistically significantly different from 0 (since a non-zero value creates different slopes), we can definitively say that perceived close proximity to a incinerator causes a decrease in housing value.
Here, we test whether or not the coefficient on the known_nearinc
variable is statistically significantly different from 0 and therefore whether or not the closer incinerator proximity causes a decrease to housing value.
# linear regression also controlling for house and lot characteristics in addition to year using heteroskedastic robust SE
<- lm(formula = rprice ~ known_nearinc + nearinc + as.factor(year) + age + rooms + area + land, data = clean_dat)
lm_mod_all <- starprep(lm_mod_all, stat = c("std.error"), se_type = "HC2", alpha = 0.05)
se_lmmod_all stargazer(lm_mod_all, se = se_lmmod_all, type="text")
===============================================
Dependent variable:
---------------------------
rprice
-----------------------------------------------
known_nearinc1 -13,320.150**
(6,785.662)
nearinc1 3,514.141
(7,149.521)
as.factor(year)1981 13,093.930***
(2,795.311)
age -266.338***
(50.716)
rooms 6,969.002***
(1,542.265)
area 23.782***
(3.901)
land 0.127
(0.137)
Constant -17,688.850
(11,070.580)
-----------------------------------------------
Observations 321
R2 0.612
Adjusted R2 0.603
Residual Std. Error 20,857.870 (df = 313)
F Statistic 70.541*** (df = 7; 313)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Final Thoughts on Incinerator Construction in North Andover
The key assumption underlying the causal interpretation of the difference in estimator is the parallel trends assumption. In other words, we must assume that the effect of time (1981 vs 1978) on inflation-adjusted house selling values should be the same in the treatment (near incinerators) and control (not near incinerators) if there were no incinerators at play here. In other words, if there is no omitted variables bias for time variant variables (the construction of an incinerator). There may be, however, other missing time variant variables at play that may be correlated with incinerator proximity and correlated with housing values, such as the construction of a new school. If this were the case, we could not definitively claim the effect of close proximity to an incinerator. However, barring other omitted time variant variables bias, this analysis shows that close proximity to incinerators does cause a decrease in housing value.
References
Footnotes
Here we use data from a 1995 paper studying housing prices during the various stages of incinerator siting operations (Kiel and T. 1995).↩︎
Citation
@online{cutler2023,
author = {Victoria Cutler},
editor = {},
title = {Correlation Vs {Causation:} {Air} {Quality} and {Housing}
{Prices}},
date = {2023-07-27},
url = {https://victoriacutler.github.io/posts/correlation-vs-causation/},
langid = {en}
}