15 min read

Estimating causal effects from aggregated data

The publically available datasets from Statistics Sweden are aggregated tables. Groups with fewer than five records are filtered out not to have any individual data being made public.

In this post, I am going to investigate with what precision it is possible to estimate the causal effect of predictors using aggregated data. I will use the dataset CPS1988 which is contained in the AER library. Cross-section data originating from the March 1988 Current Population Survey by the US Census Bureau.

I will estimate the average treatment effect on wage for ethnicity and experience respectively. I will subclassify the predictors’ education and experience. I will balance on the observed confounders and make no attempts to handle unobserved covariates. I will not try to draw any conclusions on causal effects based on SUTVA assumptions.

First, define libraries and functions.

library (tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library (AER)
## Warning: package 'AER' was built under R version 4.0.5
## Loading required package: car
## Warning: package 'car' was built under R version 4.0.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.0.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 4.0.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.5
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.0.3
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.0.5
library (bnlearn)
## Warning: package 'bnlearn' was built under R version 4.0.3
library (PerformanceAnalytics)
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.0.3
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library (tableone)
## Warning: package 'tableone' was built under R version 4.0.4
library (Matching)
## Warning: package 'Matching' was built under R version 4.0.5
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## ## 
## ##  Matching (Version 4.9-9, Build Date: 2021-03-15)
## ##  See http://sekhon.berkeley.edu/matching for additional documentation.
## ##  Please cite software as:
## ##   Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## ##   Software with Automated Balance Optimization: The Matching package for R.''
## ##   Journal of Statistical Software, 42(7): 1-52. 
## ##
library (WeightIt)
## Warning: package 'WeightIt' was built under R version 4.0.5
library (lavaan)
## Warning: package 'lavaan' was built under R version 4.0.5
## This is lavaan 0.6-8
## lavaan is FREE software! Please report any bugs.
library (tidySEM)
## Registered S3 methods overwritten by 'tidySEM':
##   method              from           
##   print.mplus.model   MplusAutomation
##   print.mplusObject   MplusAutomation
##   summary.mplus.model MplusAutomation
## 
## Attaching package: 'tidySEM'
## The following objects are masked from 'package:bnlearn':
## 
##     nodes, nodes<-
library (cobalt)
## Warning: package 'cobalt' was built under R version 4.0.4
##  cobalt (Version 4.3.1, Build Date: 2021-03-30 09:50:18 UTC)
library (jtools)
## Warning: package 'jtools' was built under R version 4.0.5
## 
## Attaching package: 'jtools'
## The following object is masked from 'package:tidySEM':
## 
##     get_data

# Argument: Vector with binned values; Value: Numeric vector where each value is the mean of the binwidth
unbin_bin <- function(x){
  unbin_x <- function(x) (parse_number(unlist(strsplit(as.character(x), ",")))[1] + parse_number(unlist(strsplit(as.character(x), ",")))[2])/2

  unlist(map(x, unbin_x))
}

data(CPS1988)

CPS1988_n <- CPS1988 %>%
  mutate(education = as.numeric(education)) %>%
  mutate(experience = as.numeric(experience)) %>%
  mutate(region = as.numeric(region)) %>%
  mutate(smsa = as.numeric(smsa)) %>%
  mutate(parttime = as.numeric(parttime)) %>%
  mutate(ethnicity = as.numeric(ethnicity))

The correlation chart shows that many predictors are correlated with the response variable but also that many predictors are correlated with each other.

chart.Correlation(CPS1988_n, histogram = TRUE, pch = 19)

A Directed Ascyclical Graph (DAG) is a useful tool to identify backdoor paths, confounders, mediators and colliders. A DAG is usually constructed by expert knowledge in the problem domain. There are also algorithms for Bayesian networks that can estimate a DAG based on the statistical properties of the data, these estimations need to be validated against expert knowledge. I will estimate a DAG using a Bayesian network, the Hill Climbing (HC) algorithm.

hcmodel <- hc(CPS1988 %>%
  mutate(education = as.numeric(education)) %>%
  mutate(experience = as.numeric(experience)))

plot(hcmodel)

Structural Equation Modeling (SEM) is a tool to represent a system of regressions. I will use Lavaan to represent the DAG from the Bayesian network above.

semmodel = '
  education ~ wage
  wage ~ parttime
  experience ~ wage
  experience ~ parttime
  wage ~ ethnicity
  region ~ ethnicity
  region ~ smsa
  wage ~ smsa
  wage ~ region
  education ~ region
  education ~ parttime
  education ~ smsa
  smsa ~ ethnicity
  experience ~ education
  experience ~ ethnicity
  parttime ~ ethnicity
  '

semfit <- sem(semmodel,
  data =  CPS1988_n)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan
## WARNING: some observed variances are (at least) a factor 1000 times larger than
## others; use varTable(fit) to investigate

graph_sem(model = semfit)

Since ethnicity is a binary variable I will use the Match algorithm to find individuals that are as similar as possible from the two groups African American and Caucasian. I will do a greedy matching based on Mahalanobis distance.

xvars <- c("education", "experience", "smsa", "region", "parttime")

print(CreateTableOne(vars = xvars, strata = "ethnicity", data = CPS1988, test = FALSE), smd = TRUE)
##                         Stratified by ethnicity
##                          cauc          afam          SMD   
##   n                      25923          2232               
##   education (mean (SD))  13.13 (2.90)  12.33 (2.77)   0.284
##   experience (mean (SD)) 18.15 (13.04) 18.74 (13.51)  0.044
##   smsa = yes (%)         19095 (73.7)   1837 (82.3)   0.210
##   region (%)                                          0.644
##      northeast            6073 (23.4)    368 (16.5)        
##      midwest              6486 (25.0)    377 (16.9)        
##      south                7468 (28.8)   1292 (57.9)        
##      west                 5896 (22.7)    195 ( 8.7)        
##   parttime = yes (%)      2280 ( 8.8)    244 (10.9)   0.072

greedymatch <- Match(Tr = as.integer(CPS1988$ethnicity) - 1, M = 1, X = data.frame(data.matrix(CPS1988[xvars])), replace = FALSE)

matched <- CPS1988[unlist(greedymatch[c("index.treated", "index.control")]), ]

print(CreateTableOne(vars = xvars, strata = "ethnicity", data = matched, test = FALSE), smd = TRUE)
##                         Stratified by ethnicity
##                          cauc          afam          SMD   
##   n                       2232          2232               
##   education (mean (SD))  12.33 (2.76)  12.33 (2.77)   0.001
##   experience (mean (SD)) 18.71 (13.44) 18.74 (13.51)  0.003
##   smsa = yes (%)          1837 (82.3)   1837 (82.3)  <0.001
##   region (%)                                          0.003
##      northeast             368 (16.5)    368 (16.5)        
##      midwest               375 (16.8)    377 (16.9)        
##      south                1293 (57.9)   1292 (57.9)        
##      west                  196 ( 8.8)    195 ( 8.7)        
##   parttime = yes (%)       244 (10.9)    244 (10.9)  <0.001

matched <- matched %>% mutate(ethnicity_n = as.integer(ethnicity) - 1)

t.test(matched$wage[matched$ethnicity_n == 1] - matched$wage[matched$ethnicity_n == 0])
## 
##  One Sample t-test
## 
## data:  matched$wage[matched$ethnicity_n == 1] - matched$wage[matched$ethnicity_n == 0]
## t = -12.989, df = 2231, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -131.18979  -96.77381
## sample estimates:
## mean of x 
## -113.9818

Another way to estimate the causal effect of ethnicity on wage is by calculating the propensity score. By regressing on the treatment, i.e. the variable that we want to calculate the effect for, we can reduce the selection bias by balancing on the covariates. Below you can see the balancing before and after using the propensity score.

W.out <- weightit(ethnicity ~ education + experience + smsa + region + parttime, 
  data = CPS1988, method = "ebal")

model_lm_ethnicity <- lm(wage ~ ethnicity, data = CPS1988, weights = W.out$weights)

bal.tab(ethnicity ~ education + experience + smsa + region + parttime, 
  data = CPS1988, estimand = "ATT", m.threshold = .05)
## Balance Measures
##                     Type Diff.Un      M.Threshold.Un
## education        Contin. -0.2909 Not Balanced, >0.05
## experience       Contin.  0.0435     Balanced, <0.05
## smsa_yes          Binary  0.0864 Not Balanced, >0.05
## region_northeast  Binary -0.0694 Not Balanced, >0.05
## region_midwest    Binary -0.0813 Not Balanced, >0.05
## region_south      Binary  0.2908 Not Balanced, >0.05
## region_west       Binary -0.1401 Not Balanced, >0.05
## parttime_yes      Binary  0.0214     Balanced, <0.05
## 
## Balance tally for mean differences
##                     count
## Balanced, <0.05         2
## Not Balanced, >0.05     6
## 
## Variable with the greatest mean difference
##   Variable Diff.Un      M.Threshold.Un
##  education -0.2909 Not Balanced, >0.05
## 
## Sample sizes
##      cauc afam
## All 25923 2232

bal.tab(W.out, m.threshold = .05, disp.v.ratio = TRUE)
## Call
##  weightit(formula = ethnicity ~ education + experience + smsa + 
##     region + parttime, data = CPS1988, method = "ebal")
## 
## Balance Measures
##                     Type Diff.Adj     M.Threshold V.Ratio.Adj
## education        Contin.   0.0001 Balanced, <0.05      0.7731
## experience       Contin.   0.0000 Balanced, <0.05      0.9613
## smsa_yes          Binary   0.0000 Balanced, <0.05           .
## region_northeast  Binary   0.0000 Balanced, <0.05           .
## region_midwest    Binary  -0.0000 Balanced, <0.05           .
## region_south      Binary  -0.0000 Balanced, <0.05           .
## region_west       Binary  -0.0000 Balanced, <0.05           .
## parttime_yes      Binary   0.0001 Balanced, <0.05           .
## 
## Balance tally for mean differences
##                     count
## Balanced, <0.05         8
## Not Balanced, >0.05     0
## 
## Variable with the greatest mean difference
##   Variable Diff.Adj     M.Threshold
##  education   0.0001 Balanced, <0.05
## 
## Effective sample sizes
##                cauc   afam
## Unadjusted 25923.   2232. 
## Adjusted   25833.39 1233.1

Estimate the causal effect of experience on wage by calculating the propensity score.

W.out <- weightit(experience ~ ethnicity + education + smsa + region + parttime,
  data = CPS1988, method = "ebal")

model_lm_experience <- lm(wage ~ experience, data = CPS1988, weights = W.out$weights)

bal.tab(experience  ~ ethnicity + education + smsa + region + parttime,
  data = CPS1988, estimand = "ATT", m.threshold = .05)
## Balance Measures
##                     Type Corr.Un
## ethnicity_afam    Binary  0.0121
## education        Contin. -0.2867
## smsa_yes          Binary -0.0397
## region_northeast  Binary  0.0251
## region_midwest    Binary -0.0166
## region_south      Binary  0.0114
## region_west       Binary -0.0212
## parttime_yes      Binary -0.0942
## 
## Sample sizes
##     Total
## All 28155

bal.tab(W.out, m.threshold = .05, disp.v.ratio = TRUE)
## Call
##  weightit(formula = experience ~ ethnicity + education + smsa + 
##     region + parttime, data = CPS1988, method = "ebal")
## 
## Balance Measures
##                     Type Corr.Adj Diff.Adj     M.Threshold
## ethnicity_afam    Binary       -0       -0 Balanced, <0.05
## education        Contin.       -0        0 Balanced, <0.05
## smsa_yes          Binary       -0       -0 Balanced, <0.05
## region_northeast  Binary        0        0 Balanced, <0.05
## region_midwest    Binary        0        0 Balanced, <0.05
## region_south      Binary       -0       -0 Balanced, <0.05
## region_west       Binary       -0       -0 Balanced, <0.05
## parttime_yes      Binary        0        0 Balanced, <0.05
## 
## Balance tally for target mean differences
##                     count
## Balanced, <0.05         8
## Not Balanced, >0.05     0
## 
## Variable with the greatest target mean difference
##   Variable Diff.Adj     M.Threshold
##  education        0 Balanced, <0.05
## 
## Effective sample sizes
##               Total
## Unadjusted 28155.  
## Adjusted   25793.54

Let’s now investigate how aggregating the numerical predictors in the data affects the precision when estimating the causal effect on wage. I will also filter out groups with less than 5 persons in them so that any individual can not be identified in the material. The binning and filtering reduces the original dataset by 98.9 %. By expanding the reduced dataset the original dataset can be estimated.

CPS1988_refi <- CPS1988 %>%
  mutate(education = as.numeric(education)) %>%
  mutate(experience = as.numeric(experience)) %>%
  mutate(education = cut_interval(education, 5)) %>%
  mutate(experience = cut_interval(experience, 5)) %>%
  group_by(education, experience, ethnicity, smsa, region, parttime) %>%
  mutate (wage = mean(wage)) %>%
  group_by(wage, education, experience, ethnicity, smsa, region, parttime) %>% 
  tally() %>%
  mutate(experience = unbin_bin(experience)) %>%
  mutate(education = unbin_bin(education)) %>%
  filter(n > 4)  

dim(CPS1988_refi)
## [1] 302   8

CPS1988_refiexp <- CPS1988_refi[rep(seq(nrow(CPS1988_refi)), CPS1988_refi$n),]

dim(CPS1988_refiexp) 
## [1] 27797     8

How has the reduction affected the DAG? It is not possible to use weights in the hc algorithm. Therefore I will use the expanded table based on the data in the aggregated table.

hcmodel_refiexp <- hc(dplyr::select(CPS1988_refiexp %>%
  mutate(education = as.numeric(education)) %>%
  mutate(experience = as.numeric(experience)), -n))

plot(hcmodel_refiexp)

For comparison, I will plot the coefficients for ethnicity for the linear model based on the original data, aggregated data and the expanded data. I will use robust (Heteroskedasticity-Consistent) error estimates.

model <- lm(wage ~ ethnicity, data = CPS1988)

model_refi <- lm(wage ~ ethnicity, data = CPS1988_refi, weights = n)

model_refiexp <- lm(wage ~ ethnicity, data = CPS1988_refiexp)

plot_summs(model, model_refi, model_refiexp, robust = "HC1", 
  model.names = c(
    "Original dataset", 
    "Reduced dataset using weights", 
    "Expanded dataset"))
## Loading required namespace: broom.mixed
## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.3.2
## Current Matrix version is 1.2.18
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package

Now let’s estimate the causal effect of ethnicity on wage using propensity scores. I will use the expanded dataset since I did not get any reasonable results using the argument s.weights in weightit.

W.out <- weightit(ethnicity ~ education + experience + smsa + region + parttime,
  data = CPS1988_refiexp, method = "ebal")

model_refiexp <- lm(wage ~ ethnicity, data = CPS1988_refiexp, weights = W.out$weights)

coeftest(model_refiexp, vcov = vcovHC, type = "HC1")
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)    616.8706     1.4634 421.541 < 2.2e-16 ***
## ethnicityafam -133.7648     6.1773 -21.654 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bal.tab(ethnicity ~ education + experience + smsa + region + parttime,
  data = CPS1988_refiexp, estimand = "ATT", m.threshold = .05)
## Balance Measures
##                     Type Diff.Un      M.Threshold.Un
## education        Contin. -0.2050 Not Balanced, >0.05
## experience       Contin. -0.0567 Not Balanced, >0.05
## smsa_yes          Binary  0.0995 Not Balanced, >0.05
## region_northeast  Binary -0.0784 Not Balanced, >0.05
## region_midwest    Binary -0.0800 Not Balanced, >0.05
## region_south      Binary  0.3059 Not Balanced, >0.05
## region_west       Binary -0.1475 Not Balanced, >0.05
## parttime_yes      Binary -0.0052     Balanced, <0.05
## 
## Balance tally for mean differences
##                     count
## Balanced, <0.05         1
## Not Balanced, >0.05     7
## 
## Variable with the greatest mean difference
##      Variable Diff.Un      M.Threshold.Un
##  region_south  0.3059 Not Balanced, >0.05
## 
## Sample sizes
##      cauc afam
## All 25732 2065

bal.tab(W.out, m.threshold = .05, disp.v.ratio = TRUE)
## Call
##  weightit(formula = ethnicity ~ education + experience + smsa + 
##     region + parttime, data = CPS1988_refiexp, method = "ebal")
## 
## Balance Measures
##                     Type Diff.Adj     M.Threshold V.Ratio.Adj
## education        Contin.   0.0001 Balanced, <0.05      0.6710
## experience       Contin.   0.0000 Balanced, <0.05      0.8969
## smsa_yes          Binary   0.0000 Balanced, <0.05           .
## region_northeast  Binary   0.0000 Balanced, <0.05           .
## region_midwest    Binary   0.0001 Balanced, <0.05           .
## region_south      Binary  -0.0000 Balanced, <0.05           .
## region_west       Binary  -0.0001 Balanced, <0.05           .
## parttime_yes      Binary  -0.0000 Balanced, <0.05           .
## 
## Balance tally for mean differences
##                     count
## Balanced, <0.05         8
## Not Balanced, >0.05     0
## 
## Variable with the greatest mean difference
##        Variable Diff.Adj     M.Threshold
##  region_midwest   0.0001 Balanced, <0.05
## 
## Effective sample sizes
##               cauc    afam
## Unadjusted 25732.  2065.  
## Adjusted   25650.5 1094.35

plot_summs(model_lm_ethnicity, model_refiexp, scale = TRUE, robust = "HC1", 
  model.names = c(
    "Original dataset", 
    "Expanded dataset"))

Let’s compare the coefficients for experience for the linear model based on the original data, aggregated data and the expanded data.

model <- lm(wage ~ experience, data = CPS1988)

model_refi <- lm(wage ~ experience, data = CPS1988_refi, weights = n)

model_refiexp <- lm(wage ~ experience, data = CPS1988_refiexp)

plot_summs(model, model_refi, model_refiexp, robust = "HC1",
  model.names = c(
    "Original dataset", 
    "Reduced dataset using weights", 
    "Expanded dataset"))

Let’s estimate the causal effect of ethnicity on wage using propensity scores. I will use the expanded dataset since I did not get any reasonable results using the argument s.wights in weightit.

W.out <- weightit(experience ~ ethnicity + education + smsa + region + parttime,
  data = CPS1988_refiexp, method = "ebal")

model_refiexp <- lm(wage ~ experience, data = CPS1988_refiexp, weights = W.out$weights)

coeftest(model_refiexp, vcov = vcovHC, type = "HC1")
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 452.45062    2.25958 200.237 < 2.2e-16 ***
## experience    8.60394    0.14098  61.029 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bal.tab(experience ~ ethnicity + education + smsa + region + parttime,
  data = CPS1988_refiexp, estimand = "ATT", m.threshold = .05)
## Balance Measures
##                     Type Corr.Un
## ethnicity_afam    Binary -0.0143
## education        Contin. -0.2736
## smsa_yes          Binary -0.0339
## region_northeast  Binary  0.0256
## region_midwest    Binary -0.0156
## region_south      Binary  0.0092
## region_west       Binary -0.0202
## parttime_yes      Binary -0.1072
## 
## Sample sizes
##     Total
## All 27797

bal.tab(W.out, m.threshold = .05, disp.v.ratio = TRUE)
## Call
##  weightit(formula = experience ~ ethnicity + education + smsa + 
##     region + parttime, data = CPS1988_refiexp, method = "ebal")
## 
## Balance Measures
##                     Type Corr.Adj Diff.Adj     M.Threshold
## ethnicity_afam    Binary       -0       -0 Balanced, <0.05
## education        Contin.       -0       -0 Balanced, <0.05
## smsa_yes          Binary       -0       -0 Balanced, <0.05
## region_northeast  Binary        0        0 Balanced, <0.05
## region_midwest    Binary        0        0 Balanced, <0.05
## region_south      Binary       -0       -0 Balanced, <0.05
## region_west       Binary       -0       -0 Balanced, <0.05
## parttime_yes      Binary        0       -0 Balanced, <0.05
## 
## Balance tally for target mean differences
##                     count
## Balanced, <0.05         8
## Not Balanced, >0.05     0
## 
## Variable with the greatest target mean difference
##   Variable Diff.Adj     M.Threshold
##  education       -0 Balanced, <0.05
## 
## Effective sample sizes
##               Total
## Unadjusted 27797.  
## Adjusted   25618.89

plot_summs(model_lm_experience, model_refiexp, robust = "HC1",
  model.names = c(
    "Original dataset", 
    "Expanded dataset"))