MRP, simplified

Demonstrating my STEM expertise

Demonstrating my STEM expertise

The irony of my life is that I failed high school math, went to Graduate school to study Political Theory…and left with a dissertation about something called “multi-level regression, imputation, and post-stratification” (MRP, though it should be called “MRIP”). Was this God’s joke? My newfound passion? My brilliant adaptation to the vagaries of the post-industrial job market, where the “STEM” shibboleth strikes fear and awe in the hearts of men? No one knows…

Regardless, MRP is a topic that interests many in social science, if only because it has been triggered a tidal wave of published papers since approximately 2009. MRP is a statistical technique that models the desired attitudinal propensities (party, ideology, presidential approval, and everything in between) of different groups (separated by geography [region, state, congressional district, county, and every combination thereof] and demographics), then weights the propensities by their demographic composition per jurisdiction. You can thus feed a national poll or concatenation of polls into MRP (different weighting and coverage frames between the polls? doesn’t matter) and arrive at a point estimate of (say) the proportion of liberals in Aroostook County, Maine. MRP can produce this estimate even if not a single Aroostook resident was polled.

The technique’s raison-d’être is that comprehensive public opinion data at levels below that of state (and even at the state level, catastrophic non-response error means that Trump can win Wisconsin even as not one of the dozens of polls between the primaries and general election showed him in the lead) does not exist. Polls are expensive commercial products that do not spontaneously appear to meet social-scientific demand. It is unlikely that newspapers would pay for polls of every U.S. Congressional district when almost all House races are virtually uncontested. It’s even less likely that the National Science Foundation would sponsor such a poll, as it would mean House members spending taxpayer money to investigate (say) whether–in this new Gilded Age of unequal democracy–they are responsive, effective legislators.

MRP is a solid replacement for the older way of doing things. It’s superior to raw disaggregation, which involves taking national polls and operationalizing (say) state public opinion as the raw joint sample means of survey responses by jurisdiction. This raw disaggregation problematic since the samples are not geographically comprehensive by state; in a national poll, all the respondents in Delaware are probably going to be in Wilmington. It’s also superior to using various proxies for public opinion, for instance presidential vote for a Congressional district’s liberalism. The issue here is simple validity: factors other than ideology usually affect the outcome of the presidential race.

The two things to know about MRP are:

  1. It’s still very flawed. Assessing the validity and reliability of MRP is a fraught endeavor. MRP, like all simulations of public opinion, produces none of the standard errors of (say) commercial polls. In most cases, no population parameter against which to compares the measure exists (in other words, the true ideological composition of Aroostook county is unobserved–otherwise, MRP wouldn’t be necessary in the first place). Exactly how it produces estimates is redolent of mystery meat. For instance, you often get identical results after making substantial changes to your individual-level model, especially in jurisdictions well-represented in the poll you’re disaggregating. All we really know is that within-jurisdiction attitude heterogeneity is bad for MRP (imputation is harder if attitudes are polarized within counties or House districts, etc.), and that incorporating aggregate effects in the model (for instance, a county’s demographic composition) are good. In all likelihood, a cranky senior American behavioralist will publish a debunking thinkpiece and end the MRP golden age. If you’re in this line of work: publish now!
  2. It’s still better than any known alternative.

I’ve been working on two MRP-related things lately, and was tired of having to write the same R script over and over again. (For a basic overview of how to estimate MRP in R, see Kastellec et al.’s primer). I’ve decided to write my own MRP library with the following simple features:

  • An MRP class to wrap the various methods needed to impute estimates and run diagnostics.
  • Ease of use. Just define the paths for the source data files, important variables, and settings, run the MRP, and you’re done.
  • Descriptive output. Output includes observations in the sample, the unadjusted sample mean, the MRP point estimate, and the weighted variance of the point estimate by jurisdiction. (The latter is important in House/Senate responsiveness scholarship, which deems the heterogeneity of public opinion a crucial factor in the style and extent of representation).
  • A robustness check. Includes a method whereby you see for yourself if the MRP is outperforming raw disaggregation as an estimation of public opinion.

The source code is here: https://github.com/cliffordvickrey/easyMRP

Documentation

As I’m pressed for time, the proceeding is very terse, elliptical, and incomplete; I will be updating this post and library as my schedule permits

Setting it up is very simple.

  1. Install the packages ARM (Andrew Gelman’s multilevel/HLM package) and Hmisc (for weighted variance).
  2. Download my MRP library.
  3. Include it using the “source” command.

General usage

Create the MRP class:

mrp<-new Mrp()

Using R's S4 class slot selector ("@"), set the following properties:


1. REQUIRED

model: String containing binomial logit model used in the multilevel step. MRP invariably uses mixed-effects estimation. For example, "Y ~ X + (1|Z)" indicates "model Y as a function of fixed effect X and random effect Y." Fixed effects are the familiar predictors of ordinary least squares (OLS) regressions. Random effects are a little trickier to grasp, and their disambiguation from fixed effects in statistical manuals is fraught with semantic complications (it's unclear if "fixed" or "random" effects are even the right terms to use), so I here assume you have some familiarity with them. The ARM estimator fits a different intercept for each category of a random predictor (for instance, one for Alabama, one for Alaska, etc.). It's theoretical good practice not to model predictors with few than 7 or so categories as random effects, but in practice this doesn't really matter. The library assigns "0" for categories missing in the survey data.

state.var: Column name indicating an ordered list (1,2,...51) of states/Washington D.C. Must be the same across all the files you use. Defaults to "state."

weight.var: Column name indicating weight in the post-stratification dataset. Defaults to "weight."

convert.underscore: Boolean for whether or not you wish to convert Stata-like underscores ("_") to R-like decimals (".") in the column names of imported scripts. Defaults to TRUE.

geographies.path: Location ('/path/to/file.csv') of a CSV containing columns of geographic information (e.g. names of counties and county IDs; see example). The MRP estimates, as well as sample sizes and weighted variance of MRP, will be appended to this file horizontally. Defaults to 'geographies.csv'

state.fe.path: Location of a CSV containing aggregate fixed effects by state. These are merged into the poll and post-stratification dataframes.

poll.path: Location of a CSV containing the poll you wish to disaggregate.

post.strat.path: Location of the post-stratification data (in other words, the observed demographic characteristics of the districts for which you wish to produce estimates) for MRP's probability adjustment. Must contain every variable in the multilevel model. Usually, the model is post-stratified with either tables from the American Community Survey's Summary File (SF), or with the ACS's Public-Use Microdata Sample (PUMS).

output.path: Path to a CSV file where the MRP estimates and simulations will be saved. Defaults to "mrp.csv."

bin.path: Path to a directory ("/path/to") where the script will store compressed, binary versions of the dataframes in the MRP for future reference. The script will try to load its dataframes from this path first, before loading the CSVs.

2. OPTIONAL

substate.var: If you're estimating below the level of state (county, Congressional district, etc.), this is the identifier (again, should be an ordered list) for the desired geography.

year.var: If your poll is collapsed over several year, and fixed effects are merged into your poll and post-stratification dataframes by year, define a year variable (consistent across all files) here.

substate.fe.path: Path to a CSV containing aggregate variables for your substate jurisdiction.

Once you're done, use the following method to run the MRP:

mrp<-runMrp(mrp)

For robustness checks, you can use Lax and Phillips' (2009) procedure of split-level simulations. This procedure treads the sample mean of one random half of your data as a pseudo-population parameter, runs MRP on a random sample of the other half (say, 5%), and checks to make sure the MRP is doing a better job of predicting the parameter than the 5% sample mean.

# Run 10 simulations of the MRP
for(i in c(1:10))
    mrp<-validationSim(mrp,.05)

Sample output:

MRP split-level validation procedure using 1.00% sample of the survey...

Results:
Sampling error: 0.42
MRP error: 0.34
Win %: 0.74%

If your MRP is not beating the sample mean at a rate of well over 50%, either your model is terribly misspecified, or you made a coding mistake in one or more of your source data files.

Example

This Zip offers a quick example of how to use the library. The script uses the 2009 Cooperative Congressional Election Study, an opt-in Internet survey that is the gold standard of American behavioralism, to model partisanship (1 = Democrat; 0 = not a Democrat). It then post-stratifies the predicted responses using the 2009 American Community Survey 5-year-averages (tables C15002B through C15002I), and averages the re-adjusted responses by county. The R script included in the Zip is below.

############################################################
# MRP EXAMPLE                                              #
# Impute party ID at the county level using the 2009 CCES  #
#                                                          #
# Clifford Vickrey                                         #
############################################################

# include library
source('vickrey.mrp.lib.R')

# new MRP object
cces.mrp<-Mrp()

# variable names
cces.mrp@weight.var<-'pop'
cces.mrp@state.var<-'state'
cces.mrp@substate.var<-'county'

# path slots
wd<-get.wd()
cces.mrp@geographies.path<-paste(wd,'csv','counties.csv',sep='/')
cces.mrp@state.fe.path<-paste(wd,'csv','state.fe.csv',sep='/')
cces.mrp@substate.fe.path<-paste(wd,'csv','county.fe.csv',sep='/')
cces.mrp@poll.path<-paste(wd,'csv','cces.truncated.csv',sep='/')
cces.mrp@post.strat.path<-paste(wd,'csv','post.strat.csv',sep='/')
cces.mrp@bin.path<-paste(wd,'bin',sep='/')
cces.mrp@output.path<-paste(wd,'cces.mrp.csv',sep='/')

# MRP model
# model Democrat (Y) as a function of:
#
# region (ICPSR region)
#
# state (1:51)
#
# state-level labor union rates
#
# whether states have right-to-work laws (as of 2007)
#
# county-level Republican presidential vote in 2004
#
# county population density
#
# sex (0 = male; 1 = female)
#
# race (1 = non-Latino white; 2 = African-American; 3 = Latino; 4 = Asian or
# Pacific Islander; 5 = Native American/Alaskan; 6 = more than one race; 7 =
# other)
#
# interaction term of sex and race
# 
# education (1 = less than high school; 2 = highschool/GED; 3 = some college
# or Associate degree; 4 = Bachelor's degree or higher)
#
cces.mrp@mrp.model<-
    paste0(
        'democrat~(1|region)+(1|state)+state.labor.union+state.rtw+',
        '(1|county)+county.bush.vote04+county.pop.density+sex+(1|race)+',
        '(1|sex.by.race)+(1|educ)'
    )

# run!
cces.mrp<-runMrp(cces.mrp)

# sim
sim<-validationSim(cces.mrp,.05)

About cvickrey

Clifford Vickrey spends his days confounding the wise.
This entry was posted in I ate the bones, political science. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *