A closer look at the methodology behind YouGov's massive survey for CBS News and the New York Times
YouGov has created a panel of over 100,000 registered voters who will be interviewed four times between July and November for the nation’s newspaper of record and premier television network. The panelists come from each of the 50 states and 435 Congressional districts. They are interviewed on the internet and have been weighted to be representative of registered voters in each Congressional district using data from the U.S. Bureau of the Census and other sources. The weighting variables are age, race, gender, education, 2012 vote for President and Congress, and party identification.
As with any survey, the estimates are subject to sampling error with larger samples usually giving more accurate estimates than smaller ones. Panelists were selected disproportionately from the most competitive states and Congressional districts to provide more accurate estimates for these races. If we had divided the sample of 100,000 evenly among districts, we would have had only 250 persons in each district. Instead, we oversampled competitive districts (ending up with about 800 persons in each of the 60 most competitive districts), leaving smaller samples (around 150 people) in the less competitive ones. To say anything about the overall outcome of the election—which party will control the House and the Senate in the 114th Congress—requires estimates for every House and Senate race, even in places where our sample is too small to make the most reliable predictions.
How can we make accurate predictions when the sample size is so small in many Congressional districts? We have quite a bit of information about the people that are not in our sample and the races in those districts. From the Census, we know their demographics. From the 2012 election returns, we know the proportion who voted for each candidate in 2012 (or didn’t vote). From the 2012 Exit poll, we know the relationship between voter demographics and 2012 vote. And from our 2014 panel, we have data on how these variables relate to 2014 voting intentions.
We have combined these data into a statistical model that predicts 2014 vote on the basis of demographics and past vote. The model uses common patterns in the data to make estimates for people not interviewed. For example, if most of the 18-24 year old white female respondents in the sample who voted for Romney in 2012 tell us that they intend to vote for the Republican Congressional candidate in 2014, the model then predicts similar behavior for 18-24 year old white female voters in a district where our sample doesn’t include any voters of this type. Where we have a few voters in a particular group, we average the model predictions with the sample, with the model estimates discounted as the sample size in that group increases. These techniques have been developed by statisticians and are commonly used for small area estimates by the Census.
The output of the statistical model is a range of estimates for each state and Congressional district, reflecting the uncertainty around the predictions from the model. We have made 100,000 random draws from these estimates, producing thousands of different combinations of election outcomes—some with Republicans winning both houses, others with Democrats maintaining control of the Senate, and countless variations of outcomes for particular races. In these simulations, we can calculate the proportion of times that, say, the Republicans gain a 51 or more seats in the Senate. This is the probability of Republican control of the Senate implied by the model.
For more information see here.