Posted on 8th May 2016
Words used by the media to describe primary races in New Hampshire, South Carolina, and Nevada include “sweeping”, “decisive”, and “easy”. CNN’s election data, formatted in a way that makes parsing and reshaping easy, make it possible to look a little closer at the magnitude of these wins, in terms of the popular vote. I am not a frequent user of JSON data, so this post moves slowly through the JSON file structure with appreciation to Bob Rudis for sharing examples and data source URLs. Credit for the source data preparation goes to the folks at CNN, whose timely reporting on the elections can be found here.
Section One of this post describes the structure of the downloaded JSON files, which are deeply nested and hierarchical, and steps to reshape the data into a format useful for creating basic, descriptive statistics. Section Two describes the calculation of the margin of victory for each winning candidate and provides the code used to graph the margins, side-by-side, by party and race.
Section One: Preparing the data
I used Jeroen Ooms’
jsonlite package to parse the files containing the democratic and republican race results.
library(curl,lib.loc="c:/r/packages") library(jsonlite,lib.loc="c:/r/packages") nhR <- fromJSON("http://data.cnn.com/ELECTION/2016primary/NH/county/R.json", flatten=TRUE ) nhD <- fromJSON("http://data.cnn.com/ELECTION/2016primary/NH/county/D.json", flatten=TRUE)
Passing the lists created in the last step to the
Summary() function provides a quick view of the files' structure.
Summary()is intended for statistical summaries, but it's also a generic function, so you can pass it different kinds of objects. Each object in
R has a length, a class, and a mode, which describes how the object is stored in memory. In this case, the
Summary()output lists each object and whether or not it is numeric, character, or a nested list. The downloaded files contained:
R, so these are vectors of length 1;
$Race represents the second level of information in the
$counties object. Its fourth member, "candidates", is the level where the race results are stored. The 15 column names of
nhD$counties$race.candidates can be retrieved by flattening the list (select distinct records to avoid having a list for each county) with the following command.
The melt() function in Hadley Wickham's
reshape2 package can flatten the race results before joining them to counties, one level up. There are many ways to flatten JSON files, one nicely described in this post by Julie Silge. The following steps work through the hierarchy slowly.
library(reshape2) nhd1 <-melt(nhD$counties$race.candidates, id=1:15) nhd2 <-nhD$counties[1:2] # co_id and county name nhd2$L1<-seq.int(nrow(nhd2)) # Add a sequence number and merge with the variable automatically created by melt() . nhdem <- merge(x = nhd1, y = nhd2, by = "L1", all.x = TRUE) #Left outer join.
Transposing by county and candidate creates the wide-format file (candidate data in columns, counties in rows) needed for graphing.
nhdemVpct1 <-dcast(nhdem, L1 + name + co_id ~ lname , value.var="pctDecimal") nhdemVnum1 <-dcast(nhdem, co_id ~ lname , value.var="votes") nhdemVpct1<-transform(nhdemVpct1, Clinton = as.numeric(Clinton)) nhdemVpct1<-transform(nhdemVpct1, Sanders = as.numeric(Sanders)) nhdem2 <- merge(x = nhdemVpct1, y = nhdemVnum1, by = "co_id", all.x = TRUE)
Part Two: Calculating the margin of victory and graphing the results
This calculation uses the definition provided by Magrino et al.: "half the difference in votes between the winner and the runnerup, rounded up if necessary." The democratic races had only two candidates, making the calculation simple. To provide context, the absolute differences between the winner and runner-up are divided by the total number of votes cast.
nhdem2 <- nhdem2[order(nhdem2$L1),] rownames(nhdem2) <- NULL nhdem2$tvotes <- rowSums(nhdem2[,grepl(".y", names(nhdem2))], na.rm = TRUE) nhdem2$first <- do.call(pmax, nhdem2[,grepl(".y", names(nhdem2))]) nhdvotes <-nhdem2[,grepl(".y", names(nhdem2))] nhdem2$second <- do.call(pmin, nhdem2[,grepl(".y", names(nhdem2))]) nhdem2$minVotes <- ceiling((nhdem2$first - nhdem2$second)/2) nhdem2$margin <- round((nhdem2$minVotes)/nhdem2$tvotes,digits=2) nhdem2$Winner <- ifelse(nhdem2$Sanders.y > nhdem2$Clinton.y,'Sanders','Clinton') # Check for ties (change the name of the winner to "Tied" if a tie exists ties <-subset(nhdem2, nhdem2$first==nhrep3$second) ties
Calculating the margins for the republican races, which started with 8 candidates, was a little more involved. An intermediate step, where race results were sorted twice, once to find the winner and once to find the runner-up, then merged back to the original data frame, was added before calculating the margins.
nhrep2 <- nhrep2[order(nhrep2$L1),] rownames(nhrep2) <- NULL nhrep2$tvotes <- rowSums(nhrep2[,grepl(".y", names(nhrep2))], na.rm = TRUE) nhrep2$first <- do.call(pmax, nhrep2[,grepl(".y", names(nhrep2))]) nhrvotes <-nhrep2[,grepl(".y", names(nhrep2))] # First place nhrWin <-as.data.frame(cbind(row.names(nhrvotes),apply(nhrvotes,1,function(x) names(nhrvotes)[which(x==max(x))]))) nhrWin$L1 = as.numeric(as.character(nhrWin$V1)) names(nhrWin)[names(nhrWin)=="V2"] <- "Winner" nhrWin <-subset(nhrWin,select=c(Winner,L1)) nhrWin$Winner <-as.character(nhrWin$Winner) # Second place nhr2nd <- as.data.frame(cbind(row.names(nhrvotes),apply(nhrvotes, 1, function(x) sort(x, decreasing=TRUE)))) nhr2nd$L1 = as.numeric(as.character(nhr2nd$V1)) #convert to numeric. names(nhr2nd)[names(nhr2nd)=="V2"] <- "second" nhr2nd$second <- as.numeric(as.character(nhr2nd$second))nhrk <- merge(x = nhr2nd, y = nhrWin, by = "L1") # Merge first and second nhrk <- merge(x = nhr2nd, y = nhrWin, by = "L1") #winner and 2nd highest vote count. nhrk$Winner<-sub(pattern = ".y", replacement = "", x = nhrk$Winner) #Reconnect to the source data frame nhrep3 <- merge(x = nhrk, y = nhrep2, by = "L1") #Merge with source nhrep3$minVotes <- ceiling((nhrep3$first - nhrep3$second)/2) nhrep3$margin <- round((nhrep3$minVotes)/nhrep3$tvotes,digits=2) # Check for ties (change the name of the winner to "Tied" if a tie exists ties <-subset(nhrep3, nhrep3$first==nhrep3$second) ties
The bar charts in this post use the number of votes cast, relative to the marginal victory for each winning candidate.
nhrtabl <-aggregate(first~margin+Winner, data=nhrep3, sum, na.rm=TRUE) nhrbubble=aggregate(tvotes~margin+Winner, data=nhrep3, sum, na.rm=TRUE) nhrtabl<- merge(x = nhrtabl, y = nhrbubble, by = c("margin","Winner")) nhrtabl$radius <- round(sqrt( nhrtabl$tvotes/ pi ),digits=1) # Save for later
Now, to the fun part...graphing. Side-by-side bar charts showing the margins of victory, the winners, and the number of votes cast were created using Hadley Wickham's
ggplot2 package and the
gridExtra package created by Baptiste Auguie and Anton Antonov. A few searches on StackOverflow helped find a solution to adding global variables, in this case the statewide total number of votes for each party's slate, to the chart's title.
By themselves, the charts aren't terribly informative. To get some perspective on the victories, I added a couple of markers for close races and more decisive victories using election data from the American Presidency Project. Vertical lines at 0.05 and 0.15 separate close races from races that were more one-sided. Margins in elections are generally calculated using electoral rather than popular votes, but for historical context, using the margin-of-victory formula used here, presidential races in the lower category would include the 2000 Bush-Gore (0.002)and 2008 Obama-McCain elections (0.04).
library(ggplot2) # Bar-Charts New Hampshire votes<-as.character(format(sum(nhrep3$tvotes),big.mark=",")) plot.title <-"New Hampshire Republican Primary Presidential Race" plot.subtitle <-paste("Total popular votes: ", votes) nhrepBar<-ggplot(nhrtabl, aes(x=margin, y=first,fill=Winner)) + geom_bar(stat="identity") + scale_x_continuous(limits=c(0,.5)) + scale_y_continuous(labels = comma, limits=c(0,25000)) + xlab("Margin of Victory (Popular Vote)") + ylab("Votes Cast for Winner") + ggtitle(bquote(atop(.(plot.title), atop(italic(.(plot.subtitle)), "")))) + theme(axis.text.y = element_text(hjust=1.2)) + geom_vline(xintercept=c(.05,.15)) + geom_text(x=.02,y=20000,label="Closer\nraces") + geom_text(x=.20,y=20000,label="Less even\nraces")
A custom set of bar colors, varying shades of red and orange for republicans and blues for democrats, was added manually using the
+ scale_fill_manual(values=c("Cruz"="red", "Trump"="darkorange", "Kasich"="maroon", "Bush"="tomato", "Fiorina"="lightcoral","Carson"="firbrick", "Rubio"="#b34d4d", "Christie"="#cc4400"))
+ scale_fill_manual(values=c("Clinton"="lightblue", "Sanders"="darkblue"))
The results appear below (click to enlarge). With the exception of the Wisconsin republican primary, there tended to be one winner across county races in most states, which does not make for particularly fascinating graphs. However, charts suggest a range of preference levels among voters for particular candidates between parties and states.
New Hampshire Primary Results
Tuesday, February 9, 2016
South Carolina Primary Results
Saturday, February 27, 2016
Wisconsin Primary Results
Tuesday, April 5, 2016