Stuff I Write‎ > ‎

Capturing the Relationship between Election Polls and News

posted Jan 14, 2017, 9:39 AM by Jack Bandy
The following is from a group project I recently completed for my "digital assets at scale" course. Note that not all of the work is my own, as three other students worked on this project. Also note the data set we analyzed is fairly incomplete, so no objective conclusions should be made.

This project used a homemade web crawler to download and inspect approximately 60,000 online news articles related to the 2016 election. Of these articles, approximately 3,000 contained direct links to published poll data. Using data from FiveThirtyEight, we tallied the articles and 

Juicy stats

This graph (Reference Frequency for Grades) uses the poll grades given by FiveThirtyEight to show how often news articles linked to polls with specific grades. The graph is somewhat bimodal, with many links to A- polls but also a fair number of links to C+ polls. In a perfect world, news articles would only link to A or A+ polls, ignoring flawed poll results.

This graph shows how often news articles linked to polls in certain locations. Most are battleground states or swing states.

The final graph (Polls in final two months before election) shows polls taken over time, both cumulative (red) and by day. There are noticeable correlations between polls taken and the news cycle, including the final debate (see peak at 10/19 on graph), the sexual assault allegations against Trump (see peak at 10/25 on graph), and the new emails related to Clinton that were unveiled (see peak at 10/30 on graph). This shows that pollsters were, understandably, trying to account for shifting public attitudes by taking new polls.

The table (below) shows some of the 10 polls that were most linked to within the articles we inspected. "PollRank" is a modification of the PageRank algorithm, which Google is built upon, and can be thought of as a popularity score (i.e. higher PollRank indicates more news articles linking to that poll).

The top two polls are both generic urls, not specific poll outcomes. This means the top-ranked (most-cited) poll from our crawl was an “A” grade poll indicating a narrow lead for Clinton a week before the election. However, the next two are “B” and “C” grade polls that indicate a Clinton lead. It was not immediately clear why Arizona, with 11 electoral votes and a strong Republican-voting history, would be so interesting to news articles. But, it should be noted that Bill Clinton is the only Democrat to win the state (in 1996) since 1952, which may have made the state interesting to some news sites. Generally polls in the top-10 with “A” grades predicted outcomes that were consistent with the results, notably Trump leading in Florida, closing the gap in Pennsylvania, and steadily drawing closer in the popular vote. Polls in our top-10 with “B” or “C” grades predicted outcomes that favored Clinton in the popular vote.


Poll URL

Poll Grade

Poll Outcome (Clinton, Trump)

Poll End Date



46, 46




42, 44





49, 44




48, 40





43, 47




43, 45





46, 40









43.4, 37.8





45, 42




45, 46




44.4, 43.9



(no grade)

44.9, 44.5


Nerdy details
  • Scraping Polls

    • The polls scraping part is straightforward. We scraped the urls of original polls from two websites, and Both of the websites provided multiple lists of polls in chronological order.

      • From the two websites, we extracted 3,510 unique nationwide poll urls and 9,554 unique urls of statewide polls.

      • Two Source for Scraping Polls

        • FiveThirtyEight

        • RealClear Politics

    • For instance, the screenshot attached below is from RealClearPolitics. The have poll lists for nationwide polls, statewide polls, etc. The other picture attached below is the HTML code for one particular poll from The “a” tag of Line 300 is the url of the original poll, which is what our poll crawler will extract from the website.

RCP_Poll copy.jpg

  • Scraping News Aggregators

    • While crawling the domains of known news websites would give us millions of related articles, we would have to first crawl through hundreds of millions (or even over a billion) webpages. Because our 2016 election polls topic is so specific, it is likely only a small fraction of crawled articles would be relevant.

    • We therefore first crawled from a seed set of URLs that were obtained by scraping news aggregators such as Google News and News Look Up which we obtained a total of approximately 36,000 unique articles.

      • Keyword search queries such as “polls”, “2016 elections”, “trump leads”, “clinton leads”, “electoral college”, “battleground states”, and more were used to fetch as many related articles as possible.

      • Returned articles were also filtered by date (7/31 to 11/8 for Google News and past one month for News Look Up) to maximize the proportion of articles that are relevant to the 2016 presidential election.

    • Upon fetching the HTML of a Google News search result page, our Google News scraper looks for three different types of anchor tags that have links to articles ('l _HId', '_sQb', and '_rQb'). Each of these three tags represent different types of content (see figure below).

      • Note that Google does not usually permit scraping/crawling of their search results and blocks non-browser users, so we spoofed the user-agent of the scraper to do this. However, we did attempt to minimize the amount of GET requests our scraper was making by forcing the return of 100 results at a time instead of the default 10.

  • The structure of the News Look Up search result page is much simpler as there is only a single class for links to articles (see figure below). However, we were limited to fetching articles that were published within the past month.

  • Scraping Web Archive

    • Scraping news aggregators could not provide enough urls for us. Thus, we tried to scrape from the snapshot of the Web.

    • We use The Internet Archive ( for this task. According to internet resource:

      • The Internet Archive is a San Francisco–based nonprofit digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including web sites, software applications/games, music, movies/videos, moving images, and nearly three million public-domain books. As of October 2016, its collection topped 15 petabytes.

    • We scraped over 340,000 distinct urls from the Internet Archive. If time allows, we can scrape as much as we want from the Internet Archive.

    • Root urls for scraping web archive

      • We use RealClearPolitics (, the political session of FiveThirtyEight (, NewsLookUp (, and Associated Press ( as the root urls.

      • By using these four root urls, we are hoping we can extract more urls related to political during the recursive crawling process.

    • The screenshot attached below is the HTML code snap for the snapshots of FiveThirtyEight on August 9, 2016 from the Internet Archive.

      • Their code organized very well. From the code we can easily identify the information we want. For instance, there are three snapshots was taken on that day. The url of each snapshots contains the data and time.

      • One thing we found interesting and we cannot figure out the reason is that the Internet Archive decided to include the url of very first snapshot of each day of  each website in their HTML code. However, users are not be able to see it from browser. We do not know why they are doing this, but we guess it must be using for some internal purpose.  

  • Creating a Graph

We used BeautifulSoup to parse links once a news article was retrieved. Since most poll-related articles will link directly to a poll. The resulting graph is in the file “data/depth0_graph.csv”

  • Ranking Polls

For implementing the PageRank algorithm, we used the Hadoop framework in conjunction with MapReduce. This part was implemented in Java. Our code has three mapper and two reducer classes: the first mapper class parses the input file (i.e., web link graph), and the associated reducer class creates a new input file for the second mapper class. The second mapper class initializes the initial PageRank value and sends them to the second reducer class, while the second reducer class mainly does the whole calculation of the PageRank. The third mapper class does the sorting of the web pages according to the PageRank. Our final output is in the format <Rank, Link>. Figure shows a sample output of PageRank for subdomain

Job 1:


·         Input: <fromlink    Tolinks >

·         Output: <link        link >


·         Assign the initial value for every link.

·         Output: <link   initialRank   links>

Job 2:


·         Input: <link      initialRank   links>

·         Output: <link   Rank   links>


·         Calculate the PageRank

·         Output <link   Rank>

Job 3:


·         Sort the links according to Rank

·         Output <Rank Link>

  • Scaled crawl with scrapy

We were able to collect approximately 47,340 unique article urls from scraping news aggregators. (NewsArchive URLS were not included in this batch, as there was a significant crawling delay in these retrievals). We used this “seed set” of urls to start a much more expansive crawl, powered by scrapy. The first round of 47k urls was crawled in about 2 hours (400 pages per minute) on a Mac laptop with a home internet connection. This produced a set of 355,501 unique urls after de-duplicating links and removing links to polls or previously crawled articles. With the rate of 400 pages per minute, this list should have been completely crawled in 14 hours.

However, in the second crawl (depth 1), scale became an issue. The spider had only crawled about 50,000 articles after running overnight (approximately 80 pages per mintute). This was because many news sites refused connections after some number of requests, even using scrapy’s request delays and other spoofing/cloaking functionality. Eventually, we reached a configuration that allowed us to crawl depth1 links reasonably quickly, and could have completed the full list given about 31 hours. This was done by changing some of scrapy’s default functionality and shuffling urls to minimize repetitive requests to a given server.

Crawling just a subset of depth1 links generated a list of over 2 million more articles (before de-duplication and filtering), however, we expect the percentage of those articles containing links to polls would be significantly lower, as was shown in the difference between depth0 and depth1 crawls. At the pace achieved during the depth1 crawl, our crawler would have taken quite a while to crawl 2 million articles. 

Below is a summary of the statistics from our crawls:

Crawl Depth

Total Urls

Crawled Urls

Total Poll links

Poll links/Crawled URLs

Approximate Pages Per Minute

0 (aggregated list)











80, 190 after adjustments