Friday, December 21, 2012

2012-12-21: The Performance of Betting Lines for Predicting the Outcome of NFL Games

It was the first week of the 2007 National Football League (NFL) season. After waiting all summer for the NFL season to begin, the fans were rabid with anticipation. The airwaves were filled with sportscasters debating the prospects of teams from both conferences and how they would perform.

Of particular interest was the New England Patriots. They had two starters out with injuries and their star receiver, Randy Moss, was questionable for the game. New England was playing against the NY Jets and their simmering rivalry add heat to the fire. Many of the sportscasters were lining up with the Jets and Vegas was favoring the Jets with a 6 point line at home.

When betting opened for the game the action on the Patriots was heavy. The shear volume of bets place on New England to win forced the sportsbooks to move the spread in an attempt to equalize betting on both sides. Eventually the line moved all  of the way to New England being a seven point favorite by game time. New England went on to win the game 38 to 14, easily covering the spread. This is one example where the collective intelligence of the NFL fans was confident that New England would win even when the "experts" thought otherwise.

We recently released a paper examining this phenomenon entitled The Performance of Betting Lines for Predicting the Outcome of NFL Games. A copy can be found on arXiv. In this paper we investigated the performance of the collective intelligence of NFL fans predicting the outcome of NFL games using the Vegas betting lines.

We found that although home teams only beat the spread 47% of the time, a strategy of betting the home team underdogs (2002 - 2011) would have produced a cumulative winning strategy of 53.5%, above the threshold of 52.38% needed to break even on a 10% vigorish.



-- Greg Szalkowski

2012-12-20: NFL Power Rankings Week 16

The NFL Playoffs are only a few weeks away. With the end of the regular season in sight there are a few trends that subtlety change the game. One of the trends is the weather. Tennessee is playing at Green Bay and snow is in the forecast. Another end of the season trend is displayed by teams that have clinched playoff positions. They rest their starting lineup and play backup players. That is more of a week 17 phenomenon but with Atlanta and Houston both at 12-2 for the season they may play some non-starters during the game. This ranking system is based on team performance and does not take trends like the weather into account.

Our ranking system is based on Google's PageRank algorithm.It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory.  

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j.  This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of votes equal to the margin of victory. Last week the Redskins beat the Browns 38 to 21, in the graph a directed edge from the Browns to the Redskins with a weight of 17 was created.

The season graph so far can be visualized in the following graph.


The Pagerank algorithm is run and all of the votes from losing teams are calculated. The nodes in the graph are given a final ranking and that ranking is represented by the size of the node in the graph. This algorithm does a much better job of taking the strength of schedule into account than many of the other ranking systems that are essentially based on win loss ratios. Atlanta has had a good season so far but their strength of schedule was the easiest of the entire league. Strength of schedule was based on performance from the 2011 season. Up until last week Atlanta was not considered a strong contender for the Superbowl but after shutting out the Giants, their performance shows the team is capable of a dominant effort against a quality opponent. Although with the Giants you have to wonder which team will show up each week.

The numerical rankings are as follows:


RankTeam
1NY Giants
2San Francisco
3Atlanta
4New England
5Carolina
6Denver
7Houston
8New Orleans 
9Cincinnati 
10Seattle
11Green Bay
12Minnesota
13Baltimore
14Chicago
15Arizona
16St Louis
17Pittsburgh 
18Tampa Bay 
19Washington
20Dallas
21Miami
22Cleveland
23San Diego
24NY Jets
25Tennessee
26Detroit
27Kansas City
28Indianapolis
29Buffalo
30Philadelphia
31Oakland
32Jacksonville

-- Greg Szalkowski

Monday, December 17, 2012

2012-12-17: Archive-It Partners Meeting

I attended the 2012 Archive-It Partners Meeting in Annapolis, MD on December 3.

I decided to attend at the last minute, and Kristine and Lori graciously let me have 5 minutes to talk about our project and upcoming NEH proposal.  We're looking for humanities-types and Archive-It partners to work with in evaluating our visualizations. After my presentation, I was able to make contacts with several potential partners.

 
Visualizing Digital Collections at Archive-It from Michele Weigle

There were several nice talks in the half-day session.  The full schedule and slides from all of the presentations are available.

Related to what we're working on, Alex Thurman from Columbia University Libraries talked about their local portal to their Human Rights collection (collection at Archive-It).  They offer a rotated list of screenshots for featured sites and have tabs to show the collection pages by title, URL, subject, place, and language. One nice feature they've added is the ability to collect and group different URLs that point to the same site (i.e., handling URL changes over time). They're currently in the user testing phase.

Students from Virginia Tech's Crisis, Tragedy, and Recovery Network (list of collections at Archive-It) presented their work on archiving web pages related to disasters and visualizing tweets related to disasters.  Their recent Hurricane Sandy collection contains 8 million tweets.  For their Archive-It collection, they extracted seed URIs from tweets, Google news, popular news/weather portals, and direct user input.  A big problem was the addition of spam links into the archive.  The tweet visualization project looks at classifying tweets into four different phases of a disaster (response, recovery, mitigation, and preparedness).  From this, they produced a several views, including a ThemeRiver/Streamgraph type view, social network view, a map, and a table of tweets.

-Michele

Friday, December 14, 2012

2012-12-14: InfoVis at Grace Hopper

I was selected give a 5-minute faculty lightning talk at the Grace Hopper Celebration of Women in Computing in October in Baltimore.  Short talks are among the most difficult to prepare, especially short talks for a general audience. I decided to increase my level of difficulty for the talk by combining two topics in my 5-minute talk, information visualization (infovis) and web archiving.

I ended up presenting a snapshot of the work that Kalpesh Padia and Yasmin AlNoamany did for their JCDL 2012 paper, Visualizing Digital Collections at Archive-It (see related blog post).


Information Visualization - Visualizing Digital Collections at Archive-It from Michele Weigle

The faculty lightning talks session was new at Grace Hopper, but went very well.  We had a 45-minute session and got to hear about 8 totally different research projects.  Info and slides from all of the presentations are available on the GHC wiki.  Especially for work-in-progress, this format was a great way for the speakers to really focus in on the important aspects of their work and for the audience to hear snippets about different research projects without any presentation being long enough to be boring.

We had others from ODU attend GHC as well (faculty member Janet Brunelle and students Erin, Tiffany, and Tamara).  Tiffany and Tamara blogged about their experiences: Tamara's blog, Tiffany's blog.

The GHC wiki has a ton of information about the conference, including notes and slides for many of the talks.

I hadn't been to GHC in about 5 years and was amazed to see how much it had grown.  There were over 3600 attendees (1500 students) from 42 countries.  Happily, even with that many people, I was able to meet up with all of my old friends.

The highlight of the conference for me was Nora Denzel's keynote on Thursday morning.  It's recommended viewing for all, but especially for female students in CS or Engineering.  The video is embedded below, but if you'd rather read about it, here are some blog posts it generated: Aakriti's blog, Valerie's blog, and Kathleen's blog.

-Michele

Saturday, November 10, 2012

2012-11-10: Site Transitions, Cool URIs, URI Slugs, Topsy

Recently I was emailing a friend and wanted to update her about the recent buzz we have enjoyed with Hany SalahEldeen's TPDL 2012 paper about the loss rate of resources shared over Twitter.  I remembered that an article in the MIT Technology Review from the Physics arXiv blog started the whole wave of popular press (e.g., MIT Technology Review, BBC, The Atlantic, Spiegel).  To help convey the amount of social media sharing of these stories, I was sending links to the sites using social media search engine Topsy.  Having recently discovered it, Topsy has quickly become one of my favorite sites.  It does many things, but the part I enjoy most is the ability to prepend "http://topsy.com/" to a URI to discover how many times a URI has been shared and who is sharing it.  For example:

http://www.bbc.com/future/story/20120927-the-decaying-web

becomes:

http://topsy.com/http://www.bbc.com/future/story/20120927-the-decaying-web

and you can see all the tweets that have linked to the bbc.com URI. 

While composing my email I recalled the Technology Review article was the one of the first (September 19, 2012) and most popular, so I did a Google search for the article and converted the resulting URI from:

http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

to:

http://topsy.com/http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

I was surprised when I saw Topsy reported 0 posts about the MIT TR story, because I recalled it being quite large.  I thought maybe it was a transient error and didn't think too much about it until later that night when I was on my home computer where I had bookmarked the MIT TR Topsy URI and it said "900 posts".  Then I looked carefully: the URI I had bookmarked now issues a 301 redirection to another URI:

% curl -I http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352561072"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 15:24:32 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 15:24:32 GMT
X-Varnish: 1779081554
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


A little poking around revealed that technologyreview.com reorganized and rebranded their site on October 24, 2012, and Google had already swapped the prior URI to the article with the new URI.  Their site uses Drupal and it appears their old site did as well but the URIs have changed.  The base URIs (e.g., http://www.technologyreview.com/view/429274/) have stayed the same (and is thus almost "cool"), but the slug has lengthed from 8 terms ("history as recorded on twitter is vanishing from") to the full title ("history as recorded on twitter is vanishing from the web say computer scientists").  Slugs are a nice way to make the URI more human readable, and can be useful in determining what the URI was "about" if (or when) it becomes 404 (see also Martin Klein's dissertation on lexical signatures).  The base URI will 301 redirect to the URI with the slug:

% curl -I http://www.technologyreview.com/view/429274/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352563816"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 16:10:16 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 16:10:16 GMT
X-Varnish: 1779473907
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


But this redirection is transparent to the user, so all the tweets that Topsy analyzes are the versions with slugs.  This results in two URIs for the article: the version from Sept 19 -- Oct 24 that has 900 tweets, and the Oct 24 -- now version that currently has 3 tweets (up from 0 when I first noticed this).  technologyreview.com is to be commended for not breaking the pre-update URIs (see the post about how ctv.ca handled a similar situation) and issuing 301 redirections to the new versions, but it would have been prefereable to have maintained the old URIs completely (perhaps the new software installation has a different default slug length, I'm not familiar with Drupal and in the code examples I can find a limit is not defined). 

Splitting PageRank with URI aliases is a well-known problem that can be addressed with 301 redirects (e.g., this is why most URI shorteners like bitly issue 301 redirects (instead of 302s), so the PageRank will accumulate at the target and not the short URI).  It would be nice if Topsy also merged redirects when computing their pages.  In the example above, that would result in either of the Topsy URIs (pre- and post-October 24) reporting 900+3 = 903 posts (or at least provided that as an option).  

--Michael

Edit: I did some more investigating and found that the slug doesn't matter, only the Drupal node ID of "429274" (those familiar with Drupal probably already knew that).  Here's a URI that should obviously return 404 redirecting to URI with the full title as the slug:

% curl -I http://www.technologyreview.com/view/429274/lasdkfjlajfdsljkaldsf/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352581871"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 21:11:11 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 21:11:11 GMT
X-Varnish: 1782237238
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


This makes the Drupal slug very close to the original Phelps & Wilensky concept of "Robust Hyperlinks Cost Just Five Words Each", which formed the basis for Martin's dissertation mentioned above.  While this is convenient in that it reduces the number of 404s in the world, it is also a bit of a white lie; user agents need to be careful to not assume that the original URI ever existed even though it is issuing a redirect to a target URI. 

Wednesday, November 7, 2012

2012-11-06: TPDL 2012 Conference


It all started last April, particularly on the 9th, when I received an email from the Dr. George Buchanan delivering the good news, my paper have been accepted at the annual international conference on Theory and Practice of Digital Libraries TPDL 2012. Being the Program Chair, Dr. Buchanan sent me the reviews and feedback associated with my paper which was entitled “Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?” which paved the way in the following months for the preparation process to present this paper.


 Along with submitting the paper, Dr. Nelson gave me the permission to submit my PhD proposal to be considered for the Doctoral Consortium at the conference. Scoring my second goal, Dr. Birger Larsen and Dr. Stefan Gradmann sent me a delightful email announcing the committee's acceptance to my proposal and I was invited a day before the conference to present my work at the consortium.

The Hat-trick came a few weeks before the conference in the form of an email from Dr. Birger proposing that I present my work, from the doctoral consortium, at the poster session on the first day of the conference. Overwhelmed with joy, I gladly accepted this gracious invitation and started working on the poster.

After an 8 hour drive to New York and a couple of flights, I arrived to Larnaca airport in Cyprus. I can't complain because two of the most closest activities to my heart are driving and travelling. Anyway, I took the bus to Limassol from the airport and was supposed to take another bus to Paphos, where the conference is held, but unfortunately it didn't come. After a quick chat with two French ladies who happened to be heading to Paphos too, we shared a taxi there and I finally arrived to the hotel which I will be spending the following nights, Cynthiana beach hotel. With a captivating view and spacious suites, Cynthiana hotel was located 10 minutes by bus from the venue of the conference and half way to the center of the city.

After being confused with the British system in driving on the opposite side of the road and being directed by the kind locals to where the station was located, I took the bus on the following morning to the conference venue at Coral Bay Hotel. Dr. Larsen and Dr. Gradmann were waiting for me and the other to guide us to the presentation room. I was assigned Ms. Justyna Walkowska as my mentor to guide the discussion and give me constructive criticism. I personally loved this model of consortia as it gave the opportunity to the mentors to read the proposals in detail prior to the presentations giving them in-depth views of the work, rendering the feedback more constructive and beneficial.



After giving the presentation and receiving the questions and feedback, I sat down and listened to the work of fellow PhD students: Tuan VuTran, Armand Brahaj, and Nut Limsopatham. Shortly after wrapping up the consortium, Dr. Larsen and Dr. Gradmann took us to the city pier to have an authentic Cypriot dinner. The food, the atmosphere, and the company were marvelous. Later that night I arrived back to the hotel exhausted.

The next morning the conference commenced. Following the welcome notes by Dr. Buchannan Dr. Mounia Lalmas gave a marvelous keynote speech entitled “User Engagement in the Digital World”. Dr. Lalmas is a visiting principal scientist at Yahoo! Labs Barcelona. She talked about user engagement and the emotional, cognitive, and behavioral connection between the user and the technological resource. She discussed ways to measure this engagement and to model it, along with some select experiments discussing those several aspects.

After the keynote speech we had a short coffee break where I met some people I haven't seen since JCDL earlier in June. Then I headed to the 2nd track sessions entitled “Analyzing and Enriching Documents “ which included several interesting papers by Róisín Rowley-Brooke, my friend Luis Meneses, Daan Odijk, and Annika Hinze who had 4 papers published in this conference, which I found fascinating. The lunch break followed and I had to do a phone interview with Ms. Lesley Taylor from the Toronto Star who wrote an article about the paper I am presenting at the conference.

Following the lunch I attended the session entitled “Extracting and Indexing” where Guido Sautter, Benjamin Köhncke, and Georgina Tryfou presented their work. The minute madness started shortly after and followed by the poster session.

Standing by my poster in the middle of the room I started explaining my work to interested researchers in the field. After a while I started checking out other neighboring posters and I bet my friend Clare Llewellyn for drinks if she won the best poster award (spoiler alert, she owes me drinks now!) with her brilliant linen cloth poster. Later that evening and after the welcome reception we went out for dinner and drinks in another authentic Cypriot restaurant and had a lovely time.

The following day started with the second keynote speech by Dr. Andreas Lanitis from the Department of Multimedia and Graphic Arts, Cyprus University of Technology entitled: “On the Preservation of Cultural Heritage Through Digitization, Restoration, Reproduction and Usage”. In this captivating talk, Dr. Lanitis discussed the digital preservation of Cypriot Cultural Heritage artifacts, the restoration and reproduction.

After the coffee break I also attended the second track entitled: “Content and Metadata Quality” where two fascinating papers have been presented, one regarding the SKOS vocabularies and the other about meta learning from wiki articles. I was fairly nervous because the following session and just after lunch I was supposed to present my long paper too.

During lunch I had my second phone interview with Ms. Claire Connelly a journalist from News Ltd in Australia also writing an article about our work. Following lunch, this time I joined the 1st track sessions among which I will present my work. It started with Anqi Cui presenting his interesting work with PrEV (Preserving and Providing Web Pages and User-generated Contents). To my surprise he cited my work within his presentation and a sense of accomplishment flooded me. Scientific processes have been analyzed next in the following paper entitled: “Preserving Scientific Processes from Design to Publications”. After that I took the stage and I was surprised by the large number of attendees. The questions were marvelous and Cathy Marshall, among others, gave me very precious feedback. Following my presentation, Ray Larson and Maria Sumbana presented the following two papers.


After the coffee break we returned back to have the last round of sessions in which I chose again track 2 “Information Retrieval” presenting four more papers. At 7 o'clock we gathered by the lobby to board the buses taking us to the outskirts of the town to an authentic Cypriot restaurant. This one was different as it had a band and a folk dancing group who taught us how to do the Cypriot round and line group dancing.

The following morning I packed my bags and checked out before attending the last day of the conference which started with an enticing and captivating talk as usual by Cathy Marshall from Microsoft Research San Francisco. The talk was entitled “Whose content is it anyway? Social media, personal data, and the fate of our digital legacy”, similar to the equally wonderful speech she gave at JCDL. Finally I attended the set of sessions that I have been looking forward to the most, “track 2 User Behavior” presented by Michael Khoo and Catherine Hall, Sally Jo Cunningham, Fernanto Loizides, and my friend Gerhard Gossen.

The closing session followed up next concluding the conference where the best paper/demo/poster awards were handed to the authors among which our friend Clare Llewellyn.
In conclusion it was a really organized and successful conference, our presence was evident three times and I attended several interesting sessions, met old colleagues, made a lot of new contacts, and got really great feedback.

Other Blog Posts:

-- Hany SalahEldeen

Thursday, October 25, 2012

2012-10-24: NFL Power Rankings Week 8


After running the R script for the week 8 rankings, the first thing that struck my mind was the disparity in the size of the nodes between the AFC on the left side of our graph and the NFC on the right side.

Two weeks ago we wrote that the NFC West has been dominant so far this year. The NFC West has the best combined record and their aggregate point differential puts others to shame.  However it is not just the West division but the entire NFC conference has dominated and out-performed the AFC conference at every turn. CBS Sports rates the NFC as head and shoulders above the AFC this year.

Our ranking system is based on Google's PageRank algorithm. It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory. 

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j.  This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of votes equal to the margin of victory. Last week the Giants beat the Redskins 27 to 23, in the graph a directed edge from the Redskins to the Giants with a weight of 4 was created.

The season graph so far can be visualized in the following graph.


The Pagerank algorithm is run and all of the votes from losing teams are calculated. The nodes in the graph are given a final ranking and that is represented by the size of the node in the graph. This algorithm does a much better job of taking the strength of schedule into account than many of the other ranking systems that are essentially based on win loss ratios. Barring any injuries or or other problems it is a good guess that Houston will representing the AFC once the playoffs are complete. The real question is which team from the NFC will rise to surface to take them on in the Superbowl.

The numerical rankings are as follows:

RankTeam
1San Francisco
2Green Bay
3NY Giants
4Dallas
5Chicago
6Seattle
7Minnesota
8St Louis
9Washington
10Houston
11Arizona
12Atlanta
13Baltimore
14Philadelphia
15Cincinnati
16Denver
17New England
18NY Jets
19Indianapolis
20Pittsburgh
21Buffalo
22Miami
23Detroit
24San Diego
25New Orleans
26Cleveland
27Tennessee
28Carolina
29Tampa Bay
30Oakland
31Jacksonville
32Kansas City

-- Greg Szalkowski

Thursday, October 11, 2012

2012-10-11: NFL Power Rankings Week 6

It is now five weeks into the 2012 season and the season is starting to come into focus. The topic of many online discussions is this years performance of the NFC West division compared to last year. The NFC West is one of the best performing divisions so far this year, which is a far cry from last year. They are certainly doing well in our ranking system.

Our ranking system is based on Google's PageRank algorithm.It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory. 

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j.  This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of votes equal to the margin of victory. Last week the Falcons beat the Redskins 24 to 17, in the graph a directed edge from the Redskins to the Falcons with a weight of 7 was created.

The season graph so far can be visualized in the following graph.

The Pagerank algorithm is run and all of the votes from losing teams are calculated. The nodes in the graph are given a final ranking and that is represented by the size of the node in the graph. The Pagerank algorithm used in this fashion has the nice effect of representing the strength of schedule. This should be of interest to many of the Houston Texan fans out there. The majority of the NFL Power Ranking sites out there, currently have Houston ranked number one. A simple glance at the schedule for the past five weeks would show that Houston has had a pretty easy season so far. They have played well so far and this week when they play Green Bay should be a good game.

The numerical rankings are as follows:

RankTeam
1Chicago
2Green Bay
3San Francisco
4Minnesota
5Indianapolis
6St Louis
7Arizona
8Seattle
9Philadelphia
10Houston
11Baltimore
12Atlanta
13Jacksonville
14Dallas
15Detroit
16Denver
17New England
18NY Giants
19Cincinatti
20San Diego
21Pittsburgh
22Miami
23Washington
24Buffalo
25NY Jets
26Carolina
27Tennesse
28New Orleans
29Oakland
30Tampa Bay
31Kansas City
32Cleveland

-- Greg Szalkowski

Wednesday, October 10, 2012

2012-10-10: Zombies in the Archives

Image provided from http://www.taxhelpattorney.com/
In our current research, the WS-DL group has observed leakage in archived sites. Leakage occurs when archived resources include current content. I enjoy referring to such occurrences as "zombie" resources (which is appropriate given the upcoming Halloween holiday). That is to say, these resources are expected to be archived ("dead") but still reach into the current Web.

In the examples below, this reach into the live Web is caused by URIs contained in JavaScript not being rewritten to be relative to the Web archive; the page in the archive is not pulling from the past archived content but is "reaching out" (zombie-style) from the archive to the live Web. 

We provide two examples with humorous juxtaposition of past and present content. Because of  JavaScript, rendering a page from the past will include advertisements from the present Web.


2008 memento of cnn.com from the Wayback Machine
First, we look at cnn.com. We can observe an archived resource from the Wayback Machine at http://web.archive.org/web/20080903204222/http://www.cnn.com/. This memento from September 16th, 2008 includes links to the 2008 presidential race between McCain-Palin and Obama-Biden. However, this memento was observed on September 28, 2012 -- during the 2012 presidential race between Romney-Ryan and Obama-Biden. The memento includes embedded JavaScript that pulls advertisements from the live Web. The advertisement included in the memento is a zombie resource that promotes the 2012 presidential debate between Romney and Obama. This drift from the expected archived time seems to provide a prophetic look at the 2012 presidential candidates in a 2008 resource. The current cnn.com homepage gives the same advertisement as the archived version.


Current cnn.com homepage as observed in 2012

A second case study comes from the IMDB movie database site. We observed the July 28th, 2011 memento of the IMDB homepage at http://web.archive.org/web/20110728165802/http://www.imdb.com/. This memento advertises the movie Cowboys and Aliens. This movies is set to start "tomorrow" according to our observed July 28th, 2011 memento. Additionally, we see the current feature movie is Captain America

2011 memento of IMDB.com from the Wayback Machine 

According to the currently observed IMDB site, Cowboys and Aliens was released in 2011 and Captain American was released in 2011, in keeping with our observed memento. However, the ad included on the IMDB memento promotes the movie "Won't Back Down." According to IMDB, this movie won't be released until 2012. Again, we can observed a memento with reference to present-day events.

Cowboys and Aliens was released in 2011
Captain American was released in 2011
Won't Back Down is scheduled to be released in 2012

When we observe the HTTP requests that are made when loading the mementos there is evidence of reach into the current Web. We've stored all HTTP headers from the archive into a text file for analysis.   The requests should be to other archive.org resources. However, we can get the requests for live-Web resources:

$ grep Host: headers.txt | grep -v archive.org
Host: ocsp.incommon.org
Host: ocsp.usertrust.com
Host: exchange.cs.odu.edu
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: core.insightexpressai.com
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: core.insightexpressai.com
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: ad.doubleclick.net


These requests from archives into the live Web are initiated by embedded JavaScript:

<iframe src="http://www.imdb.com/images/SF99c7f777fc74f1d954417f99b985a4af/a/ifb/doubleclick/expand.html#imdb2.consumer.homepage/;tile=5;sz=1008x60,1008x66,7x1;p=ns;ct=com
;[PASEGMENTS];u=[CLIENT_SIDE_ORD];ord=[CLIENT_SIDE_ORD]?" ... onload="ad_utils.on_ad_load(this)"></iframe>


During our investigation of these zombie resources, we observed that this leakage of live content into archived resources is not consistent. We noticed that some versions of some browsers would not produce the leakage; this is potentially due to the browsers' different methods of handling JavaScript and Ajax calls. In our experience, older browsers have a higher percentage of leakage, while the newer browsers demonstrate the leakage less frequently.

The CNN and IMDB mementos mentioned above were rendered in Mozilla Firefox version 3.6.3. Below are two examples of our CNN and IMDB mementos rendered in a Mozilla Firefox 15.0.1. Note that the below examples attempt to load the advertisements but produce a "Not Found In Archive" message.

CNN.com memento rendered in a newer browser with no leakage.

IMDB.com memento rendered in a newer browser with no leakage.

When analyzing the headers with these new browsers, we get fewer requests for live content. More importantly, we get different requests than we saw in the other browsers:

$ grep Host: headers.txt | grep -v archive.org
Host: ia.media-imdb.com
Host: ia.media-imdb.com
Host: ia.media-imdb.com
Host: b.scorecardresearch.com
Host: s0.2mdn.net
Host: s0.2mdn.net
Host: b.voicefive.com
Host: b.scorecardresearch.com


These mementos still attempted to load wrong resources, albeit unsuccessfully. Essentially, these mementos are shown as incomplete instead of incorrect (and without our humorous results). The exact relationship between browser, mementos, and zombie resources will required additional investigation before we can establish a cause and solution for these leakages.

The Internet Archive is not the only archive that contains these leakages. We found an example in the following WebCite memento of cnn.com archive on 2012-09-09.

WebCite memento of cnn.com.

The "Popular on Facebook" section of the page has activity from two of my "friends." The page that was shared was the 10 questions for Obama to answer page, which was published on October 1st, 2012 and is shown below. It should be obvious that my "friends" shouldn't have been able to share a page that hasn't been published, yet (2012-09-09 occurs before 2012-10-01). So, the WebCite page allow live-Web leakage in the cnn.com memento.

Live cnn.com resource

Such occurrences of leakage and zombie resources are not uncommon in today's archives. Current Web technologies such as JavaScript make a pure, unchanging capture difficult in the modern Web. However, it is useful for us as Web users and Web scientists to understand that zombies do exist in our archives.

--Justin F. Brunelle

Saturday, September 29, 2012

2012-09-29: Data Curation, Data Citation, ResourceSync

During September 10-11, 2012 I attended the UNC/NSF Workshop Curating for Quality: Ensuring Data Quality to Enable New Science in Arlington.  The structure of the workshop was to invite about 20 researchers involved with all aspects of data curation and solicit position papers in one of four broad topics:
  1. data quality criteria and contexts
  2. human and institutional factors
  3. tools for effective and painless curation
  4. metrics
Although the majority of the discussion was about science data, my position paper was about the importance of archiving the web.  In short, treating the web as the corpus that should be retained for future research.  The pending workshop report will have a full list of participants and their papers, but in the meantime I've uploaded to arXiv my paper, "A Plan for Curating `Obsolete Data or Resources'", which is a summary version of the slides I presented at the Web Archiving Cooperative meeting this summer. 

To be included in the workshop report are the results of various breakout sessions.  The ones that I was involved with involved questions such as: how contextual information should be archived with the data (cf. "preservation description information" and "knowledge base" from OAIS), how much a university's institutional overhead goes to institutional repositories and archiving capability ("put everything in the cloud" is neither an informed nor acceptable answer), and how to handle versioning and diff/patch in large data sets (tools like Galaxy and Google Refine were mentioned in the larger discussion).

(2012-10-23 edit: the final workshop report is now available.)

A nice complement to the Data Curation workshop was the NISO "Tracking it Back to the Source: Managing and Citing Research Data" workshop in Denver on September 24.  This one day workshop focused on how to cite and link to scientific data sets (which came up several times in the UNC workshop as well).  While I applaud the move to make data sets first-class objects in the scholarly communication infrastructure, I always feel there is an unstoppable momentum to "solve" the problem by simply saying "use DOIs" (e.g., DataCite), while ignoring the hard issues of what exactly does a DOI refer to (see: ORE Primer), versioning what it might point to (see: Memento), as well as the minor quibble that DOIs aren't actually URIs (look it up: "doi" is not in the registry).  In short, DOIs are a good start, but they just push the problem one level down instead of solving it.  Highlights from the workshop included a ResourceSync+Memento presentation from Herbert Van de Sompel and "Data Equivalence", by Mark Parsons of the NSIDC

After the NISO workshop, there was a two day ResourceSync working group meeting (September 25-26) in Denver.  We made a great deal of progress on the specification; the pre-meeting (0.1) version of the specification is no longer valid.  Many issues are still being considered and I won't cover the details here, but main result is that the ResourceSync format will no longer be based on Sitemaps.  We were all disappointed to have to make that break, but Martin Klein did a nice set of experiments (to be released later) that showed despite being superficially suitable for the job, there were just too many areas where its primary focus of advertising URIs to search engines inhibited the more nuanced use of advertising resources that have changed.

--Michael

Thursday, September 27, 2012

2012-09-27: NFL Referee Kerfuffle

For the first three weeks of the 2012 NFL season, replacement officials have refereed the games due to an ongoing labor dispute between the referees and the NFL. Every fan of a team that has been on the losing side of a call has voiced their opinion on the abilities of the replacement referees. Even Jon Stewart had something to say about the labor dispute.

This past Monday night during the Seahawks - Packers game, a controversial call essentially determined the winner of the game. This call was the powder keg that blew open the dam of angry recriminations and complaints directed at the replacement referees and the NFL. This was somewhat amusing to me as the people complaining seem to forget about all of the mistakes the regular referees appeared to make in all of the previous years. In 2008 one of the best referees in the NFL, Ed Hochuli made a rather horrendous call. I have to give him respect for owning up to it and apologizing. NFL fans have always complained about the officiating, warranted or not.

Seeing as how I have been collecting NFL statistics for a number of years, I decided to see what the data could tell me about the replacement referees performance. First I wanted to see if there was a disparity in the number of penalties called by the referees during the first three weeks of this year compared to the first three weeks of other years.

Year Mean penalties per game
2002 13.2609
2003 15.4783
2004 14.2609
2005 15.5870
2006 12.3261
2007 11.4583
2008 12.3617
2009 12.3333
2010 13.1489
2011 13.0417
2012 13.6250

The average number of penalties appears to be consistent with the previous decade.

One concern that I read about was that the replacement referees were local and would favor the home team. Indeed one referee was removed from his assignment after some of his Facebook posts described him as a Saints fan. Just one more reason to watch what you release on social media.
So, have the home teams done better this year than others?


Year Home Wins
2002 23
2003 25
2004 26
2005 29
2006 21
2007 30
2008 28
2009 25
2010 27
2011 31
2012 31

Looking at the number of home team wins in the first three weeks of each season shows that 31 wins in 2012, while a little higher than average and exactly equal to last year is nowhere even close to a statistical anomaly. This leads me to think about what Vegas thinks about the whole situation. The collective intelligence of the NFL fan population realized by the Vegas Spread has been the focus of much of my research.

ESPN and other sites are reporting that over $100 million dollars was lost as a result of the controversial call on Monday night. How far off from reality has the Vegas line been this year compared to the past 20 years?

This figure shows the average difference and standard deviation between the Vegas betting line and the actual Margin of Victory over the first three weeks of each year. Negative numbers indicate that the either the visitor performed better than expected or the home team was favored more than they should have. 2012 is a little less than -2, so the the argument could be made that the NFL fans favored the home teams a little more than they should have. The key takeaway is that the results are well with normal values and maybe even a little more consistent than any other season in the past two decades.

This little experiment did not really prove anything other than the first three weeks of this season have not been statistically different than any other season in the past decade. Things being what they are, people will still find something to complain about and the search for someone to blame will always be successful.

The news is reporting that starting tonight the regular referees will be back and all will be well with the world. The question I leave you with is after the next controversial call, who will they blame?

-- Greg Szalkowski

Monday, September 3, 2012

2012-08-31: Benchmarking LANL's SiteStory

On August 17th, 2012, Los Alamos National Laboratory's Herbert Van de Sompel announced the release of the anticipated transactional web archiver called SiteStory.



The ODU WS-DL research group (in conjunction with The MITRE Corporation) performed a series of studies to measure the effect of the SiteStory on web server performance. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.

A sneak-peek at how SiteStory affects server performance is provided below. Please see the technical report for a full description of these results. But first, let's compare the archival behaviors of transactional and conventional Web archives.

Crawler and user visits generate archived copies of a changing page.


A visual representation of a typical page change and user access scenario is depicted in the above figure. This scenario assumes an arbitrary page that will be called P changes at inconsistent intervals. This timeline shows page P changes at points C1, C2, C3, C4, and C5 at times t2, t6, t8, t10, and t13, respectively. A user makes a request for P at points O1, O2, and O3 at times t3, t5, and t11, respectively. A Web crawler (that captures representations for storage in a Web archive) visits P at points V1 and V2 at times t4 and t9, respectively. Since O1 occurs after change C1, an archived copy of C1 is made by the transactional archive (TA). When O2 is made, P has not changed since O1 and therefore, an archived copy is not made since one already exists. The Web crawler visits V1 captures C1, and makes a copy in the Web archive. In servicing V1, an unoptimized TA will store another copy of C1 at t4 and an optimized TA could detect that no change has occurred and not store another copy of C1.

Change C2 occurs at time t6, and C3 occurs at time t8. There was no access to P between t6 and t8, which means C2 is lost -- an archived copy exists in neither the TA nor the Web crawler's archive. However, the argument can be made that if no entity observed the change, should it be archived? Change C3 occurs and is archived during the crawler's visit V2, and the TA will also archive C3. After C4, a user accessed P at O3 creating an archived copy of C4 in the TA. In the scenario depicted in Figure 1, the TA will have changes C1, C3, C4, and a conventional archive will only have C1, C3. Change C2 was never served to any client (human or crawler) and is thus not archived by either system. Change C5 will be captured by the TA when P is accessed next.

The example in the above figure demonstrates a transactional archive's ability to capture a single version of each user-observed version of a page, but does not capture versions unseen by users.

Los Alamos National Laboratory has developed SiteStory, an open-source transactional Web archive. First, mod_sitestory is installed on the Apache server that contains the content to be archived. When the Apache server builds the response for the requesting client, mod_sitestory sends a copy of the response to the SiteStory Web archive, which is deployed as a separate entity. This Web archive then provides Memento-based access to the content served by the Apache server with mod_sitestory installed, and the SiteStory Web archive is discoverable from the Apache web server using standard Memento conventions.

Sending a copy of the HTTP response to the archive is an additional task for the Apache Web server, and this task must not come at too great a performance penalty to the Web server. The goal of this study is to quantify the additional load mod_sitestory places on the Apache Web server to be archived.

ApacheBench (ab) was used to gather the throughput statistics of a server when SiteStory was actively archiving content and compare those statistics to those of the same server when SiteStory was not running. The below figures from the technical report show that SiteStory does not hinder a server's ability to provide content to users in a timely manner.

Total run time for the ab test with 10,000 connections and 1 concurrency.

Total run time for the ab test with 10,000 connections and 100 concurrency.

Total run time for the ab test with 216,000 connections and 1 concurrency.

Total run time for the ab test with 216,000 connections and 100 concurrency.

To test the effect of sites with large numbers of embedded resources, 100 HTML pages were constructed with Page 0 containing 0 embedded images, Page 1 containing 1 embedded image, .., Page n containing n embedded images. As expected, larger resources take longer to serve to a requesting user. SiteStory is affected more for larger resources, as depicted in the below figures.






As depicted in these figures, SiteStory does not significantly hinder a server, and increases the ability to actively archive content served from a server. More details on these graphs can be found in the technical report, which has been posted to arXiv.org:

Justin F. Brunelle, Michael L. Nelson, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Technical Report 1209.1811v1, 2012.

--Justin F. Brunelle