Tuesday, September 1, 2015

2015-09-01: From Student To Researcher II

After successfully defending my Master's Thesis, I accepted a position as a Graduate Research Assistant at Los Alamos National Laboratory (LANL) Library's Digital Library Research and Prototyping Team.  I now work directly for Herbert Van de Sompel, in collaboration with my advisor, Michael Nelson.

Up to this point, I worked for years as a software engineer, but then re-entered academia in 2010 to finish my Master's Degree.  I originally just wanted to be able to apply for jobs that required Master's Degrees in Computer Science, but during my time working on my thesis, I discovered that I had more of a passion for the research than I had expected, so I became a PhD student in Computer Science at Old Dominion University.  During the time of my Master's Degree, I had taken coursework that counts toward my PhD, so I am free to accept this current extended internship while I complete my PhD dissertation.

LANL is a fascinating place to work.  In my first week, we learned all about safety and security. We learned not only about office safety (don't trip over cables), but also nuclear and industrial safety (don't eat the radioactive material).  This was in preparation for the possibility that we might actually encounter an environment where these dangers existed. One of the more entertaining parts of the training was being aware of wildlife and natural dangers, such as rattlesnakes, falling rocks, flash floods, and tornadoes.  We also learned about the more mundane concepts like how to use our security badge and how to fill out timesheets.  I was fortunate to meet people from a variety of different disciplines.

We have nice, powerful computing equipment, good systems support, excellent collaboration areas, and very helpful staff. Everyone has been very conscientious and supportive as I have acquired access rights and equipment.

By the end of my first week, I had begun working with the Prototyping Team.  They shared their existing work with me, educating me on back-end technical aspects of Memento as well as other projects, such as Hiberlink. My team members Harihar Shankar and Lyudmila Balakireva have been nice enough to answer questions about the research work as well as general LANL processes.

I am already doing literature reviews and writing code for upcoming research projects.  We just released a new Python Memento Client library in collaboration with Wikipedia. I am also evaluating Jupyter for use in future data analysis and code collection.  I have learned so much in my first month here!

I know my friends and family miss me back in Virginia, but this time spent among some of the best and brightest in the world is already shaping up to be an enjoyable and exciting experience.

--Shawn M. Jones, Graduate Research Assistant, Los Alamos National Laboratory

Friday, August 28, 2015

2015-08-28 Original Header Replay Considered Coherent


As web archives have advanced over time, their ability to capture and playback web content has grown. The Memento Protocol, defined in RFC 7089, defines an HTTP protocol extension that bridges the present and past web by allowing time-based content negotiation. Now that Memento is operational at many web archives, analysis of archive content is simplified. Over the past several years, I have conducted analysis of web archive temporal coherence. Some of the results of this analysis will be published at Hypertext'15. This blog post discusses one implication of the research: the benefits achieved when web archives playback original headers.

Archive Headers and Original Headers

Consider the headers (Figure 1) returned for a logo from the ODU Computer Science Home Page as archived on Wed, 29 Apr 2015 15:15:23 GMT.

HTTP/1.1 200 OK
Content-Type: image/gif
Last-Modified: Wed, 29 Apr 2015 15:15:23 GMT
Figure 1. No Original Header Playback

Try to answer the question "Was the representation provided by the web archive valid for Tue, 28 Apr 2015 12:00:00 GMT?" (i.e., the day before). The best answer possible is maybe. Because I have spent many hours using the Computer Science web site, I know the site changes infrequently. Given this knowledge, I might upgrade the answer from maybe to probably. This difficulty answering is due to the Last-Modified header reflecting the date archived instead of the date the image itself was last modified. And, although it is true that the memento (archived copy) was indeed modified Wed, 29 Apr 2015 15:15:23 GMT, this merging of original resource Last-Modified and memento Last-Modified loses valuable information. (Read Memento-Datetime is not Last-Modified for more details.)

Now consider the headers (figure 2) for another copy that was archived Sun, 14 Mar 2015 22:21:07 GMT. Take special note of the X-Archive-Orig-* headers. These are a playback of original headers that were included in the response when the logo image was captured by the web archive.

HTTP/1.1 200 OK
Content-Type: image/gif
X-Archive-Orig-etag: "52d202fb-19db"
X-Archive-Orig-last-modified: Sun, 12 Jan 2014 02:50:35 GMT
X-Archive-Orig-expires: Sat, 19 Dec 2015 13:01:55 GMT
X-Archive-Orig-accept-ranges: bytes
X-Archive-Orig-cache-control: max-age=31104000
X-Archive-Orig-connection: keep-alive
X-Archive-Orig-date: Wed, 24 Dec 2014 13:01:55 GMT
X-Archive-Orig-content-type: image/gif
X-Archive-Orig-server: nginx
X-Archive-Orig-content-length: 6619
Memento-Datetime: Sun, 14 Mar 2015 22:21:07 GMT
Figure 2. Original Header Playback

Compare the Memento-Datetime (which is the archive datetime) and the X-Archive-Orig-last-modified headers while answering this question: "Was the representation provided by the web archive valid for Tue, 13 Mar 2015 12:00:00 GMT?". Clearly the answer is yes.

Why This Matters

For the casual web archive user, the previous example may seem like must nit-picky detail. Still, consider the Weather Underground page archived on Thu, 09 Dec 2004 19:09:26 GMT and shown in Figure 3.

Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
Figure 3. Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
The Weather Underground page (like most) is a composition of many resources including the web page itself, images,  style sheets, and JavaScript. Note the conflict between the forecast of light drizzle and the completely clear radar image. Figure 4 shows the relevant headers returned for the radar image:

HTTP/1.1 200 OK
Memento-Datetime: Mon, 12 Sep 2005 22:34:45 GMT
X-Archive-Orig-last-modified: Mon, 12 Sep 2005 22:32:24 GMT
Figure 4. Prima Facie Coherence Headers

Clearly the radar image was captured much later than the web page—over 9 months later in fact! But this alone does not prove the radar image is the incorrect image (perhaps Weather Underground radar images were broken on 09 Dec 2004). However, the Memento-Datetime and X-Archive-Orig-last-modified headers tell the full story, showing that not only was the radar image captured well after the web page was archived, but also that the radar image was modified well after the web page was archived. Thus, together Memento-Datetime and X-Archive-Org-Last-Modified are prima facie evidence that the radar image is temporally violative with respect to the archived web page in which it is displayed. Figure 5 illustrates this pattern. The black left-to-right arrow is time. The black diamond and text represent the web page; the green represents the radar image. The green line shows that the radar image X-Archive-Orig-Last-Modified and Memento-Datetime bracket the web page archival time. Details on this pattern and others are detailed in our framework technical report.

Figure 5. Prima Facie Coherence

But Does Header Playback Really Matter?

Of course, if few archived images and other embedded resources include Last-Modified headers, the overall effect could be inconsequential. However, the results to be published at Hypertext'15 show that using the Last-Modified header makes a significant coherence improvement: using Last-Modified to select embedded resources increased mean prima facie coherence from ~41% to ~55% compared to using just Memento-Datetime. And, at the time the research was conducted, only the Internet Archive included Last-Modified playback. If the other 14 web archives included in the study also implemented header playback, we project that mean prima facie coherence would have been about 80%!

Web Archives Today

When the research leading to the Hypertext'15 paper was conducted, only the Internet Archive included Last-Modified playback. This limited prima facie coherence determination to only embedded resources retrieved from the Internet Archive. As shown in Table 1, additional web archives now playback original headers. The table also show which archives implement the Memento Protocol (and are therefore RFC 7089 compliant) and which archives use OpenWayback, which already implements header replay. Although header playback is a long way from universal, progress is happening. We look forward to continuing coherence improvement as additional web archives implement header playback and the Memento Protocol.

Table 1. Current Web Archive Status
Web Archive Header Playback? Memento Compliant? OpenWayback?
Archive-It Yes Yes Yes
archive.today No Yes No
arXiv No No No
Bibliotheca Alexandrina Web Archive Unknown1 Yes Yes
Canadian Government Web Archive No Proxy No
Croatian Web Archive No Proxy No
DBPedia Archive No Yes No
Estonian Web Archive No Proxy No
GitHub No Proxy No
Icelandic Web Archive Yes Yes Yes
Internet Archive Yes Yes Yes
Library of Congress Web Archive Yes Yes Yes
NARA Web Archive No Proxy Yes
Orain No Proxy No
PastPages Web Archive No Yes No
Portugese Web Archive No Proxy No
PRONI Web Archive No Yes Yes
Slovenian Web Archive No Proxy No
Stanford Web Archive Yes Yes Yes
UK Government Web Archive No Yes Yes
UK Parliament's Web Archive No Yes Yes
UK Web Archive Yes Yes Yes
Web Archive Singapore No Proxy No
WebCite No Proxy No
WikiPedia No Proxy No
1Unavailable at the time this post was written.

Wrap Up

Web archives featuring both the capture and replay of original header show significantly better temporal coherence in recomposed web pages. Currently, web archives using Heritrix and OpenWayback implement these features; no archives using other software are known to do so. Implementing original header capture and replay is highly recommended as it will allow implementation of improved recomposition heuristics (which is a topic for another day and another post).

— Scott G. Ainsworth

Friday, August 21, 2015

2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting

Two weeks ago (on Aug 3, 2015), I was glad to be invited to visit Internet Archive in San Francisco in order to share our latest work with a set of the Web Archiving pioneers from around the world.

The attendees were Jefferson Bailey and Vinay Goel from IA, Nicholas Taylor and Ahmed AlSum from Stanford, and Wolfgang Nejdl, Ivana Marenzi and Helge Holzmann from L3S.

First, we took a quick introduction to each others mentioning the purpose and the nature of our work to IA.

Then, Nejdl introduced the Alexandria project, and demoed the ArchiveWeb project, which aims to develop tools and techniques to explore and analyze Web archives in a meaningful way. In the project, they develop tools that will allow users to visualize and collaboratively interact with Archive-it collections by adding new resources in the form of tags and comments. Furthermore, it contains a collaborative search and sharing platform.

I presented the off-topic detection work with a live demo for the tool, which can be downloaded and tested from https://github.com/yasmina85/OffTopic-Detection.

The off-topic tool aims to automatically detect when the archived page goes off-topic, which means the page changed through time to move away from the initial scope of the page. The tool suggests a list of off-topic pages based on a specific threshold that is input by the user. Based on evaluating the tool, we suggest values for the threshold in a research paper* that can be used to detect the off-topic pages.

A site for one of the candidates for Egypt’s 2012 presidential election. Many of the captures of hamdeensabhay.com are not about the Egyptian Revolution. Later versions show an expired domain (as does the live Web version).

Examples for the usage of the tool:

Example 1: Detecting off-topic pages in 1826 collection

python detect_off_topic.py -i 1826 -th 0.15
extracting seed list


50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://casademaryland.org

Downloading 4 mementos out of 306
Downloading 14 mementos out of 306

Detecting off-topic mementos using Cosine Similarity method

Similarity memento_uri
0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html

Example 2: Detecting off-topic pages for http://hamdeensabahy.com/

python detect_off_topic.py -t https://wayback.archive-it.org/2358/timemap/link/http://hamdeensabahy.com/  -m wcount -th -0.85

Downloading 0 mementos out of 270

Downloading 4 mementos out of 270

Extracting text from the html

Detecting off-topic mementos using Word Count method

Similarity memento_uri
-0.979434447301 http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/

-0.966580976864 http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/

Nicholas insisted on the importance of the off-topic tool from QA perspective, while Internet Archives folks focused on the required computation resources and how it can be shared with Archive-It partners. The group discussed some user interface options to display the output of the tool.

After the demo, we discussed the importance of the tool, especially in the crawling quality assurance practices.  While demoing ArchiveWeb interface, some of the visualization for pages from different collections showed off-topic pages.  We all agreed that it is important that those pages won’t appear to the users when they browse the collections.

It was amazing to spend time in IA and knowing about the last trend from other research groups. The discussion showed the high reputation of WS-DL research in the web archiving community around the world.

*Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, Detecting Off-Topic Pages in Web Archives, Proceedings of TPDL 2015, 2015.


Tuesday, August 18, 2015

2015-08-18: Three WS-DL Classes Offered for Fall 2015


The Web Science and Digital Libraries Group is offering three classes this fall.  Unfortunately there are no undergraduate offerings this semester, but there are three graduate classes covering the full WS-DL spectrum:

Note that while 891 classes count toward the 24 hours of 800-level class work for the PhD program, they do not count as one of the "four 800-level regular courses" required.  Students looking to satisfy one of the 800-level regular courses should consider CS 834.  Students considering doing research in the broad areas of Web Science should consider taking all three of these classes this semester.


Monday, July 27, 2015

2015-07-27: Upcoming Colloquium, Visit from Herbert Van de Sompel

On Wednesday, August 5, 2015 Herbert Van de Sompel (Los Alamos National Laboratory) will give a colloquium in the ODU Computer Science Department entitled "A Perspective on Archiving the Scholarly Web".  It will be held in the third floor E&CS conference room (r. 3316) at 11am.  Space is somewhat limited (the first floor auditorium is being renovated), but all are welcome to attend.  The abstract for his talk is:

 A Perspective on Archiving the Scholarly Web
As the scholarly communication system evolves to become natively web-based and starts supporting the communication of a wide variety of objects, the manner in which its essential functions -- registration, certification, awareness, archiving -- are fulfilled co-evolves.  Illustrations of the changing implementation of these functions will be used to arrive at a high-level characterization of a future scholarly communication system and of the objects that will be communicated. The focus will then shift to the fulfillment of the archival function for web-native scholarship. Observations regarding the status quo, which largely consists of back-office processes that have their origin in paper-based communication, suggest the need for a change. The outlines of a different archival approach inspired by existing web archiving practices will be explored.
This presentation will be an evolution of ideas following his time as a visiting scholar at DANS, in conjunction with Dr. Andrew Treloar (ANDS) (2014-01 & 2014-12). 

Dr. Van de Sompel is an internationally recognized pioneer in the field of digital libraries and web preservation, with his contributions including many of the architectural solutions that define the community, including: OpenURL, SFX, OAI-PMH, OAI-ORE, info URI, bX, djatoka, MESUR, aDORe, Memento, Open Annotation, SharedCanvas, ResourceSync, and Hiberlink.

Also during his time at ODU, he will be reviewing the research projects of PhD students in the Web Science and Digital Libraries group as well as exploring new areas for collaboration with us.  This will be Dr. Van de Sompel's first trip to ODU since 2011 when he and Dr. Sanderson served as the external committee members for Martin Klein's PhD dissertation defense.

2015-08-17 Edit:

We recorded the presentation, but this professionally edited version from summer 2014 is better to watch online:

I've also compiled the slides of the seven students (Corren McCoy, Alexander Nwala, Scott Ainsworth, Lulwah Alkwai, Sawood Alam, Justin Brunelle, and Mat Kelly) who gave presentations on the status of their PhD research:


Friday, July 24, 2015

2015-07-24: ICSU World Data System Webinar #6: Web-Centric Solutions for Web-Based Scholarship

Earlier this week Herbert Van de Sompel gave a webinar for the ICSU World Data System entitled "Web-Centric Solutions for Web-Based Scholarship".  It's a short and simple review of some of the interoperability projects we've worked on through since 1999, including OAI-PMH, OAI-ORE, and Memento.  He ends with a short nod to his simple but powerful "Signposting the Scholarly Web" proposal, but the slides in the appendix give the full description.

The main point of this presentation was to document how each project successively further embraced the web, not just as a transport protocol but fully adopting the semantics as part of the protocol.  Herbert and I then had a fun email discussion about how the web, scholarly communication, and digital libraries were different in 1999 (the time of OAI-PMH & our initial collaboration) and now.  Some highlights include:
  • Although Google existed, it was not the hegemonic force that it is today, and contemporary search engines that did exist (e.g., AltaVista, Lycos) weren't that great (both in terms of precision and recall).  
  • The Deep Web was still a thing -- search engines did not reliably find obscure resources likely scholarly resources (cf. our 2006 IEEE IC study "Search Engine Coverage of the OAI-PMH Corpus" and Kat Hagedorn's 2008 follow up "Google Still Not Indexing Hidden Web URLs").
  • Related to the above, the focus in digital libraries was on repositories, not the web itself.  Everyone was sitting on an SQL database of "stuff" and HTTP was seen just as a transport in which to export the database contents.  This meant that the gateway script (ca. 1999, it was probably in Perl DBI) between the web and the database was the primary thing, not the database records or the resultant web pages (i.e., the web "resource").  
  • Focus on database scripts resulted in lots of people (not just us in OAI-PMH) tunneling ad-hoc/homemade protocols over HTTP.  In fairness, Roy Fielding's thesis defining REST only came out in 2000, and the W3C Web Architecture document was drafted in 2002 and finalized in 2004.  Yes, I suppose we should have sensed the essence of these documents in the early HTTP RFCs (2616, 2068, 1945) but... we didn't. 
  • The very existence of technologies such as SOAP (ca. 1998) nicely illustrates the prevailing mindset of HTTP as a replaceable transport. 
  • Technologies similar to OAI-PMH, such as RSS, were in flux and limited to 10 items (belying their news syndication origin which made them unsuitable for digital library applications).  
  • Full-text was relatively rare, so the focus was on metadata (see table 3 in the original UPS paper; every digital library description at the time distinguished between "records" and "records with full-text links").  Even if full-text was available, downloading and indexing it was an expensive operation for everyone involved -- bandwidth was limited and storage was expensive in 1999!  Sites like xxx.lanl.gov even threatened retaliation if you downloaded their full-text (today's text on that page is less antagonistic, but I recall the phrase "we fight back!").  Credit to CiteSeer for being an early digital library that was the first to use full-text (DL 1998).
Eventually Google Scholar announced they were deprecating OAI-PMH support, but the truth is they never really supported it in the first place.  It was just simpler to crawl the web, and the early focus on keeping robots out of the digital library had given way to making sure that they got into the digital library (e.g., Sitemaps).

The OAI-ORE and then Memento projects were more web-centric, as Herbert nicely explains in the slides, with OAI-ORE having a Semantic Web spin and Memento being more grounded in the IETF community.   As Herbert says at the beginning of the video, our perspective in 1999 was understandable given the practices at the time, but he goes on to say that he frequently reviews proposals about data management, scholarly communication, data preservation, etc. that continue to treat the web as a transport protocol over which the "real" protocol is deployed.  I would add that despite the proliferation of web APIs that claim to be RESTful, we're seeing a general retreat from REST/HATEOAS principles by the larger web community and not just the academic and scientific community.

In summary, our advice would be to fully embrace HTTP, since it is our community's Fortran and it's not going anywhere anytime soon


Thursday, July 23, 2015

2015-07-22: I Can Haz Memento

Inspired by the "#icanhazpdfmovement and built upon the Memento  service, I Can Haz Memento attempts to expand the awareness of Web Archiving through Twitter. Given a URL (for a page) in a tweet with the hash tag "#icanhazmemento," the I Can Haz Memento service replies the tweet with a link pointing to an archived version of the page closest to the time of the tweet. The consequence of this is: the archived version closest to the time of the tweet likely expresses the intent of the user at the time the link was shared.
Consider a scenario where Jane shares a link in a tweet to the front page of cnn about a story on healthcare. Given the fluid nature of the news cycle, at some point, the story about healthcare would be replaced by another fresh story; thus the link in Jane's tweet and its corresponding intent (healthcare story) become misrepresented by Jane's original link (for the new story). This is where I Can Haz Memento comes into the picture. If Jane included "#icanhazmemento" in her tweet, the service would have replied Jane's tweet with a link representing:
  • An archived version (closest to her tweet time) of the front page healthcare story on cnn, if the page had already been archived within a given temporal threshold (e.g 24 hours)Or
  • A newly archived version of the same page. In other words, the service does the archiving and returns the link to the newly archived page, if the page was not already archived.
How to use I Can Haz Memento
Method 1: In order to use the service, include the hashtag "#icanhazmemento" in the tweet with the link to the page you intend to archive or retrieve an archived version. For example, consider Shawn Jones' tweet below for http://www.cs.odu.edu:
Which prompted the following reply from the service:
Method 2: In Method 1, the hashtag "#icanhazmemento" and the URL,  http://www.cs.odu.edu, reside in the same tweet, but Method 2 does not impose this restriction. If someone (@anwala) tweeted a link (e.g arsenal.com), and you (@wsdlodu) wished the request be treated in the same manner as Method 1 (as though "#icanhazmemento" and  arsenal.com were in the same tweet), all that is required is a reply to the original tweet (without the "#icanhazmemento") with a tweet which includes "#icanhazmemento." Consider an example of Method 2 usage:
  1. @acnwala tweets arsenal.com without "#icanhazmemento"
  2. @wsdlodu replies the @acnwala's tweet with "#icanhazmemento"
  3. @icanhazmemento replies @wsdlodu with the archived versions of arsenal.com
The scenario (1, 2 and 3) is outlined by the following tweet threads:
 I Can Haz Memento - Implementation

I Can Haz Memento is implemented in Python and leverages the Twitter Tweepy API. The implementation is captured by the following subroutines:
  1. Retrieve links from tweets with "#icanhazmemento": This was achieved due to Tweepy's api.search API method. The sinceIDValue is used to keep track of already visited tweets. Also, the application sleeps in between each request in order to comply with Twitter's API rate limits, but not before retrieving the URLs from each tweet.
  2. After the URLs in 1. have been retrieved, the following subroutine
    • Makes an HTTP Request to the Timegate API in order to get the the Memento (instance of the resource) closest to the time of tweet (since the time of tweet is passed as a parameter for datetime content negotiation):
    • If the page is not found in any archive, it is pushed to archive.org and archive.is for archiving:
The source code for the application is available on Gitub. We acknowledge the effort of Mat Kelly who wrote the first draft of the application. And we hope you use #icanhazmemento.