We performed a study of very large Memento TimeMaps to evaluate the ratio of representations versus redirects obtained when dereferencing each archived capture. Read along below or check out the full report.
Memento represents a set of captures for a URI (e.g., http://google.com) with a TimeMap. Web archives may provide a Memento endpoint that allows users to obtain this list of URIs for the captures, called URI-Ms. Each URI-M represents a single capture (memento), accessible when dereferencing the URI-M (resolving the URI-M to an archived representation of a resource).
Variations in the "original URI" are canonicalized (coalescing https://google.com and http://www.google.com:80/, for instance) with the original URI (URI-R in Memento terminology) also included with a literal "original" relationship value.
<http://ws-dl.blogspot.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://ws-dl.blogspot.com/>; rel="self"; type="application/link-format"; from="Wed, 29 Sep 2010 00:03:40 GMT"; until="Mon, 20 Mar 2017 19:09:10 GMT", <http://web.archive.org/web/http://ws-dl.blogspot.com/>; rel="timegate", <http://web.archive.org/web/20100929000340/http://ws-dl.blogspot.com/>; rel="first memento"; datetime="Wed, 29 Sep 2010 00:03:40 GMT", <http://web.archive.org/web/20110202180231/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Wed, 02 Feb 2011 18:02:31 GMT", <http://web.archive.org/web/20110902171049/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 02 Sep 2011 17:10:49 GMT", <http://web.archive.org/web/20110902171256/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 02 Sep 2011 17:12:56 GMT", ... <http://web.archive.org/web/20151205080546/http://www.ws-dl.blogspot.com/>; rel="memento"; datetime="Sat, 05 Dec 2015 08:05:46 GMT", <http://web.archive.org/web/20161104143102/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 04 Nov 2016 14:31:02 GMT", <http://web.archive.org/web/20161109005749/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Wed, 09 Nov 2016 00:57:49 GMT", <http://web.archive.org/web/20170119233646/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Thu, 19 Jan 2017 23:36:46 GMT", <http://web.archive.org/web/20170320190910/http://ws-dl.blogspot.com/>; rel="last memento"; datetime="Mon, 20 Mar 2017 19:09:10 GMT"
For instance, to view the TimeMap for this very blog from Internet Archive, a user may request http://web.archive.org/web/timemap/link/http://ws-dl.blogspot.com/ (Figure 1). Each URI-M (e.g., http://web.archive.org/web/20110902171256/http://ws-dl.blogspot.com/) is listed with a corresponding relationship (rel) and datetime value. Note the www.ws-dl.blogspot.com and ws-dl.blogspot.com subdomain variants are both included in the same TimeMap, an product of the canonicalization procedure. The TimeMap for this URI-R currently contains 60 URI-Ms. Internet Archive's Web interface reports 58 captures -- a subtle yet differing "count". This difference get much more extreme with other URI-Rs.
The quality of each memento (e.g., in terms of completeness of capture of embedded resources) cannot be determined using the TimeMap alone. This fact is inherent in a URI-M needing to be dereferenced and each embedded resource requested upon rending the base URI-M. Comprehensively evaluating the quality over time is something we have already covered (see our TPDL2013, JCDL2014, and IJDL2015 papers/article).
In performing some studies and developing web archiving tools, we required knowing how many captures existed for a particular URI using both a Memento aggregator and the TimeMap from an archive's Memento endpoint. For http://google.com, counting the number of URIs in a TimeMap with a rel value of "memento" produces a count of 695,525 (as of May 2017). The number obtained from Internet Archive's calendar interface and CDX endpoint currently show much smaller count values (e.g., calendar interface currently states 62,339 captures for google.com).
Dereferencing these URI-Ms would take a very long time due to network latency in accessing the archive as well as limits on pipelining (though the latter can be mitigated with distributing the task). We did exactly this for google.com and found that the large majority of the URI-Ms produced a redirect to another URI-M in the TimeMap. This lead us to know that counting mementos in an archive's holdings is not sufficient with this procedure.
For google.com we found that nearly 85% of the URI-Ms resulted in a redirect when dereferenced. We repeated this procedure for seven other TimeMaps for large web sites (e.g., yahoo.com, instagram.com, wikipedia.org) and found a wide array of trends in this naïve counting method (88.2%, 67.3%, and 44.6% are redirects, respectively). We also repeated this procedure with thirteen academic institutions' URI-Rs to observe if this trend persisted.
We have posted an extensive report of our findings as a tech report available on arXiv (linked below).
— Mat (@machawk1)
Mat Kelly, Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel. "Impact of URI Canonicalization on Memento Count," Technical Report arXiv:1703.03302, 2017.