Monday, April 24, 2017

2017-04-24: Pushing Boundaries

Since the advent of the web, more elements of scholarly communication are occurring online. A world that once consisted mostly of conference proceedings, books, and journal articles now includes blog posts, project websites, datasets, software projects, and more. Efforts like LOCKSS, CLOCKSS, and Portico preserve the existing journal system, but there is no similar dedicated effort for the web presence of scholarly communication. Because web-based scholarly communication is born on the web, it can benefit from web archiving.

This is complicated by the complexity of scholarly objects. Consider a dataset on the website Figshare, whose landing page is shown in Fig. 1. Each dataset on Figshare has a landing page consisting of a title, owner name, brief description, licensing information, and links to bibliographic metadata in various forms. If an archivist merely downloads the dataset and ignores the rest, then a future scholar using their holdings is denied context and additional metadata. The landing page, dataset, and bibliographic metadata are all objects making up this artifact. Thus, in order to preserve context, a crawler will need to acquire all of these linked resources belonging to this artifact on Figshare.

Fig. 1: A screenshot of the landing page of an artifact on Figshare.
Green boxes outline links to URIs that belong to this artifact.
Red boxes outline links to that do not belong to this artifact.

Interestingly, this artifact links to another artifacta master's thesis, that does not link back to this artifact, complicating discovery of the dataset associated with the research paper. Both artifacts are fully archived in the Internet Archive. In contrast, Fig. 2 below shows a different incomplete Figshare artifact -- as of April 12, 2017 -- at the Internet Archive. Through incidental crawling, the Internet Archive discovered the landing page for this artifact, but has not acquired the actual dataset or the bibliographic metadata. Such cases show that incidental crawling is insufficient for archiving scholarly artifacts.

Fig 2: A screenshot of the web pages from an incompletely archived artifact. The Internet Archive has archived the landing page of this Figshare artifact, but did not acquire the dataset or the bibliographic metadata about the artifact.

What qualifies as an artifact? An artifact is a set of interconnected objects belonging to a portal that represent some unit of scholarly discourse. Example artifacts include datasets, blog posts, software projects, presentations, discussion, and preprints. Artifacts like blog posts and presentations may only consist of a single object. As seen in Fig. 1, datasets can consist of landing pages, metadata, and additional documentation that are all part of the artifact. Software projects hosted online may consist of source code, project documentation, discussion pages, released binaries, and more. For example, the Python Memento Client library on the GitHub portal consists of source code, documentation, and issues. All of these items would become part of the software project artifact.  An artifact is usually a citable object, often referenced by a DOI.

Artifacts are attributed to a scholar or scholars. Portals provide methods like search engines, APIs, and user profile pages to discover artifacts. Outside of portals, web search engine results and focused crawling can also be used to discover artifacts. From experience, I have observed that each result from these search efforts contains a URI pointing to an entry page. To acquire the Figshare example above, a Figshare search engine result contained a link to the entry page, not links to the dataset or bibliographic data. A screenshot showing entry pages as local search engine results is seen in Fig. 3 below. Poursardar and Shipman have studied this problem for complex objects on the web and have concluded that "there is no simple answer to what is related to a resource" thus making it difficult to discover artifacts on the general web. Artifacts stored on scholarly portals, however, appear to have some structure that can be exploited. For the purposes of this post, I will discuss capturing all objects in an artifact starting from its entry page because the entry page is designed to be used by humans to reach the rest of the objects in the artifact and because entry pages are the results returned by these search methods.

Fig. 3: A screenshot showing search engine results in Figshare that lead to entry pages for artifacts.
HTML documents contain embedded resources, like JavaScript, CSS, and images. Web archiving technology is still evolving in acquiring embedded resources. For the sake of this post, I assume that any archiving solution will capture these embedded resources for any HTML document, this post will focus on discovering the base URIs of the linked resources making up the artifact, referred to in the rest of this article as artifact URIs.

For simplicity of discovery, I want to restrict artifact URIs to a specific portal. Thus, the set of domain names possible for each artifact URI in an artifact is restricted to the set of domain names used by the portal. I consider linked items on other portals to be separate artifacts. As mentioned with the example in Fig. 1, a dataset page links to its associated thesis, thus we have two interlinked artifacts: a dataset and a thesis. A discussion of interlinking artifacts is outside of the scope of this post and is being investigated by projects such as Research Objects by Bechhofer, De Roure, Gamble, Goble, and Buchan, as well as already being supported by efforts such as OAI-ORE.

Fig. 4: This diagram demonstrates an artifact and its boundary. Artifacts often have links to content elsewhere in the scholarly portal,but only some of these links are to items that belong to the artifact.
How does a crawler know which URIs belong to the artifact and which should be ignored?  Fig. 4 shows a diagram containing an entry page that links to several resources. Only some of these resources, the artifact URIs, belong to the artifact. How do we know which URIs linked from an entry page are artifact URIs? Collection synthesis and focused crawling will acquire pages matching a specific topic, but we want as close to the complete artifact as possible with no missed and minimal extra objects. OAI-ORE provides a standard for aggregations of web resources using a special vocabulary as well as resource maps in RDF and other formats. Signposting is a machine-friendly solution that informs crawlers of this boundary by using link relations in the HTTP Link header to indicate which URIs belong to an artifact. The W3C work on "Packaging on the Web" and "Portable Web Publications" require that the content be formatted to help machines find related resources. LOCKSS boxes use site-specific plugins to intelligently crawl publisher web sites for preservation. How can a crawler determine this boundary without signposting, OAI-ORE, these W3C drafts, or site-specific heuristics? Can we infer the boundary from the structures used in each site?

Fortunately, portals have predictable behaviors that automated tools can use. In this post I assume an automated system will use heuristics to advise a crawler that is attempting to discover all artifact URIs within the boundary of an artifact. The resulting artifact URIs can then be supplied to a high resolution capture system, like Webrecorder.io. The goal is to develop a limited set of general, rather than site-specific, heuristics. Once an archivist is provided these artifact URIs, they can then create high-resolution captures of their content. In addition to defining these heuristics, I also correlate these heuristics with similar settings in Archive-It to demonstrate that the problem of capturing many of these artifacts is largely addressed. I then indicate which heuristics apply to which artifacts on some known scholarly portals. I make the assumption that all authentication, access (e.g., robots.txt exclusions), and licensing issues have been resolved and therefore all content is available.

Artifact Classes and Crawling Heuristics


To discover crawling heuristics in scholarly portals, I reviewed portals from Kramer and Boseman's Crowdsourced database of 400+ tools, part of their Innovations in Scholarly Communication. I filtered the list to only include entries from the categories of publication, outreach, and assessment. To find the most popular portals, I sorted the list by Twitter followers as a proxy for popularity. I then selected the top 36 portals from this list that were not journal articles and that contained scholarly artifacts. After that I manually reviewed artifacts on each portal to find common crawling heuristics shared across portals.

Single Artifact URI and Single Hop


In Figures below, I have drawn three different classes of web-based scholarly artifacts. The simplest class, in Fig. 5a, is an artifact consisting of a single artifact URI. This blog post, for example, is an artifact consisting of a single artifact URI.

Archiving single artifact URIs can be done easily in one shot with Webrecorder.io, Archive-It's One Page setting, and "save the page" functionality offered from web archives like archive.is. I will refer to this heuristic as Single Page.


Fig 5a: A diagram of the Single Artifact URI artifact class.
Fig 5b: A diagram showing an example Single Hop artifact class.

Fig 5c: A diagram showing an example of a Multi-Hop artifact class.

Fig. 5b shows an example artifact consisting of one entry page artifact URI and those artifact URIs linked to it. Our Figshare example above matches this second form. I will refer to this artifact class as Single Hop because all artifact URIs are available within one hop from the entry page. To capture all artifact URIs for this class, an archiving solution merely captures the entry page and any linked pages, stopping at one hop away from the entry page. Archive-It has a setting that addresses this named One Page+. Inspired by Archive-It's terminology, I will use the + to indicate "including all linked URIs within one hop". Thus, I will refer to the heuristic for capturing this artifact class as Single Page+.

Because one hop away will acquire menu items and other site content, Single Page+ will acquire more URIs than needed.  As an optimization, our automated system can first create an Ignored URIs List, inspired in part by a dead page detection algorithm by Bar-Yossef, Broder, Kumar, and Thomkins. The automated tool would fill this list using the following method:
  1. Construct an invalid URI (i.e., one that produces a 404 HTTP status) for the portal.
  2. Capture the content at that URI and place all links from that content into the ignored URIs list.
  3. Capture the content from the homepage of the portal and place all links from that content into the ignored URIs list.
  4. Remove the entry page URI from the ignored URIs list, if present.
The ignored URIs list should now contain URIs that are outside of the boundary, like those that refer to site menu items and licensing information. This method captures content both from the invalid URI and a homepage because homepages may not contain all menu items. As part of a promotion effort, the entry page URI may be featured on the homepage, hence we remove it from the list in the final step. Our system would then advise the crawler to ignore any URIs on this list, reducing the number of URIs crawled.

I will refer to this modified heuristic as Single Page+ with Ignored URIs.

Multi-Hop Artifacts


Fig. 5c shows an example artifact of high complexity. It consists of many interlinked artifact URIs. Examples of scholarly sites fitting into this category include GitHub, Open Science Framework, and Dryad. Because multiple hops are required to reach all artifact URIs, I will refer to this artifact class as Multi-Hop. Due to its complexity, Mulit-Hop breaks down into additional types that require special heuristics to acquire completely.

Software projects are stored on portals like GitHub and BitBucket. These portals host source code in repositories using a software version control system, typically Git or Mercurial. Each of these version control systems provide archivists with the ability to create a complete copy of the version control system repository. The portals provide more than just hosting for these repositories. They also provide issue tracking, documentation services, released binaries, and other content that provides additional context for the source code itself. The content from these additional services is not present in the downloaded copy of the version control system repository.

Fig. 6: Entry page belonging to the artifact representing the Memento Damage software project.

For these portals, the entry page URI is a substring of all artifact URIs. Consider the example GitHub source code page shown in Fig. 6. This entry page belongs to the artifact representing the Memento Damage software project. The entry page artifact URI is https://github.com/erikaris/web-memento-damage/. Artifact URIs belonging to this artifact will contain the entry page URI as a substring; here are some examples with the entry page URI substrings shown in italics:
  • https://github.com/erikaris/web-memento-damage/issues
  • https://github.com/erikaris/web-memento-damage/graphs/contributors
  • https://github.com/erikaris/web-memento-damage/blob/master/memento_damage/phantomjs/text-coverage.js
  • https://github.com/erikaris/web-memento-damage/commit/afcdf74cc31178166f917e79bbad8f0285ae7831
Because all artifact URIs are based on the entry page URI, I have named this heuristic Path-Based.

In the case of GitHub some item URIs reside in a different domain: raw.githubusercontent.com. Because these URIs are in a different domain and hence do not contain our significant string, they will be skipped by the path-based heuristic. We can amend the directory-based heuristic to capture these additional resources by allowing the crawler to capture all linked URIs that belong to a domain different from the domain of the entry page. I refer to this heuristic as Path-Based with Externals.

Silk allows a user to create an entire web site devoted to the data visualization and interaction of a single dataset. When a user creates a new Silk project, a subdomain is created to host that project (e.g., http://dashboard101innovations.silk.co/). Because the data and visualization are intertwined, the entire subdomain site is itself an artifact. Crawling this artifact still relies upon the path (i.e., a single slash), and hence, its related heuristic is Path-Based as well, but without the need to acquire content external to the portal.

For some portals, like Dryad, a significant string exists in the content of each object that is part of the artifact. An automated tool can acquire this significant string from the <title> element of the HTML of the entry page and a crawler can search for the significant string in the content -- not just the title, but the complete content -- of each resource discovered during the crawl. If the resource's content does not contain this significant string, then it is discarded. I refer to this heuristic as Significant String from Title.

Fig. 7: A diagram of a Dryad artifact consisting of a single dataset, but multiple metadata pages. Red boxes outline the significant string, Data from: Cytokine responses in birds challenged with the human food-borne pathogen Campylobacter jejuni implies a Th17 response., which is found in the title of the entry page and is present in almost all objects within the artifact. Only the dataset does not contain this significant string, hence a crawler must crawl URIs one hop out from those matching the Significant String from Title heuristic and also ignore menu items and other common links, hence the Significant String from Title+ with Ignored URIs is the prescribed heuristic in this case.

In reality, this heuristic misses the datasets linked from each Dryad page. To solve this our automated tool can create an ignored URI list using the techniques mentioned above. Then the crawler can crawl one hop out from each captured page, ignoring URIs in this list. I refer to this heuristic as Significant String from Title+ with Ignored URIs. Fig.7 shows an example Dryad artifact that can make use of this heuristic.

A crawler can use this heuristic for Open Science Framework (OSF) with one additional modification. OSF includes the string "OSF | " in all page titles, but not in the content of resources belonging to the artifact, hence an automated system needs to remove it before the title can be compared with the content of linked pages. Fig. 8 shows an example of this.

Fig. 8: A screenshot of an OSF artifact entry page showing the source in the lower pane. The title element of the page contains the string "OSF | Role of UvrD/Mfd in TCR Supplementary Material". The string "Role of UvrD/Mfd in TCR Supplementary Material" is present in all objects related to this artifact. To use this significant string, the substring
 "OSF | " must be removed.
Here are the steps for removing the matching text:
  1. Capture the content of the entry page.
  2. Save the text from the <title> tag of the entry page.
  3. Capture the content of the portal homepage.
  4. Save the text from the <title> tag of the homepage.
  5. Starting from the leftmost character of each string, compare the characters of the entry page title text with the homepage title text.
    1. If the characters match, remove the character in the same position from the entry page title.
    2. Stop comparing when characters no longer match.
I will refer to the heuristic with this modification as Significant String from Filtered Title.

Entry pages for the Global Biodiversity Information Facility (GBIF) contain an identification string in the path part of each URI that is present in all linked URIs belonging to the same artifact. For example, the entry page at URI http://www.gbif.org/dataset/98333cb6-6c15-4add-aa0e-b322bf1500ba contains the string 98333cb6-6c15-4add-aa0e-b322bf1500ba and its page content links to the following artifact URIs:
  • http://www.gbif.org/occurrence/search?datasetKey=98333cb6-6c15-4add-aa0e-b322bf1500ba
  • http://api.gbif.org/v1/dataset/98333cb6-6c15-4add-aa0e-b322bf1500ba/document
An automated system can compare the entry page URI to each of the URIs of its links to extract this significant string. Informed by this system, a crawler will then ignore URIs that do not contain this string. I will refer to the heuristic for artifacts on this portal as Significant String from URI.

Discovering the significant string in the URI may also require site-specific heuristics. Discovering the longest common substring between the path elements of the entry page URI and any linked URIs may work for the GBIF portal, but it may not work for other portals, hence this heuristic may need further development, if applicable to other portals.

The table below lists the artifact classes and associated heuristics that have been covered. As noted above, even though an artifact fits into a particular class, its structure on the portal is ultimately what determines its applicable crawling heuristic.


Artifact Class Potential Heuristics Potentially Adds Extra URIs Outside Artifact Potentially Misses Artifact URIs
Single Artifact URI Single Page No No
Single Hop Single Page+ Yes No
Single Page+ with Ignored URIs Yes (but reduced amount compared to Single Page+) No
Multi-Hop Path-Based Depends on Portal/Artifact Depends on Portal/Artifact
Path-Based with Externals Yes No
Significant String from Title No Yes
Significant String from Title+ with Ignored URIs Yes No
Significant String from Filtered Title No Yes
Significant String from URI No Yes


Comparison to Archive-It Settings


Even though the focus of this post has been to find artifact URIs with the goal of feeding them into a high resolution crawler, like Webrecorder.io, it is important to note that Archive-It has settings that match or approximate many of these heuristics. This would allow an archivist to use an entry page as a seed URI and capture all artifact URIs. The table below provides a listing of similar settings between the heuristics mentioned here and a setting in Archive-It that functions similarly.


Crawling Heuristic from this Post Similar Setting in Archive-It
Single Page Seed Type: One Page
Single Page+ Seed Type: One Page+
Single Page+ with Ignored URIs Seed Type: Standard

Host Scope Rule:
Block URL if it contains the text <string>
Path-Based Seed Type: Standard
Path-Based with Externals Seed Type: Standard+
Significant String from Title None
Significant String from Title+ with Ignored URIs None
Significant String from Filtered Title None
Significant String from URI Seed Type: Standard
Expand Scope to Include URL if it contains the text <string>

Fig. 9: This is a screenshot of part of the Archive-It configuration allowing a user to control the crawling strategy for each seed.
Archive-It's seed type settings allow one to change how a seed is crawled. As shown in Fig. 9, four settings are available. One Page, One Page+, and Standard all map exactly to our heuristics of Single Page, Single Page+, and Path-Based. For Path-Based, one merely needs to supply the entry page URI as a seed -- including the ending slash -- and Archive-It's scoping rules will ensure that all links include the entry page URI. Depending on the portal, Standard+ may crawl more URIs than Path-Based with Externals, but is otherwise successful in acquiring all artifact URIs.

Fig. 10: This is a screenshot of the Archive-It configuration allowing the user to expand the crawl scope to include URIs that contain a given string.

Fig. 11: This screenshot displays the portion of the Archive-It configuration allowing the user to block URIs that contain a given string.
To address our other heuristics, Archive-It's scoping rules must be altered, with screenshots of these settings shown in Figs 10 and 11. To mimic our Significant String from URI heuristic, a user would first need to know the significant string, and then can supply it as an argument to the setting "Expand Scope to include URL if it contains the text:". Likewise, to mimic Single Page+ with Ignored URIs, a user would need to know which URIs to ignore, and can use them as arguments to the setting "Block URL if...".

Archive-It does not have a setting for analyzing page content during a crawl, and hence I have not found settings that can address any of the members of the Significant String in Title heuristics family.

In addition to these settings, a user will need to experiment with crawl times to capture some of the Multi-Hop artifacts due to the number of artifact URIs that must be visited.

Heuristics Used In Review of Artifacts on Scholarly Portals


While manually reviewing one or two artifacts from each of the 36 portals from the dataset, I documented the crawling heuristic I used for each artifact, shown in the table below. I focused on a single type of artifact for each portal. It is possible that different artifact types (e.g., blog post vs. forum) may require different heuristics even though they reside on the same portal.


Portal Artifact Type Reviewed Artifact Class Applicable Crawling Heuristic
Academic Room Blog Post Single Artifact URI Single Page
AskforEvidence Blog Post Single Artifact URI Single Page
Benchfly Video Single Artifact URI Single Page
BioRxiv Preprint Single Hop Single Page+ with Ignores
BitBucket Software Project Multi-Hop Path-Based
Dataverse* Dataset Multi-Hop Significant String From Filtered Title (starting from title end instead of beginning like OSF)
Dryad Dataset Multi-Hop Significant String From Title+ with Ignored URI List
ExternalDiffusion Blog Post Single Artifact URI Single Page
Figshare Dataset Single Hop Single Page+ with Ignores
GitHub Software Project Multi-Hop Path-Based with Externals
GitLab.com Software Project Multi-Hop Path-Based
Global Biodiversity Information Facility* Dataset Multi-Hop Significant String From URI
HASTAC Blog Post Single Artifact URI Single Page
Hypotheses Blog Post Single Artifact URI Single Page
JoVe Videos Single Hop Single Page+ with Ignores
JSTOR daily Article Single Artifact URI Single Page
Kaggle Datasets Dataset with Code and Discussion Multi-Hop Path-Based with Externals
MethodSpace Blog Post Single Artifact URI Single Page
Nautilus Article Single Artifact URI Single Page
Omeka.net* Collection Item Single Artifact URI Single Page (but Depends on Installation)
Open Science Framework* Non-web content, e.g., datasets and PDFs Multi-Hop Significant String From Filtered Title
PubMed Commons Discussion Single Artifact URI Single Page
PubPeer Discussion Single Artifact URI Single Page
ScienceBlogs Blog Post Single Artifact URI Single Page
Scientopia Blog Post Single Artifact URI Single Page
SciLogs Blog Post Single Artifact URI Single Page
Silk* Data Visualization and Interaction Multi-Hop Path-Based
Slideshare Slideshow Multi-Hop Path-Based
SocialScienceSpace Blog Post Single Artifact URI Single Page
SSRN Preprint Single Hop Single Page+ with Ignores
Story Collider Audio Single Artifact URI Single Page
The Conversation Article Single Artifact URI Single Page
The Open Notebook Blog Post Single Artifact URI Single Page
United Academics Article Single Artifact URI Single Page
Wikipedia Encyclopedia Article Single Artifact URI Single Page
Zenodo Non-web content Single Hop Single Page+ with Ignores

Five entries are marked with an asterisk (*) because they may offer additional challenges.

Omeka.net provides hosting for the Omeka software suite, allowing organizations to feature collections of artifacts and their metadata on the web. Because each organization can customize their Omeka installation, they may add features that make the Single Page heuristic no longer function. Dataverse is similar in this regard. I only reviewed artifacts from Harvard's Dataverse.

Global Biodiversity Information Facility (GBIF) contains datasets submitted by various institutions throughout the world. A crawler can acquire some metadata about these datasets, but the dataset itself cannot be downloaded from these pages. Instead, an authenticated user must request the dataset. Once the request has been processed, the portal then sends an email to the user with a URI indicating where the dataset may be downloaded. Because of this extra step, this additional dataset URI will need to be archived by a human separately. In addition, it will not be linked from content of the other captured artifact URIs.

Dataverse, Open Science Framework, and Silk offer additional challenges. A crawler cannot just use anchor tags to find artifact URIs because some content is only reachable via user interaction with page elements (e.g., buttons, dropdowns, specific <div> tags). Webrecorder.io can handle these interactive elements because a human performs the crawling. The automated system that we are proposing to aide a crawler will not be as successful unless it can detect these elements and mimic the human's actions. CLOCKSS has been working on this problem since 2009 and has developed an AJAX collector to address some of these issues.

Further Thoughts


There may be additional types of artifacts that I did not see on these portals. Those artifacts may require different heuristics. Also, there are many more scholarly portals that have not yet been reviewed, and it is likely that additional heuristics will need to be developed to address some of them. A larger study analyzing the feasibility and accuracy of these heuristics is needed.

From these 36 portals, most artifacts fall into the class of Single Artifact URI. A larger study on the distribution of classes of artifacts could indicate how well existing crawling technology can discover artifact URIs and hence archive complete artifacts.

Currently, a system would need to know which of these heuristics to use based on the portal and type of artifact. Without any prior knowledge, is there a way our system can use the entry page -- including its URI, response headers, and content -- to determine to which artifact type the entry page belongs? From there, can the system determine which heuristic can be used? Further work may be able to develop a more complex heuristic or even an algorithm applicable to most artifacts.

These solutions rely on the entry page for initial information (e.g., URI strings, content). Given any other artifact URI in the artifact, is it possible to discover the rest of the artifact URIs? If a given artifact URI references content that does not contain other URIs -- either through links or text -- then the system will not be able to discover other artifact URIs. If the content of a given artifact URI does contain other URIs, a system would need to determine which heuristic is might apply in order to find the other artifact URIs.

What about artifacts that link to other artifacts? Consider again our example in Fig. 1 where a dataset links to a thesis. A crawler can save those artifact URIs to its frontier and pursue the crawl of those additional artifacts separately, if so desired.  The crawler would need to determine when it had encountered a new artifact and pursue its crawl separately with the heuristics appropriate to the new artifact and portal.

Conclusion


I have outlined several heuristics for discovering artifact URIs belonging to an artifact. I also demonstrated that many of those heuristics can already be used with Archive-It. The heuristics offered here require that one know the entry page URI of the artifact and they expect that any system analyzing pages can work with interactive elements. Because portals provide predictable patterns, finding the boundary appears to be a tractable problem for anyone looking to archive a scholarly object.

--Shawn M. Jones

Acknowledgements: Special thanks to Mary Haberle for helping to explain Archive-It scoping rules.

Sunday, April 23, 2017

2017-04-23: Remembering Professor Hussein Abdel-Wahab

Hussein (blue shirt) at the post-defense feast for Dr. SalahEldeen.
As we head into exam week, I can't help but reflect that this is the first exam week at ODU since 1980 that does not involve Professor Hussein Abdel-Wahab.  The department itself was established in 1979, so Hussein has been here nearly since the beginning.  For comparison, in 1980 I was in middle school. 

I had the privilege of knowing Hussein both as my instructor for three classes in 1996 & 1997, and as a colleague since 2002.  None who knew Hussein would dispute that he was an excellent instructor with unrivaled concern for students' learning and general well-being. It is fitting that ODU is establishing the Dr. Abdel-Wahab Memorial Scholarship (http://bit.ly/HusseinAbdelWahabODU) which will support graduate students.  As of April 11, the scholarship is 58% of the way to its goal of $25k.  I've donated, and I call on all former students and colleagues to continue Hussein's legacy and ensure this scholarship is fully funded.


--Michael

Thursday, April 20, 2017

2017-04-20: Trusted Timestamping of Mementos


The Memento Protocol provides a Memento-Datetime header indicating at what datetime a memento was captured by a given web archive. In most cases, this metadata sufficiently informs the user of when the given web resource existed. Even though it has been established in US courts that web archives can be used to legally establish that a given web resource existed at a given time, there is still potential to doubt this timestamp because the same web archive that provides the memento also provides its Memento-Datetime. Though not a replacement for Memento-Datetime, trusted timestamping is the process that provides certainty of timestamps for content and can be used to provide additional data to alleviate this doubt.
In this post, I examine different trusted timestamping methods. I start with some of the more traditional methods before discussing OriginStamp, a solution by Gipp, Meuschke, and Gernandt that uses the Bitcoin blockchain for timestamping.

Brief Cryptography Background


Trusted timestamping systems use some concepts from cryptography for confidentiality and integrity. I will provide a brief overview of these concepts here.

Throughout this document I will use the verb hashing to refer to the use of a one-way collision-resistant hash function. Users supply content, such as a document, as input and the hash function provides a digest as output.  Hash functions are one-way, meaning that no one can take that digest and reconstitute the document. Hash functions are collision-resistant, meaning that there is a very low probability that another input will produce the same digest, referred to as a collision. As shown in the figure below, small changes to the input of a hash function produce completely different digests. Thus, hash digests provide a way to identify content without revealing it. The output of the hash function is also referred to as a hash.

This diagram shows the digests produced with the same cryptographic hash function over several inputs. A small change in the input produces a completely different hash digest as output. Source: Wikipedia
The timestamping solutions in this post use the SHA-256 and RIPEMD-160 hash functions. SHA-256 is a version of the SHA-2 algorithm with 256 bit keys. Its predecessor, SHA-1, has been under scrutiny for some time. In 2005, cryptographers showed mathematically that SHA-1 was not collision-free, prompting many to start moving to SHA-2. In February of 2017, Google researchers were able to create a collision with SHA-1, showing that SHA-1 is no longer reliably trustworthy. Because collision attacks are theoretically possible, though technically infeasible, for SHA-2, SHA-3 has been developed as a future replacement. I mention this to show how the world of hash functions is dynamic, resulting in continued research of better functions. For this post, however, it is most important to just understand the purpose of hash functions.

In addition to hash functions, this post discusses solutions that utilize pubic-key cryptography, consisting of private keys and public keys. Users typically generate a private key using random information generated by their computer and an algorithm such as 3DES. Users then use this private key with an algorithm such as RSA or ECC to generate a public key. Users are expected to secure their private key, but share the public key.

A diagram showing an example of encryption using public and private keys. Source: Wikipedia
Users use public keys to encrypt content and private keys to decrypt it. In the figure above, everyone has access to Alice's pubic key. Bob encrypts a message using Alice's public key, but only Alice can decrypt it because she is the only one with access to her private key.

This process can be used in reverse to digitally sign content. The private key can be used to encrypt content and the public key can be used to decrypt it. This digital signature allows anyone with access to the public key to verify that the content was encrypted by the owner of the private key because only the owner of the private key should have access to the private key.

Certificates are documents containing a public key and a digital signature. A user typically requests a certificate on behalf of themselves or a server. A trusted certificate authority verifies the user's identity and issues the certificate with a digital signature. Other users can verify the identity of the owner of the certificate by verifying the digital signature of the certificate with the certificate authority. Certificates expire after some time and must be renewed. If a user's private key is compromised, then the certificate authority can revoke the associated certificate.

A nonce is a single-use value that is added to data prior to encryption or hashing. Systems often insert it to ensure that transmitted encrypted data can not be reused by an attacker in the future. In this article nonces are used with hashing as part of Bitcoin's proof-of-work function, to be explained later.

Finally, there is the related concept of binary-to-text encoding. Encoding allows a system to convert data to printable text. Unlike hash digests, encoded text can be converted back into its original input data. Cryptographic systems typically use encoding to create human-readable versions of public/private keys and hash digests. Base64 is a popular encoding scheme used on the Internet. Bitcoin also uses the lesser known Base58 scheme.

Brief Bitcoin Background


Bitcoin is a cryptocurrency. It is not issued by an institution or backed by quantities of physical objects, like gold or silver. It is software that was released with an open source license to the world by an anonymous individual using the pseudonym Satoshi Nakamoto. Using a complex peer-to-peer network protocol it ensures that funds (Bitcoins) are securely transferred from one account to another.
Bitcoin accounts are identified by addressesAddresses are used to indicate where Bitcoins should be sent (paid). The end user’s Bitcoin software uses public and private keys to generate an address. Users often have many addresses to ensure their privacy. Users have special purpose software, called Wallets, that generates and keeps track of addresses. There is no central authority to issue addresses, meaning that addresses must be generated individually by all participants.
Wallets generate Bitcoin addresses using the following process:
  1. Generate an ECC public-private key pair
  2. Perform SHA-256 hash of public key
  3. Perform RIPEMD-160 hash of that result
  4. Add version byte (0x00) to the front of that result
  5. Perform a SHA-256 hash of that result, twice
  6. Append the first 4 bytes of that result to the value from #4
  7. Convert that result into Base58, which eliminates confusing characters 0 and O as well as 1 and l
The last step uses Base58 so that users can write the address on a piece of paper or speak it aloud over the phone. The ECC algorithms are used by Bitcoin to make the public-private key pair "somewhat resistant" to quantum computers. SHA-256 is used twice in step 5 to reduce the chance of success for any as yet unknown attacks against the SHA-2 hash function. Because all Bitcoin users generate addresses themselves, without a central addressing authority, this long process exists to reduce the probability of a collision between addresses to 0.01%. Even so, for improved security, the community suggests generating new addresses for each transaction. Note that only public-private keys and hashing are involved. There are no certificates to revoke or expire.

Transactions contain the following types of entries:
  • Transaction inputs contain a list of addresses and amount of Bitcoins to transfer from those addresses. Also included is a digital signature corresponding to each address. This digital signature is used by the Bitcoin software to verify that the transaction is legitimate and thus these Bitcoins can be spent. There is also a user-generated script used to specify how to access the bitcoins, but the workings of these scripts are outside the scope of this post.
  • Transaction outputs contain a list of addresses and amount of Bitcoins to transfer to those addresses. As with transaction inputs, a user-generated script is included to specify how to spend the bitcoins, but I will not go into further detail here.
  • Another field exists to enter the amount of transaction fees paid to the miners for processing the transaction.
The Bitcoin system broadcasts new transactions to all nodes. Miners select transactions and group them into blocks. A block contains the transactions, a timestamp, a nonce, and a hash of the previous block.

Within each block, Bitcoin stores transactions in a Merkle tree, an example diagram of which is shown below. Transactions reside in the leaves of the tree. Each non-leaf node contains a hash of its children. This data structure is used to prevent corrupt or illicit transactions from being shared, and thus included in the block chain.
A diagram showing an example of a Merkle tree. Each non-leaf node contains a hash of its children. For Bitcoin, transactions reside in the leaves. Source: Wikipedia
A conceptual diagram shows the Bitcoin blockchain. Each block contains: a hash of the previous block, a timestamp, a nonce, and the root of a tree of transactions. Source: Wikipedia
Miners only see Bitcoin addresses and amounts in each transaction, providing some privacy to those submitting transactions. To add a block to the blockchain, miners must solve a proof-of-work function. Once a block has been assembled by a miner, the Bitcoin software generates a nonce, adds it to the content of the block, and then hashes the contents of the block with the nonce using SHA-256, twice, to produce a hash. The system does not share the nonce with the miners. To add their block to the Bitcoin blockchain, the miner must guess nonces, combine them with the block content, and hash this content until they produce the correct hash digest value. This proof-of-work function is designed to be fast for the system to verify, but time-consuming for the miners to execute. The length of the nonce is increased every 14 days to maintain the level of difficulty in solving the proof-of-work function. This value was chosen to ensure that that miners continue to take 10 minutes to process each block. For each block completed, miners are rewarded any user-provided transaction fees included in the transactions as well as newly minted Bitcoins -- a block reward. The block reward is currently set at 12.5 bitcoins, worth $15,939 as of March 2, 2017. Miners run software and dedicated hardware around the globe to solve the proof-of-work function. Currently the local cost of electricity is the limiting factor in profiting from mining bitcoins.
To alter previous transactions, an attacker would need to select the block of the transaction they wished to alter. They would then need to insert their illicit transaction into the block and create a new block. After this they would then need to solve the block containing that transaction and all subsequent blocks faster than more than 50% of all the other miners, thus it is considered to be extremely hard to alter the blockchain.

Bitcoins do not really exist, even on an individual's hard drive. The blockchain contains the record of every bitcoin spent and indicates the current balance at each Bitcoin address. Full Bitcoin nodes have a copy of the blockchain, currently at 105GB, which can create problems for users running full nodes. Satoshi Nakamoto recommended periodically pruning the blockchain of old transactions, but so far this has not been done.

Technology exists to create blockchains outside of Bitcoin, but Bitcoin provides incentives for participation, in terms of monetary rewards. Any system attempting to use a blockchain outside of Bitcoin would need to produce similar incentives for participants. The participation by the miners also secures the blockchain by preventing malicious users from spamming it with transactions.

How accurate are the timestamps in the blockchain? According to the Bitcoin wiki:
A timestamp is accepted as valid if it is greater than the median timestamp of previous 11 blocks, and less than the network-adjusted time + 2 hours. "Network-adjusted time" is the median of the timestamps returned by all nodes connected to you. As a result, block timestamps are not exactly accurate, and they do not even need to be in order. Block times are accurate only to within an hour or two.
Bitcoins come in several denominations. The largest is the Bitcoin. The smallest is the satoshi. One satoshi equals 0.00000001 (1 x 10-8) Bitcoins.

Trusted Timestamping

Trusted Timestamping allows a verifier to determine that the given content existed during the time of the timestamp. It does not indicate time of creation. In many ways, it is like Memento-Datetime because it is still an observation of the document at a given point in time.
Timestamping can be performed by anyone with access to a document. For a timestamp to be defensible, however, it must be produced by a reliable and trusted source. For example, timestamps can be generated for a document by a user's personal computer and then signed with digital signatures. At some point in the future, a verifier can check that the digital signature is correct and verify the timestamp. This timestamp is not trustworthy because the clock on the personal computer may be altered or set incorrectly, thus providing doubt in the accuracy of the timestamp. Some trustworthy party must exist to that not only sets their time correctly, but ensures that timestamps are verifiable in the future.
Trusted Timestamping relies upon a trustworthy authority to accept data, typically a document, from a requestor and issue timestamps for future verification. The process then allows a verifier that has access to the timestamp and the original data to verify that the data existed at that point in time. Thus, two basic overall processes exist: (1) timestamp issue, and (2) timestamp verification.
In addition, privacy is a concern for documents. A document being transmitted can be intercepted and if the document is held by some third party for the purposes of verifying a timestamp in the future, then it is possible that the document can be stolen from the third party.  It is also possible for such a document to become corrupted. To address privacy concerns, trusted timestamping focuses on providing a timestamp for the hash of the content instead. Because such hashes cannot be reversed, the document cannot be reconstructed. Owners of the document, however, can generate the hashes from the document to verify it with the timestamping system.
Finally, verifying the timestamps should not depend on some ephemeral service. If such a service is nonexistent in the future, then the timestamps cannot be verified. Any timestamping solution will need to ensure that verification can be done for the foreseeable future.

Trusted Timestamping with RFC 3161 and ANSI X9.95


ANSI X9.95 extends RFC 3161 to provide standards for trusted timestamping in the form of a third party service called a Time Stamping Authority (TSA). Both standards discuss the formatting of request and response messages used to communicate with a TSA as well as indicating what requirements a TSA should meet.
The TSA issues time-stamp tokens (TST) as supporting evidence that the given content existed prior to a specific datetime. The following process allows the requestor to acquire a given timestamp:
  1. The requestor creates a hash of the content.
  2. The requestor submits this hash to the TSA.
  3. The TSA ensures that its clock is synchronized with an authoritative time source.
  4. The TSA ensures that the hash is the correct length, but, to ensure privacy, does not examine the hash in any other way.
  5. The TSA generates a TST containing the hash of the document, the timestamp, and a digital signature of these two pieces of data. The digital signature is signed with a private key whose sole purpose is timestamping. RFC 3161 requires that the requestor not be identified in the TST. The TST may also include additional metadata, such as the security policy used.
  6. The TST is sent back to the requestor, who should then store it along with the original document for future verification.

Simplified diagram showing the process of using a Time Stamp Authority (TSA) to issue and verify timestamps. Source: Wikipedia
To verify a timestamp, a verifier does not need the TSA. The verifier only needs:
  • the hash of the original document
  • the TST
  • the TSA's certificate
They use the original data and the TST in the following process:
  1. The verifier verifies the digital signature of the TST against the TSA’s certificate. If this is correct, then they know that the TST was issued by the TSA.
  2. The verifier then checks that the hash in the TST matches the hash of the document. If they match, then the verifier knows that the document hash was used to generate that TST.
  3. The timestamp contained in the TST and the hash were used in the generation of the digital signature, hence the TSA observed the document at the given time.
Hauber and Stornetta mentioned that the TSA can be compromised in their 1991 paper "How to Time-Stamp a Digital Document" and prescribed a few solutions, such as linked timestamping, which is implemented by ANSI X9.95. With Linked Timestamping, each TST includes a hash digest of the previous TST. Users can then additionally verify that a timestamp is legitimate by comparing this hash digest with the previously granted TST.

ANSI X9.95 also supports the use of transient-key cryptography. In this case, the system generates a distinct public-private key pair for each timestamp issued. Once a timestamp is issued and digitally signed, the system deletes the private key so that it cannot be compromised. The verifier uses the public key to verify the digital signature.

Services using these standards exist with companies like DigiStampeMudhraTecxoft, and Safe Stamper TSA. Up to 5 free timestamps can be generated per day per IP at Safe Creative's TSA.

The solutions above have different issues.

ANSI X9.95 and RFC 3161 provide additional guidance on certificate management and security to ensure that the TSA is not easily compromised, but the TSA is still the single point of failure in this scheme. If the TSA relies on an incorrect time source or is otherwise compromised, then all timestamps generated are invalid. If the TSA’s certificate expires or is revoked, then verifying past timestamps becomes difficult if not impossible, depending on the availability of the now invalid certificate. If the revoked certificate is still available, the datetime of revocation can be used as an upper bound for the validity of any timestamps. Unfortunately, a certificate is usually revoked because its private key was compromised. A compromised key creates doubt in any timestamps issued using it. If transient-key cryptographic is used, doubt exists with any generated public-private keys as well as their associated timestamps.

Linked timestamping helps ensure that the TSA's tokens are not easily faked, but require that the verifier meet with other verifiers to review the tokens. This requirement violates the need for privacy.

Hauber and Stornetta developed the idea of distributed trust for providing timestamps. The system relies on many clients being ready, available, and synchronized to a time source. Requestors would submit a document hash digest to a random set of k timestamping clients. These clients would in turn each digitally sign their timestamp response.  Because the choice in clients is random, there is a low probability of malicious clients issuing bad timestamps. The requestor would then store all timestamps from the k clients who responded. Unfortunately, this system requires participation without direct incentives.

Trusted Timestamping with OriginStamp


Gipp, Meuschke, and Gernandt recognized that the cryptocurrency Bitcoin provides timestamping as part of maintaining the blockchain. Each block contains a hash of the previous block, implementing something similar to the linking concept developed by Hauber and Stornetta and used in ANSI X9.95. The blockchain is distributed among all full Bitcoin clients and updated by miners, who only see transactions and cannot modify them. In some ways, the distributed nature of Bitcoin resembles parts of Hauber and Stornetta's distributed trust. Finally, the blockchain, because it is distributed to all clients, is an independent authority able to verify timestamps of transactions, much like a TSA, but without the certificate and compromise issues.
They created the OriginStamp system for timestamping user-submitted documents with the Bitcoin blockchain. They chose Bitcoin because it is the most widely used cryptocurrency and thus is perceived to last for a long time. This longevity is a requirement for verification of timestamps in the future.

OriginStamp Process to convert a document content into a Bitcoin address for use in a Bitcoin transaction that can be later verified against the blockchain.
The figure above displays the OriginStamp process for creating a Bitcoin address from a document:
  1. A user submits a document to the system which is then hashed, or just submits the hash of a document.
  2. The submitted document hash is placed into a list of hashes -- seed text -- from other submissions during that day.
  3. Once per day, this seed text is itself hashed using SHA-256 to produce an aggregate hash.
  4. This aggregate hash is used as the Bitcoin private key which is used to generate a public key.
  5. That public key is used to generate a Bitcoin address which can be used in a timestamped Bitcoin transaction of 1 satoshi.
OriginStamp could submit each document hash to the blockchain as an individual transaction, but the hashes are aggregated together to keep operating costs low. Because fees are taken out of every Bitcoin transaction, each transaction costs $ 0.03, allowing Gipp and his team to offer this low cost service for free. They estimate that the system costs $10/year to operate.

Their paper was published in March of 2015. According to coindesk.com, for the March 2015 time period 1 Bitcoin was equal to $268.32. As of March of 2017, 1 Bitcoin is now equal to $960.36. The average transaction fee now sits at approximately 45,200 satoshis, resulting in a transaction fee of $0.43, as of March 26, 2017.

A screenshot of the memento that I timestamp throughout this section.

OriginStamp allows one to submit documents for timestamping using the Bitcoin blockchain. In this case, I submitted the HTML content of the memento shown in the figure above.

OriginStamp responds to the submission by indicating that it will submit a group of hashes to the Bitcoin blockchain in a few hours.

With the OriginStamp service, the requestor acquires a timestamp using the following process:
  1. Submit the document -- or just its hash -- to the OriginStamp website as seen in the screenshots above. If a document is submitted, its hash is calculated and the document is not retained.
  2. OriginStamp sends an email once the system has submitted the concatenated hashes to the Bitcoin blockchain. This email will contain information about the seed text used and this seed  must be used for verification.
  3. In addition, the @originstamp Twitter account will tweet that the given hash was submitted to the blockchain.
A screenshot showing how OriginStamp displays verification information for a given hash. In this case, the document hash is da5328049647343c31e0e62d3886d6a21edb28406ede08a845adeb96d5e8bf50 and it was submitted to the blockchain on 4/10/2017 10:18:29 AM GMT-0600 (MDT) as part of seed text whose hash, and hence private key is c634bcafba86df8313332abc0ae854eea9083b279cdd4d9cde1d516ee6fb70d9.
Because the blockchain is expected to last for the forseeable future and is tamper-proof, it can be used at any time to verify the timestamp. There are two methods available: with the OriginStamp service, or directly against the Bitcoin blockchain using the seed text.

To do so with the OriginStamp service, the verifier can follow this process:

  1. Using the OriginStamp web site, the verifier can submit the hash of the original document and will receive a response as shown in the screenshot above. The response contains the timestamp under the heading "Submitted to Bitcoin network".
  2. If the verifier wishes to find the timestamp in the blockchain, they can expand the "Show transaction details" section of this page, shown below. This section reveals a button allowing one to download the list of hashes (seed text) used in the transaction, the private and public keys used in the transaction, the recipient Bitcoin address, and a link to blockchain.info also allowing verification of the transaction at a specific time.
  3. Using the link "Verify the generated secret on http://blockchain.info", they can see the details of the transaction and verify the timestamp, shown in the figure below.
A screenshot showing that more information is available once the user clicks "show transaction details". The recipient Bitcoin address is outlined in red. From this screen, the user can download the seed text containing the list of hashes submitted to the Bitcoin blockchain. A SHA-256 hash of this seed text is the Bitcoin private key. From this private key, a user can generate the public key and eventually the Bitcoin address for verification. In this case the generated Bitcoin address is 1EcnftDQwawHQWhE67zxEHSLUEoXKZbasy.

A screenshot of the blockchain.info web site showing the details of a Bitcoin transaction, complete with its timestamp. The Bitcoin address and timestamp have been enlarged and outlined in red. For Bitcoin address 1EcnftDQwawHQWhE67zxEHSLUEoXKZbasy, 1 Satoshi was transferred on 2017-02-28 00:01:15, thus that transaction date is the timestamp for the corresponding document.
To verify that a document was timestamped by directly checking the Bitcoin blockchain, one only needs:
  • The hash of the original document.
  • The seed text containing the list of hashes submitted to the bitcoin network.
  • Tools necessary to generate a Bitcoin address from a private key and also search the contents of the blockchain.
If OriginStamp is not available for verification, the verifier would then follow this process:
  1. Generate the hash of the document.
  2. Verify that this hash is in the seed text. This seed text should have been saved as a result of the email or tweet from OriginStamp.
  3. Hash the seed text with SHA256 to produce the Bitcoin private key.
  4. Generate the Bitcoin address using a tool such as bitaddress.org. The figure below shows the use of bitaddress.org to generate a Bitcoin address using a private key.
  5. Search the Bitcoin blockchain for this address, using a service such as blockchain.info. The block timestamp is the timestamp of the document.
A screenshot of bitaddress.org's "Wallet Details" tab with the Bitcoin address enlarged and outlined in red. One can insert a Bitcoin private key and it will generate the pubic key and associated Bitcoin address. This example uses the private key of c634bcafba86df8313332abc0ae854eea9083b279cdd4d9cde1d516ee6fb70d9 shown in previous screenshots which corresponds to a Bitcoin address of 1EcnftDQwawHQWhE67zxEHSLUEoXKZbasy, also shown in previous figures.
OriginStamp also supplies an API that developers can use to submit documents for timestamping as well as verify timestamps and download the seed text for a given document hash.

Comparison of Timestamping Solutions


In the table below I provide a summary comparison between the TSA, Origin Stamp, and submitting directly to the blockchain without OriginStamp.

TSA OriginStamp Directly To Blockchain
Financial Cost per Timestamp Dependent on Service and subscription, ranges from $3 to $0.024 Dependent on size of seed text, but less than Bitcoin transaction fee Bitcoin transaction fee, optimally $0.56
Accuracy of Timestamp Within seconds of time of request, but dependent on number of requests in queue if linked timestamping used Within 1 Day + 2 hours Within 2 hours
Items Needed For Verification Original Document

TST

Certificate of server to verify signature
Original Document

The seed text of hashes submitted at the same time
Original Document
Tools Needed For Verification Certificate verification tools Software to generate Bitcoin Address

Software to search blockchain
Timestamp Storage In TST saved by requestor Blockchain
Privacy Only hash of document is submitted, but TSA knows of the requestor's IP address Miners only see the Bitcoin address, not who submitted the document or even its hash
Targets of Compromise TSA time server

TSA certificate private key
Blockchain
Requirement for Compromise Server is configured insecurely

Server has open software vulnerabilities
>50% of Bitcoin miners colluding
Dependency of Longevity Life of Organization Offering Timestamping Service Continued Interest In Preserving the Blockchain

In the first row, we compare the cost of timestamps from each service. At the TSA service run by Digistamp, an archivist can obtain a cost of $0.024 per timestamp for 600,000 timestamps, but would need to commit to an 1 year fee of $14,400. They would also need to acquire all 600,000 timestamps within a year or lose them. If they pay $10, they are only allocated 30 timestamps and need to use them within 3 months, resulting in a cost of $3 per timestamp. Tecxoft's pricing is similar. OriginStamp attempts to keep costs down by bundling many hashes into a seed text file, but is still at the mercy of Bitcoin's transaction fees. The price of Bitcoin is currently very volatile. The transaction fee mentioned in Gipp's work from 2015 was $0.03. Miners tend to process a block faster if it has a higher transaction fee. The optimal fee has gone up due to Bitcoin's rise in value and an increase in the number of transactions waiting to be processed. The lowest price for the least delay is currently 200 satoshis per byte and the median transaction size is 226 bytes for a total cost of 45,200 satoshis.  This was equivalent to $0.43 when I started writing this blog post and is now $0.56.

In the second row, we compare timestamp accuracy. The TSA should be able to issue a timestamp to the requestor within seconds of the request. This can be delayed by a long queue if the TSA uses linked timestamping because every request must be satisfied in order. OriginStamp, however, tries to keep costs down by submitting its seed list to the blockchain at once-a-day intervals, according to the paper. On top of this, the timestamp in the blockchain is accurate to within two hours of submission of the Bitcoin transaction. This means that an OriginStamp timestamp may be as much as 24 hours + 2 hours = 26 hours off from the time of submission of the document hash. In practice, I do not know the schedule used by OriginStamp, as I submitted a document on February 28, 2017 and it was not submitted to the Bitcoin network until March 4, 2017. Then again, a document submitted on March 19, 2017 was submitted to the Bitcoin network by OriginStamp almost 18 hours later.
If the cost is deemed necessary, this lack of precision can be alleviated by not using OriginStamp but submitting to the blockchain directly. One could generate a Bitcoin address from a single document hash and then submit it to the blockchain immediately. The timestamp precision would still be within 2 hours of transaction submission.
For the TSA, timestamps are stored in the TST, which must be saved by the requestor for future verification. In contrast, OriginStamp saves timestamps in the Blockchain. OriginStamp users still need to save the seed list, so both solutions require the requestor to retain something along with the original document for future verification.

All solutions offer privacy through the use of document hashes. The Bitcoin miners receiving OriginStamp transactions only see the Bitcoin address generated from the hash of the seed list and do not even know it came from OriginStamp, hiding the original document submission in additional layers. The TSA, on the other hand, is aware of the requestor's IP address and potentially other identifying information.
To verify the timestamp, TSA users must have access to the original document, the TST, and the certificate of the TSA to verify the digital signature. OriginStamp only requires the original document and the seed list of hashes submitted to the blockchain. This means that OriginStamp requires slightly fewer items to be retained.
If using the blockchain directly, without OriginStamp, a single document hash could be used as the private key. There would be no seed list in this case. For verification, one would merely need the original document, which would be retained anyway.
To compromise the timestamps, the TSA time server must be attacked. This can be done by taking advantage of software vulnerabilities or insecure configurations. TSAs are usually audited to prevent insecure configurations, but vulnerabilities are frequently discovered in software. OriginStamp, on the other hand, requires that the blockchain be attacked directly, which is only possible if more than 50% of Bitcoin miners collude to manipulate blocks.
Finally, each service has different vulnerabilities when it comes to longevity. Mementos belong to web archives, and as such, are intended to exist far, far longer than 20 years. This makes longevity a key concern in any decision of a timestamping solution. The average lifespan of a company is now less than 20 years and expected to decrease. The certificates for verifying a timestamp running a TSA may last for little more than 3 years. This means that the verifier will need someone to have saved the TSA's certificate prior to verification of the timestamp. If the organization with the document and the TST is also the same organization providing the TSA's certificate, then there is cause for doubt in its validity because that organization can potentially forge any or all of these verification components.
The Bitcoin blockchain, on the other hand, is not tied to any single organization and is expected to last as long as there is interest in investing in the cryptocurrency. In addition, there are many copies of the blockchain available in the world. If Bitcoin goes away, there is still an interest in maintaining the blockchain for verification of transactions, and thus retaining copies of the blockchain by many parties. If someone wishes to forge a timestamp, they would need to construct their own illicit blockchain. Even if they went that far, a copy of their blockchain can be compared to other existing copies to evaluate its validity. Thus, even if the blockchain is no longer updated, it is still an independent source of information that can be used for future verification. If the blockchain is ever pruned, then prior copies will still need to be archived somewhere for verification of prior transactions. The combined interests of all of these parties support the concept of the Bitcoin blockchain lasting longer than a single server certificate or company.

So, with trusted timestamping available, what options do we have to make it easy to verify memento timestamps?

Trusted Timestamping of Mementos


It is worth noting that, due to delays between timestamp request and response in each of these solutions, trusted timestamping is not a replacement for Memento-Datetime. The Memento-Datetime header provides a timestamp of when the web archive captured a given resource. Trusted timestamping, on the other hand, can provide an additional dimension of certainty that a resource existed at a given datetime. Just as Memento-Datetime applies to a resource at a specific URI-M, so would a trusted timestamp.

A crawler can capture a memento as part of its normal operations, compute the hash of its content, and then submit this hash for timestamping to one of these services. The original memento content encountered during the crawl, the raw memento, must be preserved by the archive indefinitely. The memento can include a link relation in the response headers, such as the unregistered trusted-timestamp-info relation shown in the example headers below, indicating where one can find additional information to verify the timestamp.

HTTP/1.1 200 OK Server: Tengine/2.1.0 Date: Thu, 21 Jul 2016 17:34:15 GMT Content-Type: text/html;charset=utf-8 Content-Length: 109672 Connection: keep-alive Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT Link: <http://www.cnn.com/>; rel="original", <http://a.example.org/web/timemap/link/http://www.cnn.com/>; rel="timemap"; type="application/link-format", <http://a.example.org/web/http://www.cnn.com/>; rel="timegate", <http://a.example.org/web/20160721152544/http://www.cnn.com/>; rel="last memento"; datetime="Thu, 21 Jul 2016 15:25:44 GMT", <http://a.example.org/web/20160120080735/http://www.cnn.com/>; rel="first memento"; datetime="Wed, 20 Jan 2016 08:07:35 GMT", <http://a.example.org/web/20160721143259/http://www.cnn.com/>; rel="prev memento"; datetime="Thu, 21 Jul 2016 14:32:59 GMT", <http://a.example.org/timestamping/20160722191106/http://www.cnn.com/>; rel="trusted-timestamp-info"

The URI linked by the timestamp-info relation could be a JSON-formatted resource providing information for verification. For example, if OriginStamp is used for timestamping, then the resource might look like this:

{ "timestamping-method": "OriginStamp", "seed-text": "http://a.example.org/timestamping/20160722191106/seedtext.txt",
"hash-algorithm": "SHA-256" }

In this case, a verifier already knows the URI-M of the memento. They only need to locate the raw memento, calculate its hash, and use the seed text as described above to generate the Bitcoin address and find the timestamp in the Bitcoin blockchain.

Or, if an RFC 3161 solution is used, the timestamp-info resource could look like this:

{ "timestamping-method": "RFC 3161", "timestamp-token": "http://a.example.org/timestamping/tst/20160721174431/http://www.cnn.com", "tsa-certificate": "http://a.example.org/timestamping/tsacert/cert.pem",
"hash-algorithm": "SHA-256" }

In this case, a verifier can locate the raw memento, calculate its hash, and use verify it using the timestamp token (TST) and the TSA certificate as described above for RFC 3161.

If it is known that the crawler creates the hash of the raw memento and uses it as a private key for generating a Bitcoin address, thus submitting it directly to the blockchain for timestamping, then no additional headers would be needed. Verifiers only need the content of the raw memento to generate the hash. In addition, perhaps a separate timestamping service could exist for mementos, using the URI-M (e.g., https://timestamping-service.example.org/{URI-M}).

If one specific timestamping scheme is used, then perhaps specific link relations can be created to convey the resources from each of these fields.

Of course, this assumes that we only want to timestamp raw mementos. Conceivably, one might wish to timestamp a screenshot of a web page, or its WARC. We will need to perform additional analysis of the other potential use cases needed.

Summary


In this post, I have discussed different timestamping options. These options have different levels of complexity and security. All of them support privacy by a permitting the submission of a document hash to a timestamping service. OriginStamp attempts to address some of the concerns of the existing ANSI X9.95/RFC 3161 standard by using the timestamping features of the Bitcoin blockchain.

Of these options, the Bitcoin blockchain offers a decentralized, secure solution that supports privacy and does not depend on a central server that can fail or be compromised. Because copies of the blockchain are distributed to all full Bitcoin clients, it remains present for verification in the future. Bitcoin has been around for 8 years and continues to increase in value. Because all participants have incentives to keep the blockchain distributed and up to date, it is expected to outlast most companies, who have a median age of 20 years. In addition, if Bitcoin is no longer used, copies of the blockchain will still need to be maintained indefinitely for verification. It does, however, suffer from issues with timestamp accuracy inherent in the Bitcoin protocol. These can be alleviated by submitting a document hash directly against the blockchain.

Companies offering trusted timestamping using TSAs, on the other hand, may not have the longevity and require subscription fees for a limited number of timestamps. Though Bitcoin is currently volatile, it has stabilized before, and the subscription fees from these companies are still more expensive on average than the Bitcoin transaction fee.

Even though timestamping options exist, use cases must be identified for the verification of such timestamps in the future. These use cases will inform requestors of which content to be timestamped and will also affect which timestamping solution is selected. It would also be beneficial for verifiers to have links to additional resources for verification.

Trusted timestamping of mementos is possible, but will require some additional decisions and technology to become a reality.

Additional References Used For This Post: