2017-04-24: Pushing Boundaries

Since the advent of the web, more elements of scholarly communication are occurring online. A world that once consisted mostly of conference proceedings, books, and journal articles now includes blog posts, project websites, datasets, software projects, and more. Efforts like LOCKSS, CLOCKSS, and Portico preserve the existing journal system, but there is no similar dedicated effort for the web presence of scholarly communication. Because web-based scholarly communication is born on the web, it can benefit from web archiving.

This is complicated by the complexity of scholarly objects. Consider a dataset on the website Figshare, whose landing page is shown in Fig. 1. Each dataset on Figshare has a landing page consisting of a title, owner name, brief description, licensing information, and links to bibliographic metadata in various forms. If an archivist merely downloads the dataset and ignores the rest, then a future scholar using their holdings is denied context and additional metadata. The landing page, dataset, and bibliographic metadata are all objects making up this artifact. Thus, in order to preserve context, a crawler will need to acquire all of these linked resources belonging to this artifact on Figshare.

Fig. 1: A screenshot of the landing page of an artifact on Figshare.

Green boxes outline links to URIs that belong to this artifact.

Red boxes outline links to that do not belong to this artifact.

Interestingly, this artifact links to another artifact, a master's thesis, that does not link back to this artifact, complicating discovery of the dataset associated with the research paper. Both artifacts are fully archived in the Internet Archive. In contrast, Fig. 2 below shows a different incomplete Figshare artifact -- as of April 12, 2017 -- at the Internet Archive. Through incidental crawling, the Internet Archive discovered the landing page for this artifact, but has not acquired the actual dataset or the bibliographic metadata. Such cases show that incidental crawling is insufficient for archiving scholarly artifacts.

Fig 2: A screenshot of the web pages from an incompletely archived artifact. The Internet Archive has archived the landing page of this Figshare artifact, but did not acquire the dataset or the bibliographic metadata about the artifact.

What qualifies as an artifact? An artifact is a set of interconnected objects belonging to a portal that represent some unit of scholarly discourse. Example artifacts include datasets, blog posts, software projects, presentations, discussion, and preprints. Artifacts like blog posts and presentations may only consist of a single object. As seen in Fig. 1, datasets can consist of landing pages, metadata, and additional documentation that are all part of the artifact. Software projects hosted online may consist of source code, project documentation, discussion pages, released binaries, and more. For example, the Python Memento Client library on the GitHub portal consists of source code, documentation, and issues. All of these items would become part of the software project artifact. An artifact is usually a citable object, often referenced by a DOI.

Artifacts are attributed to a scholar or scholars. Portals provide methods like search engines, APIs, and user profile pages to discover artifacts. Outside of portals, web search engine results and focused crawling can also be used to discover artifacts. From experience, I have observed that each result from these search efforts contains a URI pointing to an entry page. To acquire the Figshare example above, a Figshare search engine result contained a link to the entry page, not links to the dataset or bibliographic data. A screenshot showing entry pages as local search engine results is seen in Fig. 3 below. Poursardar and Shipman have studied this problem for complex objects on the web and have concluded that "there is no simple answer to what is related to a resource" thus making it difficult to discover artifacts on the general web. Artifacts stored on scholarly portals, however, appear to have some structure that can be exploited. For the purposes of this post, I will discuss capturing all objects in an artifact starting from its entry page because the entry page is designed to be used by humans to reach the rest of the objects in the artifact and because entry pages are the results returned by these search methods.

Fig. 3: A screenshot showing search engine results in Figshare that lead to entry pages for artifacts.

HTML documents contain embedded resources, like JavaScript, CSS, and images. Web archiving technology is still evolving in acquiring embedded resources. For the sake of this post, I assume that any archiving solution will capture these embedded resources for any HTML document, this post will focus on discovering the base URIs of the linked resources making up the artifact, referred to in the rest of this article as artifact URIs.

For simplicity of discovery, I want to restrict artifact URIs to a specific portal. Thus, the set of domain names possible for each artifact URI in an artifact is restricted to the set of domain names used by the portal. I consider linked items on other portals to be separate artifacts. As mentioned with the example in Fig. 1, a dataset page links to its associated thesis, thus we have two interlinked artifacts: a dataset and a thesis. A discussion of interlinking artifacts is outside of the scope of this post and is being investigated by projects such as Research Objects by Bechhofer, De Roure, Gamble, Goble, and Buchan, as well as already being supported by efforts such as OAI-ORE.

Fig. 4: This diagram demonstrates an artifact and its boundary. Artifacts often have links to content elsewhere in the scholarly portal,but only some of these links are to items that belong to the artifact.

How does a crawler know which URIs belong to the artifact and which should be ignored? Fig. 4 shows a diagram containing an entry page that links to several resources. Only some of these resources, the artifact URIs, belong to the artifact. How do we know which URIs linked from an entry page are artifact URIs? Collection synthesis and focused crawling will acquire pages matching a specific topic, but we want as close to the complete artifact as possible with no missed and minimal extra objects. OAI-ORE provides a standard for aggregations of web resources using a special vocabulary as well as resource maps in RDF and other formats. Signposting is a machine-friendly solution that informs crawlers of this boundary by using link relations in the HTTP Link header to indicate which URIs belong to an artifact. The W3C work on "Packaging on the Web" and "Portable Web Publications" require that the content be formatted to help machines find related resources. LOCKSS boxes use site-specific plugins to intelligently crawl publisher web sites for preservation. How can a crawler determine this boundary without signposting, OAI-ORE, these W3C drafts, or site-specific heuristics? Can we infer the boundary from the structures used in each site?

Fortunately, portals have predictable behaviors that automated tools can use. In this post I assume an automated system will use heuristics to advise a crawler that is attempting to discover all artifact URIs within the boundary of an artifact. The resulting artifact URIs can then be supplied to a high resolution capture system, like Webrecorder.io. The goal is to develop a limited set of general, rather than site-specific, heuristics. Once an archivist is provided these artifact URIs, they can then create high-resolution captures of their content. In addition to defining these heuristics, I also correlate these heuristics with similar settings in Archive-It to demonstrate that the problem of capturing many of these artifacts is largely addressed. I then indicate which heuristics apply to which artifacts on some known scholarly portals. I make the assumption that all authentication, access (e.g., robots.txt exclusions), and licensing issues have been resolved and therefore all content is available.

Artifact Classes and Crawling Heuristics

To discover crawling heuristics in scholarly portals, I reviewed portals from Kramer and Boseman's Crowdsourced database of 400+ tools, part of their Innovations in Scholarly Communication. I filtered the list to only include entries from the categories of publication, outreach, and assessment. To find the most popular portals, I sorted the list by Twitter followers as a proxy for popularity. I then selected the top 36 portals from this list that were not journal articles and that contained scholarly artifacts. After that I manually reviewed artifacts on each portal to find common crawling heuristics shared across portals.

Single Artifact URI and Single Hop

In Figures below, I have drawn three different classes of web-based scholarly artifacts. The simplest class, in Fig. 5a, is an artifact consisting of a single artifact URI. This blog post, for example, is an artifact consisting of a single artifact URI.

Archiving single artifact URIs can be done easily in one shot with Webrecorder.io, Archive-It's One Page setting, and "save the page" functionality offered from web archives like archive.is. I will refer to this heuristic as Single Page.

Fig 5a: A diagram of the Single Artifact URI artifact class.

Fig 5b: A diagram showing an example Single Hop artifact class.

Fig 5c: A diagram showing an example of a Multi-Hop artifact class.

Fig. 5b shows an example artifact consisting of one entry page artifact URI and those artifact URIs linked to it. Our Figshare example above matches this second form. I will refer to this artifact class as Single Hop because all artifact URIs are available within one hop from the entry page. To capture all artifact URIs for this class, an archiving solution merely captures the entry page and any linked pages, stopping at one hop away from the entry page. Archive-It has a setting that addresses this named One Page+. Inspired by Archive-It's terminology, I will use the + to indicate "including all linked URIs within one hop". Thus, I will refer to the heuristic for capturing this artifact class as Single Page+.

Because one hop away will acquire menu items and other site content, Single Page+ will acquire more URIs than needed. As an optimization, our automated system can first create an Ignored URIs List, inspired in part by a dead page detection algorithm by Bar-Yossef, Broder, Kumar, and Thomkins. The automated tool would fill this list using the following method:

Construct an invalid URI (i.e., one that produces a 404 HTTP status) for the portal.
Capture the content at that URI and place all links from that content into the ignored URIs list.
Capture the content from the homepage of the portal and place all links from that content into the ignored URIs list.
Remove the entry page URI from the ignored URIs list, if present.

The ignored URIs list should now contain URIs that are outside of the boundary, like those that refer to site menu items and licensing information. This method captures content both from the invalid URI and a homepage because homepages may not contain all menu items. As part of a promotion effort, the entry page URI may be featured on the homepage, hence we remove it from the list in the final step. Our system would then advise the crawler to ignore any URIs on this list, reducing the number of URIs crawled.

I will refer to this modified heuristic as Single Page+ with Ignored URIs.

Multi-Hop Artifacts

Fig. 5c shows an example artifact of high complexity. It consists of many interlinked artifact URIs. Examples of scholarly sites fitting into this category include GitHub, Open Science Framework, and Dryad. Because multiple hops are required to reach all artifact URIs, I will refer to this artifact class as Multi-Hop. Due to its complexity, Mulit-Hop breaks down into additional types that require special heuristics to acquire completely.

Software projects are stored on portals like GitHub and BitBucket. These portals host source code in repositories using a software version control system, typically Git or Mercurial. Each of these version control systems provide archivists with the ability to create a complete copy of the version control system repository. The portals provide more than just hosting for these repositories. They also provide issue tracking, documentation services, released binaries, and other content that provides additional context for the source code itself. The content from these additional services is not present in the downloaded copy of the version control system repository.

Fig. 6: Entry page belonging to the artifact representing the Memento Damage software project.

For these portals, the entry page URI is a substring of all artifact URIs. Consider the example GitHub source code page shown in Fig. 6. This entry page belongs to the artifact representing the Memento Damage software project. The entry page artifact URI is https://github.com/erikaris/web-memento-damage/. Artifact URIs belonging to this artifact will contain the entry page URI as a substring; here are some examples with the entry page URI substrings shown in italics:

https://github.com/erikaris/web-memento-damage/issues
https://github.com/erikaris/web-memento-damage/graphs/contributors
https://github.com/erikaris/web-memento-damage/blob/master/memento_damage/phantomjs/text-coverage.js
https://github.com/erikaris/web-memento-damage/commit/afcdf74cc31178166f917e79bbad8f0285ae7831

Because all artifact URIs are based on the entry page URI, I have named this heuristic Path-Based.

In the case of GitHub some item URIs reside in a different domain: raw.githubusercontent.com. Because these URIs are in a different domain and hence do not contain our significant string, they will be skipped by the path-based heuristic. We can amend the directory-based heuristic to capture these additional resources by allowing the crawler to capture all linked URIs that belong to a domain different from the domain of the entry page. I refer to this heuristic as Path-Based with Externals.

Silk allows a user to create an entire web site devoted to the data visualization and interaction of a single dataset. When a user creates a new Silk project, a subdomain is created to host that project (e.g., http://dashboard101innovations.silk.co/). Because the data and visualization are intertwined, the entire subdomain site is itself an artifact. Crawling this artifact still relies upon the path (i.e., a single slash), and hence, its related heuristic is Path-Based as well, but without the need to acquire content external to the portal.

For some portals, like Dryad, a significant string exists in the content of each object that is part of the artifact. An automated tool can acquire this significant string from the <title> element of the HTML of the entry page and a crawler can search for the significant string in the content -- not just the title, but the complete content -- of each resource discovered during the crawl. If the resource's content does not contain this significant string, then it is discarded. I refer to this heuristic as Significant String from Title.

Fig. 7: A diagram of a Dryad artifact consisting of a single dataset, but multiple metadata pages. Red boxes outline the significant string, Data from: Cytokine responses in birds challenged with the human food-borne pathogen Campylobacter jejuni implies a Th17 response., which is found in the title of the entry page and is present in almost all objects within the artifact. Only the dataset does not contain this significant string, hence a crawler must crawl URIs one hop out from those matching the Significant String from Title heuristic and also ignore menu items and other common links, hence the Significant String from Title+ with Ignored URIs is the prescribed heuristic in this case.

In reality, this heuristic misses the datasets linked from each Dryad page. To solve this our automated tool can create an ignored URI list using the techniques mentioned above. Then the crawler can crawl one hop out from each captured page, ignoring URIs in this list. I refer to this heuristic as Significant String from Title+ with Ignored URIs. Fig.7 shows an example Dryad artifact that can make use of this heuristic.

A crawler can use this heuristic for Open Science Framework (OSF) with one additional modification. OSF includes the string "OSF | " in all page titles, but not in the content of resources belonging to the artifact, hence an automated system needs to remove it before the title can be compared with the content of linked pages. Fig. 8 shows an example of this.

Fig. 8: A screenshot of an OSF artifact entry page showing the source in the lower pane. The title element of the page contains the string "OSF | Role of UvrD/Mfd in TCR Supplementary Material". The string "Role of UvrD/Mfd in TCR Supplementary Material" is present in all objects related to this artifact. To use this significant string, the substring
"OSF | " must be removed.

Here are the steps for removing the matching text:

Capture the content of the entry page.
Save the text from the <title> tag of the entry page.
Capture the content of the portal homepage.
Save the text from the <title> tag of the homepage.
Starting from the leftmost character of each string, compare the characters of the entry page title text with the homepage title text.

If the characters match, remove the character in the same position from the entry page title.
Stop comparing when characters no longer match.

I will refer to the heuristic with this modification as Significant String from Filtered Title.

Entry pages for the Global Biodiversity Information Facility (GBIF) contain an identification string in the path part of each URI that is present in all linked URIs belonging to the same artifact. For example, the entry page at URI http://www.gbif.org/dataset/98333cb6-6c15-4add-aa0e-b322bf1500ba contains the string 98333cb6-6c15-4add-aa0e-b322bf1500ba and its page content links to the following artifact URIs:

http://www.gbif.org/occurrence/search?datasetKey=98333cb6-6c15-4add-aa0e-b322bf1500ba
http://api.gbif.org/v1/dataset/98333cb6-6c15-4add-aa0e-b322bf1500ba/document

An automated system can compare the entry page URI to each of the URIs of its links to extract this significant string. Informed by this system, a crawler will then ignore URIs that do not contain this string. I will refer to the heuristic for artifacts on this portal as Significant String from URI.

Discovering the significant string in the URI may also require site-specific heuristics. Discovering the longest common substring between the path elements of the entry page URI and any linked URIs may work for the GBIF portal, but it may not work for other portals, hence this heuristic may need further development, if applicable to other portals.

The table below lists the artifact classes and associated heuristics that have been covered. As noted above, even though an artifact fits into a particular class, its structure on the portal is ultimately what determines its applicable crawling heuristic.

Artifact Class	Potential Heuristics	Potentially Adds Extra URIs Outside Artifact	Potentially Misses Artifact URIs
Single Artifact URI	Single Page	No	No
Single Hop	Single Page+	Yes	No
Single Hop	Single Page+ with Ignored URIs	Yes (but reduced amount compared to Single Page+)	No
Multi-Hop	Path-Based	Depends on Portal/Artifact	Depends on Portal/Artifact
	Path-Based with Externals	Yes	No
	Significant String from Title	No	Yes
	Significant String from Title+ with Ignored URIs	Yes	No
	Significant String from Filtered Title	No	Yes
	Significant String from URI	No	Yes

Comparison to Archive-It Settings

Even though the focus of this post has been to find artifact URIs with the goal of feeding them into a high resolution crawler, like Webrecorder.io, it is important to note that Archive-It has settings that match or approximate many of these heuristics. This would allow an archivist to use an entry page as a seed URI and capture all artifact URIs. The table below provides a listing of similar settings between the heuristics mentioned here and a setting in Archive-It that functions similarly.

Crawling Heuristic from this Post	Similar Setting in Archive-It
Single Page	Seed Type: One Page
Single Page+	Seed Type: One Page+
Single Page+ with Ignored URIs	Seed Type: Standard Host Scope Rule: Block URL if it contains the text <string>
Path-Based	Seed Type: Standard
Path-Based with Externals	Seed Type: Standard+
Significant String from Title	None
Significant String from Title+ with Ignored URIs	None
Significant String from Filtered Title	None
Significant String from URI	Seed Type: Standard Expand Scope to Include URL if it contains the text <string>

Fig. 9: This is a screenshot of part of the Archive-It configuration allowing a user to control the crawling strategy for each seed.

Archive-It's seed type settings allow one to change how a seed is crawled. As shown in Fig. 9, four settings are available. One Page, One Page+, and Standard all map exactly to our heuristics of Single Page, Single Page+, and Path-Based. For Path-Based, one merely needs to supply the entry page URI as a seed -- including the ending slash -- and Archive-It's scoping rules will ensure that all links include the entry page URI. Depending on the portal, Standard+ may crawl more URIs than Path-Based with Externals, but is otherwise successful in acquiring all artifact URIs.

Fig. 10: This is a screenshot of the Archive-It configuration allowing the user to expand the crawl scope to include URIs that contain a given string.

Fig. 11: This screenshot displays the portion of the Archive-It configuration allowing the user to block URIs that contain a given string.

To address our other heuristics, Archive-It's scoping rules must be altered, with screenshots of these settings shown in Figs 10 and 11. To mimic our Significant String from URI heuristic, a user would first need to know the significant string, and then can supply it as an argument to the setting "Expand Scope to include URL if it contains the text:". Likewise, to mimic Single Page+ with Ignored URIs, a user would need to know which URIs to ignore, and can use them as arguments to the setting "Block URL if...".

Archive-It does not have a setting for analyzing page content during a crawl, and hence I have not found settings that can address any of the members of the Significant String in Title heuristics family.

In addition to these settings, a user will need to experiment with crawl times to capture some of the Multi-Hop artifacts due to the number of artifact URIs that must be visited.

Heuristics Used In Review of Artifacts on Scholarly Portals

While manually reviewing one or two artifacts from each of the 36 portals from the dataset, I documented the crawling heuristic I used for each artifact, shown in the table below. I focused on a single type of artifact for each portal. It is possible that different artifact types (e.g., blog post vs. forum) may require different heuristics even though they reside on the same portal.

Portal	Artifact Type Reviewed	Artifact Class	Applicable Crawling Heuristic
Academic Room	Blog Post	Single Artifact URI	Single Page
AskforEvidence	Blog Post	Single Artifact URI	Single Page
Benchfly	Video	Single Artifact URI	Single Page
BioRxiv	Preprint	Single Hop	Single Page+ with Ignores
BitBucket	Software Project	Multi-Hop	Path-Based
Dataverse*	Dataset	Multi-Hop	Significant String From Filtered Title (starting from title end instead of beginning like OSF)
Dryad	Dataset	Multi-Hop	Significant String From Title+ with Ignored URI List
ExternalDiffusion	Blog Post	Single Artifact URI	Single Page
Figshare	Dataset	Single Hop	Single Page+ with Ignores
GitHub	Software Project	Multi-Hop	Path-Based with Externals
GitLab.com	Software Project	Multi-Hop	Path-Based
Global Biodiversity Information Facility*	Dataset	Multi-Hop	Significant String From URI
HASTAC	Blog Post	Single Artifact URI	Single Page
Hypotheses	Blog Post	Single Artifact URI	Single Page
JoVe	Videos	Single Hop	Single Page+ with Ignores
JSTOR daily	Article	Single Artifact URI	Single Page
Kaggle Datasets	Dataset with Code and Discussion	Multi-Hop	Path-Based with Externals
MethodSpace	Blog Post	Single Artifact URI	Single Page
Nautilus	Article	Single Artifact URI	Single Page
Omeka.net*	Collection Item	Single Artifact URI	Single Page (but Depends on Installation)
Open Science Framework*	Non-web content, e.g., datasets and PDFs	Multi-Hop	Significant String From Filtered Title
PubMed Commons	Discussion	Single Artifact URI	Single Page
PubPeer	Discussion	Single Artifact URI	Single Page
ScienceBlogs	Blog Post	Single Artifact URI	Single Page
Scientopia	Blog Post	Single Artifact URI	Single Page
SciLogs	Blog Post	Single Artifact URI	Single Page
Silk*	Data Visualization and Interaction	Multi-Hop	Path-Based
Slideshare	Slideshow	Multi-Hop	Path-Based
SocialScienceSpace	Blog Post	Single Artifact URI	Single Page
SSRN	Preprint	Single Hop	Single Page+ with Ignores
Story Collider	Audio	Single Artifact URI	Single Page
The Conversation	Article	Single Artifact URI	Single Page
The Open Notebook	Blog Post	Single Artifact URI	Single Page
United Academics	Article	Single Artifact URI	Single Page
Wikipedia	Encyclopedia Article	Single Artifact URI	Single Page
Zenodo	Non-web content	Single Hop	Single Page+ with Ignores

Five entries are marked with an asterisk (*) because they may offer additional challenges.

Omeka.net provides hosting for the Omeka software suite, allowing organizations to feature collections of artifacts and their metadata on the web. Because each organization can customize their Omeka installation, they may add features that make the Single Page heuristic no longer function. Dataverse is similar in this regard. I only reviewed artifacts from Harvard's Dataverse.

Global Biodiversity Information Facility (GBIF) contains datasets submitted by various institutions throughout the world. A crawler can acquire some metadata about these datasets, but the dataset itself cannot be downloaded from these pages. Instead, an authenticated user must request the dataset. Once the request has been processed, the portal then sends an email to the user with a URI indicating where the dataset may be downloaded. Because of this extra step, this additional dataset URI will need to be archived by a human separately. In addition, it will not be linked from content of the other captured artifact URIs.

Dataverse, Open Science Framework, and Silk offer additional challenges. A crawler cannot just use anchor tags to find artifact URIs because some content is only reachable via user interaction with page elements (e.g., buttons, dropdowns, specific <div> tags). Webrecorder.io can handle these interactive elements because a human performs the crawling. The automated system that we are proposing to aide a crawler will not be as successful unless it can detect these elements and mimic the human's actions. CLOCKSS has been working on this problem since 2009 and has developed an AJAX collector to address some of these issues.

Further Thoughts

There may be additional types of artifacts that I did not see on these portals. Those artifacts may require different heuristics. Also, there are many more scholarly portals that have not yet been reviewed, and it is likely that additional heuristics will need to be developed to address some of them. A larger study analyzing the feasibility and accuracy of these heuristics is needed.

From these 36 portals, most artifacts fall into the class of Single Artifact URI. A larger study on the distribution of classes of artifacts could indicate how well existing crawling technology can discover artifact URIs and hence archive complete artifacts.

Currently, a system would need to know which of these heuristics to use based on the portal and type of artifact. Without any prior knowledge, is there a way our system can use the entry page -- including its URI, response headers, and content -- to determine to which artifact type the entry page belongs? From there, can the system determine which heuristic can be used? Further work may be able to develop a more complex heuristic or even an algorithm applicable to most artifacts.

These solutions rely on the entry page for initial information (e.g., URI strings, content). Given any other artifact URI in the artifact, is it possible to discover the rest of the artifact URIs? If a given artifact URI references content that does not contain other URIs -- either through links or text -- then the system will not be able to discover other artifact URIs. If the content of a given artifact URI does contain other URIs, a system would need to determine which heuristic is might apply in order to find the other artifact URIs.

What about artifacts that link to other artifacts? Consider again our example in Fig. 1 where a dataset links to a thesis. A crawler can save those artifact URIs to its frontier and pursue the crawl of those additional artifacts separately, if so desired. The crawler would need to determine when it had encountered a new artifact and pursue its crawl separately with the heuristics appropriate to the new artifact and portal.

Conclusion

I have outlined several heuristics for discovering artifact URIs belonging to an artifact. I also demonstrated that many of those heuristics can already be used with Archive-It. The heuristics offered here require that one know the entry page URI of the artifact and they expect that any system analyzing pages can work with interactive elements. Because portals provide predictable patterns, finding the boundary appears to be a tractable problem for anyone looking to archive a scholarly object.

--Shawn M. Jones

Acknowledgements: Special thanks to Mary Haberle for helping to explain Archive-It scoping rules.

Search This Blog

Web Science and Digital Libraries Research Group