Fig. 1: A screenshot of the landing page of an artifact on Figshare.
Green boxes outline links to URIs that belong to this artifact.
Red boxes outline links to that do not belong to this artifact.
Interestingly, this artifact links to another artifact, a master's thesis, that does not link back to this artifact, complicating discovery of the dataset associated with the research paper. Both artifacts are fully archived in the Internet Archive. In contrast, Fig. 2 below shows a different incomplete Figshare artifact -- as of April 12, 2017 -- at the Internet Archive. Through incidental crawling, the Internet Archive discovered the landing page for this artifact, but has not acquired the actual dataset or the bibliographic metadata. Such cases show that incidental crawling is insufficient for archiving scholarly artifacts.
Fig 2: A screenshot of the web pages from an incompletely archived artifact. The Internet Archive has archived the landing page of this Figshare artifact, but did not acquire the dataset or the bibliographic metadata about the artifact.
What qualifies as an artifact? An artifact is a set of interconnected objects belonging to a portal that represent some unit of scholarly discourse. Example artifacts include datasets, blog posts, software projects, presentations, discussion, and preprints. Artifacts like blog posts and presentations may only consist of a single object. As seen in Fig. 1, datasets can consist of landing pages, metadata, and additional documentation that are all part of the artifact. Software projects hosted online may consist of source code, project documentation, discussion pages, released binaries, and more. For example, the Python Memento Client library on the GitHub portal consists of source code, documentation, and issues. All of these items would become part of the software project artifact. An artifact is usually a citable object, often referenced by a DOI.
Artifacts are attributed to a scholar or scholars. Portals provide methods like search engines, APIs, and user profile pages to discover artifacts. Outside of portals, web search engine results and focused crawling can also be used to discover artifacts. From experience, I have observed that each result from these search efforts contains a URI pointing to an entry page. To acquire the Figshare example above, a Figshare search engine result contained a link to the entry page, not links to the dataset or bibliographic data. A screenshot showing entry pages as local search engine results is seen in Fig. 3 below. Poursardar and Shipman have studied this problem for complex objects on the web and have concluded that "there is no simple answer to what is related to a resource" thus making it difficult to discover artifacts on the general web. Artifacts stored on scholarly portals, however, appear to have some structure that can be exploited. For the purposes of this post, I will discuss capturing all objects in an artifact starting from its entry page because the entry page is designed to be used by humans to reach the rest of the objects in the artifact and because entry pages are the results returned by these search methods.
|Fig. 3: A screenshot showing search engine results in Figshare that lead to entry pages for artifacts.|
For simplicity of discovery, I want to restrict artifact URIs to a specific portal. Thus, the set of domain names possible for each artifact URI in an artifact is restricted to the set of domain names used by the portal. I consider linked items on other portals to be separate artifacts. As mentioned with the example in Fig. 1, a dataset page links to its associated thesis, thus we have two interlinked artifacts: a dataset and a thesis. A discussion of interlinking artifacts is outside of the scope of this post and is being investigated by projects such as Research Objects by Bechhofer, De Roure, Gamble, Goble, and Buchan, as well as already being supported by efforts such as OAI-ORE.
Fig. 4: This diagram demonstrates an artifact and its boundary. Artifacts often have links to content elsewhere in the scholarly portal,but only some of these links are to items that belong to the artifact.
Fortunately, portals have predictable behaviors that automated tools can use. In this post I assume an automated system will use heuristics to advise a crawler that is attempting to discover all artifact URIs within the boundary of an artifact. The resulting artifact URIs can then be supplied to a high resolution capture system, like Webrecorder.io. The goal is to develop a limited set of general, rather than site-specific, heuristics. Once an archivist is provided these artifact URIs, they can then create high-resolution captures of their content. In addition to defining these heuristics, I also correlate these heuristics with similar settings in Archive-It to demonstrate that the problem of capturing many of these artifacts is largely addressed. I then indicate which heuristics apply to which artifacts on some known scholarly portals. I make the assumption that all authentication, access (e.g., robots.txt exclusions), and licensing issues have been resolved and therefore all content is available.
Artifact Classes and Crawling Heuristics
To discover crawling heuristics in scholarly portals, I reviewed portals from Kramer and Boseman's Crowdsourced database of 400+ tools, part of their Innovations in Scholarly Communication. I filtered the list to only include entries from the categories of publication, outreach, and assessment. To find the most popular portals, I sorted the list by Twitter followers as a proxy for popularity. I then selected the top 36 portals from this list that were not journal articles and that contained scholarly artifacts. After that I manually reviewed artifacts on each portal to find common crawling heuristics shared across portals.
Single Artifact URI and Single Hop
In Figures below, I have drawn three different classes of web-based scholarly artifacts. The simplest class, in Fig. 5a, is an artifact consisting of a single artifact URI. This blog post, for example, is an artifact consisting of a single artifact URI.
Archiving single artifact URIs can be done easily in one shot with Webrecorder.io, Archive-It's One Page setting, and "save the page" functionality offered from web archives like archive.is. I will refer to this heuristic as Single Page.
|Fig 5a: A diagram of the Single Artifact URI artifact class.|
|Fig 5b: A diagram showing an example Single Hop artifact class.|
|Fig 5c: A diagram showing an example of a Multi-Hop artifact class.|
Fig. 5b shows an example artifact consisting of one entry page artifact URI and those artifact URIs linked to it. Our Figshare example above matches this second form. I will refer to this artifact class as Single Hop because all artifact URIs are available within one hop from the entry page. To capture all artifact URIs for this class, an archiving solution merely captures the entry page and any linked pages, stopping at one hop away from the entry page. Archive-It has a setting that addresses this named One Page+. Inspired by Archive-It's terminology, I will use the + to indicate "including all linked URIs within one hop". Thus, I will refer to the heuristic for capturing this artifact class as Single Page+.
Because one hop away will acquire menu items and other site content, Single Page+ will acquire more URIs than needed. As an optimization, our automated system can first create an Ignored URIs List, inspired in part by a dead page detection algorithm by Bar-Yossef, Broder, Kumar, and Thomkins. The automated tool would fill this list using the following method:
- Construct an invalid URI (i.e., one that produces a 404 HTTP status) for the portal.
- Capture the content at that URI and place all links from that content into the ignored URIs list.
- Capture the content from the homepage of the portal and place all links from that content into the ignored URIs list.
- Remove the entry page URI from the ignored URIs list, if present.
I will refer to this modified heuristic as Single Page+ with Ignored URIs.
Fig. 5c shows an example artifact of high complexity. It consists of many interlinked artifact URIs. Examples of scholarly sites fitting into this category include GitHub, Open Science Framework, and Dryad. Because multiple hops are required to reach all artifact URIs, I will refer to this artifact class as Multi-Hop. Due to its complexity, Mulit-Hop breaks down into additional types that require special heuristics to acquire completely.
Software projects are stored on portals like GitHub and BitBucket. These portals host source code in repositories using a software version control system, typically Git or Mercurial. Each of these version control systems provide archivists with the ability to create a complete copy of the version control system repository. The portals provide more than just hosting for these repositories. They also provide issue tracking, documentation services, released binaries, and other content that provides additional context for the source code itself. The content from these additional services is not present in the downloaded copy of the version control system repository.
|Fig. 6: Entry page belonging to the artifact representing the Memento Damage software project.|
For these portals, the entry page URI is a substring of all artifact URIs. Consider the example GitHub source code page shown in Fig. 6. This entry page belongs to the artifact representing the Memento Damage software project. The entry page artifact URI is https://github.com/erikaris/web-memento-damage/. Artifact URIs belonging to this artifact will contain the entry page URI as a substring; here are some examples with the entry page URI substrings shown in italics:
In the case of GitHub some item URIs reside in a different domain: raw.githubusercontent.com. Because these URIs are in a different domain and hence do not contain our significant string, they will be skipped by the path-based heuristic. We can amend the directory-based heuristic to capture these additional resources by allowing the crawler to capture all linked URIs that belong to a domain different from the domain of the entry page. I refer to this heuristic as Path-Based with Externals.
Silk allows a user to create an entire web site devoted to the data visualization and interaction of a single dataset. When a user creates a new Silk project, a subdomain is created to host that project (e.g., http://dashboard101innovations.silk.co/). Because the data and visualization are intertwined, the entire subdomain site is itself an artifact. Crawling this artifact still relies upon the path (i.e., a single slash), and hence, its related heuristic is Path-Based as well, but without the need to acquire content external to the portal.
For some portals, like Dryad, a significant string exists in the content of each object that is part of the artifact. An automated tool can acquire this significant string from the <title> element of the HTML of the entry page and a crawler can search for the significant string in the content -- not just the title, but the complete content -- of each resource discovered during the crawl. If the resource's content does not contain this significant string, then it is discarded. I refer to this heuristic as Significant String from Title.
|Fig. 7: A diagram of a Dryad artifact consisting of a single dataset, but multiple metadata pages. Red boxes outline the significant string, Data from: Cytokine responses in birds challenged with the human food-borne pathogen Campylobacter jejuni implies a Th17 response., which is found in the title of the entry page and is present in almost all objects within the artifact. Only the dataset does not contain this significant string, hence a crawler must crawl URIs one hop out from those matching the Significant String from Title heuristic and also ignore menu items and other common links, hence the Significant String from Title+ with Ignored URIs is the prescribed heuristic in this case.|
In reality, this heuristic misses the datasets linked from each Dryad page. To solve this our automated tool can create an ignored URI list using the techniques mentioned above. Then the crawler can crawl one hop out from each captured page, ignoring URIs in this list. I refer to this heuristic as Significant String from Title+ with Ignored URIs. Fig.7 shows an example Dryad artifact that can make use of this heuristic.
A crawler can use this heuristic for Open Science Framework (OSF) with one additional modification. OSF includes the string "OSF | " in all page titles, but not in the content of resources belonging to the artifact, hence an automated system needs to remove it before the title can be compared with the content of linked pages. Fig. 8 shows an example of this.
- Capture the content of the entry page.
- Save the text from the <title> tag of the entry page.
- Capture the content of the portal homepage.
- Save the text from the <title> tag of the homepage.
- Starting from the leftmost character of each string, compare the characters of the entry page title text with the homepage title text.
- If the characters match, remove the character in the same position from the entry page title.
- Stop comparing when characters no longer match.
Entry pages for the Global Biodiversity Information Facility (GBIF) contain an identification string in the path part of each URI that is present in all linked URIs belonging to the same artifact. For example, the entry page at URI http://www.gbif.org/dataset/98333cb6-6c15-4add-aa0e-b322bf1500ba contains the string 98333cb6-6c15-4add-aa0e-b322bf1500ba and its page content links to the following artifact URIs:
Discovering the significant string in the URI may also require site-specific heuristics. Discovering the longest common substring between the path elements of the entry page URI and any linked URIs may work for the GBIF portal, but it may not work for other portals, hence this heuristic may need further development, if applicable to other portals.
The table below lists the artifact classes and associated heuristics that have been covered. As noted above, even though an artifact fits into a particular class, its structure on the portal is ultimately what determines its applicable crawling heuristic.
|Artifact Class||Potential Heuristics||Potentially Adds Extra URIs Outside Artifact||Potentially Misses Artifact URIs|
|Single Artifact URI||Single Page||No||No|
|Single Hop||Single Page+||Yes||No|
|Single Page+ with Ignored URIs||Yes (but reduced amount compared to Single Page+)||No|
|Multi-Hop||Path-Based||Depends on Portal/Artifact||Depends on Portal/Artifact|
|Path-Based with Externals||Yes||No|
|Significant String from Title||No||Yes|
|Significant String from Title+ with Ignored URIs||Yes||No|
|Significant String from Filtered Title||No||Yes|
|Significant String from URI||No||Yes|
Comparison to Archive-It Settings
Even though the focus of this post has been to find artifact URIs with the goal of feeding them into a high resolution crawler, like Webrecorder.io, it is important to note that Archive-It has settings that match or approximate many of these heuristics. This would allow an archivist to use an entry page as a seed URI and capture all artifact URIs. The table below provides a listing of similar settings between the heuristics mentioned here and a setting in Archive-It that functions similarly.
|Crawling Heuristic from this Post||Similar Setting in Archive-It|
|Single Page||Seed Type: One Page|
|Single Page+||Seed Type: One Page+|
|Single Page+ with Ignored URIs||Seed Type: Standard
Host Scope Rule:
Block URL if it contains the text <string>
|Path-Based||Seed Type: Standard|
|Path-Based with Externals||Seed Type: Standard+|
|Significant String from Title||None|
|Significant String from Title+ with Ignored URIs||None|
|Significant String from Filtered Title||None|
|Significant String from URI||Seed Type: Standard
Expand Scope to Include URL if it contains the text <string>
|Fig. 9: This is a screenshot of part of the Archive-It configuration allowing a user to control the crawling strategy for each seed.|
|Fig. 10: This is a screenshot of the Archive-It configuration allowing the user to expand the crawl scope to include URIs that contain a given string.|
|Fig. 11: This screenshot displays the portion of the Archive-It configuration allowing the user to block URIs that contain a given string.|
Archive-It does not have a setting for analyzing page content during a crawl, and hence I have not found settings that can address any of the members of the Significant String in Title heuristics family.
In addition to these settings, a user will need to experiment with crawl times to capture some of the Multi-Hop artifacts due to the number of artifact URIs that must be visited.
Heuristics Used In Review of Artifacts on Scholarly Portals
While manually reviewing one or two artifacts from each of the 36 portals from the dataset, I documented the crawling heuristic I used for each artifact, shown in the table below. I focused on a single type of artifact for each portal. It is possible that different artifact types (e.g., blog post vs. forum) may require different heuristics even though they reside on the same portal.
|Portal||Artifact Type Reviewed||Artifact Class||Applicable Crawling Heuristic|
|Academic Room||Blog Post||Single Artifact URI||Single Page|
|AskforEvidence||Blog Post||Single Artifact URI||Single Page|
|Benchfly||Video||Single Artifact URI||Single Page|
|BioRxiv||Preprint||Single Hop||Single Page+ with Ignores|
|Dataverse*||Dataset||Multi-Hop||Significant String From Filtered Title (starting from title end instead of beginning like OSF)|
|Dryad||Dataset||Multi-Hop||Significant String From Title+ with Ignored URI List|
|ExternalDiffusion||Blog Post||Single Artifact URI||Single Page|
|Figshare||Dataset||Single Hop||Single Page+ with Ignores|
|GitHub||Software Project||Multi-Hop||Path-Based with Externals|
|Global Biodiversity Information Facility*||Dataset||Multi-Hop||Significant String From URI|
|HASTAC||Blog Post||Single Artifact URI||Single Page|
|Hypotheses||Blog Post||Single Artifact URI||Single Page|
|JoVe||Videos||Single Hop||Single Page+ with Ignores|
|JSTOR daily||Article||Single Artifact URI||Single Page|
|Kaggle Datasets||Dataset with Code and Discussion||Multi-Hop||Path-Based with Externals|
|MethodSpace||Blog Post||Single Artifact URI||Single Page|
|Nautilus||Article||Single Artifact URI||Single Page|
|Omeka.net*||Collection Item||Single Artifact URI||Single Page (but Depends on Installation)|
|Open Science Framework*||Non-web content, e.g., datasets and PDFs||Multi-Hop||Significant String From Filtered Title|
|PubMed Commons||Discussion||Single Artifact URI||Single Page|
|PubPeer||Discussion||Single Artifact URI||Single Page|
|ScienceBlogs||Blog Post||Single Artifact URI||Single Page|
|Scientopia||Blog Post||Single Artifact URI||Single Page|
|SciLogs||Blog Post||Single Artifact URI||Single Page|
|Silk*||Data Visualization and Interaction||Multi-Hop||Path-Based|
|SocialScienceSpace||Blog Post||Single Artifact URI||Single Page|
|SSRN||Preprint||Single Hop||Single Page+ with Ignores|
|Story Collider||Audio||Single Artifact URI||Single Page|
|The Conversation||Article||Single Artifact URI||Single Page|
|The Open Notebook||Blog Post||Single Artifact URI||Single Page|
|United Academics||Article||Single Artifact URI||Single Page|
|Wikipedia||Encyclopedia Article||Single Artifact URI||Single Page|
|Zenodo||Non-web content||Single Hop||Single Page+ with Ignores|
Five entries are marked with an asterisk (*) because they may offer additional challenges.
Omeka.net provides hosting for the Omeka software suite, allowing organizations to feature collections of artifacts and their metadata on the web. Because each organization can customize their Omeka installation, they may add features that make the Single Page heuristic no longer function. Dataverse is similar in this regard. I only reviewed artifacts from Harvard's Dataverse.
Global Biodiversity Information Facility (GBIF) contains datasets submitted by various institutions throughout the world. A crawler can acquire some metadata about these datasets, but the dataset itself cannot be downloaded from these pages. Instead, an authenticated user must request the dataset. Once the request has been processed, the portal then sends an email to the user with a URI indicating where the dataset may be downloaded. Because of this extra step, this additional dataset URI will need to be archived by a human separately. In addition, it will not be linked from content of the other captured artifact URIs.
Dataverse, Open Science Framework, and Silk offer additional challenges. A crawler cannot just use anchor tags to find artifact URIs because some content is only reachable via user interaction with page elements (e.g., buttons, dropdowns, specific <div> tags). Webrecorder.io can handle these interactive elements because a human performs the crawling. The automated system that we are proposing to aide a crawler will not be as successful unless it can detect these elements and mimic the human's actions. CLOCKSS has been working on this problem since 2009 and has developed an AJAX collector to address some of these issues.
There may be additional types of artifacts that I did not see on these portals. Those artifacts may require different heuristics. Also, there are many more scholarly portals that have not yet been reviewed, and it is likely that additional heuristics will need to be developed to address some of them. A larger study analyzing the feasibility and accuracy of these heuristics is needed.
From these 36 portals, most artifacts fall into the class of Single Artifact URI. A larger study on the distribution of classes of artifacts could indicate how well existing crawling technology can discover artifact URIs and hence archive complete artifacts.
Currently, a system would need to know which of these heuristics to use based on the portal and type of artifact. Without any prior knowledge, is there a way our system can use the entry page -- including its URI, response headers, and content -- to determine to which artifact type the entry page belongs? From there, can the system determine which heuristic can be used? Further work may be able to develop a more complex heuristic or even an algorithm applicable to most artifacts.
These solutions rely on the entry page for initial information (e.g., URI strings, content). Given any other artifact URI in the artifact, is it possible to discover the rest of the artifact URIs? If a given artifact URI references content that does not contain other URIs -- either through links or text -- then the system will not be able to discover other artifact URIs. If the content of a given artifact URI does contain other URIs, a system would need to determine which heuristic is might apply in order to find the other artifact URIs.
What about artifacts that link to other artifacts? Consider again our example in Fig. 1 where a dataset links to a thesis. A crawler can save those artifact URIs to its frontier and pursue the crawl of those additional artifacts separately, if so desired. The crawler would need to determine when it had encountered a new artifact and pursue its crawl separately with the heuristics appropriate to the new artifact and portal.
I have outlined several heuristics for discovering artifact URIs belonging to an artifact. I also demonstrated that many of those heuristics can already be used with Archive-It. The heuristics offered here require that one know the entry page URI of the artifact and they expect that any system analyzing pages can work with interactive elements. Because portals provide predictable patterns, finding the boundary appears to be a tractable problem for anyone looking to archive a scholarly object.
--Shawn M. Jones
Acknowledgements: Special thanks to Mary Haberle for helping to explain Archive-It scoping rules.