Saturday, October 22, 2016

2016-10-13: Dodging The Memory Hole 2016 Trip Report (#dtmh2016)

Dodging the Memory Hole 2016, held at UCLA's Charles Young Research Library in Los Angeles California, was a two-day event to discuss and highlight potential solutions to the issue of preserving born-digital news. Organized by Edward McCain (digital curator of journalism at the Donald W. Reynolds Journalism Institute and University of Missouri Libraries) this event brought together technologists, archivists, librarians, journalists and fourteen graduate students who had won travel scholarships for attendance.  Among the attendees were four members of the WS-DL group (l-r): Mat KellyJohn BerlinDr. Michael Nelson, and Shawn Jones.

Day 1 (October 13, 2016)

Day one started off at 9am with Edward McCain welcoming everyone to the event and then turning it over to Ginny Steel, UCLA University Librarian, for opening remarks.
In the opening remarks, Ginny reflected on her career as a lifelong librarian, the evolution of printed news to digital and in closing she summarized the role archiving has to play in the digital-born news era.
After opening remarks, Edward McCain went over the goals and sponsors of the event before transitioning to the first speaker Hjalmar Gislason.

In the talk, Hjalmar touched on issues concerning the amount of data currently being generated, how to determine context about data and the importance of if and that data lost due to not knowing if it is important could mean losing someone's life work. Hjalmar ended his talk with two takeaway points: "There is more to news archiving than the web: there is mobile content" and "Television news is also content that is important to save".

After a short break, panel one which consisted of Chris Freeland, Matt Weber, Laura Wrubel, and moderator Ana Krahmer addressed the question of "Why Save Online News".

Matt Weber started off the discussion by talking about the interactions between web archives and news media. Stating that digital only media has no offline surrogate and how it is becoming increasingly difficult to do anything but look at it now as it exists. Following Mat Weber were Laura Wrubel and Chris Freeland who both talked about the large share Twitter has in online news.  Laura Wrubel brought up that in 2011 journalists primarily used Twitter to direct people to articles rather than for conversation. Chris Freeland stated that Twitter the primary source of information during the Ferguson protests in St. Louis and that the local news outlets were far behind in reporting the organic story as it happened.
Following panel one was Tim Groeling (professor and former chair of the UCLA Department of Communication Studies) giving presentation one entitled "NewsScape: Preserving TV News".

The NewsScape project is currently migrating analog recordings of TV news to digital for archival lead by Tim Groesling.  The collection contains recording dating back to 1950's and is the largest collection of TV news and public affairs programs containing a mix of U-matic, Betamax, and VHS tapes.

Currently, the project is working its way through the collections tapes completing 36k hours of encoding this year. Tim Groeling pointed out that VHS despite being the newest tapes are the most threatened.
After lunch, the attendees were broken up into fifteen groups for the first of two breakout sessions. Each group was tasked with formulating three things that could be included in a national agenda for news preservation and to come up with a project to advance the practice of online news preservation.

Each group sent up one person who briefly went over what they had come up with. Despite the diverse background of the attendees at dtmh2016 the ideas that each group came up with had a lot in common:
  • A list of tools/technologies for archiving (awesome memento)
  • Identifying broken links in new articles 
  • Increase awareness of how much or how little is archived
  • Work with news organization to increase their involvement in archiving 
  • More meetups, events, hackathons that bring together technologists
    with journalists and librarians  
The final speaker of the day was Clifford Lynch giving a talk entitled "Born-digital news preservation in perspective".
In his talk, Clifford Lynch spoke about problems that plague news preservation such as link rot and the need for multiple archives.

He also spoke on the need to preserve other kinds of media like data dumps and that archival record keeping goes hand in hand with journalism.
After his talk was over Edward McCain gave final remarks for day one and transitioned us to reception for the scholarship winners. The scholarship winners purposed projects (to be completed by December 2016) that would aid in digital news preservation and of these students three were WS-DL members (Shawn JonesMat KellyJohn Berlin).

Day 2 (October 14, 2016)

Day two of dodging the memory hole 2016 began with Sharon Farb welcoming us back.

Followed by the first presentation of the day by our very own Dr. Nelson titled "Summarizing archival collections using storytelling techniques"

The presentation highlighted the work done by Yasmin AlNoamany in her doctoral dissertation, in particular, The Dark and Stormy Archives (DSA) Framework.
Up next was Pulitzer prize winning journalist Peter Arnett who presented "Writing The First Draft of History - and Saving It!" talking about his experiences while covering the Vietnam War and how he saved the Associated Presses Saigon office archives.
Following Perter Arnett was the second to last panel of dtmh2016 Kiss your app goodbye: the fragility of data journalism featuring Ben Welsh, Regina Roberts, Meredith Broussard and moderated by Martin Klein.

Meredith Broussard spoke about how archiving of news apps has become difficult as their content does not live in a single place.
Ben Welsh was up next speaking about the work he has done at the LA Times Data Desk.
In his talk, he stressed the need for more tools to be made that allowed people like himself to make archiving and viewing of archived news content easier.
Following Ben Welsh was Regina Roberts who spoke about the work done at Standford for archiving and adding context to the data sets that live beside the codebases of research projects.
The last panel of dtmh2016 "The future of the past: modernizing The New York Times archive" featured members of the technology team at the New York Times Evan Sandhaus, Jane Cotler, and Sophia Van Valkenburg with moderator Edward McCain.

Evan Sandhause presented the New York Times own take on the wayback machine called TimesMachine. The TimesMachine allows users to view the microfilm archive of The New York Times.
Sophia Van Valkenburg spoke about how the New York Times was transitioning its news archives into a more modern system.
After Sophia Valkenburg, was Jan Cotler who spoke about the gotchas encountered during the migration process. Most notable of the gotchas was that the way in which the articles were viewed (i.e, visual aesthetics) was not preserved in the migration process in favor of a "better user experience" and that in migrating to the new system links to the old pages would no longer work.
Lightning rounds were up next.

Mark Grahm of the Internet Archive was up first with a presentation on the wayback machine and how later this year it would be getting site search.
Jefferson Bailey also of the Internet Archive spoke on the continual efforts at the Internet Archive to get the web archives into the hands of researchers.
Terry Britt spoke about how social media over time establishes "collective memory".
Katherine Boss presented "Challenges facing the preservation of born-digital news applications" and how they end up in dependency hell.
Eva Revear presented a tool to discover frameworks and software used for news apps
Cynthia Joyce talked about a book on Hurricane Katrina and its use of archived news coverage of the storm.
Jennifer Younger presented the work being done by the Catholic News Archive.
Kalev Leetaru talked about the work he and the gdeltproject  are doing in web archival.
The last presentation of the event was by Kate Zwaard titled "Technology and community Why we need partners, collaborators, and friends".

Kate Zwaard talked about the success of web archival events such as the recent Collections as Data and Archives Unleashed 2.0 held at the Library of Congress.
The web archive collection at the Library of Congress.
How they are putting Jupyter notebooks on top of database dumps.
And the diverse skill sets required for librarians of today.
The final breakout sessions of dtmh2016 consisted of four topic discussions.

Jefferson Bailey's session, Web Archiving For News, was an informal breakout where he asked the attendants about collaboration between the Archive and other organizations. A notable response was from the NYTimes representative Evan Sandhaus with a counter question about whether organizations or archives should be responsible for the preservation of news content. Jefferson Bailey responded that he wished organizations were more active in practicing self-archiving. Others responded with their organizations or ones they knew about approaches to self-archiving.

Ben Welsh's session, News Apps, discussed issues archiving news apps which are online web applications providing rich data experiences. An example app to illustrate this was California's War Dead which was archived by the Internet Archive but with diminished functionality. In spite of this "success", Ben Welsh brought up the difficulty in preserving the full experience of the app as web crawlers only interact with client side code, not server side which is required. To address this issue, he suggested solutions such as the python library django-backery for producing flat, static versions of news apps based on database queries. These static versions can be more easily archived while still providing a fuller experience when replayed.
Eric Weig's session, Working with CMS, started out with him sharing his experience of migrating one the Univeristy of Kentucky Libraries Special Collections Research Center newspaper sites cms from a local data center using sixteen cpus to a less powerful cloud-based solution using only two cpus. One of the biggest performance increases came when he switched from dynamically generating pages to serving static html pages. Generating the static html pages for the eighty-two thousand issues contained in this cms took only three hours on the two cpu cloud-based solution. After sharing this experience the rest of the time was used to hear from the audience about their experiences using cms and an impromptu roundtable discussion on cms.

Kalev Leetaru's session, The GDELT Project: A Look Inside The World's Largest Initiative To Understand And Archive The World's News, was a more in depth version of the lightning talk he gave. Kalev Leetaru shared experiences that The GDELT Project had with archival crawling of non-English language news sites, his work with the Internet Archive on monitoring news feeds and broadcasts, the untapped opportunities for exploration of Internet Archive and A Vision Of The Role and Future Of Web Archives. He also shared two questions he is currently pondering: "Why are archives checking certain news organizations more than others?" and "How do we preserve GeoIP generated content especially in non-western news sites?".
The last speaker of dtmh2016 was Katherine Skinner with Alignment and Reciprocity. In her speech Katherine Skinner called for volunteers to carry out some of the actions mentioned at dtmh2016 and reflected on the past two days.
Closing out dtmh2016 was Edward McCain who thanked everyone for coming and expressed how enjoyable this event was especially with the graduate students and Todd Grappone's closing remarks. In the closing remarks, Todd Grappone reminded attendees of the pressing problems in news archival and how they require both academic and software solutions.
Video recordings of DTMH2016 can be found on the Reynolds Journalism Institute's Facebook pageChris Aldrich recorded audio along with a transcription of days one and two. NPR's Research, Archive & Data Strategy team created a storify page of tweets covering topics they found interesting.

-- John Berlin 

Monday, October 3, 2016

2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?

"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)
The first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (Niel Chah, Steve Marti, Mohamed Aturban, and Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the 2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in ARC format. These key web sites are maintained by the candidates or their political parties (e.g.,,,, and or other newspapers like and They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "How many times did each candidate mention each state?" and "What topics were they talking about?"

In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.

[1] Extract textual data from ARC files:

ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.

[2] Extract named entities and topics

We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:
After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.

State Candidate Frequency of mentioning
the state
The most important
Mississippi Kerry
Mississippi Bush
Oklahoma Kerry
Oklahoma Bush
Delaware Kerry
Delaware Bush
Minnesota Kerry
Minnesota Bush
Illinois Kerry
Illinois Bush
Georgia Kerry
Georgia Bush
Arkansas Kerry
Arkansas Bush
New Mexico Kerry
New Mexico Bush
Indiana Kerry
Indiana Bush
Maryland Kerry
Maryland Bush
Louisiana Kerry
Louisiana Bush
Texas Kerry
Texas Bush
Tennessee Kerry
Tennessee Bush
Arizona Kerry
Arizona Bush

[3]  Interactive US map 

We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (

Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.

--Mohamed Aturban

2016-10-03: Summary of “Finding Pages on the Unarchived Web"

In this paper, the authors detailed their approach to recover the unarchived Web based on links and anchors of crawled pages. The data used was from the Dutch 2012 Web archive at the National Library of the Netherlands (KB), totaling about 38 million webpages. The collection was selected by the library based on categories related to Dutch history, social and cultural heritage. Each website is categorized using UNESCO code. The authors try to address three research questions: Can we recover a significant fraction of unarchived pages?, How rich are the representations for the unarchived pages?, and Are these representations rich enough to characterize the content?

The link extraction used Hadoop MapReduce and Apache Pig to process all archived webpages and used JSoup to extract links from their content. A second MapReduce job was to index the URLs and check if they are archived or not. Then the data was deduplicated based on the value of year, anchor text, source, target, and hashcode (MD5). In addition basic cleaning and processing was performed on the data set. The resulting number of pages in the dataset was 11 million webpages. Both external links (inter-server links) which are links between different servers, and site internal links (intra-server links) which occur within a server were included in the data set. Apache Pig script was used to aggregate the extracted links to different element such as TLD, domains, hosts, and file type.

The processed file list is as following:
(sourceURL, sourceUnesco, sourceInSeedProperty, targetURL, targetUnesco, targetInSeedProperty, anchorText, crawlDate, targetInArchiveProperty, sourceHash).

There are four main classification of URLs found in this data set, shown in Figure 1:
1-Intentionally archived URLs in the seed list, which is 92% of the dataset (10.1M).
2-Unintentionally archived URLs due to crawler configuration, which is 8% of the dataset (0.8M).
3-Inner Aura: unarchived URLs which the parent domain is included in the seed list (5.5M), (20% depth 4, because 94% are links to the site).
4-Outer Aura: unarchived URLs which do not have a parent domain that is on the seed list (5.2M), (29.7% depth 2).

In this work, the Aura is defined as Web documents which are not included in the archived collection but are known to have existed through references to those unarchived Web documents in the archived pages.

They analyzed the four classification and checked unique hosts, domain, and TLD. They found that unintentionally archived URLs have higher percentage of unique hosts, domain, and TLD compared to intentionally archived URLs. And that outer Aura have higher percentage of unique hosts, domain and TLD compared to inner Aura.

When checking the Aura they found that most of the unarchived Aura points to textual web content. The inner Aura mostly had a (.nl) top level domain (95.7%) and the outer Aura had 34.7% (.com) TLD, 31.1% (.nl) TLD, and 18% (.jp) TLD. They high percentage of Japanese TLD is that they unintentionally archived those pages. Also, they analyzed the indegree of the Aura where all target representations in the outer Aura have at least one source link, 18% have at least 3 links, and 10% have 5 links or more. In addition, the Aura was compared by the number of intra-server links and the inter-server links, the inner Aura had 94.4% intra-server links. On the other hand the outer Aura had 59.7% of inter-server links.
The number of unique anchor text words for both inner and outer Aura was almost similar, 95% had at least one word describing them, 30% have at least three words, and 3% have 10 words or more.

To test the theory of finding missing unarchived Web pages, they took a random 300 websites where 150 are homepages and 150 are non-homepages. They made sure the websites selected are either live or archived. They found that 46.7% of the targets page were found within the top 10 SERP using anchor text. However for non-homepage 46% were found using texts obtained from the URLs. By combining anchor text and URL word evidence both homepage and non-homepage had a high percentage, 64% of the homepages, and 55.3% of the deeper pages can be retrieved. Another random sample of URLs was selected to check the anchor text and words from the link, and they found homepages can be represented with anchor text, on the other hand non-homepages are better represented with both anchor text and words from the link.

They found that the archived pages show evidence of a large number of unarchived pages and websites. They also found that only a few homepage webpages have rich representations. Finally, they found that even with a few words to describe a missing webpage they can be found within the first rank. Future work include adding further information such as surrounding text and advance retrieve models.


-Lulwah M. Alkwai

Tuesday, September 27, 2016

2016-09-27: Introducing Web Archiving in the Summer Workshop

For the last few years the Department of Computer Science at Old Dominion University invites a group of undergrad students from India and hosts them in the summer. They work closely with a research group on some relevant projects. Additionally, researchers from deferment research groups in the departments present their work to the guest students twice a week and introduce various different projects that they are working on. The goal of this practice is to allow them to collaborate with graduate students of the department and to encourage them for research studies. The invited students also act as ambassadors to share their experience with their colleagues and spread the word out when they go back to India.

This year a group of 16 students from Acharya Institute of Technology and B.N.M. Institute of Technology visited Old Dominion University, they were hosted under the supervision of Ajay Gupta. They worked in the areas of Sensor Networks and Mobil Application Development. They researched ways to integrate mobile devices with low-cost sensors to solve problems in health care-related areas and vehicular networks.

I (Sawood Alam) was selected to represent our Web Science and Digital Libraries Research Group this year on July 28. Mat and Hany represented the group in the past. I happened to be the last presenter before they return back to India, by the time they were overloaded with scholarly information. Additionally, the students were not primarily from the Web science or digital libraries background. So, I decided to keep my talk semi-formal and engaging rather than purely scientific. The slides were inspired from my last year's talk in Germany on "Web Archiving: A Brief Introduction".

I began with my presentation slides entitled, "Introducing Web Archiving and WSDL Research Group". I briefly introduced myself with the help of my academic footprint and the lexical signature. I described the agenda of the talk and established the motivation for Web archiving. From there, I followed the talk agenda as laid out before, covering topics like issues and challenges in Web archiving, existing tools, services, and research efforts, my own research work about Web archive profiling, and some open research topics in the field of Web archiving. Then I introduced the WSDL research Group along with all the fun things we do in the lab. Being an Indian, I was able to pull in some cultural references from India to keep the audience engaged and entertained while still being on the agenda of the talk.

I heard encouraging words from Ajay Gupta, Ariel Sturtevant, and some of the invited students after my talk as they acknowledged it being one of the most engaging talks during the entire summer workshop. I would like to thank all who were involved in organizing this summer workshop and gave me the opportunity to introduce my field of interest and the WSDL research group.

Sawood Alam

Monday, September 26, 2016

2016-09-26: IIPC Building Better Crawlers Hackathon Trip Report

Trip Report for the IIPC Building Better Crawlers Hackathon in London, UK.                           

On September 22-23, 2016, I attended the IIPC Building Better Crawlers Hackathon (#iipchack) at the British Library in London, UK. Having been to London almost exactly 2 years ago for the Digital Libraries 2014 conference, I was excited to go back, but was more so anticipating collaborating with some folks I had long been in contact with during my tenure as a PhD student researcher at ODU.

The event was a well-organized yet loosely scheduled meeting that resembled more of an "Unconference" than a Hackathon in that the discussion topics were defined as the event progressed rather than a larger portion being devoted to implementation (see the recent Archives Unleashed 1.0 and 2.0 trip reports). The represented organizations were:

Day 0

As everyone arrived at the event from abroad and locally, the event organizer Olga Holownia invited the attendees to an informal get-together meeting at The Skinners Arms. There the conversation was casual but frequently veered into aspects of web archiving and brain picking, which we were repeatedly encouraged to "Save for Tomorrow".

Day 1

The first day began with Andy Jackson (@anjacks0n) welcoming everyone and thanking them for coming despite the short notice and announcement of the event over the Summer. He and Gil Hoggarth (@grhggrth), both of the British Library, kept detailed notes of the conference happenings as they progressed with Andy keeping an editable open document for other attendees to collaborate on building.

Tom Cramer (@tcramer) of Stanford, who mentioned he had organized hackathons in the past, encouraged everyone in attendance (14 in number) to introduce themselves and give a synopsis of their role and their previous work at their respective institutions. He also asked how we could go about making crawling tools accessible to non-web archiving specialists to stimulate conversation.

The responding discussion initiated a theme that ran throughout the hackathon -- that of the web archiving from a web browser.

One tool to accomplish this is Brozzler from Internet Archive, which combines warcprox and Chromium to preserve HTTP content sent over the wire into the WARC format. I had previously attempted to get Brozzler (originally forked from Umbra) up and running but was not successful. Other attendees either had previously tried or had not heard of the software. This transitioned later into Noah Levitt (of Internet Archive) giving an interactive audience-participation walk through of installing, setting up, and using Brozzler.

Prior to the interactive portion of the event, however, Jefferson Bailey (@jefferson_bail) of Internet Archive started a presentation by speaking about WASAPI (Web Archiving Systems API), a specification for defining data transfer of web archives. The specification is a collaboration with University of North Texas, Rutgers, Stanford via LOCKSS, and other organizations. Jefferson emphasized that the specification is not implementation specific; it does not get into issues like access control, parameters of a specific path, etc. The rationale behind this was so that the spec would not be just a preservation data transport tool but also a means of data transfer for researcher. Their in-development implementation takes WARCs, pulls out data to generates a derivative WARC, then defines a Hadoop job using Pig syntax. Noah Levitt added that the Jobs API requires you to supply an operation like "Build CDX" and the WARCs on which you want to perform the operation.

In a typical non-linear unconference fashion (also exhibited in this blog post), Noah then gave details on Brozzler (presentation slides). With a room full of Mac and Linux users, installation proved particularly challenging. One issue I had previously run into was latency in starting RethinkDB. This issue was also exhibited by Colin Rosenthal (@colinrosenthal) while he was on Linux and I on Mac. Noah's machine, which he showed in a demo as having the exact same versions of all dependencies I had installed did not show this latency, so Your Mileage Might Vary with installation but in the end both Colin and I (possibly others) were successful in crawling a few URIs using Brozzler.

Andy added to Noah's interactive session by referencing his effort in Dockerizing Brozzler and his other work in component-izing and Dockerized the other roles and tasks web archiving process with his project Wren. While one such component is the Archival Acid Test project I had created for Digital Libraries 2014, the other sub-projects of run allow for the mitigation of other tools that are otherwise difficult or time consuming to configure.

One such tool that was lauded throughout the conference was Alex Osborne's (@atosborne) tinycdxserver Andy also has created a Dockerized version of tinycdxserver. This tool was new to me but the reported statistics on CDX querying speed and storage have the potential for significant improvement for large web archives. Per Alex's description of the tool, the indexes are stored compressed using Facebook's RocksDB and are about a fifth of the size in tinycdxserver when compared to a flat CDX file. Further, Wayback instances can simply be pointed at a tinycdxserver instance using the built-in RemoteResourceIndex field in the Wayback configuration file, which makes for easy integration.


A wholly unconference discussion then commenced with topics we wanted to cover in the second part of the day. After coming up with and classifying various idea, Andy defined three groups: the Heritrix Wish List, Brozzler, and Automated QA.

Each attendee could join any of the three for further discussion. I chose "Automated QA", given the relevance of archival integrity is related to my research topic.

The Heritrix group expressed challenges that the members had encountered in transitioning from Heritrix version 1 to version 3. "The Heritrix 3 console is a step back from Heritrix 1's. Building and running scripts in Heritrix 3 is a pain." was the general sentiment from the group. Other concerns were scarce documentation, which might be remedied with funded efforts to improve it, as deep knowledge of the tool's working are needed to accurately represent the capability of the tool. Kristinn Sigurðsson (@kristsi), who was involved in the development of H3 (and declined to give a history documenting the non-existence of H2) has since resolved some issues. I was encouraged to use his fork of Heritrix 3 from he and others, my own recommendation inadvertent included:

The Brozzler group first identified the behavior of Brozzler versus a crawler in its handling of one page or site at a time (a la WARCreate) instead of adding discovered URIs to a frontier and seeding those URIs for subsequent crawls. Per above, Brozzler's use of RethinkDB as both the crawl frontier and the CDX service makes it especially appealing and more scalable. Brozzler allows multiple workers to pull URIs for a pool and report back to a RethinkDB instance. This worked fairly well in my limited but successful testing at the hackathon.

The Automated QA group first spoke about the National Library of Australia's Bamboo project. The tool consumes Heritrix's (and/or warcprox) crawl output folder and provides in-progress indexes from WARC files prior to a crawl finishing. Other statistics can also be added in as well as automated generation of screenshots for comparison of the captures on-the-fly. We also highlighted some particular items that crawlers and browser-based preservation tools have trouble capturing. For example, video formats that vary in support between browsers, URIs defined in the "srcset" attribute, responsive design behaviors, etc. I also referenced my work in Ahmed AlSum's (@aalsum) Thumbnail Summarization using SimHash, as presented at the Web Archiving Collaboration meeting.

After presentation by the groups, the attendees called it a day for further discussions at a nearby pub.

Day 2

The second day commenced with a few questions we all decided upon and agreed to while at the pub as good discussions for the next day. These questions:

  1. Given five engineers and two years, what would you build?
  2. What are the barriers in training for the current and future crawling software and tools?
Given Five...

Responses to the first included something like Brozzler's frontier but redesigned to allow for continuous instead of a single URI for crawling. With a segue toward Heritrix, Kristinn verbally considered the relationship between configurability and scalability. "You typically don't install heritrix on a virtual machine", he said, "usually a machine for this use requires at least 64 gigabytes of RAM." Also discussed was getting the raw data for a crawl versus being able to get the data needed to replicate the experience and the particular importance of the latter.

Additionally, there was talk of adapting the scheme used by Brozzler for an Electron application meant for browsing and the ability to toggle archiving through warcprox (related: see recent post on WAIL). On the flip side, Kristinn mentioned that it surprised him that we can potentially create a browser of this sort that can interact with a proxy but not build another crawler -- highlight the lack of options in other Heritrix-like robust archival crawlers.

Barriers in Training

For the second question, those involved with institutional archives seemed to agreed that if one was going to hire a crawl engineer, Java and Python experience are a pre-requisite to exposure to some of the archive-specific concepts. For current institutional training practice, Andy stated that he turns new developers in his organization loose on ACT, which is simply a CRUD application to introduce them into the web archiving domain. Others said it would be useful to have staff exchanges and internships for collaboration and getting more employees familiar with web archiving.


Another topic arose from the previous conversation about future methods of collaboration. For future work on writing documentation, more Getting Started fundamental guides as well as test sites for tools would be welcomed. For future communication, the IIPC Slack Channel as well as the newly created IIPC GitHub wiki will be the next iteration of the outdated IIPC Tools page and the COPTR initiative.

The whole-group discussion wrapped up with identifying concrete next steps from what was discussed at the event. These included creating setup guides for Brozzler, testing of any further use cases of Umbra versus Brozzler, future work on access control considerations as currently done by institutions and next steps regarding that, and a few other TODOs. A monthly online meeting is also planned to facilitate collaboration between meetings as well as more continued interaction via Slack instead of a number of outdated, obsolete, or noisy e-mail channels.

In Conclusion...

Attendance of the IIPC Building Better Crawlers Hackathon was invaluable to establishing contacts and gaining more exposure to the field and efforts done by others. Many of the conversations were open-ended, which lead to numerous other topics discussed and opened the doors to potential new collaborations. I gained a lot of insight from discussing my research topic and others' projects and endeavors. I hope to be involved with future Hackathons-turned-Unconferences from IIPC in the future and appreciate the opportunity I had to attend.

—Mat (@machawk1)

Kristinn Sigurðsson has also written a post about his take aways from the event.

Tom Cramer also published his report on the Hackathon since the publication of this post.