Our IMLS proposal is titled "Combining Social Media Storytelling With Web Archives" and a PDF of the full proposal is available directly from the IMLS. This proposal is joint with our partners at Archive-It and is informed by our experiences in several areas, such as:
- Our previous attempts at visualizing Archive-It collections where we ran into difficulty scaling conventional approaches (e.g., treemaps, timelines) to entire collections.
- "Storytelling" social media services; namely storify.com, but also pinterest.com, scoop.it, paper.li, and similar services.
- The convergence of summarizing large personal collections with storytelling services, such as 1 second everyday, Twitter's Year in Review, timehop, and Facebook's "Memories" and now "On This Day".
The IMLS proposal will investigate two main thrusts:
- Selecting a small number (e.g., 20) of exemplary pages from a collection (often 100s of archived copies of 1000s of web pages) and loading them in an existing tool such as Storify as a summarization interface (instead of custom & unfamiliar interfaces). Yasmin AlNoamany has some exciting preliminary work in this area; for example see her TPDL 2015 paper examining what makes a "good" story on Storify, and her presentation "Using Web Archives to Enrich the Live Web Experience Through Storytelling".
- Using existing stories to generate seed URIs for collections. One problem for human-generated web archive collections is that they depend on the domain knowledge of curators. For example, the image above shows two Storify stories about early riots in Kiev (aka Kyiv) which predated much of the exposure in Western media and then the subsequent escalation of the crisis. The collection at Archive-It was not begun until the annexation of the Crimea was imminent, possibly missing the URIs that document the early stages of this developing story. Our idea is to mine social media, especially stories, for semi-automated, early creation of web archive collections.
- Inspired by Martin Klein's PhD research and Hugo Huurdeman et al.'s "Finding Pages on the Unarchived Web" from JCDL 2014, we would like to see archives provide recommendations of related pages in the archive, as well as suggested "replacements" for pages that are not archived. Web archives now just return a "yes" (200) or "no" (404) when you query for a URI -- they should be able to provide more detailed answers based on their holdings.
- We'd like to further investigate the various issues of how well a page is archived. We have some preliminary work from Justin Brunelle for automatically assessing the impact of missing embedded resources (typically stylesheets and images), as well as from Scott Ainsworth on detecting temporal violations -- combinations of HTML and images that never occurred on the live web (see "Only One Out of Five Archived Web Pages Existed as Presented" from HT 2015).
- Related to #2, we need to find a better way to visualize the temporal & archival makeup of replayed pages. For example, the LANL Time Travel service does a nice job of showing the various archives that contribute resources to a reconstruction, but questions remain about scale as well as describing temporal violations and their likely semantic impact. Similarly, we'd like to investigate how to convey the request environment that generated the representation you're viewing now (see our 2013 D-Lib paper "A Method for Identifying Personalized Representations in Web Archives" for preliminary ideas on linking various geoip, mobile vs. desktop, and other related representations).
* = See also our 2014 award for $324k from the NEH for the study of personal web archiving and our 2014 award for $49k from the IIPC for profiling web archives for a more complete picture of our research vision for web archives.