Tuesday, September 9, 2014

2014-09-09: DL2014 Doctoral Consortium

After exploring London on Sunday, I attended the first DL2014 session: the Doctoral Consortium. Held in the College Building at the City University London, the Doctoral Consortium offered early-career Ph.D. students the opportunity to present their research and academic plans and receive feedback from digital libraries professors and researchers.

Edie Rasmussen chaired the Doctoral Consortium. I was a presenter at the Doctoral Consortium in 2012 with Hany SalahEldeen, but I attended this year as a Ph.D. student observer.

Session I: User Interaction was chaired by José Borbinha. Hugo Huurdeman was first to present his work entitled "Adaptive Search Systems for Web archive research". His work focuses on information retrieval and discovery in the archives. He explained the challenge with searching not only across documents but also across time.

Georgina Hibberd presented her work entitled "Metaphors for discovery: how interfaces shape our relationship with library collections." Georgina is working on digitally modeling the physical inputs library users receive when interacting with books and physical library media to allow the same information to be available when interacting with digital representations of the collection. For example, how can we incorporate physical proximity and accidental discovery in the digital systems, or how can we demonstrate frequency of use that would previously be shown in the condition of a book's spine?

Yan Ru Guo presented her work entitled "Designing and Evaluating an Affective Information Literacy Game" in which she proposes serious games to help tertiary students in an effort to help their ability to perform searches and information discovery in digital environments.

After a break to re-caffeinate, Session II: Working with Digital Collections began. Dion Goh chaired the session. Vincent Barrallon presented his work entitled "Collaborative Construction of Updatable Digital Critical Editions: A Generic Approach." This work aims to establish an updatable data structure to represent the collaborative flow of annotation, especially with respect to editorial efforts. He proposes using bidirectional graphs, triple graphs, or annotated graphs as representatives, and proposes methods of identifying graph similarity.

Hui Li finished the session with her presentation entitled "Social Network Extraction and Exploration of Historic Correspondences" in which she is working to use Named Entity Extraction to create a social network from digitized historical documents. Her effort utilizes topic modeling and event extraction to construct the network.

Due to a scheduling audible, lunch and Session III: Social Factors overlapped slightly. Ray Larson chaired this session, and Mat Kelly was able to attend after landing in LHR and navigating to our hotel. Maliheh Farrokhnia presented her work entitled "A request-based framework for FRBRoo graphical representation: With emphasis on heterogeneous cultural information needs." Her work takes user interests (through adaptive selection of target information) to present relational graphs of digital library content.

Abdulmumin Isah presented his work entitled "The Adoption and Usage of Digital Library Resources by Academic Staff in Nigerian Universities: A Case Study of University of Ilorin." His work highlights a developing country's use of digital resources in academia and cites factors influencing the success of digital libraries.

João Aguir Castro presented his work entitled "Multi-domain Research Data Description -- Fostering the participation of researchers in an ontology based data management environment." His work with Dendro uses metadata and ontologies to aid in long-term preservation of research data.

The last hour of the consortium was dedicated to an open mic session chaired by Cathy Marshall with the goal of having the student observers present their current work. I presented first and explained my work that aims to mitigate the impact of JavaScript on archived web pages. Mat went next and discussed his work about integrating public and private web archives with tools like WAIL and WARCreate.

Alexander Ororbia presented his work on using artificial intelligence and deep learning for error correcting crowd sourced data from scholarly texts. Md Arafat Sultan discussed his work on using natural language processing to detect similarity in text to identify text segments that adhered to set standards (e.g. educational standards). Kahyun Choi discussed her work on perceived mood in eastern music from the point of view of western listeners. Finally, Fawaz Alarfaj discussed his work using entity extraction, information retrieval, and natural language processing to identify experts within a specified field.

As usual, the Doctoral Consortium was full of interesting ideas, valuable recommendations, and highly motivated Ph.D. students. Tomorrow marks the official beginning of DL2014.

--Justin F. Brunelle

Tuesday, September 2, 2014

2014-09-02: WARCMerge: Merging Multiple WARC files into a single WARC file

WARCMerge is the name given to a new tool for organizing WARC files. The name describes it -- merging multiple WARC files into a single one. In web archiving, WARC files can be generated by well-known web crawlers such as Hertrix and Wget command, or by state-of-the-art tools like WARCreate/WAIL and Webrecorder.io which were developed to support the personal web archiving. WARC files contain records not only for HTTP responses and metadata elements but also all original HTTP requests. By having those WARC files, any replay tools (e.g., Wayback Machine) can be used to reconstruct and display the original web pages. I would emphasize here that a single WARC file may consist of records related to different web sites. In other words, multiple web sites can be archived in the same WARC file.

This Python program runs in three different modes. In the first mode, the program sequentially reads records one by one from different WARC files and combines them into a new file in which an extra metadata element is added to indicate when this merging operation occurred. Furthermore, a new “warcinfo” record is placed at the beginning of the resulting WARC file(s). This record contains information about WARCMerge program and metadata for date and time.

The second mode is very similar to the first mode; the only difference here is the source of WARC files. In the first mode the source files are from a specific directory, while in the second mode an explicit list of WARC files is provided.

In third mode, an existing WARC file is appended to the end of another WARC file. In this case, only one metadata element (WARC-appended-by-WARCMerge) is added to each “warcinfo” record found in the first file.

Finally, regardless of the mode, WARCMerge always checks for errors like validating the resulting WARC files as well as ensuring that the size of the resulting file does not exceed the maximum size limit. (The maximum size limit can be changed through the program's source code by assigning a new value to the variable: MaxWarcSize).

Download WARCMerge:
  • WARCMerge's source code is available on GitHub, or by running the following command:
                   git clone https://github.com/maturban/WARCMerge.git

  • Tested on Linux Ubuntu 12.04
  • Requires Python 2.7+
  • Requires Java to run Jwattool for validating WARC files
Running WARCMerge:

As described above, WARCMerge can be run in three different modes; see the three examples below (adding the option '-q' will make the program run in a quiet mode; the program does not display any messages)

Example 1: Merging WARC files (found in "input-directory") into new WARC file(s):

%python WARCMerge.py ./collectionExample/ my-output-dir 

   Merging the following WARC files:
   [ Yes ]./collectionExample/world-cup/20140707174317773.warc 
   [ Yes ]./collectionExample/warcs/20140707160258526.warc 
   [ Yes ]./collectionExample/warcs/20140707160041872.warc 
   [ Yes ]./collectionExample/world-cup/20140707183044349.warc 

   Validating the resulting WARC files:
     - [ valid ] my-output-dir/WARCMerge20140806040712197944.warc
Example 2: Merging all listed WARC files into new WARC file(s):

%python WARCMerge.py  1.warc  2.warc  ./dir1/3.warc  ./warc/4.warc mydir

    Merging the following WARC files:
    [ Yes ] ./warc/4.warc
    [ Yes ] ./1.warc
    [ Yes ] ./dir1/3.warc
    [ Yes ] ./2.warc 
    Validating the resulting WARC files:
    - [ valid ] mydir/WARCMerge20140806040546699431.warc
Example 3: Appending a WARC file to another WARC file. The option '-a' is used here to make sure that any change in the destination file is done intentionally:

%python WARCMerge.py -a ./test/src/1.warc ./test/dest/2.warc

      The resulting (./test/dest/2.warc) is valid WARC file
In case a user enters any incorrect command-line arguments, the following message will be shown:

usage: WARCMerge [[-q] -a <source-file> <dest-file> ]                 
                             [[-q] <input-directory> <output-directory> ]
                             [[-q] <file1> <file2> <file3> ... <output-directory> ]  
WARCMerge can be useful as an independent component or it can be integrated with other existing tools. For example, WARCreate, is a Google Chrome extension that helps users to create WARC files for any visited web pages in the browser. Instead of having hundreds of such WARC files, WARCMerge brings all files together in one place.

-M. Aturban

2014-09-04 Edit:

One feature of the tool that we neglected to mention in the original post is the ability to create multiple merged WARC files.  We can set the maximum desired WARC file size and if the WARCs to be merged exceed that limit, multiple merged WARCs will be created.  

Example:  We want to merge 4 WARC files with sizes 165 KB, 600 KB, 680 KB, and 900 KB, respectively, and have set a MaxWarcSize of 1 MB.

% python WARCMerge.py ./smallCollectionExample/about-warcs/20140707160041872.warc ./smallCollectionExample/about-warcs/20140707160258526.warc ./smallCollectionExample/world-cup-2014/20140707174317773.warc ./smallCollectionExample/world-cup-2014/20140707183044349.warc ./warcs

Merging the following WARC files: 

Validating the resulting WARC files: 
[valid]  ./warcs/WARCMerge20140904030816942537.warc
[valid]  ./warcs/WARCMerge20140904030817019383.warc
[valid]  ./warcs/WARCMerge20140904030817129653.warc

Thursday, August 28, 2014

2014-08-28: InfoVis 2013 Class Projects

(Note: This is continuing a series of posts about visualizations created either by students in our research group or in our classes.)

I've been teaching the graduate Information Visualization course since Fall 2011.  In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous posts: Fall 2011, Fall 2012)

In Spring 2013, I taught an Applied Visual Analytics course that asked students to create visualizations based on the "Life in Hampton Roads" annual survey performed by the Social Science Research Center at ODU.  In Fall 2013, I taught the traditional InfoVis course that allowed students to choose their own topics.  (All class projects are listed in my InfoVis Gallery.)

Life in Hampton Roads
Created by Ben Pitts, Adarsh Sriram Sathyanarayana, Rahul Ganta

This project (currently available at https://ws-dl.cs.odu.edu/vis/LIHR/) provides a visualization of the "Life in Hampton Roads" survey for 2010-2012.  Data cleansing was done with the help of tools like Excel and Open Refine.  The results of the survey are shown using maps and bar charts, making it easier to understand public opinions on a particular question. Visualizations are created by using various tools like D3, Javascript, JQuery and HTML. The visualizations implemented in the project are reader-driven and exploratory.  The user can interactively change the filters in the top half of the page and see the data only for the selected groups below.

The video below provides a demo of the tool.

MeSH Viewer
Created by Gil Moskowitz

MeSH is a controlled vocabulary developed and maintained by the U.S. National Library of Medicine (NLM) for indexing biomedical databases, including the PubMed citation database. PubMed queries can be made more precise, returning fewer citations with higher relevance, by issuing them with specific reference to MeSH terms. The database of MeSH terms is large, with many interrelationships between terms. The MeSH viewer (currently available at https://ws-dl.cs.odu.edu/vis/MeSH-vis/) visually presents a subset of MeSH terms, specifically MeSH descriptors. The MeSH descriptors are organized into trees based on a hierarchy of MeSH numbers. This work includes a tree view to show the relationships as structured by the NLM. However, individual descriptors may have multiple MeSH numbers, hence multiple locations in the forest of trees, so this view is augmented with a force-directed node-link view of terms found in response to a user search. The terms can be selected and used to build PubMed search strings, and an estimate of the specificity of this combination of terms is displayed.

The video below provides a demo of the tool.

Visualizing Currency Volatility
Created by Jason Long

This project (currently available at https://ws-dl.cs.odu.edu/vis/currency/) was focused on showing changes in currency values over time. The initial display shows the values of 39 currencies (including Bitcoin) as compared to the US Dollar (USD).  It's easy to see when the Euro was introduced and individual currencies in EU countries dropped off.  Clicking on a particular currency expands its chart and allows for closer inspection of its change over time.  There is also a tab for a heatmap view that allows the user to view a moving average of the difference from USD.  The color indicates whether the USD is appreciating (green) or depreciating (red).

The video below provides a demo of the tool.


Tuesday, August 26, 2014

2014-08-26: Memento 101 -- An Overview of Memento in 101 Slides

In preparation for the upcoming "Tools and Techniques for Revisiting Online Scholarly Content" tutorial at JCDL 2014, Herbert and I have revamped the canonical slide deck for Memento, and have called it "Memento 101" for the 101 slides it contains.  The previous slide deck was from May 2011 and was no longer current with the RFC (December 2013).  The slides cover Memento basic and intermediate concepts, with pointers for some of the more detailed and esoteric bits (like patterns 2, 3, and 4, as well as the special cases) of interest to only the most hard-core archive wonks. 

The JCDL 2014 tutorial will choose a subset of these slides, combined with updates from the Hiberlink project and various demos.  If you find yourself in need of explaining Memento please feel free to use these slides in part or in whole (PPT is available for download from slideshare). 

--Michael & Herbert

Thursday, August 21, 2014

2014-08-22: One WS-DL Class Offered for Fall 2014

This fall, only one WS-DL class will be offered:
This class approaches the Web as a phenomena to be studied in its own right.  In this class we will explore a variety of tools (e.g., Python, R, D3) as well as applications (e.g., social networks, recommendations, clustering, classification) that are commonly used in analyzing the socio-technical structures of the Web. 

The class will be similar to the fall 2013 offering. 

Right now we're planning on offering in spring 2015:
  • CS 418 Web Programming 
  • CS 725/825 Information Visualization
  • CS 751/851 Introduction to Digital Libraries

Friday, July 25, 2014

2014-07-25: Digital Preservation 2014 Trip Report

Mat Kelly and Dr. Michael L. Nelson travel to Washington, DC and both report on their current research as well as be made aware of others' work in the field.                           
On July 22 and 23, 2014, Dr. Michael Nelson (@phonedude_mln) and I (@machawk1) attended Digital Preservation 2014 in Washington, DC. This was my fourth consecutive NDIIPP (@ndiipp) / NDSA (@ndsa2) meeting (see trip reports from Digital Preservation 2011, 2012, 2013). With the largest attendance yet (300+) and compressed into two days, the schedule was jam-packed with interesting talks. Per usual, videos for most of the presentations are included inline below.

Day One

Micah Altman (@drmaltman) led the presentations with information about the NDSA and asked, regarding Amazon claiming reliability of 99.99999999999% for uptime, "What do the eleven nines mean?". "There are a number of risk that we know about [as archivists] that Amazon doesn't", he said, continuing, "No single institution can account for all the risks." Micah spoke about the updated National Agenda for Digital Stewardship, which is to have the theme of "developing evidence for collective action".

Matt Kirschenbaum (@mkirschenbaum) followed Micah with his presentation Software, It’s a Thing with reference to the recently excavated infamous Atari ET game. "Though the data on the tapes had been available for years, the uncovering was more about aura and allure". Matt spoke about George R. R. Martin's (of Game of Thrones fame) recent interview where he stated that he still uses a word processing program called WordStar on an un-networked DOS machine and that it is his "secret weapon from distraction and viruses." In the 1980s, Wordstar dominated the market until Word Perfect took rein, followed by Microsoft Word. "A power user that has memorized all of the Wordstar commands could utilize the software with the ease of picking up a pencil and starting to write."
Matt went on to talk (also see his Medium post) about software as different concepts include as an assets, as an object, as a kind of notation or score (qua music), as shrinkwrap, etc. For a full explanation of each, see his presentation:

Following Matt, Shannon Mattern (@shannonmattern) shared her presentation on Preservation Aesthetics. "One of preservationists's primary concerns is whether an item has intrinsic value.", she said. Shannon then spoke about the various sorts of auto-destructive software and art including those that are light sensitive (the latter) and those that delete themselves on execution (the former). In addition to recording her talk (see below), she graciously included the full text of her talk online.

The conference briefly had a break and a quick summary of the poster session to come later in day 1. After the break, there was a panel titled "Stewarding Space Data". Hampapuram Ramapriyan of NASA began the panel stating that "NASA data is available as soon as the satellites are launched.", he continued, "This data is preserved...We shouldn't lost bits that we worked hard to collect and process for results. He also stated that he (and NASA) is part of the Earth Science Information Partners (EISP) Data Stewardship Committee.

Deirdre Byrne of NOAA then presented speaking of their dynamics and further need on documenting data, preserving it with provenance, providing IT support to maintain the data's integrity, and being able to work with the data in the future (a digital preservation plus). Deirdre then referenced Pathfinder, a technology that allows the visualization of sea surface temperatures among other features like indicating coral bleaching from fish stocks, the water quality on coasts, etc. Citing its now use as a de facto standard means for this purpose, she described the physical dynamics as having 8 different satellites for its functionality along with 31 mirror satellites on standby for redundancy.

Emily Frieda Shaw (@emilyfshaw) of University of Iowa Libraries followed in the panel after Deirdre, and spoke about the Iowan role in preserving the original development data for the Explorer I launch. Upon converting and analyzing the data, her fellow researchers realize that at certain altitudes, the radiation detection dropped to zero, which indicated that there were large belts of particles surrounding the Earth (later, they were recognized as the Van Allen belts). After discovering more data in a basement about the launch, the group received a grant to recover and restore the badly damaged and moldy documentation.

Karl Nilsen (@KarlNilsen)and Robin Dasler (@rdasler) of University of Maryland Libraries were next with Robin first talking about her concern with issues in the library realm related to data. She reference that one project's data still resided at the University of Hawaii's Institute for Astronomy due to it being the home school to one of the original researchers on a project. The project consisted of data measuring the distances between galaxies that came about by combining and compiling data from various data sources originating from both observational data and standard corpora. To display the data (500 gigabytes total), the group developed a UI to utilize web technologies like MySQL to make the data more accessible. "Researchers were concerned about their data disappearing before they retired.", she stated about the original motive for increasing the data's accessibility.
Karl changed topics somewhat with stating that two different perspectives can be taken about data from a preservation standpoint (format or system centric). "The intellectual value of a database comes from ad hoc combination from multiple tables in the form of joins and selections.", he said. "Thinking about how to provide access", he continued, "is itself preservation." He followed this with approaches including utilizing virtual machines (VMs), migrating from one application to a more modern web application, and collapsing the digital preservation horizon to ten years at a time.
Following a brief Q&A for the panel was a series of "Lightning talks". James (Jim) A. Bradley of Ball State University started with his talk, "Beyond the Russian Doll Effect: Reflexivity and the Digital Repository Paradigm" where he spoke about promoting and sharing digital assets for reuse. Jim then talked about Digital Media Repository (DMR), which allowed information to be shared and made available at the page level. His group had the unique opportunity to tell what material are in the library, who was using them and when. From these patterns, grad students made 3-D models, which were them subsequently added and attributed to the original objects.
Dave Gibson (@davegilbson) of Library of Congress followed Jim by presenting Video Game Source Disc Preservation. He stated that his group has been the "keepers of The Game" since 1996 and have various source code excerpts from a variety of unreleased games including Duke Nukem Critical Mass, which was released for Nintendo DS but not Playstation Portable (PSP), despite being developed for both platforms. In their exploration of the game, they uncovered 28 different file formats on the discs, many of which were proprietary, and wondered how they could get access to the files' contents. After using Mediacoder to convert many of the audio files and hex editors to read the ASCII, his group found source code fragments hidden within the files. From this work, they now have the ability to preserve unreleased games
Rebecca (Becky) Katz of Council of the District of Columbia was next with UELMA-Compliant Preservation: Questions and Answers?. She first described the UELMA, an act that declares that if a U.S. state passed the act and its digital legal publications are official, the state has to make sure that the publications are preserved in some way. Because of this requirement, many states are reluctant to rely solely on digital documents and instead keeping paper copies in addition to the digital copies. Many of the barriers in the preservation process for the states lie in how they ought to preserve the digital documents. "For long term access," she said, " we want to be preserving our digital content." Becky also spoke about appropriate formats for preservation and that common features of formats like authentication for PDF are not open source. "All the metadata in the world doesn't mean anything if the data is not usable", she said. "We want to have a user interface that integrates with the [preservation] repository." She concluded with recommending state that develop the EULMA to have agreements with universities or long standing institutions to allow users to download updates of the documents to ensure that many copies are available.
Kate Murray of Library of Congress followed Becky with "We Want You Just the Way You Are: The What, Why and When of fixity in Digital Preservation". She referenced the "Migrant Mother" photo and how, through moving the digital photo from one storage component to another, there have been subtle changes to the photo. "I never realized that there are three children in the photo!", she said, referencing the comparison between the altered and original photo. To detect these changes, she uses fixity (related article on The Signal) on a whole collection of data, which ensures bit level integrity.
Following Kate, Krystyna Ohnesorge of Swiss Federal Archives (SFA) presented "Save Your Databases Using SIARD!". "About 90 percent of of specialized applications are based on relational databases.", she said. SIARD is used to preserve database content for the long term so that the data can be understood in 20 or 50 years. The system is already used in 54 countries with 341 licenses currently existing for different international organizations. "If an institution needs to archive relational databases, don't hesitate to use the SIARD suite and SIARD format!"
The conference then broke for lunch where the 2014 Innovation Awards were presented.
Following lunch, Cole Crawford (@coleinthecloud) of the Open Compute Foundation presented "Community Driven Innovation" where he spoke about Open Computer being an international based open source project. "Microsoft is moving all Azure data to Open Compute", he said. "Our technology requirements are different. To have an open innovation system that's reusable is important." He then emphasized that his talk was to be specifically open the storage aspects of Open Compute. He started with "FACT: The 19 inch server rack originated in the railroad industry then propagated to the music industry, then it was adopted by IT." He continued, "One of the most interesting things Facebook has done recently is move from tape storage to ten thousand Blueray discs for cold storage". He stated that in 2010, the Internet consisted of 0.8 Zetabytes. In 2012, this number was 3.0 Zetabytes, and by 2015, he claimed, it will be 40 Zetabytes in size. "As stewards of digital data, you guys can be working with our project to fit your requirements. We can be a great engine for procurement. As you need more access to content, we can get that.

After Cole was another panel titled, "Community Approaches to Digital Stewardship". Fran Berman (@resdatall) of Rensselaer Polytechnic Institute started off with reference to the Research Data Alliance. "All communities are working to develop the infrastructure that's appropriate to them.", she said, "If I want to worry about asthma (referencing an earlier comment about whether asthma is more likely to be obtained in Mexico City versus Los Angeles), I don't want to wait years until the infrastructure is in place. If you have to worry about data, that data needs to live somewhere."

Bradley Daigle (@BradleyDaigle) of University of Virginia followed Fran and spoke about the Academic Preservation Trust, a group consisting of 17 members that takes a community based approach at are attempting to not just have more solutions but better solutions. The group would like to create a business model based on preservation services. "If you have X amount of data, I can tell you it will take Y amount of cost to preserve that data.", he said, describing an ideal model. "The AP Trust can serve as a scratch space with preservation being applied to the data."

Following Bradley on the panel, Aaron Rubinstein from University of Massachusetts Amherst described his organization's scheme as being similar to Scooby Doo, iterating through each character displayed on-screen and stating the name of a member organization. "One of the things that makes our institution privileges is that we have administrators that understand the need for preservation.

The last presenter in the panel, Jamie Schumacher of Northern Illinois University started with "Smaller institutions have challenges when starting digital preservation. Instead of obtaining an implementation grant when applying to the NEH, we got a 'Figure it Out' grant. ... Our mission was to investigate a handful of digital preservation solutions that were affordable for organizations with restricted resources like small staff sizes and those with lone rangers. We discovered that practitioners are overwhelmed. To them, digital objects are a foreign thing." Some of the roadblocks her team eliminated were the questions of which tools and services to use for preservation tasks, to which Google frequently gave poor of too many results.

Following a short break, the conference split into five different concurrent workshops and breakout sessions. I attended the session titled Next Generation: The First Digital Repository for Museum Collections where Ben Fino-Radin (@benfinoradin) of Museum of Modern Art, Dan Gillean of Artefactual Systems and Kara Van Malssen (@kvanmalssen) of AVPreserve gave a demo of their work.

As I was presenting a poster at Digital Preservation 2014, I was unable to stay for the second presentation in the session Revisiting Digital Forensics Workflows in Collecting Institutions by Marty Gengenbach of Gates Archive, as a was required to setup my poster. Starting at 5 o'clock, the breakout sessions ended and a reception was held with the poster session in the main area of the hotel. My poster, "Efficient Thumbnail Summarization for Web Archives" is an implementation of Ahmed AlSum's initial work published at ECIR 2014 (see his ECIR Trip Report).

Day Two

The second day started off with breakfast and an immediate set of workshops and breakout sessions. Among these, I attended the Digital Preservation Questions and Answers from the NDSA Innovation Working Group where Trevor Owens (@tjowens), among other group members introduced the history and migration of an online digital preservation question and answer system. The current site, currently residing at http://qanda.digipres.org is in the process of migration from previous attempts including a failed try at a Digital Preservation Stack Exchange. This work, completed in-part by Andy Jackson (@anjacks0n) at the UK Web Archive, began its migration with his Zombse project, which extracted all of the initial questions and data from the failed Stack Exchange into a format that would eventually be readable by another Q&A system.
Following a short break after the first long set of sessions, the conference then re-split into the second set of breakout sessions for the day, where I attended the session titled Preserving News and Journalism. Aurelia Moser (@auremoser) administrated this panel-style presentation and initially showed a URI where the presentation's resource could be found (I typed bit.ly/1klZ4f2 but that seems to be incorrect).
The panel, consisting of Anne Wootton (@annewooton, Leslie Johnston (@lljohnston), and Edward McCain Reynolds (@e_mccain), initially asked, "What is born digital and what are news apps?". The group had put forth a survey toward 476 news organization, consisting of 406 hybrid organizations (those that put content in print and online), and 70 "online only" publications.
From the results, the surveyors asked what the issue was with responses, as they kept the answers open ended for the journalists to obtain an accurate account of their practices. "Newspapers that are in chains are more likely to have written policies for preservation.
The smaller organizations are where we're seeing data loss." At one point, Anne Wooton's group organized a "Design-a-Thon" where they gathered journalists, archivists, and developers. Regarding the surveyors' practice, the group stated that Content Management System (CMS) vendors for news outlets are the holders of t he key of the kingdom for newspapers in regard to preservation.

After the third breakout session of the conference, lunch was served (Mexican!) with Trevor Owens of Library of Congress, Amanda Brennan (@continuants) of Tumblr, and Trevor Blank (@trevorjblank of The State University of New York at Potsdam giving a preview of CURATECamp, to occur the day after the completion of the conference. While lunch proceeded, a series of lightning talks was presented.
The first lightning talk was by Kate Holterhoff (@KateHolterhoff) of Carnegie-Mellon University and titled Visual Haggard and Digitizing Illustration. In her talk, she introduced Visual Haggard, a catalog of many images from public domain books and documents that attempts to have better quality representations of the images in these documents compared to other online systems like Google Books. "Digital Archivists should contextual and mediate access to digital illustrations", she said.

Michele Kimpton of DuraSpace followed Kate with DuraSpace and Chronopolis Partner to Build a Long-term Access and Preservation Platform. In her presentation she introduced a few tools like Chronopolis (used for dark archiving), DuraCloud and a few other tools and her group's approach toward getting various open source tools to work together to provide a more comprehensive solution for preserving digital content.

Following Michelle, Ted Westervelt of Library of Congress presented Library of Congress Recommended Formats where he reference the Best Edition Statement, a largely obsolete but useful document that needed supplementation to account for modern best practice and newer mediums. His group has developed the "Recommended Format Specification", which provide this without superseding the original document and is a work-in-progress. The goal of the document is to set parameters for the target objects for the document so that most content that is current un-handled by the in-place specification will have directives to ensure that digital preservation of the objects is guided and easy.

After Ted, Jane Zhang of Catholic University of America presented Electronic Records and Digital Archivists: A Landscape Review where she did a cursory study of employment positions for digital archivists, both formally trained and trained on-the-job. She attempt to answer the question "Are libraries hiring digital archivists?" and tried to see a pattern from one hundred job descriptions.

After the lightning talks, another panel was held, titled "Research Data and Curation". Inna Kouper (@inkouper) and Dharma Akmon (@DharmaAkmon) of Indiana University and University of Michigan, Ann Arbor, respectively, discussed Sustainable Environment Actionable Data (SEAD, @SEADdatanet), and a Research Object Framework for the data that will be very familiar for practitioners working with video producers. "Research projects are bundles", they said, "The ROF captures multiple aspects of working with data include unique ID, agents, states, relationships, and content and how they cyclicly relate. Research objects change states."
They continued, "Curation activities are happening from the beginning to the end of an object's lifecycle. An object goes through three main states", they listed, "Live objects are in a state of flux, handled by members of project teams, and their transition is initiated by the intent to publish. Curation objects consist of content packaged using the BagIt protocol with metadata, and relationships via OAI/ORE maps, which are mutable but allow selective changes to metadata. Finally, publication objects are immutable and citable via a DOI and have revisions and derivations tracked."
Ixchel Faniel from OCLC Research then presented, stating that there are three perspectives for archeological practice. "The data has a lifecycle from the data collection, data sharing, and data reuse perspective and cycle." Her investigation consisted of detecting how actions in one part of the lifecycle facilitated work in other parts of the lifecycle. She collected data over one and one-half year (2012-2014) from 9 data producers, 2 repository staff, and 7 data re-users and concluded that actions in one part of the cycle have influence on things that occur in other stages. "Repository actions are overwhelmingly positive but cannot always reverse or explain documentation procedures."
Following Ixchel and the panel, George Oates (@goodformand), Director of Good, Form & Spectacle presented Contending with the Network. "Design can increase access to the material", she said, stating her experience with Flickr and Internet Archive. Relating to her work with Flickr, she referenced The Commons, a program that is attempting to catalog the world's public photo archive. In her work with IA, she was most proud of the interface she designed for the Understanding 9/11 interface. She then worked for a group named MapStack and created a project called Surging Seas, an interactive tool for visualizing sea level rise. She recently started a new business "Good, Form, & Spectacle" and proceeded on a formal mission of archiving all documents related to the mission through metadata. "It's always useful to see someone use what you've designed. They'll do stuff you anticipate and that you think is not so clear."
Following a short break, the last session of the day started with the Web Archiving Panel. Stephen Abrams of California Digital Library (CDL) started the presentation asking "Why web archiving? Before, the web was a giant document retrieval system. This is no longer the case. Now, the web browser is a virtual machine where the language of choice is JavaScript and not HTML." He stated that the web is a primary research data object and that we need to provide programmatic and business ways to support web archiving.
After Stephen, Martin Klein (@mart1nkle1n) of Los Alamos National Laboratory (LANL) and formally of our research group gave an update on the state of the work done with Memento. "There's extensive memento infrastructure in-place now!", he said. New web services that are to be released soon to be offered by LANL include a Memento redirect service (for example, going to http://example.com/memento/20040723150000/http://matkelly.com will automatically be resolved in the archives to the closest available archived copy); a memento list/search service to allow memento lookup using a user interface with specifying dates, times, and a URI; and finally, a Memento TimeMap service.
After Martin, Jimmy Lin (@lintool) of University of Maryland and formally of Twitter presented on how to leverage his big data expertise for use in digital preservation. "The problem", he said, "is that web archives are an important part of heritage but are vastly underused. Users can't do that much with web archives currently." His goal is to build tools to support exploration and discovery in web archives. A tool his group built, Warcbase uses Hadoop and HBase for topic modeling.
After Jimmy, ODU WS-DL's very own Michael Nelson (@phonedude_mln) presented starting off with "The problem is that right now, we're still in the phase of 'Hooray! It's in the web archive!" whenever something show up. What we should be asking is, "How well did we archive it?" Referencing the recent publicity of Internet Archiving capturing evidence toward the plane being shot down in Ukraine, Michael says, "We were just happy that we had it archived. When you click on one of the video, however, and it just sits here and hangs. We have the page archived but maybe not all the stuff archived that we like." He then went on to describe the ways that his group is assessing web archive is to determine the importance of what's missing, detect temporal violations, and benchmarking how well the tools handle the content they're made to capture.

After Dr. Nelson presented, the audience had an extensive amount of questions.
After the final panel, Dragan Espenschied (@despens) of Rhizome presented Big Data, Little Narration (see his interview & links page). In his unconventional presentation, he reiterated that some artifacts don't make sense in the archives. "Each data point needs additional data about it somewhere else to give it meaning.", he said, giving a demonstration of an authentic replay of Geocities sites in Netscape 2.0 via a browser-accessible emulator. "Every instance of digital culture is too large for an institution because it's digital and we cannot completely represent it."
As the conference adjourned, I was glad I was able to experience it, see the progress other attendees have made in the last three (or more) years, and present the status of my research.
— Mat Kelly (@machawk1)

Tuesday, July 22, 2014

2014-07-22: "Archive What I See Now" Project Funded by NEH Office of Digital Humanities

We are grateful for the continued support of the National Endowment for the Humanities and their Office of Digital Humanities for our "Archive What I See Now" project.
In 2013, we received support for 1 year through a Digital Humanities Start-Up Grant.  This week, along with our collaborator Dr. Liza Potts from Michigan State, we were awarded a 3-year Digital Humanities Implementation Grant. We are excited to be one of the seven projects selected this year.

Our project goals are two-fold:
  1. to enable users to generate files suitable for use by large-scale archives (i.e., WARC files) with tools as simple as the "bookmarking" or "save page as" approaches that they already know
  2. to enable users to access the archived resources in their browser through one of the available add-ons or through a local version of the Wayback Machine (wayback).
Our innovation is in allowing individuals to "archive what I see now". The user can create a standard web archive file ("archive") of the content displayed in the browser ("what I see") at a particular time ("now").
Our work focuses on bringing the power of institutional web archiving tools like Heritrix and wayback to humanities scholars through open-source tools for personal-scale web archiving. We are building the following tools:
  • WARCreate - A browser extension (for both Google Chrome and Firefox) that can create an archive of a single webpage in the standard WARC format and save it to local disk. It can allow a user to archive pages behind authentication or that have been modified after user interaction.
  • WAIL (Web Archiving Integration Layer) - A stand-alone application that provides one-click installation and GUI-based configuration of both Heritrix and wayback on the user’s personal computer.
  • Mink - A browser extension (for both Google Chrome and Firefox) that provides access to archived versions of live webpages. This is an additional Memento client that can be configured to access locally stored WARC files created by WARCreate.
With these three tools, a researcher could, in her normal workflow, discover a web resource (using her browser), archive the resource as she saw it (using WARCreate in her browser), and then later index and replay the archived resource (using WAIL). Once the archived resource is indexed, it would be available for view in the researcher’s browser (using Mink).

We are looking forward to working with our project partners and advisory board members: Kristine Hanna (Archive-It), Lily Pregill (NY Art Resources Consortium), Dr. Louisa Wood Ruby (Frick Art Reference Library), Dr. Steven Schneider (SUNY Institute of Technology), and Dr. Avi Santo (ODU).

A previous post described the work we did for the start-up grant:

We've also posted previously about some of the tools (WARCreate and WAIL) that we've developed as part of this project:

See also "ODU gets largest of 11 humanities grants in Virginia" from The Virginian-Pilot.