Wednesday, October 7, 2015

2015-10-07: IMLS and NSF fund web archive research for WS-DL

In the spring and summer of 2015, the Web Science and Digital Libraries (WS-DL) group has received a total of $950k of funding from the IMLS and the NSF to study various aspects of web archiving.  Although previously announced on twitter (IMLS: 2015-03-31 & NSF: 2015-08-25), here we provide greater context for how these awards support our vision for the future of web archiving*.

Our IMLS proposal is titled "Combining Social Media Storytelling With Web Archives" and a PDF of the full proposal is available directly from the IMLS.  This proposal is joint with our partners at Archive-It and is informed by our experiences in several areas, such as:
Our most illuminating insight (somewhat obvious in retrospect) is to not try to include all of the collection's holdings in its summarization, but to only surface the exemplary components sufficient to distinguish one collection from the next.  One example we frequently use is "how do we distinguish the many `human rights' collections available in Archive-It?"  They all have different perspectives, but they can be difficult to navigate for those without detailed knowledge of the seed URIs and the collection development policy. 

The IMLS proposal will investigate two main thrusts:
  1. Selecting a small number (e.g., 20) of exemplary pages from a collection (often 100s of archived copies of 1000s of web pages) and loading them in an existing tool such as Storify as a summarization interface (instead of custom & unfamiliar interfaces).  Yasmin AlNoamany has some exciting preliminary work in this area; for example see her TPDL 2015 paper examining what makes a "good" story on Storify, and her presentation "Using Web Archives to Enrich the Live Web Experience Through Storytelling".
  2. Using existing stories to generate seed URIs for collections.  One problem for human-generated web archive collections is that they depend on the domain knowledge of curators.   For example, the image above shows two Storify stories about early riots in Kiev (aka Kyiv) which predated much of the exposure in Western media and then the subsequent escalation of the crisis.  The collection at Archive-It was not begun until the annexation of the Crimea was imminent, possibly missing the URIs that document the early stages of this developing story.  Our idea is to mine social media, especially stories, for semi-automated, early creation of web archive collections. 
The NSF proposal is titled "Increasing the Value of Existing Web Archives" and represents a shift in how we think about web archiving.  One point we've made for a while now (for example, see our 2014 presentation "Accessing the Quality of Web Archives") is that we must shift our current focus of simply piling up bits in the archive to more nuanced questions of how to make the archives more immediately useful (as opposed to just insurance for future loss) and to how to assess & meaningfully convey the quality of the archived page.  This proposal will have three main research thrusts:
  1. Inspired by Martin Klein's PhD research and Hugo Huurdeman et al.'s "Finding Pages on the Unarchived Web" from JCDL 2014, we would like to see archives provide recommendations of related pages in the archive, as well as suggested "replacements" for pages that are not archived.  Web archives now just return a "yes" (200) or "no" (404) when you query for a URI -- they should be able to provide more detailed answers based on their holdings.
  2. We'd like to further investigate the various issues of how well a page is archived.  We have some preliminary work from Justin Brunelle for automatically assessing the impact of missing embedded resources (typically stylesheets and images), as well as from Scott Ainsworth on detecting temporal violations -- combinations of HTML and images that never occurred on the live web (see "Only One Out of Five Archived Web Pages Existed as Presented" from HT 2015).  
  3. Related to #2, we need to find a better way to visualize the temporal & archival makeup of replayed pages.  For example, the LANL Time Travel service does a nice job of showing the various archives that contribute resources to a reconstruction, but questions remain about scale as well as describing temporal violations and their likely semantic impact.  Similarly, we'd like to investigate how to convey the request environment that generated the representation you're viewing now (see our 2013 D-Lib paper "A Method for Identifying Personalized Representations in Web Archives" for preliminary ideas on linking various geoip, mobile vs. desktop, and other related representations). 
We have been very fortunate with respect to funding in 2015 and we look forward to continued progress on the research thrusts outlined above.  We'd like to thank everyone that made these awards possible.  We welcome any feedback or interest on these (and other) projects as we progress.  Watch this blog and @WebSciDL for continued research updates.


* = See also our 2014 award for $324k from the NEH for the study of personal web archiving and our 2014 award for $49k from the IIPC for profiling web archives for a more complete picture of our research vision for web archives.

Wednesday, September 30, 2015

2015-09-30: Digital Preservation - Magdeburg Germany Trip Report

Dr. Herzog: This large green area on your left is Sanssouci Park. It has 11 palaces in it.
Yasmin: I want to visit this park after we are back from the university, can we?
Dr. Herzog: We sure can... I think we will be back before sunset.
Yasmin: I love beautiful things.
Dr. Herzog: Who doesn't?
Sawood: [Smiles]

The three souls were heading to the Hochschule Magdeburg-Stendal University from Potsdam, Germany in Dr. Michael Herzog's car for a lunch lecture on the topic of Digital Preservation. Yasmin and Sawood from the Web Science and Digita Libraries Research Group of the Old Dominion University, Norfolk, Virginia were invited for the talk by Dr. Herzog at his SPiRIT Research Group. The two WSDL members have presented their work at TPDL 2015 in Poznan, Poland then on their way back home they ware halted and hosted by Dr. Herzog in Germany for the lunch lecture. You may also enjoy the TPDL 2015 trip report by Yasmin.

Passing by beautiful landscapes, crossing bridges and rivers, observing renewable energy sources such as windmills and solar panels, and touching almost 200 km/h speed on the highway we reached to the university in Magdeburg. Due to the vacations there were not many people in the campus, but the canteen was still crowded when we went there for the lunch. Dr. Herzog's student, Benjamin Hatscher (who created the poster for the talk) joined us for the lunch. Then we headed to the room that was reserved for the talk and started the session.

Dr. Herzog briefly introduced us, our research group, and our topics for the day to the audience. He also shared his recent memories about the time he spent at ODU and about his interactions with the WSDL members. Then he left the podium for Yasmin.

Yasmin presented her talk on the topic, "Using Web Archives to Enrich the Live Web Experience Through Storytelling". She noted that her work is supported in part by IMLS. She started her introduction with a set of interesting images. She then illustrated the importance of the time aspect in storytelling and described how storytelling looks like on the Web, and especially on the social media. She discussed the need of selecting a very small, but representative subset from a big pile of resources around certain topic to tell the story. Selecting the small representative subset is challenging, but important task. This gives a brief summary as well as the entry point to deep dive into the story and explore remaining resources. She gave examples of how Facebook Lookback compiles a few highlights from hundreds or thousands of someone's sharings and 1 Second Everyday for storytelling. Then she moved on to the popular social media storytelling service Storify and described the issues in it such as flat representation, bookmarking not preservation, and resources going off-topic over time. This lead her to the description of the Web archives, Memento, and Web archiving services (mainly Archive-It). Then she described the shortcomings of the Web archiving services when it comes to storytelling and how it can be improved by combining the Web archives and the storytelling services together. After that she concluded her talk by describing her approaches and policies on selecting the representative subset of resources from a collection.

I, Sawood Alam presented my talk on the topic, "Web Archiving: A Brief Introduction". I briefly introduced myself with the help of my academic footprint and the lexical signature. The "lexical signature" term led me to touch on Martin Klein's work and how I used it to describe a person instead of a document. Then I followed the agenda for the talk and began with the description of the archiving in general, the concept of the Web archiving, and the differences between the two.

I then briefly talked about the purpose and importance of the Web archiving on institutional and personal scales. Then I described various phases and challenges involved in the Web archiving such as crawling, storage, retrieval, replay, completeness, accuracy, and credibility. This gave me opportunity to reference various WSDL members' research work such as Justin's Two-Tiered Crawling and Scott's Temporal Violations. Then I talked about existing Web and digital archiving efforts and various tools used by Web archivists in various stages. The list included vastly used tools such as Heritrix, OpenWayback, and TimeTravel as well as various tool developed by WSDL members or other individual developers such as CarbonDate, Warrick, Synchronicity, WARCreate, WAIL, Mink, MemGator, and Browsertrix. After that I briefly described the Memento protocol and Memento aggregator.

This lead me to my IIPC funded research work on Archive Profiling. In this section of the talk I described why archive profiling is important, how it can help in Memento query routing, and how does an archive profile look like.

To motivate the audience for research in the Web archiving field I discussed various related areas that have vast research opportunities to explore.

Then I concluded my talk with the introduction of our Web Science and Digital Libraries Research Group. This was the fun part of the talk, full of pictures illustrating lifestyle and work environment at our lab. I illustrated how we use tables in our lab for fun traditions such as bringing lots of food after a successful defense or spreading assignment submissions on the Ping Pong table for parallel evaluation. I illustrated our effective use of the white boards from "about:blank" state to the highly busy and annotated state and the reserved space for the "PhD Crush" that keeps track of the progress of each WSDL member in a visual and fun way. I couldn't resist to show our Origami skills on the scale of covering an entire cubicle and every single item in it individually.

After a brief QA session, Dr. Herzog formally concluded the event.

From there we all were free to explore the beauty of the places around and we did to the extent possible. We toured around the historical places of the Magdeburg city such as the Gothic architecture masterpiece, Magdeburg Cathedral and on our way back to the Potsdam we saw the newly built largest canal under-bridge, Magdeburg Water Bridge.

By the time we reached to Postdam the sun was already set, but we still managed to see a couple of the palaces in the Sanssouci Park and they were looking beautiful in that light condition. We even managed to take a few pictures in that low light.

Dr. Herzog invited us for dinner at his place and we had no reason or intention to say no. He was the head chef in his kitchen and prepared for us a delicious rice recipe and white asparagus (which was a new vegetable for me). Since I like cooking, I decided to join him in his kitchen and he gladly welcomed me. I did not have any plans in advance, but after a brief look inside his fridge I decided to prepare egg hearts and salad. During and after the dinner Dr. Herzog described and showed pictures of many historical places in Potsdam and made us excited to visit them the next day.

The next morning we had to head back to Berlin, but we sneaked a couple of hours in the morning to see the beauty of the Sanssouci Park and the Sanssouci Palace in the bright sunlight. A long series of stairs from the front entrance of the palace leading to the water fountain with stepped walls on both the sides covered with grapes vines were mesmerizing.

Dr. Herzog dropped us to the train station (or Bahnhof in German) from where we took train for Berlin. We got almost a day to explore Berlin and we did it the extent possible. It is an amazing city, full of historical masterpieces and the state of the art architecture. At one point, we got stuck in a public demonstration and couldn't use any transport due to the road jam, although, we had no idea what was that demonstration for.

Later in the evening Dr Herzog came to Berlin to pick his wife up from the Komische Oper Berlin where she was performing an Opera and we got a chance to look inside this beautiful place. This way we got a few more hours to have a guided tour of Berlin and had dinner in an Italian restaurant.

It was a fun trip to explore three beautiful cities of Germany immediately after exploring yet another beautiful and colorful city of Poznan, Poland. We couldn't have imagined anything better than this. I published seven photo spheres of various churches and palaces on Google Maps during this trip and got an album full of pictures.

On behalf of my university, department, research group, and myself I would like to extend my sincere thanks and regards to Dr. Herzog for his invitation, warm welcome, hosting, and spending time while showing us the best of Magdeburg, Potsdam, and Berlin during our stay in Germany. He is a fantastic host and tour guide. Now tuning back to the see off conversation among the three.

Sawood: Yasmin, now you know why Dr. Herzog said, "who doesn't" when you said, "I love beautiful things".
Yasmin: [Smiles]
Dr. Herzog: [Smiles]

Sawood Alam

Monday, September 28, 2015

2015-09-28: TPDL 2015 in Poznan, Poland

The Old Market Square in Poznan
On September 15 2015, Sawood Alam and I (Yasmin AlNoamany) attended the 2015 Theory and Practice of Digital Libraries (TPDL) Conference in Poznan, Poland. This year, WS-DL had four accepted papers in TPDL for three students (Mohamed Aturban (who could not attend the conference because of visa issues), Sawood Alam, and Yasmin AlNoamany). Sawood and I arrived in Poznan on Monday, Sept. 14. Although we were tired from travel, we could not resist walking to the the best area in Poznan, the old market square. It was fascinating to see those beautiful colorful houses at night with the reflection of the water on them after it rained with the beautiful European music by many artists who were playing in the street.

The next morning we headed to the conference, which was held in Poznań Supercomputing and Networking Center. The organization of the conference was amazing and the general conference co-chairs, Marcin Werla and Cezary Mazurek, were always there to answer our questions. Furthermore, the people at the reception of the conference were there for us the whole time to help us with transportation, especially with the communication with taxi drivers; we do not speak Polish and they do not speak English. On every day of the conference, there were coffee break where we had hot and cold drinks and snacks. It is worth mentioning that I had the best coffee I have ever tasted in Poland :-). The main part of the TPDL 2015 conference was streamed live and recorded. The recordings will be processed and made publicly available on-line on PlatonTV portal.

Sawood (on the left) and Jose (on the right)
We met Jose Antonio Olvera, who interned in WS-DL lab in summer 2014, at the entrance. At the conference, Jose had an accepted poster “Evaluating Auction Mechanisms for the Preservation of Cost-Aware Digital Objects Under Constrained Digital Preservation Budgets” that was presented at the evening of the first day in the poster session. It was nice meeting him, since I was not there when he interned in our lab.
The first day of the main conference, September 15th, started with a Keynote speech by David Giaretta, whom I was honored to speak to many times during the conference and had him among the audience of my presentations, talked about "Data – unbound by time or discipline – challenges and new skills needed". At the beginning, Giaretta introduced himself with a summary about his background. His speech mainly was about data preservation and the challenges that this field faces, such as link rots, which Giaretta considered a big horror. He mentioned many examples about the possibility of data loss. Giaretta talked about big data world and presented the 7 (or 8 (or 9)) V’s of big data: volume, velocity, variety, volatility, veracity, validity, value, variability, and visualization. I loved these quotes from his speech:
  • "Preservation is judged by continuing usability, then come value". 
  • "Libraries are gateways to knowledge". 
  • "Metadata is classification".
  • "emulate or migrate".
He talked about how it is valuable and expensive to preserve the scientific data, then raised an issue about reputation for keeping things over time and long term funding. Funding is a big challenge in digital preservation, so he talked about vision and opportunities for funding. Giaretta concluded his keynote with the types of digital objects that needs to be preserved, such as simple documents and images, scientific data, complex objects, and the changing over time (such as the annotations). He raised this question: "what questions can one ask when confronted with some completely unfamiliar digital objects?" Giaretta ended his speech with an advice: "Step back and help the scientists to prepare data management plans, the current data management plan is very weak".

After the keynote we went to a coffee break, then the first session of the conference "Social-technical perspectives of digital information" started. The session was led by WS-DL’s Sawood Alam presenting his work "Archive Profiling Through CDX Summarization", which is a product of an IIPC funded project. He started with a brief introduction about the memento aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. He described two earlier profiling efforts: the complete knowledge profile by Sanderson and minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He also talked about the newly introduced CDXJ serialization format for profiles and illustrated its usefulness in serializing profiles on scale with the ability of merging and splitting arbitrary profiles easily. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. The code to generate profiles and benchmark can be found in a GitHub repository.

Next, there was a switch between the second and the third presentations and since Sawood was supposed to present on the behalf of Mohamed Aturban, the chair of the session gave Sawood enough time to breathe between the two presentations.

The second presentation was "Query Expansion for Survey Question Retrieval in the Social Sciences" by Nadine Dulisch from GESIS and Andreas Oskar Kempf from ZBW. Andreas started with a case study for the usage of survey questions, which were developed by operational organizations, in social science. He presented the importance of social science survey data for social scientists.  Then, Nadine talked about the approaches they applied for query expansion retrieval. She showed that statistical-based expansion was better than intellectual-based expansion. They presented the results of their experiments based on Trec_eval. They evaluated thesaurus-based and co-occurrence-based expansion approaches for query expansion to improve retrieval quality in digital libraries and research data archives. They found that automatically expanded queries using extracted co-occurring terms could provide better results than queries manually reformulated by a domain expert.

Sawood presented "Quantifying Orphaned Annotations in". In this paper, Aturban et al. analyzed 6281 highlighted text annotations in annotation system. They also used the Memento Aggregator to look for archived versions of the annotated pages. They found that 60% of the highlighted text annotations are orphans (i.e. annotations are attached to neither the live web nor memento(s)) or in danger of being orphaned (i.e. annotations are attached to the live web but not to memento(s)). They found that if a memento exists, there is a 90% chance that it recovers the annotated webpage. Using public archives, only 3% of all highlighted text annotations were reattached, otherwise they would be orphaned. They found that for the majority of the annotations, no memento existed in the archives. Their findings highlighted the need for archiving pages at the time of annotation.

After the end of the general session, we took a lunch break where we gathered with Jose Antonio Olvera and many of the conference attendees to exchange our research ideas.

After the lunch break, we attended the second session of the day, "Multimedia information management and retrieval and digital curation". The session started with "Practice-oriented Evaluation of Unsupervised Labeling of Audiovisual Content in an Archive Production Environment” presented by Victor de Boer. In their work, Victor et al. evaluated the automatic labeling of the audiovisual content to improve efficiency and inter-annotator agreement by generating annotation suggestions automatically from textual resources related to the documents to be archived. They performed pilot studies to evaluate term suggestion methods through precision and recall by taking terms assigned by archivists as ‘ground-truth’. The found that the quality of automatic term-suggestion are sufficiently high.

The second presentation was "Measuring Quality in Metadata Repositories" by Dimitris Gavrilis. He started his presentation by mentioning that this is a hard topic, then he explained why this research is important. He explained the specific criteria that determine the data quality: completeness, validity, consistency, timeliness, appropriateness, and accuracy constituents. In their paper, Dimitris et al. introduced a metadata quality evaluation model (MQEM) that provides a set of metadata quality criteria as well as contextual parameters concerning metadata generation and use. The MQEM allows the curators and the metadata designers to assess the quality of their metadata and to run queries on existing datasets. They evaluated their framework on two different use cases: application design and content aggregation.

After the session, we took a break and I got illness which prevented me from attending the discussion panel session, which was entitled "Open Access to Research Data: is it a solution or a problem?", and the poster session. I went back to the hotel to rest and prepare for the next day's presentation. I am embedding the tweets about the panel and the poster session.

The next day I felt fine, so we went early to have breakfast in the beautiful old market square, then headed to the conference. The opening of the second day was by Cezary Mazurek who introduced the sessions of the second day and thanked the sponsors of the conference. Then he left us with a beautiful soundtrack of music, which was related to the second keynote speaker.

The Keynote speech was "Digital Audio Asset Archival and Retrieval: A Users Perspective" by Joseph Cancellaro, active composer, musician, and the chair of the Interactive Art and Media Department of Columbia College in Chicago. Cancellaro started by a short bio about himself. The first part of his presentation handled issues of audio asset and the constant problematic for sound designers and non-linear environments (naming convention (meta tag), search tools, storage (failure), retrieval (failure), DSP (Digital signal processing), etc. He also mentioned how do they handle these issues in his department, for example for naming conventions, they add tags to the files. He explained the simple sound asset SR workflow. Preservation to Cancellaro is “not losing any more audio data". His second part of the presentation was about storage, retrieval, possible solutions, and content creation. He mentioned some facts about storage and retrieval:
  • The decrease in technology costs have reduced the local issues of storage capacity (this is always a concern in academia). 
  • Bandwidth is still an issue in real-time production. 
  • Non-linear sound production is a challenge for linear minded composers and sound designers.
He mentioned that searching for sound objects is a blocking point for many productions, then continued "when I ask my students about the search options for the sound track they have, all what I hear are crickets". At the end,  Dr. Cancellaro presented agile concept as a solution for content management systems (CMS). He presented the basic digital audio theory: sound as a continuous analog event is captured at specific data point.

After the keynote, we took a coffee break, then the sessions of the second day started with "Influence and Interrelationships among Chinese Library and Information Science Journals in Taiwan" by Ya-Ning Chen. In this research, the authors investigate the citation relation between the different journals based on a data set collected from 11 Chinese LIS journals (2,031 articles during from 2001 to 2012) in Taiwan. The authors measured the indexer and the indegree, outdegree, and the self-feeding ratios between the journals. They also measured the degree and betweenness centrality of SNA to investigate the information flow between Chinese LIS journals in Taiwan. They created a 11 × 11 matrix that express the journal-to-journal analysis. They created a sciogram of Interrelationships among Chinese LIS Journals in Taiwan which summarized the citation relation between the journals they studied.

Next was a presentation entitled "Tyranny of Distance: Understanding Academic Library Browsing by Refining the Neighbour Effect" by Dana Mckay and George Buchanan.
Dana and George explained the importance of browsing books as a part of informations seeking, and how this is not well-supported for e-books. They used different datasets to examine the patterns of co-borrowing. They examined different aspects of the neighbour effect on browsing behavior. Finally they presented their findings to improve the browsing of digital libraries.

The last presentation of this session was a study on Storify entitled "Characteristics of Social Media Stories" by Yasmin AlNoamany. Based upon analyzing 14,568 stories from Storify, AlNoamany et al. specified the structural characteristics of popular (i.e., receiving the most views) human-generated stories to build a template that will be used later in generating (semi-)automatic stories from the archived collections. The study investigated many question regarding to the features of the stories, such as the length of the story, the number of elements, the decay rate of the stories, etc. At the end, the study differentiated the popular stories and the unpopular stories based on the main feature of both of them. Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features. Popular stories tend to have more web elements (medians of 28 vs. 21), longer timespan (5 hours vs. 2 hours), longer editing time intervals, and less decay rate.


After the presentation, we had lunch, when some attendees extended the talk about my research. It was a useful discussion regarding to the future of my research, especially integrating data from the archived collections with the storytelling services.

The "user studies for and evaluation of digital library systems and applications" session started after the break with "On the Impact of Academic Factors on Scholar Popularity: A Cross-Area Study” presentation by Marcos Gonçalves. Gonçalves et al. presented a cross-area study on the impact of key academic factors on scholar popularity for understanding how different factors impact scholar popularity. They conducted their study based on scholars affiliated to different graduate programs in Brazil and internationally, with more than 1,000 scholars and 880,000 citations, over a 16-year period. They found that scholars in technological programs (e.g., Computer Science, Electrical Engineering, Bioinformatics) tend to be the most "popular" ones in their universities. They also found that international popularity in still much higher than that obtained by Brazilian scholars.

After the first presentation, there was a panel on "User studies and Evaluation" by George Buchanan, Dana McKay, and Giannis Tsakonas, and moderated by Seamus Ross as a replacement of two presentations due to the absence of the presenters. The panel started with a question from Seamus Ross: Are user studies in digital libraries soft? Each one of the panelists presented his point of view on the importance of user studies. Buchanan said that user studies matter, then Dana followed up that we want to create something that all the people can use. Tsakonas said he did studies that never developed into systems. Seamus Ross asked the panelists: what makes a person a good user study person? Dana answered with a joke; "choose someone like me". Dana works as User Experience Manager and Architect at the academic library of Swinburne University of Technology, so she has experience with users needs and user studies. I followed up the discussion that we do users studies to know what the people need or to evaluate a system, then I asked if Mechanical Turk (MTurk) experiments is a form of user studies. At the end, Seamus Ross concluded the panel with some advice on conducting user studies, such as considering a feedback loop in the process of user study.

After the panel, we had a coffee break. I had a great discussion about user evaluation in the context of my research with Brigitte Mathiak, who gave me much useful advice about evaluating the stories that will be created automatically from the web archives. Later on my European trip I gave a presentation at Magdeburg-Stendal University of Applied Sciences that gives big picture of my research.

In the last session, I attended Brigitte Mathiak presented "Are there any Differences in Data Set Retrieval compared to well-known Literature Retrieval?". In the beginning, Mathiak explained the motive of their work. Based on two user studies, a lab study with seven participants and telephone interviews with 46 participants, they investigated the requirements that users have for a data set retrieval system in the social sciences and in Digital Libraries. They found that choosing the data set is more important to researcher than choosing a piece of literature. Moreover, meta data quality and quantity is even more important for data sets.

At the evening, We had the conference dinner which was held at Concordia Design along with beautiful music. At the dinner, the conference chairs announced two awards: the best paper award for Matthias Geel and Moira Norrie on "Memsy: Keeping Track of Personal Digital Resources across Devices and Services" and the best poster/demo award for Clare Llewellyn, Claire Grover, Beatrice Alex, Jon Oberlander and Richard Tobin on "Extracting a Topic Specific Dataset from a Twitter Archive".

The third day started early at 9:00 am with sessions about digital humanities, in which I presented my study about “Detecting Off-Topic Pages in Web Archives”. The paper investigate different methods for automatically detecting when an archived page goes off-topic. It presented six different methods that mainly depend on comparing the archived copy of a page (a memento) with the first memento of this page. Testing the methods was done on different Archived collections from Archive-It. The suggested best method was a combination between a textual method (cosine similarity using TF-IDF) and a structural method (word count). The best combined methods for detecting the off-topic pages gave an average precision 0.92 on 11 different collections. The output of this research is a tool for detecting the off-topic pages in the archive. The code can be downloaded and tested from Github, and more information can be found from my recent presentation at the Internet Archive.


The next paper presented in the digital humanities session was "Supporting Exploration of Historical Perspectives across Collections" by Daan Odijk. In their work, Odjjk et al. introduced tools for selecting, linking, and visualizing the second World War (WWII) collection from collections of the NIOD, the National Library of the Netherlands, and Wikipedia. They also link digital collections via implicit events, i.e. if two articles are close in time and similar in content, they are considered to be related. Furthermore, they provided exploratory Interface to explore the connected collections. They used Manhattan distance for textual similarity over document terms in a TF.IDF weighted vector space and measured temporal similarity using a Gaussian decay function. They found that textual similarity performed better than temporal similarity, and combining textual and temporal similarity improved the nDCG score.

The third paper entitled "Impact Analysis of OCR Quality Tasks in Digital Archives" presented by Myriam C. Traub. Traub et al. performed user studies on digital archived to classify the research tasks and describe the potential impact of OCR quality on these tasks through interviewing scholars from digital humanities. They analyzed the questions and categorized the research tasks. Myriam said that few scholars could quantify the impact of OCR errors on their own research tasks. They found that OCR is unlikely to be perfect. They could not find solutions but they could suggest strategies that lead to the solutions. At the end, Myriam suggested that the tools should be open-source and there should be evaluation matrices.

At the end, I attended the last Keynote speech by Costis Dallas – "The post-repository era: scholarly practice, information and systems in the digital continuum", which was about on digital humanists' practices in the age of curation. Then the conference ended with the closing sessions, in which they announced the TPDL 2016 in Hannover, Germany.

After the conference, Sawood and I took the train from Poznan to Potsdam, Germany to meet Dr. Michael A. Herzog, the Vice Dean for Research and Technology Transfer, Department of Economics and head of Research Group SPiRIT. We were invited to talk about our research in a Digital Preservation lecture at Magdeburg-Stendal University of Applied Sciences in Magdeburg. Sawood wrote a nice blog post about our talks.


Monday, September 21, 2015

2015-09-21: InfoVis Spring 2015 Class Projects

In Spring 2015, I taught Information Visualization (CS 725/825) for MS and PhD students.  This time we used Tamara Munzner's Visualization Analysis & Design textbook, which I highly recommend:
"This highly readable and well-organized book not only covers the fundamentals of visualization design, but also provides a solid framework for analyzing visualizations and visualization problems with concrete examples from the academic community. I am looking forward to teaching from this book and sharing it with my research group."
—Michele C. Weigle, Old Dominion University
I also tried a flipped-classroom model, where students read and answer homework questions before class so that class time can focus on discussion, student presentations, and in-class exercises. It worked really well -- students liked the format, and I didn't have to convert a well-written textbook into Powerpoint slides.

Here I highlight a couple of student projects from that course.  (All class projects are listed in my InfoVis Gallery.)

Chesapeake Bay Currents Dataset Exploration
Created by Teresa Updyke

Teresa is a research scientist at ODU's Center for Coastal Physical Oceanography (CCPO). This visualization (currently available at gives a view of the metadata related to the high-frequency radar data that the CCPO collects. For all stations, users can investigate the number of data files available, station count, vector count, and average speed of the currents. The map allows users to select one of the three stations and further investigate the radial count and type collected on each day. This visualization aids researchers in quickly determining the quality of data collected at specific times and in identifying interesting areas for further investigation.

The thing I really liked about this project was that it solved a real problem and will help Teresa to do her job better. I asked Teresa how researchers previously determined what data was available.  Her reply: "They called me, and I looked it up in the log files."

In and Out Flow of DoD Contracting Dollars
Created by Kayla Henneman and Swaraj Wankhade

This project (currently available at is a visualization of the flow of Department of Defense (DoD) contracting dollars to and from the Hampton Roads area of Virginia. The system is for those who wish to analyze how the in- and out-flow of DoD contracting dollars affects the Hampton Roads economy. The visualization consists of an interactive bubble map which shows the flow of DoD contracting dollars to and from Hampton Roads based on counties, along with line charts which show the total amount of inflow and outflow dollars.  Hovering over a county on the map shows the inflow and outflow amounts for that county over time.

Federal Contracting in Hampton Roads
Created by Valentina Neblitt-Jones and Shawn Jones

This project (currently available at is a visualization for US federal government contracting awards in the Hampton Roads region of Virginia. The visualization consists of a choropleth map displaying different colors based on the funding each locality receives. To the right of the map is a bar chart indicating how much funding each industry received. On top of the map and the bar chart is a sparkline showing the trend in funding. The visualization allows the user to select a year, agency, or locality within the Hampton Roads area and updates the choropleth, bar chart, and sparkline as appropriate.


Thursday, September 10, 2015

2015-09-10: CDXJ: An Object Resource Stream Serialization Format

I have been working on an IIPC funded project of profiling various web archives to summarize their holdings. The idea is to generate statistical measures of the holdings of an archive under various lookup keys where a key can be a partial URI such as Top Level Domain (TLD), registered domain name, entire domain name along with any number of sub-domain segments, domain name and a few segments from the path, a given time, a language, or a combination of two or more of these. Such a document (or archive profile) can be used answer queries like "how many *.edu Mementos are there in a given archive?", "how many copies of the pages are there in an archive that fall under*", or "number of copies of ** pages of 2010 in Arabic language". The archive profile can also be used to determine the overlap between two archives or visualize their holdings in various ways. Early work of this research was presented at the Internet Archive during the IIPC General Assembly 2015 and later it was published at:
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva, Harihar Shankar, and David S. H. Rosenthal, Web Archive Profiling Through CDX Summarization, Proceedings of TPDL 2015.
One of many challenges to solve in this project was to come up with a suitable profile serialization format that has the following properties:
  • scalable to handle terabytes of data
  • facilitates fast lookup of keys
  • allows partial key lookup
  • allows arbitrary split and merge
  • allows storage of arbitrary number of attributes without changing the existing data
  • supports link data semantics (not essential, but good to have)
We were initially leaning towards JSON format (with JSON-LD for linked data) because it has wide language and tool support and it is expressive like XML, but less verbose than XML. However, in the very early stage of our experiments we realized that it has scale issues. JSON, XML, and YAML (a more human readable format) are all single root node document formats, which means a single document serialized in any of these formats can not have multiple starting nodes; they all must be children of a single root node. This means it has to be fully loaded in the memory which can be the bottleneck in the case of big documents. Although there are streaming algorithms to parse XML or JSON, they are slow and usually only suitable for cases when an action is to be taken while parsing the document as opposed to using them for frequent lookup of the keys and values. Additionally, JSON and XML are not very fault tolerant i.e., a single malformed character may result in making the entire document fail to be parsed. Also, because of the single root node, splitting and merging of the documents is not easy.

We also thought about using simple and flat file formats such as CSV, ARFF, or CDX (a file format used in indexing WARC files for replay). These flat formats allow sorting that can facilitate fast binary lookup of keys and the files can be split in arbitrary places or multiple files with the same fields (in the same order) can be merged easily. However, the issue with these formats is that they do not support nesting and every entry in them must have the same attributes. Additionally the CDX has limited scope of extension as all the fields are already described and reserved.

Finally, we decided to merge the good qualities from CDX and JSON to come up with a serialization format that fulfills our requirements listed above. We call it CDXJ (or CDX-JSON). Ilya Kreymer first introduced this format in PyWB, but there was no formal description of the format. We are trying to formally introduce it and make some changes that makes it extendable so that it can be utilized by the web archiving community as well as broader web communities. The general description of the format is, "a plain file format that stores key-value pairs per line in which the keys are strings that are followed by their corresponding value objects where the values are any valid JSON with the exception that the JSON block does not contain any new line characters in it (encoded newline "\n" is allowed)." Here is an example:
@context [""]
@id {"uri": ""}
@keys ["surt_uri", "year"]
@meta {"name": "Internet Archive", "year": 1996}
@meta {"updated_at": "2015-09:03T13:27:52Z"}
com,cnn)/world - {"urim": {"min": 2, "max": 9, "total": 98}, "urir": 46}
uk,ac,rpms)/ - {"frequency": 241, "spread": 3}
uk,co,bbc)/images 2013 {"frequency": 725, "spread": 1}

Lines starting with @ sign signify special purpose keys and they make these lines to appear together at the top of the file when sorted. The first line of the above example with the key @context provides context to the keywords used in the rest of the document. The value of this entry can be an array of contexts or an object with named keys. In the case of an array, all the term definitions from all the contexts will be merged in the global namespace (resolving name conflicts will be the responsibility of the document creator) while in the case of a named object it will serve like the XML Namespace.

The second entry @id holds an object that identifies the document itself and established relationship with other documents such as a parent/sibling when it is split. The @keys entry specifies the name of the key fields in the data section as an array of the field names in the order they appear (such as the primary key name appears first then the secondary key, and so on). To add more information about the keys, each element of the @keys array can have an object. All the lines except the special keys (@-prefixed) must have the exact number of fields as described in the @keys entry. Missing fields in the keys must have the special placeholder character "-".

The @meta entries describe the aboutness of the resource and other metadata. Multiple entries of the same special keys (that start with an @ sign) should be merged at the time of consuming the document. Splitting them in multiple lines increases the readability and eases the process of updates. This means the two @meta lines can be combined in a single line or split into three different lines each holding "name", "year", and "updated_at" separately. The policy to resolve the conflicts in names when merging such entries should be defined per key basis as suitable. These policies could be "skip", "overwrite", "append" (specially for the values that are arrays), or some other function to derive new value(s).

The latter three lines are the data entries in which the first one starts with a key com,cnn)/world (which is the SURT for of the followed by a nested data structure (in JSON format) that holds some statistical distribution of the archive holdings under that key. The next line holds different style of statistics (to illustrate the flexibility of the format) for a different key. The last line illustrates a secondary key in which the primary keys is the SURT form of a URI followed by the a secondary key that further divides the distribution yearly.

Now, let's reproduce the above example in JSON-LD, YAML, and XML respectively for comparison:
  "@context": "",
  "@id": "",
  "meta": {
    "name": "Internet Archive",
    "year": 1996,
    "updated_at": "2015-09:03T13:27:52Z"
  "surt_uri": {
    "com,cnn)/world": {
      "urim": {
        "min": 2,
        "max": 9,
        "total": 98
      "urir": 46
    "uk,ac,rpms)/": {
      "frequency": 241,
      "spread": 3
    "uk,co,bbc)/images": {
      "year": {
        "2013": {
          "frequency": 725,
          "spread": 1
  @context: ""
  @id: ""
    name: "Internet Archive"
    year: 1996
    updated_at: "2015-09:03T13:27:52Z"
        min: 2
        max: 9
        total: 98
      urir: 46
      frequency: 241
      spread: 3
          frequency: 725
          spread: 1
<?xml version="1.0" encoding="UTF-8"?>
<profile xmlns="">
    <name>Internet Archive</name>
    <record surt-uri="com,cnn)/world">
    <record surt-uri="uk,ac,rpms)/">
    <record surt-uri="uk,co,bbc)/images" year="2013">

The WAT format, commonly used in the web archiving community also uses JSON fragments as values for each entry separately to deal with the single root document issue, but it does not restrict the use of new-line character. As a consequence, sorting the file line-wise is not allowed, which affects the lookup speed. In contrast, CDXJ files can be sorted (like CDX files) which allows binary search in the files on the disk and prove very efficient in lookup heavy applications.

We have presented the earlier thoughts to seek feedback on the CDXJ serialization format at Stanford University during IIPC GA 2015. The slides of the talk are available at:

Going forward we are proposing to split the syntax and semantics of the format in separate specifications where the overall syntax of the file is defined as a base format while further restrictions and semantics such as adding meaning to the keys, making certain entries mandatory, giving meaning to the terms, enforcing specific sort order and defining the scope of the usage for the document are described in a separate derived specification. This practice is analogous to the XML which defines the basic syntax of the format and other XML-based formats such as XHTML or  Atom add semantics to it.

A generic format for this purpose can be defined as Object Resource Stream (ORS) that registers ".ors" file extension and "application/ors" media type. Then CDXJ extends from that to add semantics to it (as described above) which registers ".cdxj" file extension and "application/cdxj+ors" media type.

Object Resource Stream (ORS)

The above railroad diagram illustrates the grammar of the ORS format. Every entry in this format acquires one line. Empty lines are allowed that should be skipped during the consumption of the file. Apart from the empty lines, every line starts with a string key, followed by a single-line JSON block as the value. The keys are allowed to have optional leading or trailing spaces (SPACE or TAB characters) for indentation, readability, or alignment purposes, but should be skipped when consuming the file. Keys can be empty strings which means values (JSON blocks) can be present without being associated with any keys. Quoting keys is not mandatory, but if necessary one can use double quotes for the purpose. Quoted string keys will preserve any leading or trailing spaces inside quotes. None of the keys or values are allowed to break the line (new lines should be encoded if any) as the line break starts a new entry. As mentioned before, it is allowed to have a value block without a corresponding key, but not the opposite. Since the opening square and curly brackets indicate the start of the JSON block, hence it is necessary to escape them (as well as the escape and double quote characters) if they appear in the keys, and optionally their closing pairs should also be escaped. An ORS parser should skip malformed lines and continue with the remaining document. Optionally the malformed entries can be reported as warnings.


The above railroad diagram illustrates the grammar of the CDXJ format. CDXJ is a subset of ORS as it introduces few extra restriction in the syntax that are not present in the ORS grammar. In the CDXJ format the definition of the key string is strict as it does not allow leading spaces before the key or empty string as the key. If there are spaces in the CDXJ key string, it is considered a compound key where every space separated segment has an independent meaning. Apart from the @-prefixed special keys, every key must have the same number of space separated fields and empty fields use the placeholder "-". CDXJ only allows a single SPACE character to be used as the delimiter between the parts of the compound key. It also enforces a SPACE character to separate the key from the JSON value block. As opposed to the ORS, CDXJ does not allow TAB character as the delimiter. Since the keys cannot be empty strings in CDXJ, there must be a non-empty key associated with every value in it. Additionally, the CDXJ format also prohibits empty lines. These restrictions are introduced in the CDXJ to encourage its use as sorted files to facilitate binary search on the disk. When sorting CDXJ files, byte-wise sorting is encouraged for greater interoperability (this can be achieved on Unix-like operating systems by setting an environment variable LC_ALL=C). On the semantics side CDXJ introduces optional @-prefixed special keys to specify metadata, the @keys key to specify the field names of the data entries, and the @id and the @context keys to provision linked-data semantics inspired by JSON-LD.


There are many applications where a stream of JSON block is being used or can be used. Some of the applications even enforce the single line JSON restriction and optionally prefix the JSON block with associated keys. However, the format is not formally standardized and it is often called JSON for the sake of general understanding. The following are some example applications of the ORS or CDXJ format:
  • Archive Profiler generates profiles of the web archives in CDXJ format. An upcoming service will consume profiles in the CDXJ format to produce a probabilistic rank ordered list of web archives with the holdings of a given URI.
  • PyWB accepts (and encourages the usage of) CDXJ format for the archive collection indexes and the built-in collection indexer allows generating CDXJ indexes.
  • MemGator is a Memento aggregator that I built. It can be used as a command line tool or run as a web service. The tool generates TimeMaps in CDXJ format along with Link and JSON formats. The CDXJ format response is sorted by datetime as the key and it makes it very easy and efficient to consume the data chronologically or using text processing tools to perform filtering based on partial datetime.
  • 200 million Reddit link (posts) corpus that I collected and archived recently (it will be made publicly available soon) in CDXJ format (where the key is the link id), while 1.65 billion Reddit comments corpus is available in a format that conforms ORS format (although it is advertised as series of JSON blocks delimited by new lines).
  • A very popular container technology Docker and a very popular log aggeragation and unification service Fluentd are using a data format that conforms the above described specification of ORS. Docker calls their logging driver JSON which actually generates a stream of single-line JSON blocks that can also have the timestamp prefix with nano-second precision as the key for each JSON block. Fluentd log is similar, but it can have more key fields as prefix to each line of JSON block.
  • NoSQL databases including key-value store, tuple store, data structure server, object database, and wide column store implementations such as Redis and CouchDB can use ORS/CDXJ format to import and export their data from and to the disk.
  • Services that provide data streams and support JSON format such as Twitter and Reddit can leverage ORS (or CDXJ) to avoid unnecessary wrapper around the data to encapsulate the under a root node. This will allow immediate consumption of the stream of the data as it arrives to the client, without waiting for the end of the stream.
In conclusion, we proposed a generic Object Resource Stream (ORS) data serialization format that is composed of single line JSON values with optional preceding string keys per line. For this format we proposed the file extension ".ors" and the media-type "application/ors". Additionally, we proposed a derivative format of ORS as CDXJ with additional semantics and restrictions. For the CDXJ format we proposed the file extension ".cdxj" and the media-type "application/cdxj+ors". The two formats ORS and CDXJ can be very helpful in dealing with endless streams of structured data such as server logs, Twitter feed, and key-value stores. These formats allow arbitrary information in each entry (like schema-less NoSQL databses) as opposed to the fixed-field formats such as spreadsheets or relational databases. Additionally, these formats are text processing tool friendly (such as sort, grep, and awk etc.) which makes them very useful and efficient for file based data store. We have also recognized that the proposed formats are already in use on the Web and have proved their usefulness. However, they are neither formally defined nor given a separate media-type.

Sawood Alam