Monday, July 27, 2015

2015-07-27: Upcoming Colloquium, Visit from Herbert Van de Sompel

On Wednesday, August 5, 2015 Herbert Van de Sompel (Los Alamos National Laboratory) will give a colloquium in the ODU Computer Science Department entitled "A Perspective on Archiving the Scholarly Web".  It will be held in the third floor E&CS conference room (r. 3316) at 11am.  Space is somewhat limited (the first floor auditorium is being renovated), but all are welcome to attend.  The abstract for his talk is:

 A Perspective on Archiving the Scholarly Web
As the scholarly communication system evolves to become natively web-based and starts supporting the communication of a wide variety of objects, the manner in which its essential functions -- registration, certification, awareness, archiving -- are fulfilled co-evolves.  Illustrations of the changing implementation of these functions will be used to arrive at a high-level characterization of a future scholarly communication system and of the objects that will be communicated. The focus will then shift to the fulfillment of the archival function for web-native scholarship. Observations regarding the status quo, which largely consists of back-office processes that have their origin in paper-based communication, suggest the need for a change. The outlines of a different archival approach inspired by existing web archiving practices will be explored.
This presentation will be an evolution of ideas following his time as a visiting scholar at DANS, in conjunction with Dr. Andrew Treloar (ANDS) (2014-01 & 2014-12). 

Dr. Van de Sompel is an internationally recognized pioneer in the field of digital libraries and web preservation, with his contributions including many of the architectural solutions that define the community, including: OpenURL, SFX, OAI-PMH, OAI-ORE, info URI, bX, djatoka, MESUR, aDORe, Memento, Open Annotation, SharedCanvas, ResourceSync, and Hiberlink.  

Also during his time at ODU, he will be reviewing the research projects of PhD students in the Web Science and Digital Libraries group as well as exploring new areas for collaboration with us.  This will be Dr. Van de Sompel's first trip to ODU since 2011 when he and Dr. Sanderson served as the external committee members for Martin Klein's PhD dissertation defense

--Michael

Friday, July 24, 2015

2015-07-24: ICSU World Data System Webinar #6: Web-Centric Solutions for Web-Based Scholarship

Earlier this week Herbert Van de Sompel gave a webinar for the ICSU World Data System entitled "Web-Centric Solutions for Web-Based Scholarship".  It's a short and simple review of some of the interoperability projects we've worked on through since 1999, including OAI-PMH, OAI-ORE, and Memento.  He ends with a short nod to his simple but powerful "Signposting the Scholarly Web" proposal, but the slides in the appendix give the full description. 



The main point of this presentation was to document how each project successively further embraced the web, not just as a transport protocol but fully adopting the semantics as part of the protocol.  Herbert and I then had a fun email discussion about how the web, scholarly communication, and digital libraries were different in 1999 (the time of OAI-PMH & our initial collaboration) and now.  Some highlights include:
  • Although Google existed, it was not the hegemonic force that it is today, and contemporary search engines that did exist (e.g., AltaVista, Lycos) weren't that great (both in terms of precision and recall).  
  • The Deep Web was still a thing -- search engines did not reliably find obscure resources likely scholarly resources (cf. our 2006 IEEE IC study "Search Engine Coverage of the OAI-PMH Corpus" and Kat Hagedorn's 2008 follow up "Google Still Not Indexing Hidden Web URLs").
  • Related to the above, the focus in digital libraries was on repositories, not the web itself.  Everyone was sitting on an SQL database of "stuff" and HTTP was seen just as a transport in which to export the database contents.  This meant that the gateway script (ca. 1999, it was probably in Perl DBI) between the web and the database was the primary thing, not the database records or the resultant web pages (i.e., the web "resource").  
  • Focus on database scripts resulted in lots of people (not just us in OAI-PMH) tunneling ad-hoc/homemade protocols over HTTP.  In fairness, Roy Fielding's thesis defining REST only came out in 2000, and the W3C Web Architecture document was drafted in 2002 and finalized in 2004.  Yes, I suppose we should have sensed the essence of these documents in the early HTTP RFCs (2616, 2068, 1945) but... we didn't. 
  • The very existence of technologies such as SOAP (ca. 1998) nicely illustrates the prevailing mindset of HTTP as a replaceable transport. 
  • Technologies similar to OAI-PMH, such as RSS, were in flux and limited to 10 items (belying their news syndication origin which made them unsuitable for digital library applications).  
  • Full-text was relatively rare, so the focus was on metadata (see table 3 in the original UPS paper; every digital library description at the time distinguished between "records" and "records with full-text links").  Even if full-text was available, downloading and indexing it was an expensive operation for everyone involved -- bandwidth was limited and storage was expensive in 1999!  Sites like xxx.lanl.gov even threatened retaliation if you downloaded their full-text (today's text on that page is less antagonistic, but I recall the phrase "we fight back!").  Credit to CiteSeer for being an early digital library that was the first to use full-text (DL 1998).
Eventually Google Scholar announced they were deprecating OAI-PMH support, but the truth is they never really supported it in the first place.  It was just simpler to crawl the web, and the early focus on keeping robots out of the digital library had given way to making sure that they got into the digital library (e.g., Sitemaps). 

The OAI-ORE and then Memento projects were more web-centric, as Herbert nicely explains in the slides, with OAI-ORE having a Semantic Web spin and Memento being more grounded in the IETF community.   As Herbert says at the beginning of the video, our perspective in 1999 was understandable given the practices at the time, but he goes on to say that he frequently reviews proposals about data management, scholarly communication, data preservation, etc. that continue to treat the web as a transport protocol over which the "real" protocol is deployed.  I would add that despite the proliferation of web APIs that claim to be RESTful, we're seeing a general retreat from REST/HATEOAS principles by the larger web community and not just the academic and scientific community. 

In summary, our advice would be to fully embrace HTTP, since it is our community's Fortran and it's not going anywhere anytime soon

--Michael

Thursday, July 23, 2015

2015-07-22: I Can Haz Memento

Inspired by the "#icanhazpdfmovement and built upon the Memento  service, I Can Haz Memento attempts to expand the awareness of Web Archiving through Twitter. Given a URL (for a page) in a tweet with the hash tag "#icanhazmemento," the I Can Haz Memento service replies the tweet with a link pointing to an archived version of the page closest to the time of the tweet. The consequence of this is: the archived version closest to the time of the tweet likely expresses the intent of the user at the time the link was shared.
Consider a scenario where Jane shares a link in a tweet to the front page of cnn about a story on healthcare. Given the fluid nature of the news cycle, at some point, the story about healthcare would be replaced by another fresh story; thus the link in Jane's tweet and its corresponding intent (healthcare story) become misrepresented by Jane's original link (for the new story). This is where I Can Haz Memento comes into the picture. If Jane included "#icanhazmemento" in her tweet, the service would have replied Jane's tweet with a link representing:
  • An archived version (closest to her tweet time) of the front page healthcare story on cnn, if the page had already been archived within a given temporal threshold (e.g 24 hours)Or
  • A newly archived version of the same page. In other words, the service does the archiving and returns the link to the newly archived page, if the page was not already archived.
How to use I Can Haz Memento
Method 1: In order to use the service, include the hashtag "#icanhazmemento" in the tweet with the link to the page you intend to archive or retrieve an archived version. For example, consider Shawn Jones' tweet below for http://www.cs.odu.edu:
Which prompted the following reply from the service:
Method 2: In Method 1, the hashtag "#icanhazmemento" and the URL,  http://www.cs.odu.edu, reside in the same tweet, but Method 2 does not impose this restriction. If someone (@anwala) tweeted a link (e.g arsenal.com), and you (@wsdlodu) wished the request be treated in the same manner as Method 1 (as though "#icanhazmemento" and  arsenal.com were in the same tweet), all that is required is a reply to the original tweet (without the "#icanhazmemento") with a tweet which includes "#icanhazmemento." Consider an example of Method 2 usage:
  1. @acnwala tweets arsenal.com without "#icanhazmemento"
  2. @wsdlodu replies the @acnwala's tweet with "#icanhazmemento"
  3. @icanhazmemento replies @wsdlodu with the archived versions of arsenal.com
The scenario (1, 2 and 3) is outlined by the following tweet threads:
 I Can Haz Memento - Implementation

I Can Haz Memento is implemented in Python and leverages the Twitter Tweepy API. The implementation is captured by the following subroutines:
  1. Retrieve links from tweets with "#icanhazmemento": This was achieved due to Tweepy's api.search API method. The sinceIDValue is used to keep track of already visited tweets. Also, the application sleeps in between each request in order to comply with Twitter's API rate limits, but not before retrieving the URLs from each tweet.
  2. After the URLs in 1. have been retrieved, the following subroutine
    • Makes an HTTP Request to the Timegate API in order to get the the Memento (instance of the resource) closest to the time of tweet (since the time of tweet is passed as a parameter for datetime content negotiation):
    • If the page is not found in any archive, it is pushed to archive.org and archive.is for archiving:
The source code for the application is available on Gitub. We acknowledge the effort of Mat Kelly who wrote the first draft of the application. And we hope you use #icanhazmemento.
--Nwala

Tuesday, July 7, 2015

2015-07-07: WADL 2015 Trip Report


It was the last day of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2015 when the Workshop on Web Archiving and Digital Libraries (WADL) 2015 was scheduled and it started on time. When I entered in the workshop room, I realized we needed a couple of more chairs to accommodate all the participants, which was a good problem to have. The session started with a brief informal introduction of individual participants. Without wasting any time, the lightning talks session was started.

Gerhard Gossen started the lightning talk session with his presentation on "The iCrawl System for Focused and Integrated Web Archive Crawling". It was a short description of how iCrawl can be used to create archives for current events, targeted primarily to researchers and journalists. The demonstration illustrated how to search on the Web and Twitter for trending topics to find good seed URLs, manually add seed URLs and keywords, extract entities, configure crawling basic policies and finally start/schedule the crawling.

Ian Milligan presented his short talk on "Finding Community in the Ruins of GeoCities: Distantly Reading a Web Archive". He introduced GeoCities and explained why it matters. He illustrated the preliminary exploration of the data such as images, text, and topic extraction from it. He announced plans for a Web Analytics Hackathon in Canada in 2016 based on Warcbase and is looking for collaborators. He expressed the need of documentation for researchers. To acknowledge the need of context, he said, "In an archive you can find anything you want to prove, need to contextualize to validate the results."



Zhiwu Xie presented a short talk on "Archiving the Relaxed Consistency Web". This was focused on inconsistency problem mainly seen in crawler based archives. He described the illusion of consistency on distributed social media systems and the role of timezome differences. They found that the newer content is more inconsistent. In a simulation more than 60% of timelines were found inconsistent. They propose proactive redundant crawls and compensatory estimation of archival credibility as potential solutions to the issue.



Martin Lhotak and Tomas Foltyn presented their talk on "The Czech Digital Library - Fedora Commons based solution for aggregation, reuse, dissemination and archiving of digital documents". They introduced three main digitization areas in the Czech Republic - Manuscriptorium (early printed books and manuscripts), Kramerius (modern collections from 1801), and WebArchiv (digital archive of the Czech web resources). Their goal is to aggregate all digital library content from Czech Republic under Czech Republic Library (CDL).

Todd Suomela presented "Analytics for monitoring usage and users of Archive-IT collections". The University of Alberta is using Archive-It since 2009 where they have 19 different collections of which 15 are public. Their collections are proposed by public, faculty, or librarians then the proposal goes to the Collection Development Committee for the review. Todd evaluated the user activity (using Google Analytics) and the collection management aspects of the UA Digital Libraries.



After the lightning talks were over, workshop participants took a break and looked at the posters and demonstrations associated with the lightning talks above.


Our colleague Lulwah Alkwai had her "Best Student Paper" award winner full paper, "How Well Are Arabic Websites Archived?" presentation scheduled the same day, hence we joined her in the main conference track.



During the lunch break, awards were announced where our WSDL Research Group secured the Best Student Paper and the Best Poster awards. While some people were still enjoying their lunch, Dr. J. Stephen Downie presented the closing keynote on HathiTrust Digital Library. I learned a lot more about the HathiTrust, its collections, how they deal with the copyright and (not so) open data, and their mantra, "bring computing to the data" for the sake of the fair use of the copyright data. Finally, the there were announcements about the next year's JCDL conference which will be held in Newark, NJ from 19 to 23 June, 2016. After that we assembled again in the workshop room for the remaining sessions of the WADL.



Robert Comer and Andrea Copeland together presented "Methods for Capture of Social Media Content for Preservation in Memory Organizations". They talked about preserving personal and community heritage. They outlined the issues and challenges in preserving the history of the social communities and the problem of preserving the social media in general. They are working on a prototype tool called CHIME (Community History in Motion Everyday).




Mohamed Farag presented his talk on "Building and Archiving Event Web Collections: A focused crawler approach". Mohamed described the current approaches of building event collections, 1) manually - which leads to the hight quality but requires lots of effort and 2) social media - which is quick, but may result in potentially low quality collections. They are looking for balance between the two approaches to develop an Event Focused Crawler (EFC) that retrieves web pages that are similar to those with the curator selected seed URLs with the help of a topic detection model. They have made an event detection service demo available.



Zhiwu Xie presented "Server-Driven Memento Datetime Negotiation - A UWS Case". He described Uninterruptable Web Service (UWS) architecture which uses Memento to provide continuous service even if a server goes down. Then he proposed an ammendment in the workflow of the Memento protocol for a server-driven content negotiation instead of an agent-driven approcah to improve the efficiency of UWS.



Luis Meneses presented his talk on "Grading Degradation in an Institutionally Managed Repository". He motivated his talk by saying that degradation in data collection is like a library with books with missing pages. He illustrated examples from his testbed collection to introduce nine classes of degradation from the least damaged to the worst as 1) kind of correct, 2) university/institution pages, 3) directory listings, 4) blank pages, 5) failed redirects, 6) error pages, 7) pages in a different language, 8) domain for sale, and 9) deceiving pages.



The last speaker of the session, Sawood Alam (your author) presented "Profiling Web Archives". I briefly described Memento Aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. I described various profile types and policies, analyzed their cost in terms of space and time, and measured the routing efficiency of each profile. Also, I discussed the serialization format and scale related issues such as incremental updates. I took the advantage of being the last presenter of the workshop and kept the participants away from their dinner longer than I was supposed to.





Thanks Mat for your efforts in recording various sessions. Thanks Martin for the poster pictures. Thanks to everyone who contributed to the WADL 2015 Group Notes, it was really helpful. Thanks to all the organizers, volunteers and participants for making it a successful event.

Resources

--
Sawood Alam

Thursday, July 2, 2015

2015-07-02: JCDL2015 Main Conference

Large, Dynamic and Ubiquitous –The Era of the Digital Library






JCDL 2015 (#JCDL2015) took place at the University of Tennessee Conference Center in Knoxville, Tennessee. The conference was four days long; June 21-25, 2015. This year three students from our WS-DL CS group at ODU had their papers accepted as well as one poster (see trip reports for 2014, 2013, 2012, 2011). Dr. Weigle (@weiglemc), Dr. Nelson (@phonedude_mln), Sawood Alam (@ibnesayeed), Mat Kelly (@machawk1) and I (@LulwahMA) went to the conference. We drove from Norfolk, VA. Four of our previous members of our group, Martin Klein (UCLA, CA) (@mart1nkle1n), Joan Smith (Linear B Systems, inc., VA) (@joansm1th), Ahmed Alsum (Stanford University, CA) (@aalsum) and Hany SalahEldeen (Microsoft, Seattle)(@hanysalaheldeen), also met us there. The trip was around 8 hours. We enjoyed the mountain views and the beautiful farms. We also caught parts of a storm on our way, but it only lasted for two hours or so.

The first day of the conference (Sunday June 21, 2015) consisted of four tutorials and the Doctoral Consortium. The four tutorials were: Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research, Introduction to Digital Libraries, Digital Data Curation Essentials for Data Scientists, Data Curators and Librarians and Automatic Methods for Disambiguating Author Names in Bibliographic Data Repositories.


Mat Kelly (ODU, VA)(@machawk1) covered the Doctoral Consortium.




The main conference started on Monday June 22, 2015 and opened with Paul Logasa Bogen II (Google, USA) (@plbogen). He started by welcoming the attendees and then he mentioned that this year had 130 registered attendees from 87 different organizations, and 22 states and 19 different countries.



Then the program chairs: Geneva Henry (University Libraries, GWU, DC), Dion Goh (Wee Kim Wee School of Communication and Information, Nanyang Technical University, Singapore) and Sally Jo Cunningham (Waikato University, New Zealand) added on the announcements and number of accepted papers in JCDL2015. Of the conference submissions, 18 (30%) of full research papers are accepted, and 30 (50%) of short research papers are accepted, and 18 (85.7%) of posters and demos are accepted. Finally, the speaker announced the nominees for best student paper and best overall paper.

The best paper nominees were:
The best student paper nominees were:
Then Unmil Karadkar (The University of Texas at Austin, TX) presented the keynote speaker Piotr Adamczyk (Google Inc, London, UK). Piotr's talk was titled “The Google Cultural Institute: Tools for Libraries, Archive, and Museums”. He presented some of Google attempts to add to the online cultural heritage. He introduced Google Culture Institute website that consisted of three main projects: the Art Project, Archive (Historic Moments) and World Wonders. He showed us the Google Art Project (Link from YouTube: Google Art Project) and then introduced an application to search museums and navigate and look at art. Next, he introduced the Google Cardboard (Link from YouTube:”Google Cardboard Tour Guide”) (Link from YouTube: “Expeditions: Take your students to places a school bus can’t”) where you can explore different museums by looking into a cardboard container that can house a user's electronic device. He mentioned that more museums are allowing Google to capture images of their museums and allowing others to explore it using Google Cardboard and that Google would like to further engage with cultural partners. His talk was similar to a talk he gave in 2014 titled "Google Digitalizing Culture?".

Then we started off with the two simultaneous sessions "People and Their Books" and "Information Extraction". I attended the second session. The first paper was “Online Person Name Disambiguation with Constraints” presented by Madian Khabsa (PSU, PA). The goal of his work is to map the name mentioned to the real world people. They found that 11%-17% of the queries in search engines are personal names. He mentioned that two issues are not addressed: adding constraints to the clustering process and adding the data incrementally without clustering the entire database. The challenge they faced was redundant names. When constraints are added they can be useful in digital library where user can make corrections. Madian described constraint-based clustering algorithm for person name disambiguation.


Sawood Alam (ODU, VA) (@ibnesayeed) followed Madian with his paper “Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages”. He mentioned that general online book readers are not suitable for scanned dictionary. They proposed an approach of indexing scanned pages of a dictionary that enables direct access to appropriate pages on lookup. He implemented an application called Dictionary Explorer where he indexed monolingual and multilingual dictionaries, with speed of over 20 pages per minute per person.



Next, Sarah Weissman (UMD, Maryland) presented “Identifying Duplicate and Contradictory Information in Wikipedia”. Sara identified sentences in wiki articles that are identical. She randomly selected 2k articles and manually identified them. She found that 45% are identical, 30% are templates, 13.15% are copy editing, 5.8% are factual drift, 0.3% are references and 4.9% are other pages.

The last presenter in this session is Min-Yen Kan (National University of Singapore, Singapore) (@knmnyn) presenting “Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs”. He introduced the notion of extensible features for higher order semi-CRFs that allow memorization to speed up inference algorithms.


The papers in the other concurrent session that I was unable to attend were:

After the Research at Google-sponsored Banquet Lunch, Sally Jo Cunningham (University of Waikato, NZ) introduced the first panel "Lifelong Digital Libraries" and then the first speaker Cathal Gurrin (Dublin City University, Ireland)(@cathal). His presentation was titled "Rich Lifelong Libraries". He started off with using wearable devices and information loggers to automatically record your life in details. He gave examples of devices such as Google Glass or Apple’s iWatch that are currently in the market that record every moment. He has gathered a digital memory of himself since 2006 by using a wearable camera. The talk he gave was similar to a talk he gave at 2012 titled "The Era of Personal Life Archives".

The second speaker was Taro Tezuka (University of Tsukuba, Japan). His presentation was titled "Indexing and Searching of Conversation Lifelogs". He focused on search capability and that it is as important as storage capability in lifelong applications. He mentioned that cleaver indexing of recorded content is necessary for implementing a useful lifelong search systems. He also showed the LifeRecycle which is a system for recording and retrieving conversation lifelogs, by first recording the conversation, then providing speech recognition, after that store the result and finally search and show the result. He mentioned that the challenges that faces a person to allow being recorded is security issues and privacy.

The last speaker of the panel was Håvard Johansen (University of Tromso, Norway). First they started with definitions of lifelogs. He also discussed the use of personal data for sport analytic, by understanding how to construct privacy preserving lifelogging. After the third speaker the audience asked/discussed some privacy issues that may concern lifelogging.



The third and fourth session were simultaneous as well. The third session was "Big Data, Big Resources". The first presenter was Zhiwu Xie (Virginia Tech, VA) (@zxie) with his paper “Towards Use And Reuse Driven Big Data Management”. This work focused on integrating digital libraries and big data analytics in the cloud. Then they described its system model and evaluation.


Next, “iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling” was presented by Gerhard Gossen (L3S Research Center, Germany) (@GerhardGossen). iCrawl combines crawling of the Web and Social Media for a specified topic. The crawler works by collecting web and social content in a single system and exploits the stream of new social media content for guideness. The target users for this web crawling toolbox is web-science, qualitative humanities researchers. The approach was to start with a topic and follow its outgoing links that are relevant.


G. Craig Murray (Verisign Labs) presented instead of Jimmy Lin (University of Maryland, College Park) (@lintool). The paper was titled “The Sum of All Human Knowledge in Your Pocket: Full-Text Searchable Wikipedia on a Raspberry Pi”. Craig discussed how it is useful to have Wikipedia that you can access without Internet by connecting to Raspberry Pi device via bluetooth or wifi. He passed along the Raspberry Pi device to the audience, and allowed them to connect to it wirelessly. The device is considered better than Google since it offers offline search and full text access. It also offers full control over search algorithms and is considered a private search. The advantage of the data being on a separate device instead of on the phone is that it is cheaper per unit storage and offers full Linux stack and hardware customizability.


The last presenter in this session was Tarek Kanan (Virginia Tech, VA), presenting “Big Data Text Summarization for Events: a Problem Based Learning Course”. Problem/project Based Learning (PBL) is a student-centered teaching method, where student teams learn by solving problems. In this work 7 teams of student each with 30 students apply big data methods to produce corpus summaries. They found that PBL helped students in a computational linguistics class automatically build good text summaries for big data collections. The student also learned many of the key concepts of NLP.


The fourth session I missed was "Working the Crowd", Mat Kelly (ODU, VA) (@machawk1) recorded the session.



After that, Conference Banquet was served at the Foundry on the Fair Site.


On Tuesday June 23, 2015 after breakfast the Keynote speaker Katherine Skinner (Educopia Institute, GA). Her talk was titled “Moving the needle: from innovation to impact”. She discussed how to engage others to make use of digital libraries and archiving, getting out there and being an important factor to the community as we should be. She asked what digital libraries could accomplish as a field if we shifted our focus from "innovation" to "impact".

After that, there were two other simultaneous sessions "Non-Text Collection" and "Ontologies and Semantics". I attended the first session where there was one long paper presented and four short papers. The first speaker in this session was Yuehan Wang (Peking University, China). His paper was “WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document”. The speaker discussed the challenges of extracting mathematical formula from the different representations. They propose an upgraded Mathematical Information Retrieval system named WikiMirs3.0. The system can extract mathematical formulas from PDF and can type in queries. The system is publicly available at: www.icst.pku.edu.cn/cpdp/wikimirs3/.


Next, Kahyun Choi (University of Illinois at Urbana-Champaign, IL) presented “Topic Modeling Users’ Interpretations of Songs to Inform Subject Access in Music Digital Libraries”. Her paper focused on addressing if topic modeling can discover subject from interpretations, and the way to improve the quality of topics automatically. Their data set was extracted from songmeanings.com which contained almost went four thousand songs with at least five interpretation per song. Topic models are generated using Latent Dirichlet Allocation (LDA) and the normalization of the top ten words in each topic was calculated. For evaluating a sample as manually assigned to six subjects and found 71% accuracy.


Frank Shipman (Texas A&M University, TX) presented “Towards a Distributed Digital Library for Sign Language Content”. In this work they try to locate content relating to sign language over the Internet. They propose a description of a software components of a distributed digital library of sign language content, called SLaDL. This software detects sign language content in a video.




The final speaker of this session was Martin Klein (UCLA, CA) (@mart1nkle1n), presenting “Analyzing News Events in Non-Traditional Digital Library Collections”. In his work they found indicators relevant for building non-traditional collection. From the two collection, an online archive of TV news broadcasts and an archive of social media captures, they found that there is an 8 hour delay between social media and TV coverages that continues at a high frequency level for a few days after a major event. In addition, they found that news items have potential to influence other collections.




The session I missed was "Ontologies and Semantics", the papers presented were:

After lunch, there were two other simultaneous sessions "User Issues" and "Temporality". I attended "Temporality" session where there were two long papers. The first paper was presented by Thomas Bogel (Heidelberg University, Germany) titled “Time Well Tell: Temporal Linking of News Stories”. Thomas presented a framework to link news articles based on temporal expressions that occur in the articles. In this work they recover the arrangement of events covered in an article, in the big picture a network of article will be timely ordered.

The second paper was “Predicting Temporal Intention in resource Sharing” presented by Hany SalahEldeen (ODU, VA) (@hanysalaheldeen). Links on web pages on Twitter could change over time and might not match users intention. In this work they enhance prior temporal intention model by adding linguistic feature analysis, semantic similarity and balancing the training dataset. In this current module they had a 77% accuracy on predicting the intention of the user.




The session I missed "User Issues" had four papers:

Next, there was a panel on “Organizational Strategies for Cultural Heritage Preservation”. Paul Logasa Bogen, II (Google, WA) (@plbogen) introduced four speakers in this panel. There were Katherine Skinner (Educopia Institute, Atlanta), Stacy Kowalczyk (Dominican University, IL)(@skowalcz), Piotr Adamczyk (Google Cultural Institute, Mountain View) and Unmil Karadkar (The University of Texas at Austin, Austin) (@unmil). In this panel they discussed the preservation goal, the challenges that faces organizations practice preservation centralized or decentralized preservation and how to balance these approaches. In the final minutes there were questions from the audience regarding privacy and ownership in Cultural Heritage collections.

Following that was Minute Madness, which was a session where each poster presenter has two chances (60 seconds then 30 seconds) to talk about their poster in attempt to lure attendees to come by during the poster session.



The final session of the day was the "Reception and Posters". Where posters/demos are viewed and everyone in the audience got three stickers that were used to vote for best poster/demo.


On Wednesday June 24, 2015, there was one session "Archiving, Repositories, and Content" and three different workshops: "4th International Workshop on Mining Scientific Publications (WOSP 2015)", "Digital Libraries for Musicology (DLfM)" and "Web Archiving and Digital Libraries (WADL 2015)".

The session of the day "Archiving, Repositories, and Content" had four papers. The first paper in the last session of the conference was Stacy Kowalczyk (Dominican University, IL)(@skowalcz) presenting “Before the Repository: Defining the Preservation Threats to Research Data in the Lab”. She mentioned that lost media is a big threat and this threat is required to be addressed by preservation. She conducted a survey to quantify the risk to the preservation of research data. By getting a sample of 724 National Science Foundation awardees completing the survey, she found that the human error was the greatest threat to preservation followed by equipment malfunction.




Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author) presented “How Well Are Arabic Websites Archived?”. In this work we focused on determining if Arabic websites are archived and indexed. We collected a simple of Arabic websites and discovered that 46% of the websites are not archived and that 31% are not indexed. We also analyzed the dataset to find that almost 15% had an Arabic country code top level domain and almost 11% had an Arabic geographical location. We recommend that if you want an Arabic webpage to be archived then you should list in DMOZ and host it outside an Arabic country.





Next, Ke Zhou (University of Edinburgh, UK) presented his paper “No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving” (from the Hiberlink project). This paper addresses the issue of having a 404 on a reference in a scholar article. They found that there are two types content drift and link rot, and that there are around 30% of rotten web references. This work suggests that authors to archive links that are more likely to be rotten.




Then Jacob Jett (University of Illinois at Urbana-Champaign, IL) presented his paper “The Problem of “Additional Content” in Video Games". In this work they first discuss the challenges that video games nowadays faces due to its additional content such as modification and downloadable contents. They try to address the challenges by proposing a solution by capturing additional contents.




After the final paper of the main conference lunch was served along with the announcement of best poster/demo by counting the number of the audience votes. This year there were two best poster/demo awards and they were to Ahmed Alsum (Stanford University, CA) (@aalsum) for “Reconstruction of the US First Website”, and to Mat Kelly (ODU, VA) (@machawk1) for “Mobile Mink: Merging Mobile and Desktop Archived Webs”, by Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson (learn more about Mobile Mink). 




Next, was the announcement for the awards for best student paper and best overall paper. The best student paper was awarded to Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author), Michael L. Nelson, and Michele C. Weigle for our paper “How Well Are Arabic Websites Archived?”, and the Vannevar Bush best paper was awarded to Pertti Vakkari and Janna Pöntinen for their paper “Result List Actions in Fiction Search”.

After that there was the "Closing Plenary and Keynote", where J. Stephen Downie talked about “The HathiTrust Research Center Providing Analytic Access to the HathiTrust Digital Library’s 4.7 Billion Pages”. HathiTrust is trying to preserve the cultural records. It currently digitalized 13,496,147 volumes, 6,778,492 books and many more. There are any current projects that HathiTrust are working on, such as HathiTrust BookWorm which you can search for a specific term, the number of occurrence and its position. This presentation was similar to a presentation titled "The HathiTrust Research Center: Big Data Analytics in Secure Data Framework" presented in 2014 by Robert McDonald.

Finally, JCDL 2016 was announced to be located in Newark, NJ, June 19-23.

After that, I attended the "Web Archiving and Digital Libraries" workshop, where Sawood Alam (ODU, VA)(@ibnesayeed) will cover the details in a blog post.





by Lulwah Alkwai,

Special thanks to Mat Kelly for taking the videos and helping to edit this post.

Friday, June 26, 2015

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).
Fig 1. KBB.com uses Ajax to retrieve data for the drop down menus.
Fig 2. The user selects the Make option, which initiates an Ajax request...
Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.
Using Chrome's Developer Tools, we can see the Ajax making a request for this information (Figure 4).

Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.
If we view a memento of KBB.com (Figure 5), we see that the drop downs are not operational because Heritrix was not able to run the JavaScript and capture the data needed to populate the drop downs.

Fig 5. The memento of KBB.com is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.
The overly-simplified solution to this problem is for archives to use a tool that executes JavaScript in ways the traditional archival crawlers cannot. (Our paper discussing the performance trade-offs and impact of using headless browsing vs. traditional crawling tools has been accepted for publication at iPres2015.) More specifically, the crawlers should make use of technologies that act more like (or load resources in actual) browsers. For example, Archive-It is using Umbra to overcome the difficulties introduced by JavaScript for a subset of domains.

We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.

There are other similar tools (BrowsertrixWebRecorder.ioCrawlJAX) but these are slightly outside the scope of what we want to do.  We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. WebRecorder.io can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; WebRecorder.io is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.

In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.

PhantomJS is a headless browsing utility that is scripted in JavaScript. As such, it provides a tight integration between the loaded page and its DOM and the code. This allows code to be easily directly injected into the target page, and native DOM interaction to be performed. As such, PhantomJS provides a better mechanism for identifying specific DOM elements and their properties.

For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the KBB.com example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.

Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.

Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.

Tool:PhantomJSSeleniumVisualEvent
OperationHeadlessFull-BrowserJavaScript bookmarklet and code
Speed (seconds)2.5-84-10< 1 (on user click)
DOM IntegrationClose integration3rd partyClose integration/embedded
DOM Event ExtractionSemi-reliableSemi-reliable100% reliable
DOM InteractionScripted, native, on-demandScriptedNone

To summarize, PhantomJS is faster (because it's headless), and more closely integrated with the DOM than Selenium (because it loads a full browser). PhantomJS is more closely coupled with the browser, DOM, and the client-side events than Selenium. However, by using a native browser, Selenium defers the responsibility of keeping up with advances of web technologies such as JavaScript to the browser rather than maintain the responsibility within the archival tool. This will prove to be beneficial as JavaScript, HTML5, and other client-side programming languages evolve and emerge.

Sources online (e.g., Stack OverflowReal PythonVilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.

To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test  (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the KBB.com drop downs).
Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.
VisualEvent is advertised as a bookmarklet-run solution for identifying client-side events, not an archival utility, but can reliably identify all of the event handlers attached to DOM elements. To improve the accuracy of the DOM Event Extraction, we have been using VisualEvent to discover the event handlers on the DOM.

VisualEvent has a reverse approach to discovering the event handlers attached to DOM elements. Our approach -- which was ineffective -- was to use JavaScript to iterate through all DOM elements and try to discover the attached event handlers. VisualEvent starts with the JavaScript, gets all of the JavaScript functions and understands which DOM elements reference those functions and determines whether these are event handlers. VisualEvent then displays the interactive elements of the DOM (Figure 7) and their associated event handler functions (Figure 8) visually through an overlay in the browser. We removed the visual aspects and leverage the JavaScript functions to extract the interactive elements of the page.

Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.

Fig 8. The event handlers of each interactive elements are pulled from the JavaScript and displayed on the page, as well.

We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on KBB.com.

So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.

Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.

--Justin F. Brunelle