Thursday, July 2, 2015

2015-07-02: JCDL2015 Main Conference

Large, Dynamic and Ubiquitous –The Era of the Digital Library






JCDL 2015 (#JCDL2015) took place at the University of Tennessee Conference Center in Knoxville, Tennessee. The conference was four days long; June 21-25, 2015. This year three students from our WS-DL CS group at ODU had their papers accepted as well as one poster (see trip reports for 2014, 2013, 2012, 2011). Dr. Weigle (@weiglemc), Dr. Nelson (@phonedude_mln), Sawood Alam (@ibnesayeed), Mat Kelly (@machawk1) and I (@LulwahMA) went to the conference. We drove from Norfolk, VA. Four of our previous members of our group, Martin Klein (UCLA, CA) (@mart1nkle1n), Joan Smith (Linear B Systems, inc., VA) (@joansm1th), Ahmed Alsum (Stanford University, CA) (@aalsum) and Hany SalahEldeen (Microsoft, Seattle)(@hanysalaheldeen), also met us there. The trip was around 8 hours. We enjoyed the mountain views and the beautiful farms. We also caught parts of a storm on our way, but it only lasted for two hours or so.

The first day of the conference (Sunday June 21, 2015) consisted of four tutorials and the Doctoral Consortium. The four tutorials were: Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research, Introduction to Digital Libraries, Digital Data Curation Essentials for Data Scientists, Data Curators and Librarians and Automatic Methods for Disambiguating Author Names in Bibliographic Data Repositories.


Mat Kelly (ODU, VA)(@machawk1) covered the Doctoral Consortium.




The main conference started on Monday June 22, 2015 and opened with Paul Logasa Bogen II (Google, USA) (@plbogen). He started by welcoming the attendees and then he mentioned that this year had 130 registered attendees from 87 different organizations, and 22 states and 19 different countries.



Then the program chairs: Geneva Henry (University Libraries, GWU, DC), Dion Goh (Wee Kim Wee School of Communication and Information, Nanyang Technical University, Singapore) and Sally Jo Cunningham (Waikato University, New Zealand) added on the announcements and number of accepted papers in JCDL2015. Of the conference submissions, 18 (30%) of full research papers are accepted, and 30 (50%) of short research papers are accepted, and 18 (85.7%) of posters and demos are accepted. Finally, the speaker announced the nominees for best student paper and best overall paper.

The best paper nominees were:
The best student paper nominees were:
Then Unmil Karadkar (The University of Texas at Austin, TX) presented the keynote speaker Piotr Adamczyk (Google Inc, London, UK). Piotr's talk was titled “The Google Cultural Institute: Tools for Libraries, Archive, and Museums”. He presented some of Google attempts to add to the online cultural heritage. He introduced Google Culture Institute website that consisted of three main projects: the Art Project, Archive (Historic Moments) and World Wonders. He showed us the Google Art Project (Link from YouTube: Google Art Project) and then introduced an application to search museums and navigate and look at art. Next, he introduced the Google Cardboard (Link from YouTube:”Google Cardboard Tour Guide”) (Link from YouTube: “Expeditions: Take your students to places a school bus can’t”) where you can explore different museums by looking into a cardboard container that can house a user's electronic device. He mentioned that more museums are allowing Google to capture images of their museums and allowing others to explore it using Google Cardboard and that Google would like to further engage with cultural partners. His talk was similar to a talk he gave in 2014 titled "Google Digitalizing Culture?".

Then we started off with the two simultaneous sessions "People and Their Books" and "Information Extraction". I attended the second session. The first paper was “Online Person Name Disambiguation with Constraints” presented by Madian Khabsa (PSU, PA). The goal of his work is to map the name mentioned to the real world people. They found that 11%-17% of the queries in search engines are personal names. He mentioned that two issues are not addressed: adding constraints to the clustering process and adding the data incrementally without clustering the entire database. The challenge they faced was redundant names. When constraints are added they can be useful in digital library where user can make corrections. Madian described constraint-based clustering algorithm for person name disambiguation.


Sawood Alam (ODU, VA) (@ibnesayeed) followed Madian with his paper “Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages”. He mentioned that general online book readers are not suitable for scanned dictionary. They proposed an approach of indexing scanned pages of a dictionary that enables direct access to appropriate pages on lookup. He implemented an application called Dictionary Explorer where he indexed monolingual and multilingual dictionaries, with speed of over 20 pages per minute per person.



Next, Sarah Weissman (UMD, Maryland) presented “Identifying Duplicate and Contradictory Information in Wikipedia”. Sara identified sentences in wiki articles that are identical. She randomly selected 2k articles and manually identified them. She found that 45% are identical, 30% are templates, 13.15% are copy editing, 5.8% are factual drift, 0.3% are references and 4.9% are other pages.

The last presenter in this session is Min-Yen Kan (National University of Singapore, Singapore) (@knmnyn) presenting “Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs”. He introduced the notion of extensible features for higher order semi-CRFs that allow memorization to speed up inference algorithms.


The papers in the other concurrent session that I was unable to attend were:

After the Research at Google-sponsored Banquet Lunch, Sally Jo Cunningham (University of Waikato, NZ) introduced the first panel "Lifelong Digital Libraries" and then the first speaker Cathal Gurrin (Dublin City University, Ireland)(@cathal). His presentation was titled "Rich Lifelong Libraries". He started off with using wearable devices and information loggers to automatically record your life in details. He gave examples of devices such as Google Glass or Apple’s iWatch that are currently in the market that record every moment. He has gathered a digital memory of himself since 2006 by using a wearable camera. The talk he gave was similar to a talk he gave at 2012 titled "The Era of Personal Life Archives".

The second speaker was Taro Tezuka (University of Tsukuba, Japan). His presentation was titled "Indexing and Searching of Conversation Lifelogs". He focused on search capability and that it is as important as storage capability in lifelong applications. He mentioned that cleaver indexing of recorded content is necessary for implementing a useful lifelong search systems. He also showed the LifeRecycle which is a system for recording and retrieving conversation lifelogs, by first recording the conversation, then providing speech recognition, after that store the result and finally search and show the result. He mentioned that the challenges that faces a person to allow being recorded is security issues and privacy.

The last speaker of the panel was Håvard Johansen (University of Tromso, Norway). First they started with definitions of lifelogs. He also discussed the use of personal data for sport analytic, by understanding how to construct privacy preserving lifelogging. After the third speaker the audience asked/discussed some privacy issues that may concern lifelogging.



The third and fourth session were simultaneous as well. The third session was "Big Data, Big Resources". The first presenter was Zhiwu Xie (Virginia Tech, VA) (@zxie) with his paper “Towards Use And Reuse Driven Big Data Management”. This work focused on integrating digital libraries and big data analytics in the cloud. Then they described its system model and evaluation.


Next, “iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling” was presented by Gerhard Gossen (L3S Research Center, Germany) (@GerhardGossen). iCrawl combines crawling of the Web and Social Media for a specified topic. The crawler works by collecting web and social content in a single system and exploits the stream of new social media content for guideness. The target users for this web crawling toolbox is web-science, qualitative humanities researchers. The approach was to start with a topic and follow its outgoing links that are relevant.


G. Craig Murray (Verisign Labs) presented instead of Jimmy Lin (University of Maryland, College Park) (@lintool). The paper was titled “The Sum of All Human Knowledge in Your Pocket: Full-Text Searchable Wikipedia on a Raspberry Pi”. Craig discussed how it is useful to have Wikipedia that you can access without Internet by connecting to Raspberry Pi device via bluetooth or wifi. He passed along the Raspberry Pi device to the audience, and allowed them to connect to it wirelessly. The device is considered better than Google since it offers offline search and full text access. It also offers full control over search algorithms and is considered a private search. The advantage of the data being on a separate device instead of on the phone is that it is cheaper per unit storage and offers full Linux stack and hardware customizability.


The last presenter in this session was Tarek Kanan (Virginia Tech, VA), presenting “Big Data Text Summarization for Events: a Problem Based Learning Course”. Problem/project Based Learning (PBL) is a student-centered teaching method, where student teams learn by solving problems. In this work 7 teams of student each with 30 students apply big data methods to produce corpus summaries. They found that PBL helped students in a computational linguistics class automatically build good text summaries for big data collections. The student also learned many of the key concepts of NLP.


The fourth session I missed was "Working the Crowd", Mat Kelly (ODU, VA) (@machawk1) recorded the session.



After that, Conference Banquet was served at the Foundry on the Fair Site.


On Tuesday June 23, 2015 after breakfast the Keynote speaker Katherine Skinner (Educopia Institute, GA). Her talk was titled “Moving the needle: from innovation to impact”. She discussed how to engage others to make use of digital libraries and archiving, getting out there and being an important factor to the community as we should be. She asked what digital libraries could accomplish as a field if we shifted our focus from "innovation" to "impact".

After that, there were two other simultaneous sessions "Non-Text Collection" and "Ontologies and Semantics". I attended the first session where there was one long paper presented and four short papers. The first speaker in this session was Yuehan Wang (Peking University, China). His paper was “WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document”. The speaker discussed the challenges of extracting mathematical formula from the different representations. They propose an upgraded Mathematical Information Retrieval system named WikiMirs3.0. The system can extract mathematical formulas from PDF and can type in queries. The system is publicly available at: www.icst.pku.edu.cn/cpdp/wikimirs3/.


Next, Kahyun Choi (University of Illinois at Urbana-Champaign, IL) presented “Topic Modeling Users’ Interpretations of Songs to Inform Subject Access in Music Digital Libraries”. Her paper focused on addressing if topic modeling can discover subject from interpretations, and the way to improve the quality of topics automatically. Their data set was extracted from songmeanings.com which contained almost went four thousand songs with at least five interpretation per song. Topic models are generated using Latent Dirichlet Allocation (LDA) and the normalization of the top ten words in each topic was calculated. For evaluating a sample as manually assigned to six subjects and found 71% accuracy.


Frank Shipman (Texas A&M University, TX) presented “Towards a Distributed Digital Library for Sign Language Content”. In this work they try to locate content relating to sign language over the Internet. They propose a description of a software components of a distributed digital library of sign language content, called SLaDL. This software detects sign language content in a video.




The final speaker of this session was Martin Klein (UCLA, CA) (@mart1nkle1n), presenting “Analyzing News Events in Non-Traditional Digital Library Collections”. In his work they found indicators relevant for building non-traditional collection. From the two collection, an online archive of TV news broadcasts and an archive of social media captures, they found that there is an 8 hour delay between social media and TV coverages that continues at a high frequency level for a few days after a major event. In addition, they found that news items have potential to influence other collections.




The session I missed was "Ontologies and Semantics", the papers presented were:

After lunch, there were two other simultaneous sessions "User Issues" and "Temporality". I attended "Temporality" session where there were two long papers. The first paper was presented by Thomas Bogel (Heidelberg University, Germany) titled “Time Well Tell: Temporal Linking of News Stories”. Thomas presented a framework to link news articles based on temporal expressions that occur in the articles. In this work they recover the arrangement of events covered in an article, in the big picture a network of article will be timely ordered.

The second paper was “Predicting Temporal Intention in resource Sharing” presented by Hany SalahEldeen (ODU, VA) (@hanysalaheldeen). Links on web pages on Twitter could change over time and might not match users intention. In this work they enhance prior temporal intention model by adding linguistic feature analysis, semantic similarity and balancing the training dataset. In this current module they had a 77% accuracy on predicting the intention of the user.




The session I missed "User Issues" had four papers:

Next, there was a panel on “Organizational Strategies for Cultural Heritage Preservation”. Paul Logasa Bogen, II (Google, WA) (@plbogen) introduced four speakers in this panel. There were Katherine Skinner (Educopia Institute, Atlanta), Stacy Kowalczyk (Dominican University, IL)(@skowalcz), Piotr Adamczyk (Google Cultural Institute, Mountain View) and Unmil Karadkar (The University of Texas at Austin, Austin) (@unmil). In this panel they discussed the preservation goal, the challenges that faces organizations practice preservation centralized or decentralized preservation and how to balance these approaches. In the final minutes there were questions from the audience regarding privacy and ownership in Cultural Heritage collections.

Following that was Minute Madness, which was a session where each poster presenter has two chances (60 seconds then 30 seconds) to talk about their poster in attempt to lure attendees to come by during the poster session.



The final session of the day was the "Reception and Posters". Where posters/demos are viewed and everyone in the audience got three stickers that were used to vote for best poster/demo.


On Wednesday June 24, 2015, there was one session "Archiving, Repositories, and Content" and three different workshops: "4th International Workshop on Mining Scientific Publications (WOSP 2015)", "Digital Libraries for Musicology (DLfM)" and "Web Archiving and Digital Libraries (WADL 2015)".

The session of the day "Archiving, Repositories, and Content" had four papers. The first paper in the last session of the conference was Stacy Kowalczyk (Dominican University, IL)(@skowalcz) presenting “Before the Repository: Defining the Preservation Threats to Research Data in the Lab”. She mentioned that lost media is a big threat and this threat is required to be addressed by preservation. She conducted a survey to quantify the risk to the preservation of research data. By getting a sample of 724 National Science Foundation awardees completing the survey, she found that the human error was the greatest threat to preservation followed by equipment malfunction.




Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author) presented “How Well Are Arabic Websites Archived?”. In this work we focused on determining if Arabic websites are archived and indexed. We collected a simple of Arabic websites and discovered that 46% of the websites are not archived and that 31% are not indexed. We also analyzed the dataset to find that almost 15% had an Arabic country code top level domain and almost 11% had an Arabic geographical location. We recommend that if you want an Arabic webpage to be archived then you should list in DMOZ and host it outside an Arabic country.





Next, Ke Zhou (University of Edinburgh, UK) presented his paper “No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving” (from the Hiberlink project). This paper addresses the issue of having a 404 on a reference in a scholar article. They found that there are two types content drift and link rot, and that there are around 30% of rotten web references. This work suggests that authors to archive links that are more likely to be rotten.




Then Jacob Jett (University of Illinois at Urbana-Champaign, IL) presented his paper “The Problem of “Additional Content” in Video Games". In this work they first discuss the challenges that video games nowadays faces due to its additional content such as modification and downloadable contents. They try to address the challenges by proposing a solution by capturing additional contents.




After the final paper of the main conference lunch was served along with the announcement of best poster/demo by counting the number of the audience votes. This year there were two best poster/demo awards and they were to Ahmed Alsum (Stanford University, CA) (@aalsum) for “Reconstruction of the US First Website”, and to Mat Kelly (ODU, VA) (@machawk1) for “Mobile Mink: Merging Mobile and Desktop Archived Webs”, by Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson (learn more about Mobile Mink). 




Next, was the announcement for the awards for best student paper and best overall paper. The best student paper was awarded to Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author), Michael L. Nelson, and Michele C. Weigle for our paper “How Well Are Arabic Websites Archived?”, and the Vannevar Bush best paper was awarded to Pertti Vakkari and Janna Pöntinen for their paper “Result List Actions in Fiction Search”.

After that there was the "Closing Plenary and Keynote", where J. Stephen Downie talked about “The HathiTrust Research Center Providing Analytic Access to the HathiTrust Digital Library’s 4.7 Billion Pages”. HathiTrust is trying to preserve the cultural records. It currently digitalized 13,496,147 volumes, 6,778,492 books and many more. There are any current projects that HathiTrust are working on, such as HathiTrust BookWorm which you can search for a specific term, the number of occurrence and its position. This presentation was similar to a presentation titled "The HathiTrust Research Center: Big Data Analytics in Secure Data Framework" presented in 2014 by Robert McDonald.

Finally, JCDL 2016 was announced to be located in Newark, NJ, June 19-23.

After that, I attended the "Web Archiving and Digital Libraries" workshop, where Sawood Alam (ODU, VA)(@ibnesayeed) will cover the details in a blog post.





by Lulwah Alkwai,

Special thanks to Mat Kelly for taking the videos and helping to edit this post.

Friday, June 26, 2015

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).
Fig 1. KBB.com uses Ajax to retrieve data for the drop down menus.
Fig 2. The user selects the Make option, which initiates an Ajax request...
Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.
Using Chrome's Developer Tools, we can see the Ajax making a request for this information (Figure 4).

Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.
If we view a memento of KBB.com (Figure 5), we see that the drop downs are not operational because Heritrix was not able to run the JavaScript and capture the data needed to populate the drop downs.

Fig 5. The memento of KBB.com is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.
The overly-simplified solution to this problem is for archives to use a tool that executes JavaScript in ways the traditional archival crawlers cannot. (Our paper discussing the performance trade-offs and impact of using headless browsing vs. traditional crawling tools has been accepted for publication at iPres2015.) More specifically, the crawlers should make use of technologies that act more like (or load resources in actual) browsers. For example, Archive-It is using Umbra to overcome the difficulties introduced by JavaScript for a subset of domains.

We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.

There are other similar tools (BrowsertrixWebRecorder.ioCrawlJAX) but these are slightly outside the scope of what we want to do.  We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. WebRecorder.io can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; WebRecorder.io is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.

In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.

PhantomJS is a headless browsing utility that is scripted in JavaScript. As such, it provides a tight integration between the loaded page and its DOM and the code. This allows code to be easily directly injected into the target page, and native DOM interaction to be performed. As such, PhantomJS provides a better mechanism for identifying specific DOM elements and their properties.

For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the KBB.com example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.

Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.

Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.

Tool:PhantomJSSeleniumVisualEvent
OperationHeadlessFull-BrowserJavaScript bookmarklet and code
Speed (seconds)2.5-84-10< 1 (on user click)
DOM IntegrationClose integration3rd partyClose integration/embedded
DOM Event ExtractionSemi-reliableSemi-reliable100% reliable
DOM InteractionScripted, native, on-demandScriptedNone

To summarize, PhantomJS is faster (because it's headless), and more closely integrated with the DOM than Selenium (because it loads a full browser). PhantomJS is more closely coupled with the browser, DOM, and the client-side events than Selenium. However, by using a native browser, Selenium defers the responsibility of keeping up with advances of web technologies such as JavaScript to the browser rather than maintain the responsibility within the archival tool. This will prove to be beneficial as JavaScript, HTML5, and other client-side programming languages evolve and emerge.

Sources online (e.g., Stack OverflowReal PythonVilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.

To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test  (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the KBB.com drop downs).
Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.
VisualEvent is advertised as a bookmarklet-run solution for identifying client-side events, not an archival utility, but can reliably identify all of the event handlers attached to DOM elements. To improve the accuracy of the DOM Event Extraction, we have been using VisualEvent to discover the event handlers on the DOM.

VisualEvent has a reverse approach to discovering the event handlers attached to DOM elements. Our approach -- which was ineffective -- was to use JavaScript to iterate through all DOM elements and try to discover the attached event handlers. VisualEvent starts with the JavaScript, gets all of the JavaScript functions and understands which DOM elements reference those functions and determines whether these are event handlers. VisualEvent then displays the interactive elements of the DOM (Figure 7) and their associated event handler functions (Figure 8) visually through an overlay in the browser. We removed the visual aspects and leverage the JavaScript functions to extract the interactive elements of the page.

Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.

Fig 8. The event handlers of each interactive elements are pulled from the JavaScript and displayed on the page, as well.

We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on KBB.com.

So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.

Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.

--Justin F. Brunelle

2015-06-26: JCDL 2015 Doctoral Consortium

Mat Kelly attended and presented at the JCDL 2015 Doctoral Consortium. This is his report.                           

Evaluating progress between milestones in a PhD program is difficult due to the inherent open-endedness of research. A means of evaluating whether a student's topic is sound and has merit while still early on in his career is to attend a doctoral consortium. Such an event, as the one held at the annual Joint Conference on Digital Libraries (JCDL), has previously provided a platform for WS-DL students (see 2014, 2013, 2012, and others) to network with faculty and researchers from other institutions as well as observe the approach that other PhD students at the same point in their career use to explain their respective topics.

As the wheels have turned, I have showed enough progress in my research for it to be suitable for preliminary presentation at the 2015 JCDL Doctoral Consortium -- so did so this past Sunday in Knoxville, Tennessee. Along with seven other graduate students from various other universities throughout the world, I gave a twenty minute presentation with ten to twenty minutes of feedback from the audience of both other presenting graduate students, faculty, and researchers.

Kazunari Sugiyama of National University of Singapore (where Hany SalahEldeen recently spent a semester as a research intern) welcomed everyone and briefly described the format of the consortium before getting underway. Each student was to have twenty minutes to present with ten to twenty minutes for feedback from the doctors and the other PhD students present.

The Presentations

The presentations were broken up into four topical categories. In the first section, "User's Relevance in Search", Sally Jo Cunningham introduced the two upcoming speakers. Sampath Jayarathna of Texas A&M University was the first presenter of the day with his topic, "Unifying Implicit and Explicit Feedback for Multi-Application User Interest Modeling". In his research, he asked users to type short queries, which he used to investigate methods for search optimization. He asked, "Can we combine implicit and semi-explicit feedback to create a unified user interest model based on multiple every day applications?". Using a browser-based annotation tool, users in his study were able to provide relevance feedback of the search results via explicit and implicit feedback. One of his hypotheses is that if he has a user model, he should be able to compare the model against explicit feedback that the user provides for providing better relevance of results.


After Sampath, Kathy Brennan of University of North Carolina presented her topic, "User Relevance Assessment of Personal Finance Information: What is the Role of Cognitive Abilities?". In her presentation she alluded to the similarities of buying a washer and dryer to obtaining a mortgage in respect to being an indicator for a person's cognitive abilities. "Even for really intelligent people, understanding prime and subprime rates can be a challenge.", she said. One study she described analyzed rounding behavior with stock prices being an example of the observed critical details by an individual. Through testing 69 different abilities psychometrically through users analyzing documents for relevance, she found that someone with lower cognitive abilities will have a lower threshold for relevance and thus attribute more documents as relevant than those with higher cognitive abilities. "However", she said, "those with a higher cognitive ability were doing a lot more in the same amount of time as those with lower cognitive abilities."

After a short coffee break, Richard Furuta of Texas A&M University introduced the two speakers of the second session titled, "Analysis and Construction of Archive". Yingying Yu of Dalian Maritime University presented first in this session with "Simulate the Evolution of Scientific Publication Repository via Agent-based Modeling". In her research, she is seeking to find candidate co-authors for academic publications based on a model that includes venue, popularity and author importance as a partial set of parameters to generate a model. "Sometimes scholars only focus on homogenous network", she said.


Mat Kelly (@machawk1, your author) presented second in the session with "A Framework for Aggregating Private and Public Web Archives". In my work, I described the issues of integrate private and public web archives in respect to access restrictions, privacy issues, and other concerns that would arise were the archives' results to be aggregated.


The conference then broke for boxed lunch and informal discussions amongst the attendees.


After resuming sessions after the lunch break, George Buchanan of City University of London welcomed everybody and introduced the two speakers of the third session of the day, "User Generated Contents for Better Service".


Faith Okite-Amughoro (@okitefay) of University of KwaZulu-Natal presented her topic, "The Effectiveness of Web 2.0 in Marketing Academic Library Services in Nigerian Universities: a Case Study of Selected Universities in South-South Nigeria". Faith's research noted that there has not been any assessment on how the libraries in her region of study have used Web 2.0 to market their services. "The real challenge is not how to manage their collection, staff and technology", she said, "but to turn these resources into services". She found that the most used Web 2.0 tools were social networking, video sharing, blogs, and generally places where the user could add themselves.


Following Faith, Ziad Matni of Rutgers University presented his topic, "Using Social Media Data to Measure and Influence Community Well-Being". Ziad asked, "How can we gauge how well people are doing in their local communities though the data that they generate on social media?" He is currently looking for useful measure of components of community well-being and their relationships with collective feelings of stress and tranquility (as he defined in his work). He is hoping to focus on one or two social indicators and to understand the influence factors that correlate the sentiment expressed on social media and a geographical community's well-being.


After Ziad's presentation, the group took a coffee break then started the last presentation session of the day, "Mining Valuable Contents". Kazunari Sugiyama (who welcomed the group at the beginning of the day) introduced the two speakers of the session.


The first presentation in this session was from Kahyun Choi of University of Illinois at Urbana-Champaign presented her work, "From Lyrics to Their Interpretations: Automated Reading between the Lines". In her work, she is looking to try to find the source of subject information from songs with the assumption that machines might have difficult analyzing songs' lyrics. She has three general research questions, the first relating lyrics and their interpretations, the second whether topic modeling can discover the subject of the interpretations, and the third in reliably obtaining the interpretations from the lyrics. She is training and testing a subject classifier where she collected lyrics and their interpretations from SongMeanings.com. From this she obtained eight subject categories: religion, sex, drugs, parents, war, places, ex-lover, and death. With 100 songs in each category, she assigned each song to have only one subject. She then obtained the top ten interpretations per song to prevent the results from being skewed by songs with a large number of interpretations.


The final group presentation of the day was to come from Mumini Olatunji Omisore of Federal University of Technology with "A Classification Model for Mining Research Publications from Crowdsourced Data". Because of visa issues, he was unable to attend but planned on presenting via Skype or Google Hangouts. After changing wireless configurations, services, and many other attempts, the bandwidth at the conference venue proved insufficient and he was unable to present. A contingency was setup between him and the doctoral consortium organizers to review his slides.


Two-on-Two

Following the attempts to allow Mumini to present remotely, the consortium broke up into group of four (two students and two doctors) for private consultations. The doctors in my group (Drs. Edie Rasmussen and Michael Nelson) provided extremely helpful feedback in both my presentation and research objectives. Particularly valuable was their helpful discussions for how I could go about improving the evaluation of my proposed research.

Overall, the JCDL Doctoral Consortium was a very valuable experience. By viewing how other PhD students were approaching their research and obtaining critical feedback on mine, I believe the experience to be priceless for improving the quality of one's PhD research.

— Mat (@machawk1)

Tuesday, June 9, 2015

2015-06-09: Mobile Mink merges the mobile and desktop webs

As part of my 9-to-5 job at The MITRE Corporation, I lead several STEM outreach efforts in the local academic community. One of our partnerships with the New Horizon's Governor's School for Science and Technology pairs high school seniors with professionals in STEM careers. Wes Jordan has been working with me since October 2014 as part of this program and for his senior mentorship project as a requirement for graduation from the Governor's School.

Wes has developed Mobile Mink (soon to be available in the Google Play store). Inspired by Mat Kelly's Mink add-on for Chrome, Wes adapted the functionality to an Android application. This blog post discusses the motivation for and operation of Mobile Mink.

Motivation

The growth of the mobile web has encouraged web archivists to focus on ensuring its thorough archiving. However, the mobile URIs are not as prevalent in the archives as their non-mobile (or as we will refer to them: desktop) URIs. This is apparent when we compare the TimeMaps of the Android Central site (with desktop URI http://www.androidcentral.com/ and a mobile URI http://m.androidcentral.com/).

TimeMap of the desktop Android Central URI
 The 2014 TimeMap in the Internet Archive of the desktop Android Central URI includes a large number of mementos with a small number of gaps in archival coverage.
TimeMap of the mobile Android Central URI
Alternatively, the TimeMap in the Internet Archive of the mobile Android Central URI has far fewer mementos and many more gaps in archival coverage.

This example illustrates the discrepancy between archival coverage of mobile vs desktop URIs. Additionally, as humans we can understand that these two URIs are representing content from the same site: Android Central. The connection between the URIs is represented in the live web, with mobile user-agents triggering a redirect to the mobile URI. This connection is lost during archiving.



The representations of the mobile and desktop URIs are different, even though a human will recognize the content as largely the same. Because archives commonly index by URI and archival datetime only, a machine may not be able to understand that these URIs are related.
The desktop Android Central representation
The mobile Android Central representation

Mobile Mink helps merge the mobile and desktop TimeMaps while also also providing a mechanism to increase the archival coverage of mobile URIs. We detail these features in the Implementation section.

Implementation

Mobile Mink provides users with a merged TimeMap of mobile and desktop versions of the same site. We use the URI permutations detailed in McCown's work to transform desktop URIs to mobile URIs (e.g., http://www.androidcentral.com/ -> http://m.anrdoidcentral.com/) and mobile URIs to desktop URIs (e.g., http://m.androidcentral.com/ -> http://www.androidcentral.com/). This process allows Mobile Mink to establish the connection between mobile and desktop URIs.



Merged TimeMap
With the mobile and desktop URIs identified, Mobile Mink uses Memento to retrieve the TimeMaps of both the desktop and mobile versions of the site. Mobile Mink merges all of the returned TimeMaps and sorts the mementos temporally, identifying the mementos of the mobile URIs with an orange icon of a mobile phone and the mementos of the desktop URIs with a green icon of a PC monitor.

To mitigate the discrepancy in archival coverage between the mobile and desktop URIs of web resources, Mobile Mink provides an option to allow users to push the mobile and desktop URIs to the Save Page Now feature at the Internet Archive and to Archive.today. This will allow Mobile Mink's users to actively archive mobile resources that may not be otherwise archived.

These features mirror the functionality of Mink by providing users with a TimeMap of the site currently being viewed, but extends Mink's functionality by providing the merged mobile and desktop TimeMap. Mink also provides a feature to submit URIs to Archive.today and the Save Page Now feature, but Mobile Mink extends this functionality by submitting the mobile and desktop URIs to these two archival services.

Demonstration

The video below provides a demo of Mobile Mink. We use the Chrome browser and navigate to http://www.androidcentral.com/, which redirects us to http://m.androidcentral.com/. From the browser menu, we select the "Share" option. When we select the "View Mementos" option, Mobile Mink provides the aggregate TimeMap. Selecting the icon in the top right corner, we can access the menu to submit the mobile and desktop URIs to Archive.today and/or the Internet Archive.


Next Steps

We plan to release Mobile Mink in the Google Play store in the next few weeks. In the mean time, please feel free to download and use the app from Wes's GitHub repository (https://github.com/Thing342/MobileMemento) and provide feedback to through the issues tracker (https://github.com/Thing342/MobileMemento/issues). We will continue to test and refine the software moving forward.

Wes's demo of MobileMink was accepted at JCDL2015. Because he is graduating in June and preparing to start his collegiate career at Virginia Tech, someone from the WS-DL lab will be presenting his work on his behalf. However, we hope to convince Wes to come to the Dark Side and join the WS-DL lab in the future. We have cookies.

--Justin F. Brunelle

2015-06-09: Web Archiving Collaboration: New Tools and Models Trip Report

Mat Kelly and Michele Weigle travel to and present at the Web Archiving Collaboration Conference in NYC.                           

On June 4 and 5, 2015, Dr. Weigle (@weiglemc) and I (@machawk1) traveled to New York City to attend the Web Archiving Collaboration conference held at the Columbia School of International and Public Affairs. The conference gave us an opportunity to present our work from the incentive award provided to us by Columbia University Libraries and the Andrew W. Mellon Foundation in 2014.

Robert Wolven of Columbia University Libraries started off the conference with welcoming the audience and emphasizing the variety of presentations that were to occur on that day. He then introduced Jim Neal, the keynote speaker.

Jim Neal starting by noting the challenges of "repository chaos", namely, which version of a document should be cited for online resources if multiple versions exist. "Born-digital content must deal with integrity", he said, "and remain as unimpaired and undivided as possible to ensure scholarly access."

Brian Carver (@brianwc) and Michael Lissner (@mlissner) of Free Law Project (@freelawproject) followed the keynote with Brian first stating, "Too frequently I encounter public access systems that have utterly useless tools on top of them and I think that is unfair." He described his project's efforts to make available court data from the wide variety of systems digitally deployed by various courts on the web. "A one-size-fits-all solution cannot guarantee this across hundreds of different court websites.", he stated, further explaining that each site needs its own algorithm of scraping to extract content.

To facilitate the crowd sourcing of scraping algorithms, he has created a system where users can supply "recipes" to extract content from the courts' sites as they are posted. "Everything I work with is in the public domain. If anyone says otherwise, I will fight them about it.", he mentioned regarding the demands people have brought to him when finding their name in the now accessible court documents. "We still find courts using WordPerfect. They can cling to old technology like no one else."

Free Law Project slides

Shailin Thomas (@shailinthomas) and Jack Cushman from the Berkman Center for Internet and Society, Harvard University spoke next of Perma.cc. "From the digital citation in the Harvard Law Review from the last 10 year, 73% of the online links were broken. Over 50% of the links cited by the Supreme Court are broken." They continued to describe the Perma API and the recent Memento compliance.

Perma.cc slides

After a short break, Deborah Kempe (@nyarcist) of the Frick Art Reference Library describe her recent observation that there is a digital shift in art moving to the Internet. She has been working with both Archive-It and Hanzo Archives for quality assurance of captured websites and for on-demand captures of sites that her organization found particularly challenging (respectively). One example of the latter is Wangechi Mutu's site, which has an animation on the homepage, which Archive-It was unable to capture but Hanzo was.

In the same session, Lily Pregill (@technelily) of NYARC stated, "We needed a discovery system to unite NYARC arcade and our Archive-It collection. We anticipated creating yet another silo of an archive." While she stated that the user interface is still under construction, it does allow the results of her organization's archive to be supplemented with results from Archive-It.

New York Art Resources Consortium (NYARC) slides

Following Lily in the session, Anna Perricci (@AnnaPerricci) of Columbia University Libraries talked about the Contemporary Composers Web Archive, which consists of 11 participating curators from 56 sites currently available in Archive-It. The "Ivies Plus" collaboration has Columbia building web archives with seeds chosen by subject specialists from Ivy League universities along with a few other universities.

Ivies Plus slides

In the same session, Alex Thurman (@athurman) (also from Columbia) presented on the IIPC Collaborative Collection Initiative. He referenced the varying legal environments between members based on countries, some being able to do full TLD crawling while some members (namely, in the U.S.) have no protection from copyright. He spoke of the preservation of Olympics web sites from 2010, 2012, and 2014 - the latter being the first logo to contain a web address. "Though Archive-It had a higher upfront cost", he said about the initial weighing of various option for Olympic website archiving, it was all-inclusive of preservation, indexing, metadata, replay, etc." To publish their collections, they are looking into utilizing the .int TLD, which is reserved for internationally significant information but is underutilized in that only about 100 sites exist, all which have research value.

International Internet Preservation Consortium collaborative collections slides

The conference then broke for a provided lunch then started with Lightning Talks.

To start off the lightning talks, Michael Lissner (@mlissner) spoke about RECAP, what it is, what has it done and what is next for the project. Much of the content contained with the Public Access to Court and Electronic Records (PACER) system is paywalled public domain documents. Obtaining the documents costs users ten cents per page with a three dollar maximum. "To download the Lehman Brothers proceedings would cost $27000.", he said. His system leverages user's browser via the extension framework to save a copy of the downloads from a user to Internet Archive and also first query the archive for a user to see if the document has been previously downloaded.

Dragan Espenschied (@despens) gave the next Lightning Talk talking about preserving digital art pieces, namely those on the web. He noted one particular example where the artist extensively used scrollbars, which are less common place in user interface today. To accurately re-experience the work, he fired up a browser based MacOS 9 emulator:

Jefferson Bailey @jefferson_bail followed Dragan with his work in investigating archive access methods that are not URI centric. He has begun working with WATs (web archive transformations), LGAs (longitudinal graph analyses), and WANEs (web archive named entities).

Dan Chudnov (@dchud) then spoke of his work at GWU Libraries. He had developed Social Feed Manager, a Django application to collect social media data from Twitter. Previously, researchers had been copy and pasting tweets into Excel documents. His tool automated this process. "We want to 1. See how to present this stuff, 2. Do analytics to see what's in the data and 3. Find out how to document the now. What do you collect for live events? What keywords are arising? Whose info should you collect?", he said.

Jack Cushman from Perma.cc gave the next lightning talk about ToolsForTimeTravel.org, a site that is trying to make a strong dark archive. The concept would prevent archivists from reading material within until conditions are met. Examples where this would be applicable are the IRA Archive at Boston College, Hillary Clinton's e-mails, etc.

With the completion of the Lightning Talks, Jimmy Lin (@lintool) of University of Maryland and Ian Milligan (@ianmilligan1) of University of Waterloo rhetorically asked, "When does an event become history?" stating that history is written 20 to 30 years after an event has occurred. "History of the 60s was written in the 30s. Where are the Monica Lewinsky web pages now? We are getting ready to write the history of the 1990s.", Jimmy said. "Users can't do much with current web archives. It's hard to develop tools for non-existent users. We need deep collaborations between users (archivists, journalists, historians, digital humanists, etc.) and tool builders. What would a modern archiving platform built on big data infrastructure look like?" He compared his recent work in creating warcbase with the monolithic OpenWayback Tomcat application. "Existing tools are not adequate."

Warcbase: Building a scalable platform on HBase and Hadoop slides (part 1)

Ian then talked about warcbase as an open source platform for managing web archives with Hadoop and HBase. WARC data is ingested into HBase and Spark is used for text analysis and services.

Warcbase: Building a scalable platform on HBase and Hadoop slides (part 2)

Zhiwu Xie (@zxie) of Virginia Tech then presented his group's work on maintaining web site persistence when the original site is no longer available. By using an approach akin to a proxy server, the content served when the site was last available is continued to be served in lieu of the live site. "If we have an archive that archives every change of that web site and the website goes down, we can use the archive to fill the downtimes.", he said.

Archiving transactions towards an uninterruptible web service slides

Mat Kelly (@machawk1, your author) presented next with "Visualizing digital collections of web archives" where I described the SimHash archival summarization strategy to efficiently generate a visual representation of how a web page changed over time. In the work, I created a stand-alone interface, Wayback add-on, and embeddable service for a summarization to be generated for a live web page. At the close of the presentation, I attempted a live demo.

WS-DL's own Michele Weigle (@weiglemc) next presented Yasmin's (@yasmina_anwar) work on Detecting Off-Topic Pages. The recently accepted TPDL 2015 paper had her looking at how pages in Archive-It collections have changed over time and being able to detect when a page is no longer relevant to what the archivist intended to capture. She used six similarity metrics to find that cosine similarity performed the best.

In the final presentation of the day, Andrea Goethals (@andreagoethals) of Harvard Library and Stephen Abrams of California Digital Library discussed difficulties in keeping up with web archiving locally, citing the outdated tools and systems. A hierarchical diagram of a potential they showed piqued the audiences' interest as being overcomplicated for smaller archives.

Exploring a national collaborative model for web archiving slides

To close out the day, Robert Wolven gave a synopsis of the challenges to come and expressed his hope that there was something for everyone.

Day 2

The second day of the conference contained multiple concurrent topical sessions that were somewhat open-ended to facilitate more group discussion. I initially attended David Rosenthal's talk where he discussed the need for tools and APIs for integration into various system for standardization of access. "A random URL on the web has less than 50% chance of getting preserved anywhere.", he said, "We need to use resources as efficiently as possible to up that percentage". Further emphasizing this point:

DSHR then discussed repairing archives for bit-level integrity and LOCKSS' approach at accomplishing it. How would we go about establish a standard archival vocabulary?", he asked, "'Crawl scope' means something different in Archive-It vs. other systems."

I then changed rooms to catch the last half hour of Dragan Espenschied's tools where he discussed pywb (the software behind webrecorder.io) more in-depth. The software allows people to create their own public and private archives as well as offers a pass-through model where it does not record login information. Further, it can capture embedded YouTube and Google Maps.

Following the first set of concurrent sessions, I attended Ian Milligan's demo of utilizing warcbase for analysis of Canadian Political Parties (a private repo as of this writing but will be public once cleaned up). He also demonstrated using Web Archives for Historical Research. In the subsequent and final presentation of day 2, Jefferson Bailey demonstrated Vinay Goel's (@vinaygo) Archive Research Services Workshop, which was created to serve as an introduction to data mining and computational tools and methods for work with web archives for researchers, developers, and general users. The system utilizes the WAT, LGA, and WANE derived data formats that Jefferson spoke of in his Day 1 Lightning talk.

After Jefferson's talk, Robert Wolven again collected everyone into a single session to go over what was discussed in each session on the second day and gave a final closing.

Overall, the conference was very interesting and very relevant to my research in web archiving. I hope to dig into some of the projects and resources I learned about and follow up with contacts I made at the Columbia Web Archiving Collaboration conference.

— Mat (@machawk1)