Monday, July 24, 2017

2017-07-24: Replacing Heritrix with Chrome in WAIL, and the release of node-warc, node-cdxj, and Squidwarc


I have written posts detailing how an archives modifications made to the JavaScript of a web page being replayed collided with the JavaScript libraries used by the page and how JavaScript + CORS is a deadly combination during replay. Today I am here to announce the release of a suite of high fidelity web archiving tools that help to mitigate the problems surrounding web archiving and a dynamic JavaScript powered web.To demonstrate this, consider the image above: the left-hand screen shot shows today's cnn.com archived and replayed in WAIL, whereas the right-hand screen shot shows cnn.com in the Internet Archive on 2017-07-24T16:00:02.

In this post, I will be covering:


WAIL


Let me begin by announcing that WAIL has transitioned away from using Heritrix as the primary preservation method. Instead, WAIL now directly uses a full Chrome browser (Electron provided) as the preservation crawler. WAIL does not use WARCreate, Browsertrix, brozzler, Webrecorder, or a derivation of one of these tools but my own special sauce. The special sauce powering these crawls has been open sourced and made available through node-warc. Rest assured WAIL still provides auto configured Heritrix crawls but the Page Only, Page + Same Domain Links and Page + All Links crawls now use a full Chrome browser in combination with automatic scrolling of the page and the Pepper Flash Plugin. You can download this version of WAIL today. Oh and did I mention that WAIL's browser based crawls pass Mat Kelly's Archival Acid Test.

But I thought WAIL was already using a full Chrome browser and a modified WARCreate for the Page Only crawl? Yes, that is correct, but the key aspect here is the modified in modified WARCreate. WARCreate was modified for automation to use Node.js buffers, re-request every resource besides the fully rendered page and to work using Electron that is not an extension. What was shared was saving both the rendered page and the request/response headers. So how did I do this or what kind of black magic did I use in order to achieve this? Enter node-warc.

Before I continue


It is easy to forget which tool did this first and continues to do it extremely well. That tool is WARCreate. None of this would be possible if WARCreate had not done it first and I had not cut my teeth on Mat Kelly's projects. So look for this very same functionality to come to WARCreate in the near future as Chrome and the Extension APIs have matured beyond what was initially available to Mat at WARCreates inception. It still amazes me that he was able to get WARCreate to do what it does in the hostile environment that is Chrome Extensions. Thanks Mat! Now get that Ph.D. so that we come closer to not being concerned with WARCs containing cookies and other sensitive information.


node-warc


node-warc started out as a Node.js library for reading WARC files to address my dislike for how other libraries would crash and burn if they encounter an off by one error (webarchiveplayer, OpenWayback, Pywb indexers). But also to build one that is more performant and has a nicer API than the only other one on npm which is three years old with no updates and no gzip. As I worked on making the WAIL provided crawls awesome and Squidwarc it became a good home for the browser based preservation side of handling WARCs. node-warc is now a one stop shop for reading and creation of WARC files using Node.js.

On the reading side, node-warc supports both gzipped and non-gzipped warcs. An example of how to get started reading warcs using node-warc is shown below with the API documentation available online.

How performant is node-warc? Below are performance metrics for parsing both gzipped and un-gzipped warc files of different size.

un-gzipped

size records time max process memory
145.9MB 8,026 2s 22 MiB
268MB 852 2s 77 MiB
2GB 76,980 21s 100 MiB
4.8GB 185,662 1m 144.3 MiB

gzipped

size records time max process memory
7.7MB 1,269 records 297ms 7.1 MiB
819.1MB 34,253 records 16s 190.3 MiB
2.3GB 68,020 records 45s 197.6 MiB
5.3GB 269,464 records 4m 198.2 MiB
Now to the fun part. node-warc provides the means for the archiving the web using Electron's provided Chrome browser and using Chrome or Chrome headless through the chrome-remote-interface a Node.js wrapper for the DevTools Protocol. If you wish to use this library for preservation with Electron use ElectronRequestCapturer and ElectronWARCGenerator. The Electron archiving capabilities were developed in WAIL and then put into node-warc so that others can build high fidelity web archiving tools using Electron. If you need an example to help you get started consult wail-archiver.

For use with Chrome via chrome-remote-interface, use RemoteChromeRequestCapturer and RemoteChromeWARCGenerator. The Chrome specific portion of node-warc came from developing Squidwarc a high fidelity archival crawler that uses Chrome or Chrome Headless. Both the Electron and remote Chrome WARCGenerator and RequestCapturer share the same DevTools Protocol but each has their own way of accessing that API. node-warc takes care of that for you by providing a unified API for both Electron and Chrome. The special sauce here is node-warc retrieves the response body from Chrome/Electron by asking for it and Chrome/Electron will give it to us. It is that simple. Documentation for node-warc is available via n0tan3rd.github.io/node-warc and is released on Github under the MIT license. node-warc welcomes contributions and hopes that it will be found useful. Download it today using npm (npm install node-warc or yarn add node-warc)

node-cdxj


The companion library to node-warc is node-cdxj, cdxj on npm and is the Node.js library for parsing CDXJ files commonly used by Pywb. An example of how to use this library is seen below.

node-cdxj is distributed via Github and npm (npm install cdxj or yarn add cdxj) Full API documentation is available via n0tan3rd.github.io/node-cdxj and is released under the MIT license.

Squidwarc


Now that Vitaly Slobodin stepped down as the maintainer of PhantomJS (it's dead Jim) in deference to Headless Chrome it is with great pleasure to introduce to you today Squidwarc a high fidelity archival crawler that uses Chrome or Headless Chrome directly. Squidwarc aims to address the need for a high fidelity crawler akin to Heritrix while still easy enough for the personal archivist to setup and use. Squidwarc does not seek (at the moment) to dethrone Heritrix as the queen of wide archival crawls, rather seeks to address Heritrix's short comings namely
  • No JavaScript execution
  • Everything is plain text
  • Requiring configuration to known how to preserve the web
  • Setup time and technical knowledge required of its users
Those are some bold (cl)aims. Yes, they are, but in comparison to other web archiving tools using Chrome as the crawler makes sense. Plus to quote Vitaly Slobodin
Chrome is faster and more stable than PhantomJS. And it doesn't eat memory like crazy.
So why work hard when you can let the Chrome devs do a lot of the hard work for you. They must keep up with the crazy fast changing world of web development. Then why should not the web archiving community utilize that to our advantage??? I think so at least and is why I created Squidwarc. Which reminds me of the series of articles Kalev Leetaru wrote entitled Why Are Libraries Failing At Web Archiving And Are We Losing Our Digital History?, Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web and The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web? Well, sir, I present to you Squidwarc, an archival crawler that can handle the every changing and dynamic web. I have showed you mine, what do your crawlers look like?

Squidwarc is an HTTP/1.1 and HTTP/2, GET, POST, HEAD, OPTIONS, PUT, DELETE preserving, JavaScript executing, page interacting archival crawler (just to name a few). And yes it can do all that. If you doubt me, see the documentation for what Squidwarc is capable of through the chrome-remote-interface and node-warc. Squidwarc is different from brozzler as it supports both Chrome and Headless Chrome right out of the box, does not require a middle man to capture the requests and create the WARC, makes full use of the DevTools Protocol thanks to it being a Node.js based crawler (Google approved) and is simpler to setup and use.

So what can be done with Squidwarc at its current stage? I created a video demonstrating the technique described by Dr. Justin Brunelle in Adapting the Hypercube Model to Archive Deferred Representations and Their Descendants which can be viewed below. The code used in this video is on Github as is Squidwarc itself.



The crawls operate in terms of a Composite Memento. For those who are unfamiliar with this terminology, a composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation. An example crawl configuration (JSON files currently) is seen below with annotations (comments) are not valid JSON. A non-annotated configuration file is provided in the Squidwarc Github repository.

The definitions for them are seen below and remember Squidwarc crawls operate in terms of a composite memento. The frontier is web pages, not web pages plus resources of a web page (Chrome retrieves them for us automatically).
page-only
Preserve the page so that there is no difference when replaying the page from viewing the page in a web browser at preservation time
page-same-domain
Page Only option plus preserve all links found on the page that are on the same domain as the page
page-all-links
page-same-domain plus all links from other domains
Below is a video demonstrating that the Squidwarc crawls do in fact preserve only the page, page + same domain links, and page + all links using the initial seed n0tan3rd.github.io/wail/sameDomain1.



Squidwarc is an open source project and available on Github. Squidwarc is not yet available via npm but you can begin using Squidwarc by cloning the repo. Let us build the next phase of web archiving together. Squidwarc welcomes all who wish to be part of its development and if there are any issues feel free to open one up.

Both WAIL and Squidwarc use node-warc and a Chrome browser for preservation. If portability and no setup are what you seek download and start using WAIL. If you just want to use the crawler clone the Squidwarc repository and begin preserving the web using your Chrome browser today. All the projects in this blog post welcome contributions as well as issues via Github. The excuse of "I guess the crawler was not configured to preserve this page" or the term "unarchivable page" are no longer valid in the age of browser based preservation.

- John Berlin

Wednesday, July 19, 2017

2017-07-19: Archives Unleashed 4.0: Web Archive Datathon Trip Report


They: Hey Sawood, nice to see you again.
Me: Hi, I am glad to see you too.
They: Did you attend all hackathons, I mean datathons?
Me: Yes, I attended all of the four Archives Unleashed events so far.
They: How did you like it?
Me: Well, there is a reason why I attended all of them, despite being a seemingly busy PhD researcher.
They: So, what is your research about?
Me: I am trying to profile various web archives to build a high-level understanding of their holdings, primarily, for the sake of efficiently routing Memento aggregation requests, but there can be many more use cases of such profiles... [and the conversation continues...]


On day zero of Archives Unleashed 4.0 in London, conversations among many familiar and unfamiliar faces started with travel and lodging related questions, but soon emerged into mass storage challenges, scaling issues, quality and coverage of web archives, long-term maintenance of archival tools, documentation and discovery of libraries, and exchange of research ideas etc. Ian and Matt were looking fresh and welcoming in the reception of #HackArchives as always. This was all familiar, this is how other previous AU events started too, and yielded great networking among the web archiving community members.


Previously, the Web Science and Digital Libraries Research Group (WSDL) has been well-represented at AU events, but visa issues and competing events meant that only Mat and I were able to attend.


The next day, on Monday, June 12, 2017, the main event started at the British Library in the morning with usual registration process, welcome kit, and AU-branded, 3D-printed looking, strange red rubber balls (that no one had any idea what to do with it). Dr. Matthew Weber and Dr. Ian Milligan began with the opening remarks, described the scope of the event, and available dataset and other resources.


Next was the current efforts session for which Ian, Jefferson, Tom, and Andy were supposed to talk about Warcbase, Internet Archive APIs, National Archives Datasets, and UK Web Archive respectively. Since Jefferson could not make it to the event on time, Ian had to morph into Jefferson for the corresponding talk about IA APIs. All of these talks were very insightful and had a lot to learn from.

Possibly the most interesting aspect of AU events is the phenomenon of the group formation. People and idea stickers flock around the room and naturally cluster in smaller groups with similar interest to come up with a more precise research question and datasets to use. This time, they formed a total of eight different groups with diverse set of research questions and scopes.


After the lunch break teams settled on their tables and started worrying about task refinement, computing resources, data acquisition, and action plan. One of the most difficult issues at AU events is the problem of data set acquisition. Advertised datasets are often not in the easy-to-get condition. Additionally, these datasets are often too large to be copied over to the respective computing instances in a feasible amount of time. Some preprocessing and sampling can be helpful. Additionally, complex (and often unknown) authentication barriers should be removed from the data acquisition process. On one hand it is part of the learning process to acquire and understand the data and learn about other tools to create derivative data, but on the other hand I have consistently noticed that this process is difficult and limits the opportunity for actual data analysis.

Another very useful aspect of AU events is the opportunity to allow people to share their current projects and efforts in the field of web archiving using short lightning talks. In the past we have taken advantage of it to introduce various WSDL efforts such as MemGator, IPWB, CarbonDate, WhatDidItLookLike, and ICanHazMemento. Following the tradition, this time also there were a handful of lightning talks lined up for both the days.
After the first round of five lightning talks teams went back to their hacking task, mostly trying to acquire datasets, understand them, and adjust their ambitious plans to something more feasible withing the short time limit. Then everyone left for the dinner while discussing ideas and scope of their work with their team members. The dinner was really good, but it did not stop people from exchanging world-shaking ideas.


The next morning many teams were talking about how much data they processed overnight and what to do next. The next couple of hours were very critical for every team to come up with something that provides some answers to their proposed research questions. After another session of lightning talks, teams continued to work on their projects, but now they started thinking about reporting aspect and visualizations of their findings as more and more results are apparent. The efforts continued during and after the short lunch break. One could see people multi-tasking to get everything done before the final presentations that was only a coffee break away, but some people still had courage to put everything aside for a while and go for a walk outside. Not every team was working on data analysis, but the overall experience was still generalizable. Finally, the time has arrived for brief project presentations and share findings of the "Samudra Manthan" in front of three esteemed judges from the British Library.
  • Team Portuguese Archive presented their outcome of archived image classification using TensorFlow. As a testbed they used maps to distinguish contemporary maps from historic maps.
  • Team Intersect (of which I was a member) presented the archival coverage of Occupy Wall Street movement in various collections and social media along with overlap among various datasets. They found less than 1% of overlap among different datasets which means the more collectors the better coverage. They also found that two-third of the outlinks from these collections were not archived.
  • The Olympians presented gender distribution in Olympic committees and found strong male bias.
  • Team Shipman Report analyzed text in Shipman Report and found it deadly and dark.
  • Team Links analyzed WARC files to find the trend in distribution of relative/absolute paths and absolute URLs in anchor element along with HTML element distribution around anchors over the time.
  • Team Robots analyzed different types of robots.txt files in web archives with the intent of finding the impact on archival captures if the robots.txt was honored. They found that the impact will not be huge.
  • Team Curated built a prototype of an upcoming Rhizome tool for better curation and annotation. They illustrated some wire frame prototypes of various components and workflow.
  • Team WARCs peeked inside WARC files for traces of politics and elections in the US.
While judges were deciding winners, Ian wrapped up the event by looking back at the past two days and briefly mentioning the highlights of the event. He gave vote of thanks for all individuals and sponsoring organizations who supported the event in various ways including data and computing resources, venue and logistics, and travel grants. Judges' verdict was in; Team Links, Team Robots, and Team Intersect were found guilty of being the best. Everyone was a winner, but some of them performed more efficiently than others within a very short span of time. I am sure every team had much more to show than what they could in the short five minutes presentation.

Now, it was the time to disperse around and continue exchanging ideas over drinks and dinner while getting ready for the rest of the Web Archiving Week events.

They: So, Sawood, are you planning to continue attending all future AU events?  
Me: I hope so! ;-)


--
Sawood Alam

Thursday, July 6, 2017

2017-07-06: Web Science 2017 Trip Report

I was fortunate enough to have the opportunity to present Yasmin AlNoamany's work at Web Science 2017. Dr. Nelson offers an excellent class on Web Science, but it has been years since I had taken it and I still was uncertain about the current state of the art.
Web Science 2017 took place in Troy, a small city in upstate New York that is home to Rensselaer Polytechnic Institute (RPI). The RPI team had organized an excellent conference focused on a variety of Web Science topics, including cyber bullying, taxonomies, social media, and ethics.

Keynote Speakers


Day One


The opening keynote by Steffen Staab from the Institute for Web Science and Technologies (WeST) was entitled "The Web We Want". He discussed how we need to determine what values we want to meet before deciding on the web we want. Dr. Staab defined three key values: accessibility for the disabled, freedom from harassment, and a useful semantic web.

Staab detailed the MAMEM project whose goal is to provide access to the web for the disabled, accounting for those without the ability to operate a mouse and keyboard as well as those who cannot see or hear. He mentioned that the z-stacking used by the Google search engine's textbox frustrates a lot of accessibility tools.


On the topic of harassment, Staab indicated that we need to determine the roles and structure used by people in social networks. Who is each person linked to? Are they initiators or do they join discussions later? Are they trolls? Are they contributors? Are they moderators? Can we differentiate these roles based on previous experience? He showed the procedure by which the ROBUST project classifies users into each of these roles with the goal of providing an early response to trolls, attacks, and spam.




For a useful Semantic Web, Staab stressed the importance of data that interlinked and allows us to further describe entities. Most quality assessments for existing links don't take into account the usefulness of the data. How close are we to benchmarking this usefulness? It depends on the application. So far, we have recommender systems that work based on what someone else said was useful, but may not fit the needs of the user under consideration. Tests of usefulness are further frustrated by the fact that people behave differently during testing the second time through.

He closed by stressing that we need to measure our achievement of these goals. As we measure the achievement of these values, it inspires our engineering and this same measurement is required to understand how well an engineering solution works.

Day Two


The second day keynote was by Jennifer Golbeck, world leader in social media research and science communication, creator of the field of social analytics, and professor at the University of Maryland. She started by talking about one of the reviewers of her paper, "A Large Labeled Corpus for Online Harassment Research", submitted to Web Science 2017. This reviewer liked the work done on social media harassment, but objected to the inclusion of harassing tweets as evidence in the paper. Dr. Golbeck objected to this idea that we should not include evidence in scientific papers, no matter how offensive it may be. She used the story of the upset reviewer throughout the rest of her talk.
Her research tries to answer questions such as who is posting harassing content and why? She also mentioned that Twitter will often not help you block someone if you report them. In order to study the phenomenon, she sought out harassing tweets on Twitter. Fortunately, there is a low density of harassing tweets in the Twitter firehouse. After several unsuccessful methods, including blocklists like Block Together, she resorted to finding harassing tweets using Twitter searches by combining expletives and the names of marginalized groups. Harassment is directed at these groups because they have less power and it is intended to silence them. Sadly, 50% of the tweets containing the word "feminist" are part of harassment on Twitter.


She discovered that there were several main groups of harassers in the data, labeled Gamergate, Trump Supporters, Alt-right, UK-based Brexit/anti-Muslim, and "Anime". This does not mean that all Trump Supporters harass people, but there is a large group of harassers that are Trump Supporters. The "Anime" group seemed to be very interested in the Japanese cartoon style, but not all Anime fans are trolls, and likewise with other descriptors.


She also highlighted the work "Trolls Just Want to Have Fun", by Buckels, Trapnell, and Paulhus. Buckels discovered that trolls exhibit a higher percentage of the following personality traits: machiavellianism, narcissism, psychopathy, and sadism. This was contrasted with those who merely engage in debating issues. These personality traits are known in psychological circles as the Dark Tetrad of personality, identifying individuals that are more likely to cause "social distress".


In spite of some of this progress, we still have no idea of the full picture of harassment on Twitter. One would need to learn the language of the communities under study, both of harassers and victims, in order to fully discover all of the harassment going on.  This makes members of these groups -- women, minorities, etc. -- more careful about what they say on social media because they have to weigh the potential harassment before even speaking. She returned to the reviewer of her paper and stated that she included the Tweets not only as evidence, but because the more we are silent on these issues, the more they will continue.

I Presented Yasmin AlNoamany's Work


On the third day, I was fortunate enough to present Dr. AlNoamany's work on using storytelling tools to summarize Archive-It collections. She uses the example of the Egyptian Revolution, much of which was recorded online in real time, as a use case for summarizing web archive collections. Much of the web resources from the Revolution are gone, but have been preserved in web archives.


csvconfyasmin2017_05_03 from Yasmin AlNoamany, PhD

There are multiple archive collections about the revolution and it is difficult to visualize more than potentially 1000 different captures of potentially 1000 seed URIs. We seek to answer questions such as: "What is in this collection?" and "What makes this collection different from another?" She uses social media Storytelling as an existing interface with which users are familiar. This presentation discusses, at a high level, the Dark and Stormy Archives (DSA) framework which automatically summarizes the collection and generates the visualization in Storify.

Selected Posters


There were many excellent posters at Web Science 2017. Unfortunately, I do not have room to cover them all, so I will highlight a select few.


In "Understanding Parasocial Breakups on Twitter" (preprint), Kiran Garimella studied perceived virtual relationships, known as para-social relationships, on Twitter. This scenario erupts when a user follows a celebrity on social media and then are followed back. For some fans, this provides the illusion of a real relationship. A para-social breakup (PSB) occurs when a fan stops following the celebrity. He studied the 15 most followed celebrities from popular culture on Twitter and used a subset of their fans. He classified fans into 3 types: (1) involved, (2) casual, and (3) random. The involved fans tweet often with their chosen celebrity, but also have a higher probability of unfollowing the celebrity than casual fans who tweet with their celebrity only once per year, or a random sample of followers. Garimella's study has implications for marketing.

The mobile game Pokémon Go has a feature known as the Pokéstop where players can gather more resources to continue playing the game. In "Pokémon Go: Impact on Yelp Restaurant Reviews" (preprint), Pavan Kondamundi evaluated whether or not the inclusion of Pokéstops in Yelp restaurant profiles had an impact on the reviews for said restaurants.  His study included 100 restaurants, half of which contained Pokéstops. He found an increase in the number of reviews for the period of 2014 to 2015, but a slight decrease for the period of 2016-2016.

Policy documents are used by many organizations, not just those within government. Bharat Kale's work, "Predicting Research that will be cited in Policy Documents" (preprint), attempts to determine what features increase the probability of an academic work being cited by a policy document. Using features related to online attention, he discovered that the Random Forest classifier showed the best results for predicting if an article is cited by a policy document. Mention counts on peer-review platforms, such as PubPeer, seems to be the most influential feature and mentions in news appears to be the least influential feature. He intends to extend the work "to predict the number of policy citations a given work is likely to receive."
As I mentioned in an earlier blog post, the problem of entity resolution, and more specifically author disambiguation, continues to confound solutions for scholarly communication. Janaína Gomide focuses on the synonym problem, where a single individual has multiple names. In her work "Consolidating identities of authors through egonet structure", rather than using content information about a given author, she is studying egonets, networks of collaborators built from co-authorship information. She is developing an algorithm that attempts to disambiguate authors based on the shape of their egonet. Preliminary results with datasets from DBLP and Google Scholar show promise for the current version of this algorithm.

There was a lot of work on social networks at the Web Science conference, and Nirmal Sivaraman's work "On Social Synchrony in Online Social Networks" was no exception. He defines social synchrony as "a kind of collective social activity associated with some event where the number of people who participate in the activity first increases exponentially and then decreases exponentially". He developed an algorithm that determines if synchrony has occurred within a dataset of social media data.

Spencer Norris won the best poster award for his "A Semantic Workflow Approach to Web Science Analytics". He highlights the use of linked data to build workflows for use in running and repeating scientific experiments.  His work focuses on the use of semantic workflows for Web Science, indicating that these workflows, because of their ease of publication and analysis, also easily allow "Web Science analyses to be recreated and recombined". He combines the Workflow INstance Generation and Specialization (WINGS) system with the existing Semantic Numeric Exploration Technology (SemNExT) framework.

Selected Papers

There were 45 papers accepted at Web Science. I will summarize a few here to convey the type of research being conducted at the conference.

The design of web pages have shifted over time, leading to differences in how we consume them. Bardia Doosti presented "A Deep Study into the History of Web Design" (copy on author's website). In their work, they point out that web design, much like paintings and architecture, can be analyzed to indicate the concepts and ideas that represent the era from which a web page comes. They develop several automated techniques for analyzing archived web pages, including the use of deep Convolutional Neural Networks, with a hope of identifying the web pages' subject areas as well as determining which web sites (such as apple.com) may influence the design of others.

Olga Zagovora presented "The gendered presentation of professions on Wikipedia" (preprint) where she and her co-authors conducted a study comparing the number of women mentioned on the profession pages of German Wikipedia to the actual number of women in those professions, indicating that there is still a gender bias in the pages. They compared the number of images, mentioned persons, and wiki page titles). It is likely that the choices representing individuals in these professions may be made out of tradition or due to the historical preponderance of males in these fields, but this work is useful in informing further development of guidelines for the Wikipedia community. The data is available on GitHub.

Companies, celebrities, and even general users want to know what helps them acquire more Twitter followers. Juergen Mueller and his co-authors attempted to determine the influence of a user's profile information on her number of followers in "Predicting Rising Follower Counts on Twitter Using Profile Information" (preprint). Because of the rate limitations of the Twitter API, they are interested in determining what information can be predicted based on a user's profile. Using several classifiers, they discovered that follower account is affected by the "subjective impression" of the profile's name, indicating that follower counts are adversely affected for accounts with a name that is perceived as feminine. They also confirmed earlier research that indicates that users with a given name in their name field have less followers.


The concept of fake news has a lot of media attention, especially since the 2016 US Presidential Elections. In "The Fake News Spreading Plague: Was it Preventable?" (preprint), Eni Mustafaraj presented recipes for spreading misinformation on Twitter from 2010 and spreading fake news on Facebook from 2016. The two recipes have the same steps, meaning that perhaps the spread of misinformation during the 2016 US Presidential elections could have been avoided. She mentioned that even thought Facebook had been working on preventing the spread of hoaxes since January of 2015, they were unsuccessful.


Omar Alonso and his colleagues at Microsoft created a search engine in "What's Happening and What Happened: Searching the Social Web". They are building a growing archive of tweets to find relevant links that had been shared on social media. Their project differs from others because they are also adding a temporal dimension to their data gathering to show what people were talking about at a given time, which keeps the search engine fresh, but also allows for some historical analysis. The system uses a concept of virality rather than just popularity for the inclusion of results. Because of this focus on virality, their system is able filter fake news from the results. Contrary to other results, they discovered that "the total number of shares of the real links was higher than the fake links" on Twitter. The resulting search engine allows a user to search for a topic at a given date and time and discover what links were relevant to that topic at that time. The results are presented as a series of social cards rather than the "10 blue links" presented by well known web search engines. These social cards are similar to the link cards used in Storify: they contain an image, title, and short description of the link behind the card.



Alonso also presented the work "Automatic Generation of Timelines from Social Data", which attempts to determine what occurred on a given day for a given hashtag. The system evaluates the tweets by several metrics for relevance, quality, and popularity to produce a vector of relevant n-grams for that hashtag. Once this is done, links are extracted from the tweets, and titles are extracted from these links. The document link titles are evaluated using a new technique the authors name Social Pseudo Relevance Feedback which combines their existing n-gram vectors with the concept of pseudo relevance feedback from information retrieval in order to re-rank the link titles. The highest ranked title for the time period, a day or an hour, is then presented as an entry into the story. The dates can then be listed next to the title produced for that date which, when presented in this fashion, represents a timeline of events matching a given hashtag (seen for #parisattacks and #deflategate in the photos above). I thought this was quite brilliant. One could easily extend this technique by presenting the generated links in order using a tool like Storify, much like Dr. AlNoamany has done for Archive-It collections.


Web Archives are important to the research of the Old Dominion University WS-DL group, so I was intrigued by "Observing Web Archives" (university repository version) presented by Jessica Ogden. She was interested in the in-depth "day-to-day decisions, activities and processes that facilitate web archiving in practice". She used an ethnographic approach to understand the practices at the Internet Archive.





Kiran Garimella presented "The Effect of Collective Attention on Controversial Debates on Social Media" (preprint), studying polarized debates on Twitter. They analyzed four controversial topics on Twitter for from 2011 - 2016. They discovered that "spikes in interest correspond to an increase in the controversy of the discussion" and that these spikes result in changes to the vocabulary being used during the discussions as each side argues their case. They want to develop a model that allows us to use "early signals" from social media to "predict the impact of an event". Kiran won best student paper for their study.

As Web Science researchers, we spend a lot of time analyzing the data available from the web. While working on the online harassment Digital Wildfire project, Helena Webb, Marina Jirotka, and their co-authors began to question the ethics of exploring a user's Twitter data without a user's consent. Even though Twitter is largely a public social network, it presents issues when one considers that researchers are deriving behaviors and information about people and thus has parallels with the testing of human subjects. Marina Jirotka presented "The Ethical Challenges of Publishing Twitter Data for Research Dissemination" (link to university repository). They indicated that there is indeed harm to be caused by republishing social media posts, exposing the attacker to retaliation and forcing the victim to relive the experience. Even if one were to anonymize posts, it is still difficult to fully anonymize the subject, considering the posts can be found via search engines on social media sites.

If a researcher wanted to acquire consent, how would they do so? In the case of Twitter, the social media feed is so large that many users do not view it all and may miss requests for consent. How often should the researcher attempt to contact them? Is an opt-out policy better than an opt-in policy? Echoing Jennifer Golbeck's keynote: If posts were observed, but cannot be included in published research, how do we support our findings? The study exposes many of these concerns in hopes that we can come to a consensus on how to handle them as a community. This study won best paper.


Everything Else








There was a panel discussion at the end of the first day on the ethics of web science. It echoed some of the issues brought up in Helena Webb and Marina Jirotka's paper, but also introduced some additional perspectives. The panel consisted of Jim Hendler, Jeanna Matthews, Steffen Staab, and Hans Akkermans. Each offered many different perspectives, but it was clear that the concern is that Web Science researchers need to drive the ethics discussion before groups outside of the community drive it for them.



Of course, we enjoyed the time learning from one another. Discussions over dinner were influenced by the presentations we had witnessed during the day. We were also able to educate one another about our individual projects. Memento made an appearance in some of those discussions and stickers ended up on some laptops.


I would like to thank John Erickson, Juergen Mueller, Lee Fiorio, Jim Hendler, Omar Alonso, Wendy Hall, Kiran Garimella, Jessica Ogden, Marina Jirotka, James McCusker, Deborah McGuinness, Olga Zagovora, Frederick Ayala-Gómez, Hamed Alhoori, Peter Fox, Spencer Norris, Katharina Kinder-Kurlanda, Eni Mustafaraj, Xiaogang (Marshall) Ma, and many others for fascinating insight and interesting discussions over meals and outings.

Summary



The Web Science 2017 conference was invigorating and fascinating. It has really inspired me to make Web Science an area of interest in future studies. The Web Science Trust has summarized the conference and also provided a Storify story of what happened. I am looking forward to possibly attending this conference again in Amsterdam in 2018 where I may contribute the next grains of knowledge to the discipline.

-- Shawn M. Jones

Wednesday, July 5, 2017

2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017


Web Archiving and Digital Libraries Workshop was held after JCDL 2017 from June 6, 2017, to June 23, 2017. I live-tweeted both days and you can follow along on Twitter with this blog post using the hashtag wadl2017 or via the notes/minutes of WADL2017. I also created a list on Twitter of the speaker/presenters Twitter handles, go give them a follow to keep up to date with their exciting work.

Day 1 (June 22)

WADL2017 kicked off at 2 pm with Martin Klein and Edward Fox welcoming us to the event by giving an overview and introduction to the presenters and panelists.

Keynote

The opening keynote of WADL2017 was National Digital Platform (NDP), Funding Opportunities, and Examples Of Currently Funded Projects by Ashley Sands (IMLS).
In the keynote Sands spoke about the desired values for the national digital platform, how IMLS offers various grant categories and funding opportunities for archiving projects, and the submission procedure for grants as well as tips to writing IMLS grant proposals. Sands also shared what a successful (funded) proposal looks like, and how to apply to become a reviewer of the proposals!

Lightning Talks

First up in the lightning talks was Ross Spenser from the New Zealand Web Archive on "HTTPreserve: Auditing Document-Based Hyperlinks" (poster)

Spenser has created a tool that will check the status of a URL on the live web and if it has been archived by the Internet Archive (httpreserve) which is a part of a large suite of tools under the same name. You can try it out via httpreserve.info and the project is open to contributions from the community as well!
The second talk was Muhammad Umar Qasim on "WARC-Portal: A Tool for Exploring the Past". WARC-Portal is a tool that seeks to provide access for researchers to browse and search through custom collections and provides tools for analyzing these collections via Warcbase.
Third talks was by Sawood Alam on "The Impact of URI Canonicalization on Memento Count". Alam spoke about the ratio of representations vs redirects obtained from dereferencing each archived capture. For a more detailed explanation of this you can read our blog post or the full technical report.

The final talks was by Edward Fox on "Web Archiving Through In-Memory Page Cache". Fox spoke about Nearline vs. Transactional Web Archiving and the advantages of using a Redis cache.

Paper Sessions

First, up for in paper sessions was Ian Milligan, Nick Ruest and Ryan Deschamps on "Building a National Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Project"
The WALK project seeks to address the issue of "To use Canadian web archives you have to really want to use them, that is you need to be an expert" by "Bringing Canadian web archives into a centralised portal with access to derivative datasets".
Enter WALK: 61 collections, 16 TB of WARC files, developed new Solr front end based on Project Blacklight (currently indexed 250 million records). The WALK workflow consists of using Warcbase and a handful of other command line tools to retrieve data from the Internet Archive, generate scholarly derivatives (visualizations, etc) automatically, upload those derivatives to Dataverse and ensure the derivatives are available to the research team.
To ensure that WALK could scale the WALK project will be building on top of Blacklight and contributing it back to the community as WARCLight.
The second paper of WADL2017 presentation was by Sawood Alam on "Avoiding Zombies in Archival Replay Using ServiceWorker." Alam spoke about how through the use of ServiceWorkers URI's that were missed during rewriting or not rewritten at all due to the dynamic nature of the web can be rerouted dynamically by the ServiceWorker to hit the archive rather than the live web. 
Avoiding Zombies in Archival Replay Using ServiceWorker from Sawood Alam

Ian Milligan was up next presenting "Topic Shifts Between Two US Presidential Administrations". One of the biggest questions that Milligan noted during his talk was how to proceed training a classifier if there was no annotated data by which to train it by. To address this question (issue), Milligian used bootstrapping to start off via bag of words and keyword matching. He noted that is method works with noisy but reasonable data. The classifiers were trained to look for biases in admins, Trump vs Obama seems to work with dramatic differences and the TL;DR is the classifiers do learn the biases. For more detailed information about the paper see Milligan's blog post about it.
Closing the first day of WADL2017 was Brenda Reyes Ayala with the final paper presentation on "Web Archives: A preliminary exploration vs reality". Ayala spoke about looking at Archive-It support tickets, as XML, then cleaned and anonymized then using qualitative coding, grounded theory for analysis and presented three expectations when considering user expectations, their mental models when working with archives.
The original website had X number of documents, it would also follow that the archived website also has X number of documents.
Reality: an archived website was often much larger or smaller than the user had expected.
A web archive only includes content that is closely related to the topic.
Reality: Due to crawler settings, scoping rules, and the nat-6/23ure of the web, web archives often include content that is not topic-specific. This was especially the case with social media sites. Users saw the presence of this content as being of little relevance and superfluous.
Content that looks irrelevant is actually irrelevant.
Reality: A website contains pages or elements that are not obviously important but help “behind the scenes” to make other elements or pages render correctly or function properly.
This is knowledge that is known by the partner specialist, but usually unknown or invisible to the user or creator of an archive. Partner specialists often had to explain the true nature of this seemingly irrelevant content Domains and sub-domains are the same thing, and they do not affect the capture of a website.
Reality: These differences usually affect how a website is captured.

Day 2 (June 23)

Day two started off with a panel featuring Emily Maemura, Dawn Walker, Matt Price, and Maya Anjur-Dietrich on "Challenges for Grassroots Web Archiving of Environmental Data". The first event hosted took place in December in Toronto to preserve the EPA data from the Obama administration during the Trump transition. The event had roughly two-hundred participants and during the event hundreds of press articles, tens of thousands of URL’s seeded to Internet Archive, dozens of coders building tools and a sustainable local community of activists interested in continuing the work. Since then seven events in Philly, NYC, Ann Arbor, Cambridge MA, Austin TX, Berkeley were hosted/co-hosted with thirty-one more planned in cities across the country.
After the panel was Tom J. Smyth on Legal Deposit, Collection Development, Preservation, and Web Archiving at Library and Archives Canada Web Archival Scoping Documents. Smyth spoke on questions about how to start building a collection for a budding web archive that does not have the scale as well as an established one and that it has:
Web Archival Scoping Documents
  • What priority
  • What type
  • What are we trying to document
  • What degree are we trying to document
Controlled Collection Metadata, Controlled vocabulary
  • Evolves over time with the collection topic
Quality Control Framework
  • Essential for setting a cut-off point for quality control
Selected Web Resources must pass four checkpoints
  • Is the resource in-scope of the collection and theme
    (when in doubt consult the Scoping Document)
  • Heritage Value, is the content unique available in other formats,
    (what contexts can it be used)
  • Technology / Preservation
  • Quality Control

The next paper presenters up were Muhammad Umar Qasim and Sam-Chin Li for "Working Together Toward a Shared Vision: Canadian Government Information Digital Preservation Network (CGI - DPN)". The Canadian Government Information Digital Preservation Network (CGI - DPN) is a project that seeks to preserve digital collections of government information and ensure the long-term viability of digital materials through geographically dispersed servers, protective measures against data loss, and forward format migration. The project will also as a backup server in cases where the main server is unavailable as well as act as a means of restoring lost data. To achieve the goals the project is using Archive-It for the web crawls and collection building then using LOCKSS to disseminating the collections to additional peers (LOCKSS nodes).
Nick Ruest was up next speaking on "Strategies for Collecting, Processing, and Analyzing Tweets from Large Newsworthy Events". Ruest spoke about how Twitter is big data and handling the can be difficult. Ruest also spoke about how to handle the big Twitter data in a sane manner by using tools such as Hydrator or Twarc from the DocNow project.


The final paper presentation of the day was Saurabh Chakravarty, Eric Williamson, and Edward Fox on "Classification of Tweets using Augmented Training". Chakravarty discussed using the cosine similarity measure on Word2Vec based vector representation of tweets and how it can be used to label unlabeled examples. How training a classifier using Augmented Training does provide improvements in classification efficacy and how a Word2Vec based representation generated out of a richer corpus like Google News provides better improvements with augmented training.

Closing Round Table On WADL

The final order of business for WADL 2017 was a round table discussion with the participants and attendees concerning next years WADL and how to make WADL even better. There was a lot of great ideas and suggestions made as the round table progressed with the participants of this discussion becoming the most excited about the following:
  1. WADL 2018 (naturally of course)
  2. Seeking out additional collaboration and information sharing with those who are actively looking for web archiving but are unaware of / did not meet up for WADL
  3. Looking into bringing proceedings to WADL, perhaps even a journal
  4. Extending the length of WADL to a full two or three day event
  5. Integration of remote site participation for those who wish to attend but can not due to geographical location or travel expenses
Till Joint Conference on Digital Libraries 2018 June 3 - 7 in Fort Worth, Texas, USA
- John Berlin