Tuesday, May 31, 2016

2016-05-31: Can I find this story? API: Yes, Google: Maybe, Native Search: No

A story on Storify titled: "Lecture on Academic Freedom" (capture date: 2016-05-31)
The story on Storify titled: "Lecture on Academic Freedom" could not be found on Google (capture date: 2016-05-31)
The story on Storify titled: "Lecture on Academic Freedom" could not be found on Storify native search (capture date: 2016-05-31)
A part of our research (funded by IMLS) to build collections for stories or events involves exploring content curation sites like Storify in order to determine if they hold quality (news worthy, timely, etc.) content. Storify is a social network service used to create stories which consists of text and multimedia content, as well as content from other social media sites like Twitter, Facebook and Instagram.
Our exploration involved collecting stories from Storify over a period in other to manually inspect the stories to determine their newsworthiness. This exploration was dual natured: we collected latest stories (across multiple topics) from the Storify API (browse/latest interface) over a period of time, we also collected stories from Storify about the Ebola virus through Storify's search API. During this period we collected resources from Google (with the "site:storify.com" directive) as well. At a particular point in our exploration, we considered if we could rely exclusively on Storify search as a means to find content or use Google's site directive to find Storify stories. In other words, how good is the Storify native search compared to Google search for discovery of stories on Storify when compared to the Storify browse/latest API? 
Storify API vs Google and Storify native search: A simple plan for measuring discovery
We focused on known item searches to avoid the problem of subjective relevance measures. This gave us a very simple way of scoring Google and Storify's native search: if Google finds a specific story (query extracted from exact title, body content and description), Google gets 1 point. On the other hand, if Storify's native search (using the same query), finds the story, Storify gets 1 point.
Our set of test stories and their corresponding queries generated from the story titles, body content and description snippets consisted of 10 stories created between February 2016 and March 2016 (Enough time for both search services to index the stories). These stories were collected from the Storify browse/latest API interface which allows for discovery of content, but does not allow us to find topical content like with search. Here is the list of stories (collected 2016-05-30) and their respective creation datetime values, as well as the results outlining stories found by Google and/or Storify's native search:

Story Creation datetime Found? (Google) Found? (Storify)
Commandos 2: Men of Courage full game free pc, download, play. download Commandos 2: Men of Courage for pc 2016-02-22T22:36:03 Yes No
#SJUtakeover 2016-02-17T21:16:43 Yes No
Annotations for Edgar Allan Poe 2016-03-02T19:47:31 No No
Lecture on Academic Freedom 2016-02-22T22:27:08 No No
Hitman: Codename 47 full game free pc, download, play. download Hitman: Codename 47 for pc 2016-02-22T22:36:26 Yes No
AU Game Lab at GDC 2016 2016-03-18T17:36:34 Yes No
5 Leading Onlinegames For Females Cost Free 2016-02-22T22:37:22 Yes No
Sony Ericsson Z610i (Pink): newest cellular Phone With Advanced attributes 2016-03-18T23:50:55 No No
Senior Research Paper 2016-02-26T19:47:19 Yes No
Syracuse community reacts to NCAA Tournament win over Dayton 2016-03-18T17:38:34 Yes No

We searched for the stories by issuing queries with full quotes (for exact match) to Google search (with the "site:storify.com" directive) and Storify's native search and counted the number of hits and misses for both. For both Google and Storify, all SERP links where included in the test. The results from Google did not exceed 1 page, for Storify however, the average number was 20 stories.
Storify's native search finds 0/10 stories, Google finds 7/10
We expected Storify to find more stories compared to Google, since the content resides on Storify, but this was not the case: out of 10 stories, Google found 7 but Storify found none! Google found all except the following stories:
  1. Annotations for Edgar Allan Poe
  2. Lecture on Academic Freedom
  3. Sony Ericsson Z610i (Pink): newest cellular Phone With Advanced attributes
A story on Storify titled: "#SJUTakeover" (capture date: 2016-05-31)

The story on Storify titled: "#SJUTakeover" could not be found on Storify search but found on Google (capture date: 2016-05-31)
Before our test, we checked and did not find a Storify utility to exclude a story from search during the story's creation. Consequently, out test result suggests that the Storify search index is not synchronized with its browse/latest API interface. This investigation also shows the utility of using the Storify API for discovery, which contradicts some of our previous experiences where APIs provide different, limited, or stale data (e.g., Delicious API, SE APIs).
A proposal for a comprehensive study
We acknowledge the sample size of our experiment is very small, however, the preliminary results could be an approximation of a larger study due to random selection of stories. But the curious reader may consider verifying our result through a larger test consisting of a large collection of random stories published across a wide temporal window. If this is done, kindly share your findings with us.
--Nwala

Wednesday, April 27, 2016

2016-04-27: Mementos in the Raw

While analyzing mementos in a recent experiment, we discovered problems processing archived content.  Many web archives augment the mementos they serve with additional archive-specific information, including HTML, text, and JavaScript.  We were attempting to compare content across many web archives, and had to develop custom solutions to remove these augmentations.

Most augment their mementos in order to provide additional user experience features, such as navigation to additional mementos, by rewriting links and providing additional discovery tools. From an end-user perspective, these augmented mementos enhance the usability and overall experience of web archives and are the default case for user access to mementos.  An example from the PRONI web archive is shown below, with the augmentations outlined in red.



Others have requirements to differentiate archived content from live content, because they expose archived content to web search engines. Below, we see that a Google search will return content from the UK National Archives, with one of these search results outlined in red.
To indicate the archived nature of this content, the title of the web page, outlined in red below, has been altered to indicate that this archived page is "[ARCHIVED CONTENT]".


Our experiments were adversely affected by these augmentations. We required "mementos in the raw".  In the case of our study, we needed to access the content as it had existed on the web at the time of capture.  Research by Scott Ainsworth requires accurate replay of the headers as well. These captured mementos are also invaluable to the growing number of research studies that use web archives. Captured mementos are also used by projects like oldweb.today, that truly need to access the original content so it can be rendered in old browsers. It seeks consistent content from different archives to arrive at an accurate page recreation. Fortunately, some web archives store the captured memento, but there is no uniform, standard-based way to access them across various archive implementations.

Based on the needs of these research studies and software projects:
  1. A captured memento must contain only the memento content that was present in the original document:
    • no HTML, JavaScript, CSS, or text has been added to the output
    • linked URIs are not rewritten and exist as they were in the original document (e.g., http://wayback.vefsafn.is/wayback/20091117131348/http://www.lanl.gov/news/index.html should just be http://www.lanl.gov/news/index.html)
  2. A captured memento should also provide the original HTTP headers in some form (e.g., X-Archive-Orig-Content-Type: text/html for users desiring the original Content-Type)
The following table provides a list of some known web archives and the status of their ability to provide captured mementos, by either unaltered content and/or the original headers. Those columns with a "Yes" indicate that the archive is able to provide access to that specific dimension of captured mementos using software-specific approaches.


Those entries with a ? and other archives not listed may or may not provide access to captured mementos. This ambiguity is part of the problem.  Those archives that run OpenWayback for serving their mementos have the capability to deliver captured mementos, as detailed in the OpenWayback Administrator Manual, by use of special URIs. In fact, the OpenWayback im_ URI flag provides the desired behavior, with original headers and original content, even though the documentation states that it is supposed to "return document as an image".

Of course, not all web archives run OpenWayback, and developers have needed to create heuristics based on the software used by each individual web archive.  For example, our archive registry uses the un-rewritten-api-url attribute to provide a pattern for accessing captured mementos. Because there is no uniform approach, these pattern-based solutions are necessary but brittle, tying them to a small set of specific implementations, and making it difficult for clients to adapt to new or changing web archive software.
We propose a solution that uses the Memento specification (RFC 7089) in its current form, while still allowing uniform, standards-based access to both augmented and captured mementos.

Proposed Solution for Accessing Augmented and Captured Mementos

We propose two parallel Memento implementations: one with a TimeGate and TimeMap for access to augmented mementos (as currently exists) and another with a TimeGate and TimeMap for access to captured mementos.  A client that desires access to a specific type of memento (captured or augmented) only needs to access the TimeGate or TimeMap that specializes in finding and returning that type of memento. These parallel Memento implementations are based on the same infrastructure, the interactions are the same, and the only difference is in the nature of the memento each serves.

Clients could use the Archive Registry for discovering these TimeGates and TimeMaps. The Registry contains entries for many public web archives and version control systems, for each detailing its TimeGate and TimeMap URIs, as well as any additional information pertinent to accessing the archives. Several tools, such as the Memento Aggregator, directly use the information in the Registry. In light of discussions on the Memento Development list, we are considering creating a curated location where improvements can be submitted by the community.

A new attribute, profile, added to the timegate and timemap elements in the Registry, would allow a client to discover the TimeGate and/or TimeMap providing the type of memento it desires. A fictional enhanced Registry entry for the Icelandic Web Archive is shown below with the new profile attributes in red. Also, information currently provided in the <archive> element would either be deprecated (e.g. un-rewritten-api-url) or relocated (e.g. inside the timegate or timemap elements).

<link id="is" longname="Icelandic Web Archive">
    <timegate uri="http://wayback.vefsafn.is/wayback/" redirect="no" profile="http://mementoweb.org/terms/augmented"/>
    <timegate uri="http://wayback.vefsafn.is/wayback/captured/" redirect="no" profile="http://mementoweb.org/terms/captured"/>
    <timemap uri="http://wayback.vefsafn.is/wayback/timemap/link/" paging-status="2" redirect="no" profile="http://mementoweb.org/terms/augmented" />
    <timemap uri="http://wayback.vefsafn.is/wayback/timemap/captured/link/" 
paging-status="2" redirect="no" profile="http://mementoweb.org/terms/captured" />
    <icon uri="http://vefsafn.is/favicon.ico"/>
    <calendar uri="http://wayback.vefsafn.is/wayback/*/"/>
    <memento uri="http://wayback.vefsafn.is/wayback/*/"/>
    <archive type="snapshot" rewritten-urls="yes" un-rewritten-api-url="http://wayback.vefsafn.is/wayback/{timestamp}id_/{url}" access-policy="public" memento-status="yes"/>
</link>

This solution requires no changes to the Memento protocol and allows web archives to satisfy the needs of both end-users and software applications by returning the appropriate memento for each use-case. 
In the case of OpenWayback, this capability should be easy to add. Consider the following example from the Icelandic Archive, running OpenWayback, where the following URIs refer to the mementos of http://www.lanl.gov with a Memento-Datetime of Tue, 17 Nov 2009 13:13:48 GMT:
The memento that will be selected from the archive for the requested datetime, and hence the database interactions, will be the same for augmented and captured mementos. The only difference is the memento URI to which the TimeGates will redirect and is limited to the addition of the string im_ in the captured memento's URI. The additional TimeGate only needs to add this string to its output.
This approach, fully aligned with the Memento protocol, removes the need for client heuristics and supports using syntaxes other than im_ to distinguish between captured and augmented memento URIs. A client that picks the nature of a given TimeGate or TimeMap will continue to receive that type of memento.

Optional Additions


With parallel "augmented" and "captured" Memento protocol support in place, as described above, we have supplied access to different types of mementos. The following section details other optional helpful changes that a client could use to identify and locate different types of mementos.

Self-Describing TimeGates, TimeMaps, and Mementos

TimeGates, TimeMaps, and mementos can self-describe their nature with an HTTP link using a profile relation, defined by RFC 6906, and a link target (Target IRI in the RFC) that indicates their augmented or captured nature.

Example TimeGate response headers implementing this self-describing ability are shown below, with the profile relation specifying the captured nature in red.

HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:02:14 GMT
Server: Apache
Vary: accept-datetime
Location: http://arxiv.example.net/web/captured/20010321203610/http://
a.example.org/
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format"
      ; from="Tue, 15 Sep 2000 11:28:26 GMT"
      ; until="Wed, 20 Jan 2010 09:34:33 GMT",
    <http://mementoweb.org/terms/captured>; rel="profile"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Connection: close

Example TimeMap response headers implementing this relation are shown below, again with additions in red describing this TimeMap as listing augmented mementos. The profile link is placed within the Link header so that clients can discard or consume the associated entity based on their needs. The profile link is also included in the TimeMap body so that the TimeMap itself is self-describing.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:06:50 GMT
Server: Apache
Content-Length: 4883
Content-Type: application/link-format
Link: <http://mementoweb.org/terms/augmented>; rel="profile"
Connection: close

    <http://a.example.org>;rel="original",
    <http://arxiv.example.net/timemap/http://a.example.org>
      ; rel="self";type="application/link-format",
    <http://mementoweb.org/terms/augmented>
      ; rel="profile",
    <http://arxiv.example.net/timegate/http://a.example.org>
      ; rel="timegate",
    <http://arxiv.example.net/web/20000620180259/http://a.example.org>
      ; rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT",
    <http://arxiv.example.net/web/20091027204954/http://a.example.org>
      ; rel="last memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT",
    <http://arxiv.example.net/web/20000621011731/http://a.example.org>
      ; rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT",
    <http://arxiv.example.net/web/20000621044156/http://a.example.org>
      ; rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT",
    ...

Finally, a memento can specify whether it is captured or augmented using the same method.  Seen as red in the example below, headers describe this resource as a captured memento.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:15 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
    <http://arxiv.example.net/timegate/captured/http://a.example.org/>
      ; rel="timegate",
    <http://mementoweb.org/terms/captured>; rel="profile"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

These additional profile relations allow archives to describe the nature of respective TimeGates, TimeMaps, and mementos without affecting existing Memento clients.

Discovery of Other TimeGates and TimeMaps via Mementos

Here we introduce an approach for a client to get from a memento to its corresponding memento of the other type. This capability is handy as such, but, as will be shown, it is also a way to get to the other type of TimeGate and TimeMap.

By including another Link relation, a machine client can find the corresponding memento of another type.  Shown below, we build upon our previous example memento headers and add this new relation, marked in red, allowing clients to find this captured memento's augmented counterpart. Here a profile attribute is used with the memento relation type in order to indicate the type of memento found at the link target. This profile attribute has been requested as part of "Signposting the Scholarly Web", and is provided by a proposed update to a draft RFC detailing "link hints". This proposed update has been informally accepted by the RFC's author.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:15 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
    <http://arxiv.example.net/timegate/captured/http://a.example.org/>
      ; rel="timegate",
    <http://mementoweb.org/terms/captured>; rel="profile",
    <http://arxiv.example.net/web/20010321203610/http://
a.example.org/> 
      ; rel="memento"; profile="http://mementoweb.org/terms/augmented"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

From there, a client can follow the link target to the augmented memento. In the example below, we have the headers for the corresponding augmented memento.  The Memento protocol already provides the associated timegate and timemap relations, shown in bold.  A client uses these relations to discover the TimeGate/TimeMap that serves this memento, and, of course, the TimeGate/TimeMap have the same augmented nature as this memento. Note that this augmented memento also links to its captured counterpart.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:16 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
    <http://arxiv.example.net/timegate/http://a.example.org/>
      ; rel="timegate",
    <http://mementoweb.org/terms/augmented>; rel="profile",
    <http://arxiv.example.net/web/20010321203610/captured/http://
a.example.org/>
      ; rel="memento"; profile="http://mementoweb.org/terms/captured"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

Now the client can make future requests to this TimeGate and receive responses like the one below, finding additional augmented mementos for the original resource.

HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:02:17 GMT
Server: Apache
Vary: accept-datetime
Location: http://arxiv.example.net/web/20100424131422/http://
a.example.org/
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/http://a.example.org/>
      ; rel="timemap"; type="application/link-format"
      ; from="Tue, 15 Sep 2000 11:28:26 GMT"
      ; until="Wed, 20 Jan 2010 09:34:33 GMT",
    <http://mementoweb.org/terms/augmented>; rel="profile"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Connection: close

Likewise, a client can issue a request to the associated TimeMap to access augmented mementos for this resource. Of course, this process can start from an augmented memento and lead a client to the TimeGate/TimeMap for its captured counterpart as well.

Conclusion


The "captured" and "augmented" parallel Memento implementations addresses the problem of accessing different types of mementos in a standard-based manner.  Given that the selected memento will be the same for both the captured and augmented cases and the difference will only be in the access mechanism (URI), the solution seems straightforward to implement for web archives. Existing clients will still continue to function as is, and clients desiring a specific type of memento can use the Archive Registry to find the resources that support the that type of memento.

In addition, the optional profile and discovery links add further value, allowing clients to identify which type of mementos they have currently acquired as well as accessing the other types of mementos that are available.

We look forward to feedback on this proposed solution.

--
Shawn M. Jones
- and -
Herbert Van de Sompel
- and -
Michael L. Nelson

Acknowledgements: Ilya Kremer also contributed to the initial discussion of the need for a standard method of accessing captured mementos.

Sunday, April 24, 2016

2016-04-24: WWW 2016 Trip Report



I was fortunate to present a poster at the 25th International World Wide Web Conference, held from April 11, 2016 - April 15, 2016. Though my primary mission was to represent both the WS-DL and the LANL Prototyping Group, I gained a better appreciation for the state of the art of the World Wide Web.  The conference was held in Montréal, Canada at the Palais des congrés de Montéal.



SAVE-SD 2016


I began the conference at the SAVE-SD workshop, focusing on the semantics, analytics, and visualization of scholarly data.  They had 6 full research papers, 2 position papers, and 2 poster papers.  The acceptance rate for this conference is relatively high.  The conference was kicked off by Alejandra Gonzales-Beltran and Francesco Osborne. They encouraged the use of Research Articles in Simplified HTML.

Alex Wade gave us an introduction to the Microsoft Academic Service (MAS) and a sneak peek at the new features offered by Microsoft Academic, such as the Microsoft Academic Graph. They are in the process of adding semantic, rather than keyword search with the intention of understanding academic user intent when searching for papers. They have opened up their dataset to the community and provide APIs for future community research projects.
Angelo Salatino presented "Detection of Embryonic Research Topics by Analysing Semantic Topic Networks". The study investigated the discovery of "embryonic" (i.e. emerging) topics by testing for more than 2000 topics in more than 3 million publications. The goal is to determine it we can recognize trends in research while they are happening, rather than years later. They were able to show the features of embryonic topics and their next step is to automate their detection.

Bahar Sateli presented "Semantic User Profiles: Learning Scholars’ Competences by Analyzing their Publications". The goal of this study is to mitigate the information overload associated with semantic publishing. They found that it is feasible to semantically model a user's writing history. By modeling the user, better search ranking of document results can be provided for academic researchers. It can also be used to allow researchers to find others with similar interests for the purposes of collaboration.

Francesco Ronzano presented "Knowledge Extraction and Modeling from Scientific Publications" where they propose a platform to turn data from scientific publications into RDF datasets, using the Dr. Inventor Text Mining Framework Java library.  They also generate several example interactive web visualizations of the data. In the future, they seek to improve the Text Mining Framework.

Joakim Philipson presented "Citation functions for knowledge export - a question of relevance, or, can CiTO do the trick?".  He explored the use of the CiTO ontology in order to understand knowledge export - "the transfer of knowledge from one discipline to another as documented by cross-disciplinary citations". Unfortunately, he found that CiTO is not specific enough to capture all of the information needed to understand this.

Sahar Vahdati presented "Semantic Publishing Challenge: Bootstrapping a Value Chain for Scientific Data". The study discussed "the use of Semantic Web technologies to make scholarly publications and data easier to discover, browse, and interact with".  Its goal is to use many different sources to produce linked open datasets about scholarly publications with the intent of improving scholarly communication, especially in the areas of searching and collaboration.  Their next step is to start building services on the data they have produced.

Vidas Daudaravicious presented "A framework for keyphrase extraction from scientific journals".  His framework is able to use keyphrases to define topics that can differentiate journals. Using these keyphrases, one can improve search results by comparing journals to queries, allowing users to find articles of a similar nature. It also has the benefit of noting trends in research, such as when journal topics shift. Researchers can also use the framework to identify the best journals for paper submission.

Ujwal Gadirju presented "Analysing Structured Scholarly Data Embedded in Web Pages". They analyzed the use of microdata, microformats, and RDF used as bibliographic metadata embedded in scholarly documents with the intent of building knowledge graphs. They found that the distribution of data across providers, domains, and topics was uneven, with few providers actually providing any embedded data. They also found that Computer Science and Life Science documents were more apt to contain this metadata than other disciplines, but also admitted that their Common Crawl dataset may have been skewed in this direction. In the future, they are planning a targeted crawl with further analysis.
Shown below are participants enjoying the SAVE-SD 2016 Poster session. On the left below, Bahar Sateli presented "From Papers to Triples: An Open Source Workflow for Semantic Publishing Experiments". She showed how one could convert natural language academic papers into linked data, which could then be used to provide more specific search results for scholars. For example, the workflow allows a scholarly user to search a corpus for all contributions made in a specific topic.

On the right below, Kata Gábor demonstrated "A Typology of Semantic Relations Dedicated to Scientific Literature Analysis". Her poster shows a model for extracting facts about the state of the art for a particular research field using semantic relations derived from pattern mining and natural language processing techniques.


And shown to the left Erwin Marsi discussed his poster, "Text mining of related events from natural science literature". His study had the goal of producing aggregate facts on the concepts from articles within a corpus.  For example, it aggregates the fact that there is an increase in algae based on the text from many papers that had research results finding an increase in algae. The idea is to find trends in research papers through natural language processing.

In closing, the SAVE-SD 2016 workshop mentioned that selected papers could be resubmitted to PeeRJ.

TempWeb 2016


On Tuesday I attended the 6th Temporal Web Analytics Workshop, where I learned about current studies using and analyzing the temporal nature of the web. I spoke to a few of the participants about our work on Memento, and they educated me as to the new work being done.

The morning opened with a Keynote by Wolfgang Nejdl of the Alexandria Project.  Wolfgang Nejdl discussed the work at L3S and how they were trying to consider all aspects of the web, from the technical to its effects on community and society. He discussed how social media has become a powerful force, but tweets and posts link to items that can disappear, losing the context of the original post.  This reminded me of some other work I had seen in the past. He mentioned how important it was to archive these items.
He then went on to cover other aspects of searching the archived web, detailing challenges encountered by project BUDDAH, including the problem of ranking temporal search results. Seen below, he demonstrates an alternative way of visualizing temporal search results using the HistDiv project. This visualization for understanding the changing nature of a topic.  In this case, we see how searching for the term Rudolph Giuliani changes with time, as the person's career (and career aspirations) change so do the content of the archived pages about them. He closed by discussing the use of curated archiving collections in Archive-It in the collaborative search and sharing platform ArchiveWeb, which allows one to find archive collections pertinent to their search query.
The workshop presentations started with two different investigations into ways of creating and performing calculations on temporal graphs.  On the right, Julia Stoyanovich presents "Towards a distributed infrastructure for evolving graph analytics".  She details Portal, a query language for temporal graphs, allowing one to easily query and calculate metrics such as PageRank for a temporal graph, given a specific interval.

Matthias Steinbauer presented "DynamoGraph: A Distributed System for Large-scale, Temporal Graph Processing, its Implementation and First Observations".  DynamoGraph a system also allowing for one to query and calculate metrics on temporal graphs.  
Both researchers used the following lunch to discuss temporal graphs at length.  I wondered if one could model TimeMaps in this way and use these tools to discover interesting connections between archived web pages.
Mohsen Shahriari discussed "Predictive Analysis of Temporal and Overlapping Community Structures in Social Media".  He went into detail on the evolution of communities, represented by graphs, detailing how they can grow, shrink, merge, split, or dissolve entirely.  Using datasets from Facebook, DLBP citations, and Enron emails, his experiments showed that smaller communities have a higher chance of survival and his model had a high success rate in predicting whether a community would survive.
Aécio Santos presented "A First Study on Temporal Dynamics on the Web".  He used topical web page classifiers in a focused crawling experiment to analyze how often web pages about certain topics changed.  Pages from his two topics, ebola and movies, changed at different rates. Pages on ebola were more volatile, losing and gaining links, mostly due to changing news stories on the topic, whereas movies pages were more stable, with authors only augmenting their contents. He did find that, in spite of this volatility, pages did tend to stay on topic over time. The goal is to ensure that crawlers are informed by differences in topics and adjust their strategies accordingly.
Jannik Strötgen presented "Temponym Tagging: Temporal Scopes for Textual Phrases".  He discussed the discovery and use of temponyms to understand the temporal nature of text.  Using temponyms, machines can determine the time period that a text covers. He explained the issues with finding exact temporal intervals or times for web page topics, seeing as many pages are vague. His temponym project, HeidelTime, has been tested on the WikiWars corpus and the YAGO semantic web system.  He also presented further information on this topic, later in WWW 2016.

We then shifted into using temporal analysis for security.  Staffan Truvé from Recorded Future presented "Temporal Analytics for Predictive Cyber Threat Intelligence". His company specializes in using social media and other web sources to detect potential protests, uprisings, and cyberattacks.  He indicated that protests and hacktivism are often talked about online before they happen, allowing authorities time to respond.

In closing, Omar Alonso from Microsoft presented "Time to ship: some examples from the real-world". He highlighted some of the ways in which the carousel from the top of Bing is populated, using topic virality on social media as one of the many inputs. He talked about the concept of social signatures, derived from all of the social media posts referring to the same link.  Using this text, they are able to further determine aboutness for a given link, helping further with search results.  He switched to other topics that help with search, such as connecting place and time. Search results for  points of interest (POI) for a given location in effect is trying to match people looking for things to do (queries) with social media posts, checkins, and reviews for a given POI.  He concluded by saying that there is much work to be done, such as allowing POI results for a given time period "things to do in Montréal at night".

Keynotes


Sir Tim Berners-Lee



Sir Tim Berners-Lee spoke of the importance of decentralizing the web, ensuring that users own their own data, web security, work to standardize and improve the ease of payments on the web, and finally the Internet of Things (IoT).
Mentioning the efforts of projects like Solid, he highlighted the need to ensure that users retain their data to ensure their privacy. The idea is that a user can tell the service where to store their data and then they have ownership and responsibility over that data.
He mentioned that, in the past the Internet had to be deployed by sending tapes through the mail, but now we are heading to a point where the web platform, because it allows you deploy a full computing platform very very quickly, may become the rollout platform for the future. Because of this ability, security is becoming more and more important and he wants to focus on a standard for security that uses the browser, rather than external systems, as the central point for asking a user for their credentials, thereby helping guard against trojans and malicious web sites. He said that the move from HTTP to HTTPS has been less easy than expected, considering many HTTPS pages are "mixed" containing references to HTTP URIs.  This results in three different worlds: those that are HTTP pages, those that are HTTPS pages, and upgrade insecure requests which still provide a mixed page, but one that is endorsed by the author.
Next, he spoke about making web payments standardized, comparing it to authentication. There are a wide variety of different solutions for web payments and there needs to be a standard interface. There is also an increasing call to allow customers to pay smaller amounts than before, which many current systems do not handle. Of course, customers will need to know when they are being phished, hence the security implications of a standardized system.
Finally, he covered the Internet of Things (IoT), indicating there are connections to data ownership, privacy, and security.
In the following Q&A session, I asked Sir Tim Berners-Lee about the steps toward browser adoption for technologies such as Memento.  He said the first step is to discuss them at conferences like WWW, then engage in working groups, workshops, and other venues.  He noted that one also needs to define the users for such new technologies so they can help with the engagement.
Later, during the student Q&A session the following day, Morgannis Graham from McGill University asked Sir Tim Berners-Lee about his thoughts on the role of web archives.  He replied that "personally, I am a pack rat and am always concerned about losing things". He highlighted that while the general web users are thinking of the present, it is the role of libraries and universities to think about the future, hence their role in archiving the web.  He stated that universities and libraries should work more closely together in archiving the web so that if one university falls, others exist having the archives of the one that was lost. He also stated that we all have a role in ensuring that legislation exists to protect archiving efforts.  Finally, he tied his answer back to one of his current projects: what happens to your data when the site you have given it to goes out of business.

Lady Martha Lane-Fox


Wednesday evening ended with an inspiring talk from Lady Martha Lane-Fox.  She works for the UK in a variety of roles advancing the use of technology in society.  She states that a country that can: (1) improve gender balance in tech, (2) improve the technical skills of the populace, and (3) improve the ability to use tech in the public sector, will be the most competitive.


She went further in explaining how the current gender balance is very depressing, noting that in spite of the freedom offered by technology, old hierarchies and structures have been re-established. She indicated that there are studies showing that companies with more diverse boards are more successful, and how we need to tackle this problem, not only from a technical, but also a social perspective.
She discussed the challenges of bringing technology to everyday lives and applauded South Korea's success while highlighting the challenges still present in the UK. She relayed stories of encounters with the citizenry, some of whom were reluctant to embrace the web, but after doing so felt they had more freedom and capability in their lives than ever before. She praised the UK for putting coding on the school curriculum and looking toward the needs of future generations.
She then talked about re-imagining public services entirely through the use of technology. The idea is to make government agencies digital by default in an effort to save money and provide more capability. She highlighted a project where a UK hospital once had 700 administrators and 17 nurses, and, through adopting technology, were able to then take the same money and hire 700 nurses to work with 17 administrators, thus providing better service to patients.
She closed by discussing her program DotEveryone, which is a new organization promoting the promise of the Internet in the UK for everyone and by everyone. Her goal is for the UK to be the most connected, most digitally literate, and most gender equivalent nation on earth. In a larger sense, she wants to kick off a race among countries to use technology to create the best countries for their citizens.

Mary Ellen Zurko


Wednesday morning started with a keynote by Mary Ellen Zurko, from Cisco. She discussed security on the web. Her first lesson: "The future will be different; so will the attacks and attackers, but only if you are wildly successful". Her point was the the success of the web has made it a target. She then covered the history of basic authentication, S-HTTP, and finally SSL/TLS in HTTPS.
She then discuss the social side of security, indicating that users are often confused about how to respond to web browser warnings about security. There is a 90% ignore rate on such warnings, and 60% of those are related to certificates. She highlighted how difficult it is for users to know whether or not a domain is legitimate and if the certificate shown is valid. She also highlighted where most users, even expert users, do not fully understand the permissions they are granting when asked due to the cryptic and sometimes misleading descriptions given to them, mentioning that 17% of Android users actually pay attention to permissions during installation and only 3% are able to answer questions on what the security permissions mean.


Reiterating the results of a study by Google, she stated that 70% of users clicked through malware warnings in Chrome, but Firefox had more participation. The Google study found that the Firefox warnings provided a better user experience, and thus users were more apt to pay attention and understand them. Following this study, Google changed its warnings in Chrome.
She said that the open web is an equal opportunity environment for both attackers and defenders, detailing how fraudulent tech support scans are quite lucrative. This was discovered in recent work by Cisco, "Reverse Social Engineering Social Tech Support Scammers", where Cisco engineers actively bluffed tech support scammers in order to gather information on their whereabouts and identities. 
Of note, she also mentioned that there is a largely unexploited partnership between web science and security.

Peter Norvig


On Friday morning, Peter Norvig gave an engaging speech on the state of the Semantic Web. He mentioned that his job is to bring information retrieval and distributed systems together. He went through a history of information retrieval, discussing WAIS and the World Wide Web, as well as ARCHIE. Before Google, several were trying to tame the nascent web at the time.
After Google, the Semantic Web was developed as a way to extract information from the many pages that existed. He talked about how Tim Berners-Lee was a proponent, whereas Cory Doctorow highlighted that there were noting but obstacles in its path. Peter said that Cory had several reasons for why it would fail, but the main were (1) people lie, (2) people are lazy, and (3) people are stupid, indicating that the information gathered from such a system would consist of intentional misinformation, lack of complete information, or misinformation due to incompetence. 
Peter then highlighted several instances where this came about. Initially, excellent expressiveness was produced by highly trained logicians, giving us DAML, OWL, RDFa, FOAF, etc. Unfortunately, they found a 40% page error rate in practice, indicating that Cory was correct on all 3 fronts. Peter's conclusion was the highly trained logicians did not seem to solve the identified problems.

Peter then posited "what about a highly trained webmaster?". In 2010, search companies promoted the creation of schema.org with the idea of keeping it simple. The search engines promised that if a site were marked up, then they would show it immediately in search results. This gave users an incentive to mark up their pages and now has resulted in technologies that can better present things like hotel reservations and product information. This led most to conclude that schema.org was an unexpected success.
Peter closed by saying that obstacles still remain, seeing as most of the data comes from web site owners, still leading to misinformation in some cases. He talked about the need to be able to connect different sources together so that one can, for example, not only find a book on Amazon, but also a listing of the Author's interests on Facebook. He hopes that neural networks could be combined with semantic and syntactic approaches to solve some these large connection problems.

W3C Track


Tzviya Siegman, from John Wiley & Sons Publishing, presented "Scholarly Publishing in a Connected World". She discussed how publications of the past were immutable, and publishers did little with content once something was published. She confessed that in a world where machines are readers, too, publications are a bit behind the times. She further said that we still have an obsession with pages, citing them, marking them, and so on, when in reality the web is not bound by pages. She wants to standardize on a small set of RDFa vocabularies that would enable gathering of content by topic, whether the documents published are just articles, but also data and electronic notebooks. She closed by talking about how Wiley is trying to extract metadata from its own corpus to provide additional data for scholars.
Hugh McGuire presented "Opening the book: What the Web can teach books, and what books can teach the Web". He talked about how books seem to hold a special power and value, specific to the boundedness of a book. The web, by contrast, is unbounded; even a single web site is unknowable with no sense of a beginning or an end. On the web, however, anyone can publish documents and data to a global audience without any required permission. He talked about how books are a singular important node of knowledge, with the ebook business having the opposite motive of the web, making ebooks a kind of restricted, broken version of the web. He wants to be able to combine the two. For example a system can provide location-aware annotations of an ebook while also sharing those annotations freely, essentially making ebooks smarter and more open.
Ivan Herman revealed Portable Web Publications which has serious implications for archiving. The goal is to allow people to download web publications like they do ebooks, PDFs, or other portable articles. There is a need to do so because connectivity is not yet ubiquitous. With the power of the web, one can also embed interactivity into the downloaded document. Of course, there are also additional considerations, like the form factor of the reading device and the needs of reader. The concept is more than just creating a ebook with interactive components or a web page that can be saved offline. He highlighted the work of publishers in terms of egonomy and aesthetics, stating that web designers for such portable publications should learn from this work. Portable Web Publications would not be suitable for social web sites, web mail, or anything that depends on real-time data. PWP requires 3 layers of addressing (1) locating the PWP itself, (2) locating a resource within a PWP, and (3) locating a target within such a resource. In practice, locators depend on the state of the resource, creating a bit of a mess. His group is currently focusing on a manifest specification to solve these issues.

Poster Session


Of course, I was here to present a poster, "Persistent URIs Must Be Used to be Persistent", developed by Herbert Van de Sompel, Martin Klein, and I, which indicates important consequences for the use of persistent URIs such as DOIs.

In looking at the data from "Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot", we reviewed 1.6 million web references from 1.8 million articles and discovered 3 things:
  1. use of web references is increasing in scholarly articles
  2. frequently authors use publisher web pages (locating URI) rather than DOIs (persistent URI) when creating references
We show on the poster that, because many use browser bookmarks or citation managers that store these locating URIs, there must be an easy way to help tools find the DOI. Our suggestion is to store this DOI in the Link header for easy access by these tools.

I appreciate the visit from Sarven Capadisli and Amy Guy who work on Solid. Many others came by to see our work, like Takeru Yokoi, Hideaki Takeda, Lee Giles, and Pieter Colpaert. Most appreciated the idea, noting it as "simple" with some asking "why don't we have this already?".

WWW Conference Presentations


Even though I attended many additional presentations, I will only detail a few of interest.
As a person who has difficulty with SPARQL, I appreciated the efforts of Gonzalo Diaz and his co-authors in "Reverse Engineering SPARQL Queries". Their goal was to reverse engineer SPARQL queries with the intent of producing better examples for new users, seeing as new users have a hard time with the precise syntax and semantics of the language. Given a database and answers, they wanted to reverse engineer the queries that produced those answers. Unfortunately, they discovered that verifying a reverse engineered SPARQL query to determine if it is the canonical query for a given database and answer is an NP-complete (intractable) problem. They were however able to perform some heuristics on a specific subset of queries to solve this problem in polynomial time.
Fernando Suarez presented "Foundations of JSON Schema". He mentioned that JSON is very popular because it is flexible, but there is no way to describe what kind of JSON response a client should expect from a web service. He discussed a proposal from the Internet Task Force to develop a JSON schema, a set of restrictions that documents must satisfy. he said the specification is in its Fourth Draft, but is still ambiguous. Even online validators disagree on some content, meaning that we need clear semantics for validation, and he proposes a formal grammar. His contribution is an analysis shows that the validataion problem is PTIME-complete, but that determining if a document has an equivalent JSON schema is PSPACE-hard for very simple schemas. For the future, he intends to work further on integrity constraints for JSON documents and more use cases for JSON schema.
David Garcia presented "The QWERTY Effect on the Web; How Typing Shapes the Meaning of Words in Online-Human Communications".  He highlights a hypothesis that words typed with more letters from the right side of the keyborard are more positive than those with more letters from the left. He tests this hypothesis on product ratings from different datasets and found that 9 out of 11 datasets see a significant QWERTY effect which is independent of the number of views or comments on an item. He does mention that he needs to repeat the study with different languages and keyboard layouts. He closed by saying that there is no evidence yet that we can predict meanings or change evaluations based on this knowledge.
Justin Cheng presented "Do Cascades Recur?" where he analyzes the rise and fall of memes multiple times throughout social media. Prior work shows that cascades (meme sharing) rises, then falls, but in reality there are many rises and falls over time. He studies these different peaks and tries to determine how and why these cascades recur. Seeing as these bursts are separated among different network communities, cascades recur when people connect communities and reshare something. It turns out that a meme with high virality has less chance of recurring, but one with medium virality will recur months or perhaps years later. He would like to repeat his study with networks other than Facebook and develop improved models of recurrence based on other data.
Prahmod Bhatotia presented "IncApprox: The Marriage of incremental and approximate computing". He discussed how data analytic systems transform raw data into useful information, but they need to strike a balance between low latency and high throughput. There are two computing paradigms that try to strike this balance: (1) incremental computations and (2) approximate computing. Incremental computation is motivated by the fact that we are recomputing the output with small changes in the input and can reuse memorized parts of the computation that are unaffected by the changed input. Approximate computing is motivated the fact that the approximate answer is good enough. With approximate computing we get the entire input dataset, but compute only parts of the input and then produce approximate output in a low latency manner. His contribution is the combination of these two approaches.
Jessica Su presented "The Effect of Recommendations on Network Structure". She worked with Twitter on the rollout of a recommendation system that suggests new people to follow. They restricted the experiment to two weeks to avoid any noise from outside the rollout. They found that there is an effect; people's followers did increase after the rollout. They also confirmed that the "rich get richer", with those who already had many followers gaining more followers and those with few still gaining some followers. She also mentioned that people did not appear to be making friends, only following others.
Samuel Way presented "Gender, Producitivity, and Prestige in Computer Science Faculty Hiring Networks". This study tried to investigate why women are not participating in computer science. He mentioned that there are conflicting results. Universities have a 2-to-1 preference for female faculty applicants, but at the same time there is a bias favoring male students. They developed a framework for modeling faculty hiring networks using a combination of CVs, social media profiles, and other sources on a subset of people currently going through the tenure process. The model shows that gender bias is not uniformly, systematically affecting all hires in the same way and that the top institutions fight over a small group of people. Women are a limited resource in this market and some institutions are better at competing for them. The result is that accounting for gender does not help predict faculty placement, leading them to conclude that the effects of gender are counted for by other factors, such as publishing or post-doctoral training rates or the fact that some institutions appear to be better at hiring women than others. The model predicts that men and women will be hired at equal rates in Computer Science by the 2070s.

Social

Of course, I did not merely enjoy the presentations and posters. Among the Monday night SAVE-SD dinner, the Thursday night Gala, and lunch each day, I took the opportunity to acquaint myself with many field experts. Google, Yahoo!, and Microsoft were also there looking to discuss data sharing, collaboration, and employment opportunities.

I always had lunch company thanks to the efforts of Erik Wilde, Michael Nolting, Roland Gülle, Eike Von Seggern, Francesco Osborne, Bahar Sateli, Angelo Salatino, Marc Spaniol, Jannik Strötgen, Erdal Kuzey, Matthias Steinbauer,  Julia Stoyanovich, Jan Jones, and more.
Furthermore,  the Gala introduced me to other attendees, like Chris LaRoche, Marc-Olivier Lamothe, Ashutosh Dhekne, Mensah Alkebu-Lan, Salman Hooshmand, Li'ang Yin, Alex Jeongwoo Oh, Graham Klyne, and Lukas Eberhard. Takeru Yokoi introduced me to Keiko Yokoi from the University of Tokyo who was familiar with many aspects of digital libraries and quite interested in Memento. I also had a fascinating discussion about Memento and the Semantic Web with Michel Gagnon and Ian Horricks, who suggested I read "Introduction to Description Logic" to understand more of the concepts behind the semantic web and artificial intelligence.

In Conclusion


As my first academic conference, the WWW 2016 conference was an excellent experience, bringing me in touch with paragons on the forefront of web research. I now have a much better understanding of where we are in the many aspects of the web and scholarly communications.
Even as we left the conference and said our goodbyes, I knew that many of us had been encouraged  to create a more open, secure, available, and decentralized web.