Monday, March 2, 2015

2015-03-02 Reproducible Research: Lessons Learned from Massive Open Online Courses

Source: Dr. Roger Peng (2011). Reproducible Research in Computational Science. Science 334: 122

Have you ever needed to look back at a program and research data from lab work performed last year, last month or maybe last week and had a difficult time recalling how the pieces fit together? Or, perhaps the reasoning behind the decisions you made while conducting your experiments is now obscure due to incomplete or poorly written documentation.  I never gave this idea much thought until I enrolled in a series of Massive Open Online Courses (MOOCs) offered on the Coursera platform. The courses, which I took during the period from August to December of 2014, were part of a nine course specialization in the area of data science. The various topics included R Programming, Statistical Inference and Machine Learning. Because these courses are entirely free, you might think they would lack academic rigor. That's not the case. In fact, these particular courses and others on Coursera are facilitated by many of the top research universities in the country. The courses I took were taught by professors in the biostatistics department of the Johns Hopkins Bloomberg School of Public Health. I found the work to be quite challenging and was impressed by the amount of material we covered in each four-week session. Thank goodness for the Q&A forums and the community teaching assistants as the weekly pre-recorded lectures, quizzes, programming assignments, and peer reviews required a considerable amount of effort each week.

While the data science courses are primarily focused on data collection, analysis and methods for producing statistical evidence, there was a persistent theme throughout -- this notion of reproducible research. In the figure above, Dr. Roger Peng, a professor at Johns Hopkins University and one of the primary instructors for several of the courses in the data science specialization, illustrates the gap between no replication and the possibilities for full replication when both the data and the computer code are made available. This was a recurring theme that was reinforced with the programming assignments. Each course concluded with a peer-reviewed major project where we were required to document our methodology, present findings and provide the code to a group of anonymous reviewers; other students in the course. This task, in itself, was an excellent way to either confirm the validity of your approach or learn new techniques from someone else's submission.

If you're interested in more details, the following short lecture from one of the courses (16:05), also presented by Dr. Peng, gives a concise introduction to the overall concepts and ideas related to reproducible research.

I received an introduction to reproducible research as a component of the MOOCs, but you might be wondering why this concept is important to the data scientist, analyst or anyone interested in preserving research material. Consider the media accounts in the latter part of 2014 of admonishments for scientists who could not adequately reproduce the results of groundbreaking stem cell research (Japanese Institute Fails to Reproduce Results of Controversial Stem-Cell Research) or the Duke University medical research scandal which was documented in a 2012 segment of 60 Minutes. On the surface these may seem like isolated incidents, but they’re not.  With some additional investigation, I discovered some studies, as noted in a November 2013 edition of The Economist, which have shown reproducibility rates as low as 10% for landmark publications posted in scientific journals (Unreliable Research: Trouble at the Lab). In addition to a loss of credibility for the researcher and the associated institution, scientific discoveries which cannot be reproduced can also lead to retracted publications which affect not only the original researcher but anyone else whose work was informed by possibly erroneous results or faulty reasoning. The challenge of reproducibility is further compounded by technology advances that empower researchers to rapidly and economically collect very large data sets related to their discipline; data which is both volatile and complex. You need only think about how quickly a small data set can grow when it's aggregated with other data sources.

Cartoon by Sidney Harris (The New Yorker)

So, what steps should the researcher take to ensure reproducibility? I found an article published in 2013, which lists Ten Simple Rules for Reproducible Computational Research. These rules are a good summary of the ideas that were presented in the data science courses.
  • Rule 1: For Every Result, Keep Track of How It Was Produced. This should include the workflow for the analysis, shell scripts, along with the exact parameters and input that was used.
  • Rule 2: Avoid Manual Data Manipulation Steps. Any tweaking of data files or copying and pasting between documents should be performed by a custom script.
  • Rule 3: Archive the Exact Versions of All External Programs Used. This is needed to preserve dependencies between program packages and operating systems that may not be readily available at a later date.
  • Rule 4: Version Control All Custom Scripts. Exact reproduction of results may depend upon a particular script. Archiving tools such as Subversion or Git can be used to track the evolution of code as its being developed.
  • Rule 5: Record All Intermediate Results, When Possible in Standardized Formats. Intermediate results can reveal faulty assumptions and uncover bugs that may not be apparent in the final results.
  • Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds. Using the same random seed ensures exact reproduction of results rather than approximations.
  • Rule 7: Always Store Raw Data behind Plots. You may need to modify plots to improve readability. If raw data are stored in a systematic manner, you can modify the plotting procedure instead of redoing the entire analysis.
  • Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected. In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying any summaries.
  • Rule 9: Connect Textual Statements to Underlying Results. Statements that are connected to underlying results can include a simple file path to detailed results or the ID of a result in the analysis framework.
  • Rule 10: Provide Public Access to Scripts, Runs, and Results. Most journals allow articles to be supplemented with online material. As a minimum, you should submit the main data and source code as supplementary material and be prepared to respond to any requests for further data or methodology details by peers.
In addition to the processing rules, we were also encouraged to adopt suitable technology packages as part of our toolkit. The following list represents just a few of the many products we used to assemble a reproducible framework and also introduce literate programming and analytical techniques into the assignments.
  • R and RStudio: Integrated development environment for R.
  • Sweave: An R package that allows you to embed R code in LaTeX documents.
  • Knitr: New enhancements to the Sweave package for dynamic report generation. It supports publishing to the web using R Markdown and R HTML.
  • R Markdown: Integrates with knitr and RStudio. Allows you to execute R code in chunks and create reproducible documents for display on the web.
  • RPubs: Web publication tool for sharing R markdown files. The gallery of example documents illustrates some useful techniques.
  • Git and GitHub: Open source version control repository.
  • Apache Subversion (SVN): Open source version control repository.
  • iPython Notebook: Creates literate webpages and documents interactively in Python. You can combine code execution, text, mathematics, plots and rich media into a single document. This gallery of videos and screencasts includes tutorials and hands-on demonstrations.
  • Notebook Viewer: Web publication tool for sharing iPython notebook files.

As a result of my experience with the MOOCs, I now have a greater appreciation for the importance of reproducible research and all that it encompasses. For more information on the latest developments, you can refer to any of these additional resources or follow Dr. Peng (@rdpeng) on Twitter.

-- Corren McCoy

Tuesday, February 17, 2015

2015-02-17: Reactions To Vint Cerf's "Digital Vellum"

Don't you just love reading BuzzFeed-like articles, constructed solely of content embedded from external sources?  Yeah, me neither.  But I'm going to pull one together anyway.

Vint Cerf generated a lot of buzz last week when at an AAAS meeting he gave talk titled "Digital Vellum".  The AAAS version, to the best of my knowledge, is not online but this version of "Digital Vellum" at CMU-SV from earlier the same week is probably the same.

The media (e.g., The Guardian, The Atlantic, BBC) picked up on it, because when Vint Cerf speaks people rightly pay attention.  However, the reaction from archiving practitioners and researchers was akin to having your favorite uncle forget your birthday, mostly because Cerf's talk seemed to ignore the last 20 or so years of work in preservation.  For a thoughtful discussion of Cerf's talk, I recommend David Rosenthal's blog post.  But let's get to the BuzzFeed part...

In the wake of the media coverage, I found myself retweeting many of my favorite wry responses starting with Ian Milligan's observation:

Andy Jackson went a lot further, using his web archive (!) to find out how long we've been talking about "digital dark ages":

And another one showing how long The Guardian has been talking about it:

And then Andy went on a tear with pointers to projects (mostly defunct) with similar aims as "Digital Vellum":

Andy's dead right, of course.  But perhaps Jason Scott has the best take on the whole thing:

So maybe Vint didn't forget our birthday, but we didn't get a pony either.  Instead we got a dime kitty


2015-02-17: Fixing Links on the Live Web, Breaking Them in the Archive

On February 2nd, 2015, Rene Voorburg announced the JavaScript utility robustify.js. The robustify.js code, when embedded in the HTML of a web page, helps address the challenge with link rot by detecting when a clicked link will return an HTTP 404 and uses the Memento Time Travel Service to discover mementos of the URI-R. Robustify.js assigns an onclick event to each anchor tag in the HTML. The event occurs, robustify.js makes an Ajax call to a service to test the HTTP response code of the target URI.

When an HTTP 404 response code is detected by robustify.js, it uses Ajax to make a call to a remote server, uses the Memento Time Travel Service to find mementos of the URI-R, and uses a JavaScript alert to let the user know that JavaScript will redirect the user to the memento.

Our recent studies have shown that JavaScript -- particularly Ajax -- normally makes preservation more difficult, but robustify.js is a useful utility that is easily implemented to solve an important challenge. Along this thought process, we wanted to see how a tool like robustify.js would behave when archived.

We constructed two very simple test pages, both of which include links to Voorburg's missing page
  1. which does not use robustify.js
  2. which does use robustify.js
In robustifyTest.html, when the user clicks on the link to, an HTTP GET request is issued by robustify.js to an API that returns an existing memento of the page:

GET /services/statuscode.php? HTTP/1.1
Connection: keep-alive
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4
Accept: */*
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Fri, 06 Feb 2015 21:47:51 GMT
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.3.10-1ubuntu3.15
Access-Control-Allow-Origin: *

The resulting JSON is used by robustify.js to then redirect the user to the memento as expected.

Given this success, we wanted to understand how our test pages would behave in the archives. We also included a link to the stellingen memento in our test page before archiving to understand how a URI-M would behave in the archives. We used the Internet Archive's Save Page Now feature to create the mementos at URI-Ms and

The Internet Archive re-wrote the embedded links to be relative to the archive in the memento, converting to Upon further investigation, we noticed that robustify.js does not issue onclick events to anchor tags linking to pages within the same domain as the host page. An onclick even is not assigned to this any embedded anchor tags because all of the links point to within the Internet Archive, the host domain. Due to this design decision, robustify.js is never invoked when within the archive.

When clicking on the URI-M, the 2015-02-06 memento does not exist, so the Internet Archive redirects the user to the closest memento The user, when clicking the link, ends up at the 1999 memento because the Internet Archive understands how to redirect the user from the 2015 URI-M for a memento that does not exist to the 1999 URI-M for a memento that does exist. If the Internet Archive had no memento for the user would simply receive a 404 and not have the benefit of robustify.js using the Memento Time Travel service to search additional archives.

The robustify.js file is archived ( but its embedded URI-Rs are re-written by the Internet Archive.  The original, live web JavaScript has URI templates embedded in the code that are completed at run time by inserting the "yyymmddhhmmss" and "url" variable strings into the URI-R:


These templates are rewritten during playback to be relative to the Internet Archive:


Because the robustify.js is modified during archiving, we wanted to understand the impact of including the URI-M of robustify.js ( in our test page ( In this scenario, the JavaScript attempts to execute when the user clicks on the page's links, but the re-written URIs point to /web/20150206214020/ (since test-r.html exists on, the links are relative to instead of

Instead of issuing an HTTP GET for, robustify.js issues an HTTP GET for which returns an HTTP 404 when dereferenced.
The robustify.js script does not handle the HTTP 404 response when looking for its service, and throws an exception in this scenario. Note that the memento that references the URI-M of robustify.js does not throw an exception because the robustify.js script does not make a call to

In our test mementos, the Internet Archive also re-writes the URI-M to

This memento of a memento (in a near Yo Dawg situation) does not exist. Clicking on the apparent memento of a memento link leads to the user being told by the Internet Archive that the page is available to be archived.

We also created an memento of our robustifyTest.html page: In this memento, the functionality of the robustify script is removed, redirecting the user to which results in a HTTP 404 response from the live web. The link to the Internet Archive memento is re-written to, which results in a redirect (via a refresh) to which results in a HTTP 404 response from the live web, just as before. uses this redirect approach as standard operating procedure. However, re-writes all links to URI-Ms back to their respective URI-Rs.

This is a different path to a broken URI-M than the Internet Archive takes, but results in a broken URI-M, nonetheless.  Note that simply removes the robustify.js file from the memento, not only removing the functionality, but also removing any trace that it was present in the original page.

In an odd turn of events, our investigation into whether a JavaScript tool would behave properly in the archives has also identified a problem with URI-Ms in the archives. If web content authors continue to utilize URI-Ms to mitigate link rot or utilize tools to help discover mementos of defunct links, there is a potential that the archives may see additional challenges of this nature arising.

--Justin Brunelle

Wednesday, January 21, 2015

2015-02-05: What Did It Look Like?

Having often wondered why many popular videos on the web are time lapse videos (that is videos which capture the change of a subject over time), I came to the conclusion that impermanence gives value to the process of preserving ourselves or other subjects in photography. As though a means to defy the compulsory fundamental law of change. Just like our lives, one of the greatest products of human endeavor, the World Wide Web, was once small, but has continued to grow. So it is only fitting for us to capture the transitions.
What Did It Look Like? is a Tumblr blog which uses the Memento framework to poll various public web archives, take the earliest archived version from each calendar year, and then create an animated image that shows the progression of the site through the years.

To seed the service we randomly chose some web sites and processed them (see also the archives). In addition, everyone is free to nominate web sites to What Did It Look Like? by tweeting: "#whatdiditlooklike URL". 

In order to see how this system is achieved, consider the architecture diagram below. 

The system is implemented in Python and utilizes Tweepy and PyTumblr to access the Twitter and Python APIs respectively, and consists of the following programs:
  1. This application fetches tweets (with "#whatdiditlooklike URL" signature) by using the tweet ID of the last tweet visited as reference to know where to begin retrieving tweets. For example, if the application initially visited tweet IDs 0, 1, 2. It keeps track of the ID 2, so as to begin retrieving tweets with IDs greater than 2 in a subsequent tweet retrieval operation. Also, since Twitter rate limits the number of search operations (180 requests per 15 minute window), the application sleeps in between search operations. The snippet below outlines the basic operations of fetching tweets after the last tweet visited:
  2. This is a simple application with invokes for each nomination tweet (that a tweet with the "#whatdiditlooklike URL" signature).
  3. Given an input URL, this application utilizes PhantomJS, (a headless browser) to take screenshots and utilizes ImageMagick to create an animated GIF. It should also be noted that the GIFs created are optimized due to the snippet below in order to reduce their respective sizes to under 1MB. This ensures the animation is not deactivated by Tumblr.
  4. this application executes two primary operations:
    1. Publication of the animated GIFs of nominated URLs to Tumblr: This is done through the PyTumblr API create_photo() method as outlined by the snippet below:
    2. Notifying the referrer and making status updates on Twitter: This is achieved through Tweepy's api.update_status() method as outlined by the snippet below which tweets the status update message:However, a simple periodic Twitter status update message could result in the message to be flagged eventually as spam by Twitter. This comes in form of a 226 error code. In order to avoid this, does not post the same status update tweet message or notification tweet. Instead the application randomly selects from a suite of messages and injects a variety of attributes which ensure status update tweets are different. The randomness in execution is due to a custom cron application which randomly executes the entire stack beginning from down to

How to nominate sites onto What Did It Look Like?

If you are interested in seeing what a web site looked like through the years:
  1. Search to see if the web site already exists by using the search service in the archives page; this can be done by typing the URL of the web site and hitting submit. If the URL does not exist on the site, go ahead to step 2.
  2. Tweet "#whatdiditlooklike URL" to nominate a web site or tweet "#whatdiditlooklike URL1, URL2, ..., URLn" to nominate multiple URLs.
Tweet "#whatdiditlooklike URL" to nominate a web site or tweet "#whatdiditlooklike URL1, URL2, ..., URLn" to nominate multiple URLs.

How to explore historical posts

To explore historical posts, visit the archives page:

What Did Look Like?

What Did Look Like?
What Did Look Like?

"What Did It Look Like?" is inspired by two sources: 1) the "One Terabyte of Kilobyte Age Photo Op" Tumblr that Dragan Espenschied presented at DP 2014 (which basically demonstrates digital preservation as performance art; see also the commentary blog by Olia Lialina & Dragan), and 2) the Digital Public Library of America (DPLA) "#dplafinds" hashtag that surfaces interesting holdings that one would otherwise likely not discover.  Both sources have the idea of "randomly" highlighting resources that you would otherwise not find given the intimidatingly large collection in which they reside.

We hope you'll enjoy this service as a fun way to see how web sites -- and web site design! -- have changed through the years.


Thursday, January 15, 2015

2015-01-15: The Winter 2015 Federal Cloud Computing Summit

On January 14th-15th, I attended the Federal Cloud Computing Summit in Washington, D.C., a recurring event in which I have participated in the past. In my continuing role as the MITRE-ATARC Collaboration Session lead, I assisted the host organization, the Advanced Technology And Research Center (ATARC) in organizing and run the MITRE-ATARC Collaboration Sessions. The summit is designed to allow Government representatives to meeting and collaborate with industry, academic, and other Government cloud computing practitioners on the current challenges in cloud computing.

The collaboration sessions continue to be highly valued within the government and industry. The Winter 2015 Summit had over 400 government or academic registrants and more than 100 industry registrants. The whitepaper summarizing the Summer 2014 collaboration sessions is now available.

A discussion of FedRAMP and the future of the policies was held in a Government-only session at 11:00 before the collaboration sessions began.
At its conclusion, the collaboration sessions began, with four sessions focusing on the following topics.
  • Challenge Area 1: When to choose Public, Private, Government, or Hybrid clouds?
  • Challenge Area 2: The umbrella of acquisition: Contracting pain points and best practices
  • Challenge Area 3: Tiered architecture: Mitigating concerns of geography, access management, and other cloud security constraints
  • Challenge Area 4: The role of cloud computing in emerging technologies
Because participants are protected by Chathan House Rule, I cannot elaborate on the Government representation or discussions in the collaboration sessions. MITRE will continue its practice of releasing a summary document after the Summit (for reference, see the Summer 2014 and Winter 2013 summit whitepapers).

On January 15th, I attended the Summit which is a conference-style series of panels and speakers with an industry trade-show held before the event and during lunch. At 3:25-4:10, I moderated a panel of Government representatives from each of the collaboration sessions in a question-and-answer session about the outcomes from the previous day's collaboration sessions.

To follow along on Twitter, you can refer to the Federal Cloud Computing Summit Handle (@cloudfeds), the ATARC Handle (@atarclabs), and the #cloudfeds hashtag.

This was the fourth Federal Summit event in which I have participated, including the Winter 2013 and Summer 2014 Cloud Summits and the 2013 Big Data Summit. They are great events that the Government participants have consistently identified as high-value. The events also garner a decent amount of press in the federal news outlets and at MITRE. Please refer to the list of press for the most recent articles about the summit.

We are continuing to expand and improve the summits, particularly with respect to the impact on academia. Stay tuned for news from future summits!

--Justin F. Brunelle

Saturday, January 3, 2015

2015-01-03: Review of WS-DL's 2014

The Web Science and Digital Libraries Research Group's 2014 was even better than our 2013.  First, we graduated two PhD students and had many other students advance their status:
In April we introduced our now famous "PhD Crush" board that allows us to track students' progress through the various hoops they must jump through.  Although it started as sort of a joke, it's quite handy and popular -- I now wish we had instituted it long ago. 

We had 15 publications in 2014, including:
JCDL was especially successful, with Justin's paper "Not all mementos are created equal: Measuring the impact of missing resources" winning "best student paper" (Daniel Hasan from UFMG also won a separate "best student paper" award), and Chuck's paper "When should I make preservation copies of myself?" winning the Vannevar Bush Best Paper award.  It is truly a great honor to have won both best paper awards at JCDL this year (pictures: Justin accepting his award, and me accepting on behalf of Chuck).  In the last two years at JCDL & TPDL, that's three best paper awards and one nomination.  The bar is being raised for future students.

In addition to the conference paper presentations, we traveled to and presented at a number of conferences that do not have formal proceedings:
We were also fortunate enough to visit and host visitors in 2014:
We also released (or updated) a number of software packages for public use, including:
Our coverage in the popular press continued, with highlights including:
  • I appeared on the video podcast "This Week in Law" #279 to discuss web archiving.
  • I was interviewed for the German radio program "DRadio Wissen". 
We were more successful on the funding front this year, winning the following grants:
All of this adds up to a very busy and successful 2014.  Looking ahead to 2015, as well as continued publication and funding success, we are expecting to graduate both one MS & one Ph.D. student and host another visiting researcher (Michael Herzog, Magdeburg-Stendal University). 

Thanks to everyone that made 2014 such a great success, and here's to a great start to 2015!


Saturday, December 20, 2014

2014-12-20: Using Search Engine Queries For Reliable Links

Earlier this week Herbert brought to my attention Jon Udell's blog post about combating link rot by crafting search engine queries to "refind" content that periodically changes URIs as the hosting content management system (CMS) changes.

Jon has a series of columns for InfoWorld, and whenever InfoWorld changes their CMS the old links break and Jon has to manually refind all the new links and update his page.  For example, the old URI:

is currently:

The same content had at least one other URI as well, from at least 2009--2012:

The first reaction is to say InfoWorld should use "Cool URIs", mod_rewrite, or even handles.  In fairness, Inforworld is still redirecting the second URI to the current URI:

And it looks like they kept redirecting the original URI to the current URI until sometime in 2014 and then quit; currently the original URI returns a 404:

Jon's approach is to just give up on tracking different URIs for his 100s of articles and instead use a combination of metadata (title & author) and the "site:" operator submitted to a search engine to locate the current URI (side note: this approach is really similar to OpenURL).  For example, the link for the article above would become:

Herbert had a number of comments, which I'll summarize as:
  • This problem is very much related to Martin's PhD research, in which web archives are used to generate lexical signatures to help refind the new URIs on the live web (see "Moved but not gone: an evaluation of real-time methods for discovering replacement web pages").  
  • Throwing away the original URI is not desirable because that is a useful key for finding the page in web archives.  The above examples used the Internet Archive's Wayback Machine, but Memento TimeGates and TimeMaps could also be used (see Memento 101 for more information).   
  • One solution to linking to a SE for discovery while retaining the original URI is to use the data-* attributes from HTML (see the "Missing Link" document for more information).  
For the latter point, including the original URI (and its publishing date), the SE URI, and the archived URI would result in html that looks like:

I posted a comment saying that a search engine's robots.txt page would prevent archives like the Internet Archive from archiving the SERPs and thus not discover (and archive) the new URIs themselves.  In an email conversation Martin made the point that rewriting the link to search engine is assuming that the search engine URI structure isn't going to change (anyone want to bet how many links to or queries are still working?).  It is also probably worth pointing out that while metadata like the title is not likely to change for Jon's articles, that's not always true for general web pages, whose titles often change (see "Is This A Good Title?"). 

In summary, Jon's solution of using SERPs as interstitial pages as a way to combat link rot is an interesting solution to a common problem, at least for those who wish to maintain publication (or similar) lists.  While the SE URI is a good tactical solution, disposing of the original URI is a bad strategy for several reasons, including working against web archives instead of with them, and betting on the long-term stability of SEs.  The solution we need is a method to include > 1 URI per HTML link, such as proposed in the "Missing Link" document.