AlNoamany's work or Andy Jackson's work, because we are trying to acquire and extract text from mementos across many web archives, rather than just one.
We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex. We document our findings in a technical report entitled: "Rules of Acquisition for Mementos and Their Content".
Our technical report briefly covers the following key points:
- Special techniques for acquiring mementos from the WebCite on-demand archive (http://www.webcitation.org)
- An alternative to BeautifulSoup for removing elements and extracting text from mementos
- Stripping away archive-specific additions to memento content
- An algorithm for dealing with inaccurate character encoding
- Differences in whitespace treatment between archives for the same archived page
- Control characters in HTML and their effect on DOM parsers
- DOM-corruption in various HTML pages exacerbated by how the archives present the text stored within <noscript> elements
Rather than repeating the entire technical report here, we want to focus on the two issues of interest that may have the greater impact on others acquiring and experimenting with mementos: acquiring mementos from Web Cite and inaccurate character encoding.
Acquisition of Content from WebCite
WebCite is an on-demand archive specializing in archiving web pages used as citations in scholarly work. An example WebCite page is shown below.
cURL data transfer tool. With this tool, one merely types the following command to save the contents of the URI http://www.example.com:
curl -o outputfile.html http://www.example.com
For WebCite, the output from cURL for a given URI-M results in the same HTML frameset content, regardless of which URI-M is used. We sought to acquire the actual content of a given page for text extraction, so merely utilizing cURL was insufficient. An example of this HTML is shown below.