Saturday, December 5, 2009

2009-12-05: NFL Playoff Outlook

Week 12 of the regular NFL season is over and with five more weeks to go the playoff picture is starting to take shape. The predictive algorithms we chose are coming along fine. Some are doing better than others which is to be expected.

One of the algorithms we are using leverages Google's PageRank algorithm. We formed a directed graph where each vertex is one of the 32 NFL teams. A directed edge is placed for each game played with the loser pointing to the winner. The edge is then weighted with the margin of victory(mov). The mov is the winner's score - the loser's score.

Then the page rank is calculated using the graph. We are using the igraph library in R to calculate the page rank. The teams are then ranked in order of page rank. This is similar to another group that has used PageRank to rank NFL teams.

One of the concepts of PageRank is that good pages are pointed to by other good pages. In our case good teams will be pointed to by good teams.

An interesting observation that we have made is the difference in rank between the New Orleans Saints and the Indianapolis Colts. After week 12 both teams are undefeated with 11 wins each. However most of the teams that the Colts have beaten were not in the top tier of the league and even when they were the Colts did not beat them by many points.

One example is New England. New England is still a pretty good team this year and they have been beaten by both the Colts and the Saints. The Colts barely won by one point where the Saints beat New England by 21 points.

Here is an animated gif showing the page rank of teams week to week. The edges show which teams played who and the size of the node is a function of the page rank.



Now looking forward to the playoffs we take a look at a still of Week 12.


The Saints are pretty much going to take the NFC and depending on who takes the AFC North we can see the Saints winning the Superbowl over the Bengals.

--Greg Szalkowski

2009-12-9 Edit: The animated GIF from above has been uploaded to YouTube -- MLN.

Thursday, November 19, 2009

2009-11-19: Memento Presentation and Movie; Press Coverage

On Monday, November 16 2009 Herbert and I went to the Library of Congress and presented slides from our Memento eprint (see the previous post for a short description of Memento). On Thursday, November 19 2009 Herbert gave the same presentation at OCLC.

Below are the slides that were presented as well as supporting movie. Fortunately, the slides & movie were finished in between ODU sporadically losing power over the weekend due to the Nor'easter, and on Tuesday when odusource.cs.odu.edu and mementoarchive.cs.odu.edu were brought down by a disk failure. Thanks to Scott Ainsworth and the ODU systems staff for their yeoman's work on getting everything back up and running.

Slides & movie from the Library of Congress Brown Bag Seminar:





2010-02-12 Edit: The recorded presentation has just been uploaded to the Library of Congress web site.

Also, Memento has enjoyed considerable press & blog coverage. It began with an article in New Scientist:

http://www.newscientist.com/article/dn18158-timetravelling-browsers-navigate-the-webs-past.html

Which was picked up and redistributed in various forms:

http://story.chinanationalnews.com/index.php/ct/9/cid/d805653303cbbba8/id/566360/cs/1/
http://www.newkerala.com/nkfullnews-1-152320.html
http://www.sbs.com.au/news/article/1133987/Time-traveling-browsers-search-web's-past
http://www.resourceshelf.com/2009/11/17/the-internet-time-machine-from-the-momento-project/
http://sciencetech-search.blogspot.com/2009/11/time-travelling-browsers-navigate-webs.html
http://story.albuquerqueexpress.com/index.php/ct/9/cid/d867a54a6fc00b3b/id/566360/cs/1/
http://trak.in/news/soon-time-travelling-browsing-technology-to-navigate-webs-past/24449/

There was also coverage in the blog of The Chronicle of Higher Education:

http://chronicle.com/blogPost/New-Web-Site-Makes-Internet/8887/

As well as a few other blog posts:

http://davidbrunton.com/2009/11/memento-and-persistence.html
http://www.niso.org/blog/?p=78

And a mention on O'Reilly Radar:

http://radar.oreilly.com/2009/11/four-short-links-18-november-2.html

--Michael

Monday, November 9, 2009

2009-11-09: Eprint released for "Memento: Time Travel for the Web"

This is a follow-up to my post on October 5, where I mentioned the availability of the Memento project web site. Herbert's team and my team, working under an NDIIPP grant, have introduced a framework where you can browse the past web (i.e., old versions of web pages) in the same manner that you browse the current web. The framework uses HTTP content negotiation as a method for requesting the version of the page you want.

Most people know little about content negotiation, and the little they think they know is often wrong (see [1-3] for more information about CN). In a nutshell, CN allows you to link to a URI "foo" but, for example, without specifying its format (e.g., "foo.html" vs. "foo.pdf") or language ("foo.html.en" vs. "foo.html.es"). Your browser automatically passes preferences to the server (e.g., "I slightly prefer HTML over PDF, and I greatly prefer English to Spanish") and the server tries to find its best representation of "foo" that matches your preferences. In fact, CN defines 4 dimensions where the browser and server can negotiate the "best" representation: type, language, character set, and encoding (e.g., .gz vs. .zip).

We define a fifth dimension for CN: Datetime. If you configure your browser to prefer to view the web as it existed at a particular time, say January 29, 2008, then you could click on:

http://en.wikipedia.org/wiki/The_Cribs

and not get the current version, but rather get an older version of the page (in this case, before Johnny Marr had joined the band).

There are two kinds of "tricks" that must be addressed to make this possible:

1. The client can be configured to specify the desired Datetime. Scott Ainsworth is currently developing a Firefox add-on for us and will be releasing "real soon now" (tm). In the mean time, you can play with a browser-based client developed by LANL just to see how it works.

2. The server must know how to "do the right thing" (tm). There are several ways to do this. One, if the server is running a content management system that keeps track of prior versions, then the server can respond with correct older version. For example, we have a plug-in for mediawiki that maps the incoming Datetime requests to the prior versions.

Or the production server can redirect the client to where it knows its pages are. For example, the following demo pages:

http://lanlsource.lanl.gov/hello
http://odusource.cs.odu.edu/hello

"know" about their corresponding transactional archives at http://mementoarchive.lanl.gov/ and http://mementoarchive.cs.odu.edu/, respectively, and will redirect clients to the correct archive.

Third, the server can redirect the client to an aggregator we've developed (see the simple mod_rewrite rules that perform this function). For example, this rule is installed at http://digitalpreservation.gov/; if the server there detects a Memento request, it will redirect the client to the aggregator which will search the Internet Archive, Archive-It, and other public web archives for the best Datetime match.

Finally, if the server is not configured to do any of those things, the Firefox add-on attempts to detect the server's non-compliance and redirect the client to the aggregator (for the same effect as described above).

The above is a short description of how Memento works. More details can be found in our eprint:

Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila L. Balakireva, Scott Ainsworth, Harihar Shankar, "Memento: Time Travel for the Web", arXiv 0911.1112, November 2009.

Also, we have a number of upcoming presentations where you can catch us explaining Memento in more detail:
We hope to see you at one of these meetings. Let us know if you have questions or comments.

--Michael


1. Transparent Content Negotiation in HTTP, RFC 2295.
2. Content Negotiation, Apache HTTP Server Documentation.
3. ODU CS 595 Week 10 Lecture.

Sunday, November 8, 2009

2009-11-08: Back From Keynotes at WCI and RIBDA.

October was a busy travel month. On October 11-13, I attended a technical meeting for the Open Annotation Collaboration project at Berkeley, CA. From there, I traveled to Berlin, Germany to give a keynote about OAI-ORE at the Wireless Communication and Information Conference (WCI 2009). Michael Herzog was kind enough to invite me to speak there again; I also gave an invited talk at Media Production 2007, also in Berlin.


After a short week back in the US, it was off to Lima, Peru to give another keynote about OAI-ORE, this time at Reunión Interamericana de Bibliotecarios, Documentalistas y Especialistas en Información Agrícola, or RIBDA 2009. This was also another repeat performance -- I had given an invited talk about OAI-PMH in Lima in 2004, and my colleague there, Libio Huaroto, invited me back.

Slides from the keynotes are probably available on the conference web sites, however they were both edited versions of the more detailed ORE seminar I recently gave at Emory University in early October. Those interested in OAI-ORE should look at those slides or the "ORE in 10 minutes" video Herbert Van de Sompel recently uploaded to YouTube.

--Michael

Monday, October 26, 2009

2009-10-26: Communications of the ACM Article Published

The article "Why Websites Are Lost (and How They're Sometimes Found)" has finally been published in the November 2009 issue of Communications of the ACM. Co-written with Frank McCown and Cathy Marshall, it was accepted for publication in the fall of 2007. Although we've had a pre-print available since 2008, it just isn't the same until you see it in print.

Except we won't be seeing this in print; it is instead published in the "Virtual Extension" part of the CACM. So even though it has page numbers (pp. 141-145), this article won't be among those that arrive in your mailbox in a few weeks. As someone who has spent his entire career trying to transform the scholarly communication process with the web and digital libraries I completely understand this move by the CACM, but I have to admit I'm disappointed that I won't see a printed, bound copy. Even though in the long-term, all discovery will come from the web (e.g., Google Scholar or personal publication lists), the short-term thrill of receiving the hard-copy in the mail is hard to to replace.

The article itself is a very nice summary of the problem area. The idea to write the paper came from our involvement in Warrick, a tool for reconstructing lost web sites. Warrick was very successful, and the interest in Warrick was so high we eventually became distracted from the mechanics of reconstruction and our focus turned to the question "why are people losing all these sites?!" We learned quite a bit.

Interested readers might also like: our paper in Archiving 2007, Frank's dissertation, or any of the several papers by Cathy on personal (digital) archiving.

--Michael

Thursday, October 15, 2009

2009-10-15: Seminars at Emory University

I recently traveled to Emory University to visit with Joan Smith (an alumna of our group -- PhD, 2008) and Rick Luce. While there, I gave two colloquiums: on October 1 at the Woodruff Library on OAI-ORE, and on October 2 at the Mathematics & Computer Science Department on web preservation (specifically, based on Martin Klein's PhD research).

I've uploaded both sets of slides. The first, "OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project", is based on slides from Herbert Van de Sompel:



The second, "(Re-) Discovering Lost Web Pages", is an extended version of slides presented at the NDIIPP Partners Meeting this summer:




--Michael

Monday, October 5, 2009

2009-10-05: Web Page for the Memento Project Is Available

The Library of Congress funded research project "Tools for a Preservation Ready Web" is coming to a close. The initial phase (2007-2008) of the project funded Joan Smith's PhD research into using the web server to inform web crawlers exactly how many valid URIs there are at a web site (the "counting problem") as well as perform server-side generation of preservation metadata at dissemination time (the "representation problem"). Several interesting papers came out of that project (e.g., WIDM 2006, D-Lib 14(1/2)) as well as the mod_oai Apache module. Joan graduated in 2008 and is now the Chief Technology Strategist for the Emory University Libraries and an adjunct faculty member in the CS department at Emory.

Since that time, Herbert and I (plus our respective teams) have been closing out this project working on some further ideas regarding the preservation of web pages and how web archives can be integrated with the "live web". The result is the Memento Project, which has a few test pages that are collecting links from robots and interactive users that we will use in a description and analysis to be published shortly. In the mean time, the test pages feature some clever scripting from Rob to show Herbert and I standing next to BBC and CNN web pages, respectively. Check them out:

http://lanlsource.lanl.gov/hello
http://odusource.cs.odu.edu/hello

And here are their respective bit.ly URIs (just for fun):

http://bit.ly/4iubG
http://bit.ly/2u6uAv

I'll post a further update on WS-DL when we publish the description of how Memento works. We'd like to again thank the National Digital Information Infrastructure and Preservation Program for their support of the "Tools for a Preservation Ready Web" project.

-- Michael

Monday, September 28, 2009

2009-09-28: OAI-ORE In 10 Minutes

A significant part of my research time in 2007-2008 was spent working on the Open Archives Initiative Object Reuse & Exchange project (OAI-ORE, or simply just ORE). Producing the ORE suite of eight documents was difficult and took longer than I anticipated, but we had an excellent team and I'm extremely proud of the results. In the process, I also learned a great deal about the building blocks of ORE: the Web Architecture, Linked Data and RDF.

I'm often asked "What is ORE?" and I don't always have a good, short answer. The simplest way I like to describe ORE is "machine readable splash-pages". More formally, ORE addresses the problem of identifying Aggregations of Resources on the Web. For example, we often use the URI of an html page as the identifier of an entire collection of Resources. Consider this YouTube URI:

http://www.youtube.com/watch?v=SkJDKdOlUGQ

Technically, it identifies just the html page that is returned when that URI is dereferenced:


But we frequently (incorrectly) use this URI to also identify all the information contained within that html page, which is actually a collection of many URIs, some of which include:

http://www.youtube.com/v/SkJDKdOlUGQ&hl=en&fs=1
http://i4.ytimg.com/vi/SkJDKdOlUGQ/default.jpg
http://m.youtube.com/watch?desktop_uri=%2Fwatch%3Fv%3DSkJDKdOlUGQ&v=SkJDKdOlUGQ
http://gdata.youtube.com/feeds/base/videos/SkJDKdOlUGQ
http://gdata.youtube.com/feeds/base/videos/SkJDKdOlUGQ/related
http://gdata.youtube.com/feeds/base/videos/SkJDKdOlUGQ/comments
http://gdata.youtube.com/feeds/base/videos/SkJDKdOlUGQ/responses


There is more to ORE than this, however. Interested readers can read the ORE primer, and then tackle more difficult documents like the ORE Abstract Data Model. There are a variety of presentations about ORE available as well, including all the presentations we gave at the 2008 Open Day at Open Repositories 2008 at Southampton University. Herbert Van de Sompel has several presentations uploaded to slideshare (see additional presentations with the "oaiore" tag), but some are quite lengthy (160+ slides).

Fortunately, there is now a short, gentle introduction to ORE. Herbert has just uploaded to YouTube a nice 10 minute narrated overview of ORE in preparation for the 2009 Dublin Core conference. Obviously, there is a limit to how much can be covered in a 10 minute presentation, but this should provide you with the answer to "what is ORE about?" and give you enough background to start reading the ORE suite of documents.



Thanks to Herbert for taking the time to record and upload this video.

-- Michael

Thursday, September 17, 2009

2009-09-19: Football Intelligence and Beyond

Football Intelligence (FI) is a system for gathering, storing, analyzing, and providing access to data to help Football enthusiasts discover more about the performance of their favorite past time.

While taking Dr. Nelson's Collective Intelligence class I became fascinated with techniques for mining useful data from the "collective intelligence" of readily available data on the Internet.

We decided to apply some of the Data Mining Techniques covered in class in an attempt to predict the 2009 NFL Football season. There is a plethora of data out there that could be mined from Injury reports to betting lines but we decided to limit the scope to use the box score data for training and predictions.

Using box scores from 2003 to present we trained a number of different models from Support Vector Machines to Multilayer Perceptron Networks. The implementations of the models we are using are based on the Weka Data Mining Software. Weka contains a number of tools for experimenting with and visualizing data.

For comparison and to provide some controls we have chosen a few schemes like Home team always wins, City Population, and best Mascot. For the best mascot competition I had my daughters rank the mascots from best to worst and that ranking will be used throughout the season. Poe from the Baltimore Ravens came out on top.






If you would like to see how we are doing or even join us with your own predictions we have pick'em leagues for straight and against the spread.

Greg Szalkowski

Wednesday, September 16, 2009

2009-09-16: Announcing ArchiveFacebook - A Firefox Add-on for Archiving Facebook Accounts


ArchiveFacebook is a Firefox extension, which helps you to save web pages from Facebook and easily manage them. Save content from Facebook directly to your hard drive and view them exactly the same way you currently view them on Facebook.
Why would you want to do this?  Facebook has become a very important part of our lives.  Information about our friends, family, business contacts and acquaintances is stored in Facebook with no easy way to get it out.  ArchiveFacebook allows you to do just that.  What guarantee do you have that Facebook won't accidentally, or in some cases intentionally delete your account?  Don't trust your data to one web site alone.  Take matters into your own hands and preserve this information.  Show it to your kids one day!
Currently ArchiveFacebook can save:
  • Photos
  • Messages
  • Activity Stream
  • Friends List
  • Notes
  • Events
  • Groups
  • Info
Installation:
You can download the extension from https://addons.mozilla.org/en-US/firefox/addon/13993/.  Once at this page, press “Add to Firefox” and follow the prompts for installation.  Firefox will prompt you to restart Firefox.  Once you do so, you should see a new menu called “ArchiveFB”.  At this point, ArchiveFacebook has successfully installed.
It should be noted that, at the time of this writing, ScrapBook should not be enabled while ArchiveFacebook is enabled.  This will cause instability issues within both programs.  You can easily disable an extension from within the Firefox “Add-ons” dialog.  This dialog can be found in the “Tools” menu of Firefox.

Usage:
Logging In:
First of all, make sure you are logged into your Facebook account.  ArchiveFacebook uses Firefox to view the pages of your account and then save them.  So, you need to be logged into your account in order to have the proper authentication and authorization to archive your account.
Sidebar:
Once logged in, it is a good idea to open the sidebar.  To do this, click the ArchiveFB --> Show in Sidebar.  The sidebar should appear on the left hand side of the screen.  The sidebar does not need to be open in order to archive, but it provides a richer interface for doing so.
Archiving:
To archive your account, press ArchiveFB --> Archive.  This will redirect you to your Facebook profile page.  You will then see a dialog box telling you that you activity stream will be expanded.  If you press “Cancel” the archiving process will be completely cancelled.  If you press “Ok”, your activity stream will be expanded, displaying all activity done on Facebook since your accounts creation.  As your activity is being retrieved, you are presented with another dialog box that lets you know the date of the current activities that are being retrieved.  You may cancel this process at any time and your account will still be archived.  Your activity stream will be archived up until the date where you cancelled the retrieval.


Once the retrieval of your activity stream has completed, the archiving process will begin.  You will see a window that says “Capture” on it.  This window drives the archive process.  Each page to be archived will be listed in the scrollbar pane.

 

Browsing the Archive:
Once the archiving process has completed, you will see an entry in the sidebar that says “Facebook | username date” where username is your Facebook username and date is the current date.  Click on the entry.  You will see your Facebook profile page appear and at the bottom will be an annotation bar where you can highlight text or make comments on a page for your personal records.  Click through your archived Facebook pages to ensure that all pages have been archived.  All pages listed in the introduction should be archived.  You can tell if a page has been archived by placing your cursor over a link.  Look in the bottom left hand corner and Firefox will show the location of the link i.e. if the location starts with “file://”, it is on your hard drive, if it starts with “http://”, it is on the web.  If it is not on your local hard drive, try to archive your account again.  If the second attempt doesn’t work, please notify us and we will attempt to fix the problem.

Notes:
ArchiveFacebook was developed at Old Dominion University (ODU) by Carlton Northern. Michael L. Nelson (ODU) and Frank McCown (Harding University) served as advisers. You can read more about the add-on in the research paper entitled What Happens When Facebook is Gone? presented at JCDL 2009.
ArchiveFacebook was developed by modifying code from ScrapBook. Note that ArchiveFacebook will not work correctly when ScrapBook is installed.  You will need to temporarily disable or uninstall ScrapBook before using ArchiveFacebook.
When running ArchiveFacebook, it may take several minutes to several hours to complete, depending on the amount of content to be archived.  At the beginning of the process, Firefox will be temporarily frozen while it retrieves your activity stream.  You may cancel this retrieval at any time and the archiving of the rest of your account will still occur.

User Manuals:

Presentation:


--Carlton

Friday, August 21, 2009

2009-08-21: CS 751/851 "Introduction to Digital Libraries" Postponed Until Spring 2010

CS 751/851 "Introduction to Digital Libraries" has been postponed from Fall 2009 to Spring 2010. I apologize to those who had planned to take the class this Fall.

--Michael

Thursday, July 30, 2009

2009-07-30: Position Paper Published in Educause Review

The July/August 2009 issue of Educause Review has a position paper of mine entitled "Data Driven Science: A New Paradigm?" This invited paper is essentially a cleaned-up version of my position paper at the 2007 NSF/JISC Workshop on Data-Driven Science and Scholarship held in Arizona, April 17-19 2007. Prior to the workshop, we were all assigned topics on which we were to write a short position paper. My topic was to address the question of is "data-driven science is becoming a new scientific paradigm – ranking with theory, experimentation, and computational science?"

You can judge my response by the original paper's more cheeky title of "I Don't Know and I Don't Care". My argument can be summed up as "we've always had data-driven science at whatever was the largest feasible scale; it just happens that the scale is now very large." Scale is important, in fact some days I might argue that scale is all there is. But partitioning into paradigms does not seem helpful -- every other dimension of our life is now at web-scale, so why not our science?

Thanks to Ron Larsen for co-hosting (with Bill Arms) the workshop in 2007 and for resurrecting the paper for Educause Review.

--Michael

Thursday, July 16, 2009

2009-07-17: Technical Report "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure"



This week I uploaded the technical report which is co-authored by Michael L. Nelson to the e-print service arxiv.org. The underlying idea of this research is to utilize the web infrastructure (search engines, their caches, the Internet Archive, etc) to rediscover missing web pages - pages that return the 404 "Page not Found" error. We apply various methods to generate search engine queries based on the content of the web page and user created annotations about the page. We then compare the retrieval performance of all methods and introduce a framework to combine such methods to achieve the optimal retrieval performance.
The applied methods are:
  • 5- and 7-term lexical signatures of the page
  • the title of the page
  • tags users annotated the page with on delicious.com
  • 5- and 7-term lexical signatures of the page neighborhood (up to 50 pages linking to the missing page)
We query the big three search engines (Google, Yahoo and MSN Live) with the outcome of all methods and analyze the result sets to investigate the performance.
We have shown in recent work (published at ECDL 2008) that lexical signatures perform very well for rediscovering missing web pages.

As shown on the left we distinguish between four retrieval categories: top (the URLs is returned top ranked), top10 (returned within the top10 but not top), top100 (returned between rank 11 and 100) and undiscovered (not returned in any of the categories above). Displayed here is the retrieval performance of 5- and 7-term lexical signatures. A somewhat binary pattern is visible meaning the vast majority of URLs are either returned within the top 10 or are undiscovered.

However in this study we found that the pages' titles perform equally well. We further found that neither tags about the pages nor lexical signatures based on the page neighborhood performed satisfactorily. We need to mention though that we were not able to obtain tags for all URLs of our data set.
Inspired by the performance of titles we also conducted a small scale analysis of the consistency of our titles with respect to their retrieval performance. We looked at the title length in terms of the number of terms and characters as well as the number of stop words.

Since the title of a web page is much cheaper to obtain compared to the complex computation of a lexical signature we recommend using the title first and the lexical signature second for URLs that were not discovered in the first step. This experiment for one is a follow-up study of our work published at ECDL 2008 and for two forms the basis for a larger-scale study in the future.
--
martin

2009-07-16: The July issue of D-Lib Magazine has JCDL and InDP reports.

The July/August 2009 issue of D-Lib Magazine has just published reports for the 2009 ACM/IEEE JCDL (written by me) and InDP (written by Frank and his co-organizers), as well as several other reports for JCDL workshops and other conferences (such as Open Repositories 2009). Whereas my previous entry about JCDL & InDP was focused on our group's experiences, these reports give a broader summary of the events.

--Michael

Tuesday, July 7, 2009

2009-07-07: Hypertext 2009

From June 30th through July 1st I attended Hypertext 2009 (HT 2009) in Torino Italy. The conference saw a 70% increase in submissions (117 total) compared to last year but due to the equally increased number of accepted papers (26 long and 11 short) and posters maintain last years acceptance rate of roughly 32%. HT 2009 also had a record of 150 registered attendees.

I presented our paper titled "Comparing the Performance of US College Football Teams in the Web and on the Field" (DOI) which was joint work with Olena Hunsicker under the supervision of Michael L. Nelson. The paper describes an extensive study on the correlation of expert rankings of real world entities and search engine rankings of their representative resources on the web.
We published a poster, "Correlation of Music Charts and Search Engine Rankings" (DOI), with the results of a similar experiment but of much smaller scale at JCDL 2009.

It was my first time attending HT and from my point of view there were four highlights that I would like to report on (in the order of their occurrences):
1) Mark Bernstein gave a very inspiring talk "On Hypertext Narrative" and also advertised his new book titled "Reading Hypertext". He further is the chief scientist of Eastgate Systems and the designer of Tinderbox.

2) Lada Adamic's keynote "The Social Hyperlink" (slides). She talked about various experiments with social networks e.g., the propagation of knowledge through social networks and how assets (such as dance moves) propagate in Second Life. She argued that it is often hard to differentiate between influence and correlation in social networks.

3) I got to meet and talk to Theodor (Ted) Nelson. Ted coined the term Hypertext and is the father of the Xanadu project. He authored various books including his last work "Geeks Bearing Gifts". The best newcomer paper award at HT is named after him.

4) Ricardo Baeza-Yates' keynote "Relating Content by Web Usage" where he argued that web search is no longer about document retrieval (a sad statement for IR fanatics) but about exploiting the wisdom of the crowds since that provides popularity, diversity, coverage and quality. Search moves towards identifying the user's task and enable its completion. He makes a case for search transitioning from returning web documents to web objects such as people, places and businesses since these objects satisfy the user's intent.

Besides the impressions I got from the conference a few useless facts that I feel like sharing:
Torino seems like a nice place but I did not get a chance to walk around and explore the city.
Italians dine (similar to the French) in several courses so do not make the same rookie mistake I did and fill yourself up on the appetizers assuming its all you get.
Italian cab drivers of course do not understand a single word of the English language unless it comes to how much tip you give them.
There are conference hotels on the face of this planet that do not provide irons for their guests...

Hypertext 2010 will be held in Toronta Canada June 14th - 17th 2010.
--
martin

Monday, June 29, 2009

2009-06-29: NDIIPP Partners Meeting

On June 24-26 I attended the 2009 NDIIPP Partners Meeting in Washington DC. Although it has grown from the early years, I believe this year's attendance of 150 people is similar to last year's.

Clay Shirky, author of "Here Comes Everybody", gave the keynote on Wednesday morning. Hopefully the Library of Congress will post a video of the keynote soon. If not, take a look at some of his other presentations -- you will find them enjoyable and informative.

On Thursday morning I presented a summary of Martin's PhD research, the tangible product of which will be a FireFox extension called "Synchronicity":



The presentation was very well received and there is a lot of interest in the extension. There were several interesting break out sessions, but the real news was on Friday when Martha Anderson (LC) introduced the upcoming National Digital Stewardship Alliance (NDSA). Details will be forthcoming, but it appears to be envisioned as a cooperating set of digital archives.

--Michael

2009-07-02 edit: The Library of Congress has updated their site with a summary and agenda+slides.

Monday, June 22, 2009

2009-06-22: Back From JCDL 2009

We had a good showing at the 2009 ACM/IEEE Joint Conference on Digital Libraries (JCDL) in Austin, TX last week. In total, we had 1 full paper, 3 short papers, 2 posters, 1 workshop paper and 1 doctoral consortium paper. JCDL is the flagship conference in our field and we always make a point to send as many people as possible.

Chuck Cartledge (left) presented "A Framework for Digital Object Self-Preservation" at the doctoral consortium. He also presented the related short paper "Unsupervised Creation of Small World Networks for the Preservation of Digital Objects". Chuck is planning to have his doctoral candidacy exam sometime in the early fall.

Michael presented the full paper "Using Timed-Release Cryptography to Mitigate The Preservation Risk of Embargo Periods". This paper was based on Rabia Haq's MS Thesis, which she defended in the fall of 2008. Michael also co-organized the doctoral consortium and convinced WS-DL alumna Joan Smith (PhD, 2008) to serve on the faculty committee as well.

Martin Klein (left) presented two posters, "Inter-Search Engine Lexical Signature Performance" and "Correlation of Music Charts and Search Engine Rankings", the latter of which won runner up for best poster. That poster was based on the MS project from Olena Hunsicker, who graduated in fall of 2008. Martin will present a full paper also based on Olena's work at the upcoming Hypertext 2009 conference in Italy. He also presented the paper "investigating the Change of Web Pages' Titles Over Time" at the First International Workshop on Innovation in Digital Preservation (InDP).

The InDP workshop was co-organized by Frank McCown, an alumnus of the WS-DL group (PhD, 2007). Frank also presented two short papers: "A Framework for Describing Web Repositories" and "What Happens When Facebook is Gone?". The latter generated a lot of interest and the meeting room was standing room only. Frank, Michael and current MS student Carlton Northern are working on a project to address archiving Facebook accounts -- look for a software release soon.

--Michael