2013-03-22: NTRS, Web Archives, and Why We Should Build Collections

At the ResourceSync meeting this week, Simeon Warner brought my attention to the fact that the NASA Technical Report Server (NTRS) digital library had gone offline on March 19.  Although I have not been involved with it since about 2004, I was the creator of NTRS and it was a central part of my early career

If you click on http://ntrs.nasa.gov/ now, you can a message saying the service is down.  Technically, you get an "HTTP/1.1 503 Service Temporarily Unavailable" message:

$ curl -I http://ntrs.nasa.gov/
HTTP/1.1 503 Service Temporarily Unavailable
Date: Sat, 23 Mar 2013 04:00:14 GMT
Server: Apache/2.2.3 (Red Hat)
Last-Modified: Fri, 22 Mar 2013 12:50:02 GMT
ETag: "720003-300-4d882e4c05280"
Accept-Ranges: bytes
Content-Length: 768
Connection: close
Content-Type: text/html; charset=UTF-8


 And the body of the page says:
The NASA technical reports server will be unavailable for public access
while the agency conducts a review of the site's content to ensure that it
does not contain technical information that is subject to U.S. export control laws
and regulations and that the appropriate reviews were performed.
The site will return to service when the review is complete.
We apologize for any inconvenience this may cause
Mark Phillips described it perfectly:

Presumably the shutdown of ntrs.nasa.gov is in response to a security incident with a NASA LaRC contractor who is a Chinese national.  I won't address that issue, but I will say that shutting down a public web server in response demonstrates a profound misunderstanding of how the web works: you can't put the pdfs back in the server.  When I discovered that NTRS was down, I searched twitter and here is the first tweet I came across with a link to ntrs.nasa.gov:



Clicking on the link produces this image, note that the response stays the same and the URI is not rewritten:


Using MementoFox, I moved my slider to go back in time and I was able to find the report in Archive-It.  Here's the screen shot of the PDF in Preview overlapping the web browser:


Here's the TimeMap for those who are interested in the Memento details:

$ curl -i http://mementoproxy.cs.odu.edu/aggr/timemap/list/http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19730009264_1973009264.pdf
HTTP/1.1 200 OK
Date: Sat, 23 Mar 2013 04:24:50 GMT
Server: Apache/2.2.15 (Red Hat)
Link: <http://http://mementoproxy.cs.odu.edu/aggr/timemap/list/http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19730009264_1973009264.pdf>;rel="timemap";type="application/link-format";anchor="http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19730009264_1973009264.pdf"
Connection: close
Transfer-Encoding: chunked
Content-Type: application/link-format
 

<http://http://mementoproxy.cs.odu.edu/aggr/timemap/list/http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19730009264_1973009264.pdf>;rel="self";type="application/link-format",
 <http://mementoproxy.cs.odu.edu/aggr/timegate/http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19730009264_1973009264.pdf>;rel="timegate",
 <http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19730009264_1973009264.pdf>;rel="original",
 <http://wayback.archive-it.org/1792/20100507120934/http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19730009264_1973009264.pdf>;rel="first last memento";datetime="Fri, 07 May 2010 00:00:00 GMT"


Exactly how many pdfs are available in Memento-compliant archives?  I'm not sure; it could be just a few since most general web archives prefer html pages to pdfs.  MementoFox can make the rediscovery process seamless, but the point is that the pdfs are out there and shutting down ntrs.nasa.gov won't bring them back.

Although I can't help you find all material NASA has taken off-line, I can point you to two resources.  First is a mirror of some NACA (1917-1958) that I helped set up with Paul Needham at Cranfield University in 2001:

http://naca.central.cranfield.ac.uk/

I helped establish that when the NASA websites were taken down after September 11, 2001.  That made it clear to me that NASA information was too important to be left on *.nasa.gov computers.  Although that mirror required a number of emails to coordinate, the bulk transfer was done over the web.  I'm surprised it is still up and running -- it is a testament to good, simple web design and perhaps it is proof that benign neglect can be helpful in the web as well as in the physical world.

Of more recent and larger scale is the Internet Archive's "NASA Technical Documents" collection:

http://archive.org/details/nasa_techdocs

I'm not sure about the size of their collection either, but there appears to be a good deal there.

I've also just discovered that Mark Phillips has a NACA collection as well.  I'm not sure if this is related to the IA collection or is totally separate:

http://digital.library.unt.edu/explore/collections/NACA/

Returning to Mark's point, it is events like this that demonstrate the value of copying by-value and not just by-reference.  I'm not concerned about popular culture artifacts disappearing (e.g., see our TPDL 2011 paper about music redundancy in YouTube), but it is not clear that long tail content like NASA reports will enjoy that same level of uncoordinated refreshing and migration.  The moral of the story: make copies of the content, and let services like Google Scholar cluster the pdfs together (e.g., a 1994 NASA TM of mine is on at least six different hosts, none of which are *.nasa.gov).

David Rosenthal has often mentioned that preservation threats include legal, bureaucratic, and political threats.  If NTRS was a LOCKSS participant then access would be uninterrupted, but even LOCKSS assumes that the organization responsible for the content is not the primary threat to the content.

--Michael

2013-05-10 Edit: According to NASA Watch, NTRS came back online May 8 2013 -- without 85% of its full-text reports.  That same article also pointed out that at least some of the material is available at the Aerospace Research Information Center in South Korea. 

Comments