The New Yorker tells us that average life of a Web page is about a hundred days. Websites don’t have to be deliberately deleted to disappear. Sites hosted by corporations tend to die with their hosts. Even the Web page you are viewing now is in flux. New blog posts might appear, comments and reviews are added. Bookmarks or references you are making to Web pages are in general not pointing to the same information you were reading when you visited the page, or when you were writing an article about that page. All this is very problematic in an academic context where provenance and diplomatics is crucial to analyse documents. To point a static version of a Web page one can make use of services like the Internet Archive , Perma.cc and Archive Today. But, these solutions tend to be ad-hoc, there is no common strategy to refer to a static version of a web page. In comes Memento, a protocol created by Herbert Van de Sompel and Michael Nelson which adds services on top of HTTP to travel the web of the past.
During a two day Hackathon event at Ghent University Library technologists from all over Europe gathered to explore time travel using the Memento protocol presented by Herbert Van de Sompel and Harihar Shankar from Los Alamos National Laboratory.
The slides of this event are available here.
Herbert Van de Sompel – Los Alamos National Laboratory, Harihar Shankar – Los Alamos National Laboratory, Najko Jahn – Bielefeld University, Vitali Peil – Bielefeld University, Christian Pietsch – Bielefeld University, Dries Moreels – Ghent University, Patrick Hochstenbach – Ghent University, Nicolas Steenlant – Ghent University, Nicolas Franck – Ghent University, Katrien Deroo – Ghent University, Ruben Verborgh – iMinds, Miel Vander Sande – iMinds, Snorri Briem – Lund University, Maria Hedberg – Lund University, Benoit Pauwels – Université Libre de Bruxelles, Anthony Leroy – Université Libre de Bruxelles, Benoit Erken – Université Catholique de Louvain, Laurent Dubois – Université Catholique de Louvain
Introduction into Memento
The goal of Memento is to provide a protocol for accessing historical versions of web resources. These archived versions, called Mementos, can reside in the content management system of a website or in external services such as web archives.
Take Wikipedia as an example. To view the current version of the lemma ‘Memento_Project’ one needs to visit the web resource http://en.wikipedia.org/wiki/Memento_Project. Wikipedia provides also historical versions of this resource at http://en.wikipedia.org/w/index.php?title=Memento_Project&action=history . In this case the WikiMedia platform keeps all the historical versions of a resource.
Another example is Tim Berners-Lee’s homepage at: http://www.w3.org/People/Berners-Lee/. The W3C website doesn’t provide an archive of versions of this webpage, but they are archived at Internet Archive , Archive-It, UK Web Archive and Archive Today.
How can a machine discover all versions of a web resource automatically?
As Gerald Sussman says: “Wishful thinking is essential to good engineering, and certainly essential to good computer science”. We might imagine any web resource (such as the Wikipedia page or Berners-Lee homepage above), called the original resource (URI-R), as a box that just tells a machine where to find all it’s archived versions using a standard syntax, the HTTP protocol.
A machine visits the resource URI-R and requests the “2007-05-31” version. The answer should be a link to the archived version of the resource, called the Memento (URI-M). There are some complications which Memento protocol should consider:
- Not all websites have a content management system with a versioning system, the resource at URI-R might not know where an archived Memento is, or exists at all.
- Web archives such as the Internet Archive don’t have a complete coverage of the Internet, many web archives might need to be visited to find a Memento.
- Even when a resource is available in a Web archive, then not all versions of the resource might be available: the archive contains only a fragmented history of versions.
To implement the time travel protocol, Memento introduces a service called a TimeGate (URI-G) which can act as a router for time travel request. As input it receives the address of a resource (URI-R) and a date time (e.g. “2007-05-31”) and as response it returns the URL of the archived resource, the Memento (URI-M).
A machine visits URI-R and requests the “2007-05-31” version. The server redirects the machine to a TimeGate (URI-G) which has a routing table where to find archived versions, or at least a version close to the requested date.
The TimeGate can be a service that runs locally querying the local content management system or on the Internet maybe connected to a large web archive or a knowledge base of access routes to versioning systems like GitHub or Wikipedia.
You might ask, how does a TimeGate (URI-G) itself know where the archived version of a particular resource lives? We can look at three cases:
- When the TimeGate is connected to the content management system of a website it can query the local version database. Given a local URL and date it can find out which local versions are available. The TimeGate can even provide a complete listing of all versions of a particular local URL, this is known as the TimeMap (URI-T) of a resource.
- When a TimeGate needs to find an archived version of a remote URL for which locally no further information is known it can forward these requests to other well known TimeGate servers. Typically a TimeGate running at a webarchive has a huge repository or URL-s for which Mementos exists. Based on this information the request can be answered.
- Or, the TimeGate knows the version API-s of services such as GitHub, Wikipedia, Internet Archive, and act as gateway translating a Memento requests into service specific version requests.
In the example below a machine requests the version “2007-05-31” of a resource to a TimeGate (URI-G). The TimeGate doesn’t know the answer but can query one or more remote TimeGates (which contain an index of Mementos at URI-T) services (e.g. Internet Archive, Archive-It, Archive Today) and request all versions for a resource. Some TimeGate servers might give zero results. Some might answer with a listing of all available versions. Based on this information the TimeGate server can decide which results best fit the original request.
In a <html><head> include the following code snippet:
<link rel="stylesheet" type="text/css" href="http://robustlinks.mementoweb.org/demo/robustlinks.css" />
Now one can add HTML5 attributes to web links. In this way it is possible to link to a particular version of a web resource. E.g. to link to the “2014-11-01” version of the LibreCat homepage one can write:
<a href="http://librecat.org" data-versiondate="2014-11-01">link</a>
Automatically this link will get a menu option to the archived version of this web page (using http://timetravel.mementoweb.org/ as TimeGate)
See a demonstration here: http://librecat.org/memento/demo.html
Read more on this project on the Robust Links page.
The second day was used to implement the Memento protocol in various tools and environments. All the results are available as open source projects on Github:
The Web is full of high-quality Linked Data but in general it can’t be reliably queried. Public SPARQL endpoints are often unavailable because they need to answer many unique queries. The Linked Data Fragments conceptual framework allows to define more lightweight interfaces, which enable client-side execution of complex queries.
During the Hackathon Miel Vander Sande and Ruben Verborgh of iMinds extended the LDF server and client to allow for Memento based querying. A demonstrator was built where many versions of DBPedia are made available using the Memento protocol. By adding the correct headers to queries historical Linked Data dumps can be queried with SPARQL and compared.
In data science, R is the language for data analysis and data mining. The language is known for its strong statistical and graphical support.
Najko Jahn of Bielefeld University created an R client for Memento called Timetravelr. With this tool he demonstrated how HTML tables can be extracted from websites and transformed into a dataset. Using the Memento protocol, this dataset can be tracked over time to generate a time series. In his demonstration Najko showed the evolution of conforming OAI repositories by tracking the OAI registry over time.
GitLab is a web-based Git repository manager with wiki and issue tracking features. GitLab is similar to GitHub, but GitLab has an open source version, unlike GitHub. Bielefeld University Library is using GitLab as a platform to manage source code and (soon) research data. During the Hackathon, Christian Pietsch (Bielefeld University) created a GitLab handler for the Memento TimeGate software using the GitLab Commits API.
PSGI/Plack is a Perl middleware to build web applications, comparable with WSGI in Python and Rack in Ruby. Using Plack it becomes very easy to make RESTful web applications with only a few lines of Perl code. By creating Plack plugins new functionality can be added to existing web applications without needing to change the application specific code.
Nicolas Steenlant (Ghent University) , Vitali Peil (Bielefeld University) and Maria Hedberg (Lund University) created a Memento plugin for Plack which turns every REST application into a Memento TimeGate if a versioning database is available. As a special case Nicolas, Vitali and Maria demonstrated with Catmandu how versioning can be added to databases such as Elastic Search, MongoDB, CouchDB and DBI. Programmers only need to take care of the logic of the database records, Catmandu and Plack take care of the rest.
Catmandu is the ETL backbone of the LibreCat project. Using Catmandu librarians can extract bibliographic data from various sources such as catalogs, institutional repositories, A&I databases, search engines and transform this data with a small language called Fix. The results of these transformations can be published again into catalogs, search engines, CSV reports, Atom feeds and Linked Data.
During the Hackathon Patrick Hochstenbach (Ghent University) and Snorri Briem (Lund University) created Memento support for the Catmandu tools. As a demonstration they showed how librarians can use Catmandu as a URL checker. As input MARC records were exported from a catalog, URL-s extracted from the 856u field and checked against TimeGates for the availability of archived versions.