There is seen to be a big divide between the domain of ‘structured data’ sitting in rows and columns within databases, and that of ‘unstructured data’ (documents, images etc). In fact the two domains are related. Every document management system has a database within it to maintain the metadata about the objects it holds. It is worthless to be able to migrate/extract/preserve the documents unless you can also migrate/extract/preserve the metadata held in the database.
Last Friday I had a discussion with Kevin Ashley about digital preservation and the challenges of archiving data from databases.
Archivists encountered the problem of archiving databases earlier than they faced the problem of archiving electronic documents from document management systems. This was simply because organisations used computing power for structured data earlier than they did for the creation and storage of documents. In the early 1980s Kevin was already involved in attempting to rescue data from legacy databases within a UK government research council.
In 1997 Kevin led the creation of NDAD, the National Digital Archive of Datasets. This pioneering service, was set up under contract to the UK National Archives (then called the Public Record Office), and opened in 1998.
At the time it was widely thought within organisations that databases were the sphere of IT professionals and data managers, not records professionals (records managers/archivists). In fact NDAD needed the skills of all three of these professions. Kevin said that the contribution that the archivists in NDAD made was invaluable, because they knew how to to draw up agreements with the contributing organisation, and how to capture the context of the dataset (who created it, why, how, what they used it for etc.).
The National Archives required NDAD to use the ISAD G standard of archival description to catalogue each dataset. I asked Kevin whether ISAD G had been difficult to adapt to structured datasets (it was written with files/documents in mind). Kevin said that ISAD G had been very useful and worked very well for the datasets.
I asked Kevin whether he thought that the databases in use by organisations today were easier or harder to archive than the databases in use in the 1990s. Kevin said that the challenges were different. An individual database in an organisation today was easier to understand and extract data from than an equivalent database in the 1990s. But the challenge today is that the databases in an organisation tend to be integrated with each other. For example all or most databases in an organisation may use the organisation’s people directory to hold information about their users. As soon as you try to archive data from one database you are faced with the challenge of archiving data from all the other databases that it drew data from.
I asked Kevin whether initiatives such as the open data initiative at data.gov.uk and the whole linked data/semantic web movement would mean there is less of a role for archivists in getting government datasets into the public domain. He said that was very rare for an organisation to make an entire database available to the public online. Usually the public would be given access to a derived database, which contains only a subset of the data in the database used by the government department itself. So there is still a role for the archivist in ensuring that the database that actually informed government decisions was preserved.
Kevin talked about one of the perennial challenges of digital preservation, the challenge of avoiding lock-in to an application and ensuring that information (whether it be structured data in a database, e-mails in an e-mail system, or documents in a document management system) can be exported from a particular application when that application is no longer used by the organisation. I was interested in the parallels here with the recently published MoReq 2010 records management standard.
MoReq 2010 was written on the premise that the content of applications is typically still of value after the application itself has fallen into disuse, and therefore a key attribute of any application must be that it can export data, objects and metadata in a form that another application can understand. After we had stopped recording we realised that link between NDAD and MoReq 2010 comes in the person of Richard Blake, who has recently retired from the the UK National Archives. Richard was strongly involved in both the creation of NDAD in the 1990s and with the writing this year (with Jon Garde, Richard Jeffrey-Cook and others) of MoReq 2010.
Kevin reported Richard as saying that one of the weaknesses of early electronic records management system specifications (such as TNA 2002 from the National Archives) was that although all compliant EDRM system would keep metadata in a way that could be exported, each different EDRMS kept metadata in a different way, so it was hard an organisation to migrate from one vendor’s EDRM system to another. It was this experience that informed MoReq 2010s decision to define very precisely how systems keep metadata, in order that the metadata of one MoReq 2010 compliant system could be understood by any other MoReq 2010 compliant system.
Click on the play button below to hear the podcast. If you don’t see the play button (it needs Flash) then you can get the podcast from here. The podcast is 44 minutes long.
This podcast was recorded for the Records Management Today podcast series, hosted by the Northumbria University. To see all the podcasts in the series visit Records Management Today
Elizabeth Shepherd and Charlotte Smith have written an article on the application of ISAD G to archival datasets, which focused on NDAD’s experience. (Journal of the Society of Archivists ‘The application of ISAD(G) to the description of archival datasets’ Journal of the Society of Archivists 21: 1 (April 2000): 55-86)
Patricia Sleeaman, Digital archivist at the University of London wrote an article for the journal Archivaria about the National Digital Archive of Datasets, the article is available as a free pdf download from here
In the podcast Kevin discussed two datasets accessioned by NDAD:
Kevin Ashley is Director of the Digital Curation Centre. He is @kevingashley on twitter