Why is content migration so difficult?

Migrating content from one application to another is a problem that even now, two decades into the digital age, we have no solution for.  Migrating content is often so labour intensive and complex as to be non cost effective.   Any content migration involves compromises and ommissions that result in a significant loss of quality of the metadata that is held about the content being migrated.
Solving the content migration problem is about to become more urgent with the growing popularity of the software-as-a-service (SaaS) variety of cloud computing.  In this model the provider not only provides the software application, they also host your content.  Imagine what would happens if your organisation decided it wanted to change from one SaaS provider to another.  It wants to change from Salesforce to a different SaaS CRM.  Or it wants to go from SharePoint online to Box or Huddle or another collaboration/filesharing offering (or vica-versa).   What do you do?  How do you migrate content from SharePoint Online to Box? They have little in common in terms of how they are architected and what entities they consist of. What is the Box equivalent of a SharePoint content type?
The vendor lock-in problem is very real.  If you can’t migrate the content you are left paying two sets of SaaS subscriptions and managing two SaaS contracts.  If you were leaving because of a breakdown of trust with your original SaaS providor then how happy would you be leaving your content locked up with them on their servers?

Content migration is a problem that affects all organisations, and which affects archivists as much as records managers

The difficulty organisations experience in migrating content from one application to another matters in many situations. It matters when an organisation wants to replace an application with a better application from a different supplier. It matters in a merger/acquisition scenario, when an organisation wants to move the acquired company onto the same applications that the rest of the group are using.
It matters to archivists, because any transfer of electronic records from an organisation to an archive is, to all intents and purposes, a content migration.  I heard the digital preservation consultant Philip Lord say at a conference that the big difference for archives of the electronic world over the paper world is that:

  • in the paper world it was possible for an archive to set up a routine process for transferring hard-copy records that it would expect all bodies contributing records to the archive to adhere to
  • in the electronic world everytime an archive wishes to accept a transfer of records from a new information system it needs to work out a bespoke process for importing that content and its metadata from that particular system into their electronic archive.

Different applications keep their metadata in profoundly different ways

Migration from one application to another is extremely time consuming because you are:

  • mapping from one set of entity types to another. Entities are the types of objects the application can hold (users/groups/documents/records/files/libraries/sites/retention rules etc)
  • mapping from one set of descriptive metadata fields to another
  • mapping from one set of functions to another. Functions are the actions that users can be permitted to perform on entities in the system (for example: create an entity/amend it/rename it/move it/copy it/delete it/attach a retention rule to it/grant or deny access permissions on it)
  • mapping from one set of roles to another. Roles are simply collections of functions, grouped together to make it easier to administrate them. For example in SharePoint the role of ‘member’ of a site collects together the functions a user needs to be able to access a site and view and download content, and to contribute new content to the site, but denies them the functions they would need to administer or change the site itself.

Let us imagine we want to migrate content from application A to application B. Application A has an export function and can export all of its content and metadata into an xml schema. That is good. We go to import the content and metadata into application B. This is where we hit problems.

Application B looks at the audit logs of application A. They contain a listing of events (actions performed by users on entities within the system, at a particular point in time). Each event listed gives you the identity of the user that performed the function, the name of the function they performed, the identity of the entity they performed it on and the date or time at which the event occurred. Application B won’t understand these event listings. It is unlikely to understand the identifiers application A uses to refer to the entities and users. It is unlikely to understand the functions performed because application A has a different set of functions to application B.

Application B looks at the access control lists of application A. Each entity in application A has an access control lists that tells you which users or groups can perform what role in relation to that entity. Application B does not understand those roles, nor does it understand the functions that the roles are made up of. Therefore system B cannot understand the access control lists.

The end result is that application B cannot understand the history of the entities it is importing from application A, and it cannot understand who should be able to access/contribute to/change them.  It is also going to find it difficult to import things like retention rules, descriptive metadata fields, controlled vocabularies.

Migration reduces the quality of metadata

The process of migration is ‘lossy’.  In the world of recorded music it is said that when you move music from one format to another (LP to tape, tape to mp3 etc.) you cannot gain quality, you can only lose it.  When you migrate content from one system to another you cannot gain information about that that content, you can only lose it.  There will be whole swathes of metadata in system A that it will not be cost effective for you to map to conterpart metadata in system B.  You end up migrating content without that metadata, and your knowledge about the content that you hold is poorer as a result.

The fact that content migration is so labour intensive and lossy means that many organsiations opt to leave content in the original application and start from scratch in the new application.  This is a nice easy option, but there are downsides.   It means that the organisation has to maintain the original application for as long as it needs to keep the content that is locked within it.  This means paying the resultant cost of licence fees and support arrangements.   It also means a break in the memory of the organisation.  Users of the new system wishing to look back over previous years will have to go to the old system to view the content.  That is OK for a short period, during which time most colleagues will remember the old system and how to use it.  But as time goes by a larger and larger percentage of colleagues will have no knowledge or memory of the older system and how to use it.

The organisation may mitigate the impact of that by connecting the search capability of system B to the repository of system A.  The results of this are hit and miss.  The search functionality of system B will have been calibrated to the architecture of system B, it will not be calibrated to the architecture of system A.  Yes, it will return results but it will not be able to rank them very well (and you are still having to maintain system A in order that system B can run the search on it).

What can electronic records management specifications do to improve this situation?

The problem of content migration is not specific to records systems, it is a universal problem that affects any organisation wishing to move content from any kind of application to another application.

But it is a problem that is central to the concerns of records managers and archivists, because as a profession(s) we are concerned with the ability to manage records over time, and difficulties in migrating content hamper our ability to manage content over time.  We know that applications have a shelf life – after a period of years a new application comes along that can do the same job better and/or cheaper, and therefore we want to move to the new tool.  The problem is that retention periods for business records are usually longer than the shelf life of applications.    Therefore it is probably from the records management or archives world that a solution will come to this problem, if it comes at all.

The first generation of electronic records management system specifications (everything from the US DoD 5015.2 that first came out in 1998 to MoReq2 which came out in 2008), did not attempt to tackle the problem. They told vendors what types of metadata to put into their products – but they did not tell vendors how to implement that metadata.   For example these specifications would specify that records had to have a unique system identifier, but it was up to the vendor what format that identifier took. They had to have a permissions model but what functions and roles they set up was up to the vendor, and so on.

This lack of prescription had the benefit of sparing vendors of existing products the necessity of re-architecting the way they assign identifiers/implement a permissions model/ keep event histories etc. Had existing vendors been forced to re-architect in such a way it would have proved a major disincentive for them to produce products that complied with the specification. But the disadvantage was that the electronic document and records management systems (EDRMS) that these specifications gave rise to each had their own permissions models and metadata structures. When an organisation wanted to change from one specification compliant EDRMS to another, they had the same content migration problems as you would when migrating content between instances of any other type of information system. An archive (for instance a national archive) wishing to accept records from different EDRM systems would need to come up with a bespoke migration procedure for each product.

MoReq2010’s attempt to facilitate content migration between MoReq2010 compliant systems

MoReq2010 marks something of a break with past electronic records management specifications.  One of its stated aims is to ensure that any compliant system can export its content together with their event history, their access control list and their contextual metadata, in a way that any system that has the capability of importing MoReq2010 content can understand and use.

In order to this it has had to be far more prescriptive than previous electronic records management specifications in terms of how products keep metadata.

For example

  • It tells any compliant system to give each implementation of that system a unique identifier. This means that any entity created within that implementation will be able to carry with it to subsequent systems information about the system it originated in
  • It tells every implementation of every compliant system to give each entity it creates the MoReq2010 identifier for that entity type, so that any subsequent  MoReq 2010 compliant system that the entity is migrated to understands what type of thing that entity is (is it a record? or an aggregation of records? or a classification class or a retention schedule? or a user? or a group? or a role?)
  • It tells every implementation of every compliant system to give every entity created within it a globally unique identity an identifier in a MoReq2010 specified format. Each entity can carry this identifier with it to any subsequent MoReq 2010 compliant system, no matter how many times it is migrated
  • It tells every implementation of every compliant system to give each entity an event history that not only records the functions performed on that entity whilst it is in the system, but which also could be carried on and added by each subsequent system.
  • It tells each compliant system to create an access control list for each entity in the system, that governs who can do what in relation to that entity whilst it is in the system, and which can be understood, used, and added to by any subsequent compliant system that the entity is migrated to.

To achieve the last two of these ambitions MoReq2010 had to get into the nitty gritty of how a system implements its permissions model.

MoReq2010 and permissions models

I recorded two podcasts with Jon Garde about the permissions model in MoReq2010:

  • episode 7 of Musing Over MoReq2010 is about how the ‘user and group service’ section of the MoReq2010 specification
  • episode 8 (shortly to be published here )is about the ‘model role service’ section – the part of the MoReq2010 specification that deals with functions (the actions users can perform within the system) and roles (collections of functions).

In the latter podcast Jon said that the model role service was the part of MoReq2010 that caused him the most sleepless nights when he wrote it.  The problem was that every product on the market already has a permissions model, with its own way of describing the functions that it allows its users to perform on entities within the system.

The dilemma for Jon writing Moreq2010 was as follows:

  • If the specification prescribed a way for each system to implement its permissions model then existing systems would have to be rewritten and this would act as a major disincentive for vendors to revise their products to comply with MoReq2010
  • If the specification did not prescribe a way for each system to describe the functions that users could perform within it then subsequent systems would not be able to understand the event histories of exported entities (because it would not understand which actions had been performed on the entity concerned) or their access control lists (because it would not understand what particular users/groups of users were entitled to do to that entity)

The solution that Jon adopted was half way between these two options.  In the model role service MoReq2010 outlines its own permissions model, with definitions of a complete set of functions that a record system can allow users to perform on entities.

MoReq2010 does not insist that to be compliant a system must implement every one (or even any one) of the functions that are outlined within the model role service.  It allows products to carry on using their own permissions model.  However MoReq2010 does insist that a system must be able to export their content and metadata with the functions and roles expressed as the functions and roles outlined in the MoReq2010 specification.  In other words a product would need to map its existing permissions model (functions and roles) to MoReq2010 functions and roles.   This would mean that two MoReq compliant systems with entirely different permissions models could both export their content with all of the functions in the access control lists and the event histories expressed as MoReq2010 functions.

Mapping the functions and roles in their product’s permission model to MoReq2010’s permission model is a significant body of work for vendors of existing systems, and they will obviously make a commercial judgement as to whether the benefit to them of achieving MoReq2010 compliance outweighs the cost of the investment they will need to make those mappings and to implement the other changes, such as the identifier formats, that MoReq2010 demands.

Because MoReq2010 is so prescriptive as to how systems keep metadata it could well be that it is easier for new entrants to the market to write new products from scratch to comply with the specification than it is for existing vendors to re-architect their products to comply. If I was a vendor writing a new document or records management system from scratch I would certainly think about simply implementing the MoReq2010 permissions model outlined in the model role service.

Why is import more complex than export?

The core modules of MoReq2010 include an export module.  Every compliant system must be able to export entities and their event histories, access control lists and contextual metadata in a MoReq2010 compliant way.   There is no import module in the core modules of MoReq2010.  Vendors can win MoReq2010 compliance for their products without their products being able to import content and its metadata from other MoReq2010 compliant systems.

The import module of MoReq2010 is being written as I write, and is scheduled for release sometime in 2012.  It will not be compulsory.  The reason why the import module is not a compulsory module of the specification is that not all records systems will need to import from other MoReq2010 compliant records systems.  For example by definition the first generation of compliant systems will not have to import from other compliant systems (because they have no predecessor compliant systems to import from!).

It will be more complex for a system to comply with the import requirements of MoReq2010 (when the module is published) than it is with the export requirements.

For example:

  • an existing product that seeks compliance with the core modules of MoReq2010 (but not the additional and optional import module) will have to map its functions (actions/permissions) and roles to the functions and roles outlined in MoReq2010.  It does not have to worry about all the functions listed in MoReq2010 – only the functions that it needs to map its own functions to
  • a product that seeks additionally to comply with the import module of  MoReq2010 compliant system will need to be able to implement all of the functions listed in MoReq2010 – because it needs to be able to import content from any MoReq2010 compliant system and a MoReq2010 compliant system may chose to use any of the functions listed in MoReq2010.

I put it to Jon in our podcast on the model role service that we would know that MoReq2010 had ‘arrived’ if and when someone brings to market a product that complies with the import module and is capable of importing content from MoReq2010 compliant systems.  Once you have products capable of importing from MoReq2010 compliant systems there is all of a sudden a purpose to implementing MoReq2010 compliant systems – the theoretical possibility of being able to pass content onto another system that understands the content as well or nearly as well as the originating system is turned into a practical reality.  Once you have a product that is capable of importing from MoReq2010 compliant systems it is in the interests of anyone implementing that product to influence whoever runs the applications that they wish to import from to make those applications MoReq2010 compliant. Imagine a national archives running an electronic archive with a MoReq2010 import capability.  It would be in the interests of that national archives to pursuade the various parts of government who contribute records to them to implement MoReq2010 compliant systems.

Jon’s response on the podcast was to lay down a challenge to the archives world to develop a MoReq2010 compliant electronic archive system, with a MoReq2010 compliant import capability.

What are the chances of MoReq2010 catching on?

MoReq2010 is doubly ambitious.  In this post I have looked at its ambition to ensure that content can take its identifiers, event history, access control list and contextual metadata with it through its life as it migrates from one system to another.  Its other great ambition is to reach a situation where any application in use in a business is routinely expected to have record keeping functionality.   The two ambitions are related to each other.

  • MoReq2010 makes it feasible for the vendor of a line of business system to add records management functionality to their product and get it certified as being a compliant records system. The specification has done this by eliminating from the core modules  any requirements that are would not be necessary for every system to perform however small and however specialised. A compliant system does not have to be able to do all the things an organisation-wide electronic records management system would have to do.  It only needs to be able to manage and export its own records. Note that MoReq2010 makes it possible for vendors of line of business systems to seek compliance, but the specification alone cannot incentivise them to do this – incentivisation would have to come from the market or from organisations that could influence the market
  • Because MoReq2010 allows the possibility for  records to be kept in multiple line of business and other systems within an organisation then the issue of migration becomes very important.  When a line of business applicatin is replaced the organisation will need to migrate content  either to the application’s replacement or to an organisational records repository or or to a third party archive. Hence the ambition that any compliant records system can export content and metadata in a way that another compliant system can understand.

Being ambitious carries with it a risk.  MoReq2010 does call for existing vendors to re-architect its systems, and vendors do not like re-architecting their systems.  If too few vendors produce products that comply with the specification then MoReq2010 will go the way of its predecessor, MoReq2, which died because only one vendor felt it was commercially worthwhile to produce a product that complied with it.

In the situation that electronic records management finds itself in, being ambitious is less risky than trying to incrementally tweak previous specifications.   MoReq2, failed because by the time it was published in 2008 the bottom had fallen out of the market for the EDRM systems that it and previous electronic records management system specifications underpinned.  SharePoint had come along and pushed it over like a house of cards.

EDRM fell without so much as a whimper because no-one was prepared to defend it.  Archivists were not prepared to defend it because they had not benefited from it – it was as hard for them to accept electronic transfers from EDRM systems as from any other type of application.  Practioners were not prepared to defend it because it had proved difficult and expensive to implement monolithic EDRM systems across whole enterprises.  The ECM vendors who had acquired EDRM products were not prepared to defend it because EDRM represented only a relatively small portion of their portfolio, and they had no stomach for a fight with Microsoft.

MoReq2010 has a chance of success.  It is not guaranteed to succeed, but it has a chance.  The reason why it has a chance is because it is addressing the right two questions – how do we get records management functionality adopted by all business applications? and how do we ensure that content can be migrated easily and without significant loss of metadata from one application to another?

These questions will have to be nailed. If MoReq2010 succeeds in nailing them so much the better.  If it doesn’t, if the market isn’t ready for it, then whatever specifications come after it will have to nail them.  There is no going back to the EDRM ‘one records system-per-organisation’ model.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s