Records management before and after the AI revolution

I recently recorded an IRMS podcast with Alan Pelz-Sharpe,  co-author of perhaps the first book on the use of artificial intelligence (AI) for information management purposes.  In the interview Alan said that he thought that the transition to managing information through AI would be an even greater change than the transition from analogue to digital working in the 1990s.

Recordkeeping is a continuous activity.  History suggests that once a society starts keeping records nothing short of the end of that society causes it to stop.   Recordkeeping becomes integral to the functioning of individual organisations and to society as a whole.   Fundamental change in society leads to fundamental change in recordkeeping practices.

This post maps how recordkeeping changed after both the industrial revolution and the digital revolution, and predicts how it will change after the AI revolution.  The post looks across the whole sweep of recordkeeping history to show how the digital revolution changed records management from a service function into a policy function, and how the AI revolution will change it once again, this time into a data science.

In looking over this broad sweep of recordkeeping history we will see three broad trends:

  • The ever increasing volume of records created;
  • The ever increasing dominance of structured data systems over unstructured data;
  • The ever increasing ability to re-classify and re-aggregate all records in a records system.

The AI revolution offers new powers to records managers/information governance professionals to intervene effectively within and across records systems for governance purposes. Like any power this comes with responsibility, and the need to use the power wisely and safely.

This post attempts to outline both the opportunities that AI will offer, and the questions it will pose, to the records management and information governance profession.

Recordkeeping before the industrial revolution

In the United Kingdom our National Archives has an unbroken set of government records dating back to the turn of the 13th –  the point in time at which the administration of the English Kings started keeping copies of the letters and charters that they sent out.  From the 13th until the mid nineteenth century we can characterise recordkeeping as follows:

  • Records consist largely of correspondence. Correspondence is predominantly kept in simple chronological sequences.
  • The volumes of records are low. The size of the royal administration is very small.  The pace at which correspondence moves (on horseback on bad roads, or by water) is slow.
  • There is no need for a records management profession because the chronological sequences of correspondence do not require records management expertise to manage.
  • Some records that do not take the form of correspondence but instead take the form of entries into registers, inventories, index book or ledgers. There are the beginnings of what would later be called structured data.

At this early point in the history of recordkeeping we can already see the fundamental difference between ‘structured’ and ‘unstructured’ data:

  • An item of correspondence (a letter) is unstructured data because at the point of its creation it stands and moves independent of any system or structure.   The letter therefore has to be integrated into some kind of structure with other pieces of correspondence in order for it to fully function as a component part of a record.
  • In contrast an entry into a registry, inventory, index book or ledger is an example of structured data because from the moment of its creation it is already integrated within a structure with other similar entries.

One of the enduring endeavours of records management practice has been to ensure that records are consistently captured into a coherent structure.   This endeavour  is vital when organisations are predominantly creating unstructured data such as free-standing and free-moving correspondence and documents.  However it is not nearly as useful to organisations carrying out their work through structured data systems because the structure of the database is set from the outset of the system and records are captured into the structure at the moment of creation.

After the industrial revolution

The industrial revolution at the turn of the nineteenth century first started to bring large concentrations of manual workers together.   By the turn of the twentieth century large concentrations of clerical workers were being brought together in bureaucracies of ever growing government departments, businesses and other institutions.  This led to a revolution in recordkeeping:

  • The volume of records created is now much higher. The size of organisations has grown.  The pace at which correspondence moves (by motorised transport on tarmac road, by rail, steam boat and later by air) is faster.
  • Records can best be characterised as documents. These documents are kept in sophisticated filing systems in which one file (or one set of files) is created for each distinct piece of work.  The files of similar types of work are grouped into records series that can usually be managed by a single access and retention rule.
  • There is a need for a records management profession because the filing systems are sophisticated and the set of retention rules that governs how long records within each different series are kept is also sophisticated.
  • In organisations with strong recordkeeping requirements records management is set up as a service.  In UK government this service is provided by records staff working in registries.  Registries are interposed into the flow of correspondence as it moves from sender to recipient, so that the piece of correspondence is filed before it reaches the recipient.  This has the double advantage that it both ensures the item is filed, and ensures that the recipient reads it in the context of the previous correspondence on that case/matter/project/topic.
  • An organisation is able to classify its different file series to have an overall integrated structure for all its documentation.
  • The volume of structured data also increases, with more sophisticated methods of keeping structured data such as card indexes. This structured data sits outside of the main way that documentation is organised.

The nature of documentation changed after the industrial revolution.  The pre-industrial organisation had captured items of correspondence into chronological series.  The 20th century practise of a file dedicated to each specific piece of work brought into being new classes of document such as ‘file notes’.   These were documents created not primarily as a direct communication from one person/office to another, but as an addition to the file, to ensure that the file could tell the whole story of that work.  The growth in the ability to copy documents (initially through typewriters, typing pools and carbon paper, later through photocopiers) enabled a copy of a document to be placed on each different file to which it related.

After the digital revolution

The pre-cursor to the digital revolution was the computerisation by organisations of various processes and workflows in the 1960s, 1970s and 1980s.  This computerisation was largely restricted to very predictable, high volume processes such as payroll, financial ledgers, stock control etc.  These processes were computerised through the construction of databases with a data model very specifically adapted to the process in question.

The digital revolution hit the large English speaking economies in the early-1990s when a way was found of applying a data structure to general business correspondence.  The data structure in question was that contained in the email protocol, which specified the format for one internet connected computer to send a message (an email) to another.   The spread of email resulted in the spread of computers to the desktop of every clerical worker in the large economies.

Just as a quantum particle such as an electron can be viewed as either a particle or a wave, so emails within email systems can be seen as either:

  • unstructured data – emails are stand alone items of correspondence that move from one person to another, and which should at some point be filed together with other documentation from the same type of work  OR
  • structured data – an email system is a corporate database and each new email is a new entry into the database. Like entries in any other type of database there is no need for either the sender or any of the recipients to file it because it is integrated into the structure/schema of the email system from the moment it is sent/received.

In the early digital age (1990s to the present day):

  • The predominant form of records are ‘datasets’.  Organisations have multiple databases. Some are specific to a particular process or line of business, others are corporate-wide.   An email system is a database of correspondence. A content management system is a database of the content available through a website and/or intranet.  A customer relations system is a database of contacts with customers etc..  Some operational databases and logistics databases may have business critical information and key intellectual property and know-how.
  • The volume and velocity of documentation increases exponentially. The coming of email causes the time taken for a piece of correspondence to travel from sender to recipient to vanish to virtually zero.
  • There is no overall schema for organising records. Each dataset has its own separate metadata schema/data model.
  • Records management becomes a governance/policy function, setting requirements for what individual staff members should and should not do with the documentation and data they create and receive.
  • The transfer of structured data from analogue ledgers, index books, inventories, card indexes and registers to digital databases is transformational, because of new powerful ways to process and analyse data that computers bring with them.
  • Metadata fields enable machines to ‘understand’ data in structured systems. Machines can perform information management tasks when they are given rules which tell them what actions should be triggered by what value appearing in what metadata field.

Records management is far less effective in the two decades after the digital revolution than it had been in the four decades before it.

At the start of the digital age the records management profession saw its task of being one of managing electronic documents.  This was based on the assumption that the fundamental change involved in the digital revolution was a change of format, from paper to digital.   We assumed that unstructured data would still predominate over structured data as it had for the entire history of recordkeeping before the digital revolution.   We assumed that items of correspondence would continue to function as unstructured data – free standing, free moving items that needed at some point in their trajectory to be captured and integrated into a structure and a system.

The standard records management strategy in the first decade of the digital age was to configure corporate wide records systems into which documents and correspondence could be captured and integrated with other records within a records classification/structure.

This strategy failed  because most correspondence exist as emails within email systems.  Their only move is a handover from one email system to another if the sender is on a different email system to the recipient.  There is no point at which it needs be filed by either the sender or any of the recipients.  It is already integrated into the structure and metadata schema of the email system of both the sender and recipient(s).

The only type of record in the digital age that acts as ‘unstructured data’ and needs consciously capturing into a structure are documents created in word processing software/presentation software/spreadsheet software such as Microsoft Word/Powerpoint/Excel.  These behave like the documents of the paper age.  At the point they are created they are not yet integrated into a structure, and therefore the creator needs to file them somewhere.  This creates a need for document management systems.

Corporate document management systems merit attention.  They need and deserve careful management.  They act as record systems for documents created in packages such as Microsoft Word, Powerpoint and Excel .  They provide more than enough work for many practitioners.  But we cannot base a profession on them.   Our profession has less and less influence over such systems as the vast bulk of the market for such systems belongs to just two suppliers (Microsoft and Google).  

Corporate document management systems stand in uneasy relation to email systems.  Document management systems rarely act as record systems for correspondence.  Email systems usually act as a record system for documents.  When a documents needs to be communicated it is typically emailed.  The email system has a record of the date that the document was sent, who it was sent by, who it was sent to, what message was imparted along with it, and what responses were received by return.  The corporate document management system and the email system both cover all corporate activities.   Each of them holds most of the organisation’s documents, but the email system has so much more besides in terms of the decision trails around and outside of documents.

The latest generation of collaborative systems (such as MS Teams and Slack) are trying to combat this disconnect between email systems and document management systems by bringing team based communications out of email systems and into a collaborative space.  This is a better strategy than seeking to move conversations that have happened in the email environment into a document management environment.   It has a good chance of succeeding where individuals are predominantly communicating with a close knit group of people (for example within a project team).   However it tends not to work as well when individuals are working across team and organisational boundaries with a changing array of interlocutors on a shifting range of matters.  This latter category includes many of the people whose records archivists have typically wanted to see selected for permanent preservation  (policy makers, diplomats etc.).

The AI revolution

The AI revolution is happening now, at the start of the third decade of the twenty first century.  It involves a massive expansion in the scope of judgements that can be made by machine intelligence.

Before the AI revolution machines could make information management judgements only in a constrained set of circumstances, namely when each of the following three conditions were met:

  • the machine is explicitly programmed how to make the judgement;
  • the judgement can be made on the basis of values in metadata fields;
  • the values in those metadata fields were clear and unambiguous.

The AI revolution allows machines to make judgements without having being explicitly programmed to do so.  We no longer need to set out each step a machine needs to follow.  If we use a machine learning tool to identify which emails in email correspondence could be classed as ‘business’ correspondence’ we would be in effect using a set of algorithms (the machine learning model) to develop another set of algorithms (the algorithms that will distinguish business from personal/trivial email on the basis of patterns observed in the data).

The most obvious way of training a machine learning tool to identify business correspondence within an email system is to feed it a training set of emails, each of which are labelled as either ‘business’ or ‘personal/trivial’.   The machine learning model looks for the features in the set of business correspondence whose values tend to differ from those of the same features in the non business correspondence. The tool comes up with a hypothesis algorithm, setting parameters for each data feature.  The algorithm is typically then tested by being fed with a mixed set of business and non business emails to see how accurately it distinguishes the two.

Whereas machine based rules before the AI revolution worked on certainty, algorithms work on probability.   An informal tone in an email might increase the probability that an email is trivial (or personal), but it does not give certainty.  By taking into account other data features (the subject line of the email, the number of recipients, the roles of the recipients, the topic of the email as indicated by words in the body of the message etc.) the algorithm is able to increase its own confidence in its classification of the email as ‘trivial/personal’ (or as ‘business’).   Machine learning algorithms can tell you not only what they have classified an item as, but also the percentage certainty with which the judgement has been made. This can help the organisation set threshold certainty levels below which judgements should be checked by humans.

The nature of recordkeeping after the AI revolution

On the basis of the previous history of recordkeeping, here are some predictions as to how recordkeeping will be shaped by, and will adapt to, the AI revolution:

  • Records management/information governance will become a data science, overseeing algorithms that apply record classifications and/or record retention and access rules.
  • The point at which we know information governance has entered the AI age, is the point in time after which access and retention rules are applied to aggregations into which records have been assigned by machine learning algorithm.
  • To an algorithm everything is data. If there are patterns in a set of data then an algorithm can learn those patterns and use its knowledge of those patterns to make distinctions.   Machines are no longer restricted to acting on highly structured metadata.  Algorithms can identify patterns in any kind of data, structured or unstructured.
  • Organisations will continue to have multiple databases. Some algorithms might use data from one database to manage data in another (for example you might use information taken from job descriptions in an HR database to assist algorithms identifying important business emails in an email system).
  • The volume and velocity of documentation and data will continue to rise, as AI algorithms generate content (for example by automated replies or automated chat bots) as well as help manage it.
  • Algorithms, like humans, tend to understand data best when they view it in the context of its originating application. Email is best understood within email systems, or within repositories that can replicate the structure and functioning of email systems. There is no longer any necessity to move content out of one structured database (such as an email system) into another system.
  • Organisations will have the technical possibility of having one overall structure/schema for organising records.  But this dream is likely to remain elusive due to the fact that data created within a structured dataset is usually much more meaningful and manageable within the structure of that dataset than it would be outside of it.  Algorithms will be used more often to make data in a dataset manageable than to break data out of its original dataset to manage it through an alternative structure.
  • AI brings with it some possibilities that humans have never had before.  For example the possibility to restructure an entire records system to enable access and retention rules to be applied to a completely different set of aggregations than were present when individual action officers created or received the documentation.  Learning whether (and if so how and when) to use this capability will be a challenge for the recordkeeping profession.



Records management in an era dominated by structured data

The rise of structured data  poses a challenge to records management theory.  This theory  has, for the most part, been based on the assumption that the majority of records, (including correspondence and other types of documentation) are created as free standing objects (unstructured data) that move independently of any structure and therefore need at some point to be integrated into a structure.

This theory needs to be refined to enable it to adapt to the reality that since the digital revolution even correspondence is created and shared within a structured database. Such a theory would de-emphasize the importance of building record structures to integrate records into (because most records, including all email correspondence,  are created within a database that already has a structure and schema).  It would instead emphasise the importance of establishing a defensible, pragmatic and consistent basis for the application of retention and access rules across the different structures and schemas of the different datasets of the organisation.

AI and the possibility to re-structure and re-aggregate entire records systems

The most far-reaching change of the AI revolution is that the ability to re-organise all the items in a record system is for the first time unconstrained by the original metadata schema of that system.   A records management/information governance team will in theory have the ability to use any relevant classification logic (any classification scheme that bears any relation to the content of the record system)  to re-aggregate content in a record system.   The team will be able to apply retention and access rules through those new aggregations.  The re-aggregation could be carried out at any time in the existence of the system (meaning that a record might be reassigned to a new aggregation for governance purposes one second, one day, one month, one year, one decade or one century after its creation or receipt.

This poses two fundamental questions for records management theory and practice:

  • What are the implications of the ability to re-classify, re-aggregate and/or re-label all the items in a record system for a profession whose aim has traditionally been to build and maintain governance regimes for information based on predictable access permissions and predictable retention rules being applied to predictable aggregations of records?
  • What are the implications of the possibility to assign retention and access rules to aggregations that did not exist when the records were originally created and received, and that the creators/recipients of the records would not have envisaged being used to apply access and retention rules?

To make these questions more concrete let us think of them in relation to email – the great unsolved records management challenge brought by the digital revolution.

In email systems correspondence is aggregated into email accounts and access permissions are applied to correspondence via those accounts.   AI opens up three options for the application of retention rules and access permissions to emails:

  • Ignore the existing structure/schema . –  bypass email accounts  use AI to re-aggregate email correspondence (for instance by applying a corporate records classification) so that access permissions and/or retention rules are no longer applied via email accounts but instead through the records classification.
  • Stick with the existing structure/schema – make email accounts manageable  use AI to make email accounts more manageable by identifying trivial, personal and sensitive emails within email accounts.
  • Use the existing structure and schema as a starting point – enhance email accounts and then move beyond them:  use AI to classify emails within email accounts by business activity, but continue to use email accounts as the main aggregation for the application of access permissions.   As individuals get used to the classification of their email by machine against business activity, so they could be given the option of opening up access to selected colleagues to correspondence of selected activities within their email account.

The first approach is high risk, the second approach is low benefit.  The third approach offers the possibility of providing benefits to individual email account users and their colleagues through incremental change.

We should be looking for what Dave Snowden might call ‘safe-fail’ approaches to the introduction of AI.  In such approaches machine learning classifications are first introduced alongside (or within) existing structures, then gradually begin to become more influential over the application of access permissions and retention rules as confidence in the machine learning process grows.

The theories and explanations outlined in this article have been developed during the course of my Loughborough University doctoral research project which is looking at archival policy towards email from a  realist perspective.  A paper from this project ‘The defensible deletion of government email’ was published by the Records Management Journal in March 2019.  An open access version of this paper is available from Loughborough University’s digital repository here (once in the repository click on ‘download’ to download the pdf,  or read it in the window provided).