Records management before and after the AI revolution

I recently recorded an IRMS podcast with Alan Pelz-Sharpe,  co-author of perhaps the first book on the use of artificial intelligence (AI) for information management purposes.  In the interview Alan said that he thought that the transition to managing information through AI would be an even greater change than the transition from analogue to digital working in the 1990s.

Recordkeeping is a continuous activity.  History suggests that once a society starts keeping records nothing short of the end of that society causes it to stop.   Recordkeeping becomes integral to the functioning of individual organisations and to society as a whole.   Fundamental change in society leads to fundamental change in recordkeeping practices.

This post maps how recordkeeping changed after both the industrial revolution and the digital revolution, and predicts how it will change after the AI revolution.  The post looks across the whole sweep of recordkeeping history to show how the digital revolution changed records management from a service function into a policy function, and how the AI revolution will change it once again, this time into a data science.

In looking over this broad sweep of recordkeeping history we will see three broad trends:

  • The ever increasing volume of records created;
  • The ever increasing dominance of structured data systems over unstructured data;
  • The ever increasing ability to re-classify and re-aggregate all records in a records system.

The AI revolution offers new powers to records managers/information governance professionals to intervene effectively within and across records systems for governance purposes. Like any power this comes with responsibility, and the need to use the power wisely and safely.

This post attempts to outline both the opportunities that AI will offer, and the questions it will pose, to the records management and information governance profession.

Recordkeeping before the industrial revolution

In the United Kingdom our National Archives has an unbroken set of government records dating back to the turn of the 13th –  the point in time at which the administration of the English Kings started keeping copies of the letters and charters that they sent out.  From the 13th until the mid nineteenth century we can characterise recordkeeping as follows:

  • Records consist largely of correspondence. Correspondence is predominantly kept in simple chronological sequences.
  • The volumes of records are low. The size of the royal administration is very small.  The pace at which correspondence moves (on horseback on bad roads, or by water) is slow.
  • There is no need for a records management profession because the chronological sequences of correspondence do not require records management expertise to manage.
  • Some records that do not take the form of correspondence but instead take the form of entries into registers, inventories, index book or ledgers. There are the beginnings of what would later be called structured data.

At this early point in the history of recordkeeping we can already see the fundamental difference between ‘structured’ and ‘unstructured’ data:

  • An item of correspondence (a letter) is unstructured data because at the point of its creation it stands and moves independent of any system or structure.   The letter therefore has to be integrated into some kind of structure with other pieces of correspondence in order for it to fully function as a component part of a record.
  • In contrast an entry into a registry, inventory, index book or ledger is an example of structured data because from the moment of its creation it is already integrated within a structure with other similar entries.

One of the enduring endeavours of records management practice has been to ensure that records are consistently captured into a coherent structure.   This endeavour  is vital when organisations are predominantly creating unstructured data such as free-standing and free-moving correspondence and documents.  However it is not nearly as useful to organisations carrying out their work through structured data systems because the structure of the database is set from the outset of the system and records are captured into the structure at the moment of creation.

After the industrial revolution

The industrial revolution at the turn of the nineteenth century first started to bring large concentrations of manual workers together.   By the turn of the twentieth century large concentrations of clerical workers were being brought together in bureaucracies of ever growing government departments, businesses and other institutions.  This led to a revolution in recordkeeping:

  • The volume of records created is now much higher. The size of organisations has grown.  The pace at which correspondence moves (by motorised transport on tarmac road, by rail, steam boat and later by air) is faster.
  • Records can best be characterised as documents. These documents are kept in sophisticated filing systems in which one file (or one set of files) is created for each distinct piece of work.  The files of similar types of work are grouped into records series that can usually be managed by a single access and retention rule.
  • There is a need for a records management profession because the filing systems are sophisticated and the set of retention rules that governs how long records within each different series are kept is also sophisticated.
  • In organisations with strong recordkeeping requirements records management is set up as a service.  In UK government this service is provided by records staff working in registries.  Registries are interposed into the flow of correspondence as it moves from sender to recipient, so that the piece of correspondence is filed before it reaches the recipient.  This has the double advantage that it both ensures the item is filed, and ensures that the recipient reads it in the context of the previous correspondence on that case/matter/project/topic.
  • An organisation is able to classify its different file series to have an overall integrated structure for all its documentation.
  • The volume of structured data also increases, with more sophisticated methods of keeping structured data such as card indexes. This structured data sits outside of the main way that documentation is organised.

The nature of documentation changed after the industrial revolution.  The pre-industrial organisation had captured items of correspondence into chronological series.  The 20th century practise of a file dedicated to each specific piece of work brought into being new classes of document such as ‘file notes’.   These were documents created not primarily as a direct communication from one person/office to another, but as an addition to the file, to ensure that the file could tell the whole story of that work.  The growth in the ability to copy documents (initially through typewriters, typing pools and carbon paper, later through photocopiers) enabled a copy of a document to be placed on each different file to which it related.

After the digital revolution

The pre-cursor to the digital revolution was the computerisation by organisations of various processes and workflows in the 1960s, 1970s and 1980s.  This computerisation was largely restricted to very predictable, high volume processes such as payroll, financial ledgers, stock control etc.  These processes were computerised through the construction of databases with a data model very specifically adapted to the process in question.

The digital revolution hit the large English speaking economies in the early-1990s when a way was found of applying a data structure to general business correspondence.  The data structure in question was that contained in the email protocol, which specified the format for one internet connected computer to send a message (an email) to another.   The spread of email resulted in the spread of computers to the desktop of every clerical worker in the large economies.

Just as a quantum particle such as an electron can be viewed as either a particle or a wave, so emails within email systems can be seen as either:

  • unstructured data – emails are stand alone items of correspondence that move from one person to another, and which should at some point be filed together with other documentation from the same type of work  OR
  • structured data – an email system is a corporate database and each new email is a new entry into the database. Like entries in any other type of database there is no need for either the sender or any of the recipients to file it because it is integrated into the structure/schema of the email system from the moment it is sent/received.

In the early digital age (1990s to the present day):

  • The predominant form of records are ‘datasets’.  Organisations have multiple databases. Some are specific to a particular process or line of business, others are corporate-wide.   An email system is a database of correspondence. A content management system is a database of the content available through a website and/or intranet.  A customer relations system is a database of contacts with customers etc..  Some operational databases and logistics databases may have business critical information and key intellectual property and know-how.
  • The volume and velocity of documentation increases exponentially. The coming of email causes the time taken for a piece of correspondence to travel from sender to recipient to vanish to virtually zero.
  • There is no overall schema for organising records. Each dataset has its own separate metadata schema/data model.
  • Records management becomes a governance/policy function, setting requirements for what individual staff members should and should not do with the documentation and data they create and receive.
  • The transfer of structured data from analogue ledgers, index books, inventories, card indexes and registers to digital databases is transformational, because of new powerful ways to process and analyse data that computers bring with them.
  • Metadata fields enable machines to ‘understand’ data in structured systems. Machines can perform information management tasks when they are given rules which tell them what actions should be triggered by what value appearing in what metadata field.

Records management is far less effective in the two decades after the digital revolution than it had been in the four decades before it.

At the start of the digital age the records management profession saw its task of being one of managing electronic documents.  This was based on the assumption that the fundamental change involved in the digital revolution was a change of format, from paper to digital.   We assumed that unstructured data would still predominate over structured data as it had for the entire history of recordkeeping before the digital revolution.   We assumed that items of correspondence would continue to function as unstructured data – free standing, free moving items that needed at some point in their trajectory to be captured and integrated into a structure and a system.

The standard records management strategy in the first decade of the digital age was to configure corporate wide records systems into which documents and correspondence could be captured and integrated with other records within a records classification/structure.

This strategy failed  because most correspondence exist as emails within email systems.  Their only move is a handover from one email system to another if the sender is on a different email system to the recipient.  There is no point at which it needs be filed by either the sender or any of the recipients.  It is already integrated into the structure and metadata schema of the email system of both the sender and recipient(s).

The only type of record in the digital age that acts as ‘unstructured data’ and needs consciously capturing into a structure are documents created in word processing software/presentation software/spreadsheet software such as Microsoft Word/Powerpoint/Excel.  These behave like the documents of the paper age.  At the point they are created they are not yet integrated into a structure, and therefore the creator needs to file them somewhere.  This creates a need for document management systems.

Corporate document management systems merit attention.  They need and deserve careful management.  They act as record systems for documents created in packages such as Microsoft Word, Powerpoint and Excel .  They provide more than enough work for many practitioners.  But we cannot base a profession on them.   Our profession has less and less influence over such systems as the vast bulk of the market for such systems belongs to just two suppliers (Microsoft and Google).  

Corporate document management systems stand in uneasy relation to email systems.  Document management systems rarely act as record systems for correspondence.  Email systems usually act as a record system for documents.  When a documents needs to be communicated it is typically emailed.  The email system has a record of the date that the document was sent, who it was sent by, who it was sent to, what message was imparted along with it, and what responses were received by return.  The corporate document management system and the email system both cover all corporate activities.   Each of them holds most of the organisation’s documents, but the email system has so much more besides in terms of the decision trails around and outside of documents.

The latest generation of collaborative systems (such as MS Teams and Slack) are trying to combat this disconnect between email systems and document management systems by bringing team based communications out of email systems and into a collaborative space.  This is a better strategy than seeking to move conversations that have happened in the email environment into a document management environment.   It has a good chance of succeeding where individuals are predominantly communicating with a close knit group of people (for example within a project team).   However it tends not to work as well when individuals are working across team and organisational boundaries with a changing array of interlocutors on a shifting range of matters.  This latter category includes many of the people whose records archivists have typically wanted to see selected for permanent preservation  (policy makers, diplomats etc.).

The AI revolution

The AI revolution is happening now, at the start of the third decade of the twenty first century.  It involves a massive expansion in the scope of judgements that can be made by machine intelligence.

Before the AI revolution machines could make information management judgements only in a constrained set of circumstances, namely when each of the following three conditions were met:

  • the machine is explicitly programmed how to make the judgement;
  • the judgement can be made on the basis of values in metadata fields;
  • the values in those metadata fields were clear and unambiguous.

The AI revolution allows machines to make judgements without having being explicitly programmed to do so.  We no longer need to set out each step a machine needs to follow.  If we use a machine learning tool to identify which emails in email correspondence could be classed as ‘business’ correspondence’ we would be in effect using a set of algorithms (the machine learning model) to develop another set of algorithms (the algorithms that will distinguish business from personal/trivial email on the basis of patterns observed in the data).

The most obvious way of training a machine learning tool to identify business correspondence within an email system is to feed it a training set of emails, each of which are labelled as either ‘business’ or ‘personal/trivial’.   The machine learning model looks for the features in the set of business correspondence whose values tend to differ from those of the same features in the non business correspondence. The tool comes up with a hypothesis algorithm, setting parameters for each data feature.  The algorithm is typically then tested by being fed with a mixed set of business and non business emails to see how accurately it distinguishes the two.

Whereas machine based rules before the AI revolution worked on certainty, algorithms work on probability.   An informal tone in an email might increase the probability that an email is trivial (or personal), but it does not give certainty.  By taking into account other data features (the subject line of the email, the number of recipients, the roles of the recipients, the topic of the email as indicated by words in the body of the message etc.) the algorithm is able to increase its own confidence in its classification of the email as ‘trivial/personal’ (or as ‘business’).   Machine learning algorithms can tell you not only what they have classified an item as, but also the percentage certainty with which the judgement has been made. This can help the organisation set threshold certainty levels below which judgements should be checked by humans.

The nature of recordkeeping after the AI revolution

On the basis of the previous history of recordkeeping, here are some predictions as to how recordkeeping will be shaped by, and will adapt to, the AI revolution:

  • Records management/information governance will become a data science, overseeing algorithms that apply record classifications and/or record retention and access rules.
  • The point at which we know information governance has entered the AI age, is the point in time after which access and retention rules are applied to aggregations into which records have been assigned by machine learning algorithm.
  • To an algorithm everything is data. If there are patterns in a set of data then an algorithm can learn those patterns and use its knowledge of those patterns to make distinctions.   Machines are no longer restricted to acting on highly structured metadata.  Algorithms can identify patterns in any kind of data, structured or unstructured.
  • Organisations will continue to have multiple databases. Some algorithms might use data from one database to manage data in another (for example you might use information taken from job descriptions in an HR database to assist algorithms identifying important business emails in an email system).
  • The volume and velocity of documentation and data will continue to rise, as AI algorithms generate content (for example by automated replies or automated chat bots) as well as help manage it.
  • Algorithms, like humans, tend to understand data best when they view it in the context of its originating application. Email is best understood within email systems, or within repositories that can replicate the structure and functioning of email systems. There is no longer any necessity to move content out of one structured database (such as an email system) into another system.
  • Organisations will have the technical possibility of having one overall structure/schema for organising records.  But this dream is likely to remain elusive due to the fact that data created within a structured dataset is usually much more meaningful and manageable within the structure of that dataset than it would be outside of it.  Algorithms will be used more often to make data in a dataset manageable than to break data out of its original dataset to manage it through an alternative structure.
  • AI brings with it some possibilities that humans have never had before.  For example the possibility to restructure an entire records system to enable access and retention rules to be applied to a completely different set of aggregations than were present when individual action officers created or received the documentation.  Learning whether (and if so how and when) to use this capability will be a challenge for the recordkeeping profession.

 

CONCLUSIONS

Records management in an era dominated by structured data

The rise of structured data  poses a challenge to records management theory.  This theory  has, for the most part, been based on the assumption that the majority of records, (including correspondence and other types of documentation) are created as free standing objects (unstructured data) that move independently of any structure and therefore need at some point to be integrated into a structure.

This theory needs to be refined to enable it to adapt to the reality that since the digital revolution even correspondence is created and shared within a structured database. Such a theory would de-emphasize the importance of building record structures to integrate records into (because most records, including all email correspondence,  are created within a database that already has a structure and schema).  It would instead emphasise the importance of establishing a defensible, pragmatic and consistent basis for the application of retention and access rules across the different structures and schemas of the different datasets of the organisation.

AI and the possibility to re-structure and re-aggregate entire records systems

The most far-reaching change of the AI revolution is that the ability to re-organise all the items in a record system is for the first time unconstrained by the original metadata schema of that system.   A records management/information governance team will in theory have the ability to use any relevant classification logic (any classification scheme that bears any relation to the content of the record system)  to re-aggregate content in a record system.   The team will be able to apply retention and access rules through those new aggregations.  The re-aggregation could be carried out at any time in the existence of the system (meaning that a record might be reassigned to a new aggregation for governance purposes one second, one day, one month, one year, one decade or one century after its creation or receipt.

This poses two fundamental questions for records management theory and practice:

  • What are the implications of the ability to re-classify, re-aggregate and/or re-label all the items in a record system for a profession whose aim has traditionally been to build and maintain governance regimes for information based on predictable access permissions and predictable retention rules being applied to predictable aggregations of records?
  • What are the implications of the possibility to assign retention and access rules to aggregations that did not exist when the records were originally created and received, and that the creators/recipients of the records would not have envisaged being used to apply access and retention rules?

To make these questions more concrete let us think of them in relation to email – the great unsolved records management challenge brought by the digital revolution.

In email systems correspondence is aggregated into email accounts and access permissions are applied to correspondence via those accounts.   AI opens up three options for the application of retention rules and access permissions to emails:

  • Ignore the existing structure/schema . –  bypass email accounts  use AI to re-aggregate email correspondence (for instance by applying a corporate records classification) so that access permissions and/or retention rules are no longer applied via email accounts but instead through the records classification.
  • Stick with the existing structure/schema – make email accounts manageable  use AI to make email accounts more manageable by identifying trivial, personal and sensitive emails within email accounts.
  • Use the existing structure and schema as a starting point – enhance email accounts and then move beyond them:  use AI to classify emails within email accounts by business activity, but continue to use email accounts as the main aggregation for the application of access permissions.   As individuals get used to the classification of their email by machine against business activity, so they could be given the option of opening up access to selected colleagues to correspondence of selected activities within their email account.

The first approach is high risk, the second approach is low benefit.  The third approach offers the possibility of providing benefits to individual email account users and their colleagues through incremental change.

We should be looking for what Dave Snowden might call ‘safe-fail’ approaches to the introduction of AI.  In such approaches machine learning classifications are first introduced alongside (or within) existing structures, then gradually begin to become more influential over the application of access permissions and retention rules as confidence in the machine learning process grows.

The theories and explanations outlined in this article have been developed during the course of my Loughborough University doctoral research project which is looking at archival policy towards email from a  realist perspective.  A paper from this project ‘The defensible deletion of government email’ was published by the Records Management Journal in March 2019.  An open access version of this paper is available from Loughborough University’s digital repository here (once in the repository click on ‘download’ to download the pdf,  or read it in the window provided). 

Automation and its implications for archival policy towards email

This is the text of a talk I gave in London on 26 September 2019 to the UK Government Knowledge and Information Network.   I have revised and extended the text.

Think of all the correspondence moving into, out of and around your organisation.

Think of the structure or schema into which you would like all important items of business correspondence to be assigned so that they can be found and managed. Think of the records system that the structure/schema sits in.

Who would you like to file important items of correspondence into that structure/schema: humans or machines?

Trial no 1:  humans versus machines that can learn

Imagine you set up a trial:

  • you tell every member of staff to file important pieces of correspondence into your records system with your preferred structure/schema;
  • in parallel you set up a group of machines to look at all the correspondence coming in and out, to select important correspondence and file it into the same structure/schema as the humans.

Who would you like to win this trial-  the humans or the machines?

Who would you expect to win the trial?

01-Competition

Most of us in the records and information management professions would want the machines to win.  If the machines win they take the filing workload off the heads of our colleagues.  This frees our colleagues up to focus on the job they were employed for.

We would expect the machines to win provided that:

  • the machines were capable of learning a fairly complex structure;
  • there was a feedback loop between humans and machines so that the machines had their mistakes pointed out to them;
  • the machines were learning machines that could adjust their algorithms in response to feedback;
  • the trial ran long enough for the machines to improve after many iterations.

We do not yet have the automation necessary to assign correspondence routinely to a node in the kind of complex multi-level corporate wide taxonomy/fileplan/retention schedule that records managers like to use to manage records.

The nature of automation projects currently being undertaken

The type of automation projects we are seeing in information management at the time of writing are mainly based on binary questions:

  • The legal world has been making progress with predictive coding projects that seek to use machine learning to answer the binary question ‘is this content likely to be responsive to a specific legal dispute?’;
  • In the US NARA’s Capstone policy has motivated some US Federal Agencies to use machine learning to answer the binary question ‘is this email needed as a record?’, and a similar project is being undertaken by the Nationaal Archief of the Netherlands (their report, in dutch, is here );
  • The Better Information for Better Government programme run by the UK Cabinet Office will shortly set up a project  to develop an artificial intelligence tool that can distinguish important from non-important government emails  (see the call for expressions of interest they issued in August);
  • Graham MacDonald has worked on a process for using automation to support the sensitivity review of records by using machine learning to predict whether or not any particular document is likely to be covered by one of the UK’s Freedom of Information exemptions (see his thesis )

We are are going to be able to deploy machines sooner if we can find binary questions for them to resolve, than if we wait until machines can assign content to nodes within complex multi-level taxonomies/fileplans/retention schedules.  

The records management demands we make of human beings

For most of the twentieth century human beings succeeded in filing correspondence into what were often very sophisticated filing structures.   In the twenty first century this no longer holds true.   In the twentieth century humans filed correspondence because the correspondence had to be filed by humans.   In the twenty first century email correspondence has been filed automatically by the automation built into email systems.  Any injunction to civil servants asking them to move email correspondence into another system is in effect asking them to re-file that correspondence.

The automation built into email systems

The automation built into the proprietary email systems rolled out in the mid to late 1990s was not machine learning.  The machines in proprietary email systems could not learn, all they could do was follow rules.   Even now, two decades later, proprietary email systems only assign correspondence into a very simple structure and schema.

The reaction of the archives and records management community when email systems were introduced was to point out (quite rightly) the records management deficiencies of a system that aggregates correspondence into individual email accounts and does not distinguish between business correspondence and personal/trivial correspondence.  With some exceptions  (notably NARA in the US), the records and information management community has not accepted the structure of email systems as being a viable filing structure and in many administrations (including that of the UK) we have continued to ask human beings to re-file important items of correspondence into separate systems.

Trial no 2: humans versus machines that cannot learn

To go back to the idea of a trial with which I started this talk, we have for the past two decades been pitting human beings against machines:

  • the humans have been asked to file important items of correspondence into a preferred records system which houses our preferred records structure/schema;
  • the machines  (in the shape of email systems) have been configured to file correspondence into a simple structure that is inferior for records management purposes.

02-Inferior

Who do you want to win this trial?   The automated filing or the human filing?

From a records/information management point of view, would you want the machines to win on the grounds that:

  • they take the workload off the shoulders of our colleagues
  • the filing is very predictable and consistent
  • the filing is instantaneous?…..

….or would you want the humans to win because they would be filing into a structure that permits a more precise application of retention rules and access rules?

Who do you think would win such a trial?

In theory the humans have more chance of winning this second trial than they did of winning the first trial.   The human filing could prevail if the human beings in the organisation found the records structure/schema so beneficial that they would be prepared :

  • to make the extra effort to file correspondence into the designated records system;
  • to use the designated records system, rather than their email account, as their main source of reference for their own correspondence;
  • to forego the possibility of simply relying on the inferior structure into which the email systems had filed the correspondence.

However even when officials do highly value the records structure/schema there is still a strong possibility that the machine filing will prevail.  I remember when email systems were introduced into UK government in the mid 1990s.  Government departments and the civil servants in them valued the then record systems of their organisations (hard copy registered file systems) very highly.   Everyone at the time wanted the registered file systems to survive and to make an ordered transition to the electronic world.  But within five years of the general introduction of email in UK government all of those registered file systems were in tatters with no replacement systems in place.  The introduction of email destroyed those systems.

Why did the automated filing of email systems into a simple structure overcome the value that UK civil servants placed on the much more sophisticated structure of their registered filing systems?

The crucial advantage that the machines (email systems) had was speed.  They filed correspondence instantaneously.   The automated filing by email systems provided officials with instant access to their correspondence from the moment it left the sender’s account.   This acted to accelerate the velocity of correspondence, which in turn increased the volume of items exchanged, which in turn increased the number of items to be re-filed by the human beings.

The introduction of email increased correspondence volumes exponentially and therefore made it to all intents and purposes impossible to have human beings re-file correspondence into a complex corporate structure. In other words the machines moved the goalposts.  And won the game!  

To put it more simply

  • human filing is a viable option when there is a low volume and low velocity of correspondence exchange;
  • if the velocity and volume of business correspondence increase exponentially then the human resource to refile it does not scale (not within public sector budgets anyway!). 

 

Machine filing versus human filing – the experience of the past twenty years

The experience of UK government in relation to email over the past twenty five years can be divided into three phases.

In the first phase (c 1995 to c 2003) human beings (civil servants) were asked to print important pieces of correspondence out and place them onto registered files whilst machines (email systems) filed correspondence into email accounts.

03-registered files

In the second phase (c2003 to c2010) civil servants were asked to file correspondence into electronic records and document management systems whilst machines (email systems) filed correspondence into email accounts

04-EDRM

In the third phase civil servants were asked to file correspondence into collaborative systems (such as Microsoft’s SharePoint) whereas machines (email systems) continued to file correspondence into email accounts.

05-Sharepoint

 

Over the course of this twenty to twenty five year period progress has been made in the systems to which we have been asking our colleagues to file into.  We have moved from hard copy to electronic systems; we have moved from electronic records management systems with clunky corporate fileplans to more user-friendly collaborative systems.   But the result has been the same in all three phases.   In each phase a pitifully low percentage of business correspondence has been moved from email accounts into the record system concerned.  The automated filing of email into email accounts has always defeated attempts to persuade humans to get into the habit of re-filing their important correspondence somewhere else.

The policy dilemma posed by the automated filing built into email systems

Email systems have, over the past two decades used a primitive form of rules based automation to file emails into a simple structure/schema.  This has caused a policy dilemma:

  • email systems file email correspondence efficiently, routinely and predictably into email accounts BUT the organisation of correspondence into individual email accounts results in an inefficient and imprecise application of retention and access rules to correspondence;
  • in contrast human beings are able to re-file important items of correspondence into a structure that enables retention and access rules to be applied more precisely BUT they are likely to do this infrequently and haphazardly.

The policy dilemma exists in part because records management best practice does not tell us which of the following two policy imperatives is more important:

  • the consistent capture of correspondence into a structure/schema; OR
  • a structure/schema that supports the precise application of retention and access rules.

Records management best practice does not help us choose between these two competing imperatives because records management best practice wants both!  Records management best practice requires the consistent capture of correspondence into a structure/schema that supports the precise application of retention and access rules.

We are faced with two imperfect options.  We should choose the least imperfect.  The least imperfect option is the option whose weaknesses we are most likely to be able to correct at a future date.

We are working in a period of transition, and the transition is towards the ever greater use of every more powerful automation, analytics and machine learning.  If the present rate of progress with machine learning/artificial intelligence is maintained then we can predict that:

  • in the medium term originating bodies will be able to deploy machines to answer binary questions that would help to mitigate the worst faults of email accounts:  namely to distinguish important from trivial mail, and personal from business mail;
  • in the long term originating organisations will be able to deploy machine intelligence to re-file correspondence into any order that they choose.

 

Factoring the future of machine learning into present-day policy decisions

If and when we reach a point at which machine learning tools can file correspondence into any order that an organisation wishes then our policy dilemma will be resolved – we will at that point be able to consistently assign correspondence to any taxonomy, records classification and/or retention schedule that an organisation chooses.  We would also, one presumes, be able to run the machine learning over legacy correspondence and assign that correspondence to the same taxonomy/records classification/retention schedule.  We can anticipate that:

  • future machine learning tools will be able to retrospectively correct the weaknesses in the structure/schema of any email accounts that survive;
  • future machine learning tools will only be able to retrospectively correct the weaknesses in the capture of email into corporate collaboration systems/electronic records management systems if important email accounts survive.

This logic dictates that we should give a high priority now to ensuring that historically important email accounts survive in the confident hope that we will later be able to correct weaknesses and inefficiencies in the content of these accounts and in the structure and schema of those accounts.

This would require some form of protection being introduced now for the email accounts of officials playing important roles.  Business correspondence residing in the email accounts of important UK government officials does not currently enjoy any protection.    UK government departments subject email in email accounts to some kind of scheduled deletion.  The most common form of scheduled deletion is to delete the content of email accounts shortly after an individual leaves post.  This practice complies with the National Archives’ policy towards UK government email, because each department asks its officials to move important email out of email accounts to some form of corporate records system.  However the unintended consequence of this policy is that most  business correspondence ends up being subject to this deletion.

Affording some protection to the email accounts of officials occupying important roles can be seen as a protect now- process later approach.

This protect now –  process later approach involves protecting  historically important email accounts in the knowledge that machines are good at dealing with legacy and can at a later date be deployed to filter these records, enhance the metadata and/or overlay an alternative structure on to these records.

Such an approach would no longer require individuals to move important emails to a separate system for recordkeeping purposes (though there may well continue to be circumstances when an organisation for knowledge management/operational purposes requires some teams/areas to move important correspondence out of email systems, or seeks to divert correspondence away from email into other communication channels).

This approach is based on the realisation that deploying human effort to do something (badly) that machines are likely to be able to do (well) at a later date does not make sense in terms of either effectiveness or efficiency.

GDPR implications of a protect now –  process later approach

The implication of protecting important email accounts from deletion whilst working on the development of machine learning capabilities is that some personal correspondence is likely to be retained alongside historically important correspondence.  This has data protection implications.

GDPR allows the archiving of records containing personal data provided that the preservation of the records is in the public interest, and provided that necessary safeguards are in place and the data protection rights of data subjects are respected.   The retention of the work email account of an important official is likely to be in the public interest, and is likely to be compliant with data protection law provided the following conditions are met:

  • the role that the individual played was of historic interest;
  • the individual could expect their account to be permanently preserved;
  • the individual was given the chance to flag or remove personal correspondence;
  • access to personal correspondence was prevented except in case of overriding legal need;
  • items of correspondence that are primarily personal in nature are removed once a reliable capability to identify them becomes available.

 

Conclusion

This talk recommends that government departments which use email as their main channel of communication refrain from automatically deleting correspondence from the email of their  most important staff, pending the development of automated tools to process the correspondence within those accounts.   In practice this is likely to only involve protecting around 5% of their email accounts (using the old archival rule of thumb that 5% of the records of an originating body are likely to be worthy of permanent preservation).

This is not an easy sell to make to government departments.  Even though the recommendation only covers around 5% of their email accounts, departments may well feel that these are the 5% that carry the highest potential reputational/political risk, and are the 5% most likely to attract freedom of information requests. 

Making such a recommendation is in no sense ‘giving up’ on the records management ambition to have business correspondence consistently assigned to structures and schema that support the use and reuse of correspondence and that support the precise application of retention and access rules.   It is simply a recognition that asking civil servants to select and move important email into a separate system has not worked for twenty years and shows no sign of working any time soon.  It is also a recognition that we need automated tools to process the material that has been automatically filed by email systems.

Most important of all this approach of protecting important email accounts gives us a pathway for applying automated solutions to email.  It would provide an incentive and an opportunity to deploy tools that work on a binary logic (‘is this email important, yes or no?’,  ‘is this email personal, yes or no?’) to mitigate the worst flaws of email accounts from an information management point of view.  These tools are not pie in the sky, they are already being used in real-life projects.  The hope would also be that in the long term we may have tools that go beyond binary questions and could assign individual emails to a reasonably granular records classification, taxonomy and/or retention schedule.

 

 

The theories and explanations outlined in this talk have been developed during the course of my Loughborough University doctoral research project which is a realist evaluation of archival policy towards UK government email.   A paper from this project ‘The defensible deletion of government email’ was published by the Records Management Journal in March 2019.  An open access version of this paper is available from Loughborough University’s digital repository here (once in the repository click on ‘download’ to download the pdf,  or read it in the window provided).

James Lappin

 

 

 

 

 

 

The SharePoint records retention model in Office 365

At an IRMS public sector group meeting in London yesterday I heard Rob Bath compare the records retention model offered within the on-premise SharePoint with that offered in Office 365.

Rob described the two models available in on-premise SharePoint:

  • the record centre model in which important items are moved to a records centre – the disadvantage of this model is that it rips content out of context by taking it from the SharePoint sites in which users had interacted with it;
  • the in-place record management model in which end users can click a button to identify individual items as records – the disadvantage of this model is that SharePoint gives no reporting capability for information managers to see and manage the items scattered across their SharePoint implementation that have been declared as records.

Both the record centre model and the in-place model are still available within SharePoint Online in Office 365.  However Office 365 offers another way of managing the retention of SharePoint Online content.  This new method does not sit within SharePoint Online itself.  It sits in the Office 365  Security and Compliance centre which exists to provide a means of managing content across the Office 365 family of applications.

The Security and Compliance centre provides a facility to set up retention labels and retention policies:

  • retention labels can be applied to containers within SharePoint sites such as libraries or folders;
  • retention policies can be applied at SharePoint site level.

Rob’s conclusion was that the retention labels/retention policies model offered in the Office 365 Security and Compliance centre was both a simpler and more effective way of managing SharePoint content than the two models available within SharePoint itself.  One member of the audience asked him whether there were any circumstances in which he would recommend the use of either the record centre model or the in-place model in place of (or in conjunction with) the retention labels/retention policies model within Office 365.  Robert thought for a second and then said ‘no’.

In the rest of this post I will offer some thoughts on why it is that after so many years of coming up with such unwieldy retention offerings in SharePoint, Microsoft have come up with something so much better for Office 365.

The need for Office 365 to have a retention model that went beyond SharePoint

Microsoft wanted a records retention model for Office 365 that was not unique to any one particular Office 365 application, but which could be applied to all of the major applications within the Office 365 family.  This forced them to come up with a model that was not based on any features specific to SharePoint itself.  In particular it led to them to moving away from the linkage between records retention and SharePoint content types.

The need to come up with a model that could be applied to applications as diverse as SharePoint Online, the Exchange Online email system, and the OneDrive filesharing application meant that Microsoft had to look for a common denominator between the different applications.  The one common denominator between the applications is that they all aggregate content.

The records retention model in Office 365

The retention model in Office 365 is very simple.   For each application a fundamental aggregation is identified.   In Exchange it is the email account.  In SharePoint it is the site.  In OneDrive it is the OneDrive account, etc.. The Office 365 Security and Compliance centre allows you to use the fact that all content in SharePoint is held within sites, all content in Exchange is held within email accounts etc. to apply your retention rules.

The Security and Compliance Centre offers essentially two different strategies for linking retention rules to content:

  • The most basic approach is that you apply retention at the level of the fundamental aggregation.   You set up retention policies in the Security and Compliance Centre and you identify which SharePoint sites (and/or which Exchange email accounts, which OneDrive accounts etc.) you wish to apply each policy to.
  • A more sophisticated approach is that you manage retention at a level below the fundamental aggregation.  In this model you set up your retention rules as retention labels (rather than retention policies) in the Security and Compliance centre, and then you target the rules at the SharePoint sites/Exchange email accounts etc. in which you want them to be available for users to apply to content

Applying retention policies and/or retention labels to content in Exchange Online and in SharePoint Online 

It is possible to use different strategies to apply retention labels/policies in different ways in different Office 365 applciations,

My preference for email is to set a retention policy on email accounts based on the business value of the correspondence (which will vary according to the role played by the individual email account holder), and in addition to allow users to use a retention label to flag up personal correspondence so that it can be subject to a shorter retention period.

After his talk I asked Rob Bath whether he preferred to apply retention policies or retention labels in the SharePoint environment.  He said that he thought that a SharePoint site was typically too big an aggregation to apply one retention policy to.  He prefers to apply retention labels to libraries and folders within sites.  The way to do this is to set up your retention rules as retention labels and then identify for each label the SharePoint sites that it is relevant too.  The result will be that most SharePoint sites will only have a small number of retention labels available to manage content within them. In the process of creating any new library or new folder the library/folder can be configured to apply one of those retention labels to all content stored within it (see here).

It is important to ignore Microsoft’s pushing of retention labels as a tool for end-users to tag individual items in SharePoint.  There is no incentive for an end user to choose a retention period for an individual document.  In general it is not good practice to ask end users to do something you know they will not do.   In the SharePoint environment you should either:

  • Set retention labels as default on libraries or folders; OR
  • set retention policies as default on SharePoint sites (do this if setting retention labels on libraries/folders is not possible, perhaps because your SharePoint installation is too big, your number of information governance staff is too low, your roll-out schedule is too short, or you are applying retention labels/policies in a legacy environment)

 

 

Predictions for the application of machine learning to the management of email

Last year I gave two presentations, one to the DLM Forum Triennial and one to the IRMS conference, in which I developed a fictional case study of an organisation that decides to apply machine learning and analytics to email.

In my case study a public sector organisation:

  • is concerned about the low capture of email into its record system (SharePoint) and embarks on a programme to apply machine learning to remedy the shortfall;
  • uses machine learning to apply its existing policy of moving important email into a records system;
  • seeks to apply the machine learning capability on all email accounts on a corporate wide basis.

In reality I think that the attitude of public sector organisations to the application of  analytics and machine learning to email will be rather different to the attitude taken by the organisation in my case study.   My predictions are that public sector organisations in the UK:

  • will be reluctant to apply machine learning to email accounts because of the risks involved;
  • will be just as concerned about the prospect that the application of machine learning might result in very large volumes of email being captured into their record system as they would be about the existing under-capture of emails as records;
  • would use machine learning (or an analytics capability) to look for certain specific types of correspondence that are valuable to the organisation in certain specific accounts rather than applying machine learning/analytics to all accounts across the business;
  • would not move emails identified as important or valuable into their corporate records system, but would instead leave them within email accounts and either place them under a hold to prevent deletion, or move them to an email archive.

Here is the video of my monologue explaining how the organisation applying machine learning to all of its email accounts got on:

 

Managing email in Office 365

What is an email account in Office 365?   It is a special type of document library, that doesn’t need version control, doesn’t need extra metadata fields and doesn’t live in SharePoint.

It now seems a little incongruous for organisations to ask their staff to move important emails out of email accounts and into a ‘corporate record system’.  If Office 365 is your corporate record system then email accounts are within it already!

One of the things that Microsoft had to do in order to make Office 365 work as a service offering was to get their SharePoint team working with their Exchange team – something that famously never happened whilst both products were predominantly on-premise offerings.  Microsoft customers implementing on-premise SharePoint alongside their on-premise Exchange email system  had to deploy third-party plug-ins if they wanted staff to be able to drag and drop an email into a SharePoint document library without leaving their Outlook email client.

There are two routes Microsoft could have gone with the relationship between Exchange and SharePoint within Office 365:

  • the integration route – building in features that make it easier to move emails from Exchange to SharePoint;
  • the governance route – making common governance features available so that emails in an email account could be governed using the same policies as documents in a document library in SharePoint.

Microsoft’s choice of direction for Office 365 has implications for the policy decisions that organisations need to take on email:

  • If Microsoft were to go down the integration route then it would fit in with the records management belief that an email system is not a ‘record system’ but is instead a ‘communications tool’.  Many organisations over the course of the past decade have designated SharePoint as their corporate records system and asked staff to move important emails into SharePoint.
  • If Microsoft were to go down the governance route then it would fit with the information governance belief that distinctions between record systems and non-records systems are meaningless and unhelpful because organisations are under legal, regulatory and ethical obligations to manage all their business information systems in accordance with information governance principles.

From a marketing point of view, there are clear advantages to Microsoft from going down the governance route rather than the integration route:

  • If Microsoft went down the integration route it would imply that they viewed a SharePoint document library as a better place to store business email than an Exchange email account. This is despite the fact that Exchange was built for and designed around the storage of email messages, whereas SharePoint document libraries were not designed with email in mind.
  • By going down the governance route Microsoft can stay neutral on the question of whether an email is better stored in SharePoint or in Exchange, and can gradually remove any necessity for organisations to move emails out of Exchange and into SharePoint.

It is therefore no surprise to see Microsoft putting their emphasis on the governance route rather than the integration route.

Office 365 comes with a ‘Security & Compliance Centre’ that sits separately from SharePoint or Exchange or any of the other component parts of Office 365.   The Security & Compliance Centre gives you two different means of applying retention rules to content:

  • retention policies which are applied to the containers within which content sits (SharePoint sites, email accounts etc.);
  • retention labels which are applied to individual items of content (emails/documents etc.).

This effectively gives you three alternative options for applying retention to email:

  • apply retention policies to email accounts without applying retention labels; OR
  • ask end users to apply retention labels to emails (or automate the application of labels if and when you develop automation capability), without applying retention policies; OR
  • use a combination approach by applying a default retention policy to email accounts whilst allowing staff (or machines!) to apply a retention label to particular emails that deserve a retention rule that differs from the default.

Note that in applying retention from the Security & Compliance Centre to content in OneDrive, Office 365 groups or SharePoint you will be faced with variations of the three options listed above.   The variation relates to the type of container that you would be applying retention policies to, and the type of content that you would be applying retention labels to.

The fact that Microsoft allows an email account to be treated in the same way as a document library for retention purposes will not stop organisations wanting to apply different retention periods to email accounts than to document libraries even when they arise from a similar business function.  The cost and risk profile of an email account differs significantly from that of a document library.

However Office 365 is a game changer in two ways:

  • it brings the application of retention rules to email in email accounts firmly into the information governance, rather than the IT domain.  The retention policy  and retention label menus in the Office 365 Security & Compliance centre can be used to apply retention policies and/or retention labels to Exchange email accounts and SharePoint sites (as well as other parts of Office 365 including Office 365 groups and OneDrive accounts);
  • it creates the possibility of applying different types of policy towards email. For example if you wanted to apply a Capstone policy towards email you could do so out of the box in Office 365 by simply:
    • setting two retention policies on email; a Capstone retention policy for application to the relatively small number of email accounts that you wish to retain permanently, and a non Capstone retention policy for application to email accounts that you do not wish to retain permanently;
    • deploying  retention labels to enable staff with Capstone email accounts to identify trivial and personal emails so that those emails are exempt from the permanent retention applied to the rest of the correspondence in their email account.