Automation and its implications for archival policy towards email

This is the text of a talk I gave in London on 26 September 2019 to the UK Government Knowledge and Information Network. I have revised and extended the text.

Think of all the correspondence moving into, out of and around your organisation.

Think of the structure or schema into which you would like all important items of business correspondence to be assigned so that they can be found and managed. Think of the records system that the structure/schema sits in.

Who would you like to file important items of correspondence into that structure/schema: humans or machines?

Trial no 1: humans versus machines that can learn

Imagine you set up a trial:

you tell every member of staff to file important pieces of correspondence into your records system with your preferred structure/schema;
in parallel you set up a group of machines to look at all the correspondence coming in and out, to select important correspondence and file it into the same structure/schema as the humans.

Who would you like to win this trial- the humans or the machines?

Who would you expect to win the trial?

01-Competition

Most of us in the records and information management professions would want the machines to win. If the machines win they take the filing workload off the heads of our colleagues. This frees our colleagues up to focus on the job they were employed for.

We would expect the machines to win provided that:

the machines were capable of learning a fairly complex structure;
there was a feedback loop between humans and machines so that the machines had their mistakes pointed out to them;
the machines were learning machines that could adjust their algorithms in response to feedback;
the trial ran long enough for the machines to improve after many iterations.

We do not yet have the automation necessary to assign correspondence routinely to a node in the kind of complex multi-level corporate wide taxonomy/fileplan/retention schedule that records managers like to use to manage records.

The nature of automation projects currently being undertaken

The type of automation projects we are seeing in information management at the time of writing are mainly based on binary questions:

The legal world has been making progress with predictive coding projects that seek to use machine learning to answer the binary question ‘is this content likely to be responsive to a specific legal dispute?’;
In the US NARA’s Capstone policy has motivated some US Federal Agencies to use machine learning to answer the binary question ‘is this email needed as a record?’, and a similar project is being undertaken by the Nationaal Archief of the Netherlands (their report, in dutch, is here );
The Better Information for Better Government programme run by the UK Cabinet Office will shortly set up a project to develop an artificial intelligence tool that can distinguish important from non-important government emails (see the call for expressions of interest they issued in August);
Graham MacDonald has worked on a process for using automation to support the sensitivity review of records by using machine learning to predict whether or not any particular document is likely to be covered by one of the UK’s Freedom of Information exemptions (see his thesis )

We are are going to be able to deploy machines sooner if we can find binary questions for them to resolve, than if we wait until machines can assign content to nodes within complex multi-level taxonomies/fileplans/retention schedules.

The records management demands we make of human beings

For most of the twentieth century human beings succeeded in filing correspondence into what were often very sophisticated filing structures. In the twenty first century this no longer holds true. In the twentieth century humans filed correspondence because the correspondence had to be filed by humans. In the twenty first century email correspondence has been filed automatically by the automation built into email systems. Any injunction to civil servants asking them to move email correspondence into another system is in effect asking them to re-file that correspondence.

The automation built into email systems

The automation built into the proprietary email systems rolled out in the mid to late 1990s was not machine learning. The machines in proprietary email systems could not learn, all they could do was follow rules. Even now, two decades later, proprietary email systems only assign correspondence into a very simple structure and schema.

The reaction of the archives and records management community when email systems were introduced was to point out (quite rightly) the records management deficiencies of a system that aggregates correspondence into individual email accounts and does not distinguish between business correspondence and personal/trivial correspondence. With some exceptions (notably NARA in the US), the records and information management community has not accepted the structure of email systems as being a viable filing structure and in many administrations (including that of the UK) we have continued to ask human beings to re-file important items of correspondence into separate systems.

Trial no 2: humans versus machines that cannot learn

To go back to the idea of a trial with which I started this talk, we have for the past two decades been pitting human beings against machines:

the humans have been asked to file important items of correspondence into a preferred records system which houses our preferred records structure/schema;
the machines (in the shape of email systems) have been configured to file correspondence into a simple structure that is inferior for records management purposes.

02-Inferior

Who do you want to win this trial? The automated filing or the human filing?

From a records/information management point of view, would you want the machines to win on the grounds that:

they take the workload off the shoulders of our colleagues
the filing is very predictable and consistent
the filing is instantaneous?…..

….or would you want the humans to win because they would be filing into a structure that permits a more precise application of retention rules and access rules?

Who do you think would win such a trial?

In theory the humans have more chance of winning this second trial than they did of winning the first trial. The human filing could prevail if the human beings in the organisation found the records structure/schema so beneficial that they would be prepared :

to make the extra effort to file correspondence into the designated records system;
to use the designated records system, rather than their email account, as their main source of reference for their own correspondence;
to forego the possibility of simply relying on the inferior structure into which the email systems had filed the correspondence.

However even when officials do highly value the records structure/schema there is still a strong possibility that the machine filing will prevail. I remember when email systems were introduced into UK government in the mid 1990s. Government departments and the civil servants in them valued the then record systems of their organisations (hard copy registered file systems) very highly. Everyone at the time wanted the registered file systems to survive and to make an ordered transition to the electronic world. But within five years of the general introduction of email in UK government all of those registered file systems were in tatters with no replacement systems in place. The introduction of email destroyed those systems.

Why did the automated filing of email systems into a simple structure overcome the value that UK civil servants placed on the much more sophisticated structure of their registered filing systems?

The crucial advantage that the machines (email systems) had was speed. They filed correspondence instantaneously. The automated filing by email systems provided officials with instant access to their correspondence from the moment it left the sender’s account. This acted to accelerate the velocity of correspondence, which in turn increased the volume of items exchanged, which in turn increased the number of items to be re-filed by the human beings.

The introduction of email increased correspondence volumes exponentially and therefore made it to all intents and purposes impossible to have human beings re-file correspondence into a complex corporate structure. In other words the machines moved the goalposts. And won the game!

To put it more simply

human filing is a viable option when there is a low volume and low velocity of correspondence exchange;
if the velocity and volume of business correspondence increase exponentially then the human resource to refile it does not scale (not within public sector budgets anyway!).

Machine filing versus human filing – the experience of the past twenty years

The experience of UK government in relation to email over the past twenty five years can be divided into three phases.

In the first phase (c 1995 to c 2003) human beings (civil servants) were asked to print important pieces of correspondence out and place them onto registered files whilst machines (email systems) filed correspondence into email accounts.

03-registered files

In the second phase (c2003 to c2010) civil servants were asked to file correspondence into electronic records and document management systems whilst machines (email systems) filed correspondence into email accounts

04-EDRM

In the third phase civil servants were asked to file correspondence into collaborative systems (such as Microsoft’s SharePoint) whereas machines (email systems) continued to file correspondence into email accounts.

05-Sharepoint

Over the course of this twenty to twenty five year period progress has been made in the systems to which we have been asking our colleagues to file into. We have moved from hard copy to electronic systems; we have moved from electronic records management systems with clunky corporate fileplans to more user-friendly collaborative systems. But the result has been the same in all three phases. In each phase a pitifully low percentage of business correspondence has been moved from email accounts into the record system concerned. The automated filing of email into email accounts has always defeated attempts to persuade humans to get into the habit of re-filing their important correspondence somewhere else.

The policy dilemma posed by the automated filing built into email systems

Email systems have, over the past two decades used a primitive form of rules based automation to file emails into a simple structure/schema. This has caused a policy dilemma:

email systems file email correspondence efficiently, routinely and predictably into email accounts BUT the organisation of correspondence into individual email accounts results in an inefficient and imprecise application of retention and access rules to correspondence;
in contrast human beings are able to re-file important items of correspondence into a structure that enables retention and access rules to be applied more precisely BUT they are likely to do this infrequently and haphazardly.

The policy dilemma exists in part because records management best practice does not tell us which of the following two policy imperatives is more important:

the consistent capture of correspondence into a structure/schema; OR
a structure/schema that supports the precise application of retention and access rules.

Records management best practice does not help us choose between these two competing imperatives because records management best practice wants both! Records management best practice requires the consistent capture of correspondence into a structure/schema that supports the precise application of retention and access rules.

We are faced with two imperfect options. We should choose the least imperfect. The least imperfect option is the option whose weaknesses we are most likely to be able to correct at a future date.

We are working in a period of transition, and the transition is towards the ever greater use of every more powerful automation, analytics and machine learning. If the present rate of progress with machine learning/artificial intelligence is maintained then we can predict that:

in the medium term originating bodies will be able to deploy machines to answer binary questions that would help to mitigate the worst faults of email accounts: namely to distinguish important from trivial mail, and personal from business mail;
in the long term originating organisations will be able to deploy machine intelligence to re-file correspondence into any order that they choose.

Factoring the future of machine learning into present-day policy decisions

If and when we reach a point at which machine learning tools can file correspondence into any order that an organisation wishes then our policy dilemma will be resolved – we will at that point be able to consistently assign correspondence to any taxonomy, records classification and/or retention schedule that an organisation chooses. We would also, one presumes, be able to run the machine learning over legacy correspondence and assign that correspondence to the same taxonomy/records classification/retention schedule. We can anticipate that:

future machine learning tools will be able to retrospectively correct the weaknesses in the structure/schema of any email accounts that survive;
future machine learning tools will only be able to retrospectively correct the weaknesses in the capture of email into corporate collaboration systems/electronic records management systems if important email accounts survive.

This logic dictates that we should give a high priority now to ensuring that historically important email accounts survive in the confident hope that we will later be able to correct weaknesses and inefficiencies in the content of these accounts and in the structure and schema of those accounts.

This would require some form of protection being introduced now for the email accounts of officials playing important roles. Business correspondence residing in the email accounts of important UK government officials does not currently enjoy any protection. UK government departments subject email in email accounts to some kind of scheduled deletion. The most common form of scheduled deletion is to delete the content of email accounts shortly after an individual leaves post. This practice complies with the National Archives’ policy towards UK government email, because each department asks its officials to move important email out of email accounts to some form of corporate records system. However the unintended consequence of this policy is that most business correspondence ends up being subject to this deletion.

Affording some protection to the email accounts of officials occupying important roles can be seen as a protect now- process later approach.

This protect now – process later approach involves protecting historically important email accounts in the knowledge that machines are good at dealing with legacy and can at a later date be deployed to filter these records, enhance the metadata and/or overlay an alternative structure on to these records.

Such an approach would no longer require individuals to move important emails to a separate system for recordkeeping purposes (though there may well continue to be circumstances when an organisation for knowledge management/operational purposes requires some teams/areas to move important correspondence out of email systems, or seeks to divert correspondence away from email into other communication channels).

This approach is based on the realisation that deploying human effort to do something (badly) that machines are likely to be able to do (well) at a later date does not make sense in terms of either effectiveness or efficiency.

GDPR implications of a protect now – process later approach

The implication of protecting important email accounts from deletion whilst working on the development of machine learning capabilities is that some personal correspondence is likely to be retained alongside historically important correspondence. This has data protection implications.

GDPR allows the archiving of records containing personal data provided that the preservation of the records is in the public interest, and provided that necessary safeguards are in place and the data protection rights of data subjects are respected. The retention of the work email account of an important official is likely to be in the public interest, and is likely to be compliant with data protection law provided the following conditions are met:

the role that the individual played was of historic interest;
the individual could expect their account to be permanently preserved;
the individual was given the chance to flag or remove personal correspondence;
access to personal correspondence was prevented except in case of overriding legal need;
items of correspondence that are primarily personal in nature are removed once a reliable capability to identify them becomes available.

Conclusion

This talk recommends that government departments which use email as their main channel of communication refrain from automatically deleting correspondence from the email of their most important staff, pending the development of automated tools to process the correspondence within those accounts. In practice this is likely to only involve protecting around 5% of their email accounts (using the old archival rule of thumb that 5% of the records of an originating body are likely to be worthy of permanent preservation).

This is not an easy sell to make to government departments. Even though the recommendation only covers around 5% of their email accounts, departments may well feel that these are the 5% that carry the highest potential reputational/political risk, and are the 5% most likely to attract freedom of information requests.

Making such a recommendation is in no sense ‘giving up’ on the records management ambition to have business correspondence consistently assigned to structures and schema that support the use and reuse of correspondence and that support the precise application of retention and access rules. It is simply a recognition that asking civil servants to select and move important email into a separate system has not worked for twenty years and shows no sign of working any time soon. It is also a recognition that we need automated tools to process the material that has been automatically filed by email systems.

Most important of all this approach of protecting important email accounts gives us a pathway for applying automated solutions to email. It would provide an incentive and an opportunity to deploy tools that work on a binary logic (‘is this email important, yes or no?’, ‘is this email personal, yes or no?’) to mitigate the worst flaws of email accounts from an information management point of view. These tools are not pie in the sky, they are already being used in real-life projects. The hope would also be that in the long term we may have tools that go beyond binary questions and could assign individual emails to a reasonably granular records classification, taxonomy and/or retention schedule.

The theories and explanations outlined in this talk have been developed during the course of my Loughborough University doctoral research project which is a realist evaluation of archival policy towards UK government email. A paper from this project ‘The defensible deletion of government email’ was published by the Records Management Journal in March 2019. An open access version of this paper is available from Loughborough University’s digital repository here (once in the repository click on ‘download’ to download the pdf, or read it in the window provided).

James Lappin

Thinking Records

James Lappin's records management blog

Thinking Records

Automation and its implications for archival policy towards email

Leave a comment Cancel reply

Thinking Records

James Lappin's records management blog

Share this:

Related

Leave a comment Cancel reply