
Month: October 2024
Using AI for records management purposes: seven key choices
If you were to tell me that you are going to apply artificial intelligence for records management purposes, then the first thing I would do is blink. Once I had finished blinking I would ask you some questions in order to locate your proposed intervention more precisely.
There is a range of types of AI, that can be applied to take a range of different actions, on a range of different target types of content, at a range of different levels of aggregation, at a range of different stages of the records lifecycle, under a range of different types of human control, in pursuit of a range of different records management purposes.
It is the choices that you make on each of these seven aspects that define your AI intervention.
1 Choice of AI/data science technique
There are a great variety of different types of artificial intelligence (and data science techniques more generally) that could be applied. You might want to:
- write your own rule set; OR
- train a supervised machine learning model using pre-labelled data; OR
- run an unsupervised machine learning model directly on your target content (and then start supervising and iterating it because the model won’t get it right first time!); OR
- prompt, fine tune or augment a large language model (LLM).
Alternatively (or additionally) you might want to run analytics tools to gain statistical insights on your target content.
It is best to think of them as a tool set. You may need to apply them in combination. You may start within one type of intervention and have to switch to another if results are not as planned or if costs are too high.
2 Choice of relationship between AI and human control
The AI/data science technique might be:
- giving decision support – by providing a records management team with summaries of content, suggested taxonomy categories, indications of the locations of sensitivities etc.;
- directly acting on content – by assigning it to a taxonomy category, assigning it to a retention category, applying protective markings etc. In such cases the human control takes the form of both the testing of the AI model prior to deployment, and the continued monitoring of the model after deployment.
3 Choice of purpose
There are various purposes for which a records management team might wish to deploy the help of AI (and data science techniques more generally). These include:
- taxonomy application – the use of AI to assign content to taxonomy categories. This could involve the use of supervised learning to apply an existing taxonomy. Alternatively it could involve the use of unsupervised learning to generate a taxonomy from the target content itself, and to apply it to that target content. For the taxonomy to be useful for records management purposes it must be linked (or linkable) to retention rules;
- disposition review – the use of AI to identify material that has no ongoing value to the organisation or its stakeholders;
- appraisal and selection – the use of AI (usually by a public authority, after a period defined by statute) to identify material that has historical value for society and hence is worthy of transfer to a historical archive;
- sensitivity identification – the use of AI to identify sensitivities in material that warrants the placing of access restrictions and/or other protections.
4 Choice of AI action
In order to support these purposes there are a variety of actions that AI might be deployed to take. These actions include:
- labelling – applying a taxonomy category (or directly applying a retention category) to an item or an aggregation;
- summarising – summarising an item, aggregation or a set of aggregations. A summary might enable a records management team to quickly check assumptions about the content of that aggregation/set of aggregations. A summary might also enable a records management team to support and defend decisions made on that aggregation/set of aggregations;
- sensitivity protection – actions to restrict access to aggregations, items, or parts of items that contain information that is identified as being sensitive;
- negative appraisal – identifying personal, social, unsolicited (messages), uncommunicated (documents), and/or trivial items that are not needed within a particular aggregation or set of aggregations;
- clustering – linking together similar items, aggregations or sets of items, for the purposes of taxonomy generation, taxonomy application, or to enable similar items/aggregations to be reviewed and processed together.
5 Choice of level of aggregation at which to act
Records consist of items (documents, messages, data entries, images, videos, etc). Items exist within aggregations (shared drives/folders, SharePoint sites/libraries, email accounts, datasets etc.).
A key decision for any AI intervention for records management purposes is the decision on what level of aggregation to make the decision/take the action on.
If you are applying AI to apply a taxonomy, you could:
- apply the taxonomy to individual items – each document, message, etc. is assigned to a taxonomy category; OR
- apply the taxonomy to aggregations – each SharePoint site/ each email account/ each shared drive (or each high level folder or each folder) is assigned to a taxonomy category.
If you are taking destruction actions after disposition reviews you could:
- dispose of entire aggregations – deleting the entirety of a SharePoint site/ the entirety of an email account/ the entirety of a shared drive/ the entirety of content below a high level folder/ the entirety of a folder; OR
- dispose of individual items – deleting individual documents, messages etc. (without this deletion affecting the rest of the content in the aggregation of which they are part).
If you are taking selection actions after appraisal exercises you could:
- select entire aggregations for permanent preservation- selection of entire SharePoint sites, entire email accounts (minus non-business content), entire Shared drives/the entirety of content below a high level folder/ entire folders; OR
- select individual items for permanent preservation – selection of certain individual documents, messages etc. without this selection affecting the rest of the content in the aggregations of which they formed part.
6 Choice of stage of the records lifecycle
Another key choice is whether you plan to apply AI to:
- active content on live systems (for example content in live MS Teams/SharePoint sites, live email accounts, etc.); OR
- inactive content on live systems (for example closed/moribund Teams/SharePoint sites, email accounts of leavers etc.); OR
- inactive content on legacy systems (for example content created in legacy systems and not migrated to live systems, including legacy shared drives, legacy on-premise SharePoint implementations, legacy electronic document and records management systems, legacy email systems).
Applying AI to active content on live systems offers very different challenges than applying AI to legacy systems.
If you apply AI to active content in live systems you have the advantage that the AI model’s judgements could be made visible to the end-users who have created or received the content. These end-users could be offered the opportunity to challenge or confirm the model’s judgement.
However, most live environments are cloud based and are ‘evergreen’ – with frequent and rapid updates pushed through by the provider. For a tenant organisation to deploy an AI tool in a cloud suite such as Microsoft 365 and make its judgements visible in the end-user interface, would require some form of integration with the suite. Such an integration would run the risk of conflcting with future developments in the suite. This integration challenge would not arise if you use the AI capabilities provided within the suite itself, but you are then limited to the AI capabilities made available by that provider in that suite/product.
When dealing with content on legacy systems there are no end-users around to confirm or correct the judgement of the AI model. However a compensating advantage of acting on legacy systems is that there is scope to move or copy target content into an environment of your choice, and to apply AI tools and data science techniques of your choice.
7 Choice of target content
In general, most organisations will have accumulated most of their unstructured content in generic communications and document storage tools. The market for such tools over the past twenty years has been dominated by Microsoft.
In the on-premise era the Microsoft Windows environment dominated. The most widespread document storage area was network shared drives (file shares). SharePoint sites offered an alternative to network shared drives in the latter years of the on-premise era. The most common message storage medium in the on-premise era was Microsoft Exchange email accounts.
Microsoft have retained their domination of generic business document management, collaboration and communication tools in the cloud era. The Microsoft 365 cloud suite is built around the online versions of SharePoint and Exchange. The first big cloud native application that emerged in the Microsoft 365 cloud suite was MS Teams, but Teams does not have a repository of its own – it uses SharePoint and Exchange as its repositories:
- MS Teams uses SharePoint and OneDrive sites to store any documents posted or uploaded into it or sent through it (OneDrive sites use the SharePoint repository);
- MS Teams uses Exchange email accounts to store any messages or channel posts made through it.
This relative homogeneity of supplier and environment means that most of the content that most organisations are likely to need to take action on comes in the form of either:
- network shared drives (fileshares) – they may be an unglamorous left-over from the on-premise era, but they still have to be dealt with;
- SharePoint sites -including sites from on-premise SharePoint implementations, sites in Microsoft 365 SharePoint implementations, sites set up to accompany MS Teams, and OneDrive sites. Furthermore many organisations have decanted/migrated the content of some or all of their legacy on-premise era document management systems, including electronic records management systems, into SharePoint sites;
- email accounts – including any on-premise email accounts, and email accounts in cloud suites. Email accounts in Microsoft Exchange in Microsoft 365 also include, as a substrate invisible to the end-user, the Teams chat messages exchanged by that user.
If approaches can be found for applying data science techniques for the management over time of content in network shared drives, SharePoint sites and email accounts, then huge swathes of most organisations’ digital heaps become manageable.
Typology of AI interventions for records management purposes
The most striking thing about these choices is that for each of them there is no right or wrong answer. These choices exist because there is no perfect form of AI, no perfect way of applying AI, and because perfection in records management is not obtainable (even with AI). There are trade-offs to be managed and different choices have different advantages and disadvantages.
Setting out these choices provides a way of placing any existing or proposed AI intervention that you come across (or think of) against the whole constellation of choices open to a records management team.
| Nature of the choice | Options |
| 1 AI/data science technique(s) | – analytics tools – rule set – supervised machine learning – unsupervised machine learning – large language model |
| 2 Relationship between AI and human control | – AI directly acting on content – AI providing decision support |
| 3 Purpose | – taxonomy application – disposition review – appraisal and selection – sensitivity identification |
| 4 AI action | – assigning a category or label – summarising – sensitivity protection – negative appraisal – clustering |
| 5 Level of aggregation at which to act | – item level – aggregation level |
| 6 Stage of the records lifecycle | – active content on live systems – inactive content on live systems – inactive content on legacy systems |
| 7 Target content | – content in SharePoint sites – content in email accounts – content in network shared drives – other |