The journal Archival Science has published the latest paper from my doctoral research project into archival policy towards email. The paper is entitled ‘Rival records management models in an era of partial automation’. It is an open access paper and is free to read and to download from here.
The paper argues that:
- the adoption of email in the 1990s brought in an era of partial automation. Email systems could automatically file correspondence, but not into a structure/schema of our choosing;
- Frank Upward’s records continuum model sees a recordkeeping system as a set of processes that involve the creation, capture, organisation and sharing of records;
- in an email system the process by which correspondence is captured is optimal. Correspondence is filed instantly according to very predictable and reliable rules. Metadata values are assigned consistently and accurately;
- the structure and metadata schema of a typical email system is sub-optimal. Correspondence is not linked to the business activity it arose from, and business correspondence is not distinguished from personal, trivial, and social correspondence. This causes difficulties with the management and sharing of records (most notably when an individual leaves post and their successor cannot normally be permitted to access the business correspondence within their email account).
A significant proportion of archival and records management thought over the course of the past quarter of a century has gone into trying to specify what constitutes an optimal structure and metadata schema for a records system.
In the era before email, when we did not have the capability to automatically file correspondence, there was a level playing field between different ways of structuring a record system. An organisation had a choice of several ways it could file correspondence (chronologically, alphabetically by correspondent, functionally by business activity). It did not take appreciably more effort for a human to file into any one of these three different structures than into any of the others. It therefore made sense to choose the structure that gives the optimum efficiency in terms of the management and sharing of records through time. This equates to the structure that is most efficient from the point of view of the application of records retention and access rules. Theodore Schellenberg, author of the foundational text on records management, tells us that the most efficient way of organising records is by function and business activity.
The coming of email changed this equation. An email system automatically files all of an organisation’s email correspondence alphabetically and chronologically at the email system level, and all of an individual’s correspondence alphabetically and chronologically at an email account level. There is no longer a level playing field between different ways of structuring a record system. We can automatically and instantly file email correspondence chronologically and alphabetically, but if we want to file it by business activity then that would have to be done manually.
In this era of partial automation we have a paradoxical situation whereby if we were to ask end-users to file email correspondence into an application with an optimal structure/schema (one that organises records by business activity) then we are likely to make the recordkeeping of the organisation less efficient and less reliable. This is because we would be using an unreliable manual process to re-file correspondence that had already been filed automatically and reliably into a sub-optimal structure/schema.
The paper therefore finds a justification within archival theory for approaches that seek to manage correspondence in place within the structure and metadata schema of a native messaging application (for example of an email system) even where that structure/schema is sub-optimal . This justification is valid in circumstances where the native application has automatically and reliably assigned correspondence to that structure/schema, and where the organisation lacks an automatic and reliable means to re-assign correspondence to an alternative (more optimal) structure/schema.
Every organisation has a records management strategy. Regardless of whether or not they make that strategy explicit in a strategy document, their strategy is implicit in the choices they make about how they go about applying retention rules to the content in their business applications:
- they might be seeking to try to move documents and messages needed as a record into one business application that is optimised for recordkeeping;
- they might be seeking to intervene in several, most or all of their business applications so that each of these applications are optimised to manage their own records;
- they might be seeking to manage records ‘in place’ within the native applications that they arise in, even when an application has a structure and a metadata schema that is sub-optimal for recordkeeping.
Each of these rival strategies has their pros and cons. Each is a legitimate records management approach.
This situation is made more complex by the arrival of the cloud suites from tech giants Microsoft and Google. Each cloud suite is a combination of several different document management, collaboration, filesharing and messaging applications. Each cloud suite embodies an implicit records management strategy. The strategy is discernable by the retention capabilities that the suite is equipped with, and by how those capabilities relate to the applications within the suite. This opens up the possibility that an organisation might be expounding one type of records management strategy, but be deploying a cloud suite that is informed by one of the rival strategies.
It is clear from the way that the Microsoft (Office) 365 cloud suite has been set up that Microsoft have adopted an ‘in place’ approach to records management. In the MS 365 suite no one application is superior for records management purposes than any other. Retention functionality sits in the Compliance centre, outside of any one application. Retention rules can be set in the Compliance centre and applied to any type of aggregation in any of the collaboration/document management/messaging/filesharing applications within the suite.
In the early years of Office 365 it may have been possible for an organisation to deploy the suite whilst continuing with a strategy of moving all documents and messages needed as a record into one specific business application that is optimised for recordkeeping. Since the coming of MS Teams this no longer appears possible.
MS Teams deployed as an organisation’s primary collaboration system
The phenomenal rise in adoption of MS Teams over the past two years is prompting some fundamental changes in the records management strategies of those organisations who have adopted it as their primary collaboration application:
- prior to the coming of MS Teams an organisation would have been able, if it so wished, to configure their main collaboration system with a records management friendly structure and metadata schema. This kind of ‘records management by design’ is not possible in MS Teams;
- prior to the coming of MS Teams most large organisations tended to deploy a document management application as their main collaboration system. Microsoft Teams is a messaging application. The organisation has therefore gone from trying to manage messages in a document management application, to trying to manage documents in a messaging tool.
These two changes are related: the reason why MS Teams is not as configurable as previous generations of collaboration systems is precisely because it is primarily a messaging system in which content is pushed to individuals and teams via Chat and Channels.
The impact of MS Teams on an organisation’s records management strategy
Let us think of how the records management strategy of a typical large organisation may have evolved over the past two decades:
- in the middle of the first decade of this century they may have implemented a corporate electronic document and records management system (EDRMS) and said to their staff ‘if you want a document or a message to be treated as a record, save it into our EDRMS’.
- near the start of the second decade of the 21st century they may have wanted to introduce a document management system that is more collaborative in nature. They may have replaced their EDRMS with a tool such as Microsoft SharePoint. They might have said to their staff ‘if you want a document or message to be treated as a record, move it into a SharePoint document library’.
- in 2019 or 2020 they may have rolled out Microsoft Teams, giving every individual a Teams Chat client and every team a Team with a set of channels into which messages and documents can be posted. They may now say to their staff ‘it you have an important document or message then post it through a Team channel’.
Here we have some elements of continuity, but also some important elements of discontinuity.
The main element of continuity is that the organisation is still encouraging staff to contribute every document or message needed as a record to one particular application. They are still making a distinction between:
- an application that they designate as being a record system (and hence will apply their retention principles and retention rules to);
- other applications that they do not regard as record systems (and hence do not commit to apply their retention principles and rules to).
The element of discontinuity lies in the nature of the rationale behind this distinction.
When the organisation had implemented an EDRM or SharePoint as its corporate document management system and asked staff to move any documents and messages needed as a record into that system they could argue that they were taking a ‘records management by design’ approach. They will have endeavoured to configure their corporate document management systems with a logical structure that reflects (as best they could) their business processes and to which they will have attempted to link appropriate retention rules and access rules.
They could justify the routine deletion of content in other applications by arguing that the records management by design approach relies on important documents and messages being placed into an application that has records management frameworks configured into it.
MS Teams and the decline of records management by design
For most of the past ten years SharePoint has dominated the market for corporate document management systems and hence dominated the market for systems through which an organisation could apply a records management by design approach.
SharePoint has been on a journey. It started the decade as an on-premise, stand-alone document management system which end users would directly interact with. It ends the decade as a component part in a cloud suite. Its role in that cloud suite is more and more becoming that of a back-end system supplying document management capability to MS Teams.
An organisation that configured SharePoint as their corporate document management system with a carefully designed information architecture will find that structure sidelined and deprecated by the introduction of MS Teams. Each new MS Team is linked to a new SharePoint document library in which all documents posted through its channels will be kept. In effect this brings in a parallel structure of document libraries to rival those previously set up for those same users in SharePoint.
At the time of writing it does not appear possible to do records management by design with Microsoft Teams. The information architecture of MS Teams is not suited for it, and neither is the way that most organisations roll Teams out.
Information architecture of MS Teams
In MS Teams each Team is a silo. There is no overarching structure to organise the Teams. Overarching structures are useful in any sort of collaboration system, not just to serve as a navigation structure for end-users to find content, but also to place each collaboration area into some sort of context to support the ongoing management and retention of its content. There does exist, in the Microsoft 365 Teams Admin Centre, a list of the Teams in the tenancy. This list is only accessible to those who have admin rights in the tenancy. Most organisations give admin rights to only a very small number of people. The list gives an information governance/records management team precious little information about each Team. It gives a basic listing of the name of the team and the number of channels, members, owners and guests it has. (see this recent IRMS podcast with Robert Bath for a fuller discussion of these issues).
Roll out of MS Teams
In the past organisations typically used a staggered roll out to deploy their corporate document management /collaboration system (EDRMS, SharePoint etc.). Such systems would be rolled out tranche by tranche, to give time for the implementation team to configure the specific areas of the system that each team in the tranche was going to use. This was necessary in order to bridge the gaps between the organisation’s broad brush information architecture frameworks and the specific work and specific documentation types of the area concerned.
Microsoft Teams is a primarily a messaging system. It is a communication tool. It gives individuals the ability to send messages to other individuals through their Chat client and to share posts with their close colleagues through a Team channel. Its aim is to speed up communications and to enable colleagues separated in space by remote working to keep closely connected. Organisations may have been happy to stagger the roll out of a document management system but they are not typically happy to stagger the roll out of a messaging system because to do so would exclude some staff from communication flows.
There is a design element to Teams. There are choices to be made as what is the the optimum size for a Team, as to how many different channels to set up in any team, and what to set channels up for. But there are two significant limitations to the scope of these design choices.
The first limitation arises from the fact that the design choices relate to Teams channels, whereas most MS Teams traffic tends to go through Teams Chat. There are no design choices to be made in the implementation of individual Teams Chat clients.
The second limitation arises from the fact that however you design Teams and Teams channels, you cannot overcome the fundamental architectural weakness of Teams, that each Team has to be a silo. This is the key difference between SharePoint/EDRMS systems on the one hand and MS Teams on the other.
The access model in MS Teams
SharePoint is a document management system. The access model for SharePoint sites and document libraries is flexible. You can make the site/library a silo if you wish by restricting access to a a small number of people, but equally you can open up access widely. You can reduce the risk of opening access widely by confining the right to contribute, edit and delete content to a small group whilst opening view access to a wider group.
MS Teams is a messaging system first and foremost with a document management component (provided by SharePoint) as a secondary element.
You can only view content in a given Team if you are a member of a that Team. Each Team member has access to every channel within the team (apart from private channels) and to the Team’s document library. Each Team member can contribute posts to channels and can add and delete items from the document library. Broadening the membership of a Team is therefore more risky than opening up access to a document library in a SharePoint implementation, because everyone that you make a member of the Team has full edit and contribute rights in the Team.
A further disincentive to adding extra members to the Team lies in the fact that by making someone a member of the Team you are exposing the individual to the flow of messages in and out of the channels of that Team. There is a catch-22 here. The more you widen membership of a Team the more you drive message traffic away from Teams channels and into Teams Chat. This is because the wider the membership the greater the likelihood that any given message will be uninteresting, irrelevant or inappropriate for some members. If you keep the membership small you may increase the percentage of traffic going through the channel but you decrease the number of people to whom that traffic is accessible.
Microsoft Teams and the notion of ‘retrospective governance’
The first time I heard the phrase ‘retrospective governance’ was in the chat in the margins of a recent IRMS webinar. An organisation had carried out a corporate wide roll-out of MS Teams out virtually over night in 2019, and the records manager reported having spent the subsequent year trying to put in place some ‘retrospective governance’ by identifying which organisational unit each Team belonged to, what it was being used for, whether it was needed, what retention rule should apply to it. Numerous other participants reported similar experiences.
Retrospective governance is not exclusive to Microsoft Teams. We can for example see retrospective governance in the use of analytics/eDiscovery tools to make and process decisions on legacy file shares (shared drives).
In the past an organisation may have decided to apply retrospective governance to data in legacy applications and repositories, but such retrospective governance efforts were very much peripheral to the main thrust of their records management strategy which was centred on the application into which it had configured its recordkeeping frameworks. After the coming of MS Teams retrospective governance suddenly moves to the heart of an organisation’s records management efforts. The rise of MS Teams means that within the MS 365 suite it is not possible to configure records management frameworks into one application in such a way as to give that application records management primacy over other applications.
The in-place records management strategy
The in place records management approach holds that at the current point in time it is not feasible to consistently move all significant business documents and messages into an application equipped with a structure/schema that is optimised for records management. There also exist many classes of business application (including email systems and all other types of messaging system) whose structure and schema simply cannot be optimised for record keeping. Therefore an organisation’s document management, messaging, filesharing and collaboration applications should be treated as record systems even if the structure and metadata schema of most of these applications is sub-optimal for recordkeeping.
Unless and until it becomes possible either a) to consistently and comprehensively move records into an application that is optimised for recordkeeping, or b) to configure each business application so that it has a structure and schema that is optimal for recordkeeping, then we also need c) a viable strategy for managing records in applications that have a structure and schema that is sub-optimal for recordkeeping.
Dialogue with Microsoft about their execution of the in-place records management strategy
The records management profession has put a tremendous amount of work over the past 25 years into building a knowledge base for the records management approaches that involve optimising one business application for recordkeeping. There is also a small amount of literature about the model where you configure records management frameworks into several, many or all business applications. The profession has put much less work into defining best practice for the in-place records management approach. This is understandable, because it was not our first choice of model. But its absence is especially noticeable in discussions with Microsoft about records management.
The in-place strategy embodied in the MS 365 cloud suite is one in which applications are deployed without any attempt to ensure that the structure and schema of those applications are optimised for recordkeeping. Organisations apply retention rules to the aggregations that naturally occur within those different native applications (email accounts, SharePoint sites, One Drive accounts, Teams Channels, Teams Chat accounts etc.) and ask individuals (or train machines) to identify and label any items that are exceptions to the retention rule set on the aggregation that they are housed in.
We can point Microsoft to a truck load of best practice standards in records management but almost all of it has been written from a point of view that assumes an organisation or a vendor is trying to design records management frameworks into one business application. There is little or no literature on the in-place strategy that Microsoft are trying to implement in their cloud suite. We have nothing against which to judge Microsoft’s execution of the in-place strategy in their suite, and nothing to guide organisations attempting to work with that strategy in their implementation of the suite.
Filling the gap in records management best practice
I am writing this on New Years Day. How about a collective new year’s resolution to start building a knowledge base for the in-place records management strategy? This should prompt us to start thinking through the key considerations involved in managing records across multiple document management, messaging, filesharing and collaboration applications, many of which have structures and schemes that are neither optimised nor optimisable for recordkeeping.
This should supplement but not replace the existing knowledge base we have for the other two main records management strategies.
To understand Microsoft’s strategy for document management in Office 365 it is more instructive to look at what they are doing with Delve, MS Teams and Project Cortex than it is to look at what they are doing with SharePoint.
The move to the cloud has had a massive impact on document management, despite the fact that document management systems (such as SharePoint) have changed relatively little.
What has changed is that cloud suites such as Office 365 and GSuite have created a much closer relationship between document management systems and the other corporate systems holding ‘unstructured’ data such as email systems, fileshares and IM/chat. This closer relationship is fuelling developments in the AI capabilities that the big cloud providers are including in their offerings. Project Cortex, set to come to Office 365 during 2020, is the latest example of an AI capability that is built upon an ability to map the interconnections between content in the document management system and communication behaviour in email and chat.
SharePoint in the on-premise era
In its on-premise days SharePoint was in many ways a typical corporate document management system. It was the type of system in which:
- colleagues were expected to upload documents, add metadata and place documents within some kind of overarching corporate structure.
- information managers would work to optimise the information architecture and in particular the search capability, the metadata schema and the structure of the system.
- retention rules would be held and applied to content.
It was the type of system that an organisation’s might call their ‘corporate records system’ on the grounds that documents within the system are likely to have better metadata and be better governed, than documents held elsewhere.
SharePoint in the cloud era
In Office 365 SharePoint’s role is evolving differently. Its essential role is to provide document management services (through its document libraries) and (small scale) data management services (through its lists) to the other applications in the Office 365 family, and in particular to MS Teams.
SharePoint can still be configured to ask users to add metadata to documents, but users have three quicker alternatives to get a document into a document library:
- If the document library is synched with their Explorer they can drag and drop a document from anywhere on their computer drive into the document library.
- They could simply post the document to a channel in their Team in MS Teams which will place it in the document library in the SharePoint Team Site associated with the Office 365 group that underpins their Team.
- They could post it to a private channel in a Team which would cause the document to be stored in a document library within a site collection dedicated to that private channel.
SharePoint can and should still be given a logical corporate structure but MS Teams may start to reduce the coherence of this structure. Every new Team in MS Teams has to be linked to an Office 365 group. If no group exists for the Team then a new group has to be created. The creation of a new Office 365 Group provisions a SharePoint site in order to store the documents sent through the channels of that Team. Every time a private channel is created in that Team it will create another new SharePoint site of its own.
SharePoint still has a powerful Enterprise search centre within it, but it is rivalled by Delve, a personalised search tool that sits within Office 365 but outside SharePoint. Delve searches not just documents in SharePoint but also in One Drive for Business and even attachments to emails.
SharePoint can still be configured to apply retention rules to its own content through policies applied to content types or directly to libraries. However a simpler and more powerful way of applying retention rules to content in SharePoint is provided outside SharePoint, in the retention menu of the Office 365 Security and Compliance Centre. This retention menu is equally effective at applying retention rules (via Office 365 retention policies and/or labels) to SharePoint sites and libraries, Exchange email accounts, Teams, Teams chat users and other aggregations within the Office 365 environment.
Microsoft’s attitude to Metadata
Microsoft’s Office 365 is a juggernaut. It is evergreen software which means that it has regular upgrades that take effect immediately. It faces strong competitive pressures from another giant (Google). It needs to gain and hold a mass global customer base in order to achieve the economies of scale that cloud computing business models depend on.
Information architects of one sort or another are part of the ecosystem of Office 365. Like any other part of the Office 365 ecosystem information architects are impacted by shifts, advances and changes in the capabilities of the evergreen, everchanging Office 365. Suppliers in the Office 365 ecosystem look for gaps in the offering. They don’t know how long a particular gap will last, but they do know that there will always be a gap, because Microsoft are trying to satisfy the needs of a mass market, not the needs of that percentage of the market that have particularly strong needs in a particular area (governance, information architecture, records management etc.).
The niche that SharePoint information architects have hitherto occupied in the Office 365 environment will be changed (but not diminished) by Microsoft’s strategy of promoting:
- Teams as the interface and gateway to SharePoint;
- Delve as the main search tool for Office 365;
- the forthcoming Project Cortex as the main knowledge extraction tool;
- the Security and Compliance centre as the main locus of retention policies.
Microsoft’s need to win and keep a mass customer base means that they need document management to work without information architecture specialists because there are not enough information architecture specialists to help more than a minority of their customers.
Microsoft’s plans for SharePoint to be a background rather than a foreground element in Office 365 will take time to run their course, and that gives us time to think through what the next gap will be. What will be the gap for information architects after SharePoint has been reduced to a back end library and list holder for Teams, Delve, Cortex and the Microsoft Graph?
In order to come up with a proposed answer to this question this post will explore in a little bit more detail how and why Microsoft’s document management model has changed between the stand alone on premise SharePoint and SharePoint Online which is embedded in Office 365.
The on-premise corporate document management system model
On-premise corporate document management systems, up to and including the on premise SharePoint, were built on the assumption that a corporate document management system could stand separately from the systems (including email systems) that transported documents from person to person.
This assumption had been based on the idea that good metadata about a document would be captured at the point that it was entered into the system, and updated at any subsequent revision. This metadata would provide enough context about the documents held in the system to render superfluous any medium or long term retention of the messages that accompanied those documents as they were conveyed from sender to recipient(s).
The model depended on a very good information architecture to ensure that:
- every person (or machine) uploading a document to the system was faced with a set of relevant metadata fields;
- these metadata fields were backed, where necessary, by controlled vocabularies, that present a set of coherent and contextually relevant choices for metadata values.
The problem with this model is that it is not feasible to design an information architecture for a corporate wide stand alone document management system that describes documents in a way that means that documents across all of an organisation’s different activities are understandable and manageable. You can achieve this for some parts of the system, but not for the whole system.
There are two ways you can set up an information architecture: top down or bottom up. Neither approach works on a corporate wide scale:
- In the top down approach you define a corporate controlled vocabulary for every metadata field that needs one. The trouble with this is that for any one individual user the vast majority of the values of those vocabularies would be irrelevant, and they would have to wade through all these irrelevant values every time they wanted to upload a document to the system.
- In the bottom up approach you define locally specific vocabularies. SharePoint was and is particularly good for this. For any particular document library you can define vocabularies tailored specifically to the content being put into that libraries. However then you get the problem that an implementation team in a medium or large organisation has not got the time to define locally-specific metadata for every single area of the business.
There is a way by which this information architecture problem can be solved. It involves:
- mapping different controlled vocabularies to each other, so that the choice of a metadata value in one field removes any conflicting values in the controlled vocabularies of any other metadata field.
- mapping metadata fields to a users role, so that any values in a controlled vocabulary that are irrelevant to the user are removed
This is already starting to look like a graph – like the Facebook social graph that drives search in Facebook, the Google Knowledge Graph that is built into Google Search and the Microsoft graph which is the enterprise social graph that underpins Delve and Project Cortex in Office 365.
Enterprise social graphs
An enterprise social graph is an established set of connections between:
- information objects (such as documents)
- the people who interact with those documents;
- the actions those people perform on those documents (saving them, sending them, revising them etc.)
- the topics/entities (such as policy issues, projects, countries, regions, organisations, disciplines etc. etc.) discussed in those documents.
The deployment of a graph significantly reduces the reliance of a system on metadata added by an end user (or machine) at the time of the upload of a document into a system. The mere fact that a particular end user has uploaded a document to a particular location in a system is already connecting that document to the graph. The graph connects the document to other people, topics and entities connected with the person who uploaded the document.
Graphs consist of nodes (people, objects and topics) and edges (the relationships between the nodes).
The concept of the graph has enormous potential in information architecture. You could narrow down the range of permitted values for any metadata field for any document any individual contributes to a system just by ensuring that the system knows what role they occupy at the time they upload the document.
This pathway towards smart metadata also takes us away from the idea of the document management system as a stand alone system.
If we see a document management system as a world unto itself we will never be able to capture accurate enough metadata to understand the documents in the system. Better to start with the idea that the documents that an individual creates are just one manifestation of their work, and are interrelated and interdependent with other manifestations of their work such as their correspondence, their chats, and their contributions to various line of business databases.
We can also distinguish between a knowledge graph, which is built out of what an organisation formally knows, and a social graph which is built out of how people in an organisation behave in information systems. The cloud providers have started by providing us with a social graph. Over time that social graph may improve to become more like a knowledge graph, and we will see below when we look at Project Cortex that Microsoft are taking some steps in that direction. But there is still some way to go before the enterprise social graph provided by Microsoft has the precision of an ideal knowledge graph. Note the word ‘ideal’ in that sentence: I have never worked in an organisation that has managed to get a knowledge graph (as opposed to a social graph) up and functioning.
The nature of an ideal knowledge graph will vary from organisation to organisation. An engineering firm needs a different type of graph from a ministry of foreign affairs which needs a different type of graph from a bank etc. etc.
In an engineering firm an ideal knowledge graph would connect:
- the people that are employed;
- the projects that are being carried out by the company;
- the systems that are being designed, manufactured, installed and maintained;
- the structures that are being designed and built,
- the engineering disciplines that are involved.
These different datasets and vocabularies can be mapped to each other in a graph independently of any document. Once a graph is constructed a document can be mapped to any one of these features and the range of possible values for all the other features should correspondingly reduce.
In a foreign ministry an ideal knowledge graph would connect
- the people that are employed
- the location they are based in
- their generic role (desk officer, subject expert, ambassador etc.);
- the country(ies) that they deal with
- the people they work closely with
- the multilateral fora that are participated in
- thematic topics
- types of agreements/treaties.
Again these can be mapped independently of any documents. Staff can be mapped to the countries they are based in/follow or to the thematic topic they work on.
The notion of the graph (whether a knowledge graph or a social graph or a blend of the two) brings home the fact that the data, document and messaging systems of an organisation are all interdependent.
The graph becomes more powerful from a machine learning and search point of view if it is kept nourished with the events that take place in different systems. When a person emails a document to another person this either reinforces or re-calibrates the graph’s perception of who that person works with and what projects, topics or themes they are working on.
Information architects will still need to pay attention to the configuration of particular systems, and corporate document management systems bring with them more configuration choices than any other information system I can think of. They should however pay equal attention to the configuration of the enterprise social graph that the document management system, in common with the other systems of the organisation, will both contribute to and draw from.
The next section looks at why both end users and Microsoft have tended to move away from user added metadata in SharePoint.
SharePoint and user added metadata
In a recent IRMS podcast Andrew Warland reported that an organisation he worked with synched their SharePoint document libraries with Explorer, and that subsequently most users seemed to prefer accessing their SharePoint documents through the ‘Explorer View’ rather than through the browser.
This preference for using the Explorer view over the browser view is counter intuitive. The browser view provides the full visual experience and the full functionality of SharePoint, whereas the Explorer view in effect reduces SharePoint to one big shared drive. But it is understandable when you think of the relationship between functionality and simplicity. Those purchasing and configuring information systems tend to want to maximise the functionality of the system they buy/implement. Those using it tend to want to maximise the simplicity. These things are in tension – the more powerful the functionality the more complex the choices presented to end users. The simplest two things a document management system must do is allow users to add documents and allow them to view documents: Explorer view supports both of these tasks and nothing else.
At this point I will add an important caveat. Andrew didn’t say that all end users preferred the explorer view. Some sections of the organisation had more sophisticated document library set ups that they valued, and were prepared to keep adding and using the metadata. But if the hypothesis advanced at the start of this post is correct then it is not feasible to configure targeted metadata fields with context specific controlled vocabularies for every team in an organisation when rolling out a stand alone document management system.
Graham Snow pointed out in this tweet that one disadvantage of synching document libraries with Explorer is that when a user adds a document they are not prompted to add any metadata to it. This raises two questions:
- why are Microsoft giving a get out to the addition of metadata when we know how important metadata is to retrieval?
- why are so many end-users seemingly uninterested in adding metadata when they would, in theory, be the biggest beneficiaries of that metadata?
Let us start by confirming that Metadata indeed is important. In order to understand any particular version of any particular document you need to understand three things:
- who was it shared with?
- when was it shared with them?
- why was it shared with them?
This provides a clue as to why many end-users don’t tend to add metadata to documents. If a document was shared via email then the end-user has the metadata that answers those three crucial questions, in the form of an email sitting in their email account. Their email account will have a record of who they shared it with (the recipient of the email), when (the date of the email) and why (the message of the email). One question we might ask ourselves is why have we not sought to routinely add to the metadata of each document the details which we could scrape from the email system when it is sent as an attachment? These details include the date the document was sent, the identity of the sender and the identity(ies) of the recipient(s).
The Microsoft graph
Microsoft are trying to make Office 365 more than simply a conglomeration of stand alone applications. They are trying to integrate and interrelate One Drive, Outlook, Teams, SharePoint and Exchange, to provide common experiences across these tools which are they prefer to call separate Office 365 ‘workloads’ rather than separate applications. This effort to drive increased integration is based on two main cross Office 365 developments: an Office 365 wide API (called Microsoft Graph API) and an enterprise social graph (called the Mircosoft Graph).
The Microsoft Graph API provides a common API to all the workloads in Office 365. This enables developers (and Microsoft themselves) to build applications that draw on content held and events that happen in any of the Office 365 workloads.
Microsoft Graph is an enterprise social graph that is nourished by the ‘signals’ of events that happen anywhere in Office 365 (documents being uploaded to One Drive or SharePoint; documents being sent through Outlook or Teams; documents being edited, commented upon, liked, read, etc.). These signals are surfaced though the Microsoft Graph API.
Microsoft Graph was set up to map the connections between individual staff, the documents they interact with, and the colleagues they interact with. For most of its existence Microsoft graph has been more of a social graph than a knowledge graph.
The forthcoming project Cortex (announced at the Microsoft Ignite conference of November 2019) takes some steps in the direction of turning Microsoft Graph into a knowledge graph. It will create a new class of objects in the graph called ‘knowledge entities’. Knowledge entities are the topics and entities that Cortex finds mention of in the documents and messages that are uploaded to/exchanged within Office 365. Cortex will create these in the Microsoft Graph and link them to the document in which they are mentioned and the people that work with those documents.
Applications built on top of the Microsoft graph
The three most important new services that Microsoft has built within Office 365 since its inception are Delve, Microsoft Teams and Project Cortex. All three of these services are meant to act as windows into the other workloads of Office 365. They are all built on top of the Microsoft 365 graph, and they provide signposts as to how Microsoft wants to see Office 365 go and how it sees the future of document management.
MS Teams, Delve, Cortex and the Microsoft Graph are eroding the barriers between the document management system (SharePoint), the fileshare (One Drive for Business), the email system (Outlook and Exchange) and the chat system (Teams).
Teams is primarily a chat client. But it is a chat client that stores any documents sent through it in either:
- SharePoint document libraries (if the message is sent through a Teams channel or private channel) or
- One Drive for Business (Microsoft’s cloud equivalent of a fileshare) if the document is sent through a chat.
Delve uses Microsoft Graph to personalise, security trim, filter and rank search results obtained by the Office 365 search engine. Delve pushes these personalised results to individual users so that on their individual Delve page they see:
- a list of their own recent documents. This shows documents they have interacted with (sent, received, uploaded, edited, commented on, opened or liked) in Outlook Teams, One Drive or SharePoint.
- a list of documents they may be interested in. This is Delve acting as a recommendation engine and showing documents that the individual’s close colleagues have interacted with recently, and which relate to topics that the individual has been mentioning in their own documents.
Delve is working under certain constraints. It does not search the content of email messages, only the attachments. It does not recommend a documents to an individual who does not have access to that document.
There are some cases where Delve has surfaced information architecture issues. In an IRMS podcast discussion with Andrew Warland (which is currently being prepared for publication) Andrew told me how one organisation he came into contact with had imported all their shared drives into SharePoint without changing access permissions in any way. Each team’s shared drive went to a dedicated document library. The problem came when Delve started recommending documents. Sometimes Delve would recommend documents from one part of the team to people in a different part of the team, and sometimes the document creators were not pleased that the existence of those documents had been advertised to other colleagues.
The team asked Andrew whether they could switch off Delve. His response was that they could, but that switching off Delve (or removing the document library from the scope of Delve) would not tackle the root of the problem. The underlying problem was that the whole team had access to the document library that they were saving their documents into. He suggested splitting up the big document library into smaller document libraries so that access restrictions could be set that were better tailored to the work of different parts of the team.
Delve has taken baby steps to unlocking some of the knowledge locked in email systems that is normally only available to the individual email account holder (and to central compliance teams). Delve cannot search the content of messages but it can search the attachments of email messages and the metadata of who sent the attachment to whom.
Project Cortex will take this one step further. It is a knowledge extraction tool. It seeks to identify items of information within the documents uploaded and the messages sent through Office 365. It is looking for the ‘nouns’ (think of the nodes on the graph) within the documents and the messages. The types of things it is looking for are the names of projects, organisations, issues etc. It seeks to create ‘topic cards’ and topic pages containing key pieces of information about these entities. A link to the topic card will appear whenever the project/organisation/issue etc. is mentioned in any Office 365 workload. Users will come across the link when they read or type the name of the entity into an email or a document. The topic cards and pages will also contain Cortex’s recommendations as to which colleagues are experts on the topic and which documents are relevant to the topic. Like Delve, Cortex will use Microsoft Graph to create these recommendations.
Project Cortex is tightly bound in with SharePoint. Its outputs manifest themselves in familiar SharePoint pages and libraries. Cortex uses the fact that SharePoint sites can be used to serve as an intranet to generate topic pages that function like Sharepoint intranet pages. Like SharePoint intranet pages you can add web parts to them, and they use document libraries to store and display documents. Project Cortex will populate the document library of a topic page with the documents that it mined to generate the information on the topic page. Colleagues who do not have access to those documents will not have access to the page.
The topic cards and pages will be editable (like wikipages). Project Cortex will link the topic pages for related topics together to form Knowledge Centres. These Knowledge centres will supplement (or rival) the organisation’s intranet.
SharePoint and machine added metadata
So far the knowledge centre/topic pages aspects of Project Cortex have got the most publicity, and they are the aspects that are likely to make the most immediate impression on end users. But I think and hope that the most useful aspects of Project Cortex will be two features that allow you to use machine learning to capture specified fields of metadata for a specified group of content in specified SharePoint document libraries.
- for structured documents (forms and other types of documents that follow a set template) Cortex provides a forms processing feature that allows you to identify particular elements form/template and map them to a metadata field. For any instance of that type of document the machine will enter the value found in that place in the document into the given metadata field.
- for unstructured documents you are able to train a machine learning tool to recognise certain types of content and recognise certain attributes of that content. This is done using a ‘machine teaching’ approach, where information professionals and/or subject matter experts explain to the machine the reasoning behind what features in the documents they want the machine to look for.
Project Cortex will provide a ‘Content centre’ within which Information professionals and/or subject matter experts can use machine teaching to build particular machine learning models. These models can be published out to particular SharePoint document libraries. The model can then populate metadata fields for documents uploaded to the library.
It would seem, from what Microsoft are saying about it, that the machine teaching capability that it will play to the strengths of information professionals, because it will use their knowledge of the business logic behind what metadata is needed about what content. The disadvantage of the machine teaching learning model is that it won’t scale corporate wide. You will have to target what areas you want to develop machine learning models for, just like in the on-premise days when you had to target which areas you would design tailored sites and libraries for.
The developments that are driving change in document management
The following four developments are driving change in document management:
- the move of email systems and corporate document management systems to the cloud;
- the emergence of cloud suites (Office 365 and G Suite) that bring both document repositories and messaging systems (email and chat) into one system;
- the development of enterprise social graphs within those suites that map people to the content that they create (and react to) and the topics that they work on;
- the development of machine learning.
These four developments are interdependent. Machine learning is only as good as the data it is trained on. Within a stand alone document management system there is simply not enough activity around documents for a machine learning tool/search tool to work out which documents are relevant to which people. A machine learning tool/search tool is much more powerful when it can draw on a graph of information that includes not just the content of the documents themselves and its metadata, but also the activity around those documents in email systems and IM/Chat systems.
In their on-premise days Microsoft found it extremely difficult to build shared features between Exchange and SharePoint. Now that both applications are on the cloud, both are within Office 365, both share the same API and both share the same enterprise social graph it is much easier for Microsoft to build applications and features that work with both email and SharePoint.
The gaps that project Cortex may not be able to fill
There are four main gaps in the Office 365 metadata/information architecture model:
- There are constraints on how much use the AI can make of information it finds in emails. Delve confines itself to indexing the attachments of email, and does not attempt to use knowledge within messages. Cortex seems to push that envelope further in that it does penetrate into email messages. If an email mentions an entity ( a project, organisation etc.) Cortex will turn the name of that entity in the email into a link to a topic card. However Microsoft states that Cortex will respect access restrictions so that users will only have access to topic cards about topics that are mentioned in content that they have access to.
- Office 365’s strength is in documents and messages, not data. Most of an organisation’s structured data is likely to be held in databases outside of Office 365 and the Microsoft Office Graph does not draw on this knowledge
- The Microsoft graph is geared towards the here and now. It is configured to prioritise recent activity on documents. It is not geared toward providing ongoing findability of documents over time.
- The Microsoft graph is a model that is designed for any organisation. It builds on the commonalities of all organisations (they consist of people who create, edit, receive and share documents and send and receive messages). An organisation with strong information architecture maturity and well established controlled vocabularies in key areas of its business will find that these controlled vocabularies are not utilised by the Microsoft graph. One of the most interesting aspects to watch when Cortex rolls out later this year is the extent to which it integrates with the Managed Metadata Service within SharePoint. What we would really want is a managed metadata service that has strong hooks into Microsoft Graph, so that the Graph can leverage the knowledge encoded in the controlled vocabularies and so that the Managed Metadata Service can leverage the ability of the graph to push out the controlled vocabularies to content via services such as Delve and Cortex.
These gaps provide the space within which records managers, information architects, and the supplier ecosystem in the records management and information architecture space can act in.
Below are what I see as the medium to long term priorities for information professionals (and the information profession) to work on in relation to Office 365:
- Put your enterprise into your enterprise social graph. The Microsoft Graph in your Office 365 is yours. It is your data, and sits in your tenant. There is an API to it. You can get at the content. What we want is a marriage between the metadata in your enterprise and the enterprise social graph that has emerged in the Microsoft graph on your tenant. We need a tooo would enable us to hold control vocabularies (or bring in master data lists held in other systems), link them to each other and hook them into the Microsoft Graph so that the documents and people of the organisation get linked into those metadata vocabularies
- Make the enterprise social graph persist through time If the Microsoft social graph is needed in order for Office 365 to be findable now, then it will still be needed to find and understand that content in five or ten years time. The question is how can it serve as an ongoing metadata resource when it is geared up only to act as a way of surfacing content relevant to the here and now? This challenge has both digital preservation and information architecture aspects. The digital preservation aspects concern the question of what parts of the graph we need to preserve and how we preserve them. The information architecture aspects concern what contextual information we need alongside the graph, and how we enable any application built on top of the graph to keep current the security trimming of the results it returns. Could we for example have some sort of succession linkages, so that successors-in-post can automatically access the same documents as their predecessors (unless personal sensitivity labels/flags had been applied)?
- Make emails more accessible Delve and Project Cortex have come up with ingenious ways of unlocking some of the store of knowledge cooped up in email accounts without breaking the expectation that each individual has that their email account is accessible only by themselves (or rather only by themselves and their corporate compliance team). Delve does it by confining itself to attachments. Project Cortex does it by confining itself to items of fact. But this does not alter the fundamental problem that the business correspondence of most individuals is locked inside an aggregation (their email account) that is only accessible for day to day purposes to the individual account owner. This is acting as a barrier to day to day information sharing and to succession planning. There is nothing fundamentally wrong in having correspondence grouped by individual sender/recipient. People can be mapped to roles and to topics/projects etc. However the truth is that an email account is too wide an aggregation to apply a precise access permission to. What we need is the ability to assign items of correspondence within an email account to the particular topics/projects/cases/matter or relationship that the item relates to, so that a more suitable access permission can be applied to these sub-groupings. This seems an obvious use of AI.
So here is my wish list from the supplier ecosystem around Office 365
- a tool that lets you keep your controlled vocabularies, link them each other, and link them to the Microsoft graph (or a clear methodology of how to use the Managed Metadata service to do this):
- a digital preservation tool, or a digital preservation methodology, for preserving (and enriching) those parts of the Microsoft graph needed for the ongoing understanding of content across Office 365;
- a machine learning tool that within each email account assigns emails to different topics/matters and allows the email account holder, once they have built up trust in the classification, to share access with a colleague (or their successor in post) to emails assigned to a particular topic/matter.
Sources and further reading/watching/listening
At the time of writing Project Cortex is on private preview. What information is available about it comes from presentations, podcasts, blogposts and webinars given by Microsoft.
On 14 January 2020 the monthly SharePoint Developer/Engineering update community call consisted of a a 45 minute webinar from Naomi Moneypenny (Director of Content Services and Insights ) on Project Cortex. A You Tube video of the call is available at https://www.youtube.com/watch?v=e0NAo6DjisU. The video includes discussion of:
- the ways that administrators can manage security and permissions around Cortex (from 15 minutes)
- the machine teaching and the form processing capabilities (from 19 minutes)
- the interaction of Cortex with the Managed Metadata Service (from 26 minutes).
The philosophy behind machine teaching is discussed in this fascinating podcast from Microsoft Research with Dr Patrice Simard (recorded May 2019) https://www.microsoft.com/en-us/research/blog/machine-teaching-with-dr-patrice-simard/
The following resources provide some background to graphs:
- In November 2018 Fredric Landqvist wrote this blogpost comparing the Microsoft Graph to the knowledge graphs/ontologies that taxonomists seek to build https://findwise.com/blog/beyond-office-365-knowledge-graphs-microsoft-graph-ai/
- This post gives an introduction to the data science behind enterprise social graphs/knowledge graphs, and shows the connection between graphs and machine learning. https://towardsdatascience.com/graph-theory-and-deep-learning-know-hows-6556b0e9891b
Microsoft Teams is an Office 365 application within which individuals belong to one or more Teams and can exchange messages through one of three different routes:
- channels – messages are visible to all their fellow Team members;
- private channels – messages are visible to a defined sub-set of their fellow Team members;
- chats and group chats- conversations on an ad hoc basis with any colleague or group of colleagues (regardless of whether or not those colleagues are part of their Team).
In a recent IRMS podcast Andrew Warland said that Teams had been adopted enthusiastically by colleagues across his organisation and the volume of communications sent via Team channels and chats had grown rapidly. However the number of emails exchanged had not seemed to fall. In contrast Graham Snow tweeted that he had seen figures of as much as an 85% reduction in email traffic as a result of the introduction of Teams.
This post looks at three questions in relation to MS Teams:
- will correspondence going through Teams be any more accessible, and/or any easier to govern than correspondence going through email?
- what proportions of an organisation’s correspondence is likely to be diverted from email to Teams?
- on what basis should we make retention decisions on correspondence going through Teams channels, private channels and chats?
Will correspondence be more manageable in MS Teams than it has been in email?
The question of whether correspondence in Teams is likely to be more or less accessible and manageable than correspondence in email depends in large part upon which of the communication routes within Teams attracts the most correspondence:
- correspondence going through ‘channels’ within Teams is likely to be both more accessible and easier to govern than email
- correspondence going through private channels, group chats and chats is unlikely to be any more governable and accessible than email.
Conversation though channels
Channels are likely to be more manageable than email because the access model is so simple. Every member of the Team can access everything in every channel of the Team. Channels may however pose something of a digital preservation headache because the conversations are stored separately from any documents/files that are shared in the channel:
- The messages/conversations in the channel are stored in MS Exchange, as a hidden set of files associated with the shared email account linked to the Office 365 Group that the Team is based on.
- Any documents that are posted to the channel are stored in the document library which sits in the SharePoint team site associated with the Team. By default there is one document library to hold the documents of all the channels of the Team.
Conversations through private channels
Private channels are a new feature of MS Teams, introduced in November 2019. They function like channels with the main difference being that the access model is more granular and hence more complex. Every private channel within a team has its own bespoke list of members able to access it.
The storage arrangements are not more complex for private channels than channels:
- The correspondence from private channels is stored (as invisible files) in MS Exchange, attached to the email accounts of all of the participants in the private channel.
- The documents are stored in a SharePoint document library. Each new private channel creates a new site collection in SharePoint, just to accommodate the one document library it needs to store the documents/ uploaded to the private channel.
In the December 2019 episode of the O365Eh! podcast Dino Caputo described the angst in the SharePoint community at the fact that each new Teams private channel creates a new SharePoint site collection, and asked the Microsoft Teams product lead Roshin Lal Ramesan why it had been designed like that. Roshin said it was to protect the confidentiality expectations of the participants of a private channel by making sure that the documents they send were not visible to the owner of the Team that the private channel is based in.
Microsoft have designed the storage arrangements for private channels to take account of the fact that it is not normally necessary or recommended for a Team owner to be the most senior person in the team. Private channels give (for example) the possibility for senior managers within a Team to have a channel for communications which the team owner cannot see into.
Roshin explained that by default the Team owner becomes the site collection administrator of the SharePoint site that is automatically created when a new Team and a new Office 365 group is created. The Team owner can see all the content stored in that SharePoint site collection. When a private channel is created within the Team a further new site collection is created, to which the Team owner has no access to unless they are themselves a participant in the private channel.
Private channels were introduced after a mountain of requests from customer organisations. Organisations may play a high governance price for their request being granted. As Private channels proliferate within Teams so they will also proliferate new sites in SharePoint. In anticipation of this Microsoft have quadrupled the number of site collections that an organisation is able to have in their implementation, from 500,000 to 2 million.
Access to old private channels will degrade over time. Microsoft’s model is that when the owner of a private channel leaves, the ownership of the group defaults to another member of the private channel. Once the private channel ceases being used then access to the private channel will degrade, with progressively less and less people being able to access it.
Implementation scenarios for MS Teams
The question of whether channelling correspondence through Teams makes that correspondence more or less useful and manageable than email depends then on the balance between channels on the one hand, and private channels and chats on the other.
We can identify two different scenarios:
- In scenario one channels are dominant, and the correspondence in MS Teams is more accessible and manageable than email correspondence.
- In scenario two private channels and chats are dominant and correspondence is harder to govern over time than correspondence in email accounts, and hardly more accessible than it would have been had it gone through email.
Scenario 1: Teams dominated by channels
In scenario 1 each organisational unit is given a Team, as are some cross-organisational projects. Each team is relatively small which means that talk in a channel can be relatively frank. Each Team defines a relatively small number of channels to cover its main areas of work. Individuals continue to use their email account as their main source of correspondence, but also use their Teams client for quick communications.
In this scenario the Team owner rarely adds colleagues from outside their organisational unit to the Team, because that would give them the ability to see all the existing correspondence in all the channels.
Scenario 2: Teams dominated by private channels and chats
In scenario 2 the organisation increases the average size of Teams, giving each individual a bigger pool of people to interact with through channels and private channels. The team still has a set of channels for matters that concern the whole team. Because the Team is so much bigger there is a need for private channels, in part to minimise the noise for individual Team members from traffic that does not relate to them, and in part to enable team members to talk frankly. The Teams client becomes more important than the email inbox for many colleagues, particularly those that are internally facing.
In this scenario some individuals external to the Team and even to the organisation can be added as members, visitors or guests so that a team can interact with them via a channel or private channel. Individuals find that their Team’s client becomes more complex. In it they see not just their own Team’s channels, and the private channels that they are part of, but also the channels (and perhaps some private channels) of other Teams that they have been added to. Individuals learn to adapt to the new environment, turning on and off notifications from different channels depending on their perception of the relevance and usefulness of each channel/ private channel.
In this second scenario some individuals begin to watch their Teams client more closely than their email client, and send more messages through Teams than they do through email
The tension between governability and growth in MS Teams
A reading of the two scenarios above suggests that:
- If an MS Teams implementation stays tightly governed ( as in the first scenario) then correspondence going through Teams is more accessible and manageable than it would have been had it gone through email, but the platform only takes a minority of correspondence traffic away from email.
- If MS Teams governance is loosened (as in the second scenario) then Teams has a real chance of reaching parity with email for internal correspondence, but at a price that the correspondence is barely any more accessible than it would have been had it gone through email, and is harder to manage than it would have been had it gone through email
Predicting how much traffic MS Teams will take from email
Whether or not a particular individual experiences a fall in email traffic as a result of the introduction of MS Teams is likely to depend on who they do most of their communicating with:
- Those individuals whose correspondence is predominantly with close colleagues who belong to the same Team may experience a significant switch of correspondence from email to MS Teams.
- Those individuals whose correspondence tends to be with colleagues that are more widely spread group across the organisation will experience less of a reduction in emails.
- Individuals whose correspondence is predominantly with people outside their organisation may notice little or no difference at all.
Even within one organisation you will see wide varieties of take up, with some internally facing teams using it for 80% of their communications but other externally facing teams using it for 20% or less of their correspondence.
The two main barriers that are likely to hold MS Teams back from becoming the main channel for written communications are that:
- Teams is not designed for collaboration with people outside the organisation. It is true that Teams out of the box allows the owner of a Team to add a person external to the organisation to the Team, simply by adding their email address. However once a person is added to a Team they can see all the past and present content of all the Channels in the team. This may lead some organisations to turn off entirely the ability for Team owners to add external contacts to their Team.
- Teams clients get more and more complex the more Teams an individual is a member of and the more channels and private channels they belong to and the more chats they are part of . The AvePoint blog put it like this:
As any Microsoft Teams user knows, the “left-rail” of the Teams interface gets hard to organize once you’re part of many Teams and named group chats. “Chats” quickly fall out of view if they aren’t pinned to your left-rail and you get bombarded by chats every day (and who doesn’t?). Given that you cannot even search for named group Chats in the mobile clients, this experience can get infuriating if you’re often on the road.
Counterbalancing the above two tendencies is the fact that many people will find working out of a Teams client quicker and more effective than working out of an email account. If and when more people in the organisation prefer working out of a Teams client than an email client then a tipping point may be reached when the Teams client replaces the email account as the main source of communication for a significant portion of the internal facing staff of the organisation.
It is possible that in some organisations the volume of traffic going through Teams could approach parity with email. In order to reach this position of near parity with email Teams will have to become loosely governed. Private channels, Group chats and one-to one chats will proliferate. These are the three types of communication in Teams that (unlike Channels) offer no governance advantage over correspondence through email.
The retention of conversations in channels, private channels, and chats
I have long held the belief that whatever correspondence/messaging tool eventually overtook or reached parity with email would be harder to manage and govern than email.
This is because the long term trend is for the velocity and volume of correspondence to increase. When the velocity of correspondence increases the average value of individual messages reduces, even though the total value of all the correspondence of an organisation does not diminish.
The lower the average value of messages the harder it is to tell significant messages from insignificant messages. A one-line message in a channel, private channel, group chat or chat only makes sense in the context of the rest of the messages in that channel, private channel, group chat or chat.
Donald Henderson kicked off a debate on the Records Management UK listserv about how long to keep messages sent through MS Teams. In his post Donald describes how he had first wanted to impose a regime of deleting all posts in teams after one month, but that faced opposition so his next suggestion was six months. He went on to relate that:
It has now been suggested to me that some sections of the organisation will actually post ‘important stuff’ in chat – the example being quoted was interactions round a major capital building project, including with the contractor. My thoughts are that this sort of stuff starts to warrant retention as a record of the capital project, i.e. 25 years and possibly permanent retention depending on the project.
Donald is right. If your colleagues are using MS Teams for interactions with a contractor about a major capital project then those interactions should indeed be kept for the retention period applicable to records of major capital projects. The fact that colleagues in the organisation did not want Teams chats deleted after one months shows that Teams is serving as a record system for those interactions that are being conducted through Teams. Donald went on to describe the downside of retaining conversations conducted through Teams as records:
Since it is really hard to get rid of individual items of chat (only the poster can delete their own posts), this raises the spectre of retaining every item in a Team site for the entire retention period. The thought of a subject access request or, probably worse, an FOI request for all the stupid GIFs that have been posted is just a bit concerning.
When organisations are faced with a high velocity correspondence system their first reaction is usually to apply a one-size fits all retention policy across the entire system.
A one-size fits retention policy will work for 90% of your email accounts/Teams/chats.
If you set a policy of retaining email/Teams correspondence for x years (with x being a number between two and seven) then that would work for 90 per cent of the email accounts and Teams channels/private channels/chats that you have. The problem is that the 10% that it doesn’t work for contain the most important 10% of your correspondence.
The basis of records management is that different activities have different levels of impact and importance, and this should be reflected in retention periods. We have found consistently over the past two decades that many decisions were documented only in email. If Teams really takes off and approaches parity with email as a correspondence medium then you will find that some decisions are documented only in Teams.
The pragmatic approach is to manage by exception. This way we can still set a one size fits all retention period for email and Teams but we only apply it to 90% of email accounts, 90% of Teams, and 90% of individuals using Teams chat. An exception should be made for the 10% of email accounts, Teams and Teams chat accounts that are used by those people who have responsibility for our organisation’s most strategic/valuable/important/impactful work. Choosing which individuals and which Teams constitute that 10% is a matter of good records management judgement, and is the type of judgement that people in our profession are qualified to make.
To accompany such a policy we should reserve the right to use human and/or automated means to identify and separate out trivial and personal correspondence from that 10%.
I recently recorded an IRMS podcast with Alan Pelz-Sharpe, co-author of perhaps the first book on the use of artificial intelligence (AI) for information management purposes. In the interview Alan said that he thought that the transition to managing information through AI would be an even greater change than the transition from analogue to digital working in the 1990s.
Recordkeeping is a continuous activity. History suggests that once a society starts keeping records nothing short of the end of that society causes it to stop. Recordkeeping becomes integral to the functioning of individual organisations and to society as a whole. Fundamental change in society leads to fundamental change in recordkeeping practices.
This post maps how recordkeeping changed after both the industrial revolution and the digital revolution, and predicts how it will change after the AI revolution. The post looks across the whole sweep of recordkeeping history to show how the digital revolution changed records management from a service function into a policy function, and how the AI revolution will change it once again, this time into a data science.
In looking over this broad sweep of recordkeeping history we will see three broad trends:
- The ever increasing volume of records created;
- The ever increasing dominance of structured data systems over unstructured data;
- The ever increasing ability to re-classify and re-aggregate all records in a records system.
The AI revolution offers new powers to records managers/information governance professionals to intervene effectively within and across records systems for governance purposes. Like any power this comes with responsibility, and the need to use the power wisely and safely.
This post attempts to outline both the opportunities that AI will offer, and the questions it will pose, to the records management and information governance profession.
Recordkeeping before the industrial revolution
In the United Kingdom our National Archives has an unbroken set of government records dating back to the turn of the 13th – the point in time at which the administration of the English Kings started keeping copies of the letters and charters that they sent out. From the 13th until the mid nineteenth century we can characterise recordkeeping as follows:
- Records consist largely of correspondence. Correspondence is predominantly kept in simple chronological sequences.
- The volumes of records are low. The size of the royal administration is very small. The pace at which correspondence moves (on horseback on bad roads, or by water) is slow.
- There is no need for a records management profession because the chronological sequences of correspondence do not require records management expertise to manage.
- Some records that do not take the form of correspondence but instead take the form of entries into registers, inventories, index book or ledgers. There are the beginnings of what would later be called structured data.
At this early point in the history of recordkeeping we can already see the fundamental difference between ‘structured’ and ‘unstructured’ data:
- An item of correspondence (a letter) is unstructured data because at the point of its creation it stands and moves independent of any system or structure. The letter therefore has to be integrated into some kind of structure with other pieces of correspondence in order for it to fully function as a component part of a record.
- In contrast an entry into a registry, inventory, index book or ledger is an example of structured data because from the moment of its creation it is already integrated within a structure with other similar entries.
One of the enduring endeavours of records management practice has been to ensure that records are consistently captured into a coherent structure. This endeavour is vital when organisations are predominantly creating unstructured data such as free-standing and free-moving correspondence and documents. However it is not nearly as useful to organisations carrying out their work through structured data systems because the structure of the database is set from the outset of the system and records are captured into the structure at the moment of creation.
After the industrial revolution
The industrial revolution at the turn of the nineteenth century first started to bring large concentrations of manual workers together. By the turn of the twentieth century large concentrations of clerical workers were being brought together in bureaucracies of ever growing government departments, businesses and other institutions. This led to a revolution in recordkeeping:
- The volume of records created is now much higher. The size of organisations has grown. The pace at which correspondence moves (by motorised transport on tarmac road, by rail, steam boat and later by air) is faster.
- Records can best be characterised as documents. These documents are kept in sophisticated filing systems in which one file (or one set of files) is created for each distinct piece of work. The files of similar types of work are grouped into records series that can usually be managed by a single access and retention rule.
- There is a need for a records management profession because the filing systems are sophisticated and the set of retention rules that governs how long records within each different series are kept is also sophisticated.
- In organisations with strong recordkeeping requirements records management is set up as a service. In UK government this service is provided by records staff working in registries. Registries are interposed into the flow of correspondence as it moves from sender to recipient, so that the piece of correspondence is filed before it reaches the recipient. This has the double advantage that it both ensures the item is filed, and ensures that the recipient reads it in the context of the previous correspondence on that case/matter/project/topic.
- An organisation is able to classify its different file series to have an overall integrated structure for all its documentation.
- The volume of structured data also increases, with more sophisticated methods of keeping structured data such as card indexes. This structured data sits outside of the main way that documentation is organised.
The nature of documentation changed after the industrial revolution. The pre-industrial organisation had captured items of correspondence into chronological series. The 20th century practise of a file dedicated to each specific piece of work brought into being new classes of document such as ‘file notes’. These were documents created not primarily as a direct communication from one person/office to another, but as an addition to the file, to ensure that the file could tell the whole story of that work. The growth in the ability to copy documents (initially through typewriters, typing pools and carbon paper, later through photocopiers) enabled a copy of a document to be placed on each different file to which it related.
After the digital revolution
The pre-cursor to the digital revolution was the computerisation by organisations of various processes and workflows in the 1960s, 1970s and 1980s. This computerisation was largely restricted to very predictable, high volume processes such as payroll, financial ledgers, stock control etc. These processes were computerised through the construction of databases with a data model very specifically adapted to the process in question.
The digital revolution hit the large English speaking economies in the early-1990s when a way was found of applying a data structure to general business correspondence. The data structure in question was that contained in the email protocol, which specified the format for one internet connected computer to send a message (an email) to another. The spread of email resulted in the spread of computers to the desktop of every clerical worker in the large economies.
Just as a quantum particle such as an electron can be viewed as either a particle or a wave, so emails within email systems can be seen as either:
- unstructured data – emails are stand alone items of correspondence that move from one person to another, and which should at some point be filed together with other documentation from the same type of work OR
- structured data – an email system is a corporate database and each new email is a new entry into the database. Like entries in any other type of database there is no need for either the sender or any of the recipients to file it because it is integrated into the structure/schema of the email system from the moment it is sent/received.
In the early digital age (1990s to the present day):
- The predominant form of records are ‘datasets’. Organisations have multiple databases. Some are specific to a particular process or line of business, others are corporate-wide. An email system is a database of correspondence. A content management system is a database of the content available through a website and/or intranet. A customer relations system is a database of contacts with customers etc.. Some operational databases and logistics databases may have business critical information and key intellectual property and know-how.
- The volume and velocity of documentation increases exponentially. The coming of email causes the time taken for a piece of correspondence to travel from sender to recipient to vanish to virtually zero.
- There is no overall schema for organising records. Each dataset has its own separate metadata schema/data model.
- Records management becomes a governance/policy function, setting requirements for what individual staff members should and should not do with the documentation and data they create and receive.
- The transfer of structured data from analogue ledgers, index books, inventories, card indexes and registers to digital databases is transformational, because of new powerful ways to process and analyse data that computers bring with them.
- Metadata fields enable machines to ‘understand’ data in structured systems. Machines can perform information management tasks when they are given rules which tell them what actions should be triggered by what value appearing in what metadata field.
Records management is far less effective in the two decades after the digital revolution than it had been in the four decades before it.
At the start of the digital age the records management profession saw its task of being one of managing electronic documents. This was based on the assumption that the fundamental change involved in the digital revolution was a change of format, from paper to digital. We assumed that unstructured data would still predominate over structured data as it had for the entire history of recordkeeping before the digital revolution. We assumed that items of correspondence would continue to function as unstructured data – free standing, free moving items that needed at some point in their trajectory to be captured and integrated into a structure and a system.
The standard records management strategy in the first decade of the digital age was to configure corporate wide records systems into which documents and correspondence could be captured and integrated with other records within a records classification/structure.
This strategy failed because most correspondence exist as emails within email systems. Their only move is a handover from one email system to another if the sender is on a different email system to the recipient. There is no point at which it needs be filed by either the sender or any of the recipients. It is already integrated into the structure and metadata schema of the email system of both the sender and recipient(s).
The only type of record in the digital age that acts as ‘unstructured data’ and needs consciously capturing into a structure are documents created in word processing software/presentation software/spreadsheet software such as Microsoft Word/Powerpoint/Excel. These behave like the documents of the paper age. At the point they are created they are not yet integrated into a structure, and therefore the creator needs to file them somewhere. This creates a need for document management systems.
Corporate document management systems merit attention. They need and deserve careful management. They act as record systems for documents created in packages such as Microsoft Word, Powerpoint and Excel . They provide more than enough work for many practitioners. But we cannot base a profession on them. Our profession has less and less influence over such systems as the vast bulk of the market for such systems belongs to just two suppliers (Microsoft and Google).
Corporate document management systems stand in uneasy relation to email systems. Document management systems rarely act as record systems for correspondence. Email systems usually act as a record system for documents. When a documents needs to be communicated it is typically emailed. The email system has a record of the date that the document was sent, who it was sent by, who it was sent to, what message was imparted along with it, and what responses were received by return. The corporate document management system and the email system both cover all corporate activities. Each of them holds most of the organisation’s documents, but the email system has so much more besides in terms of the decision trails around and outside of documents.
The latest generation of collaborative systems (such as MS Teams and Slack) are trying to combat this disconnect between email systems and document management systems by bringing team based communications out of email systems and into a collaborative space. This is a better strategy than seeking to move conversations that have happened in the email environment into a document management environment. It has a good chance of succeeding where individuals are predominantly communicating with a close knit group of people (for example within a project team). However it tends not to work as well when individuals are working across team and organisational boundaries with a changing array of interlocutors on a shifting range of matters. This latter category includes many of the people whose records archivists have typically wanted to see selected for permanent preservation (policy makers, diplomats etc.).
The AI revolution
The AI revolution is happening now, at the start of the third decade of the twenty first century. It involves a massive expansion in the scope of judgements that can be made by machine intelligence.
Before the AI revolution machines could make information management judgements only in a constrained set of circumstances, namely when each of the following three conditions were met:
- the machine is explicitly programmed how to make the judgement;
- the judgement can be made on the basis of values in metadata fields;
- the values in those metadata fields were clear and unambiguous.
The AI revolution allows machines to make judgements without having being explicitly programmed to do so. We no longer need to set out each step a machine needs to follow. If we use a machine learning tool to identify which emails in email correspondence could be classed as ‘business’ correspondence’ we would be in effect using a set of algorithms (the machine learning model) to develop another set of algorithms (the algorithms that will distinguish business from personal/trivial email on the basis of patterns observed in the data).
The most obvious way of training a machine learning tool to identify business correspondence within an email system is to feed it a training set of emails, each of which are labelled as either ‘business’ or ‘personal/trivial’. The machine learning model looks for the features in the set of business correspondence whose values tend to differ from those of the same features in the non business correspondence. The tool comes up with a hypothesis algorithm, setting parameters for each data feature. The algorithm is typically then tested by being fed with a mixed set of business and non business emails to see how accurately it distinguishes the two.
Whereas machine based rules before the AI revolution worked on certainty, algorithms work on probability. An informal tone in an email might increase the probability that an email is trivial (or personal), but it does not give certainty. By taking into account other data features (the subject line of the email, the number of recipients, the roles of the recipients, the topic of the email as indicated by words in the body of the message etc.) the algorithm is able to increase its own confidence in its classification of the email as ‘trivial/personal’ (or as ‘business’). Machine learning algorithms can tell you not only what they have classified an item as, but also the percentage certainty with which the judgement has been made. This can help the organisation set threshold certainty levels below which judgements should be checked by humans.
The nature of recordkeeping after the AI revolution
On the basis of the previous history of recordkeeping, here are some predictions as to how recordkeeping will be shaped by, and will adapt to, the AI revolution:
- Records management/information governance will become a data science, overseeing algorithms that apply record classifications and/or record retention and access rules.
- The point at which we know information governance has entered the AI age, is the point in time after which access and retention rules are applied to aggregations into which records have been assigned by machine learning algorithm.
- To an algorithm everything is data. If there are patterns in a set of data then an algorithm can learn those patterns and use its knowledge of those patterns to make distinctions. Machines are no longer restricted to acting on highly structured metadata. Algorithms can identify patterns in any kind of data, structured or unstructured.
- Organisations will continue to have multiple databases. Some algorithms might use data from one database to manage data in another (for example you might use information taken from job descriptions in an HR database to assist algorithms identifying important business emails in an email system).
- The volume and velocity of documentation and data will continue to rise, as AI algorithms generate content (for example by automated replies or automated chat bots) as well as help manage it.
- Algorithms, like humans, tend to understand data best when they view it in the context of its originating application. Email is best understood within email systems, or within repositories that can replicate the structure and functioning of email systems. There is no longer any necessity to move content out of one structured database (such as an email system) into another system.
- Organisations will have the technical possibility of having one overall structure/schema for organising records. But this dream is likely to remain elusive due to the fact that data created within a structured dataset is usually much more meaningful and manageable within the structure of that dataset than it would be outside of it. Algorithms will be used more often to make data in a dataset manageable than to break data out of its original dataset to manage it through an alternative structure.
- AI brings with it some possibilities that humans have never had before. For example the possibility to restructure an entire records system to enable access and retention rules to be applied to a completely different set of aggregations than were present when individual action officers created or received the documentation. Learning whether (and if so how and when) to use this capability will be a challenge for the recordkeeping profession.
Records management in an era dominated by structured data
The rise of structured data poses a challenge to records management theory. This theory has, for the most part, been based on the assumption that the majority of records, (including correspondence and other types of documentation) are created as free standing objects (unstructured data) that move independently of any structure and therefore need at some point to be integrated into a structure.
This theory needs to be refined to enable it to adapt to the reality that since the digital revolution even correspondence is created and shared within a structured database. Such a theory would de-emphasize the importance of building record structures to integrate records into (because most records, including all email correspondence, are created within a database that already has a structure and schema). It would instead emphasise the importance of establishing a defensible, pragmatic and consistent basis for the application of retention and access rules across the different structures and schemas of the different datasets of the organisation.
AI and the possibility to re-structure and re-aggregate entire records systems
The most far-reaching change of the AI revolution is that the ability to re-organise all the items in a record system is for the first time unconstrained by the original metadata schema of that system. A records management/information governance team will in theory have the ability to use any relevant classification logic (any classification scheme that bears any relation to the content of the record system) to re-aggregate content in a record system. The team will be able to apply retention and access rules through those new aggregations. The re-aggregation could be carried out at any time in the existence of the system (meaning that a record might be reassigned to a new aggregation for governance purposes one second, one day, one month, one year, one decade or one century after its creation or receipt.
This poses two fundamental questions for records management theory and practice:
- What are the implications of the ability to re-classify, re-aggregate and/or re-label all the items in a record system for a profession whose aim has traditionally been to build and maintain governance regimes for information based on predictable access permissions and predictable retention rules being applied to predictable aggregations of records?
- What are the implications of the possibility to assign retention and access rules to aggregations that did not exist when the records were originally created and received, and that the creators/recipients of the records would not have envisaged being used to apply access and retention rules?
To make these questions more concrete let us think of them in relation to email – the great unsolved records management challenge brought by the digital revolution.
In email systems correspondence is aggregated into email accounts and access permissions are applied to correspondence via those accounts. AI opens up three options for the application of retention rules and access permissions to emails:
- Ignore the existing structure/schema . – bypass email accounts use AI to re-aggregate email correspondence (for instance by applying a corporate records classification) so that access permissions and/or retention rules are no longer applied via email accounts but instead through the records classification.
- Stick with the existing structure/schema – make email accounts manageable use AI to make email accounts more manageable by identifying trivial, personal and sensitive emails within email accounts.
- Use the existing structure and schema as a starting point – enhance email accounts and then move beyond them: use AI to classify emails within email accounts by business activity, but continue to use email accounts as the main aggregation for the application of access permissions. As individuals get used to the classification of their email by machine against business activity, so they could be given the option of opening up access to selected colleagues to correspondence of selected activities within their email account.
The first approach is high risk, the second approach is low benefit. The third approach offers the possibility of providing benefits to individual email account users and their colleagues through incremental change.
We should be looking for what Dave Snowden might call ‘safe-fail’ approaches to the introduction of AI. In such approaches machine learning classifications are first introduced alongside (or within) existing structures, then gradually begin to become more influential over the application of access permissions and retention rules as confidence in the machine learning process grows.
The theories and explanations outlined in this article have been developed during the course of my Loughborough University doctoral research project which is looking at archival policy towards email from a realist perspective. A paper from this project ‘The defensible deletion of government email’ was published by the Records Management Journal in March 2019. An open access version of this paper is available from Loughborough University’s digital repository here (once in the repository click on ‘download’ to download the pdf, or read it in the window provided).
This is the text of a talk I gave in London on 26 September 2019 to the UK Government Knowledge and Information Network. I have revised and extended the text.
Think of all the correspondence moving into, out of and around your organisation.
Think of the structure or schema into which you would like all important items of business correspondence to be assigned so that they can be found and managed. Think of the records system that the structure/schema sits in.
Who would you like to file important items of correspondence into that structure/schema: humans or machines?
Trial no 1: humans versus machines that can learn
Imagine you set up a trial:
- you tell every member of staff to file important pieces of correspondence into your records system with your preferred structure/schema;
- in parallel you set up a group of machines to look at all the correspondence coming in and out, to select important correspondence and file it into the same structure/schema as the humans.
Who would you like to win this trial- the humans or the machines?
Who would you expect to win the trial?
Most of us in the records and information management professions would want the machines to win. If the machines win they take the filing workload off the heads of our colleagues. This frees our colleagues up to focus on the job they were employed for.
We would expect the machines to win provided that:
- the machines were capable of learning a fairly complex structure;
- there was a feedback loop between humans and machines so that the machines had their mistakes pointed out to them;
- the machines were learning machines that could adjust their algorithms in response to feedback;
- the trial ran long enough for the machines to improve after many iterations.
We do not yet have the automation necessary to assign correspondence routinely to a node in the kind of complex multi-level corporate wide taxonomy/fileplan/retention schedule that records managers like to use to manage records.
The nature of automation projects currently being undertaken
The type of automation projects we are seeing in information management at the time of writing are mainly based on binary questions:
- The legal world has been making progress with predictive coding projects that seek to use machine learning to answer the binary question ‘is this content likely to be responsive to a specific legal dispute?’;
- In the US NARA’s Capstone policy has motivated some US Federal Agencies to use machine learning to answer the binary question ‘is this email needed as a record?’, and a similar project is being undertaken by the Nationaal Archief of the Netherlands (their report, in dutch, is here );
- The Better Information for Better Government programme run by the UK Cabinet Office will shortly set up a project to develop an artificial intelligence tool that can distinguish important from non-important government emails (see the call for expressions of interest they issued in August);
- Graham MacDonald has worked on a process for using automation to support the sensitivity review of records by using machine learning to predict whether or not any particular document is likely to be covered by one of the UK’s Freedom of Information exemptions (see his thesis )
We are are going to be able to deploy machines sooner if we can find binary questions for them to resolve, than if we wait until machines can assign content to nodes within complex multi-level taxonomies/fileplans/retention schedules.
The records management demands we make of human beings
For most of the twentieth century human beings succeeded in filing correspondence into what were often very sophisticated filing structures. In the twenty first century this no longer holds true. In the twentieth century humans filed correspondence because the correspondence had to be filed by humans. In the twenty first century email correspondence has been filed automatically by the automation built into email systems. Any injunction to civil servants asking them to move email correspondence into another system is in effect asking them to re-file that correspondence.
The automation built into email systems
The automation built into the proprietary email systems rolled out in the mid to late 1990s was not machine learning. The machines in proprietary email systems could not learn, all they could do was follow rules. Even now, two decades later, proprietary email systems only assign correspondence into a very simple structure and schema.
The reaction of the archives and records management community when email systems were introduced was to point out (quite rightly) the records management deficiencies of a system that aggregates correspondence into individual email accounts and does not distinguish between business correspondence and personal/trivial correspondence. With some exceptions (notably NARA in the US), the records and information management community has not accepted the structure of email systems as being a viable filing structure and in many administrations (including that of the UK) we have continued to ask human beings to re-file important items of correspondence into separate systems.
Trial no 2: humans versus machines that cannot learn
To go back to the idea of a trial with which I started this talk, we have for the past two decades been pitting human beings against machines:
- the humans have been asked to file important items of correspondence into a preferred records system which houses our preferred records structure/schema;
- the machines (in the shape of email systems) have been configured to file correspondence into a simple structure that is inferior for records management purposes.
Who do you want to win this trial? The automated filing or the human filing?
From a records/information management point of view, would you want the machines to win on the grounds that:
- they take the workload off the shoulders of our colleagues
- the filing is very predictable and consistent
- the filing is instantaneous?…..
….or would you want the humans to win because they would be filing into a structure that permits a more precise application of retention rules and access rules?
Who do you think would win such a trial?
In theory the humans have more chance of winning this second trial than they did of winning the first trial. The human filing could prevail if the human beings in the organisation found the records structure/schema so beneficial that they would be prepared :
- to make the extra effort to file correspondence into the designated records system;
- to use the designated records system, rather than their email account, as their main source of reference for their own correspondence;
- to forego the possibility of simply relying on the inferior structure into which the email systems had filed the correspondence.
However even when officials do highly value the records structure/schema there is still a strong possibility that the machine filing will prevail. I remember when email systems were introduced into UK government in the mid 1990s. Government departments and the civil servants in them valued the then record systems of their organisations (hard copy registered file systems) very highly. Everyone at the time wanted the registered file systems to survive and to make an ordered transition to the electronic world. But within five years of the general introduction of email in UK government all of those registered file systems were in tatters with no replacement systems in place. The introduction of email destroyed those systems.
Why did the automated filing of email systems into a simple structure overcome the value that UK civil servants placed on the much more sophisticated structure of their registered filing systems?
The crucial advantage that the machines (email systems) had was speed. They filed correspondence instantaneously. The automated filing by email systems provided officials with instant access to their correspondence from the moment it left the sender’s account. This acted to accelerate the velocity of correspondence, which in turn increased the volume of items exchanged, which in turn increased the number of items to be re-filed by the human beings.
The introduction of email increased correspondence volumes exponentially and therefore made it to all intents and purposes impossible to have human beings re-file correspondence into a complex corporate structure. In other words the machines moved the goalposts. And won the game!
To put it more simply
- human filing is a viable option when there is a low volume and low velocity of correspondence exchange;
- if the velocity and volume of business correspondence increase exponentially then the human resource to refile it does not scale (not within public sector budgets anyway!).
Machine filing versus human filing – the experience of the past twenty years
The experience of UK government in relation to email over the past twenty five years can be divided into three phases.
In the first phase (c 1995 to c 2003) human beings (civil servants) were asked to print important pieces of correspondence out and place them onto registered files whilst machines (email systems) filed correspondence into email accounts.
In the second phase (c2003 to c2010) civil servants were asked to file correspondence into electronic records and document management systems whilst machines (email systems) filed correspondence into email accounts
In the third phase civil servants were asked to file correspondence into collaborative systems (such as Microsoft’s SharePoint) whereas machines (email systems) continued to file correspondence into email accounts.
Over the course of this twenty to twenty five year period progress has been made in the systems to which we have been asking our colleagues to file into. We have moved from hard copy to electronic systems; we have moved from electronic records management systems with clunky corporate fileplans to more user-friendly collaborative systems. But the result has been the same in all three phases. In each phase a pitifully low percentage of business correspondence has been moved from email accounts into the record system concerned. The automated filing of email into email accounts has always defeated attempts to persuade humans to get into the habit of re-filing their important correspondence somewhere else.
The policy dilemma posed by the automated filing built into email systems
Email systems have, over the past two decades used a primitive form of rules based automation to file emails into a simple structure/schema. This has caused a policy dilemma:
- email systems file email correspondence efficiently, routinely and predictably into email accounts BUT the organisation of correspondence into individual email accounts results in an inefficient and imprecise application of retention and access rules to correspondence;
- in contrast human beings are able to re-file important items of correspondence into a structure that enables retention and access rules to be applied more precisely BUT they are likely to do this infrequently and haphazardly.
The policy dilemma exists in part because records management best practice does not tell us which of the following two policy imperatives is more important:
- the consistent capture of correspondence into a structure/schema; OR
- a structure/schema that supports the precise application of retention and access rules.
Records management best practice does not help us choose between these two competing imperatives because records management best practice wants both! Records management best practice requires the consistent capture of correspondence into a structure/schema that supports the precise application of retention and access rules.
We are faced with two imperfect options. We should choose the least imperfect. The least imperfect option is the option whose weaknesses we are most likely to be able to correct at a future date.
We are working in a period of transition, and the transition is towards the ever greater use of every more powerful automation, analytics and machine learning. If the present rate of progress with machine learning/artificial intelligence is maintained then we can predict that:
- in the medium term originating bodies will be able to deploy machines to answer binary questions that would help to mitigate the worst faults of email accounts: namely to distinguish important from trivial mail, and personal from business mail;
- in the long term originating organisations will be able to deploy machine intelligence to re-file correspondence into any order that they choose.
Factoring the future of machine learning into present-day policy decisions
If and when we reach a point at which machine learning tools can file correspondence into any order that an organisation wishes then our policy dilemma will be resolved – we will at that point be able to consistently assign correspondence to any taxonomy, records classification and/or retention schedule that an organisation chooses. We would also, one presumes, be able to run the machine learning over legacy correspondence and assign that correspondence to the same taxonomy/records classification/retention schedule. We can anticipate that:
- future machine learning tools will be able to retrospectively correct the weaknesses in the structure/schema of any email accounts that survive;
- future machine learning tools will only be able to retrospectively correct the weaknesses in the capture of email into corporate collaboration systems/electronic records management systems if important email accounts survive.
This logic dictates that we should give a high priority now to ensuring that historically important email accounts survive in the confident hope that we will later be able to correct weaknesses and inefficiencies in the content of these accounts and in the structure and schema of those accounts.
This would require some form of protection being introduced now for the email accounts of officials playing important roles. Business correspondence residing in the email accounts of important UK government officials does not currently enjoy any protection. UK government departments subject email in email accounts to some kind of scheduled deletion. The most common form of scheduled deletion is to delete the content of email accounts shortly after an individual leaves post. This practice complies with the National Archives’ policy towards UK government email, because each department asks its officials to move important email out of email accounts to some form of corporate records system. However the unintended consequence of this policy is that most business correspondence ends up being subject to this deletion.
Affording some protection to the email accounts of officials occupying important roles can be seen as a protect now- process later approach.
This protect now – process later approach involves protecting historically important email accounts in the knowledge that machines are good at dealing with legacy and can at a later date be deployed to filter these records, enhance the metadata and/or overlay an alternative structure on to these records.
Such an approach would no longer require individuals to move important emails to a separate system for recordkeeping purposes (though there may well continue to be circumstances when an organisation for knowledge management/operational purposes requires some teams/areas to move important correspondence out of email systems, or seeks to divert correspondence away from email into other communication channels).
This approach is based on the realisation that deploying human effort to do something (badly) that machines are likely to be able to do (well) at a later date does not make sense in terms of either effectiveness or efficiency.
GDPR implications of a protect now – process later approach
The implication of protecting important email accounts from deletion whilst working on the development of machine learning capabilities is that some personal correspondence is likely to be retained alongside historically important correspondence. This has data protection implications.
GDPR allows the archiving of records containing personal data provided that the preservation of the records is in the public interest, and provided that necessary safeguards are in place and the data protection rights of data subjects are respected. The retention of the work email account of an important official is likely to be in the public interest, and is likely to be compliant with data protection law provided the following conditions are met:
- the role that the individual played was of historic interest;
- the individual could expect their account to be permanently preserved;
- the individual was given the chance to flag or remove personal correspondence;
- access to personal correspondence was prevented except in case of overriding legal need;
- items of correspondence that are primarily personal in nature are removed once a reliable capability to identify them becomes available.
This talk recommends that government departments which use email as their main channel of communication refrain from automatically deleting correspondence from the email of their most important staff, pending the development of automated tools to process the correspondence within those accounts. In practice this is likely to only involve protecting around 5% of their email accounts (using the old archival rule of thumb that 5% of the records of an originating body are likely to be worthy of permanent preservation).
This is not an easy sell to make to government departments. Even though the recommendation only covers around 5% of their email accounts, departments may well feel that these are the 5% that carry the highest potential reputational/political risk, and are the 5% most likely to attract freedom of information requests.
Making such a recommendation is in no sense ‘giving up’ on the records management ambition to have business correspondence consistently assigned to structures and schema that support the use and reuse of correspondence and that support the precise application of retention and access rules. It is simply a recognition that asking civil servants to select and move important email into a separate system has not worked for twenty years and shows no sign of working any time soon. It is also a recognition that we need automated tools to process the material that has been automatically filed by email systems.
Most important of all this approach of protecting important email accounts gives us a pathway for applying automated solutions to email. It would provide an incentive and an opportunity to deploy tools that work on a binary logic (‘is this email important, yes or no?’, ‘is this email personal, yes or no?’) to mitigate the worst flaws of email accounts from an information management point of view. These tools are not pie in the sky, they are already being used in real-life projects. The hope would also be that in the long term we may have tools that go beyond binary questions and could assign individual emails to a reasonably granular records classification, taxonomy and/or retention schedule.
The theories and explanations outlined in this talk have been developed during the course of my Loughborough University doctoral research project which is a realist evaluation of archival policy towards UK government email. A paper from this project ‘The defensible deletion of government email’ was published by the Records Management Journal in March 2019. An open access version of this paper is available from Loughborough University’s digital repository here (once in the repository click on ‘download’ to download the pdf, or read it in the window provided).