Auto-classification – will cloud vendors get there first?

It is easy to predict that data analytics, auto-classification and the cloud will have an increasing impact on records management.   The big question is whether these three trends will act separately or in combination.

My guess is that they will have their most powerful effect when they are used in combination – when a cloud provider uses analytics on the content they hold on behalf of a great many different customers to auto-classifiy content for each customer.

Lets look at each of these phenomena separately and then see what happens when they come into combination with each other

Data analytics

Analytics is the detection by machines of patterns in content/data/metadata in order to derive insights.  These insights may be designed for action by either people or machines.

You can run analytics across any size of data set, but it yields more insight when run across large data sets.

When people talk about big data they are really talking about analytics.   Big data is what makes analytics effective.   Analytics is what makes big data worth the cost of keeping.

The use of analytics in records management now

The use of analytics in records management is still in its infancy

There are many tools (Nuix, HP Control Point, Active Navigation etc.) that offer analytics dashboards that can display insights gleaned from crawls of  shared drives, SharePoint, e-mail servers and other organisational system.  They allow an administrator to drill into the index to find duplicates, content under legal hold,  ROT (redundant outdated and trivial documentation), etc..

Records managers and information governance managers are, to the extent that they are using analytics at all, using these tools to deal with legacy data.    Content analytics tools  are put to use to reduce a shared drive in size, or prepare a shared drive for migration, or apply a legal hold across multiple repositories etc.   All worthy and good.   But it means that:

  • we are using  analytics on content with the least potential value (old stuff on shared drives) rather than on content with the most potential value (content created or received today)
  • we are using analytics to reduce the cost of storing unwanted records but not to increase access to, and usage of, valuable content
  • we have a very weak feedback loop for the accuracy of the analytics because the decision on whether something is trivial/private/significant is being made at a point in time when it has little consequence for individuals in the organisation (and hence they will neither notice nor care if a mistake is made)]


Auto-classification is a sub-set of analytics.    Auto-classification uses algorithms and/or rules to assign a digital object (document/e-mail etc.) to a classification category (or a file/folder/tag) on the basis of its content, or its metadata, or the context of its usage.

Auto-classification is becoming a standard offering in the products of e-mail archive vendors like Recommind, content analytics vendors like Nuix and HP Autonomy, enterprise content management vendors like IBM and Open Text.

There is a real opportunity for auto-classification, but to harness it we need to overcome two barriers:

  • the trust barrier – organisations are currently reluctant to use auto-classification to make decisions that would affect the access and retention on a piece of content.
  • The training barrier – it takes time to train an auto-classification engine to understand the categories you want it to classify content against

Overcoming the trust barrier

Imagine you are a middle manager.   Into your e-mail account every day come two hundred e-mails – some of them are innocuous but others have some sort of sensitivity attached to them.   You do not have a personal assistant.  Do you trust an auto-classification engine to run over your account once a day and assign categories to e-mails which will mean that some of them will become visible or discoverable by colleagues, and others will be sentenced to a relatively short retention period?

I have heard from two different vendors (HP and Nuix) that customers are still reluctant to trust algorithmic auto-classification to make decisions on content that change its access and retention rules.  They report that customers are happier to trust auto-classification where decisions are based on rules (for example ‘if the e-mail was sent from an address that ends in assign it to category y’)  than they are to trust decisions made on an algorithmic reading of content and metadata.   But organisations cannot possibly manually define enough rules for an auto-classification engine to make rule-based decisions on every e-mail received by every person in the organisation.

The whole point of auto-classification is to change the access and retention on content, and particularly on e-mail content.   For example the purpose of applying auto-classsification to e-mail is to:

  • prevent  important e-mails staying locked in one or two people’s in-boxes inaccessible to others in the organisation
  • prevent important e-mails getting destroyed when an organisation deletes an individual’s e-mail account six months/two years/ six years after they leave employment.

The way to increase trust in auto-classification is to increase its:

  • transparency – make visible to individuals how any e-mail/document is being categorised,  who will be able to see it, and why the engine has assigned that classification
  • choice –   give individuals a way of influencing auto-classification decisions – give them warning of the categorisation before it is actioned, and let them reverse, prevent or change the classification
  • consequence – there needs to be some impact from auto-classification decisions in terms of access and retention- -otherwise individuals will not act to prevent or correct categorisation mistakes
  • consistency – nothing breeds trust more surely than predictability and routine
  • use – make sure that the output of auto classification is groupings of e-mails/documents that are referred to/subscribed to/surfaced in search results etc.  This not only means that mistakes are spotted quicker, it also means that the organisation gets collaboration benefits from the auto-classification as well as information governance benefits.

There is a bit a chicken and egg situation here:

  • in order for the organisation to develop confidence in auto-classification it needs end users to be interacting constantly with the auto-classification results, so that there is a feedback loop and so the auto-classification engine can learn from end-users behaviour and reaction…..
  • …but in order to get that level of interaction the auto-classification needs to be dealing with the most current and hence the most risky content.

This means that in order to overcome the trust barrier we also need to overcome the training barrier,  to get the auto-classification engine accurate enough for an organisation to start using it.

Overcoming the training barrier

An auto-classification engine needs to learn the meaning of the categories/tags/folders/ that the organisation wants it to assign content to.   The standard way of doing that at the moment is to prepare a training set of documents for each category.  This is time consuming especially if your classification is very granular.

For auto-classification to be viable across large sectors of the economy it needs to work without an organisation having to find a training document set for every node/category of the classification(s) it uses.

There are two ways of replacing the need for training sets.

The first is to piggy back on the training done in other organisations in the same sector.  So if one UK local authority/US state or county/Canadian Province trains its auto-classification engine against its records classification then in theory this could be used by any other local authority/state/county/province.    This may create an incentive for sectors to arrive at common classifications to be used across the sector.

The second is to use data analytics to bring into play contextual information that is not present in the documents themselves and their metadata.     Ideally an auto-classification engine would have access to analytics information about each individual in the organisation.   It would know:

  • which team they belong to
  • what their team is responsible for
  • what activities (projects/cases/relationships) the team is working on
  • where they habitually store their documents
  • who they habitually  correspond with

It would use this information to narrow down its choice of auto-classification category for each document/e-mail created or received by each individual.

The opportunity here is to use data analytics to support and train an auto-classification engine, and hence eliminate the need for a training document set.  I see know reason why that shouldn’t work, provided that the data sets that the data analytics runs on are big enough and relevant enough.

It follows from this that the vendors whose auto-classification engines will work the best for your organisations are the vendors:

  • with access to data arising from the content of other organisations in the same sector as yours
  • with access to the broadest range of data from your organisation – including e-mail correspondence data,  social networking data, search log data, and content analytics data from document repositories such as SharePoint and shared drives

Which category of vendor will have access to all of this data?  Cloud vendors.

The cloud is a game changer

I met Cheryl McKinnon at the IRMS2014 conference last week.  She told me that there is a cloud based e-mail archive service called ZL who advise their clients not to delete even trivial e-mails on the grounds that the data analytics runs better with a complete set of e-mails.

What analytics usage could you possibly want trivial e-mails for?   The example Cheryl gave was of a company wanting to be able to predict whether new sales staff will be high performing or not.   It might run analytics on the e-mail of sales staff to surface communication patterns that correlate with high performance.  It could then run the algorithm over the  e-mail correspondence of new staff to see whether they are exhibiting such a communication pattern.  And  trivial e-mails may be just as good an indicator of such patterns as important e-mails.

Data analytics is becoming all pervasive.  Its use will affect every walk of life.  This means that more and more data will be kept to feed the analytics.   The all-pervasive nature of data analytics means that  both cloud vendors (in this case ZL) and their clients have an interest in keeping data that would otherwise have been valueless.

Cloud vendors will acquire more and more data from inside more and more organisations.   This potentially gives them the ability to train and refine content analytics across a wide spread of organisations, and provide auto-classification as part of their cloud service back to their organisational clients.

We can predict that:

  • the relationship between an organisation and its cloud vendor will be completely different than the relationship between an organisation and its on-premise vendor
  • the nature of cloud vendors will be different from the nature of on-plremise vendors – for example Microsoft the provider of Office 365 will behave in a completely different way than Microsoft the vendor of on-premise SharePoint and Exchange

Lets think about Microsoft’s strategy.  They have:

  • the leading on-premise e-mail storage software (MS Exchange)
  • the leading on-premise document storage software (MS SharePoint)
  • the leading on-premise productivity suite (MS Office).

In the on-premise world they kept  these products  separate.  In their cloud offering they have combined them into one (Office 365), and are charging less for the combined package than you might have predicted they would charge for any of the three on its own.    They have also announced plans  for integrating enterprise social into Office 365 through ‘codename Oslo’ .   This will use  analytics data on who each individual interacts with to present personalised feeds of content (Microsoft call this the ‘Office graph’ in a nod to Facebook’s ‘social graph’) .

What do Microsoft’s actions tell us?  They tell us that their business model for Office 365 is different from their business model for their on-premise software:

  • In the on-premise world Microsoft  wanted to upsell – by getting existing customers to buy more and more different software packages from them.  Each of their software products had its own distinct brand.
  • In the cloud world Microsoft wants to give customers all of their core products, so that they get the most content and hence the most analytics data from each customer.  They are even prepared to deprecate a brand name like ‘SharePoint’  in favour of a single ‘Office 365’ brand for their cloud package.

How long will it be before Microsoft uses the analytics data it will have gained from across their many customers, to start enhancing metadata, enhancing search, and auto-classifying content for each customer?

The questions this poses for NARA

The US National Archives (NARA)  has recently put out for consultation their automated electronic records management report .  The report is part of the mandate given by them by the Presidential Records Management Directive  to find ways to help the US federal government automate records management.

NARA’s report gives a good description of autocategorisation, although it is based on the assumption that the autocategorisation engine needs training packages in order to work.   It acknowledges that:

‘the required investment [in autocategorisation] may not be within reach of the smallest agencies, though hosted or subscription services may bring them within reach for many’ (page 13)

NARA is acknowledging here that cloud vendors are more likely to bring auto-classification to many agencies than they are to develop the capability themselves.   This poses some very fundamental questions:

  • Would the federal government be happy to let a cloud vendor such as Microsoft  use data analytics to auto-classify  federal e-mails and documents? OR
  • Would they rather each individual federal agency develops their own capability ? OR
  • Do they think federal agencies need to club together to create a pan-government capability?

The security and information governance  issues this question raises are massive.

  • From a security and an information governance point of view the option of each agency having an individual analytics capability is clearly the best, because the cloud option and the pan-administration option create too large a concentration of data and insight about the US federal administration.
  • But from a big data/data analytics point of view the  pan-administration option or the cloud option is better, because they give a bigger base of data on which to make better auto-classification decisions.


The Ontario gas plant records deletion saga – a records management case study


The records deletion controversy in Ontario is of relevance to archivists and records managers elsewhere in the world because of the stark contrast it poses between on the one hand a very strong and complete records management governance framework:

  •  Ontario has a relatively recent piece of Archives legislation (the Archives and Recordkeeping Act 2006)
  •  Ontario has a comprehensive set of records retention schedules, all signed by the Archivist of Ontario and backed up by the Archives and Recordkeeping Act
  • one of those retention rules states that ministerial correspondence (correspondence arising from the portfolio responsibilities of a minster) should be preserved permanently

…and on the other hand:

  • the lack of any  planning as to how this retention rule on ministerial correspondence could be applied in a situation where the correspondence accumulated in the individual e-mail accounts of political staff working in ministerial offices
  • the ability of political staff to delete e-mails (whether trivial or important) from their e-mail accounts should they wish to do so
  • the operation by the Ontarian government of a routine policy of deleting entire e-mail accounts when staff leave

This tension between recordkeeping policy and e-mail practice is not unique to the government of Ontario, it is a universal problem, facing all administrations.

The US National Archives (NARA) took the step in August 2013 of issuing advice to US federal agencies that the e-mail accounts of important officials should be preserved permanently if the agency cannot find any other reliable way of capturing the significant correspondence of those individuals.   This advice is contained in bulletin 2013 -02, and has gone under the name of the ‘Capstone’ approach.

It will be interesting to see whether or not other National Archives around the world follow NARA’s lead and  intervene in the way that the e-mail accounts of important officials are managed.

The slidepack embedded in this post is a collection of all the episodes of the Ontario gas plant records deletion saga comic strip that I have published on this blog (together with a few extra slides that I have added in the middle and at the end).

The slidepack goes under a creative commons licence, so feel free to use it for non-commercial purposes.  My intention is that it serves as a case study. Accompanying the slidepack are:

  • a records guru podcast that I recorded with Jon Garde in which we discuss the saga
  • a blogpost in which I give a recordkeeping perspective on the saga