The original order of digital records

One of the founding principles of archival science is that the original order of records should be respected. If an organisation was to retrospectively restructure a set of records, in a way that obscured or lost the original order, then this would risk giving a misleading picture of:

  • how content had accumulated
  • how people had worked
  • who had known what, when, in relation to the work that the records arose from

One of the great strengths of digital (as opposed to analogue) records is that they can be presented or viewed in so many different ways and orders. Does the principle of respect for the original order of records still hold true in the digital age? Is a digital repository likely to have one structure that will tend to act as the best vehicle for the application of retention rules, the taking of disposition actions and the making of appraisal decisions on content within the repository? We can use a simple thought experiment to show that the principle does indeed still hold true.

Thought experiment to show the original order of digital records

Imagine you are a records or information manager, reviewing a legacy digital repository (think of a legacy shared drive (fileshare), or a legacy SharePoint system, or a legacy email system). You could reorder that repository in an unlimited number of ways, at the press of a button, with a few lines of code, or with a well-engineered prompt.

Now travel back in time a little and imagine yourself in the position of an end-user who is using a shared drive, a SharePoint site, an email account or in the course of their work. They would have been able to re-order content within that particular shared drive, that particular SharePoint site, or that particular email account. But it is unlikely that they would have been able to re-order content across the entire repository of shared drives, the entire repository of SharePoint sites or the entire repository of email accounts.

In corporate systems such as shared drives, SharePoint systems, email systems and the like, end-users are given partial access to the system. They can contribute to one or more containers within the system, but not to the rest. They can access one or more containers in the system, but not the rest.

This partitioning of the system is necessary in any all-purpose system that can be used by all or most of an organisation’s staff to conduct all or most of their business activities. It is necessary because in a large organisation with a sophisticated division of labour, it is not normally advisable to allow individuals to be able to view, edit and contribute content in all parts of the system.

The importance of containers within digital repositories

The order that has the most influence on how content accumulates in a digital system is the order which determines who can contribute content where, and the order that sets default access permissions on content. To find the original order we therefore have to find the groupings on which default access permissions were set within the system. This order also strongly influences how people behave in a system. People are likely to alter their communication style, and alter the types of information they are willing to share, depending on the access permissions of the particular container they are contributing to.

Content in corporate digital systems tends to be partitioned into containers within which a defined individual, team or work group can contribute content:

  • in the on-premise world before the coming of cloud suites, a teams could not work in shared drives (fileshares) until someone had provisioned them a top level folder
  • a team cannot work in SharePoint until someone has provisioned them a SharePoint site
  • an individual cannot work in an email system until someone has provisioned them an email account that they can use

Every item within a corporate all-purpose digital system has to have an access permission attached to it from the first moment that it is saved into a system. Therefore there needs to be some way of applying default access permissions to all content. Containers are the vehicle through which default permissions are applied. Every item therefore must sit within a container.

In a SharePoint system every document sits within a site. In an email system every message sits within an email account. In a on-premise shared drives every item sits underneath a top level folder. In Microsoft 365 , MS Teams uses SharePoint, OneDrive and Exchange as its repositories. Every document, post or message contributed to a Teams channel or chat conversation is stored in either a SharePoint site, a OneDrive site or an Exchange email account.

This means that if a records/information manager has a means of acting on containers then they have a means of acting on all items within the repository. This makes containers a very powerful way of controling content in live digital repositories, and of scaling up actions and decisions on content in legacy digital repositories.

The intuition behind this post

This is the second in a series of posts that attempt to articulate intuitions about records management for data scientists. The intuition behind this post is as follows:

Corporate all purpose digital systems are systems that can be used by all or most members of staff to work on all or most of their activities. Examples include email systems, collaboration systems, shared drives (fileshares) etc.

Content in such systems tends to be partitioned into containers within which a defined individual, team or work group can contribute content.  All items within the repository will have been contributed to one of those containers.

The previous post in this series is Intuitions about records management for data scientists

(all views in this post are my own)

Intuitions about records management for data scientists 

Data science techniques can be applied in any domain (medicine, psychology, marketing, baseball, records management etc.). In order for data science techniques to be used effectively, a combination is needed of:

  • sound intuitions about how these these techniques work
  • sound intuitions about the domain they are being applied to

Imagine you are working with a data scientist on a project to use data science techniques for a particular records management purpose on a particular set of content. You might need their intuitions on data science. They might need your intuitions about records management.  

Both sets of intuitions would be subjective. No two data scientists and no two records managers would give the same set of intuitions about their disciplines. But this does not make them any less valuable. Not only would your intuitions give your colleague an insight into your discipline, they would also give them an insight into how you think, about what matters to you, and about what lens you will be using to look at the problem situation.

Intuitions about data science

Think of all the data science techniques that you have heard mention of: linear regression, classification, clustering, topic modelling, regular expression matching, entity extraction, graph algorithms, language modelling etc.. They are each executed by some algorithm written in some programme language. They are each underpinned by some combination of pure mathematics, statistics, probability and/or logic.

In lectures or podcasts you can hear data scientists converse about such techniques by conveying their intuitions about them.  For example the intuitions that:

  • Clustering algorithms can assign data points (customers, documents, properties, baseball players etc. etc.) to a position in a multi-dimensional virtual space. In doing so they can cluster together data points that have similar features (customers with a similar purchase history, listeners with similar musical tastes, baseball players with similar strengths etc. etc.) 
  • Graph algorithms can make connections between people, objects, and topics. For example, given an organisation’s email system and document management system as inputs, such algorithms could identify, for any given individual end-user, who that end-user most frequently communicated with, about what topics and with reference to which documents.
  • Large language models have a statistical understanding of how each language they have encountered works. They understand how frequently words occur in the language, how frequently words appear with other words and how the presence of one word or combination of words influences the likelihood that any other word or combination of words will appear. They can therefore calculate, to a high degree of probability, a good answer to any question on any subject, provided that they have been given enough relevant information in their training or at the point the question is asked.

You might need to contribute to a conversation on which data science technique(s) to use on your problem situation. If you have good enough intuitions about those techniques then you can make such a contribution without having to understand the underpinning maths or the executing code.

Intuitions about records management

In order to maximise the value of data science (and data scientists) to records management, it is important for us as individuals and as a profession to convey intuitions about the domain of records management.  

We can make a start on this by articulating some general intuitions about records in any age. The best source for this is archival science, which contains a set of intuitions that have been building up for well over a century (many people date the foundation of archival science to the publication of what is commonly called the ‘Dutch manual’ in 1898).

Here are what I consider to be the most important intuitions from archival science: 

  • Records arise when people conduct activities – in an information based society records are like water – vitally important but ubiquitous rather than special or unusual
  • Records have a lifecycle – setting access permissions and retention rules at the point of creation helps ensure records can be managed predictably and efficiently through their life. 
  • Act at the highest practical level of content grouping – acting on content groupings rather than individual items helps ensure that you keep the context around key documents and messages. It is also the natural way that archival and records management thinking can be scaled up
  • The retention rule that should be applied to content depends upon the nature of the work from which that content arose – we value records according to the extent to which the work they arose from is important.  A hastily written message from a piece of work that is still having an impact on us now is likely to be more valuable than a beautifully written report from a piece of work of short lived impact
  • Preserve the context as well as the content – imagine you had the key documents from a piece of work (the strategy, design, policy, contract, final report etc,) but not the humdrum documents/messages through which the work to arrive at those outputs was conducted. You would have limited ability to understand, question, defend, debunk or assess those key documents
  • Respect the original order in which content was created – the structures of the systems that people use to conduct and record their work will influence that work. If you retrospectively change the order/structure of a set of records, then you risk giving a false impression of how that work was conducted. You also risk making it impossible to establish who knew what and when – questions that often lie at the heart of any form of research or investigation

These intuitions are generally applicable to records at any stage of human technological progress. They were arrived at in the paper age, but they are equally applicable to the parchment age before it (when clerks working in my country were creating records on sheep skins) or the digital age after it.

However, these general intuitions are only a starting point. We should build on them to arrive at a set of intuitions that are specific to our age, in order to better inform and frame data science interventions in our domain.

A significant proportion of the digital content that has been created in organisations in countries like the UK over the past twenty five years now exists as either:

  • legacy shared drive (fileshares) repositories
  • SharePoint repositories
  • email repositories

It would be beneficial to articulate intuitions about the structure and nature of content in these common types of repositories. In particular it would be beneficial to articulate intuitions for:

  • how the records lifecycle works in such a repository
  • what constitutes the original order/structure of content in the repository (and how we can best factor that order into the decisions we take when conducting reviews and appraisal exercises on records)
  • how content of ongoing value tends to be distributed across the repository
  • how content that has no value tends to be distributed across the repository

I will offer some of my intuitions on these questions in coming posts.

All comments are my own