Data science techniques can be applied in any domain (medicine, psychology, marketing, baseball, records management etc.). In order for data science techniques to be used effectively, a combination is needed of:
- sound intuitions about how these these techniques work
- sound intuitions about the domain they are being applied to
Imagine you are working with a data scientist on a project to use data science techniques for a particular records management purpose on a particular set of content. You might need their intuitions on data science. They might need your intuitions about records management.
Both sets of intuitions would be subjective. No two data scientists and no two records managers would give the same set of intuitions about their disciplines. But this does not make them any less valuable. Not only would your intuitions give your colleague an insight into your discipline, they would also give them an insight into how you think, about what matters to you, and about what lens you will be using to look at the problem situation.
Intuitions about data science
Think of all the data science techniques that you have heard mention of: linear regression, classification, clustering, topic modelling, regular expression matching, entity extraction, graph algorithms, language modelling etc.. They are each executed by some algorithm written in some programme language. They are each underpinned by some combination of pure mathematics, statistics, probability and/or logic.
In lectures or podcasts you can hear data scientists converse about such techniques by conveying their intuitions about them. For example the intuitions that:
- Clustering algorithms can assign data points (customers, documents, properties, baseball players etc. etc.) to a position in a multi-dimensional virtual space. In doing so they can cluster together data points that have similar features (customers with a similar purchase history, listeners with similar musical tastes, baseball players with similar strengths etc. etc.)
- Graph algorithms can make connections between people, objects, and topics. For example, given an organisation’s email system and document management system as inputs, such algorithms could identify, for any given individual end-user, who that end-user most frequently communicated with, about what topics and with reference to which documents.
- Large language models have a statistical understanding of how each language they have encountered works. They understand how frequently words occur in the language, how frequently words appear with other words and how the presence of one word or combination of words influences the likelihood that any other word or combination of words will appear. They can therefore calculate, to a high degree of probability, a good answer to any question on any subject, provided that they have been given enough relevant information in their training or at the point the question is asked.
You might need to contribute to a conversation on which data science technique(s) to use on your problem situation. If you have good enough intuitions about those techniques then you can make such a contribution without having to understand the underpinning maths or the executing code.
Intuitions about records management
In order to maximise the value of data science (and data scientists) to records management, it is important for us as individuals and as a profession to convey intuitions about the domain of records management.
We can make a start on this by articulating some general intuitions about records in any age. The best source for this is archival science, which contains a set of intuitions that have been building up for well over a century (many people date the foundation of archival science to the publication of what is commonly called the ‘Dutch manual’ in 1898).
Here are what I consider to be the most important intuitions from archival science:
- Records arise when people conduct activities – in an information based society records are like water – vitally important but ubiquitous rather than special or unusual
- Records have a lifecycle – setting access permissions and retention rules at the point of creation helps ensure records can be managed predictably and efficiently through their life.
- Act at the highest practical level of content grouping – acting on content groupings rather than individual items helps ensure that you keep the context around key documents and messages. It is also the natural way that archival and records management thinking can be scaled up
- The retention rule that should be applied to content depends upon the nature of the work from which that content arose – we value records according to the extent to which the work they arose from is important. A hastily written message from a piece of work that is still having an impact on us now is likely to be more valuable than a beautifully written report from a piece of work of short lived impact
- Preserve the context as well as the content – imagine you had the key documents from a piece of work (the strategy, design, policy, contract, final report etc,) but not the humdrum documents/messages through which the work to arrive at those outputs was conducted. You would have limited ability to understand, question, defend, debunk or assess those key documents
- Respect the original order in which content was created – the structures of the systems that people use to conduct and record their work will influence that work. If you retrospectively change the order/structure of a set of records, then you risk giving a false impression of how that work was conducted. You also risk making it impossible to establish who knew what and when – questions that often lie at the heart of any form of research or investigation
These intuitions are generally applicable to records at any stage of human technological progress. They were arrived at in the paper age, but they are equally applicable to the parchment age before it (when clerks working in my country were creating records on sheep skins) or the digital age after it.
However, these general intuitions are only a starting point. We should build on them to arrive at a set of intuitions that are specific to our age, in order to better inform and frame data science interventions in our domain.
A significant proportion of the digital content that has been created in organisations in countries like the UK over the past twenty five years now exists as either:
- legacy shared drive (fileshares) repositories
- SharePoint repositories
- email repositories
It would be beneficial to articulate intuitions about the structure and nature of content in these common types of repositories. In particular it would be beneficial to articulate intuitions for:
- how the records lifecycle works in such a repository
- what constitutes the original order/structure of content in the repository (and how we can best factor that order into the decisions we take when conducting reviews and appraisal exercises on records)
- how content of ongoing value tends to be distributed across the repository
- how content that has no value tends to be distributed across the repository
I will offer some of my intuitions on these questions in coming posts.
All comments are my own