So you’re a data scientist? That don’t impress me much.
The original article was published in the Royal Statistical Society's Stats Life magazine on 29 June 2015. It is reproduced here to mark the 40th anniversary of my career in 'Data Science' which started on 14th February 1983 at Reader's Digest, in London.
In a recent entry on these pages, Martin Goodson talked about how analytics was done before anyone had heard of "data scientists", and one of the pioneering companies he mentioned was Readers Digest. I'd like to fill in some of the blanks about what the company did in the early days of data led marketing, and how those early models evolved into the data science of today.
Martin writes:
‘Reader's digest was doing data analysis on millions of households in the 1970s (but they needed a mainframe to do it). They employed programmers and statisticians but no one we would recognize as a data scientist. Instead they had 300 people who together did what a data scientist can achieve today on a single MacBook Pro.’
I was one of the team at Reader’s Digest back then and I joined shortly after graduating in maths with statistics from the University of London. I worked as a statistical programmer, developing and maintaining a suite of programs written in Fortran IV to perform regression, chi-square and many other statistical functions.
The ‘data science’ at Reader’s Digest in the 1980s
The company’s use of regression models to more accurately target marketing campaigns was seen as pioneering and was, without doubt, commercially successful. Such was the level of outside interest that a spin-off company was formed to sell the service to other organisations on a bureau basis.
However, to understand Reader’s Digest’s innovation, and its contribution to modern data science, we have to look beyond just the statistics. Underpinning the entire system was an in house data dictionary driven hierarchical data storage system known as Axiom. Its record “root and trailer” format provided space efficient storage of data with minimal redundancy. The flexible data storage and retrieval mechanisms where achieved through in-house developed selection language Axsel (Axiom Select), NSL (New Selection Language).
Volumes of data were very large by the standards of the time. Reader’s Digest’s database was held in two subsets, known as the inside list, which was everyone who had interacted with the company and their transactional history, and outside list which held everyone else from the Electoral Register and other prospect lists.
The hardware was basic by today’s standards. Data was stored on reel to reel tape and analytical jobs were typically run in a 1.5MB partition on an IBM 3033 mainframe with a total of 16MB of RAM. The inside list held data on around 6 million customers, about 4.4GB in size and occupying approximately 26 12” magnetic tape reels recorded at 6250bpi. The outside list held around 38 million prospect records in 7.5GB on 44 similar tapes.
While the system was later viewed by many as a precursor to relational database management systems (RDBMS), in reality its access mechanisms bear a closer resemblance to modern NoSQL based technologies in supporting semi and unstructured data with the ability to create new data entities dynamically.
NSL was the mechanism through which statistics interacted with the rest of the system. The syntax of the language was based on set theory. Groups of customers were selected using standard criteria, including tests for dynamic data elements, and labelled into sets. Further functions then defined how the interactions between resultant sets were to be dealt with, for example by randomisation or by assigning all overlap to a single group, to arrive at a de-duplicated “data universe” group for analysis.
Another set of tools were provided in the language to perform further tests on these groups, including frequency counts, cross tabulations, chi-square contingency table tests, from which the statisticians would define a set of variables to go forward to the model building stage. NSL had a wealth of functions to create new variables by combining and aggregating data, akin to modern MapReduce methodologies.
The interface to the statistical suite was also through NSL with keywords such as REGRESS and the output, including logs, returned via NSL variables. Once a statistician was happy with a model was built, its parameters would be defined in NSL, it would be catalogued and added to the data dictionary, effectively becoming a calculated variable which could be used in selections like any other field.
So we can see from this, that the innovation at Reader’s Digest in the 1980s was not solely attributed to statisticians, but achieved through a combined approach involving experts in a number of fields including data management, programming, set theory, computer language specialists, marketing analysts and statisticians.
Data science is a multi-discipline subject
The same goes for the ongoing debate between statisticians and data scientists and to understand this more, we need to step back for a minute and understand the changing nature of data being collected by modern organisations.
Like computer science, I see data science as a multi-discipline subject. A lot of modern data cannot be analysed solely using statistical techniques, it often needs to be summarised and classified before any form of statistical analysis can take place.
For example, to analyse free-format text requires Natural Language Processing techniques such as sentiment analysis, topic segmentation and relationship extraction to classify text.
Location based data is now business critical for many organisations, particularly the huge volume of geo-tagged data being captured by smartphone applications. To produce meaningful analysis, we need to understand the spatial environment within which the data exists. To put spatial data into a meaningful context, we have to use reference data such as that provided by Ordnance Survey and ONS in order to classify the urban and rural environment.
Is spatial aggregation of data just as simple as calculating averages constrained by an area boundary or radius, or is there more to it?
What defines a neighbourhood? One key aspect is topology. The interconnected nature of a road network determines how a neighbourhood grows in a way that simple radial aggregation of data cannot describe. Fortunately geographers have developed spatial analysis techniques to classify various aspects of the environment which can be used as subsequent input to a statistical process.
An aspect of statistics, that many empirical machine learning techniques fall down on, is the ability to assign a level of confidence to the reproducibility of the results of predictive models, thus reducing the risk of over-fitting.
I have seen many grossly over-fitted neural network models applied to business problems during my career. A far better model could have been built if some thought had been given beforehand to reducing the set of input variables and optimising the architecture of the neural network hidden layers. In addition, statistical tests applied to the output of the model when tested on a validation sample would have highlighted problems at an earlier stage.
Over the last 20 years, I have advocated a hybrid approach involving a combination of statistical and machine learning techniques to achieve the best results on complex data sets. In a lot of cases I build several models, using a variety of techniques on different subsets of the data, and compare them. Depending on the application, I sometimes deploy multiple models applied to the same problem, polling their predictions to make a decision.
Data science is essentially a multi-discipline subject in its own right which brings together:
- Geographers
- Text mining and NLP experts
- Statisticians
- Machine learning specialists
- Database developers
- Data storage and access specialists
Statisticians vs data scientists misses the point
The debate should not be about statisticians versus data scientists. Statisticians are a critical part of data science but I believe that we need to evolve our role into becoming more of domain experts within a wider team.
The garbage in, garbage out rule still applies to ‘big data’ and it remains important to validate against reference data which is an important role for statisticians. Just because you have a lot of data, it doesn’t mean it is representative of the behaviour of the target audience that you are trying to predict.
It’s time to put our differences aside. If we are to move forward as a profession we must accept the changing nature of data in modern organisations and work with experts in other fields to achieve our common goal. However, specialism remains important. It would be very counter-productive to allow ourselves to become ‘jacks of all trades’.
Vice-President (Retired) at Reader's Digest
1yGreat article John.
Senior Deep Learning Data Scientist at NVIDIA
1yHi John Murray, since the very first moment we met, I got impressed by your work. It is selfish from my side, but I would love to see your work for another 40 years! Congratulations!!!!
International Marketing & Brand Executive • Transformation in Customer Experience, Digital & Innovation
1yYou just made me realize that the Reader's Digest probably invented... Content marketing!
Full Professor in Machine Learning at Liverpool John Moores University. NVIDIA DLI Certified Instructor and University Ambassador
1yYou are an inspiration to us all John Murray. I love reading about your work.