I’ve been building a repertoire of Computational Science projects for five years now, and I am constantly struck by how basic the needs of scientists are with respect to data, and how under-resourced they are when it comes to managing their own data.
I’ve worked with molecular biologists, marine biologists, neuroscientists and physicists, and the vast majority of scientists I speak to are doing basic analyses of their data, either by using the built in features of their experimental software, or simple statistical analyses in Excel. Many of them are analysing their data by visual inspection – a fancy way of saying they’re looking at it and hoping to see meaning in the numbers.
They can, and do, achieve a lot this way, but there is so much more they could discover using more powerful and flexible computation. I recently had a sleep study and discovered that the analysis they did was so basic it rendered much of the sophisticated data collection unnecessary. They categorized gross apnoea events, but could not do much with anything that wasn’t a complete obstruction of the airway. The resulting report was simple and no doubt effective for patients with classic, textbook sleep apnoea, but almost useless for someone like me who (surprise!) doesn’t fit the normal pattern, but nonetheless has a severe (and still undiagnosed!) sleep problem.
The only report they could give me was the basic output of the proprietary software package they use. They were reluctant to give me the raw data, and assured me that I would not be able to read it without their special software. When I got the usb stick home it turned out to contain edf files – a standard format for this sort of data, with a host of open source packages out there that can read it, convert it to text, and muck with it in a variety of different ways. But the sleep clinic and my medical specialist had no idea this was even possible.
So I looked into the science of sleep and discovered that, although scientists are doing somewhat more analysis of the data than doctors, they are still, for the most part, confined to the standard reports delivered by their off the shelf software. They are collecting masses of sophisticated data, and doing very little with it. What they do do, they do manually. A lot of data analysis is done by eye – does that waveform look like a sleeping eeg pattern? Does this one look like an apnoea event?
This is why all scientists need to learn to code. Basic scripting skills could save them thousands of hours of work every year, and ensure that their work is repeatable and verifiable. I’m both shocked and awed by the magnitude of this problem, and the scale of the opportunity that now presents itself. If we are training the scientists of the future, we need to train them to make the best possible use of their data.
This means they need to know enough code to write scripts to pull out the information they need from files that may be gigabytes, terabytes, or even petabytes in size. But it also means they need to be trained in how to work out the possibilities of their datasets. Where can they look for correlations? How can they find new information in old datasets? What graphs will help them make sense of their data? What statistical analyses will create meaning from a file full of numbers? And how can they effectively compare one file of data against another?
If you have a 10 Gigabyte data file, you can’t just graph those numbers as a single scatter plot. So which subsets of data do you look at? Do you look at a tenth of the data at a time, or is that still too large? Do you break it into consecutive sets or regular samples? What do you expect to see in this data? How can you check if it’s actually there? What other meaning might be worth looking for? How can you verify your results?
These are tough questions, and not always easy to answer, but the effective use of data involves skills that can be taught. We have access via the internet to a mind boggling number and range of datasets, so there are plenty of test cases to play with, in every imaginable discipline.
I’m starting to teach these things in my year 11 Computer Science course, and we are focusing on them more and more every year. But they are not niche skills for computer scientists alone, they are fundamental to the progress of modern science. We need to recognise that all scientists need these skills, and we need to start teaching them earlier.