Show me the Data

I’ve been building a repertoire of Computational Science projects for five years now, and I am constantly struck by how basic the needs of scientists are with respect to data, and how under-resourced they are when it comes to managing their own data.

I’ve worked with molecular biologists, marine biologists, neuroscientists and physicists, and the vast majority of scientists I speak to are doing basic analyses of their data, either by using the built in features of their experimental software, or simple statistical analyses in Excel. Many of them are analysing their data by visual inspection – a fancy way of saying they’re looking at it and hoping to see meaning in the numbers.

They can, and do, achieve a lot this way, but there is so much more they could discover using more powerful and flexible computation. I recently had a sleep study and discovered that the analysis they did was so basic it rendered much of the sophisticated data collection unnecessary. They categorized gross apnoea events, but could not do much with anything that wasn’t a complete obstruction of the airway. The resulting report was simple and no doubt effective for patients with classic, textbook sleep apnoea, but almost useless for someone like me who (surprise!) doesn’t fit the normal pattern, but nonetheless has a severe (and still undiagnosed!) sleep problem.

The only report they could give me was the basic output of the proprietary software package they use. They were reluctant to give me the raw data, and assured me that I would not be able to read it without their special software. When I got the usb stick home it turned out to contain edf files – a standard format for this sort of data, with a host of open source packages out there that can read it, convert it to text, and muck with it in a variety of different ways. But the sleep clinic and my medical specialist had no idea this was even possible.

So I looked into the science of sleep and discovered that, although scientists are doing somewhat more analysis of the data than doctors, they are still, for the most part, confined to the standard reports delivered by their off the shelf software. They are collecting masses of sophisticated data, and doing very little with it. What they do do, they do manually. A lot of data analysis is done by eye – does that waveform look like a sleeping eeg pattern? Does this one look like an apnoea event?

This is why all scientists need to learn to code. Basic scripting skills could save them thousands of hours of work every year, and ensure that their work is repeatable and verifiable. I’m both shocked and awed by the magnitude of this problem, and the scale of the opportunity that now presents itself. If we are training the scientists of the future, we need to train them to make the best possible use of their data.

This means they need to know enough code to write scripts to pull out the information they need from files that may be gigabytes, terabytes, or even petabytes in size. But it also means they need to be trained in how to work out the possibilities of their datasets. Where can they look for correlations? How can they find new information in old datasets? What graphs will help them make sense of their data? What statistical analyses will create meaning from a file full of numbers? And how can they effectively compare one file of data against another?

If you have a 10 Gigabyte data file, you can’t just graph those numbers as a single scatter plot. So which subsets of data do you look at? Do you look at a tenth of the data at a time, or is that still too large? Do you break it into consecutive sets or regular samples? What do you expect to see in this data? How can you check if it’s actually there? What other meaning might be worth looking for? How can you verify your results?

These are tough questions, and not always easy to answer, but the effective use of data involves skills that can be taught. We have access via the internet to a mind boggling number and range of datasets, so there are plenty of test cases to play with, in every imaginable discipline.

I’m starting to teach these things in my year 11 Computer Science course, and we are focusing on them more and more every year. But they are not niche skills for computer scientists alone, they are fundamental to the progress of modern science. We need to recognise that all scientists need these skills, and we need to start teaching them earlier.


About lindamciver

Australian Freelance Writer, Teacher, & Computer Scientist
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

3 Responses to Show me the Data

  1. Joe says:

    I agree and disagree.

    Much is achieved when you marry a specialist scientist with a specialist computer scientist. I’d rather encourage more collaboration, more research departments with a “go to” team of computer scientists / coders / specialist data analysts / statisticians, rather than try to get every scientist to also learn good coding, data analysis, and statistics.

    One thing I’ve been finding in the last year and a half of choosing a Masters stream and starting on it (choosing between “big data” techniques and traditional statistical ones) is that both forms of data analysis are fraught with danger even (or *especially*) for those with just a little more than basic knowledge.

    (I’ve heard it said, for example, that the majority of regression analysis conducted in research breaks the useability limitations of applied regression analysis.)

    • lindamciver says:

      You make a good point, but the current state of science is such that scientists don’t even know what is possible. They don’t know how to look at their own data. Not all departments will be able to afford a team of data scientists, and collaboration, while close to my heart, is not always easy to achieve. I think scientists need, at a minimum, to know what to look for, even if they have to hire someone to do the looking.

      • Joe says:

        That’s the “agree” part. 😛

        I was bemoaning again today the delusion (and it’s a strong and pervasive one) that people can *efficiently* “self help” given the right tools… lodge your own helpdesk ticket, analyse your own data, fill in your own “business case” forms, and so much more. Reality is that a hand-hold between the specialist user and the specialist service is significantly more efficient. No matter how hard you try, the whole suite of assumption and knowledge that drives the design of the tools needs to be communicated *at some level* for a proper result… and that’s rarely done. So (for example) an organisation will pay their specialist user a day of effort to do something that a hand-hold partnership would achieve in less than half the time for a much more robust result.

        But in the meantime, the organisation can say “look we have a lean spend on technology / support staff / …” and score brownie points.

        I look forward to the day we bring back a culture of respect for subject matter specialists at all levels. The “tea ladies” of old weren’t there to make people feel pampered. They were there because they could make tea a lot more cheaply than the staff they served.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s