Wednesday, December 30, 2009

Teaching data management?

Data management is one of the most important skills we learn as researchers. Without it we are well and truly screwed and can spend days slogging through a mess of our own making or have to re-do experiments altogether. But other than training I have seen on either databasing programs or as a small section of course in Unix, where does training in good data management come from?

In my own experience data management was something learned on the job that became more complex as the amount and breadth of data grew. Often this was a gradual process that allowed for the trainee to scale up from a base and each person came up with a system that worked well for them. An obvious issue with this is that multiple systems in a lab can become problematic for data sharing, but as long as the final product is in a "sharable" format this is not a major problem in most circumstances.

In my field however, things have changed. Yes, my students no longer have to walk to work uphill both ways in the snow like I did and nothing costs a nickel anymore. Now, the sheer volume of data we produce for a project is enormous compared with the data I dealt with as a student, so the gradual building of a database is not so gradual these days. Data management is more critical than ever, but it never occurred to me until recently that simply saying "Make sure you keep your datasets organized and well labeled." about 100 times isn't enough.

So, as part of my supervisory tasks in the coming year I'm going to start sitting down with all my students to go over their data management and ensure they have a good system in place. It may take me some time, but ultimately it'll save a lot of person-hours down the road.

I'm sure others have dealt with this in the past as either the teacher or the trainee. Any particularly effective strategies? I realize that the type of data matters, but I think it's worth discussing.

11 comments:

  1. I would suggest checking out BioKM (www.biodata.com) - an internet based application designed to meet the everyday needs of researchers in an academic lab environment allowing them to store, search, share and manage laboratory and research data. Next Generation Science recently posted a review of BioKM - you can read it at http://bit.ly/56YmmW

    ReplyDelete
  2. I had a scary experience with a database from hell my sophomore year. It was ~12yrs of data on a high-profile critter, so when critters were sampled, they tested for EVERYTHING under the sun. There was a (pretty awesome) spatial component to it too...

    The main problem with the data set? Many people had access to it, and were entering in the data inconsistently. (Huge continuity issue with units among other things).

    I spent the majority of my internship debugging that beastie. Oh, and it was done in Microsoft Office, ancient version.



    I'd be interested to see what other people use for databases...

    ReplyDelete
  3. Our data don't lend themselves to databasing, per se, more like iterative organization at different levels. The problem is storing different versions of datasets and keeping everything straight when one has to try different ways to organize the data.

    ReplyDelete
  4. For each project I manage, I rely on a series of folders to store my data. DNA sequence, gel images, data files are all stored in descriptive folders. In my lab notebook, I'll often print out a screen capture of those directories to tape into my book so I know the path to get to them. Our data is backed up in duplicate (my own system, and the centers system) so if my computer takes a dive, I can get it back (relatively) easily.

    For lab management (tracking chemicals, primers, enzymes, organisms) I rely on MS Access ... though I'm having a hard time getting my staff to keep up with it as much as I do.

    I'm trying to move to an electronic notebook format which SHOULD allow me to more easily track file storage (grab the file and drop it into the notebook and viola! added instantly).

    ReplyDelete
  5. I keep all of the numbers in an Excel spreadsheet - one file per study - and keep all of the raw calculations, pictures, etc in one electronic folder with sub-folders where appropriate. When the manuscript is published and the study essentially finished, I print a hardcopy of the spreadsheet and keep it with the lab book that contains all of the work for that study. My biggest fear is having all of the data in a particular file format that then becomes superseded and no longer accessible or losing everything in the event of an electronic meltdown.

    ReplyDelete
  6. We have just embarked on a set of studies in my lab where each experiment results in hundreds of millions of individual pieces of data. The post-doc running this project seems to know what she's doing, so I am confident she will develop an appropriate pipeline for data analysis and archiving.

    ReplyDelete
  7. For the past two years we have been working with 40 labs to create the best solution for research management. So far, 130 labs are using our service and the number is growing daily. We've learned a few things along the way, but the most important thing is to have a system that all student use - something that works for the lab in four years, when your graduate students or post docs leave. You can find the results, images, tubes or any research related outcome you need and shouldn't have to spend hours searching for something eventually giving up on it or starting from scratch. BioKM does just that.

    ReplyDelete
  8. I took a look at the BioKM website and while I can see it being useful for a big lab, it's not really the type of data management we need. The problem we have is the management of large datasets on people's computers. If I have two students working on different projects that include large datasets, I need to make sure that they have the data sorted in a way that they can keep it all straight. We don't have the problem of linking specific results to experiments, more seeing the important trees in the data forest.

    ReplyDelete
  9. We have labs of all sizes working with us.
    (6 - 12 researchers being most common.) I am not sure what you are currently researching but our specimen module was intended for just that - seeing the important trees .I'd be happy to walk you through our system and discuss its uses.
    Also, I think it is important to stress that all the features we currently offer came from discussions understanding the day to day needs of professors and students engaged in research.

    jonathan(dot)gross(at)biodata(dot)com

    ReplyDelete
  10. This comment has been removed by a blog administrator.

    ReplyDelete