Tuesday, May 18, 2010

Academic data plans

Charles sent me a fake letter to the NSF. Since it doesn't match my 15 years or experience in and with academic IT, I provide this edited version as a free (as in beer) template.

Dear NSF,

I am happy to respond to your request for a 2-page Data Management Plan.

First of all, let me say how enthusiastic I am that you have embraced this new field of "large scale data analysis". I heard about this at a conference in 1999, and even though it is outside my field, it seems like a subject that could generate a lot of publications, so I have been toying with the idea to put some grad students on this and have them publish with me listed as primary investigator since I am only working on the deep questions in CS theory and haven't touched a computer in decades except for buying things for my wife on eBay. I even have my secretary print out all my email and transcribe my answers that I give her on a Dictaphone tape anywhere from 3-6 months after I get the printout.

Since it now looks like I can get funding from the NSF as soon as the English grad I hired to write my grant proposals is done, I'll get right on it.

I probably just need one big hard drive, if we need more, we'll just daisy chain them to the SGI Indy we have here. Alternatively, I would like to suggest that we combine this with a DARPA grant and acquire a StorageTek 9985V.

The files will be named by the date they were created and the name of the grad student creating them. If more than one file gets generated per day, they will be named sequentially. E.g. charles20100501-1. The file descriptions will be sent as an Excel sheet (printed and sent as PDF weekly). Students will mark the files they need for their work on the sheet, and our department assistant will create one or more custom DVD with the requested sets once a week and send them to the students. All results will be converted to PDF and sent to the department assistant who will scan them, and upload them to the disk(s).

The advantage of this is that the data on the server can't be corrupted by students, and the DVDs also serve as a backup. In case of a server disk failure, we will just ask them to return the DVDs and re-upload the data. Thanks to the ingenious naming scheme I invented above, there will be no file collisions and due to the date+sequential numbering, we intrinsically have incremental backups.

If we can't avoid it, for example because of state and federal law, we will make the produced data available to other researchers under the following license:

"You are provided this data for the sole purpose of reproducing our published results. Any attempt to publish your own analyses of this data will be rejected, if necessary during the anonymous review process, by pointing out all of the data cleanup steps you forgot to do correctly in your analysis. If you succeed in publishing, I need to be named as lead investigator."

The license will be faxed to the other department and has to be returned signed via FedEx.

After receiving the signed license, we will upload the data with the name of the requester, file date and sequential number if more than one file is requested. The files will compressed in the industry standard LZW format and encrypted. For data security, the encryption keys will not be stored electronically, but kept in a secure three ring binder in my locked desk drawer. If a key is required, my assistant has permission to open the drawer and fax the key to the requester or tell it over the phone.

After talking to my longest serving graduate assistant, Larry, he suggested ISO 8859-16 as encoding. Since he has been working in my department for 28 years, I trust his experience implicitly.

Note, we won't be using a version control system since they only add overhead. All the code will be in Python, Perl, C or FORTRAN 99 (Fujitsu/Siemens extensions), depending on the whim of the grad student. All code will be names similar to the data sets above and students and faculty will maintain their custom build scripts on their PCs.

Sincerely yours,

XXX
Professor of Biophysics and Department Chair

No comments: