Friday, December 2, 2011

Quick summary ARL / DLF E-Science Institute Capstone -- Atlanta

Waiting for flight home from Atlanta from the ARL / DLF E-Science Institute Capstone event.  Overall, it was a very productive event, especially for discussions with Rob Olendorf (my collaborator and a data management librarian at UNM) and Dale Hendrickson (head of Library IT at UNM).  Almost all of the attendees were library personnel, and I learned a lot from my interactions and the ideas presented.  I thought I would jot down some ideas and action items.

First, action items.  We were encouraged to develop "next steps" for when we return to our institutions.  Here are some of ours:

1.  Incorporate Library interactions with the undergraduate physics course (PHYC 308L, electronics lab) I am teaching next semester.  This is a new course for me, and I won't have time or familiarity to diverge much from the very good plan that prior instructors have developed.  But I know enough that Rob, Dale, and I came up with some concrete ideas that will be great for spurring data management at UNM and with these budding scientists:

  • Guest lecture by Rob to describe data management and related library services.  I think this would be best for the second lecture period in the course.  Rob will describe issues of data management and we will announce our intention to integrate library data management into the course (below).  Rob will also give an overview of github and a quick "how to."
  • A substantial part of the course (as I understand from talking to prior instructors and students) involves developing LabVIEW code for circuit design and simulation.  I'm guessing (pretty sure) that no source code control or versioning is used.  I think this presents a good (not perfect) opportunity to teach the students how to use github for versioning and source code sharing.  I'm thinking it will be an integrated requirement for all of the coding during the semester.  The reason it's not perfect is because LabVIEW uses binary files, so some of the forking and merging functionality will not be appreciated.  Many of the students are experienced in Matlab, though, and where possible I will encourage moving to that platform.  Regardless of how this plays out, I think for sure the students will come away from the course with a fundamental knowledge of github and how wonderful it is for protecting and sharing code.  I think I will also require LaTeX for their final reports, which will work well with github.
  • Incorporate data management, using the Library Institutional Repository.  Some infrastructure and coordination with the library will be necessary here, because I don't think we've done it before at UNM.  Dale's idea is to create a "community" in the d-space IR for our course, e.g. "Junior Lab 308L."  The students will be in charge up uploading their final data sets (testing their circuits) into permanent, curated objects in the IR.  There may be difficulties with this, but I am confident that the students will come away with a good appreciation of the power of good data management, and, hopefully a real, curated data set as part of their career portfolio.
2. Participation is data management "group meeting."  The library currently has some kind of regular meeting like this, and I will visit one of their upcoming meetings.

3. (Mostly for Rob)--"finish" our pilot data management project.  Rob has been working on this for a long time and it hasn't been easy.  He is working on curating and archiving one of Andy Maloney's complete kinesin gliding assay data sets.  The uncurated data can be seen on our server.  I don't really understand how Rob is doing this, but he's done a lot of coding and is close to putting a curated version of that data set into our institutional repository.  There are 500,000 images in the set, and I think Rob said that involves more than 50 million lines of (XML?) code to describe it.  I may be getting terminology and numbers wrong ("schema," etc.) but the point is Rob is writing a lot of code to do it "right."  A finished product will serve as a great example to everyone on campus (and even broader), especially researchers as to what the library can provide for data management.  I think this will be a huge step for us at UNM and in convincing more researchers to collaborate with the Library for research data management.

There were many more "next steps," but they aren't coming to mind now.  More than just next steps, there were a lot of visionary ideas presented by groups at the capstone event.  Here are some that stuck in my mind:

1.  Graduate students are key to connecting data management librarians with research groups.  What seemed the best idea to emerge was that an existing pipeline to graduate students is the general requirement for "ethics / responsible research conduct" courses as part of NIH/NSF training grants.  Good data management is often part of these courses, and in my mind is essential for responsible / ethical research.  Given how these courses are usually implemented, I think it would be fairly easy for data management librarians to obtain one or more time slots to discuss data management with the graduate students.  Best would be "hands-on" coursework, where the students are asked to bring data to the course.  This was discussed a bit on a friendfeed thread.

2.  Our institutional group and at least one other (can't remember the institution) more than once mentioned a vision for the library providing more than just data curation / preservation / storage.  I don't have a good term to capture this area, but it involves capturing / helping with workflow (especially custom software used in labs for data management / processing) and data visualization.  In my mind, a ripe area for connecting with researchers is to work backwards from the traditional publication.  Currently, many libraries have an institutional repository that allows researchers to post PDFs of research papers.  And usually that's about it (from what I can see).  Working back upstream, what I think would be very useful is to provide a computational workspace (through the libary) where researchers can process and produce the figures in those papers.  As an example, my graduate student logs into the library workspace, and uploads the data needed to produce the final figures. The graduate student and me then use software on that workspace (maybe R, Matlab, Excel) to create the figures for the paper.  There is a versioning system to keep track of the code used to process the figures and the many versions created.  When the paper is submitted for peer review (the current standard), it is seamless to link each figure to the data sets and the code used to generate those figures, using either permanent URLs or DOIs.  For me as a researcher, I would LOVE such a system.  And talking with Dale and Rob, it doesn't seem too much of a pipe dream.  It's a lot of work, but I think it would be a huge step and improvement in data management and data sharing in research.  Successful implementation would also be a really great way to recruit more researchers into data management partnerships with the library.  An important component of this I forgot to describe above is that there will be experts in the Library (such as Rob) who can work side-by-side (virtually) with us to develop the data visualization code and figures.

3.  Related to item 1 above, I think connections with graduate students could be greatly accelerated by a grants / data management competition.  A $1000 dollar research grant prize, directly to graduate students for "the best data management," would I think be very effective.  Compared to what we need to accomplish to transform research and the library's involvement, $1000 every so often is not a lot.  But it would mean a lot to the graduate students in the competition.

4. The NSF Data Management Plan (DMP) requirement has already done a lot to connect researchers with data management librarians.  Rob estimates more than 30 faculty connections have been made for him at UNM because of DMPs.  I think this is just one great outcome of the DMP requirement.  And it illuminates a huge opportunity that I see for researchers and libraries.  In my specific case, if I get tenure at UNM, I want to pursue a couple training grants.  One specifically I would like to try for is an "open science" NSF REU program.  REU is "research experience for undergraduates," usually involving summer research internships for undergraduates from other institutions around the country.  I think an REU proposal with a heavy focus on "open science" and advanced data management would look very appealing to the NSF.  Of course I also think it would be very effective in training the next generation of researchers.  Importantly, though, I would need a lot of help to write this grant.  The Library's experience with DMP's can be extended to this effort and people like Rob and others will be essential in planning, writing, and executing the grant.  Moreover, I think other people on campus who are planning other training grants would get a big "broader impacts" boost from this kind of data management or "open science" collaboration with the library.  So, hopefully, our Research office can help coordinate these endeavors.

Many, many more ideas but I think I'm out of steam for now.  Overall, a great conference and I'm excited for pursuing these ideas!
 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.