Friday, October 28, 2011

Open Access Week event at U. Arizona: Reproducibility, Open Data

Earlier this week I was lucky to participate in the Open Access Week event at the University of Arizona: The Future of Data: Open Access and Reproducibility.  The event was hosted by Chris Kollen and Dan Lee of Arizona University Libraries.  I am very grateful for the invite and the opportunity to meet them, some active member of the audience, and the other speakers, Victoria Stodden and Eric Kansa.

Victoria Stodden gave an excellent talk, framed around the computational sciences, and with the major point: Instead of promoting "open data," we should promote "reproducibility" in science.  She argued, very convincingly, that good science requires reproducibility and thus scientists should be easily convinced that we need very high standards for reproducible results.  For computational research, the only way to ensure reproducibility is to publish much more open data and open code than is normally done now.  If your result is computational, how can anyone hope to replicate and build upon your results if you haven't provided the source code and the data sets?  They can't, but publications without code and data are by far the most common these days.  It's a failure of science that is probably caused by many factors.  One that comes to mind is that computational scientists have been forced to fit their "publications" into standard peer-reviewed articles, where the system is not set up to accept and / or host source code and data.  (As an aside, this is clearly a routine failure of peer review, as referees obviously are not ensuring reproducibility of the research, which should be a primary criterion for publication.)  Scientists understand that reproducibility is an essential element of research.  For example, two years in a row, my undergraduate physics majors identified reproducibility as the most important element of good science (see brainstorming 2010).  Since scientists understand this, then they will naturally practice open publishing of data, code, methods when they realize that reproducibility is missing without those elements.  As Victoria argued, demanding "open data" leads to confusion and resistance and ultimately probably lack of compliance.  In contrast, demanding "reproducible research" is already a cultural norm and it naturally leads to open data and open code of the most helpful variety for reproducibility.  Victoria's slides can be found here.

The notes for my presentation can be found on linked mindmaps, starting here.  (Click on the tiny right arrows to navigate.)  My notes are probably not too meaningful if you weren't at the symposium.  In contrast to Victoria's high-level talk about policies that could make a major impact, I told a few stories about open data and open notebook science in our own teaching and research labs, and the successful impact we've had already.  I think (hope) it provided concrete example of the benefits of open science.  On the one hand, I showed that open science, especially open notebook science strongly promotes reproducibility.  This has been seen best in the undergraduate physics lab that I teach.  Students read the notebooks of other students from prior weeks and prior years.  They build upon these previous results, which allows them to get the experiment working much quicker and have more time to explore new aspects of the experiment, or to develop new data analysis methods.  They are doing real science!  I showed an example of an excellent primary notebook from Alex Andrego and Anastasia Ierides.  However, I think I also showed that open data and open science make an impact beyond just reproducibility.  This impact is in reuse and repurpose of data. I told two stories where theory and research groups already have been able to use data we publicly shared on youtube.  One group has already used our data in a theory preprint on the arXiv.  Both groups expressed delight and gratitude that our data was freely availalbe.  There are two important features of these stories.  First, both groups used our data for a purpose that we had not (and probably would not have) imagined!  Clearly the impact of our data was multiplied by being public.  Secondly, we did the easiest and simplest sharing method we could find: youtube, yet we still made an impact.  We are currently working with Rob Olendorf, a data curation librarian at UNM to vastly improve our sharing.  This will include permanent citation links, vastly improved metadata (at least 10x more than the data itself), hosting by the institutional repository (much safer than our lab server), and links to other data sets.  Reason would have it that if we could make an impact with the imperfect system we tried first, then the impact will be much higher with the data shared via Rob and the institutional repository.

The final talk was by Eric Kansa, who described the amazing work of him and his colleagues on Open Context, a platform for sharing and linking archaeological data.  His notes from the event can be found here.  And his slides are available also: A More Open Future for the Past.  Despite being far from the field of archaeology, it was easy for me to see the vast impact that Eric and his colleagues are making via the open context project.  A large amount of time, sweat, and money are expended collecting archaeological data.  Without opening these data and curating and linking these data, the potential impact is severely limited.  The Open Context team has developed a method for collecting these data, archiving them, and linking them to other data sets.  The method is very effective, and importantly requires far less work than required to collect the data in the first place.  This seemed clearly, to me, a case of the huge power of data reuse and repurpose. In contrast to computational science, the power of data reuse seemed to trump the need for open data for reproducibility.  This is not surprising, given how different the two fields are.  But it was an interesting and somewhat confusing contrast for me between the needs for open data in computational research versus archaeology.

There were several engaged audience members.  One of them was Nirav Merchant, with the iPlant Collaborative.  Victoria and I were highly impressed by the computational platform that iPlant has developed already, only three years into the NSF cyberinfrastructure project.  I was simply amazed and I couldn't do it justice describing it.  The ability to ensure reproducibility of computational research with the iPlant platform is vast.  One example is how easy it is to save an image of a virtual machine and then share this image with other users.  They demonstrated this for us and it took only a few clicks and less than a minute.  I highly recommend reading more about iPlant at their site linked above.  The iPlant team that we met was energized, engaged, and collectively brilliant.  I'd love to know how they assembled their team as they've clearly done an excellent job.  I intend to keep in contact with the iPlant folks and am even hoping that I could introduce the computational platform to my Junior Lab students this year.  I think the exposure to these state of the art and "open" tools will be invaluable for their future research.

Overall, the one-day Open Access Week event was highly successful for me.  I met some amazing people and gained a lot of clarity in my thinking about the imperative for much more openness and sharing in science. Incidentally, maybe not coincidentally, during my flights I was able to read Michael Nielsen's fantastic new book on the untapped potential of connected, open science: Reinventing Discovery.  Despite having met Michael and having heard him speak a few times, I still found the book riveting and I learned a lot.  I absolutely recommend the book to anyone interested in the practice of science!
 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.