BITlab: Behavior Information Technology

BITlab
404 Wilson Rd. Room 251
Communication Arts & Sciences
Michigan State University
East Lansing, MI 48824

Notes from Data Management Workshop

Emilee Rader, June 12, 2015

Introduction

  1. PhD Comic: http://phdcomics.com/comics/archive_print.php?comicid=1689
  2. What is data?
    1. “the intermediary products of research”: http://www.grad.msu.edu/researchintegrity/docs/rif01.pdf
    2. “the recorded information necessary to support or validate research findings” — rio.msu.edu—research-data
    3. Examples
      • hard copy
      • database
      • keychain drive
      • scripts and code
      • meeting whiteboard notes
      • documentation (like the BITLab wiki)
      • subject tracker
      • csv file with numbers in it
      • affinity photos
      • codebook
      • audio and video recordings (primary data)
      • transformed data
      • finalized data
      • paper drafts
      • published versions of papers (and datasets!)
  3. Who cares about data management??
    1. recent story about the gay marriage opinion survey: http://www.slate.com/blogs/browbeat/2015/06/01/gay_marriage_study_faked_how_grad_student_david_broockman_uncovered_a_huge.html
    2. story about a copy-and-paste error in an excel spreadsheet http://www.businessinsider.com/reinhart-and-rogoff-admit-excel-blunder-2013-4
    3. Rick’s story about a reviewer request: “The methods section is still too vague to be useful to other researchers who may want to use similar metrics when building on your work. Please define what cut offs and metrics were used to determine whether a participant was a member of the site for a long time and what metrics provided the cut off for low or high activity, etc. Once these changes are in place, please submit the revised manuscript so that we can move forward quickly.”
    4. My story about almost-lost dissertation (backup)
    5. My story about the topic models paper (started working on it in 2011, multiple people contributed)
  4. Data Management is everyone’s responsibility!!!
    1. My goal for today is to give you some examples to think about, and give you some exercises to work on specific to your project this summer, NOT to convey everything you will ever need to know about data management. It’s a process. And it takes practice, and sometimes feels like extra work that is just a pain in the ass. But, it will absolutely save your ass someday.
  5. Topics I will cover
    1. data integrity: quality control; accuracy and consistency of data over its entire life-cycle
    2. data description: documentation of what the data are, where they came from, how they were collected, how to use them
    3. data organization: the structure and rules for how the data should be named and organized and versioned
    4. data security: protection against unauthorized access or untimely destruction/disaster
    5. data ownership: who has the rights to control/manage/destroy the data

Data Integrity

  1. Your conclusions are only as good as your data, and your data are only as good as your process
  2. Types of errors in data collection:
  3. BEFORE, DURING and AFTER you collect / produce data…
    1. Focus on preventing problems; also detecting and monitoring issues
    2. It is important to decide who is responsible for conducting what parts of the process
      • design of the protocol/process, collection/production, review of data
    3. what procedures will be necessary to ensure nothing has gone wrong
      • software tests?
      • code and/or data reviews?
      • debrief/postmortem?
      • how to document changes to the plan? (AND deviations from the plan?)
    4. Requires communication! And knowing who to go to if you think there’s a problem.
  4. For discussion: Imagine it is the end of the REU internship, and as you’re going over your work for the summer you realize that there was a bug in your code and this caused some incorrect or invalid data to be produced. How should you proceed?

Data Description

  1. Not just WHAT, but also WHEN, WHY and HOW. You might think you’ll remember everything you need to, but I promise you, YOU WILL NOT
  2. Project documentation
    1. At minimum, have a readme.txt file — for each major activity!
    2. Name of project, people, roles & contact information
    3. Executive summary or abstract for basic context
    4. Inventory of servers, directories, data, lab equipment, and other resources
    5. Relationships between files
  3. Process documentation
    1. Protocols, procedures, software or code settings, code comments
    2. Workflow descriptions (text) or diagrams (image)
    3. Include example scripts, inputs, outputs if applicable — AND the reasons behind your choices
    4. Aim for reproducibility of all stages of the data cleaning, processing, and analysis
    5. A great start for process documentation is a lab notebook
  4. For discussion: What are some things you might capture in a lab notebook or journal that are different from what you typically see in a research article?

Data Organization

  1. When you work on an interdisciplinary team, how to organize the data can be complicated
    1. (Refer back to the kinds of data from each project written on the whiteboard)
    2. Where to put ALL THE THINGS? Not in the same place…
  2. File formats
    1. platforms are often chosen for expediency rather than longevity and ease of access — this is bad
    2. price, ease of use, backwards-compatibility (don’t just use the new hotness)
    3. Make an informed decision!!
  3. File organization and naming
    1. Create a file plan and discuss repeatedly, because projects change and evolve
    2. Want to avoid too many unsynchronized copies — hard to remember which is “authoritative”
    3. Use naming conventions to avoid confusion

Data Security

  1. Part of data management is trying to anticipate and plan for consequences of different kinds of “disasters” (aka threats)
  2. There are different kinds of threats…
    1. (brainstorm)
    2. common malfunctions, like a computer “crash” or hard drive failure
    3. accidents, like broken water pipes
    4. weather events or natural disasters
    5. malicious acts like data breaches, arson, vandalism
  3. Prevention / mitigation / recovery
    1. Everyone can help with this
    2. Keep good records about equipment
    3. Access control — both for digital stuff and for space (know who and what are in your lab and in your files)
    4. Software updates and virus protection
    5. Backups!!! Good practices for creating a backup strategy:
      • Make 3 copies (original + external/local + external/remote; original + 2 formats on 2 drives in 2 locations)
      • Geographically distributed and secure
      • Local vs. remote, depending on needed recovery time
      • Know what resources are available to you: personal computer, external hard drives, departmental, or university servers may be used
  4. For discussion: Imagine that your laptop is stolen from a coffeeshop. What do you do?

Data Ownership

  1. Ownership is complicated… because research is complicated. From http://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dotopic.html
    1. There are a lot of different kinds of contributions and roles involved in a research project… and most of them imply some kind of ownership (brainstorm)
      • Creator – The party that creates or generate data
      • Consumer – The party that uses the data
      • Enterprise (University) - All data that enters the enterprise or is created within the enterprise is completely owned by the enterprise
      • Funder - the user that commissions the data creation claims ownership
      • Packager (data repository?) - the party that collects information for a particular use and adds value through formatting the information for a particular market or set of consumers
      • Subject - the subject of the data claims ownership of that data, mostly in reaction to another party claiming ownership of the same data
  2. Ownership means BOTH possession of and responsibility for data
    1. When you work on a project this means you have a “stake” in the data produced; but you are also responsible for that data as well. In all the ways we’ve already talked about.
  3. Who owns “my” data? (at MSU)
    1. MSU does! “Except where precluded by the specific terms of sponsorship or other agreements, tangible research property, including scientific data and other records of research conducted under the auspices of Michigan State University, belongs to Michigan State University. The PI should be responsible for maintenance and retention of research data.” — rio.msu.edu—research-data
    2. The PI is the “custodian” and is responsible for:
      • data management
      • setting conditions for access
      • maintaining records (in the unit where they were produced)
      • also, for things that go wrong — and for providing access in case of audit/investigation
    3. Researchers involved in group investigations have rights to access to data gathered by all members of the group, while they are part of the group. This means that ALL project work must be available to ALL members of the team
    4. When you leave MSU… you can take copies with you (depending on confidentiality/IRB, etc.) but the originals MUST stay here
  4. For discussion: Imagine that you are using Google Docs to collaborate on a research project. Who owns the data? How can you find out? What do you need to think about before using an online/cloud service in your research project?

Exercises

  1. File organization exercise/demo
    1. Organize the securitymodels.working file
    2. How to figure out where something should go in the Dropbox or in git
    3. Where to create a file/file formats (dropbox, google docs, etc.) — what to consider
    4. Naming conventions: more words means more understandable
  2. Create a README file
    1. put your name in your files, and the date (so others know who to ask with questions)
    2. inventory of locations of important files
    3. documentation: do it for someone else, not yourself
    4. project information vs. process information
  3. Code documentation
    1. goal is both to write lots of comments and also write readable code
    2. document what the piece of code is FOR, not what it does (it should be readable enough to be self-explanatory)
    3. best practices:
      • use consistent indentation
      • comment blocks that make functional sense
      • consistent and sensible naming scheme
      • avoid repeating the same thing in multiple places
      • avoid lots of nesting (makes code hard to read)
      • separation of code and data (don’t hard-code things!!)
  4. Start a “lab notebook”/journal
    1. file format? (and why your notes should NEVER be on paper and nowhere else)
    2. where to store it?
    3. what should go in it?
      • protocols and procedures
      • software settings
      • code comments
      • workflow descriptions or diagrams (process)
      • examples
      • concepts/constructs and operationalization