Summer 2010 — General: Data Structure
|<- Intro||Home||R: Basics->|
- Data Collection and Storage
- Structuring Data
I understand the goal of this part of the study group to be getting off the spreadsheet mindset. That is, I will be trying to convince you to convince yourselves that the best and most flexible format for our data is not one where we can easily visually inspect its entirety.
This advice applies mostly to researchers working with observational data.
Over-collect: When collecting data in the first place, over-collect if at all possible. The world is a very complex place, so there is no way you could cram it all into a bottle, but give it your best shot! If during the course of your data analysis, you find that it would have been really useful to have data on, say, duration, as well as formant frequencies, it becomes very costly to recollect that data, especially if you haven't laid the proper trail for yourself.
Preserve HiD Info If, for instance, you're collecting data on the effect of voicing on preceding vowel duration, preserve high dimensional data coding, like Lexical Item, or the transcription of the following segment. These high dimensional codings probably won't be too useful for your immediate analysis, but they will allow you to procedurally exract additional features from them at a later time. By preserving your high dimensional information, you're preserving the data's usefulness for your own later reanalysis, as well as for future researchers.
Leave A Trail of Crumbs Be sure to answer this question: How can I preserve a record of this observation in such a way that I can quickly return to it and gather more data on it if necessary? If you fail to successfully answer this question, then you'll be lost in the woods if you ever want to restudy, and the only way home is to replicate the study from scratch.
Give Meaningful Names Give meaningful names to both the names of predictor columns, as well as to labels of nominal observations. Keeping a readme describing the data is still a good idea, but at least now the data is approachable at first glance.
Distinguish between 0 and NA Just do it.
When we store data, it should be:
Raw: Raw data is the most useful data. It's impossible to move down to smaller granularity from a coarser, summarized granularity. Summary tables etc. are nice for publishing in a paper document, but raw data is what we need for asking novel research questions with old data. Also, it will make Tim Berners-Lee happy.
Open formatted: Do not use proprietary database software for long term storage of your data. I have already heard stories about interesting data sets that are no longer accessible for research either because the software they are stored in is defunct, or current versions are not backwards compatible. At that point, your data is property of Microsoft, or whoever.
Store your data as raw text, delimited in some way (I prefer tabs). I am not writing hyperbolically when I say go convert your data right now from a proprietry format to an open format.
Consistent: I think this is most important when you may have data in many separate files. Each file and its headers should be consistently named and formatted. They should be consistently delimited and commented also. There is nothing worse than inconsistent headers and erratic comments, labels, headers or NA characters in a corpus.
Documented: Produce a readme describing the data, how it was collected and processed, and describe every variable and its possible values.
It is the way we structure data which will differ the most from the spreadsheet based thinking. In fact, working out how to effectively use
For any given data set there will be two kinds of variables.
ID Variables: These variables are identifiers or features of each unique observation. Essentially anything you are testing to see if it has an effect on outcomes will be an ID variable.
Measure Variables: These variables record your measurement of each unique observation.
What counts as an ID Variable or a Measure Variable will depend upon the study. For instance, in most studies sex of the subject will usually be an ID Variable, and something about the subject's response will be a Measure Variable. However, if you're doing a study as to whether men or women are more likely to show up to your experiment, sex of subject would be a Measurement Variable.
There are two ways to organize your ID and Measure variables which I'll focus on here. I'll call them the ideal way, and the flexible way.
The ideal way to organize your table is with a row for every unique combination of ID variables. This will produce a column for every ID variable, and then a column for every kind of measurement. If you were doing a study keeping track of how many apples and oranges subjects bought, and then how many of those apples and oranges subjects ate, the ideal data format would look like Table 1.
Frequently, you will see data published looking like Table 2. This is a fine summary table format for publication in a paper, but it does not have the proper structure for rapid and easy statistical analysis, or graphical representation. Notice that the levels of the Fruit ID variable are represented in the columns, rather than in the rows.
If you already have data stored in only this summary format, or some other summary format, don't worry. By using the
The ideal data format is probably the one you will use the most for statistical analysis, graphical representation, and data storage.
However, we're going to focus a lot on what I'm calling the flexible format. For the most part, we'll be reformatting data from the
ideal format to the flexible format, and then manipulating it using functions from the
The flexible data format has a row for every unique combination of ID variable and measurement type, with a column for the value of the measurement. For the apples and oranges data, it would look this way.
At this point, you might be incredulous. We are, in fact, mixing different measurement types in the Value column. If you're not incredulous, this flexible format for vowel measurements might make you so.
As you can see, we are mixing values in hertz with values in miliseconds in a single column. This should offend your
data sensibilities. However, with clever use of the
*This notation indicates which package a particular function comes from. Its format is