Summer 2010 — General: Data Structure

Goals
Data Collection and Storage

General Principles of Data Collection
Storing Data

Structuring Data

ID and Measure Variables
Organization

Goals

I understand the goal of this part of the study group to be getting off the spreadsheet mindset. That is, I will be trying to convince you to convince yourselves that the best and most flexible format for our data is not one where we can easily visually inspect its entirety.

Data Collection and Storage

General Principles of Data Collection

This advice applies mostly to researchers working with observational data.

Over-collect: When collecting data in the first place, over-collect if at all possible. The world is a very complex place, so there is no way you could cram it all into a bottle, but give it your best shot! If during the course of your data analysis, you find that it would have been really useful to have data on, say, duration, as well as formant frequencies, it becomes very costly to recollect that data, especially if you haven't laid the proper trail for yourself.

Preserve HiD Info If, for instance, you're collecting data on the effect of voicing on preceding vowel duration, preserve high dimensional data coding, like Lexical Item, or the transcription of the following segment. These high dimensional codings probably won't be too useful for your immediate analysis, but they will allow you to procedurally exract additional features from them at a later time. By preserving your high dimensional information, you're preserving the data's usefulness for your own later reanalysis, as well as for future researchers.

Leave A Trail of Crumbs Be sure to answer this question: How can I preserve a record of this observation in such a way that I can quickly return to it and gather more data on it if necessary? If you fail to successfully answer this question, then you'll be lost in the woods if you ever want to restudy, and the only way home is to replicate the study from scratch.

Give Meaningful Names Give meaningful names to both the names of predictor columns, as well as to labels of nominal observations. Keeping a readme describing the data is still a good idea, but at least now the data is approachable at first glance.

Distinguish between 0 and NA Just do it.

Storing Data

When we store data, it should be:

Raw: Raw data is the most useful data. It's impossible to move down to smaller granularity from a coarser, summarized granularity. Summary tables etc. are nice for publishing in a paper document, but raw data is what we need for asking novel research questions with old data. Also, it will make Tim Berners-Lee happy.

Open formatted: Do not use proprietary database software for long term storage of your data. I have already heard stories about interesting data sets that are no longer accessible for research either because the software they are stored in is defunct, or current versions are not backwards compatible. At that point, your data is property of Microsoft, or whoever.

Store your data as raw text, delimited in some way (I prefer tabs). I am not writing hyperbolically when I say go convert your data right now from a proprietry format to an open format.

Consistent: I think this is most important when you may have data in many separate files. Each file and its headers should be consistently named and formatted. They should be consistently delimited and commented also. There is nothing worse than inconsistent headers and erratic comments, labels, headers or NA characters in a corpus.

Documented: Produce a readme describing the data, how it was collected and processed, and describe every variable and its possible values.

Structuring Data

It is the way we structure data which will differ the most from the spreadsheet based thinking. In fact, working out how to effectively use the reshape(Cran, website) package has really changed the way I think about data, and hopefully it will for you too. I said here that the ideal data structure is one where every row is an observation. Here, I'll be refining what counts as an observation.

ID and Measure Variables

For any given data set there will be two kinds of variables.

ID Variables: These variables are identifiers or features of each unique observation. Essentially anything you are testing to see if it has an effect on outcomes will be an ID variable.

Measure Variables: These variables record your measurement of each unique observation.

What counts as an ID Variable or a Measure Variable will depend upon the study. For instance, in most studies sex of the subject will usually be an ID Variable, and something about the subject's response will be a Measure Variable. However, if you're doing a study as to whether men or women are more likely to show up to your experiment, sex of subject would be a Measurement Variable.

Organization

There are two ways to organize your ID and Measure variables which I'll focus on here. I'll call them the ideal way, and the flexible way.

The ideal way to organize your table is with a row for every unique combination of ID variables. This will produce a column for every ID variable, and then a column for every kind of measurement. If you were doing a study keeping track of how many apples and oranges subjects bought, and then how many of those apples and oranges subjects ate, the ideal data format would look like Table 1.

ID Variables
Fruit	Person
Apple	John
Orange	John
Apple	Mary
Orange	Mary

Measure Variables
Bought	Ate
5	1
5	3
3	2
4	3

Table 1
Ideal data format

Frequently, you will see data published looking like Table 2. This is a fine summary table format for publication in a paper, but it does not have the proper structure for rapid and easy statistical analysis, or graphical representation. Notice that the levels of the Fruit ID variable are represented in the columns, rather than in the rows.

Table 2
Summary Table
	Apples		Oranges
Person	Bought	Ate	Bought	Ate
John	5	1	5	3
Mary	3	2	4	3

If you already have data stored in only this summary format, or some other summary format, don't worry. By using the reshape::melt()^* function, we can get from the summary format to the ideal format. I would encourage you not to store data in summary formats anymore, however.

The ideal data format is probably the one you will use the most for statistical analysis, graphical representation, and data storage. However, we're going to focus a lot on what I'm calling the flexible format. For the most part, we'll be reformatting data from the ideal format to the flexible format, and then manipulating it using functions from the reshape (Cran, website) package.

The flexible data format has a row for every unique combination of ID variable and measurement type, with a column for the value of the measurement. For the apples and oranges data, it would look this way.

**Flexible data format**
Fruit	Person	Variable	Value
Apple	John	Bought	5
Apple	John	Ate	1
Orange	John	Bought	5
Orange	John	Ate	3
Apple	Mary	Bought	3
Apple	Mary	Ate	2
Orange	Mary	Bought	4
Orange	Mary	Ate	3

At this point, you might be incredulous. We are, in fact, mixing different measurement types in the Value column. If you're not incredulous, this flexible format for vowel measurements might make you so.

**Flexible data format**
Vowel	ObsID	Variable	Value
ae	1	F1.hz	595
oh	2	F1.hz	759
ey	3	F1.hz	531
ae	1	F2.hz	2421
oh	2	F2.hz	1120
ey	3	F2.hz	2401
ae	1	Duration.msec	80
oh	2	Duration.msec	100
ey	3	Duration.msec	120

As you can see, we are mixing values in hertz with values in miliseconds in a single column. This should offend your data sensibilities. However, with clever use of the reshape::cast() function, the flexible format is not only one line of code away from the ideal format, but from the summary format, and all other manners of aggregation.

*This notation indicates which package a particular function comes from. Its format is package::function()