Home | Week 2 -> |

**Introductory Statistics with R**: Chapter 1
**Analyzing Linguistic Data: A practical introduction to statistics**: Chapter 1

- Why use R?
- Installing R
- Getting Help
- Customization and Expansion
- Basics of Vectors and Assignment
- Matrices

When you first start up R, this is what you see:

>and that is very intimidating.

Here are some immediate questions you will have

- How do I load my data?
- How do I look at my data?
- What method do I use to analyze my data

Your immediate and natural response right now might be to question why we should all be using R. Other statistics programs exist, like SPSS, and some people have specialty scripts to do stats for them (e.g. Sociolinguistics with GoldVarb and RBrul).

My personal response overlaps quite a bit with this blog post. Quite simply,
**R is the statistics software paradigm of our day**. That statement can easily tell the whole story, but I'll elaborate. First,
there are intrinsic qualities to R that are driving it towards paradigm status. Specifically, it is free, open source, platform independent,
has an excellent ease of use to power of use ratio, and with all the libraries available is usually a one-stop-shop for all your data analysis needs.
Second, because it is the paradigm, there is a broad community of users out there to continue to expand and improve its use, and to provide support, which
specialty scripts and proprietary software lack. Most academic statisticians use R, so if you go to them for help, or buy their book, they will usually provide
R code or R packages to implement their advice. When we do our data analysis in R, we benefit from, and contribute to the use of statistics in the
whole of the social sciences. There approaches a point where individual disciplines may need to utilize specialized or rare statistical techniques, but these
points lie on the margins. For the most part, we are all asking the same questions for different problem spaces, and when we have unified language to ask them
in, we can all assist each other despite non-overlapping theoretical backgrounds.

But what's more is that it is not necessary to suffer to reach salvation! It is possible to start using R quite effectively quite rapidly, which will be the goal of this study group. We will gloss over a lot of the guts of the system, as well as its use as a programming language, and focus exclusively on using the interactive prompt to manipulate and analyze data.

If you haven't already installed R, you can do so at the Comprehensive R Archive Network (CRAN).

Perhaps the most useful command in R is ?. At the prompt, type ? followed by any command, and it will bring up the help page for that command. e.g. ?plot

If you don't know the name of a specific command, or don't know how to do what you want, try searching the internet for it. There are a lot of blogs and mailing lists talking about R, and odds are the answer to your question will be there. "R" is a pretty poor search term, expecially when dealing with statistics or mathematics, so you might try including "R language" in your search terms. There is also a custom google search called RSeek, but I'm not always pleased with the results it gives.

R is an open source project, which means everyone and their mother has a package for everything. This means that if you have a task, there is probably an R library for it. Some caution should be exercised before you build a model that hinges on some Joe Blow's R package, but the most commonly used packages are in good shape. To install a package (if your defauts aren't changed) type install.packages("packagename").

Thanks to Keelan for pointing out http://crantastic.org/, which is set up to rate different R libraries. There isn't much content there yet, but hopefully over time it will fill up.

R can be used for any kind of basic calculation.

> 3+3

[1] 6

> 2*4

[1] 8

> (369-1)/6

[1] 61.33333

The > symbol is the prompt for input. The output is preceded by [1] to indicate that it is the first element in a vector.

Vectors are one of the basic data structures in R. They are essentially a list of data, and can contain characters, numbers, or TRUE FALSE values. There are a number of ways to create vectors in R, and frequently doing data manipulation will produce subvectors of data. The table below has some commands for creating vectors.

1:10 | This produces a vector of integers from 1 to 10. Reversing the order of the numbers will produce a vector of decreasing values |

c(...) | This produces a vector of whatever is passed as an argument to
c(). e.g. c(1,2,3,4) |

seq(from,to,...) | This produces a sequence of numbers either by a given
increment or evenly spaced to a given length. seq(1,10,by=0.5) seq(1,10,length=11) |

rep(x,...) | This produces a vector of repetitions of x by
a given number of times. rep(1,6) rep(1:3,2) rep("hello world",4) |

Not that this will ever be a big issue in practical use, but vectors of mixed types are impossible in R. They will automatically be simplified to the least restrictive type. Boolean logical values will be simplified to numerical 1's and 0's in a mixed vector with numbers, and everything will be simplified to chracter strings in a mixed vector with characters. This problem will usually need to be addressed when loading poorly fomatted data into R. (For more marginal problems of dealing with integers vs. floating points, see here.)

It is usually useful to assign vectors to variables. The assignment operator in R is <- or ->. The following are equivalent assignments:

> x <- 1:10The variable x now represents the vector of integers from 1 to 10. Variables in R are case sensitive, so if you assign something to x, you have not assigned it to X.

> 1:10 ->x

> x

[1] 1 2 3 4 5 6 7 8 9 10

Also note that functions and operations never directly modify variables. Below we discuss some vector arithmetic. The operation x*3 does not modify x, but rather produces output which can be assigned to a new variable, or back to the original variable. Without assigning the output of a function or operation to some variable, that output is not available in any way to further computation.

> x

[1] 1 2 3 4 5 6 7 8 9 10

> x*3

[1] 3 6 9 12 15 18 21 24 27 30

> x

[1] 1 2 3 4 5 6 7 8 9 10

> x*3->x

> x

[1] 3 6 9 12 15 18 21 24 27 30

Just like you can carry out arithmetic functions on numbers, you can do the same to vectors. In fact, single numbers are treated as vectors of length 1, so all arithmetic is actually vector arithmetic. The arithmetic functions in R (even * /) are all element-wise.

> x <- 1:10You can also do arithmetic operations on two multilength vectors. R will repeat the smaller vector so it is the same length as the larger vector.

> x

[1] 1 2 3 4 5 6 7 8 9 10

> x+1

[1] 2 3 4 5 6 7 8 9 10 11

> x*2

[1] 2 4 6 8 10 12 14 16 18 20

> x^2

[1] 1 4 9 16 25 36 49 64 81 100

> x/3

[1] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667 2.0000000 2.3333333 2.6666667 3.0000000 3.3333333

> log(x)

[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 2.0794415 2.1972246 2.3025851

> x*xIf the smaller vector is not a multiple of the larger vector, you will still get output, but R will print a warning.

[1] 1 4 9 16 25 36 49 64 81 100

> x+x

[1] 2 4 6 8 10 12 14 16 18 20

> y<-c(1,2)

> x*y

[1] 1 4 3 8 5 12 7 16 9 20

> x<-rep(1,10)

> x

[1] 1 1 1 1 1 1 1 1 1 1

> y <- c(1,2,3)

> y

[1] 1 2 3

> x*y

[1] 1 2 3 1 2 3 1 2 3 1

Warning message:

In x * y : longer object length is not a multiple of shorter object length

Elements in a vector are indexed, and can be referenced by their index. **Note** In R, unlike many languages, indexing begins with 1.

> LETTERS

[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

> LETTERS[1]

[1] "A"

> LETTERS[1:4]

[1] "A" "B" "C" "D"

> LETTERS[c(1,3,5)]

[1] "A" "C" "E"

You can also access subvectors using a vector of TRUE or FALSE values as an index. Usually, this TRUE/FALSE vector will be the same length as the vector it is indexing.

> t.f<- rep(F,26)

> t.f

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> LETTERS[t.f]

character(0)

> t.f[3]<- T

> LETTERS[t.f]

[1] "C"

> x<-1:10

> x

[1] 1 2 3 4 5 6 7 8 9 10

> x>5

[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE

> x[x>5]

[1] 6 7 8 9 10

If you use a vector of Boolean values as an index which is shorter than the vector it is indexing, R will repeat it so it is the same length.

> x[c(T,F)]As might be expected, any vector of Boolean values, if preceded by ! will produce a vector of the opposite values.

[1] 1 3 5 7 9

> x[c(F,T)]

[1] 2 4 6 8 10

> t.f

[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> !t.f

[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

> LETTERS[t.f]

[1] "C"

> LETTERS[!t.f]

[1] "A" "B" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

- Numeric Vectors
- sum()
- mean()
- sd() (standard deviation)
- max()
- min()
- median()
- range()
- Any Vector
- rev() (reverse)
- unique() (list of unique vector values)

Just like they sound, matrices are matrices of data. Just like vectors, they can contain characters, numbers, or TRUE/FALSE values. Here are some commands for creating matrices.

matrix(x,nrow,ncol) | Takes a vector as an argument, and re-forms it as a
matrix with the number of rows and columns as defined by
ncol and/or nrow. By default, the vector is "poured" into the matrix by columns, but it can be done by rows by setting the byrows option to T. If matrix() is called without a vector argument, but ncol and/or nrow are defined, it will generate a matrix of empty values. matrix(1:9,ncol = 3) matrix(1:9,ncol = 3, byrow = T) matrix(nrow= 3,ncol = 3) |

cbind(...) | This binds vectors into a matrix by columns cbind(1:3,1:3) |

rbind(...) | This binds vectors into a matrix by rows rbind(1:3,1:3) |

As with vectors, mixed data types in one matrix is impossible. Automatic simplification to the least restrictive type happens in the same way as described above.

When a simple matrix is printed to output, its indexing is exactly like it is printed along the margins. Within brackets, the first number indexes the row, and the second the column. Leaving one of the numbers underspecified will return a vector of the row or column values.

> xmat <- matrix(1:9,ncol = 3)

> xmat

> row.number <-1

[,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9

> col.number <-3

> xmat[row.number,col.number]

[1] 7

> xmat[row.number,]

[1] 1 4 7

> xmat[,col.number]

[1] 7 8 9

- t(): transposes a matrix
- colSums(): Sum of the column values
- rowSums(): Sum of the row values
- colMeans(): Mean of the column values
- rowMeans(): Mean of the row values