![]() |
Dr. Mark Gardener |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GO... |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On this page... Making Data |
Using R for statistical analyses - More on dataThis page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics. If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going. On this page learn how to create and manipulate data without using a spreadsheet. Learn more about reading data files. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
R is Open Source R is Free |
What is R?R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation. R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes. Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Navigation index |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
read.csv() is the most useful command for entering large and complex data sets into R. |
Creating dataWith larger data sets the most useful method of creating and storing your information remains the use of a spreadsheet. R can read spreadsheet files in .XLS format but it is probably better to use .CSV. This format is readily opened by text editors and can be easily modified. Your original data set can be kept in native spreadsheet format and you can use 'save as' to create a .CSV file for the analysis you want to run. To remind yourself about creating and reading CSV files see the introduction page. The most useful function to read data into R is the read.csv() command. Here is a recap: variable = read.csv(file.choose(), header=TRUE, row.names=#) file.choose()
opens an explorer=type window allowing you to select your file. This is not the only way to get data into R as we shall find out now. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The c() command is used extensively in R, especially as a parameter within other finctions. It is also a quick way to enter small amounts of data. |
Combine values commandIf you wish to enter a small vector of data it may not be worthwhile creating a spreadsheet and saving it as a CSV file and then reading it into R. It would be much easier to type the data in directly. There are several ways to do this. The first one is using the c() command (c is short for combine). An example will demonstrate it's use: data1 = c(2, 4, 5, 2, 3, 7, 8, 4) Here we have created a variable called data1 and assigned the values in the brackets to it. We may now use the variable we created like any other. We can use the c() command to append data to an existing vector e.g. data1 = c(data1, 12, 14, 11, 9) Now we have added 4 values to our existing variable. This command is used as part of other functions in R. For example in graphing it is possible to set the limits of the x and y axes, this command is called from within the plot() function like so: plot(data, xlim= c(lower, upper), ylim= c(lower, upper), ...other commands) See the section on scatter plots for more information on this command. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Numeric values can be entered 'as is' but text values must be in "quotes" when using the c() command. |
Types of dataThe values we entered using our c() command were obviously numeric. We can enter text values merely my enclosing them in (double) quotes so: dates = c("Jan", "Feb", "Mar", "Apr", "May") We now have a variable called dates which contains five text values. What if we were to type in the months without quotes? Let's try and see: month
= (Jan, Feb, Mar, Apr, May) Oh dear. So, it appears that we either have to have numbers or text values in quotes. It is possible to get one other data type but we will cover that when we get to it later on. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
scan() is a useful command for adding larger amounts of data. The basic command accepts numeric values only. To read in text values we must use scan(what="char") |
Typing in values using scan()Typing in values using the c() command is fine but when you have substantial sample size you don't necessarily want to type all the commas! R provides another way of entering data using the scan() command. In basic form the scan command works like this: more.data = scan() 1: The 1: indicates that R is waiting for you to type in the first element of your data. What we need to do now is to type in some values; this time we separate them with spaces and don't bother with the commas. You can press the enter key to spread over several lines. Data entry will stop when you enter a blank line e.g. 1: 2 5 6.2 33 25 1.3 8 To see what we entered type the name of the variable e.g. more.data We can see that R has appended decimals to our data so that the precision matches for all items in the vector. If we try the same thing but with text labels what happens? more.months
= scan() It looks like we might need to enter the values in quotes again. It is a real pain to enter lots of quotes so let's find a way around that. Try this: more.months
= scan(what="char") That's better; now we don't have to type quotes around each item we merely type what="char" to tell the function to expect text values. In fact we cannot read text values into the scan() command in any other way. In addition we cannot mix text and numeric values. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Multiple variablesWhen you only have 1-2 variables to input and these are of moderate length, it may be worthwhile entering them using scan() or c() commands. However, when you have more data it is usually better to enter the data into a spreadsheet first and then save as a CSV file for input to R. This subject was introduced earlier (see data files) but here we'll add a bit more detail. Types of data (again)So far we have looked at two types of data item, numeric and text. Let's get a data file to illustrate: twoway
= read.csv(file.choose())
We have three variables, height, plant and water. This is the sort of thing you would expect to form the basis for a two-way analysis of variance. In order for R to read the variables from this data file we would attach() the main variable e.g. attach(twoway). However, it is possible to read the variables without doing this. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To access variables from within larger data sets we can use one of several methods: attach(data.frame) aloows the variables to be accessed by typing the name. data.frame$variable reads a variable directly. data.frame[row, col] allows you to access a specific row, column or element. |
Variables inside data setsTo see the height variable we type the following: twoway$height [1] 9 11 6 14 17 19 28 31 32 7 6 5 14 17 15 44 38 37 We see the vector of numbers, it's obviously a numeric variable. Notice how we type the name of the original variable then append a dollar sign and the name of the variable within it that we wish to see. If we look at the water variable next: twoway$water [1]
lo lo lo mid mid mid hi hi hi lo lo lo mid mid mid hi This is something new; the variable doesn't appear to be text (the items are not enclosed in quotes). The first couple of lines show us the data items in the order they are in the table and then we see a line starting with "Levels:" This line shows us that there are three 'things' in the water variable, lo, mid and hi. This type of variable is a factor (as opposed to character or numeric). R assumes that all text values in your CSV file are either headings or are factors unless you specifically tell it otherwize. We will cover this later. A single variable is termed a vector. When we create a larger data file (e.g. as a CSV file) the resulting variable (e.g. twoway above) is called a data frame. We can display the individual variables from the data frame by using the $ symbol as we have just seen. However, there is another way. The data frame is composed of rows and columns; we can pull-out individual items using the following syntax: data.frame[row, col] So, to see the height variable we type: twoway[,1] [1] 9 11 6 14 17 19 28 31 32 7 6 5 14 17 15 44 38 37 Since we left the row blank all rows are displayed. If we wish to see the water variable we type: twoway[,3] [1]
lo lo lo mid mid mid hi hi hi lo lo lo mid mid mid hi We can display a single row of course: twoway[4,]
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The transpose command t() is a fast way to re-arrange a data frame by switching rows and columns. |
Transposing data framesOnce you create and enter a CSV file of data you create a data frame. Here is a simple example showing monthly mean temperatures for an Antarctic research station: vostok
Apart from the fact that it is decidedly chilly we can see that we have two variables, month and temp arranged in two columns. If we wished to create a bar chart of these data it may be more useful to have the data arranged in 12 columns, one for each month, rather than the two. We can switch around a data frame using the transpose command t(). To do that we merely type t(dataname) e.g. t(vostok)
The data frame has now been switched around. Also we can see that all the data are enclosed in quotes as if they were text. What has happened is that R has taken the data from the data.frame and made it into a matrix. This is a separate type of data item that I won't cover here. The t() function is useful for producing barplots that may contain both row and column headings as it allows you to display (and therefore graph) the data sorted by row or column. To see an individual row or column in a matrix we cannot use the $ notation but we can use the [row, col] method e.g. t(vostok)[2,]
This displays the second row only (the temperatures). To see the 2nd column only we type: t(vostok)[,2]
Interestingly it does not display as we might expect (although it is the 2nd column). We can replace a single number in the square brackets for an expression. So if for example we wanted to see the 2nd, 3rd and 4th columns we could type: t(vostok)[,2:4]
The expression now reads, columns 2 to 4. For a more complex arrangement we can use the c() function that we have come across before (see creating data above and the section on scatter plots) e.g. t(vostok)[, c(1, 2, 6, 7)]
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Making text columns in data framesIf we create a data frame in our spreadsheet and save the result as a CSV file for reading into R we get a selection of numeric and factor variables. However, we may wish to have R regard some of the variables as text (i.e. character variables). To do this we append a separate command to the read.csv function. In the example above we only had 2 columns, the file was read into R using a basic command: vostok = read.csv(file.choose()) Since the CSV file already contained the column headings no other parameters were required. However, if we wish to alter the 1st column (month) from a factor to a character we need to use the as.is=# parameter like so: vostok = read.csv(file.choose(), as.is=1) Now the 1st column of data will be read as character rather than as a factor. If you wish to include several columns you can use syntax similar to above e.g. x:y or c(x, y, z) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Missing valuesA data frame consists of a regtangular matrix consisting of a number of columns, each containing a series of data as numbers or text. If one column is shorter than the others it will be padded out with NA values. These are ignored by most stats tests but may be included in routines to calculate the mean or median for example. In most cases you may ignore the NA values by including the parameter na.rm= TRUE (see the section on basic stats). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Stacking dataThe data fram you are working with may contain several columns, each containing a sample of numeric data. Here is a sample data file (called sugars). Each column shows the growth of an insect fed on a particular diet. These data were used in the demonstration of one-way ANOVA: sugars
We can see the data are in 6 columns, each representing a sample. These are the sort of data that would likely be analysed using ANOVA. However, the aov() routine in R requires the data to be organized in a slightly different manner. What is required are two columns only, one for the growth data (i.e. the numbers) and one for the factors (i.e. the types of treatment, the sugars). Ideally you would have entered the data into your spreadsheet in the appropriate manner right at the start but, if for some reason this was not done then all is not lost. R provides a routine to take the individual columns and stack them together to form a new data frame in the correct fashion for our ANOVA. The command is stack(data.frame) and if we perform this on our sugar data we see something like the following: stack(sugars)
The function creates two columns, the numbers are placed in a column entitled values whilst the factors are entitled ind. We can now perform our analysis on the stacked data, either by assigning it to a new variable name (easiest option) or replacing the variables in the aov() expression with the stack() variables e.g. carbs
= stack(sugars) or... aov(stack(sugars)$values ~ stack(sugars)$ind) Selecting columnsIt is possible that you may want to extract only some of the columns from a data frame. The stack() command allows you to select which columns to make into the new stacked variable. In general terms the command is: stack(data, select= c(var1, var2)) Notice how the list of variables we wish to extract is in the c(item1, item2) format that we have come across before (see also the examples in the section on scatter plots). For the example above, if we wished to extract only "pure" sugars we might use the following command: sugar.st = stack(sugars, select= c(C, F, G, S)) The new data frams now contains two columns entitled values and ind as before but we have missed out the samples for F.G and test. Naming the stacked columnsIt is possible to give more meaningful names to the two columns of your new stacked data frame. To do this we use the names() command. In this instance we would type: names(carb) = c("growth", "sugar") You will notice how the names are assigned using the c() function that we came across earlier (see also the examples in the section on scatter plots). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
UnstackThe opposite of stacking is unstacking! Using the example above, we have our stacked sugar/growth data and wish to extract the various samples into individual variables. We use the unstack(data.frame) command so: unstack(carbs) $C $F $F.G $G $S $test Now we have a list of six vectors, one for each sample (i.e. sugar). To see a single sample we use the $ notation e.g. unstack(carb)$F $F This can be useful to extract a single sample for some other analysis. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||