![]() |
Dr. Mark Gardener |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GO... |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On this page... Basic stats (e.g. mean, median) |
Using R for statistical analyses - Basic StatisticsThis page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics. If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going. On this page learn how to perform simple statistical tests like the t-test, u-test, chi-squared and goodness of fit tests as well as some basic descriptive statistical functions (e.g mean, variance). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
R is Open Source R is Free |
What is R?R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation. R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes. Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Navigation index |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
By default R includes NA values in variables. These arise from data sets containing variables of unequal length. To ensure that you are only using the real data add na.rm= TRUE to your commands. |
Basic statsR provides a number of functions for basic statistics.
It is possible to use these and other functions in combination as if you were using a calculator. Other functions include sqrt(variable) to determine square root. To generate a power function use the caret character e.g. 2^3 gives 2 to the power of 3 (i.e. 8). If your data set is made up of several columns they may not all be of the same length. By default R pads out the 'missing' cells with NA. If your variable contains NA values then this will affect your calculations. To get around this use na.rm= TRUE in the command e.g. mean(variable, na.rm= TRUE) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The t-test defaults to the Welch proceedure, which assumes the variances are unequal. |
T-testThe t-test is used to determine statistical differences between two samples. There is also a version that can be used as a paired test i.e. when you have measurements collected as matched pairs. The first stage is to arrange your data in a .CSV file. Use a column for each variable and give it a meaningful name. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. The second stage is to read your data file into memory and give it a sensible name. The next stage is to attach your data set so that the individual variables are read into memory. To perform a t-test you type: > t.test(var1, var2) Welch Two Sample t-test data: x1 and x2 > This version of the test does not assume that the variance of the two samples is equal and performs a Welch two sample t-test. The "classic" version of the t-test can be run as follows: > t.test(var1, var2, var.equal=T) Two Sample t-test data: x1 and x2 >
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A paired t-test is simple to run; just add paired= TRUE
to the basic command. |
Now the variances of the two samples are considered equal and the basic version is performed. To run a t-test on paired data you add a new term: > t.test(var1, var2, paired=T) Paired t-test data: x1 and x2 > |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
T-test Step by Step
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The Mann-Whitney U-test is also known as the Wilcoxon rank sum test. If you have tied ranks R will give you an warning message. |
U-testThe Mann-Witney U-test is commonly used to test for significant differences between two samples when data are non-parametric. In R the test is perhaps confusingly called the Wilcoxon test and can be applied to two samples or paired data. The first stage is to arrange your data in a .CSV file. Use a column for each variable and give it a meaningful name. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. The second stage is to read your data file into memory and give it a sensible name. The next stage is to attach your data set so that the individual variables are read into memory. The basic u-test is performed on two samples so: > wilcox.test(var1, var2) Wilcoxon rank sum test with continuity correction data: x1 and x3 Warning message: > |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
To run a paired test simply add paired= TRUE to the
basic command. |
If you have paired data you can run a matched pair test: > wilcox.test(var1, var2, paired=T) Wilcoxon signed rank test with continuity correction data: x1 and x3 Warning messages: > In the above examples we see that there are several warning messages. We can safely ignore these. Also, the test runs with continuity correction as the default. If you want to turn this off (I cannot see why you would) then add correct=F to the parameters e.g. > wilcox.test(var1, var2, correct=F) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
U-test Step by Step
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Paired testsMost of the regular stats routines provide for an option to run as a paired variant. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chi-squared testsTests for association are easily performed in R. The basc function is chisq.test() The first stage is to arrange your data in a .CSV file. Use row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. The second stage is to read your data file into memory and give it a sensible name. You will need to tell R that the file contains row names so that a data matrix is created. To perform the Chi-squared test you type something like the following: > chisq.test(your.data) Pearson's Chi-squared test data: your.data > This gives you a basic result but you will want more than that in order to interpret the statistic. The test produces more data than is displayed, to see what you have to work with type: > names(chisq.test(your.data)) [1] "statistic" "parameter" "p.value" "method" "data.name" "observed" > This shows us that there are other data that we can call upon to help us. It is cumbersome to run the test each time to it would be better to assign the chi-squared test result to a variable. It's a good habit to get into when using R and means that you can use the results in further calculations. In this instance we might try: > your.chi = chisq.test(your.data) > names(your.chi) To see the observed values (i.e. the original data) type: > your.chi$observed
> To see the expected values type: > your.chi$expected
> To see the residuals type: > your.chi$residuals
> The residuals calculated are the Pearson residuals i.e. (observed - expected) / sqrt(expected). You can examine these and easiy pick out which are the most important associations (and the direction). You do not actually need to type the full command to see the components of the chi-squared test. After the $ sign you can type a short version and as long as it is unique it will be intepreted e.g. > your.chi$obs R will produce the desired table. If you wish to extract a single value from one of these tables then you can do that by appending an extra part e.g. > your.chi$res["row.name", "col.name"] In other words add a square bracket and type in the row and column headings (in quotes) that define the value you wish. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Yates Correction is appropriate for 2 x 2 contingency
tables. |
Yates CorrectionWhen using a 2 x 2 contingency table it is common practice to reduce the |O-E| differences by 0.5. To do this add correct=T to the original function e.g. > your.chi = chisq.test(your.data, correct=T) If your table is larger then the correction will not be done (the basic test will run instead). |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chi-Squared Step by Step
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Goodness of Fit testWe can use the Chi-squared distribution to calculate goodness of fit to pre-determined distributions. The function is chisq.test(), which is the same as discussed above in the section on Chi-Squared tests. If you haven't already done so it is a good idea to look over that first. In this case we will have a list of observations and another list of the expected ratios, propotions or values. The first stage is to arrange your data in a .CSV file. Use row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. One column should contain the observed values and the other should contain thecorresponding ratio, proportions or values. The second stage is to read your data file into memory and give it a sensible name. You will need to tell R that the file contains row names so that a data matrix is created. The next stage is to attach your data set so that the individual variables are read into memory. We now run the analysis using the chisq.test() function e.g: > your.chi = chisq.test(observed.data, p=expected.values, rescale.p=T) In this case observed.data is the column of your measured data and expected.values is the column of ratios (or expected values in some form). The rescale.p=T part tells R to convert the expected values so that they add up to unity. It is a good habit to get into to add this parameter as then it does not matter in what form your expected values come; R will convert to proportions. Here is an example of a goodness of fit analysis: > gfit = read.csv(file.choose(), row.names=1)
> attach(gfit) Chi-squared test for given probabilities data: visit > As before we can extract the expected values and the residuals: > gfit.g$exp |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Goodness of Fit test - Step by Step
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||