Dr. Mark Gardener

GO...
Gardeners Own Home
Using R Introduction
Navigation Index
About Us

On this page...

Basic stats (e.g. mean, median)

T-test
T-test step-by-step

U-test
U-test step-by-step

Chi-Squared
Yates' Correction
Chi-Squared step-by-step

Goodness of Fit test
G-Fit step-by-step

Using R for statistical analyses - Basic Statistics

This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going.

I run training courses in data management, visualisation and analysis using Excel and R: The Statistical Programming Environment. From 2013 courses will be held at The Field Studies Council Field Centre at Slapton Ley in Devon. Alternatively I can come to you and provide the training at your workplace. See details on my Courses Page.

On this page learn how to perform simple statistical tests like the t-test, u-test, chi-squared and goodness of fit tests as well as some basic descriptive statistical functions (e.g mean, variance).

See also: R Courses | R Tips, Tricks & Hints | MonogRaphs | Writer's bloc


My publications about R

See my books about R on my Publications page

Statistics for Ecologists | Beginning R | The Essential R Reference | Community Ecology

Statistics for Ecologists is available now from Pelagic Publishing. Get a 20% discount using the S4E20 code!
Beginning R is available from Wrox the publisher or see the entry on Amazon.co.uk.
The Essential R Reference is available from the publisher Wiley now (see the entry on Amazon.co.uk)!
Community Ecology is in production now and expected by the end of 2013 from Pelagic Publishing.

I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book.


Skip directly to the 1st topic

R is Open Source

R is Free

Get R from the R Project website

What is R?

R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation.

R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes.

Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses.


Top

Navigation index

Introduction

Getting started with R:

Top
What is R?
Introduction
Data files
Inputting data
Seeing your data in R
What data are loaded?
Removing data sets
Help and Documentation


Data2

More about manipulating data and entering data without using a spreadsheet:

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data


Help and Documentation

A short section on how to find more help with R

 

Basic Statistics

Some statistical tests:

Basic stats
Mean
Variance
Quantile
Length

T-test
Variance unequal
Variance Equal
Paired t-test
T-test Step by Step

U-test
Two sample test
Paired test
U-test Step by Step

Paired tests
T-test: see T-test
Wilcoxon: see U-test

Chi Squared
Yates Correction for 2x2 matrix
Chi-Squared Step by Step

Goodness of Fit test
Goodness of Fit Step by Step


Non-Parametric stats

Stats on multiple samples when you have non-parametric data.

Kruskal Wallis test
Kruskal-Wallis Stacked
Kruskal Post-Hoc test
Studentized Range Q
Selecting sub-sets
Friedman test
Friedman post-hoc
Rank data ANOVA

 

Correlation

Getting started with correlation and a basic graph:

Correlation
Correlation and Significance tests
Graphing the Correlation
Correlation step by step


Regression

Multiple regression analysis:

Multiple Regression
Linear regression models
Regression coefficients
Beta coefficients
R squared
Graphing the regression
Regression step by step


ANOVA

Analysis of variance:

ANOVA analysis of variance
One-Way ANOVA
Simple Post-hoc test
ANOVA Models
ANOVA Step by Step

 

Graphs

Getting started with graphs, some basic types:

Introduction
Bar charts
Multi-category
Stacked bars
Frequency plots
Horizontal bars

Histograms

Box-whisker plots
Single sample
Multi-sample
Horizontal plot


Graphs2

More graphical methods:

Scatter plot

Stem-Leaf plots

Pie charts


Graphs3

More advanced graphical methods:

Line Plots
Plot types
Time series
Custom axes

Bottom


Top

Navigation Index

 

By default R includes NA values in variables. These arise from data sets containing variables of unequal length. To ensure that you are only using the real data add na.rm= TRUE to your commands.

Basic stats

R provides a number of functions for basic statistics.

Basic maths/stats functions
You can perform a variety of functions on a single set of numbers. The data set could comprise of an array that you have read in from a .CSV file or it could be from a larger data set. If the latter is the case then ensure that you attach(dataset) so that the various variables contained within the larger data set are read in to memory.
The basic arithmetic mean mean(variable)
The median (middle value) median(variable)
The largest value in the variable max(variable)
The smallest value in the variable min(variable)
The standard deviation of the variable sd(variable)
The number of items in the variable length(variable)
The variance is given by this var(variable)
You can determine any quantile using this function. Set the level to any value e.g. 0.25, 0.75 to return the appropriate quantile quantile(variable, level)

It is possible to use these and other functions in combination as if you were using a calculator. Other functions include sqrt(variable) to determine square root. To generate a power function use the caret character e.g. 2^3 gives 2 to the power of 3 (i.e. 8).

If your data set is made up of several columns they may not all be of the same length. By default R pads out the 'missing' cells with NA. If your variable contains NA values then this will affect your calculations. To get around this use na.rm= TRUE in the command e.g. mean(variable, na.rm= TRUE)


Top

Navigation Index

 

The t-test defaults to the Welch proceedure, which assumes the variances are unequal.

T-test

The t-test is used to determine statistical differences between two samples. There is also a version that can be used as a paired test i.e. when you have measurements collected as matched pairs.

The first stage is to arrange your data in a .CSV file. Use a column for each variable and give it a meaningful name. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period.

The second stage is to read your data file into memory and give it a sensible name.

The next stage is to attach your data set so that the individual variables are read into memory.

To perform a t-test you type:

> t.test(var1, var2)

Welch Two Sample t-test

data: x1 and x2
t = 4.0369, df = 22.343, p-value = 0.0005376
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.238967 6.961033
sample estimates:
mean of x mean of y
8.733333 4.133333

>

This version of the test does not assume that the variance of the two samples is equal and performs a Welch two sample t-test. The "classic" version of the t-test can be run as follows:

> t.test(var1, var2, var.equal=T)

Two Sample t-test

data: x1 and x2
t = 4.0369, df = 28, p-value = 0.0003806
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.265883 6.934117
sample estimates:
mean of x mean of y
8.733333 4.133333

>

 

A paired t-test is simple to run; just add paired= TRUE to the basic command.

Now the variances of the two samples are considered equal and the basic version is performed. To run a t-test on paired data you add a new term:

> t.test(var1, var2, paired=T)

Paired t-test

data: x1 and x2
t = 4.3246, df = 14, p-value = 0.0006995
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.318620 6.881380
sample estimates:
mean of the differences
4.6

>


 

T-test Step by Step

T-test Step by Step
First create your data file. Use a spreadsheet and make each column a variable. Each row is a replicate but the columns do not need to contain the same number of data items (unless you want a paired test). The first row should contain the variable names. Save this as a .CSV file 
Read in your file and assign it to a variable name your.data = read.csv(file.choose())
Make the variables within the data set available to R   attach(your.data)
For a classic t-test (variances assumed equal)   t.test(var1, var2, var.equal=T)
If variances are assumed unequal use the Welch procedure   t.test(var1, var2)
If you have paired data run the paired version   t.test(var1, var2, paired=T)

Top

Navigation Index

 

The Mann-Whitney U-test is also known as the Wilcoxon rank sum test.

If you have tied ranks R will give you an warning message.

U-test

The Mann-Witney U-test is commonly used to test for significant differences between two samples when data are non-parametric. In R the test is perhaps confusingly called the Wilcoxon test and can be applied to two samples or paired data.

The first stage is to arrange your data in a .CSV file. Use a column for each variable and give it a meaningful name. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period.

The second stage is to read your data file into memory and give it a sensible name.

The next stage is to attach your data set so that the individual variables are read into memory.

The basic u-test is performed on two samples so:

> wilcox.test(var1, var2)

Wilcoxon rank sum test with continuity correction

data: x1 and x3
W = 63.5, p-value = 0.04244
alternative hypothesis: true mu is not equal to 0

Warning message:
cannot compute exact p-value with ties in: wilcox.test.default(x1, x3, paired = F)

>

To run a paired test simply add paired= TRUE to the basic command.

If you have paired data you can run a matched pair test:

> wilcox.test(var1, var2, paired=T)

Wilcoxon signed rank test with continuity correction

data: x1 and x3
V = 22.5, p-value = 0.06299
alternative hypothesis: true mu is not equal to 0

Warning messages:
1: cannot compute exact p-value with ties in: wilcox.test.default(x1, x3, paired = T)
2: cannot compute exact p-value with zeroes in: wilcox.test.default(x1, x3, paired = T)

>

In the above examples we see that there are several warning messages. We can safely ignore these. Also, the test runs with continuity correction as the default. If you want to turn this off (I cannot see why you would) then add correct=F to the parameters e.g.

> wilcox.test(var1, var2, correct=F)


 

U-test Step by Step

U-test Step by Step
First create your data file. Use a spreadsheet and make each column a variable. Each row is a replicate but the columns do not need to contain the same number of data items (unless you want a paired test). The first row should contain the variable names. Save this as a .CSV file 
Read in your file and assign it to a variable name your.data = read.csv(file.choose())
Make the variables within the data set available to R   attach(your.data)
For a standard two-sample U-test   wilcox.test(var1, var2)
If you have paired data run the paired version   wilcox.test(var1, var2, paired=T)
The default tests run with continuity correction, to turn this off use   wilcox.test(var1, var2, correct=F)

 

Paired tests

Most of the regular stats routines provide for an option to run as a paired variant.


Top

Navigation Index

Chi-squared tests

Tests for association are easily performed in R. The basc function is chisq.test()

The first stage is to arrange your data in a .CSV file. Use row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period.

The second stage is to read your data file into memory and give it a sensible name. You will need to tell R that the file contains row names so that a data matrix is created.

To perform the Chi-squared test you type something like the following:

> chisq.test(your.data)

Pearson's Chi-squared test

data: your.data
X-squared = 121.5774, df = 8, p-value < 2.2e-16

>

This gives you a basic result but you will want more than that in order to interpret the statistic.

The test produces more data than is displayed, to see what you have to work with type:

> names(chisq.test(your.data))

[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals"

>

This shows us that there are other data that we can call upon to help us. It is cumbersome to run the test each time to it would be better to assign the chi-squared test result to a variable. It's a good habit to get into when using R and means that you can use the results in further calculations.

In this instance we might try:

> your.chi = chisq.test(your.data)

> names(your.chi)

To see the observed values (i.e. the original data) type:

> your.chi$observed

 
hedge
river
wood
pip
21
43
77
daub
23
11
32
noct
26
9
11
fruit
54
15
8
leaf
54
43
7

>

To see the expected values type:

> your.chi$expected

 
hedge
river
wood
pip
57.82949
39.31106
43.85945
daub
27.06912
18.40092
20.52995
noct
18.86636
12.82488
14.30876
fruit
31.58065
21.46774
23.95161
leaf
42.65438
28.99539
32.35023

>

To see the residuals type:

> your.chi$residuals

 
hedge
river
wood
pip
-4.8430734
0.5883615
5.0041253
daub
-0.7821028
-1.7253055
2.5314606
noct
1.6423555
-1.0680501
-0.8747094
fruit
3.9894462
-1.3959167
-3.2593967
leaf
1.7371868
2.6007971
-4.4570060

>

The residuals calculated are the Pearson residuals i.e. (observed - expected) / sqrt(expected). You can examine these and easiy pick out which are the most important associations (and the direction).

You do not actually need to type the full command to see the components of the chi-squared test. After the $ sign you can type a short version and as long as it is unique it will be intepreted e.g.

> your.chi$obs
> your.chi$exp
> your.chi$res

R will produce the desired table. If you wish to extract a single value from one of these tables then you can do that by appending an extra part e.g.

> your.chi$res["row.name", "col.name"]

In other words add a square bracket and type in the row and column headings (in quotes) that define the value you wish.


Yates Correction is appropriate for 2 x 2 contingency tables.

Yates Correction

When using a 2 x 2 contingency table it is common practice to reduce the |O-E| differences by 0.5. To do this add correct=T to the original function e.g.

> your.chi = chisq.test(your.data, correct=T)

If your table is larger then the correction will not be done (the basic test will run instead).


 

Chi-Squared Step by Step

Chi-Squared Step by Step
First create your data file. Use a spreadsheet and make a table of your data. Both the rows and the columns should have headings. Save this as a .CSV file 
Read in your file and assign it to a variable name. This command tells R that the 1st column contains the row names. your.data = read.csv(file.choose(), row.names=1)
Have a look at your data to see that it contains what you expected   your.data
Run the Chi-Squared test and assign it to a variable your.chi = chisq.test(your.data)
If you need to apply Yates correction for a 2 x 2 matrix your.chi = chisq.test(your.data, correct=T)
To see the original data i.e. observed values   your.chi$obs
To see the expected values   your.chi$exp
To see the Pearson residuals (O-E)/sqrt(E)   your.chi$res
To extract a single item from the Observed, Expected or Residual tables   your.chi$table["row", col"]

Top

Navigation Index

Goodness of Fit test

We can use the Chi-squared distribution to calculate goodness of fit to pre-determined distributions. The function is chisq.test(), which is the same as discussed above in the section on Chi-Squared tests. If you haven't already done so it is a good idea to look over that first.

In this case we will have a list of observations and another list of the expected ratios, propotions or values.

The first stage is to arrange your data in a .CSV file. Use row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. One column should contain the observed values and the other should contain thecorresponding ratio, proportions or values.

The second stage is to read your data file into memory and give it a sensible name. You will need to tell R that the file contains row names so that a data matrix is created.

The next stage is to attach your data set so that the individual variables are read into memory.

We now run the analysis using the chisq.test() function e.g:

> your.chi = chisq.test(observed.data, p=expected.values, rescale.p=T)

In this case observed.data is the column of your measured data and expected.values is the column of ratios (or expected values in some form). The rescale.p=T part tells R to convert the expected values so that they add up to unity. It is a good habit to get into to add this parameter as then it does not matter in what form your expected values come; R will convert to proportions.

Here is an example of a goodness of fit analysis:

> gfit = read.csv(file.choose(), row.names=1)
> gfit

 
ratio
visit
Red
10.0
100
Blue
5.0
33
White
15.0
12
Green
10.0
16
Yellow
5.0
22
Orange
2.5
7
Pink
6.0
23
Purple
12.0
17

> attach(gfit)
> gfit.g = chisq.test(visit, p=ratio, rescale.p=T)
> gfit.g

Chi-squared test for given probabilities

data: visit
X-squared = 191.9482, df = 7, p-value < 2.2e-16

>

As before we can extract the expected values and the residuals:

> gfit.g$exp
> gfit.g$res


 

Goodness of Fit test - Step by Step

Goodness of Fit Step by Step
First create your data file. Use a spreadsheet and make a table of your data. One column should contain your observed data and another should contain the expected proportions. The expected values can be as ratios, proportions or whatever. It can be useful to have a column of row names. Save this as a .CSV file 
Read in your file and assign it to a variable name. This command tells R that the 1st column contains the row names. your.data = read.csv(file.choose(), row.names=1)
Have a look at your data to see that it contains what you expected   your.data
    attach(your.data)
Run the Goodness of Fit test and assign it to a variable your.gft = chisq.test(obs.values, p=exp.values, rescale.p=T)
To see the original data i.e. observed values   your.gft$obs
To see the expected values   your.gft$exp
To see the Pearson residuals (O-E)/sqrt(E)   your.gft$res
 
Gardeners Own Home
Top
Navigation Index