Dr. Mark Gardener

GO...
Gardeners Own Home
Using R Introduction
Navigation Index
About Us

On this page...

Introduction to graphing

Bar charts

Histograms

Box-whisker plots

Using R for statistical analyses - Graphs 1

This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going.

I run training courses in data management, visualisation and analysis using Excel and R: The Statistical Programming Environment. From 2013 courses will be held at The Field Studies Council Field Centre at Slapton Ley in Devon. Alternatively I can come to you and provide the training at your workplace. See details on my Courses Page.

On this page you can find out information on producing a range of graphs to illustrate your analyses. Specifically you'll find information on bar charts, histograms and box-whisker plots. For information on scatter plots, pie charts and stem and leaf plots you need to go to the graph2 page.

See also: R Courses | R Tips, Tricks & Hints | MonogRaphs | Writer's bloc


My publications about R

See my books about R on my Publications page

Statistics for Ecologists | Beginning R | The Essential R Reference | Community Ecology

Statistics for Ecologists is available now from Pelagic Publishing. Get a 20% discount using the S4E20 code!
Beginning R is available from Wrox the publisher or see the entry on Amazon.co.uk.
The Essential R Reference is available from the publisher Wiley now (see the entry on Amazon.co.uk)!
Community Ecology is in production now and expected by the end of 2013 from Pelagic Publishing.

I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book.

Skip directly to the 1st topic

R is Open Source

R is Free

Get R at the R Project page.

What is R?

R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation.

R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes.

Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses.


Top

Navigation index

Introduction

Getting started with R:

Top
What is R?
Introduction
Data files
Inputting data
Seeing your data in R
What data are loaded?
Removing data sets
Help and Documentation


Data2

More about manipulating data and entering data without using a spreadsheet:

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data


Help and Documentation

A short section on how to find more help with R

 

Basic Statistics

Some statistical tests:

Basic stats
Mean
Variance
Quantile
Length

T-test
Variance unequal
Variance Equal
Paired t-test
T-test Step by Step

U-test
Two sample test
Paired test
U-test Step by Step

Paired tests
T-test: see T-test
Wilcoxon: see U-test

Chi Squared
Yates Correction for 2x2 matrix
Chi-Squared Step by Step

Goodness of Fit test
Goodness of Fit Step by Step


Non-Parametric stats

Stats on multiple samples when you have non-parametric data.

Kruskal Wallis test
Kruskal-Wallis Stacked
Kruskal Post-Hoc test
Studentized Range Q
Selecting sub-sets
Friedman test
Friedman post-hoc
Rank data ANOVA

 

Correlation

Getting started with correlation and a basic graph:

Correlation
Correlation and Significance tests
Graphing the Correlation
Correlation step by step


Regression

Multiple regression analysis:

Multiple Regression
Linear regression models
Regression coefficients
Beta coefficients
R squared
Graphing the regression
Regression step by step


ANOVA

Analysis of variance:

ANOVA analysis of variance
One-Way ANOVA
Simple Post-hoc test
ANOVA Models
ANOVA Step by Step

 

Graphs

Getting started with graphs, some basic types:

Introduction
Bar charts
Multi-category
Stacked bars
Frequency plots
Horizontal bars

Histograms

Box-whisker plots
Single sample
Multi-sample
Horizontal plot


Graphs2

More graphical methods:

Scatter plot

Stem-Leaf plots

Pie charts


Graphs3

More advanced graphical methods:

Line Plots
Plot types
Time series
Custom axes

Bottom


Top

Introduction to Graphing

R has great graphical power but it is not a point and click interface. This means that you must use typed commands to get it to produce the graphs you desire. This can be a bit tedious at first but once you have the hang of it you can save a list of useful commands as text that you can copy and paste into the R command line.


Top

Navigation Index

Bar charts

The bar chart is familiar to everyone and is a useful graphical tool that may be used in a variety of ways.

The basic function is: barplot(data)

Before you can draw a graph you need to get your data into an appropriate format. R has many ways of manipulating data but it is often easiest to assemble and manipulate your data in a spreadsheet (you can save in .CSV format).

The first stage is to arrange your data in a .CSV file. You may have your data arranged in columns or in rows. You may also have both row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period.

The second stage is to read your data file into memory and give it a sensible name. When using barplots you may have both row and column names so don't forget to tell R that you are using row names if you are.


Simple multi-category chart

Your data may consist of a simple row of means e.g. here are some data on road deaths in Virginia. These data come with the basic distribution of R and are called VADeaths. The means have been extracted below and assigned to the variable VADmeans.

Rural Male
Rural Female
Urban Male
Urban Female
32.74
25.18
40.48
25.28

We can see that there are four categories. To create a basic bar chart we simply call the barplot() function:

barplot(VADmeans, main="Road Deaths in Virginia",xlab="Categories", ylab="Mean Deaths")

This produces a very basic plot; I have added a main title and labels for the x and y axes using fairly simple commands. When plotting a graph R opens a graphics window. If you select the window (by clicking on or in it) you may then copy to the clipboard and paste into a variety of applications.


Top

Navigation Index

Stacked charts or not?

The VADeaths dataset consists of a matrix of values with both column and row labels:

 
Rural Male
Rural Female
Urban Male
Urban Female
50-54
11.7
8.7
15.4
8.4
55-59
18.1
11.7
24.3
13.6
60-64
26.9
20.3
37.0
19.3
65-69
41.0
30.9
54.6
35.1
70-74
66.0
54.3
71.1
50.0

If we attempt to produce a bar chart of these data we get something like the following:

barplot(VADeaths, legend= rownames(VADeaths))

This time a legend was added using the legend command along with the rownames of the dataset. We see that by default a stacked bar chart is produced. To unstack the bars and plot them alongside one another we use a new command:

barplot(VADeaths, legend= rownames(VADeaths), beside= TRUE)

This is fine but the colour scheme is kind of boring. Here is a new set of commands:

barplot(VADeaths, beside = TRUE, col = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk"), legend = rownames(VADeaths), ylim = c(0, 100))

title(main = "Death Rates in Virginia", font.main = 4)

This is a bit better. We have specified a list of colours to use for the bars. Note how the list is in the form c(item1, item2, item3, item4). The command ylim sets the limits of the y-axis. In this case a lower limit of 0 and an upper of 100. The command is in the form ylim= c(lower, upper) and note again the use of the c(item1, item2) format. The legend takes the names from the row names of the datafile. We set the y-axis limit to accommodate the legend box.

It is possible to specify the title of the graph as a separate command, which is what was done above. The command title() achieves this but of course it only works when a graphics window is already open. The command font.main sets the typeface, 4 produces bold italic font.


Top

Navigation Index

Frequency plots

Sometimes you will have a single column of data that you wish to summarize. A common use of a bar chart is to produce a frequency plot showing the number of items in various ranges. Here is a vector of numbers:

75 67 70 75 65 71 67 67 76 68

These have been assigned to a variable called carb and we wish to make a frequency plot. Let's try:

barplot(carb)

Oops. That wasn't really what we wanted at all. What's happened is that each item has been plotted as a separate entity. We need to tabulate the frequencies. Fortunately there is an easy way to do this. We use the table() function. Let's redraw the graph but using the following:

barplot(table(carb))

This is much better. Now we have the frequencies for the data arranged in several categories (sometimes called bins). As with other graphs we can add titles to axes and to the main graph.

We can look at the table() function directly to see what it produces.

table(carb)

carb            
65
67
68
70
71
75
76
1
3
1
1
1
2
1

We can see that the function has summarised the data for us into various numerical categories.


Top

Navigation Index

We may wish to show the frequencies as a proportion of the total rather than as raw data. To do this we simply divide each item by the total number of items in our dataset:

barplot(table(carb)/length(carb))

This shows exactly the same pattern but now the total of all the bars add up to one.


Top

Navigation Index

Horizontal bar plots

It is straightforward to rotate your plot so that the bars run horizontal rather than vertical (which is the default). To produce a horizontal plot you add horizontal= TRUE to the command e.g.

barplot(table(carb), horiz=T, col="lightgreen", xlab="Frequency", ylab="Range")
title(main="Horizontal Bar Plot", font.main= 4)

This time I have used the title() command to add the main title separately. The value of 4 sets the font to bold italic (try other values).


Top

Navigation Index

Histograms

The barplot function can be used to create a frequency plot of sorts but it does not produce a continuous distribution along the x-axis. A true frequency distribution should have the bar categories (i.e. the x-axis) as continuous items. The frequency plot produced previously has discontinuous categories.

To create a frequency distribution chart we need a histogram, which has a continuous range along the x-axis. The command in R is:

hist(variable)

Here is a vector of numbers saved as the variable test.data:

2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8

To create a histogram we type:

hist(test.data)

To plot the probabilities (i.e. proportions) rather than the actual frequency we need to add the command prob= TRUE like so:

hist(test.data, prob= TRUE)

This is useful but the plots are a bit basic and boring. We can change axis labels and the main title using the same commands as for the barplot function. Here is a new plot with a few enhancements:

hist(test.data, col="cornsilk", xlab="Data range", ylab="Frequency of data", main="Histogram", font.main=4)

These commands are largely self-explanatory. The 4 in the font.main command sets the font to italic (try some other values).

By default R works out where to insert the breaks between the bars. You can change the number of breaks by adding a simple command e.g.

hist(data.set,breaks=10) # 10 breaks, or just hist(data.set, 10)

The # tells R that what follows is a comment, useful for creating your own library of commands.

Alternatively you can be more specific and set the breaks exactly:

hist(data.set,breaks=c(0,1,2,3,4,5,10,20,max(data.set))) # specify break points exactly

Notice how the exact break points are specified in the c(x1, x2, x3) format. You can manipulate the axes by changing the limits e.g. make the x-axis start at zero and run to 6 by another simple command e.g.:

hist(test.data, 10, xlim=c(0,6), ylim=c(0,10))

This sets 10 break-points and sets the y-axis from 0-10 and the x-axis from 0-6. Notice how the commands are in the format c(lower, upper). The xlim and ylim commands are useful if you wish to prepare several histograms and want them all to have the same scale for comparison.


Top

Navigation Index

Box and whisker plots

Single sample plot

A box and whisker graph allows you to convey a lot of information on one simple plot. Generally they are used for data that are not normally distributed (i.e. that are non-parametric). You can plot a single sample or create a more complex plot of categories within a data set.

The basic function is boxplot()

Here is a vector of numbers saved as the variable test.data:

2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8

To create a box-whisker plot we type:

boxplot(test.data)

Not the most exciting graph ever but we can jazz it up later. What we see is a box with a line through it. The line represents the median of the sample. The box itself shows the upper and lower quartiles. The whiskers show the range (i.e. the largest and smallest values). It is easy to see that this sample has a skewed distribution and is certainly non-parametric.

We can add axis labels, a main title and colour the box using simple commands. These commands are the same as for those used in producing barplots and histograms. For example:

boxplot(test.data, xlab="Single sample", ylab="Value axis", main="Simple Box plot", col="lightblue")

Let's make the data even more skewed and add an outlier:

2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8 12.0

We'll now redraw the graph. This time the main title will be added using a separate command:

boxplot(test.data, xlab="Single sample", ylab="Value axis", col="lightblue")
title(main="Plot with outlier", font.main= 4)

Now we see the outlier separately. R doesn't automatically show the full range of data (as I implied earlier). We can control the range shown using a simple command range= n. If we set n to 0 then the full range is shown. Otherwise the whiskers extend to n x the inter-quartile range. The default is set to n = 1.5.

boxplot(test.data2, xlab="Single sample", ylab="Value axis", col="lightblue", range=0)
title(main="Plot with full-range", font.main= 4)


Top

Navigation Index

Plotting several samples

So we can see how to represent a single sample but often we wish to compare samples.For example, we may have raised broods of flies on various sugars. We measure the size of the individual flies and record the diet for each. Our data file would consist of two columns; one for growth and one for sugar. e.g.

 

growth sugar
75
C
72
C
73
C
61
F
67
F
64
F
62
S
63
S

These data are the same as we used in the example on analysis of variance. Here is shown only part of the larger data set. We have one variable, growth, and several samples (i.e. the different sugars). To plot these we use the boxplot command with slightly different syntax e.g. boxplot(y ~ x). This model syntax is used widely in R for setting-up ANOVA and regression analyses for example.

To create a summary boxplot we type something like:

boxplot(growth ~ sugar, data=fly, xlab="Sugar type", ylab="Growth", col="bisque", range=0)
title(main="Growth against sugar type", font.main= 4)

Now we can see that the different sugar treatments appear to produce differing growth in our subjects.


Top

Navigation Index

Horizontal box plots

It is straightforward to rotate your plot so that the bars run horizontal rather than vertical (which is the default). To produce a horizontal plot you add horizontal= TRUE to the command e.g.

boxplot(growth ~ sugar, data=fly, ylab="Sugar type", xlab="Growth", col="mistyrose", range=0, horizontal=TRUE)
title(main="Growth against sugar type - horizontal", font.main= 4)

Once again I have used the title command separately to add a main title. The 4 in the font.main command sets bold itailic (try other values). The ylab and xlab instructions refer to the left and bottom axes respectively so it is important to switch these around; it is easy to forget.


 
Gardeners Own Home
Top
Navigation Index