Dr. Mark Gardener |
||||||||||||||||||||||||||||||||||
GO... |
||||||||||||||||||||||||||||||||||
On this page... Introduction to graphing |
Using R for statistical analyses - Graphs 1This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going. I run training courses in data management, visualisation and analysis using Excel and R: The Statistical Programming Environment. From 2013 courses will be held at The Field Studies Council Field Centre at Slapton Ley in Devon. Alternatively I can come to you and provide the training at your workplace. See details on my Courses Page. On this page you can find out information on producing a range of graphs to illustrate your analyses. Specifically you'll find information on bar charts, histograms and box-whisker plots. For information on scatter plots, pie charts and stem and leaf plots you need to go to the graph2 page. See also: R Courses | R Tips, Tricks & Hints | MonogRaphs | Writer's bloc My publications about RSee my books about R on my Publications page Statistics for Ecologists | Beginning R | The Essential R Reference | Community Ecology | Managing Data Statistics for Ecologists is available now from Pelagic Publishing. Get a 20% discount using the S4E20 code! I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book. |
|||||||||||||||||||||||||||||||||
R is Open Source R is Free |
What is R?R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation. R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes. Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses. |
|||||||||||||||||||||||||||||||||
Navigation index |
||||||||||||||||||||||||||||||||||
Introduction to GraphingR has great graphical power but it is not a point and click interface. This means that you must use typed commands to get it to produce the graphs you desire. This can be a bit tedious at first but once you have the hang of it you can save a list of useful commands as text that you can copy and paste into the R command line. |
||||||||||||||||||||||||||||||||||
Bar chartsThe bar chart is familiar to everyone and is a useful graphical tool that may be used in a variety of ways. The basic function is: barplot(data) Before you can draw a graph you need to get your data into an appropriate format. R has many ways of manipulating data but it is often easiest to assemble and manipulate your data in a spreadsheet (you can save in .CSV format). The first stage is to arrange your data in a .CSV file. You may have your data arranged in columns or in rows. You may also have both row and column names. Don't forget that variable names in R can contain letters and numbers but the only punctuation allowed is a period. The second stage is to read your data file into memory and give it a sensible name. When using barplots you may have both row and column names so don't forget to tell R that you are using row names if you are. Simple multi-category chartYour data may consist of a simple row of means e.g. here are some data on road deaths in Virginia. These data come with the basic distribution of R and are called VADeaths. The means have been extracted below and assigned to the variable VADmeans.
We can see that there are four categories. To create a basic bar chart we simply call the barplot() function: barplot(VADmeans, main="Road Deaths in Virginia",xlab="Categories", ylab="Mean Deaths") This produces a very basic plot; I have added a main title and labels for the x and y axes using fairly simple commands. When plotting a graph R opens a graphics window. If you select the window (by clicking on or in it) you may then copy to the clipboard and paste into a variety of applications. |
||||||||||||||||||||||||||||||||||
Stacked charts or not?The VADeaths dataset consists of a matrix of values with both column and row labels:
If we attempt to produce a bar chart of these data we get something like the following: barplot(VADeaths, legend= rownames(VADeaths)) This time a legend was added using the legend command along with the rownames of the dataset. We see that by default a stacked bar chart is produced. To unstack the bars and plot them alongside one another we use a new command: barplot(VADeaths, legend= rownames(VADeaths), beside= TRUE) This is fine but the colour scheme is kind of boring. Here is a new set of commands: barplot(VADeaths, beside = TRUE, col = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk"), legend = rownames(VADeaths), ylim = c(0, 100)) title(main = "Death Rates in Virginia", font.main = 4) This is a bit better. We have specified a list of colours to use for the bars. Note how the list is in the form c(item1, item2, item3, item4). The command ylim sets the limits of the y-axis. In this case a lower limit of 0 and an upper of 100. The command is in the form ylim= c(lower, upper) and note again the use of the c(item1, item2) format. The legend takes the names from the row names of the datafile. We set the y-axis limit to accommodate the legend box. It is possible to specify the title of the graph as a separate command, which is what was done above. The command title() achieves this but of course it only works when a graphics window is already open. The command font.main sets the typeface, 4 produces bold italic font. |
||||||||||||||||||||||||||||||||||
Frequency plotsSometimes you will have a single column of data that you wish to summarize. A common use of a bar chart is to produce a frequency plot showing the number of items in various ranges. Here is a vector of numbers: 75 67 70 75 65 71 67 67 76 68 These have been assigned to a variable called carb and we wish to make a frequency plot. Let's try: barplot(carb) Oops. That wasn't really what we wanted at all. What's happened is that each item has been plotted as a separate entity. We need to tabulate the frequencies. Fortunately there is an easy way to do this. We use the table() function. Let's redraw the graph but using the following: This is much better. Now we have the frequencies for the data arranged in several categories (sometimes called bins). As with other graphs we can add titles to axes and to the main graph. We can look at the table() function directly to see what it produces. table(carb)
We can see that the function has summarised the data for us into various numerical categories. |
||||||||||||||||||||||||||||||||||
We may wish to show the frequencies as a proportion of the total rather than as raw data. To do this we simply divide each item by the total number of items in our dataset: barplot(table(carb)/length(carb)) This shows exactly the same pattern but now the total of all the bars add up to one. |
||||||||||||||||||||||||||||||||||
Horizontal bar plotsIt is straightforward to rotate your plot so that the bars run horizontal rather than vertical (which is the default). To produce a horizontal plot you add horizontal= TRUE to the command e.g. barplot(table(carb),
horiz=T, col="lightgreen", xlab="Frequency", ylab="Range") This time I have used the title() command to add the main title separately. The value of 4 sets the font to bold italic (try other values). |
||||||||||||||||||||||||||||||||||
HistogramsThe barplot function can be used to create a frequency plot of sorts but it does not produce a continuous distribution along the x-axis. A true frequency distribution should have the bar categories (i.e. the x-axis) as continuous items. The frequency plot produced previously has discontinuous categories. To create a frequency distribution chart we need a histogram, which has a continuous range along the x-axis. The command in R is: hist(variable) Here is a vector of numbers saved as the variable test.data: 2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8 To create a histogram we type: hist(test.data) To plot the probabilities (i.e. proportions) rather than the actual frequency we need to add the command prob= TRUE like so: hist(test.data, prob= TRUE) This is useful but the plots are a bit basic and boring. We can change axis labels and the main title using the same commands as for the barplot function. Here is a new plot with a few enhancements: hist(test.data, col="cornsilk", xlab="Data range", ylab="Frequency of data", main="Histogram", font.main=4) These commands are largely self-explanatory. The 4 in the font.main command sets the font to italic (try some other values). By default R works out where to insert the breaks between the bars. You can change the number of breaks by adding a simple command e.g. hist(data.set,breaks=10) # 10 breaks, or just hist(data.set, 10) The # tells R that what follows is a comment, useful for creating your own library of commands. Alternatively you can be more specific and set the breaks exactly: hist(data.set,breaks=c(0,1,2,3,4,5,10,20,max(data.set))) # specify break points exactly Notice how the exact break points are specified in the c(x1, x2, x3) format. You can manipulate the axes by changing the limits e.g. make the x-axis start at zero and run to 6 by another simple command e.g.: hist(test.data, 10, xlim=c(0,6), ylim=c(0,10)) This sets 10 break-points and sets the y-axis from 0-10 and the x-axis from 0-6. Notice how the commands are in the format c(lower, upper). The xlim and ylim commands are useful if you wish to prepare several histograms and want them all to have the same scale for comparison. |
||||||||||||||||||||||||||||||||||
Box and whisker plotsSingle sample plotA box and whisker graph allows you to convey a lot of information on one simple plot. Generally they are used for data that are not normally distributed (i.e. that are non-parametric). You can plot a single sample or create a more complex plot of categories within a data set. The basic function is boxplot() Here is a vector of numbers saved as the variable test.data: 2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8 To create a box-whisker plot we type: boxplot(test.data) Not the most exciting graph ever but we can jazz it up later. What we see is a box with a line through it. The line represents the median of the sample. The box itself shows the upper and lower quartiles. The whiskers show the range (i.e. the largest and smallest values). It is easy to see that this sample has a skewed distribution and is certainly non-parametric. We can add axis labels, a main title and colour the box using simple commands. These commands are the same as for those used in producing barplots and histograms. For example: boxplot(test.data, xlab="Single sample", ylab="Value axis", main="Simple Box plot", col="lightblue") Let's make the data even more skewed and add an outlier: 2.1 2.6 2.7 3.2 4.1 4.3 5.2 5.1 4.8 1.8 1.4 2.5 2.7 3.1 2.6 2.8 12.0 We'll now redraw the graph. This time the main title will be added using a separate command: boxplot(test.data,
xlab="Single sample", ylab="Value axis", col="lightblue")
Now we see the outlier separately. R doesn't automatically show the full range of data (as I implied earlier). We can control the range shown using a simple command range= n. If we set n to 0 then the full range is shown. Otherwise the whiskers extend to n x the inter-quartile range. The default is set to n = 1.5. boxplot(test.data2,
xlab="Single sample",
ylab="Value axis", col="lightblue", range=0) |
||||||||||||||||||||||||||||||||||
Plotting several samplesSo we can see how to represent a single sample but often we wish to compare samples.For example, we may have raised broods of flies on various sugars. We measure the size of the individual flies and record the diet for each. Our data file would consist of two columns; one for growth and one for sugar. e.g.
These data are the same as we used in the example on analysis of variance. Here is shown only part of the larger data set. We have one variable, growth, and several samples (i.e. the different sugars). To plot these we use the boxplot command with slightly different syntax e.g. boxplot(y ~ x). This model syntax is used widely in R for setting-up ANOVA and regression analyses for example. To create a summary boxplot we type something like: boxplot(growth
~ sugar, data=fly, xlab="Sugar type", ylab="Growth",
col="bisque", range=0) Now we can see that the different sugar treatments appear to produce differing growth in our subjects. |
||||||||||||||||||||||||||||||||||
Horizontal box plotsIt is straightforward to rotate your plot so that the bars run horizontal rather than vertical (which is the default). To produce a horizontal plot you add horizontal= TRUE to the command e.g. boxplot(growth
~ sugar, data=fly, ylab="Sugar type", xlab="Growth",
col="mistyrose", range=0, horizontal=TRUE) Once again I have used the title command separately to add a main title. The 4 in the font.main command sets bold itailic (try other values). The ylab and xlab instructions refer to the left and bottom axes respectively so it is important to switch these around; it is easy to forget. |
||||||||||||||||||||||||||||||||||