Loading Data

Jonathan D. Bakker

Foundational Concepts

3 Loading Data

Learning Objectives

To demonstrate how to load data in R.

To introduce the sample datasets that we will use throughout the quarter.

To introduce how to use matching (%in%) to index an object and create new objects.

Resources

Broman & Woo (2018)

RStudio cheat sheets for reference:

Data Import with readr, readxl, and googlesheets4

Introduction

There are several ways to load data into R. Small datasets can be typed directly and assigned to an object, though this is not practical for larger datasets. My preferred method for loading data is to organize it in Excel, save the file, and then read it into R. Note that these functions assume your data are organized in a rectangular format, following the recommendations of Broman & Woo (2018).

Key Takeaways

Data files should be organized in rectangular format. The simplest form has rows as samples and columns as data:

Each row is a unique record, such as the observations from one plot at one time.
Each column is a variable. Some of these may be explanatory (e.g., a column identifying the plot and a column identifying the time of each sample was taken) and others may be responses (columns identifying different species in a community, or other measurements made in that plot and time.
Each cell is the value of a given variable (column) for a given sample (row). Missing values should be indicated as ‘NA’ or as zero.

While you can read data into R by typing in the full path to the desired file, it is preferred to establish a project folder and work therein as discussed in the ‘Reproducible Research‘ chapter. When you open a R project file, the working directory is automatically set to the folder in which it resides.

You can hard-code the name of the file that you want to load. For example, if the file is saved in the ‘data’ sub-folder and is named ‘data.csv’:
dataa <- read.csv("data/data.csv")(Note: I don’t recommend using ‘data’ as an object name as there is a data() function within R)

Alternatively, you could use file.choose() to navigate to and select the desired file:
dataa <- read.csv(file.choose())There are times when I find this helpful because it can be quick, but note that this is not automated – if you ran the script again, it would pause here until you selected the file.

Text files are the most common type of files to be loaded, but others are possible as well.

Text Files

The main function to load data is called read.table(). This is a generic function with arguments that allow you to customize the call to reflect how your data file is formatted. Some standard file formats have had the required arguments hard-coded: read.delim() for tab-delimited text files, and read.csv() for CSV files. The arguments that I find most helpful include:

file – the key required argument; the name of the file to be read.
header – whether the names of the variables are included in the first line of the file. The default is that they are not (header = FALSE), but I almost always include them in this fashion and therefore set header = TRUE.
row.names – the name or number of a column containing a field that you want to assign to row names. Note that this will not work if the referenced column does not contain a unique record for each row.
na.strings – unique values in the dataset that you want to be replaced with ‘NA’ (Not Available). For example, missing data might have been coded as ‘999’; you obviously would not want to include these values when calculating an average.

As an example, let’s load the oak plant community dataset that is introduced below. This file has column names in the first row, so we will include the header argument. It also has a unique code for each entry in the first column, so we will include the row.names argument:
Oak <- read.csv("data/Oak_data_47x216.csv", header = TRUE, row.names = 1)

The readr package, part of the tidyverse, provides another set of functions to read rectangular data.

Other File Types

The readxl package allows you to load data from a specific sheet of a Microsoft Excel file. Formulae within the Excel file will not be kept; just the resulting value will be loaded.

Large datasets are often stored in relational databases. There are packages that allow you to build and run queries in these databases, and to export the data into R. RODBC is one example of this type of package; many others are listed here:
https://cran.r-project.org/web/views/Databases.html

Verifying That Data Are Loaded Correctly

Loading data is (perhaps obviously) only useful if you assign the result to an object – otherwise, all you’ve done is displayed it in the Console.

It’s important to verify that the data were loaded correctly. There are several ways to do so by examining the resulting object:

Check the size or dimensions of the object. This information is reported in the ‘Environment’ panel (upper-right quadrant) of RStudio, but we can also display it in the Console using the dim() function. The results are always ordered as the number of rows followed by the number of columns.
View the first few records using head(). Can you guess what the tail() function does?
View a data summary using summary().
View the structure of the object using str().

Sample Dataset: Oak Plant Communities

The primary dataset that we’ll use as an example throughout the course is from Quercus garryana (aka Garry oak, Oregon white oak) stands in the Willamette Valley (Thilenius 1963, 1968). These data were included as an example with PC-ORD, and I have reformatted them for R – I thank Bruce McCune for granting permission to use them!

Three files are associated with this dataset. These and all other data files are available through this book’s GitHub site (https://github.com/jon-bakker/appliedmultivariatestatistics):

Oak_Metadata.docx
Oak_data_47x216.csv
Oak_species_189x5.csv

Download these files and save them in the ‘data’ sub-folder within your SEFS 502 folder (the one that contains the R project for these classes). We’ll use them throughout the quarter.

Oak_Metadata.docx explains the other files. We will not be loading it into R, but be sure to look through it. You will get to create a similar metadata file for your class project.

Oak_data_47x216.csv contains the response and explanatory variables for each of the 47 stands. The response variables are the abundances of 189 species in each stand. The explanatory variables are 27 stand-level attributes. I call them explanatory variables for simplicity; in reality, it would be more accurate to term them ‘potential explanatory variables’. Rows correspond to plots and columns correspond to variables in this object. Most analyses assume the data are structured this way. If your data are reversed, you can re-organize them this way using the t() function to transpose the object (however, if they are reversed you may also have issues with regard to the classes of the objects).

Oak_species_189x5.csv contains the species codes associated with this project along with some simple information about each taxon, including its scientific name and life form (tree, shrub, herb, graminoid).

Note: An alternate way to organize these data is to have the response and explanatory data in separate files. Having separate files is helpful for keeping the two datasets distinct, but can cause problems if, for example, the order of data is rearranged in one but not the other. As a result, the preferred approach is to store the species data and explanatory data together in a single file that is loaded into R and then manipulated there to create new objects containing the desired components.

Loading and Indexing the Oak Plant Community Data

We are going to be using these data throughout the quarter. Each time we do so, we will need to load them and make some initial data adjustments. We begin by loading both data files:
Oak <- read.csv("data/Oak_data_47x216.csv", header = TRUE, row.names = 1)Oak_species <- read.csv("data/Oak_species_189x5.csv", header = TRUE)

One of the columns in Oak_species is a list of the species codes. We can select the response variables from Oak by matching the column names from Oak against the names in this list. Doing so is much easier and less error-prone than manually typing them all out. We will use an extremely useful operator, %in%, to do so. It works like the match() function, but is more intuitive: it selects those elements from the object on its left that are in the object on its right, and ignores elements that are not present in both objects.

Oak_abund <- Oak[ , colnames(Oak) %in% Oak_species$SpeciesCode]

Use the functions that were introduced above (Verifying That Data Are Loaded Correctly) to explore this new object and satisfy yourself that it contains just the species data.

We can also use the %in% operator to find non-matches (i.e., column names that are not included in the list of species codes):
Oak_explan <- Oak[ , ! colnames(Oak) %in% Oak_species$SpeciesCode]

Be sure you understand what happened here! These variables should all be explanatory variables (i.e., stand-level attributes). However, if a species was not named identically in both objects then it would also be included in Oak_explan.

Key Takeaways

If data are being combined across objects, it is much safer to do so via matching than by assuming that the samples are in the same order in both objects.

References

Broman, K.W., and K.H. Woo. 2018. Data organization in spreadsheets. The American Statistician 72:2-10. doi: 10.1080/0031305.2017.1375989

Thilenius, J.F. 1963. Synecology of the white-oak (Quercus garryana Douglas) woodlands of the Willamette Valley, Oregon. Ph.D. dissertation. Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR. 151 p.

Thilenius, J.F. 1968. The Quercus garryana forests of the Willamette Valley, Oregon. Ecology 49:1124-1133.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Applied Multivariate Statistics in R Copyright © 2024 by Jonathan D. Bakker is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.