Learning Objectives

  • Extract values from vectors and data frames.
  • Perform operations on columns in a data frame.

In this lesson you will learn how to extract and manipulate data stored in data frames in R. We will work with the E. coli metadata file that we used previously. Be sure to read this file into a dataframe named metadata, if you haven’t already done so.

metadata <- read.csv('data/Ecoli_metadata.csv')

Because the columns of a data frame are vectors, we will first learn how to extract elements from vectors and then learn how to apply this concept to select rows and columns from a data frame.

Extracting values with indexing and sequences

Vectors

Let’s create a vector containing the first ten letters of the alphabet.

ten_letters <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')

In order to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we do in math. R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

So, to extract the 2nd element of ten_letters we type:

ten_letters[2]

We can extract multiple elements at a time by specifying mulitple indices inside the square brackets as a vector. Notice how you can use : to make a vector of all integers between two numbers.

ten_letters[c(1,7)]

ten_letters[3:6]

ten_letters[10:1]

ten_letters[c(2, 8:10)]

Challenge

Select every other element in ten_letters.

Solution

ten_letters[2*1:5]

What if we were dealing with a much longer vector? We can use the seq() function to quickly create sequences of numbers.

seq(1, 10, by = 2)
seq(20, 4, by = -3)

Challenge

Fill in the blank to select the even elements of ten_letters using the seq() function.

ten_letters[____________]

Solution

ten_letters[seq(2, 10, by = 2)]

Data frames

The metadata data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers (i.e. [row, column]).

metadata[1, 2]   # 1st element in the 2nd column 
metadata[1, 6]   # 1st element in the 6th column
metadata[1:3, 7] # First three elements in the 7th column
metadata[3, ]    # 3rd element for all columns
metadata[, 7]    # Entire 7th column

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. Sometimes the column number for a particular variable can change if your analysis adds or removes columns. The best practice when working with columns in a data frame is to refer to them by name. This also makes your code easier to read and your intentions clearer.

There are two ways to select a column by name from a data frame:

  • Using dataframe[ , "column_name"]
  • Using dataframe$column_name

You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. You can use names(metadata) or colnames(metadata) to remind yourself of the column names. For instance, to extract all the strain information from our datasets:

# Select the strain column from metadata
metadata[ , "strain"]

# Alternatively...
metadata$strain

The first method allows you to select multiple columns at once. Suppose we wanted strain and clade information:

metadata[, c("strain", "clade")]

You can even access columns by column name and select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:

metadata[4:7, c("strain", "clade")]

Data Carpentry, 2017. License. Contributing.
Questions? Feedback? Please file an issue on GitHub.
On Twitter: @datacarpentry