Data Types (atomic classes)

Data is stored in memory in a variety of ways. Each type of data has different properties and different memory allocations. For the vast majority of users who are not optimizing algorithms this is not a major concern. The main concern with data types is knowing how they work and how R handles them. Understanding of data types and structures will save a lot of headaches down the road, as R does not have any intuitive data type handling, only explicit.

At the smallest level there are three data types (technically there are five, but if you know you need one of the other two then you don’t need me to explain them to you).

Numeric Data

Numeric data is fairly easy to understand, it’s composed of numbers. In R, all numeric data is the same. In some languages you will see terms such as int, float, and double. If those terms mean something to you, R natively only works with int and double data types. If those don’t mean anything to you, do not worry, it’s generally unimportant to front-end users.

Numeric data can have mathmatical operations performed on them.

10-5

## [1] 5

Character Data

Character data in R is comprised of what is refered to as characters and strings. Strings are just a series of characters, and characters are any symbol (most commonlty letters and numbers). Character data cannot have mathmatical operations performed on it. This can be confusing because R will often convert data types without making it obvious. We saw this in exercise 1 when converting scientific notation unsing format().

(1/10000)*2

## [1] 2e-04

format(1/10000, scientific=FALSE)*2

## Error in format(1/10000, scientific = FALSE) * 2: non-numeric argument to binary operator

Even though the character data is composed entirely of what we know as numbers, R does not see it that way. The reason is a bit compicated, but the good news is is that we can tell R what the data type should be.

as.numeric(format(1/10000, scientific=FALSE))*2

## [1] 2e-04

R is smart enough to know that these are numbers and can be converted and treated as numeric data, if we explicitly tell it to.

Try it with letters.

as.numeric("a")

Logical Data

A third type. We saw reference to logical types when we looked at operators. Logical data is a single bit of information, TRUE or FALSE. It is binary, and you may have noticed that it is stored and treated as such when we were practicing with the operators in exercise 1. Logical data is a comparison of two objects in R.

Vector

Vectors are the most common data structure and the most important for using R. A vector is a series of data contained in a single object. Here we will use the concatenate function c() to create a vector.

our_vector <- c("a", 0, TRUE)

Vectors are atomic, they can only hold a series of data that is all of the same type. Notice though that when I created our_vector I created it with one of each atomic data type. So what happened? Call our_vector by typing it into the console.

our_vector

It converted all of the elements of our_vector to characters. Well, that was nice of it, but what if I wanted it to be numeric, we can tell R to make something a different data type, right?

as.logical(our_vector)

It does it.. but we can’t fool it. It sets all non-logical data to NA.

If we aren’t sure what atomic type our vector is there are several ways to ask R.

typeof(our_vector)
class(our_vector)

And the most common and most informative is to ask the structure str.

str(our_vector)

Let’s make a vector of each type

our_characters <- letters[1:3]
our_numbers <- 1:3
our_logic <- c(TRUE,FALSE,TRUE)

Accessing Elements of a Vector

Each element of a vector is still an individual object, and is assigned what is called and index value so that R knows where it is and you can explicitly call it. Say that we had a numeric vector

our_vector <- c(1, 0, 3)

The index values in R are assigned in increasing numerical value starting at position 1 based on order of appearance. We can call them by typing the name of the vector and then the index value contained in [].

our_vector[1]
our_vector[2]
our_vector[3]

For vectors, if we call an index that is not in the object R will assign it as NA.

our_vector[4]

This is nice because it does not error and also because it allows us to add to our vector by index. So let’s assign a value to the 4th position in our_vector.

our_vector[4] <- 5

Notice that our_vector now has 5 at the 4th index. You can always call length() on a vector to find the total number of elements it contains.

What if we skip an index?

our_vector[6] <- 3

R does what we ask it to, and increases the size of the vector to index up to that position with the unassigned positions set to NA. We can change the value of positions too using the assignment operator.

our_vector[1] <- 138

if we want to know something about the relation of multiple indices we can call them all individually.

our_vector[1]/our_vector[3]
our_vector[1]/our_vector[3] >= our_vector[4]

Or we can call a range of indices.

sum(our_vector[1:3])

Or a vector of vectors!

sum(our_vector[c(1,3,4)])

When operations are called on a whole vector it will perform that operation on each element individually.

our_vector/3

This is super powerful! But sometimes it doesn’t workout.

sum(our_vector)

It worked.. but that’s not really what we wanted.

Fortunately there are lots of ways to handle almost every issue in R. The easiest here would be the system built into the function

sum(our_vector, na.rm=TRUE)

But in this section of the tutorial, that’s cheating!

Excluding Elements of a Vector

The same way we call specific elements, we can ignore ones too by specifying minus that elements -.

our_vector[-5]
sum(our_vector[-5])

But what if we have a really large dataset stored as a vector and there are lots of NA values? There’s lots of ways to handle that too.

First, find out which values are NA using logical operators. NA is a special case, since it is an absences of data it has no value or type, so using == does not work as it would in every other case, so R has a function specifically for it is.na()

is.na(our_vector)

or which are ! not NA.

!is.na(our_vector)

If we just wanted to know which indices are equal to 3 we could use ==.

our_vector == 3

Now, we can access those specific elements of the vector using the logical values.

sum(our_vector[!is.na(our_vector)])

This can look intimidating if you are not comfortable in R, so as we talked about in exercise 1 when talking about nested functions, assigning variables can make things cleaner.

not_na <- !is.na(our_vector)
our_vector_no_na <- our_vector[not_na]
sum(our_vector_no_na)

What if we really wanted to have multiple data types in the same object, this is where a list is handy.

List

A list is exactly that. It’s similar to a vector, it contains other objects, but in their own partition so that they can each have their own data type.

our_list <- list("a", 0, TRUE)

If you call our_list you will see that it looks different than the vector before, but also each element is not forced to be a character.

A list can be made up of any object we want.

our_list <- list(c("a","b"), our_vector_no_na, our_list)

For this reason lists are both great and dangerous. Indexing in a list gives a lot of new users a lot of trouble, but if you look at how it is printed to the console it shows you how to do it. I will try to break it down to be as simple as possible.

Accessing Elements of a List

When we call our list

our_list

## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] 138   0   3   5   3
## 
## [[3]]
## [[3]][[1]]
## [1] "a"
## 
## [[3]][[2]]
## [1] 0
## 
## [[3]][[3]]
## [1] TRUE

We see numbers contained in both [[i]] and [i]. Lists are indexed with the double brackets ‘[[i]]’. If we call a list with single bracket indices

our_list[1]

It sort of looks like what we expected to have happen, and this is why it leads to a lot of confusion if you don’t know what’s happening. You see the [[1]] is still there. We know that this should be a vector, so if we check its type

str(our_list[1])

It says "list". We called the first element of the list, but not the object inside it. The index of the object in the list is inside the [[i]].

our_list[[1]]
str(our_list[[1]])

And now we have that character vector from the list. From there we can call elements of that vector by index also, the indices read from top-down, left-right, so the more indices to the right, the more specific we are being.

our_list[[2]][1]

So when we have a list inside a list, like in our_list[[3]] we have to call it by index from the first list and then from the second list.

our_list[[3]][[3]][1]

So, by now you may be wondering “well, why do I always have to specify [1] to call the vector if it’s always [1]?” And the reason for that is, becasue there are aso data structures that hold multiple vectors, namely matrices.

Matrix

Matrices can be thought of as concatenations of vertical vectors. Each vector must have the same atomic type and all be of the same length. i.e.

our_matrix <- matrix(1:9, nrow=3)

If you are creating a matrix you can specify nrow= and ncol=. Becasue we know that the length of the vertical vectors must be the same, we know that we only need to specify one or the other. Specifying both, if you miscalculate the number of each needed, will force the given data to fit it. R will give a warning if it is not correct, and then repeat the data given until it fits the explicit size.

matrix(1:9, nrow=3, ncol=4)

By default the matrix is filled in by column, it inserts values until column 1 is filled and then it moves and does the same for column 2, etc. If for some reason we want to do it by row, there is an option for that!

matrix(1:9, nrow=3, byrow=TRUE)

Notice the difference?

But also, we said that each vertical vector must be the same atomic type, so let’s demonstrate that. Check the str(our_matrix) before and after entering this into the console.

our_matrix[,1] <- our_characters

Notice, first off that the matrix has only one type, and also that it changed it to be character.

Subsetting Matrices

In the last snippet you may have noticed how I called the first column of the matrix [,1]. It’s a little different than we have seen before, and that is because its a 2-dimensional object. When sybsetting we give it [Y,X] coordinates, that is [row,column]. Indexing works the same for all R data structures, so we can give it a single number or Booleans, or a vector of numbers or Booleans (remember Booleans are binary, logical, TRUE/FALSE).

So let’s say we want the second elements of the first column.

our_matrix[2,1]

Or all of the second columm.

our_matrix[,2]

Or the third row.

our_matrix[3,]

Or a logical operation

our_matrix <- matrix(1:9, nrow=3)
our_matrix[our_matrix >= 5]

You’ll notice the result of subsetting a matrix is always a vector. You may have also notice that I didn’t include a [,] when I indexed the last example. As we said, matrices are just a concatenation of equal length vectors, and so can be thought of as a single continuous vector. Therefore each element has both a matrix coordinate [Y,X] and also a vector index [i]. The vector index goes in increasing order in the same way we said the matrix is filled by default, by column. So the last elements of our first column is our_matrix[3] and the first of the second column is our_matrix[4]. Try it.

Array

Arrays are similar to matrices but can have more than two dimensions. Let’s look.

our_array <- array(1:27, dim=c(3,3,3))

Arrays are not very common, and most tutorials I’ve seen don’t even bother discussing them. But since they exist, it is good to be aware of them. Using what we already learned about how indexing works and how to use it with other data structures, we can figure out how arrays work. Try some different things.

our_array[,,3]
our_array[,2,1]
our_array[,,3][2,1]

If it doesn’t make sense, I think the clearest way to look at it and figure it out is like this:

our_array[,,2]
our_array[,2,2]
our_array[1,2,2]

You can see that

our_array[1,2,2]

Is the same as saying

our_array[,,2][1,2]

Data Frame

A data frame is what you would get if a matrix and a list started filing joint-taxes. If you don’t know what that means, ask your parents, it’s outside the scope of this tutorial.

Data frames are one of the most common structures users implement their data in. They are 2-dimensional objects that can hold equal length vectors of different atomic types.

our_data_frame <- data.frame(our_characters, our_numbers, our_logic)

It looks like a matrix, right? But look..

typeof(our_data_frame)

R sees it as a list.

If we look at its structure

str(our_data_frame)

## 'data.frame':    3 obs. of  3 variables:
##  $ our_characters: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ our_numbers   : int  1 2 3
##  $ our_logic     : logi  TRUE FALSE TRUE

It knows it’s a data.frame, it has 3 observations (or 3 colums) with 3 variables in each. It then lists each of the vectors that we used to create the data.frame and what their structure is. You may notice that instead of chr our characters are listed as Factor.

Factors

Factors look like characters but are stored as integers. Notice the absences of "" when we call the column of our_characters.

our_data_frame[,1]

But if we try a logical comparison on one of them

our_data_frame[3,1] == "c"

## [1] TRUE

But look at this.

as.numeric("c")

## Warning: NAs introduced by coercion

## [1] NA

as.numeric(our_data_frame[3,1])

## [1] 3

How R treats factors is important because it will treat them as either nominal or ordinal when executing statistical operations or when graphing. We can see how it treats the factors ordinally using

ordered(our_data_frame$our_characters)

Subsetting Data Frames

Data frames can be subset exactly like matrices, in addition to other, more practical, ways. If you recall when we used str() to look at the data frame It listed the variable names of the object we put into out data frame. Those variable names became the column names colanames() of the data frame. Column names are denoted with $ as shown with str().

our_data_frame$our_numbers

But this only allows a single coulmn to be called by name. If we put the name inside of the index brackets we can call multiple columns by name.

our_data_frame[,c('our_logic','our_numbers')]

Note that data frames also have rownames() that are by default just the index number.

our_data_frame[c('3','1'),]

Data Table

You may see references to data tables, and in some cases tibbles. Data tables offer all the same features as data frames but with additional functionality. As a cost for gaining the functionality, the syntax of data tables is more obtuse. To try to mary the function of data tables with easy syntax Haley Wickham introduced tibbles in his tidyverse package. For general use though, data frames offer sufficient functionality to accomplish the data-analytic needs of the average scientist.

Continue: Exercise 3: Writing Scripts

Schuyler Smith
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University. Ames, IA.

Exercise 2: Data-Structures