Module 4: Data Wrangling Part 1

How to Clean and Manipulate your Data in R

Hawken Hass

University of North Carolina Wilmington

Cleaning Your Data

Cleaning your data is an essential part of data analysis. Before you can run any analysis your data must be structured in a certain way. This will ensure that your analyses will run smoothly in any program. First, let’s take a look at some messy data. This data file summarizes average monthly rainfall between the years of 2001 and 2019 in two different lakes: Lake Victoria and Lake Simiyu.

library(readxl)
read_excel("messy-data.xlsx")

New names:
* `` -> ...2
* `` -> ...3

Seasonal rainfall in Lake Victoria and Simiyu	...2	...3
NA	NA	NA
Month, period	Lake Victoria	Simiyu
Jan,2001-2019	3.176mm	2.908473684
Feb,2001-2019	3.477mm	1.8mm
Mar,2001-2019	4.687052632	2.981052632
Apr,2001-2019	7.004526316	4.753578947
May,2001-2019	9.362789474	4.077473684
Jun,2001-2019	3.430210526	1.046947368
Jul,2001-2019	1.764421053	0.1952105263
Aug,2001-2019	2.812631579	0.3336315789
Sep,2001-2019	3.978894737	1.205842105
Oct,2001-2019	5.318421053	2.454736842
Nov,2001-2019	5.118473684	3.091421053
Dec,2001-2019	4.168105263	3.890052632

You can see there are a few things wrong with this data set.

The column headers are not accurate
There is no data in row one
The headers are in row two
The month and period columns are combined into one column
The total rainfall for each lake is rounded to different decimal points in some rows
Some rows have characters (mm) while others do not
The data is in wide format

Clearly this data set needs some serious tidying. But first, what does a tidy data set look like?

Tidy Data

A tidy data set has two key features:

Each variable is saved in its own column
Each variable is saved in its own row

This structure is specifically called long format, in which each observation has its own row. For example, in our data set there are 12 observations of rainfall for each lake. Therefore, we need 24 rows in our data set. Additionally, each variable needs to be saved in its own column. Thus, we need to create a column that denotes which group (Lake Victoria or Lake Simiyu) each observation belongs to. The data frame below shows what this data will look like once it is cleaned.

read.table("clean_data.txt")

Month	Period	Lake	Rainfall_mm
Jan	2001-2019	Victoria	3.18
Feb	2001-2019	Victoria	3.48
Mar	2001-2019	Victoria	4.69
Apr	2001-2019	Victoria	7.00
May	2001-2019	Victoria	9.36
Jun	2001-2019	Victoria	3.43
Jul	2001-2019	Victoria	1.76
Aug	2001-2019	Victoria	2.81
Sep	2001-2019	Victoria	3.98
Oct	2001-2019	Victoria	5.32
Nov	2001-2019	Victoria	5.12
Dec	2001-2019	Victoria	4.17
Jan	2001-2019	Simiyu	2.91
Feb	2001-2019	Simiyu	1.80
Mar	2001-2019	Simiyu	2.98
Apr	2001-2019	Simiyu	4.75
May	2001-2019	Simiyu	4.08
Jun	2001-2019	Simiyu	1.05
Jul	2001-2019	Simiyu	0.20
Aug	2001-2019	Simiyu	0.33
Sep	2001-2019	Simiyu	1.21
Oct	2001-2019	Simiyu	2.45
Nov	2001-2019	Simiyu	3.09
Dec	2001-2019	Simiyu	3.89

This looks much better! You’ll notice all rainfall observations are in one column and each observation has its own row. Additionally, we have a column denoting group membership. This data is ready for analysis. So how did I get here? We are going to use a package called “tidyr”.

Cleaning using the tidyr package

First, I am going to read in my data and activate the tidyr package.

library(tidyr)
messy_data<-read_excel("messy-data.xlsx")
print(messy_data)

New names:
* `` -> ...2
* `` -> ...3


# A tibble: 14 x 3
   `Seasonal rainfall in Lake Victoria and Simiyu` ...2          ...3        
   <chr>                                           <chr>         <chr>       
<NA>                                            <NA>          <NA>        
Month, period                                   Lake Victoria Simiyu      
Jan,2001-2019                                   3.176mm       2.908473684 
Feb,2001-2019                                   3.477mm       1.8mm       
Mar,2001-2019                                   4.687052632   2.981052632 
Apr,2001-2019                                   7.004526316   4.753578947 
May,2001-2019                                   9.362789474   4.077473684 
Jun,2001-2019                                   3.430210526   1.046947368 
Jul,2001-2019                                   1.764421053   0.1952105263
Aug,2001-2019                                   2.812631579   0.3336315789
Sep,2001-2019                                   3.978894737   1.205842105 
Oct,2001-2019                                   5.318421053   2.454736842 
Nov,2001-2019                                   5.118473684   3.091421053 
Dec,2001-2019                                   4.168105263   3.890052632 

The first step I am going to take is to separate the Month and Period column. If you’ll notice, the month and period are both in one column, and separated by a comma. It is important that in your data set all variables have their own column. Thus, we are going to use the separate function to separate this column into two.

messy_data<-separate(messy_data, "Seasonal rainfall in Lake Victoria and Simiyu", c("Month", "Period"), ",")
print(messy_data)

# A tibble: 14 x 4
   Month Period    ...2          ...3        
   <chr> <chr>     <chr>         <chr>       
<NA>  <NA>      <NA>          <NA>        
Month " period" Lake Victoria Simiyu      
Jan   2001-2019 3.176mm       2.908473684 
Feb   2001-2019 3.477mm       1.8mm       
Mar   2001-2019 4.687052632   2.981052632 
Apr   2001-2019 7.004526316   4.753578947 
May   2001-2019 9.362789474   4.077473684 
Jun   2001-2019 3.430210526   1.046947368 
Jul   2001-2019 1.764421053   0.1952105263
Aug   2001-2019 2.812631579   0.3336315789
Sep   2001-2019 3.978894737   1.205842105 
Oct   2001-2019 5.318421053   2.454736842 
Nov   2001-2019 5.118473684   3.091421053 
Dec   2001-2019 4.168105263   3.890052632 

The separate function uses the following arguments:

The first argument is your data file
The second argument is the new names of the new columns you want to create in a vector format
The third argument is how the two values are separate in the current column, in this case it is a comma

Next, we are going to change the names of the columns to refelct more accurate names. This function will get rid of the “..2” and “..3” in columns 3 and 4. First, we are going to create a vector of all of our column names. Then assign these column names to our data file using the names function.

varname<-c("Month", "Period", "Victoria", "Simiyu")
names(messy_data)<-varname
print(messy_data)

# A tibble: 14 x 4
   Month Period    Victoria      Simiyu      
   <chr> <chr>     <chr>         <chr>       
<NA>  <NA>      <NA>          <NA>        
Month " period" Lake Victoria Simiyu      
Jan   2001-2019 3.176mm       2.908473684 
Feb   2001-2019 3.477mm       1.8mm       
Mar   2001-2019 4.687052632   2.981052632 
Apr   2001-2019 7.004526316   4.753578947 
May   2001-2019 9.362789474   4.077473684 
Jun   2001-2019 3.430210526   1.046947368 
Jul   2001-2019 1.764421053   0.1952105263
Aug   2001-2019 2.812631579   0.3336315789
Sep   2001-2019 3.978894737   1.205842105 
Oct   2001-2019 5.318421053   2.454736842 
Nov   2001-2019 5.118473684   3.091421053 
Dec   2001-2019 4.168105263   3.890052632 

The next step is to delete any missing data.

messy_data<-na.omit(messy_data)
print(messy_data)

# A tibble: 13 x 4
   Month Period    Victoria      Simiyu      
   <chr> <chr>     <chr>         <chr>       
Month " period" Lake Victoria Simiyu      
Jan   2001-2019 3.176mm       2.908473684 
Feb   2001-2019 3.477mm       1.8mm       
Mar   2001-2019 4.687052632   2.981052632 
Apr   2001-2019 7.004526316   4.753578947 
May   2001-2019 9.362789474   4.077473684 
Jun   2001-2019 3.430210526   1.046947368 
Jul   2001-2019 1.764421053   0.1952105263
Aug   2001-2019 2.812631579   0.3336315789
Sep   2001-2019 3.978894737   1.205842105 
Oct   2001-2019 5.318421053   2.454736842 
Nov   2001-2019 5.118473684   3.091421053 
Dec   2001-2019 4.168105263   3.890052632 

We also want to delete Row 1 because this row was the names of the columns in the excel file.

messy_data<-messy_data[-c(1),]
print(messy_data)

# A tibble: 12 x 4
   Month Period    Victoria    Simiyu      
   <chr> <chr>     <chr>       <chr>       
Jan   2001-2019 3.176mm     2.908473684 
Feb   2001-2019 3.477mm     1.8mm       
Mar   2001-2019 4.687052632 2.981052632 
Apr   2001-2019 7.004526316 4.753578947 
May   2001-2019 9.362789474 4.077473684 
Jun   2001-2019 3.430210526 1.046947368 
Jul   2001-2019 1.764421053 0.1952105263
Aug   2001-2019 2.812631579 0.3336315789
Sep   2001-2019 3.978894737 1.205842105 
Oct   2001-2019 5.318421053 2.454736842 
Nov   2001-2019 5.118473684 3.091421053 
Dec   2001-2019 4.168105263 3.890052632 

The ‘-c’ denotes that you want to delete a column or a row. If you wanted to delete mulitple rows you could type:

[-c(1,2,3,4),]

The argument after the comma represents columns. If you wanted to delete column 1 you could type:

[, -c(1)]

Next, we want to convert our data to the long format. In this case we want a column that denotes which lake the data point is from, and a column denoting how much rainfall. We want to make sure each group is paired with the correct data point. For this, we are going to be using the gather function.

messy_data<-gather(messy_data, key="Lake", value="Rainfall_mm", 3:4)
print(messy_data)

# A tibble: 24 x 4
   Month Period    Lake     Rainfall_mm
   <chr> <chr>     <chr>    <chr>      
Jan   2001-2019 Victoria 3.176mm    
Feb   2001-2019 Victoria 3.477mm    
Mar   2001-2019 Victoria 4.687052632
Apr   2001-2019 Victoria 7.004526316
May   2001-2019 Victoria 9.362789474
Jun   2001-2019 Victoria 3.430210526
Jul   2001-2019 Victoria 1.764421053
Aug   2001-2019 Victoria 2.812631579
Sep   2001-2019 Victoria 3.978894737
Oct   2001-2019 Victoria 5.318421053
# ... with 14 more rows

The gather function uses the following arguments:

The first argument is the data frame name
The second argument is the new name of the column you want to create for the grouping variable. I chose “Lake” because I want to take the names of each lake and repeat it in the column twelve times; the number of observations of rainfall for that lake.
The third argument is the name of the column for the value you want to pair with the grouping variable. I chose “Rainfall_mm” because that is the observation or data point I want to pair with each grouping variable.
The final argument is the range of columns you want to turn into key-value pairs

Let’s now remove some unnecessary characters.

messy_data$Rainfall_mm<-gsub('mm', '',messy_data$Rainfall_mm)
messy_data$Rainfall_mm<-as.numeric(messy_data$Rainfall_mm)
print(messy_data)

# A tibble: 24 x 4
   Month Period    Lake     Rainfall_mm
   <chr> <chr>     <chr>          <dbl>
Jan   2001-2019 Victoria        3.18
Feb   2001-2019 Victoria        3.48
Mar   2001-2019 Victoria        4.69
Apr   2001-2019 Victoria        7.00
May   2001-2019 Victoria        9.36
Jun   2001-2019 Victoria        3.43
Jul   2001-2019 Victoria        1.76
Aug   2001-2019 Victoria        2.81
Sep   2001-2019 Victoria        3.98
Oct   2001-2019 Victoria        5.32
# ... with 14 more rows

The gsub function uses the following arguments:

The first argument is the string of characters you want to remove from each data point. I want to remove the “mm: off the end data points that have it
The second argument denotes what you want to replace the “mm” character string with. In this case I put a space because I don’t want anything to replace it
The final argument is the column you want to apply this function to

When using the gsub function, the column is automatically converted to a character variable. Use the as.numeric function to convert the column back to a numeric variable.

Finally, let’s ensure that all observations in the rainfall column are rounded to the same decimal point.

messy_data[,'Rainfall_mm']=round(messy_data[,'Rainfall_mm'],2)
print(messy_data)

# A tibble: 24 x 4
   Month Period    Lake     Rainfall_mm
   <chr> <chr>     <chr>          <dbl>
Jan   2001-2019 Victoria        3.18
Feb   2001-2019 Victoria        3.48
Mar   2001-2019 Victoria        4.69
Apr   2001-2019 Victoria        7   
May   2001-2019 Victoria        9.36
Jun   2001-2019 Victoria        3.43
Jul   2001-2019 Victoria        1.76
Aug   2001-2019 Victoria        2.81
Sep   2001-2019 Victoria        3.98
Oct   2001-2019 Victoria        5.32
# ... with 14 more rows

Our data is clean now! Let’s rename it “clean data”

clean_data<-messy_data

To save it as new data file, use the write.table function. This will save the new version of your data file to your computer.

write.table(clean_data, file = "clean_data.txt")
read.table("clean_data.txt")

Month	Period	Lake	Rainfall_mm
Jan	2001-2019	Victoria	3.18
Feb	2001-2019	Victoria	3.48
Mar	2001-2019	Victoria	4.69
Apr	2001-2019	Victoria	7.00
May	2001-2019	Victoria	9.36
Jun	2001-2019	Victoria	3.43
Jul	2001-2019	Victoria	1.76
Aug	2001-2019	Victoria	2.81
Sep	2001-2019	Victoria	3.98
Oct	2001-2019	Victoria	5.32
Nov	2001-2019	Victoria	5.12
Dec	2001-2019	Victoria	4.17
Jan	2001-2019	Simiyu	2.91
Feb	2001-2019	Simiyu	1.80
Mar	2001-2019	Simiyu	2.98
Apr	2001-2019	Simiyu	4.75
May	2001-2019	Simiyu	4.08
Jun	2001-2019	Simiyu	1.05
Jul	2001-2019	Simiyu	0.20
Aug	2001-2019	Simiyu	0.33
Sep	2001-2019	Simiyu	1.21
Oct	2001-2019	Simiyu	2.45
Nov	2001-2019	Simiyu	3.09
Dec	2001-2019	Simiyu	3.89

Other Tidy R Functions

The above functions only scratch the surface of the tidyr package. The reshape functions mentioned above have their reverse counterparts. For example, if you want to spread rows across columns, you will want to use the spread function. For example, if you want two columns for rainfall in each lake, you would use the following code.

wide_data<-spread(clean_data, "Lake", "Rainfall_mm")
print(wide_data)

# A tibble: 12 x 4
   Month Period    Simiyu Victoria
   <chr> <chr>      <dbl>    <dbl>
Apr   2001-2019   4.75     7   
Aug   2001-2019   0.33     2.81
Dec   2001-2019   3.89     4.17
Feb   2001-2019   1.8      3.48
Jan   2001-2019   2.91     3.18
Jul   2001-2019   0.2      1.76
Jun   2001-2019   1.05     3.43
Mar   2001-2019   2.98     4.69
May   2001-2019   4.08     9.36
Nov   2001-2019   3.09     5.12
Oct   2001-2019   2.45     5.32
Sep   2001-2019   1.21     3.98

Like before, we used the separate function to separate the Month and Period columns, but you can reunite these columns using the unite function.

wide_data<-unite(wide_data,col="Month,Period",c("Month", "Period"), sep=",")
print(wide_data)

# A tibble: 12 x 3
   `Month,Period` Simiyu Victoria
   <chr>           <dbl>    <dbl>
Apr,2001-2019    4.75     7   
Aug,2001-2019    0.33     2.81
Dec,2001-2019    3.89     4.17
Feb,2001-2019    1.8      3.48
Jan,2001-2019    2.91     3.18
Jul,2001-2019    0.2      1.76
Jun,2001-2019    1.05     3.43
Mar,2001-2019    2.98     4.69
May,2001-2019    4.08     9.36
Nov,2001-2019    3.09     5.12
Oct,2001-2019    2.45     5.32
Sep,2001-2019    1.21     3.98

In Part 2 we will discuss how to manipulate your data using the dplyr package!