Module 4: Data Wrangling Part 2
How to Clean and Manipulate your Data in R
Hawken Hass
University of North Carolina Wilmington
Manipulating your Data
Once your data is all clean, you can manipulate your data. Data manipulation can include grouping data, combining data sets, summarizing data, and making new variables. For this, we are going to use the dplyr package.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
For this module, let’s use our data we just cleaned in the last module. I refactored the data to make sure that the levels are in the correct order.
clean_data<-read.table("clean_data.txt")
clean_data$Month<-factor(clean_data$Month, levels=unique(clean_data$Month))
clean_data
Month | Period | Lake | Rainfall_mm |
---|---|---|---|
Jan | 2001-2019 | Victoria | 3.18 |
Feb | 2001-2019 | Victoria | 3.48 |
Mar | 2001-2019 | Victoria | 4.69 |
Apr | 2001-2019 | Victoria | 7.00 |
May | 2001-2019 | Victoria | 9.36 |
Jun | 2001-2019 | Victoria | 3.43 |
Jul | 2001-2019 | Victoria | 1.76 |
Aug | 2001-2019 | Victoria | 2.81 |
Sep | 2001-2019 | Victoria | 3.98 |
Oct | 2001-2019 | Victoria | 5.32 |
Nov | 2001-2019 | Victoria | 5.12 |
Dec | 2001-2019 | Victoria | 4.17 |
Jan | 2001-2019 | Simiyu | 2.91 |
Feb | 2001-2019 | Simiyu | 1.80 |
Mar | 2001-2019 | Simiyu | 2.98 |
Apr | 2001-2019 | Simiyu | 4.75 |
May | 2001-2019 | Simiyu | 4.08 |
Jun | 2001-2019 | Simiyu | 1.05 |
Jul | 2001-2019 | Simiyu | 0.20 |
Aug | 2001-2019 | Simiyu | 0.33 |
Sep | 2001-2019 | Simiyu | 1.21 |
Oct | 2001-2019 | Simiyu | 2.45 |
Nov | 2001-2019 | Simiyu | 3.09 |
Dec | 2001-2019 | Simiyu | 3.89 |
Filtering Data by Rows
The filter function allows you to subset rows from your data frame. For example, let’s say we only want to look at rainfall in the month of January. We can do that using the filter function.
filter(clean_data, Month=="Jan")
Month | Period | Lake | Rainfall_mm |
---|---|---|---|
Jan | 2001-2019 | Victoria | 3.18 |
Jan | 2001-2019 | Simiyu | 2.91 |
Since there are only two observations for each month, this is a pretty small data frame. However, this function is quite useful for large datasets. In the second argument, you denote an operate that tells the function how you would like to filter the column. Since we wanted only values for the month of January, we used the “==” operator. There are many different operators you can use to filter your data. For example, the code below would filter out any rainfall measurement that is less than 1.
filter(clean_data, Rainfall_mm>1)
Month | Period | Lake | Rainfall_mm |
---|---|---|---|
Jan | 2001-2019 | Victoria | 3.18 |
Feb | 2001-2019 | Victoria | 3.48 |
Mar | 2001-2019 | Victoria | 4.69 |
Apr | 2001-2019 | Victoria | 7.00 |
May | 2001-2019 | Victoria | 9.36 |
Jun | 2001-2019 | Victoria | 3.43 |
Jul | 2001-2019 | Victoria | 1.76 |
Aug | 2001-2019 | Victoria | 2.81 |
Sep | 2001-2019 | Victoria | 3.98 |
Oct | 2001-2019 | Victoria | 5.32 |
Nov | 2001-2019 | Victoria | 5.12 |
Dec | 2001-2019 | Victoria | 4.17 |
Jan | 2001-2019 | Simiyu | 2.91 |
Feb | 2001-2019 | Simiyu | 1.80 |
Mar | 2001-2019 | Simiyu | 2.98 |
Apr | 2001-2019 | Simiyu | 4.75 |
May | 2001-2019 | Simiyu | 4.08 |
Jun | 2001-2019 | Simiyu | 1.05 |
Sep | 2001-2019 | Simiyu | 1.21 |
Oct | 2001-2019 | Simiyu | 2.45 |
Nov | 2001-2019 | Simiyu | 3.09 |
Dec | 2001-2019 | Simiyu | 3.89 |
Here are some more examples of operators.
== (Equal to)
!= (Not equal to)
< (Less than)
<= (Less than or equal to)
> (Greater than)
>= (Greater than or equal to)
Filtering by Columns
The select function is used to filter out specific columns. The select function also uses helper functions to filter out columns based on certain properties. Using double quotes, you can filter out columns of a certain name.
select(clean_data, contains("Lake"))
Lake |
---|
Victoria |
Victoria |
Victoria |
Victoria |
Victoria |
Victoria |
Victoria |
Victoria |
Victoria |
Victoria |
Victoria |
Victoria |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
Simiyu |
The “-“ operator can be used to denote which columns you want to exclude.
select(clean_data, -Period)
Month | Lake | Rainfall_mm |
---|---|---|
Jan | Victoria | 3.18 |
Feb | Victoria | 3.48 |
Mar | Victoria | 4.69 |
Apr | Victoria | 7.00 |
May | Victoria | 9.36 |
Jun | Victoria | 3.43 |
Jul | Victoria | 1.76 |
Aug | Victoria | 2.81 |
Sep | Victoria | 3.98 |
Oct | Victoria | 5.32 |
Nov | Victoria | 5.12 |
Dec | Victoria | 4.17 |
Jan | Simiyu | 2.91 |
Feb | Simiyu | 1.80 |
Mar | Simiyu | 2.98 |
Apr | Simiyu | 4.75 |
May | Simiyu | 4.08 |
Jun | Simiyu | 1.05 |
Jul | Simiyu | 0.20 |
Aug | Simiyu | 0.33 |
Sep | Simiyu | 1.21 |
Oct | Simiyu | 2.45 |
Nov | Simiyu | 3.09 |
Dec | Simiyu | 3.89 |
Piping
It may be unproductive to create one line of code for each data cleaning and manipulating function. Sometimes you will see programmers use a pipe. A pipe looks like this: %>%. Piping can make your code easier to read and efficient. For example, everything I just did above can be done in one line of code using a couple of pipes.
new_data<-clean_data%>%select(-Period)%>%filter(Rainfall_mm>1)%>%filter(Lake=="Victoria")
print(new_data)
Month Lake Rainfall_mm
1 Jan Victoria 3.18
2 Feb Victoria 3.48
3 Mar Victoria 4.69
4 Apr Victoria 7.00
5 May Victoria 9.36
6 Jun Victoria 3.43
7 Jul Victoria 1.76
8 Aug Victoria 2.81
9 Sep Victoria 3.98
10 Oct Victoria 5.32
11 Nov Victoria 5.12
12 Dec Victoria 4.17
You can see how piping made our code more efficient. In one line of code we were able to create a new data set that removed the period column, and outputed only values for Lake Victoria that were greater than 1 mm. This is very useful for large datasets.
Summarize Data
The dplyr function also allows you to summarize your data using different summary functions.
summarise(clean_data, avg=mean(Rainfall_mm))
avg |
---|
3.46 |
You can add more than one summary function in each argument.
summarise(clean_data, avg=mean(Rainfall_mm), n=n(), sd=sd(Rainfall_mm), var=var(Rainfall_mm),median=median(Rainfall_mm), min=min(Rainfall_mm), max=max(Rainfall_mm))
avg | n | sd | var | median | min | max |
---|---|---|---|---|---|---|
3.46 | 24 | 2.055855 | 4.226539 | 3.305 | 0.2 | 9.36 |
It may be more useful to look at the average rainfall for both groups. We can find the means of separate groups by using the group_by function and piping it to the summarise function. The code would look like this:
clean_data%>%group_by(Lake)%>%summarise(avg=mean(Rainfall_mm), sd=sd(Rainfall_mm), min=min(Rainfall_mm), max=max(Rainfall_mm))
Lake | avg | sd | min | max |
---|---|---|---|---|
Simiyu | 2.395 | 1.488236 | 0.20 | 4.75 |
Victoria | 4.525 | 2.036613 | 1.76 | 9.36 |
Creating New Variables
Dplyr also allows you to create new variables within your data set using the mutate function. In this function you are computing a new variable from an existing variable. The basic code for the function is:
- mutate(df, name of new column= formula)
For example, let’s say we want to transform the Rainfall_mm to measure rainfall in inches.
clean_data_2<-mutate(clean_data, Rainfall_inches=Rainfall_mm*0.039)
clean_data_2
Month | Period | Lake | Rainfall_mm | Rainfall_inches |
---|---|---|---|---|
Jan | 2001-2019 | Victoria | 3.18 | 0.12402 |
Feb | 2001-2019 | Victoria | 3.48 | 0.13572 |
Mar | 2001-2019 | Victoria | 4.69 | 0.18291 |
Apr | 2001-2019 | Victoria | 7.00 | 0.27300 |
May | 2001-2019 | Victoria | 9.36 | 0.36504 |
Jun | 2001-2019 | Victoria | 3.43 | 0.13377 |
Jul | 2001-2019 | Victoria | 1.76 | 0.06864 |
Aug | 2001-2019 | Victoria | 2.81 | 0.10959 |
Sep | 2001-2019 | Victoria | 3.98 | 0.15522 |
Oct | 2001-2019 | Victoria | 5.32 | 0.20748 |
Nov | 2001-2019 | Victoria | 5.12 | 0.19968 |
Dec | 2001-2019 | Victoria | 4.17 | 0.16263 |
Jan | 2001-2019 | Simiyu | 2.91 | 0.11349 |
Feb | 2001-2019 | Simiyu | 1.80 | 0.07020 |
Mar | 2001-2019 | Simiyu | 2.98 | 0.11622 |
Apr | 2001-2019 | Simiyu | 4.75 | 0.18525 |
May | 2001-2019 | Simiyu | 4.08 | 0.15912 |
Jun | 2001-2019 | Simiyu | 1.05 | 0.04095 |
Jul | 2001-2019 | Simiyu | 0.20 | 0.00780 |
Aug | 2001-2019 | Simiyu | 0.33 | 0.01287 |
Sep | 2001-2019 | Simiyu | 1.21 | 0.04719 |
Oct | 2001-2019 | Simiyu | 2.45 | 0.09555 |
Nov | 2001-2019 | Simiyu | 3.09 | 0.12051 |
Dec | 2001-2019 | Simiyu | 3.89 | 0.15171 |
Combining Data Sets
Dplyr also has the ability to combine data sets into one. Let’s read in some more lake data and combine it with our current data set!
lake_data<-read.table("lake_data.txt")
print(lake_data)
Month Period SeaHawk Randall
1 Jan 2001-2019 3.125990 10.915324
2 Feb 2001-2019 12.441331 5.148206
3 Mar 2001-2019 3.230104 5.982370
4 Apr 2001-2019 10.907835 6.179094
5 May 2001-2019 9.674827 8.532295
6 Jun 2001-2019 8.033946 6.132740
7 Jul 2001-2019 12.318558 14.437468
8 Aug 2001-2019 6.557286 11.269315
9 Sep 2001-2019 6.057994 15.059383
10 Oct 2001-2019 12.040164 10.124065
11 Nov 2001-2019 5.369271 2.171475
12 Dec 2001-2019 8.132666 12.947003
Now we have two more lakes: Lake SeaHawk and Lake Randall along with their monthly rainfall in mm. Let’s clean and tidy our data to make it look like our clean_data data frame.
library(tidyr)
lake_data<-lake_data %>% gather(key="Lake",value="Rainfall_mm",3:4)%>%mutate(Rainfall_inches=Rainfall_mm*0.039)
lake_data
Month | Period | Lake | Rainfall_mm | Rainfall_inches |
---|---|---|---|---|
Jan | 2001-2019 | SeaHawk | 3.125990 | 0.12191361 |
Feb | 2001-2019 | SeaHawk | 12.441331 | 0.48521192 |
Mar | 2001-2019 | SeaHawk | 3.230104 | 0.12597405 |
Apr | 2001-2019 | SeaHawk | 10.907835 | 0.42540556 |
May | 2001-2019 | SeaHawk | 9.674827 | 0.37731825 |
Jun | 2001-2019 | SeaHawk | 8.033946 | 0.31332390 |
Jul | 2001-2019 | SeaHawk | 12.318558 | 0.48042378 |
Aug | 2001-2019 | SeaHawk | 6.557286 | 0.25573414 |
Sep | 2001-2019 | SeaHawk | 6.057994 | 0.23626178 |
Oct | 2001-2019 | SeaHawk | 12.040164 | 0.46956640 |
Nov | 2001-2019 | SeaHawk | 5.369271 | 0.20940156 |
Dec | 2001-2019 | SeaHawk | 8.132666 | 0.31717398 |
Jan | 2001-2019 | Randall | 10.915324 | 0.42569765 |
Feb | 2001-2019 | Randall | 5.148206 | 0.20078002 |
Mar | 2001-2019 | Randall | 5.982370 | 0.23331245 |
Apr | 2001-2019 | Randall | 6.179094 | 0.24098468 |
May | 2001-2019 | Randall | 8.532295 | 0.33275951 |
Jun | 2001-2019 | Randall | 6.132740 | 0.23917686 |
Jul | 2001-2019 | Randall | 14.437468 | 0.56306127 |
Aug | 2001-2019 | Randall | 11.269315 | 0.43950329 |
Sep | 2001-2019 | Randall | 15.059383 | 0.58731592 |
Oct | 2001-2019 | Randall | 10.124065 | 0.39483852 |
Nov | 2001-2019 | Randall | 2.171475 | 0.08468754 |
Dec | 2001-2019 | Randall | 12.947003 | 0.50493311 |
Let’s combine our two data sets!
All_lakes<-bind_rows(clean_data_2,lake_data)
All_lakes$Month<-as.factor(All_lakes$Month)
All_lakes$Lake<-as.factor(All_lakes$Lake)
print(All_lakes)
Warning message in bind_rows_(x, .id):
"binding factor and character vector, coercing into character vector"Warning message in bind_rows_(x, .id):
"binding character and factor vector, coercing into character vector"
Month Period Lake Rainfall_mm Rainfall_inches
1 Jan 2001-2019 Victoria 3.180000 0.12402000
2 Feb 2001-2019 Victoria 3.480000 0.13572000
3 Mar 2001-2019 Victoria 4.690000 0.18291000
4 Apr 2001-2019 Victoria 7.000000 0.27300000
5 May 2001-2019 Victoria 9.360000 0.36504000
6 Jun 2001-2019 Victoria 3.430000 0.13377000
7 Jul 2001-2019 Victoria 1.760000 0.06864000
8 Aug 2001-2019 Victoria 2.810000 0.10959000
9 Sep 2001-2019 Victoria 3.980000 0.15522000
10 Oct 2001-2019 Victoria 5.320000 0.20748000
11 Nov 2001-2019 Victoria 5.120000 0.19968000
12 Dec 2001-2019 Victoria 4.170000 0.16263000
13 Jan 2001-2019 Simiyu 2.910000 0.11349000
14 Feb 2001-2019 Simiyu 1.800000 0.07020000
15 Mar 2001-2019 Simiyu 2.980000 0.11622000
16 Apr 2001-2019 Simiyu 4.750000 0.18525000
17 May 2001-2019 Simiyu 4.080000 0.15912000
18 Jun 2001-2019 Simiyu 1.050000 0.04095000
19 Jul 2001-2019 Simiyu 0.200000 0.00780000
20 Aug 2001-2019 Simiyu 0.330000 0.01287000
21 Sep 2001-2019 Simiyu 1.210000 0.04719000
22 Oct 2001-2019 Simiyu 2.450000 0.09555000
23 Nov 2001-2019 Simiyu 3.090000 0.12051000
24 Dec 2001-2019 Simiyu 3.890000 0.15171000
25 Jan 2001-2019 SeaHawk 3.125990 0.12191361
26 Feb 2001-2019 SeaHawk 12.441331 0.48521192
27 Mar 2001-2019 SeaHawk 3.230104 0.12597405
28 Apr 2001-2019 SeaHawk 10.907835 0.42540556
29 May 2001-2019 SeaHawk 9.674827 0.37731825
30 Jun 2001-2019 SeaHawk 8.033946 0.31332390
31 Jul 2001-2019 SeaHawk 12.318558 0.48042378
32 Aug 2001-2019 SeaHawk 6.557286 0.25573414
33 Sep 2001-2019 SeaHawk 6.057994 0.23626178
34 Oct 2001-2019 SeaHawk 12.040164 0.46956640
35 Nov 2001-2019 SeaHawk 5.369271 0.20940156
36 Dec 2001-2019 SeaHawk 8.132666 0.31717398
37 Jan 2001-2019 Randall 10.915324 0.42569765
38 Feb 2001-2019 Randall 5.148206 0.20078002
39 Mar 2001-2019 Randall 5.982370 0.23331245
40 Apr 2001-2019 Randall 6.179094 0.24098468
41 May 2001-2019 Randall 8.532295 0.33275951
42 Jun 2001-2019 Randall 6.132740 0.23917686
43 Jul 2001-2019 Randall 14.437468 0.56306127
44 Aug 2001-2019 Randall 11.269315 0.43950329
45 Sep 2001-2019 Randall 15.059383 0.58731592
46 Oct 2001-2019 Randall 10.124065 0.39483852
47 Nov 2001-2019 Randall 2.171475 0.08468754
48 Dec 2001-2019 Randall 12.947003 0.50493311
The bind_rows function adds the rows from your second argument to the dataframe in your first argument. You can also use the bind_cols function to add the columns from one dataset to the other. For this example, let’s get our data back into wide format.
clean_data_wide<-clean_data%>%spread(key="Lake",value="Rainfall_mm")
clean_data_wide
Month | Period | Simiyu | Victoria |
---|---|---|---|
Jan | 2001-2019 | 2.91 | 3.18 |
Feb | 2001-2019 | 1.80 | 3.48 |
Mar | 2001-2019 | 2.98 | 4.69 |
Apr | 2001-2019 | 4.75 | 7.00 |
May | 2001-2019 | 4.08 | 9.36 |
Jun | 2001-2019 | 1.05 | 3.43 |
Jul | 2001-2019 | 0.20 | 1.76 |
Aug | 2001-2019 | 0.33 | 2.81 |
Sep | 2001-2019 | 1.21 | 3.98 |
Oct | 2001-2019 | 2.45 | 5.32 |
Nov | 2001-2019 | 3.09 | 5.12 |
Dec | 2001-2019 | 3.89 | 4.17 |
lake_data_wide<-read.table("lake_data.txt")
All_lakes_wide<-bind_cols(clean_data_wide,lake_data_wide)
All_lakes_wide
Month | Period | Simiyu | Victoria | Month1 | Period1 | SeaHawk | Randall |
---|---|---|---|---|---|---|---|
Jan | 2001-2019 | 2.91 | 3.18 | Jan | 2001-2019 | 3.125990 | 10.915324 |
Feb | 2001-2019 | 1.80 | 3.48 | Feb | 2001-2019 | 12.441331 | 5.148206 |
Mar | 2001-2019 | 2.98 | 4.69 | Mar | 2001-2019 | 3.230104 | 5.982370 |
Apr | 2001-2019 | 4.75 | 7.00 | Apr | 2001-2019 | 10.907835 | 6.179094 |
May | 2001-2019 | 4.08 | 9.36 | May | 2001-2019 | 9.674827 | 8.532295 |
Jun | 2001-2019 | 1.05 | 3.43 | Jun | 2001-2019 | 8.033946 | 6.132740 |
Jul | 2001-2019 | 0.20 | 1.76 | Jul | 2001-2019 | 12.318558 | 14.437468 |
Aug | 2001-2019 | 0.33 | 2.81 | Aug | 2001-2019 | 6.557286 | 11.269315 |
Sep | 2001-2019 | 1.21 | 3.98 | Sep | 2001-2019 | 6.057994 | 15.059383 |
Oct | 2001-2019 | 2.45 | 5.32 | Oct | 2001-2019 | 12.040164 | 10.124065 |
Nov | 2001-2019 | 3.09 | 5.12 | Nov | 2001-2019 | 5.369271 | 2.171475 |
Dec | 2001-2019 | 3.89 | 4.17 | Dec | 2001-2019 | 8.132666 | 12.947003 |
You’ll notice that this created two new columns “Month1” and “Period1” because they share the same name in both datasets. To account for this you can use the full join function. Using the “by=” argument, you can join columns that have the same name and values.
full_join(clean_data_wide,lake_data_wide, by=c("Month","Period"))
Warning message:
"Column `Month` joining factors with different levels, coercing to character vector"
Month | Period | Simiyu | Victoria | SeaHawk | Randall |
---|---|---|---|---|---|
Jan | 2001-2019 | 2.91 | 3.18 | 3.125990 | 10.915324 |
Feb | 2001-2019 | 1.80 | 3.48 | 12.441331 | 5.148206 |
Mar | 2001-2019 | 2.98 | 4.69 | 3.230104 | 5.982370 |
Apr | 2001-2019 | 4.75 | 7.00 | 10.907835 | 6.179094 |
May | 2001-2019 | 4.08 | 9.36 | 9.674827 | 8.532295 |
Jun | 2001-2019 | 1.05 | 3.43 | 8.033946 | 6.132740 |
Jul | 2001-2019 | 0.20 | 1.76 | 12.318558 | 14.437468 |
Aug | 2001-2019 | 0.33 | 2.81 | 6.557286 | 11.269315 |
Sep | 2001-2019 | 1.21 | 3.98 | 6.057994 | 15.059383 |
Oct | 2001-2019 | 2.45 | 5.32 | 12.040164 | 10.124065 |
Nov | 2001-2019 | 3.09 | 5.12 | 5.369271 | 2.171475 |
Dec | 2001-2019 | 3.89 | 4.17 | 8.132666 | 12.947003 |
Always read any warning messages that R outputs. You’ll notice that because the month columns have different levels, R automatically converts it to a character vector. In this case, make sure you refactor your data if necessary.
There are many other functions that dplyr offers. I would recommend exploring different functions to learn how to properly manipulate your data. In the next module we will learn all about t-tests!