Module 4: Data Wrangling Part 2

How to Clean and Manipulate your Data in R

Hawken Hass

University of North Carolina Wilmington

Manipulating your Data

Once your data is all clean, you can manipulate your data. Data manipulation can include grouping data, combining data sets, summarizing data, and making new variables. For this, we are going to use the dplyr package.

library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

For this module, let’s use our data we just cleaned in the last module. I refactored the data to make sure that the levels are in the correct order.

clean_data<-read.table("clean_data.txt")
clean_data$Month<-factor(clean_data$Month, levels=unique(clean_data$Month))
clean_data

Month	Period	Lake	Rainfall_mm
Jan	2001-2019	Victoria	3.18
Feb	2001-2019	Victoria	3.48
Mar	2001-2019	Victoria	4.69
Apr	2001-2019	Victoria	7.00
May	2001-2019	Victoria	9.36
Jun	2001-2019	Victoria	3.43
Jul	2001-2019	Victoria	1.76
Aug	2001-2019	Victoria	2.81
Sep	2001-2019	Victoria	3.98
Oct	2001-2019	Victoria	5.32
Nov	2001-2019	Victoria	5.12
Dec	2001-2019	Victoria	4.17
Jan	2001-2019	Simiyu	2.91
Feb	2001-2019	Simiyu	1.80
Mar	2001-2019	Simiyu	2.98
Apr	2001-2019	Simiyu	4.75
May	2001-2019	Simiyu	4.08
Jun	2001-2019	Simiyu	1.05
Jul	2001-2019	Simiyu	0.20
Aug	2001-2019	Simiyu	0.33
Sep	2001-2019	Simiyu	1.21
Oct	2001-2019	Simiyu	2.45
Nov	2001-2019	Simiyu	3.09
Dec	2001-2019	Simiyu	3.89

Filtering Data by Rows

The filter function allows you to subset rows from your data frame. For example, let’s say we only want to look at rainfall in the month of January. We can do that using the filter function.

filter(clean_data, Month=="Jan")

Month	Period	Lake	Rainfall_mm
Jan	2001-2019	Victoria	3.18
Jan	2001-2019	Simiyu	2.91

Since there are only two observations for each month, this is a pretty small data frame. However, this function is quite useful for large datasets. In the second argument, you denote an operate that tells the function how you would like to filter the column. Since we wanted only values for the month of January, we used the “==” operator. There are many different operators you can use to filter your data. For example, the code below would filter out any rainfall measurement that is less than 1.

filter(clean_data, Rainfall_mm>1)

Month	Period	Lake	Rainfall_mm
Jan	2001-2019	Victoria	3.18
Feb	2001-2019	Victoria	3.48
Mar	2001-2019	Victoria	4.69
Apr	2001-2019	Victoria	7.00
May	2001-2019	Victoria	9.36
Jun	2001-2019	Victoria	3.43
Jul	2001-2019	Victoria	1.76
Aug	2001-2019	Victoria	2.81
Sep	2001-2019	Victoria	3.98
Oct	2001-2019	Victoria	5.32
Nov	2001-2019	Victoria	5.12
Dec	2001-2019	Victoria	4.17
Jan	2001-2019	Simiyu	2.91
Feb	2001-2019	Simiyu	1.80
Mar	2001-2019	Simiyu	2.98
Apr	2001-2019	Simiyu	4.75
May	2001-2019	Simiyu	4.08
Jun	2001-2019	Simiyu	1.05
Sep	2001-2019	Simiyu	1.21
Oct	2001-2019	Simiyu	2.45
Nov	2001-2019	Simiyu	3.09
Dec	2001-2019	Simiyu	3.89

Here are some more examples of operators.

== (Equal to)
!= (Not equal to)
< (Less than)
<= (Less than or equal to)
> (Greater than)
>= (Greater than or equal to)

Filtering by Columns

The select function is used to filter out specific columns. The select function also uses helper functions to filter out columns based on certain properties. Using double quotes, you can filter out columns of a certain name.

select(clean_data, contains("Lake"))

Lake
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu

The “-“ operator can be used to denote which columns you want to exclude.

select(clean_data, -Period)

Month	Lake	Rainfall_mm
Jan	Victoria	3.18
Feb	Victoria	3.48
Mar	Victoria	4.69
Apr	Victoria	7.00
May	Victoria	9.36
Jun	Victoria	3.43
Jul	Victoria	1.76
Aug	Victoria	2.81
Sep	Victoria	3.98
Oct	Victoria	5.32
Nov	Victoria	5.12
Dec	Victoria	4.17
Jan	Simiyu	2.91
Feb	Simiyu	1.80
Mar	Simiyu	2.98
Apr	Simiyu	4.75
May	Simiyu	4.08
Jun	Simiyu	1.05
Jul	Simiyu	0.20
Aug	Simiyu	0.33
Sep	Simiyu	1.21
Oct	Simiyu	2.45
Nov	Simiyu	3.09
Dec	Simiyu	3.89

Piping

It may be unproductive to create one line of code for each data cleaning and manipulating function. Sometimes you will see programmers use a pipe. A pipe looks like this: %>%. Piping can make your code easier to read and efficient. For example, everything I just did above can be done in one line of code using a couple of pipes.

new_data<-clean_data%>%select(-Period)%>%filter(Rainfall_mm>1)%>%filter(Lake=="Victoria")
print(new_data)

   Month     Lake Rainfall_mm
  Jan Victoria        3.18
  Feb Victoria        3.48
  Mar Victoria        4.69
  Apr Victoria        7.00
  May Victoria        9.36
  Jun Victoria        3.43
  Jul Victoria        1.76
  Aug Victoria        2.81
  Sep Victoria        3.98
 Oct Victoria        5.32
 Nov Victoria        5.12
 Dec Victoria        4.17

You can see how piping made our code more efficient. In one line of code we were able to create a new data set that removed the period column, and outputed only values for Lake Victoria that were greater than 1 mm. This is very useful for large datasets.

Summarize Data

The dplyr function also allows you to summarize your data using different summary functions.

summarise(clean_data, avg=mean(Rainfall_mm))

avg
3.46

You can add more than one summary function in each argument.

summarise(clean_data, avg=mean(Rainfall_mm), n=n(), sd=sd(Rainfall_mm), var=var(Rainfall_mm),median=median(Rainfall_mm), min=min(Rainfall_mm), max=max(Rainfall_mm))

avg	n	sd	var	median	min	max
3.46	24	2.055855	4.226539	3.305	0.2	9.36

It may be more useful to look at the average rainfall for both groups. We can find the means of separate groups by using the group_by function and piping it to the summarise function. The code would look like this:

clean_data%>%group_by(Lake)%>%summarise(avg=mean(Rainfall_mm), sd=sd(Rainfall_mm), min=min(Rainfall_mm), max=max(Rainfall_mm))

Lake	avg	sd	min	max
Simiyu	2.395	1.488236	0.20	4.75
Victoria	4.525	2.036613	1.76	9.36

Creating New Variables

Dplyr also allows you to create new variables within your data set using the mutate function. In this function you are computing a new variable from an existing variable. The basic code for the function is:

mutate(df, name of new column= formula)

For example, let’s say we want to transform the Rainfall_mm to measure rainfall in inches.

clean_data_2<-mutate(clean_data, Rainfall_inches=Rainfall_mm*0.039)
clean_data_2

Month	Period	Lake	Rainfall_mm	Rainfall_inches
Jan	2001-2019	Victoria	3.18	0.12402
Feb	2001-2019	Victoria	3.48	0.13572
Mar	2001-2019	Victoria	4.69	0.18291
Apr	2001-2019	Victoria	7.00	0.27300
May	2001-2019	Victoria	9.36	0.36504
Jun	2001-2019	Victoria	3.43	0.13377
Jul	2001-2019	Victoria	1.76	0.06864
Aug	2001-2019	Victoria	2.81	0.10959
Sep	2001-2019	Victoria	3.98	0.15522
Oct	2001-2019	Victoria	5.32	0.20748
Nov	2001-2019	Victoria	5.12	0.19968
Dec	2001-2019	Victoria	4.17	0.16263
Jan	2001-2019	Simiyu	2.91	0.11349
Feb	2001-2019	Simiyu	1.80	0.07020
Mar	2001-2019	Simiyu	2.98	0.11622
Apr	2001-2019	Simiyu	4.75	0.18525
May	2001-2019	Simiyu	4.08	0.15912
Jun	2001-2019	Simiyu	1.05	0.04095
Jul	2001-2019	Simiyu	0.20	0.00780
Aug	2001-2019	Simiyu	0.33	0.01287
Sep	2001-2019	Simiyu	1.21	0.04719
Oct	2001-2019	Simiyu	2.45	0.09555
Nov	2001-2019	Simiyu	3.09	0.12051
Dec	2001-2019	Simiyu	3.89	0.15171

Combining Data Sets

Dplyr also has the ability to combine data sets into one. Let’s read in some more lake data and combine it with our current data set!

lake_data<-read.table("lake_data.txt")
print(lake_data)

   Month    Period   SeaHawk   Randall
  Jan 2001-2019  3.125990 10.915324
  Feb 2001-2019 12.441331  5.148206
  Mar 2001-2019  3.230104  5.982370
  Apr 2001-2019 10.907835  6.179094
  May 2001-2019  9.674827  8.532295
  Jun 2001-2019  8.033946  6.132740
  Jul 2001-2019 12.318558 14.437468
  Aug 2001-2019  6.557286 11.269315
  Sep 2001-2019  6.057994 15.059383
 Oct 2001-2019 12.040164 10.124065
 Nov 2001-2019  5.369271  2.171475
 Dec 2001-2019  8.132666 12.947003

Now we have two more lakes: Lake SeaHawk and Lake Randall along with their monthly rainfall in mm. Let’s clean and tidy our data to make it look like our clean_data data frame.

library(tidyr)
lake_data<-lake_data %>% gather(key="Lake",value="Rainfall_mm",3:4)%>%mutate(Rainfall_inches=Rainfall_mm*0.039)
lake_data

Month	Period	Lake	Rainfall_mm	Rainfall_inches
Jan	2001-2019	SeaHawk	3.125990	0.12191361
Feb	2001-2019	SeaHawk	12.441331	0.48521192
Mar	2001-2019	SeaHawk	3.230104	0.12597405
Apr	2001-2019	SeaHawk	10.907835	0.42540556
May	2001-2019	SeaHawk	9.674827	0.37731825
Jun	2001-2019	SeaHawk	8.033946	0.31332390
Jul	2001-2019	SeaHawk	12.318558	0.48042378
Aug	2001-2019	SeaHawk	6.557286	0.25573414
Sep	2001-2019	SeaHawk	6.057994	0.23626178
Oct	2001-2019	SeaHawk	12.040164	0.46956640
Nov	2001-2019	SeaHawk	5.369271	0.20940156
Dec	2001-2019	SeaHawk	8.132666	0.31717398
Jan	2001-2019	Randall	10.915324	0.42569765
Feb	2001-2019	Randall	5.148206	0.20078002
Mar	2001-2019	Randall	5.982370	0.23331245
Apr	2001-2019	Randall	6.179094	0.24098468
May	2001-2019	Randall	8.532295	0.33275951
Jun	2001-2019	Randall	6.132740	0.23917686
Jul	2001-2019	Randall	14.437468	0.56306127
Aug	2001-2019	Randall	11.269315	0.43950329
Sep	2001-2019	Randall	15.059383	0.58731592
Oct	2001-2019	Randall	10.124065	0.39483852
Nov	2001-2019	Randall	2.171475	0.08468754
Dec	2001-2019	Randall	12.947003	0.50493311

Let’s combine our two data sets!

All_lakes<-bind_rows(clean_data_2,lake_data)
All_lakes$Month<-as.factor(All_lakes$Month)
All_lakes$Lake<-as.factor(All_lakes$Lake)
print(All_lakes)

Warning message in bind_rows_(x, .id):
"binding factor and character vector, coercing into character vector"Warning message in bind_rows_(x, .id):
"binding character and factor vector, coercing into character vector"

   Month    Period     Lake Rainfall_mm Rainfall_inches
  Jan 2001-2019 Victoria    3.180000      0.12402000
  Feb 2001-2019 Victoria    3.480000      0.13572000
  Mar 2001-2019 Victoria    4.690000      0.18291000
  Apr 2001-2019 Victoria    7.000000      0.27300000
  May 2001-2019 Victoria    9.360000      0.36504000
  Jun 2001-2019 Victoria    3.430000      0.13377000
  Jul 2001-2019 Victoria    1.760000      0.06864000
  Aug 2001-2019 Victoria    2.810000      0.10959000
  Sep 2001-2019 Victoria    3.980000      0.15522000
 Oct 2001-2019 Victoria    5.320000      0.20748000
 Nov 2001-2019 Victoria    5.120000      0.19968000
 Dec 2001-2019 Victoria    4.170000      0.16263000
 Jan 2001-2019   Simiyu    2.910000      0.11349000
 Feb 2001-2019   Simiyu    1.800000      0.07020000
 Mar 2001-2019   Simiyu    2.980000      0.11622000
 Apr 2001-2019   Simiyu    4.750000      0.18525000
 May 2001-2019   Simiyu    4.080000      0.15912000
 Jun 2001-2019   Simiyu    1.050000      0.04095000
 Jul 2001-2019   Simiyu    0.200000      0.00780000
 Aug 2001-2019   Simiyu    0.330000      0.01287000
 Sep 2001-2019   Simiyu    1.210000      0.04719000
 Oct 2001-2019   Simiyu    2.450000      0.09555000
 Nov 2001-2019   Simiyu    3.090000      0.12051000
 Dec 2001-2019   Simiyu    3.890000      0.15171000
 Jan 2001-2019  SeaHawk    3.125990      0.12191361
 Feb 2001-2019  SeaHawk   12.441331      0.48521192
 Mar 2001-2019  SeaHawk    3.230104      0.12597405
 Apr 2001-2019  SeaHawk   10.907835      0.42540556
 May 2001-2019  SeaHawk    9.674827      0.37731825
 Jun 2001-2019  SeaHawk    8.033946      0.31332390
 Jul 2001-2019  SeaHawk   12.318558      0.48042378
 Aug 2001-2019  SeaHawk    6.557286      0.25573414
 Sep 2001-2019  SeaHawk    6.057994      0.23626178
 Oct 2001-2019  SeaHawk   12.040164      0.46956640
 Nov 2001-2019  SeaHawk    5.369271      0.20940156
 Dec 2001-2019  SeaHawk    8.132666      0.31717398
 Jan 2001-2019  Randall   10.915324      0.42569765
 Feb 2001-2019  Randall    5.148206      0.20078002
 Mar 2001-2019  Randall    5.982370      0.23331245
 Apr 2001-2019  Randall    6.179094      0.24098468
 May 2001-2019  Randall    8.532295      0.33275951
 Jun 2001-2019  Randall    6.132740      0.23917686
 Jul 2001-2019  Randall   14.437468      0.56306127
 Aug 2001-2019  Randall   11.269315      0.43950329
 Sep 2001-2019  Randall   15.059383      0.58731592
 Oct 2001-2019  Randall   10.124065      0.39483852
 Nov 2001-2019  Randall    2.171475      0.08468754
 Dec 2001-2019  Randall   12.947003      0.50493311

The bind_rows function adds the rows from your second argument to the dataframe in your first argument. You can also use the bind_cols function to add the columns from one dataset to the other. For this example, let’s get our data back into wide format.

clean_data_wide<-clean_data%>%spread(key="Lake",value="Rainfall_mm")
clean_data_wide

Month	Period	Simiyu	Victoria
Jan	2001-2019	2.91	3.18
Feb	2001-2019	1.80	3.48
Mar	2001-2019	2.98	4.69
Apr	2001-2019	4.75	7.00
May	2001-2019	4.08	9.36
Jun	2001-2019	1.05	3.43
Jul	2001-2019	0.20	1.76
Aug	2001-2019	0.33	2.81
Sep	2001-2019	1.21	3.98
Oct	2001-2019	2.45	5.32
Nov	2001-2019	3.09	5.12
Dec	2001-2019	3.89	4.17

lake_data_wide<-read.table("lake_data.txt")
All_lakes_wide<-bind_cols(clean_data_wide,lake_data_wide)
All_lakes_wide

Month	Period	Simiyu	Victoria	Month1	Period1	SeaHawk	Randall
Jan	2001-2019	2.91	3.18	Jan	2001-2019	3.125990	10.915324
Feb	2001-2019	1.80	3.48	Feb	2001-2019	12.441331	5.148206
Mar	2001-2019	2.98	4.69	Mar	2001-2019	3.230104	5.982370
Apr	2001-2019	4.75	7.00	Apr	2001-2019	10.907835	6.179094
May	2001-2019	4.08	9.36	May	2001-2019	9.674827	8.532295
Jun	2001-2019	1.05	3.43	Jun	2001-2019	8.033946	6.132740
Jul	2001-2019	0.20	1.76	Jul	2001-2019	12.318558	14.437468
Aug	2001-2019	0.33	2.81	Aug	2001-2019	6.557286	11.269315
Sep	2001-2019	1.21	3.98	Sep	2001-2019	6.057994	15.059383
Oct	2001-2019	2.45	5.32	Oct	2001-2019	12.040164	10.124065
Nov	2001-2019	3.09	5.12	Nov	2001-2019	5.369271	2.171475
Dec	2001-2019	3.89	4.17	Dec	2001-2019	8.132666	12.947003

You’ll notice that this created two new columns “Month1” and “Period1” because they share the same name in both datasets. To account for this you can use the full join function. Using the “by=” argument, you can join columns that have the same name and values.

full_join(clean_data_wide,lake_data_wide, by=c("Month","Period"))

Warning message:
"Column `Month` joining factors with different levels, coercing to character vector"

Month	Period	Simiyu	Victoria	SeaHawk	Randall
Jan	2001-2019	2.91	3.18	3.125990	10.915324
Feb	2001-2019	1.80	3.48	12.441331	5.148206
Mar	2001-2019	2.98	4.69	3.230104	5.982370
Apr	2001-2019	4.75	7.00	10.907835	6.179094
May	2001-2019	4.08	9.36	9.674827	8.532295
Jun	2001-2019	1.05	3.43	8.033946	6.132740
Jul	2001-2019	0.20	1.76	12.318558	14.437468
Aug	2001-2019	0.33	2.81	6.557286	11.269315
Sep	2001-2019	1.21	3.98	6.057994	15.059383
Oct	2001-2019	2.45	5.32	12.040164	10.124065
Nov	2001-2019	3.09	5.12	5.369271	2.171475
Dec	2001-2019	3.89	4.17	8.132666	12.947003

Always read any warning messages that R outputs. You’ll notice that because the month columns have different levels, R automatically converts it to a character vector. In this case, make sure you refactor your data if necessary.

There are many other functions that dplyr offers. I would recommend exploring different functions to learn how to properly manipulate your data. In the next module we will learn all about t-tests!