Module 4: Data Wrangling Part 2

How to Clean and Manipulate your Data in R

Hawken Hass

University of North Carolina Wilmington

Manipulating your Data

Once your data is all clean, you can manipulate your data. Data manipulation can include grouping data, combining data sets, summarizing data, and making new variables. For this, we are going to use the dplyr package.

library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

For this module, let’s use our data we just cleaned in the last module. I refactored the data to make sure that the levels are in the correct order.

clean_data<-read.table("clean_data.txt")
clean_data$Month<-factor(clean_data$Month, levels=unique(clean_data$Month))
clean_data
MonthPeriodLakeRainfall_mm
Jan 2001-2019Victoria 3.18
Feb 2001-2019Victoria 3.48
Mar 2001-2019Victoria 4.69
Apr 2001-2019Victoria 7.00
May 2001-2019Victoria 9.36
Jun 2001-2019Victoria 3.43
Jul 2001-2019Victoria 1.76
Aug 2001-2019Victoria 2.81
Sep 2001-2019Victoria 3.98
Oct 2001-2019Victoria 5.32
Nov 2001-2019Victoria 5.12
Dec 2001-2019Victoria 4.17
Jan 2001-2019Simiyu 2.91
Feb 2001-2019Simiyu 1.80
Mar 2001-2019Simiyu 2.98
Apr 2001-2019Simiyu 4.75
May 2001-2019Simiyu 4.08
Jun 2001-2019Simiyu 1.05
Jul 2001-2019Simiyu 0.20
Aug 2001-2019Simiyu 0.33
Sep 2001-2019Simiyu 1.21
Oct 2001-2019Simiyu 2.45
Nov 2001-2019Simiyu 3.09
Dec 2001-2019Simiyu 3.89

Filtering Data by Rows

The filter function allows you to subset rows from your data frame. For example, let’s say we only want to look at rainfall in the month of January. We can do that using the filter function.

filter(clean_data, Month=="Jan")
MonthPeriodLakeRainfall_mm
Jan 2001-2019Victoria 3.18
Jan 2001-2019Simiyu 2.91

Since there are only two observations for each month, this is a pretty small data frame. However, this function is quite useful for large datasets. In the second argument, you denote an operate that tells the function how you would like to filter the column. Since we wanted only values for the month of January, we used the “==” operator. There are many different operators you can use to filter your data. For example, the code below would filter out any rainfall measurement that is less than 1.

filter(clean_data, Rainfall_mm>1)
MonthPeriodLakeRainfall_mm
Jan 2001-2019Victoria 3.18
Feb 2001-2019Victoria 3.48
Mar 2001-2019Victoria 4.69
Apr 2001-2019Victoria 7.00
May 2001-2019Victoria 9.36
Jun 2001-2019Victoria 3.43
Jul 2001-2019Victoria 1.76
Aug 2001-2019Victoria 2.81
Sep 2001-2019Victoria 3.98
Oct 2001-2019Victoria 5.32
Nov 2001-2019Victoria 5.12
Dec 2001-2019Victoria 4.17
Jan 2001-2019Simiyu 2.91
Feb 2001-2019Simiyu 1.80
Mar 2001-2019Simiyu 2.98
Apr 2001-2019Simiyu 4.75
May 2001-2019Simiyu 4.08
Jun 2001-2019Simiyu 1.05
Sep 2001-2019Simiyu 1.21
Oct 2001-2019Simiyu 2.45
Nov 2001-2019Simiyu 3.09
Dec 2001-2019Simiyu 3.89

Here are some more examples of operators.

  • == (Equal to)
  • != (Not equal to)
  • < (Less than)
  • <= (Less than or equal to)
  • > (Greater than)
  • >= (Greater than or equal to)

Filtering by Columns

The select function is used to filter out specific columns. The select function also uses helper functions to filter out columns based on certain properties. Using double quotes, you can filter out columns of a certain name.

select(clean_data, contains("Lake"))
Lake
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Victoria
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu
Simiyu

The “-“ operator can be used to denote which columns you want to exclude.

select(clean_data, -Period)
MonthLakeRainfall_mm
Jan Victoria3.18
Feb Victoria3.48
Mar Victoria4.69
Apr Victoria7.00
May Victoria9.36
Jun Victoria3.43
Jul Victoria1.76
Aug Victoria2.81
Sep Victoria3.98
Oct Victoria5.32
Nov Victoria5.12
Dec Victoria4.17
Jan Simiyu 2.91
Feb Simiyu 1.80
Mar Simiyu 2.98
Apr Simiyu 4.75
May Simiyu 4.08
Jun Simiyu 1.05
Jul Simiyu 0.20
Aug Simiyu 0.33
Sep Simiyu 1.21
Oct Simiyu 2.45
Nov Simiyu 3.09
Dec Simiyu 3.89

Piping

It may be unproductive to create one line of code for each data cleaning and manipulating function. Sometimes you will see programmers use a pipe. A pipe looks like this: %>%. Piping can make your code easier to read and efficient. For example, everything I just did above can be done in one line of code using a couple of pipes.

new_data<-clean_data%>%select(-Period)%>%filter(Rainfall_mm>1)%>%filter(Lake=="Victoria")
print(new_data)
   Month     Lake Rainfall_mm
1    Jan Victoria        3.18
2    Feb Victoria        3.48
3    Mar Victoria        4.69
4    Apr Victoria        7.00
5    May Victoria        9.36
6    Jun Victoria        3.43
7    Jul Victoria        1.76
8    Aug Victoria        2.81
9    Sep Victoria        3.98
10   Oct Victoria        5.32
11   Nov Victoria        5.12
12   Dec Victoria        4.17

You can see how piping made our code more efficient. In one line of code we were able to create a new data set that removed the period column, and outputed only values for Lake Victoria that were greater than 1 mm. This is very useful for large datasets.

Summarize Data

The dplyr function also allows you to summarize your data using different summary functions.

summarise(clean_data, avg=mean(Rainfall_mm))
avg
3.46

You can add more than one summary function in each argument.

summarise(clean_data, avg=mean(Rainfall_mm), n=n(), sd=sd(Rainfall_mm), var=var(Rainfall_mm),median=median(Rainfall_mm), min=min(Rainfall_mm), max=max(Rainfall_mm))
avgnsdvarmedianminmax
3.46 24 2.0558554.2265393.305 0.2 9.36

It may be more useful to look at the average rainfall for both groups. We can find the means of separate groups by using the group_by function and piping it to the summarise function. The code would look like this:

clean_data%>%group_by(Lake)%>%summarise(avg=mean(Rainfall_mm), sd=sd(Rainfall_mm), min=min(Rainfall_mm), max=max(Rainfall_mm))
Lakeavgsdminmax
Simiyu 2.395 1.4882360.20 4.75
Victoria4.525 2.0366131.76 9.36

Creating New Variables

Dplyr also allows you to create new variables within your data set using the mutate function. In this function you are computing a new variable from an existing variable. The basic code for the function is:

  • mutate(df, name of new column= formula)

For example, let’s say we want to transform the Rainfall_mm to measure rainfall in inches.

clean_data_2<-mutate(clean_data, Rainfall_inches=Rainfall_mm*0.039)
clean_data_2
MonthPeriodLakeRainfall_mmRainfall_inches
Jan 2001-2019Victoria 3.18 0.12402
Feb 2001-2019Victoria 3.48 0.13572
Mar 2001-2019Victoria 4.69 0.18291
Apr 2001-2019Victoria 7.00 0.27300
May 2001-2019Victoria 9.36 0.36504
Jun 2001-2019Victoria 3.43 0.13377
Jul 2001-2019Victoria 1.76 0.06864
Aug 2001-2019Victoria 2.81 0.10959
Sep 2001-2019Victoria 3.98 0.15522
Oct 2001-2019Victoria 5.32 0.20748
Nov 2001-2019Victoria 5.12 0.19968
Dec 2001-2019Victoria 4.17 0.16263
Jan 2001-2019Simiyu 2.91 0.11349
Feb 2001-2019Simiyu 1.80 0.07020
Mar 2001-2019Simiyu 2.98 0.11622
Apr 2001-2019Simiyu 4.75 0.18525
May 2001-2019Simiyu 4.08 0.15912
Jun 2001-2019Simiyu 1.05 0.04095
Jul 2001-2019Simiyu 0.20 0.00780
Aug 2001-2019Simiyu 0.33 0.01287
Sep 2001-2019Simiyu 1.21 0.04719
Oct 2001-2019Simiyu 2.45 0.09555
Nov 2001-2019Simiyu 3.09 0.12051
Dec 2001-2019Simiyu 3.89 0.15171

Combining Data Sets

Dplyr also has the ability to combine data sets into one. Let’s read in some more lake data and combine it with our current data set!

lake_data<-read.table("lake_data.txt")
print(lake_data)

   Month    Period   SeaHawk   Randall
1    Jan 2001-2019  3.125990 10.915324
2    Feb 2001-2019 12.441331  5.148206
3    Mar 2001-2019  3.230104  5.982370
4    Apr 2001-2019 10.907835  6.179094
5    May 2001-2019  9.674827  8.532295
6    Jun 2001-2019  8.033946  6.132740
7    Jul 2001-2019 12.318558 14.437468
8    Aug 2001-2019  6.557286 11.269315
9    Sep 2001-2019  6.057994 15.059383
10   Oct 2001-2019 12.040164 10.124065
11   Nov 2001-2019  5.369271  2.171475
12   Dec 2001-2019  8.132666 12.947003

Now we have two more lakes: Lake SeaHawk and Lake Randall along with their monthly rainfall in mm. Let’s clean and tidy our data to make it look like our clean_data data frame.

library(tidyr)
lake_data<-lake_data %>% gather(key="Lake",value="Rainfall_mm",3:4)%>%mutate(Rainfall_inches=Rainfall_mm*0.039)
lake_data
MonthPeriodLakeRainfall_mmRainfall_inches
Jan 2001-2019 SeaHawk 3.125990 0.12191361
Feb 2001-2019 SeaHawk 12.441331 0.48521192
Mar 2001-2019 SeaHawk 3.230104 0.12597405
Apr 2001-2019 SeaHawk 10.907835 0.42540556
May 2001-2019 SeaHawk 9.674827 0.37731825
Jun 2001-2019 SeaHawk 8.033946 0.31332390
Jul 2001-2019 SeaHawk 12.318558 0.48042378
Aug 2001-2019 SeaHawk 6.557286 0.25573414
Sep 2001-2019 SeaHawk 6.057994 0.23626178
Oct 2001-2019 SeaHawk 12.040164 0.46956640
Nov 2001-2019 SeaHawk 5.369271 0.20940156
Dec 2001-2019 SeaHawk 8.132666 0.31717398
Jan 2001-2019 Randall 10.915324 0.42569765
Feb 2001-2019 Randall 5.148206 0.20078002
Mar 2001-2019 Randall 5.982370 0.23331245
Apr 2001-2019 Randall 6.179094 0.24098468
May 2001-2019 Randall 8.532295 0.33275951
Jun 2001-2019 Randall 6.132740 0.23917686
Jul 2001-2019 Randall 14.437468 0.56306127
Aug 2001-2019 Randall 11.269315 0.43950329
Sep 2001-2019 Randall 15.059383 0.58731592
Oct 2001-2019 Randall 10.124065 0.39483852
Nov 2001-2019 Randall 2.171475 0.08468754
Dec 2001-2019 Randall 12.947003 0.50493311

Let’s combine our two data sets!

All_lakes<-bind_rows(clean_data_2,lake_data)
All_lakes$Month<-as.factor(All_lakes$Month)
All_lakes$Lake<-as.factor(All_lakes$Lake)
print(All_lakes)

Warning message in bind_rows_(x, .id):
"binding factor and character vector, coercing into character vector"Warning message in bind_rows_(x, .id):
"binding character and factor vector, coercing into character vector"

   Month    Period     Lake Rainfall_mm Rainfall_inches
1    Jan 2001-2019 Victoria    3.180000      0.12402000
2    Feb 2001-2019 Victoria    3.480000      0.13572000
3    Mar 2001-2019 Victoria    4.690000      0.18291000
4    Apr 2001-2019 Victoria    7.000000      0.27300000
5    May 2001-2019 Victoria    9.360000      0.36504000
6    Jun 2001-2019 Victoria    3.430000      0.13377000
7    Jul 2001-2019 Victoria    1.760000      0.06864000
8    Aug 2001-2019 Victoria    2.810000      0.10959000
9    Sep 2001-2019 Victoria    3.980000      0.15522000
10   Oct 2001-2019 Victoria    5.320000      0.20748000
11   Nov 2001-2019 Victoria    5.120000      0.19968000
12   Dec 2001-2019 Victoria    4.170000      0.16263000
13   Jan 2001-2019   Simiyu    2.910000      0.11349000
14   Feb 2001-2019   Simiyu    1.800000      0.07020000
15   Mar 2001-2019   Simiyu    2.980000      0.11622000
16   Apr 2001-2019   Simiyu    4.750000      0.18525000
17   May 2001-2019   Simiyu    4.080000      0.15912000
18   Jun 2001-2019   Simiyu    1.050000      0.04095000
19   Jul 2001-2019   Simiyu    0.200000      0.00780000
20   Aug 2001-2019   Simiyu    0.330000      0.01287000
21   Sep 2001-2019   Simiyu    1.210000      0.04719000
22   Oct 2001-2019   Simiyu    2.450000      0.09555000
23   Nov 2001-2019   Simiyu    3.090000      0.12051000
24   Dec 2001-2019   Simiyu    3.890000      0.15171000
25   Jan 2001-2019  SeaHawk    3.125990      0.12191361
26   Feb 2001-2019  SeaHawk   12.441331      0.48521192
27   Mar 2001-2019  SeaHawk    3.230104      0.12597405
28   Apr 2001-2019  SeaHawk   10.907835      0.42540556
29   May 2001-2019  SeaHawk    9.674827      0.37731825
30   Jun 2001-2019  SeaHawk    8.033946      0.31332390
31   Jul 2001-2019  SeaHawk   12.318558      0.48042378
32   Aug 2001-2019  SeaHawk    6.557286      0.25573414
33   Sep 2001-2019  SeaHawk    6.057994      0.23626178
34   Oct 2001-2019  SeaHawk   12.040164      0.46956640
35   Nov 2001-2019  SeaHawk    5.369271      0.20940156
36   Dec 2001-2019  SeaHawk    8.132666      0.31717398
37   Jan 2001-2019  Randall   10.915324      0.42569765
38   Feb 2001-2019  Randall    5.148206      0.20078002
39   Mar 2001-2019  Randall    5.982370      0.23331245
40   Apr 2001-2019  Randall    6.179094      0.24098468
41   May 2001-2019  Randall    8.532295      0.33275951
42   Jun 2001-2019  Randall    6.132740      0.23917686
43   Jul 2001-2019  Randall   14.437468      0.56306127
44   Aug 2001-2019  Randall   11.269315      0.43950329
45   Sep 2001-2019  Randall   15.059383      0.58731592
46   Oct 2001-2019  Randall   10.124065      0.39483852
47   Nov 2001-2019  Randall    2.171475      0.08468754
48   Dec 2001-2019  Randall   12.947003      0.50493311

The bind_rows function adds the rows from your second argument to the dataframe in your first argument. You can also use the bind_cols function to add the columns from one dataset to the other. For this example, let’s get our data back into wide format.

clean_data_wide<-clean_data%>%spread(key="Lake",value="Rainfall_mm")
clean_data_wide
MonthPeriodSimiyuVictoria
Jan 2001-20192.91 3.18
Feb 2001-20191.80 3.48
Mar 2001-20192.98 4.69
Apr 2001-20194.75 7.00
May 2001-20194.08 9.36
Jun 2001-20191.05 3.43
Jul 2001-20190.20 1.76
Aug 2001-20190.33 2.81
Sep 2001-20191.21 3.98
Oct 2001-20192.45 5.32
Nov 2001-20193.09 5.12
Dec 2001-20193.89 4.17
lake_data_wide<-read.table("lake_data.txt")
All_lakes_wide<-bind_cols(clean_data_wide,lake_data_wide)
All_lakes_wide
MonthPeriodSimiyuVictoriaMonth1Period1SeaHawkRandall
Jan 2001-20192.91 3.18 Jan 2001-2019 3.12599010.915324
Feb 2001-20191.80 3.48 Feb 2001-201912.441331 5.148206
Mar 2001-20192.98 4.69 Mar 2001-2019 3.230104 5.982370
Apr 2001-20194.75 7.00 Apr 2001-201910.907835 6.179094
May 2001-20194.08 9.36 May 2001-2019 9.674827 8.532295
Jun 2001-20191.05 3.43 Jun 2001-2019 8.033946 6.132740
Jul 2001-20190.20 1.76 Jul 2001-201912.31855814.437468
Aug 2001-20190.33 2.81 Aug 2001-2019 6.55728611.269315
Sep 2001-20191.21 3.98 Sep 2001-2019 6.05799415.059383
Oct 2001-20192.45 5.32 Oct 2001-201912.04016410.124065
Nov 2001-20193.09 5.12 Nov 2001-2019 5.369271 2.171475
Dec 2001-20193.89 4.17 Dec 2001-2019 8.13266612.947003

You’ll notice that this created two new columns “Month1” and “Period1” because they share the same name in both datasets. To account for this you can use the full join function. Using the “by=” argument, you can join columns that have the same name and values.

full_join(clean_data_wide,lake_data_wide, by=c("Month","Period"))
Warning message:
"Column `Month` joining factors with different levels, coercing to character vector"
MonthPeriodSimiyuVictoriaSeaHawkRandall
Jan 2001-20192.91 3.18 3.12599010.915324
Feb 2001-20191.80 3.48 12.441331 5.148206
Mar 2001-20192.98 4.69 3.230104 5.982370
Apr 2001-20194.75 7.00 10.907835 6.179094
May 2001-20194.08 9.36 9.674827 8.532295
Jun 2001-20191.05 3.43 8.033946 6.132740
Jul 2001-20190.20 1.76 12.31855814.437468
Aug 2001-20190.33 2.81 6.55728611.269315
Sep 2001-20191.21 3.98 6.05799415.059383
Oct 2001-20192.45 5.32 12.04016410.124065
Nov 2001-20193.09 5.12 5.369271 2.171475
Dec 2001-20193.89 4.17 8.13266612.947003

Always read any warning messages that R outputs. You’ll notice that because the month columns have different levels, R automatically converts it to a character vector. In this case, make sure you refactor your data if necessary.

There are many other functions that dplyr offers. I would recommend exploring different functions to learn how to properly manipulate your data. In the next module we will learn all about t-tests!