Data Manipulation using ‘dplyr ‘ Package

If you like this, Share!

Data manipulation is an important part of any form data analysis. When we talk about manipulating data, we perform some operation on the loaded dataset or the sample data we created in R. Manipulating data is important because by doing so we understand something about the data in hand before we do any modelling or Visualization. In this article, I will be explaining about how to use Dplyr package effectively to understand and manipulate data.This package or tool can be used for data manipulation for data frame like objects. Let us see the usage of Dplyr package with the help of dataset ‘murders’. So let’s start by installing the package and loading the dataset using below-set of codes.

  install.packages("dplyr")

library(dplyr)

Let us get an idea of what the murders data by looking at its structure and the first 6 rows of the data.

str(murders)

head(murders)

Murders dataset contains 51 rows where data about 51 states are mentioned. Each states abbreviation, region, total population and the total number of murders committed are contained in the dataset.

Dplyr package comes with 5 main data manipulation commands and we will see one by one.

1.Mutate– Mutate function is used to add a new variable to the existing data frame or data table. We will use this function to find the rate of murder in each state and then add the variable rate into the dataset. Mutate function take the dataset name as the first argument and the name and value of the variable as the second argument.

murders<-mutate(murders,rate=total/population*100000)

Here we manipulated the murders dataset by defining a new variable- rate which is calculated as total murders happened divided by the total population. Mutate function automatically adds this variable into murders dataset. Let us see how the data set murders now look like.

2. Filter- As the name suggest filter function filter the data based on a condition. Suppose if we want to filter the table based on murder rate less than 0.70  we use filter command. The first argument as always should be filled with dataset name followed by the condition as the second argument. For example, if we want to see the details of states which has a murder rate less than 0.70 we type the below code.

filter(murders,rate<=0.70)

Below is the output of the filter condition specified. We could see that there are 5 states which have a murder rate below 0.70

3. Select- If you want to work with fewer column then we use the select function. In the murders dataset, there are only 6 variables to work with. What if the dataset contains more than 50 or even 100 columns to work with and not all variables are required for your analysis. We could use the select function which selects the columns of our requirement.Like how we worked with other functions the first argument will be the name of the dataset and followed by the names of the column in the dataset that we want to work with as shown in the below code.

new_table<-select(murders,state,region,rate)

Here we only want to work with 3 variables namely state, region and rate. And we stored that data into a new table(new_table) so that we can do the manipulation in the new table and not disturbing the original dataset(murders).Output for the above code will retrieve the original 51-row data with state, region and rate and stores it into table-new_table.

4. Summarize-As the word suggests this function can be used to summarize the data based on some parameter that could be mean of some variable. It can also be used to summarize the data with respect to the number of counts, maximum or minimum. Let us try this command on the murders data set.

summarize(murders,mean(population))

summarize(murders,max(rate))

summarize(murders,min(rate))

summarize(murders,n())

The output of the above code is shown below.

5. Group By- Group by is used in conjunction with the summarize function wherein we need to segregate or group the data based on some condition. In the previous function summarize we calculated the mean of the variable total which represents the total murders happened in that particular state. Now if we want to find out the mean of total murders happened grouped by each region we have to use the below code.

sc<-group_by(murders,region)

summarize(sc,mean(total))

Here we first grouped the murders data based on region. Each of the 51 states is tagged against any one of the above 4 regions. Then using the summarize command we found out the mean of the variable ‘total’ which represents the number of murders occurred in each state and it grouped the data by each region. From the output, we can see that state represented in the South Region South has reported a higher number of murders.

6. Pipe Operator(%>%) 

A pipe operator is generally used if we want to integrate the output of one function as the input of another function. if you look at the example provided in the group_by function we created an intermediate variable called ‘sc’.We input this variable as the argument in summarize function. Basically what happened here is the output of group_by function was fed as input into the summarize function. We can simplify the above operation with the use of a pipe operator.Code for the pipe operator will look like this.

murders%>%group_by(region)%>%summarize(mean(total))

Here the murders dataset is fed as input into group_by function. Then the function groups the data on the basis of region and pass it on to summarize function where it will calculate the mean of data and summarizes it on the basis of region wise. The output of the above code will look like this.

If you have observed this is the same output we received in the group by and summarize command when used with an intermediate variable.

We can also use the mutate and filter function with the help of a pipe operator.

murders%>%mutate(rate=total/population*100000)%>%filter(rate<=0.71)

Above command will pass the data murders into mutate function which will add a new variable rate and then pass the data into filter which filters it with rate<=0.71

One thing to note here is that when using the pipe operator we don’t have to pass the first argument which is the name of the dataset into each function. This is because dplyr will understand the data is coming from murders table and we don’t have to specify explicitly in the mutate, filter, group by or summarize functions!

Now there are lot many other functions related to Dplyr package which can be used as part of data exploratory or manipulation but for initial analysis above are the important functions which will be mainly used.

 

Vishnu Venugopal

 

 

Leave a Reply

Be the First to Comment!

  Subscribe  
Notify of