This article explain how to recognise and erase duplicate data in R.
You’ll learn how to use the following R base and dplyr functions:
duplicated()
determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.unique()
: for extracting unique elements.unique()
: for extracting unique elements.To use dplyr
functions you need to first load the tidyverse
library in your environment.
Load the tidyverse
packages, which include dplyr
:
> library(tidyverse)
We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.
> data <- as_tibble(iris)
> data
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
To find the duplicate elements we are going to use duplicated()
function.
> x <- c(1, 1, 2, 5, 4, 3, 4, 7, 3)
> duplicated(x)
[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
To get the duplicate elements.
> x[duplicated(x)]
[1] 1 4 3
Now, to remove the duplicated elements. we will be using !duplicated()
function, where ! is a logical
> x[!duplicated(x)]
[1] 1 2 5 4 3 7
Now, with the above method you can remove the duplicate rows from a data frame based on a column values.
# Removing duplicated values from iris dataset
> data[!duplicated(data$Sepal.Width),]
# A tibble: 23 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 4.4 2.9 1.4 0.2 setosa
9 5.4 3.7 1.5 0.2 setosa
10 5.8 4 1.2 0.2 setosa
# ... with 13 more rows
The function distinct()
[dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique()
.
> data %>% distinct()
# A tibble: 149 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 139 more rows
> data %>% distinct(Sepal.Length, .keep_all = TRUE)
# A tibble: 35 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.4 2.9 1.4 0.2 setosa
8 4.8 3.4 1.6 0.2 setosa
9 4.3 3 1.1 0.1 setosa
10 5.8 4 1.2 0.2 setosa
# ... with 25 more rows
To summarize, we studied how to find and remove duplicate elements from vector, matrix and data frames. Using unique()
, duplicated()
function.
This brings the end of this Blog. We really appreciate your time.
Hope you liked it.
Do visit our page www.zigya.com/blog for more informative blogs on Data Science
Keep Reading! Cheers!
Zigya Academy
BEING RELEVANT
Through the standard form offers different advantages in mathematical calculations and scientific notation. Firstly, it…
Introduction Stress is a feeling caused by an external trigger that makes us frustrated, such…
Sociology is a broad discipline that examines societal issues. It looks at the meaningful patterns…
Some info about Inch Inches are a unique measure that persuades us that even the…
You should be familiar with logarithms to understand antilogarithms in a better manner. Logarithms involve…
यहां "नाटककार सुरेंद्र वर्मा" पुस्तक की पीडीएफ विद्यार्थी, शोधार्थी और जो इसका अभ्यास के लिए…