library(tidyverse)
library(reticulate)
One of the most fundamental skillset in data science is the ability to pre process data and get it into a format that is suitable for analysis. In R, the tidyverse
package has made this process very easy and intuitive. In Python, the pandas
package is the equivalent of the dplyr
package.
In this post, I will go through some of the most common data manipulation tasks in pandas
from the perspective of an R user learning Pandas. I will also be using the reticulate
package to run Python code in R.
Let’s start off with reading the palmerpenguins dataset found in the below github link, and use pandas read_csv
to read in the data.
import pandas as pd
= 'https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv'
url
= pd.read_csv(url) data
’Select’ing columns
The dplyr method of selecting columns uses the select
function that allows us to select by index or by column name.
We can access the data in the python environment by using the py
object within the R environment and accessing the name of the python object using the $
operator. The using the select
function by including unquoted column names
$data |>
pyas_tibble() |>
select(species, island)
# A tibble: 344 × 2
species island
<chr> <chr>
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
5 Adelie Torgersen
6 Adelie Torgersen
7 Adelie Torgersen
8 Adelie Torgersen
9 Adelie Torgersen
10 Adelie Torgersen
# ℹ 334 more rows
The pandas equivalent uses the loc
method. The loc
method allows us to select by name. The first argument is the row and the second argument is the column.
'species', 'island']] data.loc[:,[
species island
0 Adelie Torgersen
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
.. ... ...
339 Chinstrap Dream
340 Chinstrap Dream
341 Chinstrap Dream
342 Chinstrap Dream
343 Chinstrap Dream
[344 rows x 2 columns]
On the same note, we can also use the iloc
method to select by index instead of name.
1:3] data.iloc[:,
species island
0 Adelie Torgersen
1 Adelie Torgersen
2 Adelie Torgersen
3 Adelie Torgersen
4 Adelie Torgersen
.. ... ...
339 Chinstrap Dream
340 Chinstrap Dream
341 Chinstrap Dream
342 Chinstrap Dream
343 Chinstrap Dream
[344 rows x 2 columns]
’Filter’ing rows
The dplyr method of filtering rows uses the filter
function that allows us to filter by a logical condition. Let’s define a condition to filter the data by the species Adelie
.
$data |>
pyas_tibble() |>
filter(species == 'Adelie')
# A tibble: 152 × 9
rowid species island bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 Adelie Torgersen 39.1 18.7 181
2 2 Adelie Torgersen 39.5 17.4 186
3 3 Adelie Torgersen 40.3 18 195
4 4 Adelie Torgersen NaN NaN NaN
5 5 Adelie Torgersen 36.7 19.3 193
6 6 Adelie Torgersen 39.3 20.6 190
7 7 Adelie Torgersen 38.9 17.8 181
8 8 Adelie Torgersen 39.2 19.6 195
9 9 Adelie Torgersen 34.1 18.1 193
10 10 Adelie Torgersen 42 20.2 190
# ℹ 142 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
The pandas equivalent uses the query
method. The query
method allows us to filter by a logical condition, similiar to dplyr’s filter
function.
'species == "Adelie"') data.query(
rowid species island ... body_mass_g sex year
0 1 Adelie Torgersen ... 3750.0 male 2007
1 2 Adelie Torgersen ... 3800.0 female 2007
2 3 Adelie Torgersen ... 3250.0 female 2007
3 4 Adelie Torgersen ... NaN NaN 2007
4 5 Adelie Torgersen ... 3450.0 female 2007
.. ... ... ... ... ... ... ...
147 148 Adelie Dream ... 3475.0 female 2009
148 149 Adelie Dream ... 3450.0 female 2009
149 150 Adelie Dream ... 3750.0 male 2009
150 151 Adelie Dream ... 3700.0 female 2009
151 152 Adelie Dream ... 4000.0 male 2009
[152 rows x 9 columns]
Similarly, we can also use the loc
method to filter by a logical condition.
'species'] == 'Adelie'] data.loc[data[
rowid species island ... body_mass_g sex year
0 1 Adelie Torgersen ... 3750.0 male 2007
1 2 Adelie Torgersen ... 3800.0 female 2007
2 3 Adelie Torgersen ... 3250.0 female 2007
3 4 Adelie Torgersen ... NaN NaN 2007
4 5 Adelie Torgersen ... 3450.0 female 2007
.. ... ... ... ... ... ... ...
147 148 Adelie Dream ... 3475.0 female 2009
148 149 Adelie Dream ... 3450.0 female 2009
149 150 Adelie Dream ... 3750.0 male 2009
150 151 Adelie Dream ... 3700.0 female 2009
151 152 Adelie Dream ... 4000.0 male 2009
[152 rows x 9 columns]
Reuse
Citation
@online{luu2024,
author = {Luu, Michael},
title = {Pandas {Practice} {From} the {Perspective} of an {R} {User}
{(Part} 1)},
date = {2024-01-29},
langid = {en}
}