Pandas Practice From the Perspective of an R User

python
pandas
Author

Michael Luu

Published

January 29, 2024

library(tidyverse)
library(reticulate)

One of the most fundamental skillset in data science is the ability to pre process data and get it into a format that is suitable for analysis. In R, the tidyverse package has made this process very easy and intuitive. In Python, the pandas package is the equivalent of the dplyr package.

In this post, I will go through some of the most common data manipulation tasks in pandas from the perspective of an R user learning Pandas. I will also be using the reticulate package to run Python code in R.

Let’s start off with reading the palmerpenguins dataset found in the below github link, and use pandas read_csv to read in the data.

import pandas as pd 

url = 'https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv'

data = pd.read_csv(url)

’Select’ing columns

The dplyr method of selecting columns uses the select function that allows us to select by index or by column name.

We can access the data in the python environment by using the py object within the R environment and accessing the name of the python object using the $ operator. The using the select function by including unquoted column names

py$data |> 
  as_tibble() |>
  select(species, island)
# A tibble: 344 × 2
   species island   
   <chr>   <chr>    
 1 Adelie  Torgersen
 2 Adelie  Torgersen
 3 Adelie  Torgersen
 4 Adelie  Torgersen
 5 Adelie  Torgersen
 6 Adelie  Torgersen
 7 Adelie  Torgersen
 8 Adelie  Torgersen
 9 Adelie  Torgersen
10 Adelie  Torgersen
# ℹ 334 more rows

The pandas equivalent uses the loc method. The loc method allows us to select by name. The first argument is the row and the second argument is the column.

data.loc[:,['species', 'island']]
       species     island
0       Adelie  Torgersen
1       Adelie  Torgersen
2       Adelie  Torgersen
3       Adelie  Torgersen
4       Adelie  Torgersen
..         ...        ...
339  Chinstrap      Dream
340  Chinstrap      Dream
341  Chinstrap      Dream
342  Chinstrap      Dream
343  Chinstrap      Dream

[344 rows x 2 columns]

On the same note, we can also use the iloc method to select by index instead of name.

data.iloc[:, 1:3]
       species     island
0       Adelie  Torgersen
1       Adelie  Torgersen
2       Adelie  Torgersen
3       Adelie  Torgersen
4       Adelie  Torgersen
..         ...        ...
339  Chinstrap      Dream
340  Chinstrap      Dream
341  Chinstrap      Dream
342  Chinstrap      Dream
343  Chinstrap      Dream

[344 rows x 2 columns]

’Filter’ing rows

The dplyr method of filtering rows uses the filter function that allows us to filter by a logical condition. Let’s define a condition to filter the data by the species Adelie.

py$data |> 
  as_tibble() |>
  filter(species == 'Adelie')
# A tibble: 152 × 9
   rowid species island    bill_length_mm bill_depth_mm flipper_length_mm
   <dbl> <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1     1 Adelie  Torgersen           39.1          18.7               181
 2     2 Adelie  Torgersen           39.5          17.4               186
 3     3 Adelie  Torgersen           40.3          18                 195
 4     4 Adelie  Torgersen          NaN           NaN                 NaN
 5     5 Adelie  Torgersen           36.7          19.3               193
 6     6 Adelie  Torgersen           39.3          20.6               190
 7     7 Adelie  Torgersen           38.9          17.8               181
 8     8 Adelie  Torgersen           39.2          19.6               195
 9     9 Adelie  Torgersen           34.1          18.1               193
10    10 Adelie  Torgersen           42            20.2               190
# ℹ 142 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

The pandas equivalent uses the query method. The query method allows us to filter by a logical condition, similiar to dplyr’s filter function.

data.query('species == "Adelie"')
     rowid species     island  ...  body_mass_g     sex  year
0        1  Adelie  Torgersen  ...       3750.0    male  2007
1        2  Adelie  Torgersen  ...       3800.0  female  2007
2        3  Adelie  Torgersen  ...       3250.0  female  2007
3        4  Adelie  Torgersen  ...          NaN     NaN  2007
4        5  Adelie  Torgersen  ...       3450.0  female  2007
..     ...     ...        ...  ...          ...     ...   ...
147    148  Adelie      Dream  ...       3475.0  female  2009
148    149  Adelie      Dream  ...       3450.0  female  2009
149    150  Adelie      Dream  ...       3750.0    male  2009
150    151  Adelie      Dream  ...       3700.0  female  2009
151    152  Adelie      Dream  ...       4000.0    male  2009

[152 rows x 9 columns]

Similarly, we can also use the loc method to filter by a logical condition.

data.loc[data['species'] == 'Adelie']
     rowid species     island  ...  body_mass_g     sex  year
0        1  Adelie  Torgersen  ...       3750.0    male  2007
1        2  Adelie  Torgersen  ...       3800.0  female  2007
2        3  Adelie  Torgersen  ...       3250.0  female  2007
3        4  Adelie  Torgersen  ...          NaN     NaN  2007
4        5  Adelie  Torgersen  ...       3450.0  female  2007
..     ...     ...        ...  ...          ...     ...   ...
147    148  Adelie      Dream  ...       3475.0  female  2009
148    149  Adelie      Dream  ...       3450.0  female  2009
149    150  Adelie      Dream  ...       3750.0    male  2009
150    151  Adelie      Dream  ...       3700.0  female  2009
151    152  Adelie      Dream  ...       4000.0    male  2009

[152 rows x 9 columns]

Reuse

Citation

BibTeX citation:
@online{luu2024,
  author = {Luu, Michael},
  title = {Pandas {Practice} {From} the {Perspective} of an {R} {User}
    {(Part} 1)},
  date = {2024-01-29},
  langid = {en}
}
For attribution, please cite this work as:
Luu, Michael. 2024. “Pandas Practice From the Perspective of an R User (Part 1).” January 29, 2024.