suppressMessages(library(dplyr))
%>%
iris mutate(sepal_ratio = Sepal.Width / Sepal.Length) %>%
filter(sepal_ratio > 0.5) %>%
select(Species, sepal_ratio) %>%
group_by(Species) %>%
summarise(mean_sepal_ratio = mean(sepal_ratio))
Data wrangling in Python seems clunky. Yet, it doesn’t have to be. Here is how to pipe in Python.
The problem
In R
, we process data beautifully with dplyr
and %>%
:
In Python’s pandas
, most data wrangling code looks much less great, often like this:
import pandas as pd
= "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
url = pd.read_csv(url)
iris
"sepal_ratio"] = iris["sepal_width"] / iris["sepal_length"]
iris[= iris[iris["sepal_ratio"] > 0.5] #filter
iris = iris[["species", "sepal_ratio"]] #select
iris = iris.groupby("species") #group by
iris = iris.agg({"sepal_ratio": "mean"}) #summarise iris
This is hard to read because there’s lots of repitition.
A new hope
But don’t despair! We can pipe in Python, too. Here is how:
= pd.read_csv(url)
iris
(iris= lambda x: x.sepal_width / x.sepal_length)
.assign(sepal_ratio "sepal_ratio > 0.5")
.query("species", "sepal_ratio"]]
.loc[:, ["species")
.groupby("sepal_ratio": "mean"})
.agg({ )
sepal_ratio
species
setosa 0.684248
versicolor 0.530953
virginica 0.526112
For now, there are only two main principles to remember:
- Use
.
instead of%>%
to pipe. Put.
at the beginning of the line. - Use
()
around the whole expression. Python will complain otherwise.
There is a pipeable method for most tasks. Sometimes, though, there isn’t, but you can still make it work.
- Use
lambda
to define a function on the fly (like in .assign() above). - Use
.pipe()
, which takes a function as an argument and allows you to pipe it.
Here is an example of .pipe():
def sepal_ratio(df):
return df.assign(sepal_ratio = df.sepal_width / df.sepal_length)
(iris
.pipe(sepal_ratio)"sepal_ratio > 0.5")
.query("species", "sepal_ratio"]]
.loc[:, ["species")
.groupby("sepal_ratio": "mean"})
.agg({ )
sepal_ratio
species
setosa 0.684248
versicolor 0.530953
virginica 0.526112
That’s it! Some people don’t like piping because, they say, it’s harder to debug. I like to simply comment out lines, allowing you to run it line by line, which makes it actually very easy to debug. A pipe is also easy to read as it’s basically like a recipe. Start with a dataframe and change stuff step by step.
Here’s a quick summary over the most common data wrangling tasks and their pipeable methods and functions in dplyr
and pandas
:
task | dplyr | pandas |
---|---|---|
filter rows | filter() | df.query() |
pick columns | select() | df.loc[] |
group by | group_by() | df.groupby() |
summarise | summarise() | df.agg() |
make new variable | mutate() | df.assign() |
join dfs | inner_join() | df.merge() |
sort df | arrange() | df.sort_values() |
rename columns | rename() | df.rename() |
Last tip: AI is your friend. If you’re stuck, put your pandas code in chatgpt or GitHub Copilot and ask it to re-write code as a pipe. Seems to work pretty well.
Not so secret bonus: siuba
There is also the beautiful siuba package. If you come from R
, this might be the way to go. But even if not, it’s still less verbose than pandas
. Last time I tried it, the package didn’t quite have everything I needed but I think it grew a lot since then. Here is the same pipeline, but with siuba
:
from siuba import _, mutate, filter, select, group_by, summarize
(iris>> mutate(sepal_ratio = _.sepal_width / _.sepal_length)
>> filter(_.sepal_ratio > 0.5)
>> select(_.species, _.sepal_ratio)
>> group_by(_.species)
>> summarize(mean_sepal_ratio = _.sepal_ratio.mean())
)