Saying Hello to DataFrames.jl

ODSC - Open Data Science
3 min readJun 2, 2021

--

A majority of data scientists use Python or R to perform data preparation tasks before jumping to modeling. The Julia language is a younger player in this field that promises that you will be able to do the number-crunching-intensive parts of your pipelines fast. However, the question is if its data preparation tools also live up to the performance promise and are already mature enough?

In this post, I compare the performance of DataFrames.jl against data.table, Pandas, and Polars on two common tasks: data aggregation and joining tables.

I run the tests using DataFrames.jl 1.1.1, data.table 1.14.0, Pandas 1.2.4, and Polars 0.7.16. For all solutions, I use a single thread for easier comparability (also I feel that it is a typical scenario for laptop-oriented workflows of many users).

We start with Pandas:

>>> import numpy as np
>>> import pandas as pd
>>> import timeit
>>> df = pd.DataFrame({"id":np.random.randint(10**4, size=10**8),
... "v":np.random.rand(10**8)})
>>> t_start = timeit.default_timer()
>>> res = df.groupby("id").agg({"v":"sum"})
>>> t_stop = timeit.default_timer()
>>> t_stop - t_start
3.627904000000001
>>> weights = pd.DataFrame({"id":range(10**4),
... "w":np.random.rand(10**4)})
>>> t_start = timeit.default_timer()
>>> res = df.merge(weights, on="id")
>>> t_stop = timeit.default_timer()
>>> t_stop - t_start
18.2003761

As the df table has 10⁸ rows the timings are not bad, but let us check if we can be faster by switching to data.table:

> library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
> setDTthreads(1)
> df <- data.table(id=sample(1:10^4, 10^8, replace=TRUE), v=runif(10^8))
> system.time(df[, sum(v), by=id])
user system elapsed
2.98 0.42 3.41
> weights <- data.table(id=1:10^4, w=runif(10^4))
> system.time(df[weights, on="id", nomatch=NULL])
user system elapsed
7.03 0.47 7.50

As you can see we are slightly better on the aggregation task and much better on the join task.

You are probably now curious how fast DataFrames.jl goes. Here are the benchmarks:

julia> using DataFrames, BenchmarkToolsjulia> df = DataFrame(id=rand(1:10^4, 10^8), v=rand(10^8));julia> @btime combine(groupby(df, :id), :v => sum);
494.459 ms (316 allocations: 763.21 MiB)
julia> weights = DataFrame(id=1:10^4, w=rand(10^4));julia> @btime innerjoin(df, weights, on=:id);
3.739 s (239 allocations: 3.80 GiB)

Finally, you might have read that Polars is a new player that is also decently fast (which indeed is the case). I have decided to check it. Unfortunately, I was unable to disable multi-threading when using it to ensure comparability with Pandas (for some reason POLARS_MAX_THREADS environment variable was ignored on my machine). The timings are given for an 8-core machine:

>>> import numpy as np
>>> import polars as pl
>>> import timeit
>>> df = pl.DataFrame({"id":np.random.randint(10**4, size=10**8),
... "v":np.random.rand(10**8)})
>>> t_start = timeit.default_timer()
>>> res = df.groupby("id").agg(pl.col("v").sum())
>>> t_stop = timeit.default_timer()
>>> t_stop - t_start
5.560835634999421
>>> weights = pl.DataFrame({"id":range(10**4),
... "w":np.random.rand(10**4)})
>>> t_start = timeit.default_timer()
>>> res = df.join(weights, on="id")
>>> t_stop = timeit.default_timer()
>>> t_stop - t_start
3.7951884779995453

Even if we account for individual run-time variability of the results you can see that DataFrames.jl is quite fast on both tasks in the comparison.

If you feel intrigued be sure to check out my talk DataFrames.jl: a Perfect Sidekick for Your Next Data Science Project during the upcoming ODSC Europe 2021 conference.

Before I finish, let me comment that by this post I do not want to claim that DataFrames.jl is uniformly fastest in all sorts of different usage scenarios. However, in general, you can expect that DataFrames.jl has a competitive performance in comparison to other mature and well-established packages that allow you to work with data frames.

https://odsc.com/europe/#register

About the author/ODSC Europe 2021 speaker on Dataframes.jl:

Bogumił Kamiński is Head of Decision Analysis and Support Unit at Warsaw School of Economics, Poland, and Adjunct Professor at Data Science Laboratory, Ryerson University, Canada. His research interests are techniques of large-scale mathematical modeling of complex systems combining simulation, optimization, and machine learning. A particular area of his expertise is agent-based simulation and modeling and analysis of complex networks.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet