Dask-expr is an ongoing effort to add a logical query optimization layer to Dask DataFrames. We now have the first benchmark results to share that were run against the current DataFrame implementation.| Blog
Distributed computing is hard, distributed debugging is even harder. Dask tries to simplify this process as much as possible. Coiled adds additional observability features for your Dask clusters and processes them to help users understand their workflows better.| Blog
While it’s trivial to measure the end-to-end runtime of a Dask workload, the next logical step - breaking down this time to understand if it could be faster - has historically been a much more arduous task that required a lot of intuition and legwork, for novice and expert users alike. We wanted to change that.Populated Fine Performance Metrics dashboard| Blog
Dask DataFrame doesn’t currently optimize your code for you (like Spark or a SQL database would). This means that users waste a lot of computation. Let’s look at a common example which looks ok at first glance, but is actually pretty inefficient.| Blog
Patrick Hoefler, Hendrik Makait| Blog
Patrick Hoefler| Blog
Dask makes it easy to print whether you’re running code locally on your laptop, or remotely on a cluster in the cloud.print-in-worker-logs| Blog
Hendrik Makait2023-05-16| Blog
Miles Granger| Blog
At Coiled we develop Dask and automatically deploy it to large clusters of cloud workers (sometimes 1000+ EC2 instances at once!). In order to avoid surprises when we publish a new release, Dask needs to be covered by a comprehensive battery of tests — both for functionality and performance.Nightly tests report| Blog
Dask has deep integrations with other libraries in the PyData ecosystem like NumPy, pandas, Zarr, PyArrow, and more. Part of providing a good experience for Dask users is making sure that Dask continues to work well with this community of libraries as they push out new releases. This post walks through how Dask maintainers proactively ensure Dask continuously works with its surrounding ecosystem.| Blog
Hendrik Makait| Blog
There have been a number of engineering improvements to Dask Array like consistent chunksizes in Xarray rolling-constructs and improved efficiency in map_overlap. Notably, as of Dask version 2024.11.2, calculating quantiles is much faster and more reliable. Calculating Quantiles with Xarray Calculating quantiles is a common operation for geospatial …| Patrick Hoefler
Running large-scale GroupBy-Map patterns with Xarray that are backed by Dask arrays is an essential part of a lot of typical geospatial workloads. Detrending is a very common operation where this pattern is needed. In this post, we will explore how and why this caused so many pitfalls for Xarray …| Patrick Hoefler
Intro Dask DataFrame scales out pandas DataFrames to operate at the 100GB-100TB scale. Historically, Dask was pretty slow compared to other tools in this space (like Spark). Due to a number of improvements focused on performance, it's now pretty fast (about 20x faster than before). The new implementation moved Dask …| Patrick Hoefler
Get the most out of PyArrow support in pandas and Dask right now| phofl.github.io
Introduction| phofl.github.io
Getting notified of a significant performance regression the day before release sucks, but quickly identifying and resolving it feels great!| phofl.github.io