Aug 26, 2025| Blog
Aug 21, 2025| Blog
Is AWS Batch right for data professionals? While AWS Batch is a powerful service, it requires significant infrastructure expertise that may not align with data science workflows. Discover alternatives like Coiled that prioritize ease of use for Python workloads, GPU jobs, and parallel computing.| Coiled
Looking for an AWS Batch alternative? Learn why AWS Batch can be overkill for simple Python scripts and parallel workloads. Discover easier solutions like Coiled Batch for GPU jobs, Spot instances, and embarrassingly parallel tasks, without containers or complex setups.| Coiled
Learn how to manage long-running Lambda tasks, the challenges they pose, and the workarounds for timeout and resource limits.| Coiled
James Bourbeau| Blog
DataCode snippet of using the coiled.function decorator to run a query with Polars on a large VM in the cloud| Blog
You can train scikit-learn models in parallel using the scikit-learn joblib interface. This allows scikit-learn to take full advantage of the multiple cores in your machine (or, spoiler alert, on your cluster) and speed up training.| Blog
Github Actions let you launch automated jobs from your Github repository. Coiled lets you run your Python code in the cloud. Combining the two gives you lightweight workflow orchestration for heavy...| Coiled
Run Python scripts in the cloud with AWS EC2. Explore step-by-step setup and how to simplify scaling Python workloads with less manual effort.| Coiled
AWS Lambda doesn’t support GPUs, but Coiled does. Run serverless GPU workloads for Python in your AWS account with zero infrastructure setup or timeouts.| Coiled
2025-05-07| Blog
Sarah Johnson, James Bourbeau, Quentin Lhoest, Daniel van Strien| Blog
James Bourbeau, Matt Rocklin 2024-09-09 3 min read TL;DR: We need your help creating a geospatial benchmark suite. Please propose your workload on this GitHub discussion. People love the Xarray/Das...| Coiled
This blog post explains how to write Parquet files with Dask using the to_parquet method.| Blog
Kubernetes is great if you need to organize many always-on services and have in-house expertise, but can add an extra burden and abstraction when deploying a single bursty service like Dask, especially in a user environment with quickly changing needs.| Blog
Running PyData libraries on an Apple M1 machine requires you to use an ARM64-tailored version of conda-forge. This article provides a step-by-step guide of how to set that up on your machine using mambaforge.| Blog
Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for Dask users, addressing:| Blog
TL;DR: unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory and cause computations to hang and crash.| Blog
Updated April 18th, 2024: For Dask versions >= 2024.3.0, Dask will perform a number of the optimizations discussed in this blog post automatically. See this demo for more details.| Blog
Apache Spark has long been a popular tool for handling petabytes of data in analytics applications. It’s often used for big data and machine learning, and most organizations use it with cloud infrastructure to run models and build algorithms. Spark is no doubt a fast analytical tool that provides high-speed queries for large datasets, but recent client testimonials tell us that Dask is even faster. So, what should you keep in mind when moving from Spark to Dask?| Blog
Snowflake is a leading cloud data platform and SQL database. Many companies store their data in a Snowflake database.| Blog
This post lays out the different stages of openness in Open Source Software (OSS) and the benefits and costs of each.| Blog
This post demonstrates how to change a DataFrame index with set_index and explains when you should perform this operation.| Blog
Alex Egg, Senior Data Scientist at Grubhub, joins Matt Rocklin and Hugo Bowne-Anderson to talk and code about how Dask and distributed compute are used throughout the user intent classification pipeline at Grubhub!| Blog
Data Scientists are increasingly using Python and the Python ecosystem of tools for their analysis. Combined with the growing popularity of big data, this brings the challenge of scaling data science workflows. Dask is a library built for this exact purpose - making it easy to scale your Python code, and serve as a toolbox for distributed computing!| Blog
The cloud is wonderful but expensive.| Blog
This article explains how to redistribute data among partitions in a Dask DataFrame with repartitioning…| Blog
There’s a saying in emergency response: “slow is smooth, smooth is fast”.| Blog
Columns in Dask DataFrames are typed, which means they can only hold certain values (e.g. integer columns can’t hold string values). This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column types.| Blog
pandas 2.0 has been released! 🎉| Blog
Many people say the following to me:| Blog
This blog explains how to perform a spatial join in Python. Knowing how to perform a spatial join is an important asset in your data-processing toolkit: it enables you to join two datasets based on spatial predicates. For example, you can join a point-based dataset with a polygon-based dataset based on whether the points fall within the polygon.| Blog
Featured Posts: Geospatial Benchmarks Building a large-scale geo benchmark suite. Large Scale Geospatial Benchmarks: First Pass GPUs Easy access to GPUs on the cloud. Machine Learning with Coiled W...| Coiled
Historically, out-of-memory errors and excessive memory requirements have frequently been a pain point for Dask users. Two of the main causes of memory-related headaches are data duplication and imbalance between workers.| Blog
This post demonstrates how to merge Dask DataFrames and discusses important considerations when making large joins.| Blog
In this post, we will cover:| Blog
While running workloads to test Dask reliability, we noticed that some workers were freezing or dying when the OS stepped in and started killing processes when the system ran out of memory.| Blog
This post explains how to filter Dask DataFrames based on the DataFrame index and on column values using loc.| Blog
Along with the Cloud SaaS product, Coiled sells enterprise support for Dask. Mostly people buy this for these three things:| Blog
When you search for how to run a Python function in parallel, one of the first things that comes up is the multiprocessing module. The documentation describes parallelism in terms of processes versus threads and mentions it can side-step the infamous Python GIL (Global Interpreter Lock).| Blog
Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on Google Cloud, addressing:| Blog
Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on Azure, addressing:| Blog
Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on AWS, addressing:| Blog
Dask is a general purpose library for parallel computing. Dask can be used on its own to parallelize Python code, or with integrations to other popular libraries to scale out common workflows.| Blog
The PyData stack for scientific computing in Python is an interoperable collection of tools for data analysis, visualization, statistical inference, machine learning, interactive computing and more that is used across all types of industries and academic research. Dask, the open source package for scalable data science that was developed to meet the needs of modern data professionals, was born from the PyData community and is considered foundational in this computational stack. This post desc...| Blog
This blog post explains how to read Parquet files into Dask DataFrames. Parquet is a columnar, binary file format that has multiple advantages when compared to a row-based file format like CSV. Luckily Dask makes it easy to read Parquet files into Dask DataFrames with read_parquet.| Blog
This post explains how to create disk-partitioned Parquet lakes using partition_on and how to read disk-partitioned lakes with read_parquet and filters. Disk partitioning can significantly improve performance when used correctly.| Blog
Coiled can often save money for an organization running Dask. This article goes through the most common ways in which we see that happen.| Blog
You can use Coiled, the cloud-based Dask platform, to easily convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object-store. Start off by iterating with Dask locally first to build and test your pipeline, then transfer the same workflow to Coiled with minimal code changes. We demonstrate a JSON to Parquet conversion for a 75GB dataset that runs without downloading the dataset to your local machine.| Blog
Coiled, a Dask company, is about one year old. We’ll have a more official celebration in mid-February (official date of incorporation), but I wanted to take this opportunity to talk a little bit about the journey over the last year, where that has placed us today, and what I think comes next.| Blog
Running Dask in the cloud is easy.| Blog
Black is an amazing Python code formatting tool that automatically corrects your code.| Blog
As of Dask 2025.1.0, Xarray workloads loading large datasets are smoother and more reliable.| Coiled
Coiled's 2024 in review. Where we started, what we did, and where we're going.| Coiled
Patrick Hoefler| Blog
Patrick Hoefler| Blog
Sarah Johnson, Florian JetterBar chart comparing the relative difference in TPC-H query runtime for Dask vs. PySpark when executed on a M1 MacBook Pro with 8 cores. Orange represents queries where Dask is faster and blue where PySpark is faster.| Blog
We show a lightweight scalable data pipeline that runs large Python jobs on a schedule on the cloud.Scalable data pipeline example that runs regularly scheduled jobs on the cloud.| Blog
Sarah Johnson| Blog
Jack SolomonLine graph of forecasted sales and actual sales over time.| Blog
James Bourbeau| Blog
Sarah Johnson| Blog
Coiled Team| Blog
| docs.coiled.io
We show how to run existing NASA data workflows on the cloud, in parallel, with minimal code changes using Coiled. We also discuss cost optimization.Comparing cost and duration between running the same workflow locally on a laptop, running on AWS, and running with cost optimizations on AWS.| Blog
Sarah Johnson| Blog
Matthew Rocklin| Blog
Dask-expr is an ongoing effort to add a logical query optimization layer to Dask DataFrames. We now have the first benchmark results to share that were run against the current DataFrame implementation.| Blog
Distributed computing is hard, distributed debugging is even harder. Dask tries to simplify this process as much as possible. Coiled adds additional observability features for your Dask clusters and processes them to help users understand their workflows better.| Blog
The cloud offers amazing scale, but it can be difficult for Python data developers to use. This post walks through how to use Coiled Functions to run your existing code in parallel on the cloud with minimal code changes.Comparing code runtime between a laptop, single cloud VM, and multiple cloud VMs in parallel| Blog
We processed 250TB of geospatial cloud data in twenty minutes on the cloud with Xarray, Dask, and Coiled. We do this to demonstrate scale and to think about costs.County-level heat map of the continental US showing mean depth to soil saturation (in meters) in 2020.| Blog
Patrick Hoefler| Blog
While it’s trivial to measure the end-to-end runtime of a Dask workload, the next logical step - breaking down this time to understand if it could be faster - has historically been a much more arduous task that required a lot of intuition and legwork, for novice and expert users alike. We wanted to change that.Populated Fine Performance Metrics dashboard| Blog
Coiled Functions make it easy to improve performance and reduce costs by moving your computations next to your cloud data.| Blog
Patrick Hoefler| Blog
Dask DataFrame doesn’t currently optimize your code for you (like Spark or a SQL database would). This means that users waste a lot of computation. Let’s look at a common example which looks ok at first glance, but is actually pretty inefficient.| Blog
What is the easiest way to run Python code in the cloud, especially for compute jobs?| Blog
Patrick Hoefler| Blog
Patrick Hoefler, Hendrik Makait| Blog
We recently pushed out a new, experimental notebooks feature for easily launching Jupyter servers in the cloud from your local machine. We’re excited about Coiled notebooks because they:| Blog
Patrick Hoefler| Blog
Dask makes it easy to print whether you’re running code locally on your laptop, or remotely on a cluster in the cloud.print-in-worker-logs| Blog
Hendrik Makait2023-05-16| Blog
Miles Granger| Blog
At Coiled we develop Dask and automatically deploy it to large clusters of cloud workers (sometimes 1000+ EC2 instances at once!). In order to avoid surprises when we publish a new release, Dask needs to be covered by a comprehensive battery of tests — both for functionality and performance.Nightly tests report| Blog
Sarah Johnson, Nat Tabrisbar chart of AWS cost vs. processor type| Blog
Dask has deep integrations with other libraries in the PyData ecosystem like NumPy, pandas, Zarr, PyArrow, and more. Part of providing a good experience for Dask users is making sure that Dask continues to work well with this community of libraries as they push out new releases. This post walks through how Dask maintainers proactively ensure Dask continuously works with its surrounding ecosystem.| Blog
Hendrik Makait| Blog
Docker is a great tool for creating portable software environments, but we found it’s too slow for interactive exploration. We find that clusters depending on docker images often take 5+ minutes to launch. Ouch.| Blog
A few months ago we released package sync, a feature that takes your Python environment and replicates it in the cloud with zero effort.| Blog
XGBoost is one of the most well-known libraries among data scientists, having become one of the top choices among Kaggle competitors. It is performant in a wide of array of supervised machine learning problems, implements scalable training through the rabit library, and integrates with many big data processing tools, including Dask.| Blog
The cloud is tricky! You might think the rules that determine which IAM permissions are required for which actions will continue to apply in the same way. You might think they’d apply the same way to different AWS accounts. Or that if these things aren’t true, at least AWS will let you know. (I did.) You’d be wrong!| Blog
Nat Tabris| Blog
Dask is widely used among data scientists and engineers proficient in Python for interacting with big data, doing statistical analysis, and developing machine learning models. Operationalizing this work has traditionally required lengthy code rewrites, which makes moving from development and production hard. This gap slows business progress and increases risk for data science and data engineering projects in an enterprise setting. The need to remove this bottleneck has prompted the emergence ...| Blog
Matthew Powers| Blog
Patrick Hoefler| Blog
Matthew Rocklin| Blog
Stephen Schneider, Franco BosettiSiemens logo| Blog