I wanted to query a large TSV file stored in S3. To achieve this, I decided to convert it to Parquet and query it using DuckDB. However, I didn’t want to download the full file and then convert it. Instead, I wanted to stream the CSV file directly from S3 and write the output to a Parquet file. Here are a couple of approaches that worked quite nicely. Setup I’m running these experiments on an EC2 instance with 8 cores and 32GB of RAM. The data is stored in a 700GB gp2 volume with 2100 IOPS.