Learn how to implement data time travel with Ducklake and Tigris to easily recover from database mishaps. This guide demonstrates how Ducklake creates snapshots for every `INSERT` or `DELETE` operation, allowing you to explore your analytics database as it was before disaster struck.| Tigris Object Storage Blog
Combine SQL and object storage data seamlessly with DuckLake, a data lakehouse solution that works with Tigris. Query big and small data without egress fees and simplify your data analytics workflows.| Tigris Object Storage Blog
Spatial Joins in DuckDB Extremely detailed overview by Max Gabrielsson of DuckDB's new spatial join optimizations.Consider the following query, which counts the number of NYC Citi Bike Trips for each of the neighborhoods defined by the NYC Neighborhood Tabulation Areas polygons and returns the top three: SELECT neighborhood, count(*) AS num_rides FROM rides JOIN hoods ON ST_Intersects( rides.start_geom, hoods.geom ) GROUP BY neighborhood ORDER BY num_rides DESCLIMIT3; The rides table contains...| Simon Willison's Weblog
Statically compiling DuckDB can improve security, improve startup time, and support offline environments.| Colin Breck
I was giving a presentation about Microsoft Fabric Python notebooks and someone asked if they scale. The short answer is yes. You can download the notebook and try it for yourself. For the long ans…| Small Data And self service
Learn Delta tables with ColumnMapping in Polars, addressing solutions, performance, and efficiency using alternative methods| Sandeep Pawar | Microsoft Fabric
Discover how Apache Iceberg, DuckDB, and open catalogs transform data lakes into powerful lakehouses. Learn to query S3 data with SQL interfaces.| Data Engineering Blog
This is not an official Microsoft benchmark, just my personal experience. Last week, I came across a new TPCH generator written in Rust. Luckily, someone ported it to Python, which makes generating large datasets possible even with a small amount of RAM. For example, it took 2 hours and 30 minutes to generate a 1 … Continue reading "Some Observations on Running TPCH 1 TB on Microsoft Fabric"| Small Data And self service
Preconditions To use the Amazon SageMaker Lakehouse with DuckDB, you first have to create a S3 Table bucket, a namespace and an actual S3 Table. All those steps are described in my other blog post “Query S3 Tables with DuckDB”, so please make sure yo...| tobilg.com
Data Lakes come in a broad variety and lots of different flavors. AWS, Azure, Google Cloud, Snowflake, DataBricks, etc. they all have their specialties, strong and weak sides. Common among them is that the most, if not all, of them use Object Storage...| tobilg.com
The General Transit Feed Specification (GTFS) is a standardized, open data format for public transportation schedules and geographic information. In practice, a GTFS feed is simply a ZIP archive of text (CSV) tables - such as stops.txt, routes.txt, a...| tobilg.com
I had a use case that eventually required performing IP address lookups in a given list of CIDR ranges, as I maintain an open source project that gathers IP address range data from public cloud providers, and also wrote an article in my blog about an...| tobilg.com
A while ago I published sql-workbench.com and the accompanying blog post called "Using DuckDB-WASM for in-browser Data Engineering". The SQL Workbench enables its users to analyze local or remote data directly in the browser. This lowers the bar rega...| tobilg.com
Introduction DuckDB, the in-process DBMS specialized in OLAP workloads, had a very rapid growth during the last year, both in functionality, but also popularity amongst its users, but also with developers that contribute many projects to the Open Sou...| tobilg.com
This articles explains how the gathering and analyzing of public cloud provider IP address data is possible with DuckDB and Observerable| tobilg.com
Using AWS Serverless services and DuckDB as near-realtime Data Lake backend infrastructure| tobilg.com
A common task in S3-based Data Lakes is to repartition data, to optimize query patterns and speed. This article describes a serverless solution using DuckDB| tobilg.com
How to run DuckDB in a serverless way on AWS Lambda, with a custom layer.| tobilg.com
Note: The blog and especially the code were written with the assistance of an LLM. TL;DR I built a simple Fabric Python notebook to orchestrate sequential SQL transformation tasks in OneLake using …| Small Data And self service
🌟 Introduction While testing the DuckDB ODBC driver, which is getting better and better (not production ready but less broken compared to two years ago), I noticed something unexpected. Running que…| Small Data And self service
A few forward looking SQL dialects have started introducing lambda expressions to be used with functions operating on arrays| Java, SQL and jOOQ.
When attempting to read a Delta table using Python with the deltalake library (Delta_rs, not Spark), you may encounter the following error: import deltalake DeltaTable(‘/lakehouse/default/Tab…| Small Data And self service
New dialects: jOOQ 3.20 ships with 2 new experimental dialects: ClickHouse is a fast-moving SQL dialect with a historic vendor-specific syntax that is gradually migrated to a more standards compliant alternative, which is why our support is still experimental. A lot of behaviours differ from what one would expect elsewhere, including NULL handling, which is … Continue reading jOOQ 3.20 released with ClickHouse, Databricks, and much more DuckDB support, new modules, Oracle type hierarchies, ...| Java, SQL and jOOQ.
The article discusses the evolution of business intelligence (BI) tools, questioning their longevity compared to the enduring nature of spreadsheets. It highlights how spreadsheets facilitated decision-making in the past but have become inadequate as data complexity increased. The author envisions a future "Spreadsheets 2.0," integrating advanced features, better orchestration, and AI support to revitalize the role of spreadsheets in data workflows.| DataDuel.co
Patrick Hoefler| Blog
There are significant changes happening in distributed systems.| Colin Breck
In the ever-evolving landscape of data management, DuckDB has carved out a niche for itself as a powerful analytical database designed for efficient in-process data analysis. It is particularly wel…| Shekhar Gulati
TigerEye releases open-source DuckDB.dart to simplify data-intensive application development - SiliconANGLE| SiliconANGLE
Learn how to define the OneLake filesystem using fsspec and why it matters| Sandeep Pawar | Microsoft Fabric
DuckDB and the R ecosystem| josiahparry.com
We recently pushed out two new and experimental features Coiled Jobs| phofl.github.io