I was recently working on a PySpark pipeline in which I was using the JDBC option to write about 22 million records from a Spark DataFrame into a Postgres RDS database. Hey, why not use the built in method provided by Spark, how bad could it be? I mean it’s not like the creators and […] The post The Fastest Way to Insert Data to Postgres appeared first on Confessions of a Data Guy.| Confessions of a Data Guy
Did you know that Polars, that Rust based DataFrame tool that is one the fastest tools on the market today, just got faster?? There is now GPU execution on available on Polars that makes it 70% faster than before!! The post Polars on GPU: Blazing Fast DataFrames for Engineers appeared first on Confessions of a Data Guy.| Confessions of a Data Guy
I can no longer hold the boiling and frothing mess of righteous anger that starts to rumble up from within me when I hear the words “Medallion Architecture” in the context of Data Modeling, especially when it’s used by some young Engineer who doesn’t know any better. Poor saps who have been born into a […] The post The Medallion Architecture Farce. appeared first on Confessions of a Data Guy.| Confessions of a Data Guy
I recently encountered a problem loading a few hundred CSV files, which contained mismatched schemas due to a handful of “extra” columns. This turned out to be not an easy problem for Polars to solve, in all its Rust glory. That made me curious: how does DuckDB handle mismatched schemas of CSV files? Of course, […] The post DuckDB … Merge Mismatched CSV Schemas. (also testing Polars) appeared first on Confessions of a Data Guy.| Confessions of a Data Guy
So, you are happily using the new Rust GOAT dataframe tool Polars to mung messy data, maybe like me, messing with 40GBs of CSV data over multiple files. You are pretty much going to run into this error. polars.exceptions.ComputeError: schema lengths differ This error occurred with the following context stack: [1] ‘csv scan’ [2] ‘select’ […] The post polars.exceptions.ComputeError: schema lengths differ appeared first on Confessions of a Data Guy.| Confessions of a Data Guy
I don’t know about you, but I grew up and cut my teeth in what feels like a special and Golden age of software engineering that is now relegated to the history books, a true onetime Renaissance of coding that was beautiful, bright, full of laughter and wonder, a time which has passed and will […]| Confessions of a Data Guy
SQLMesh is an open-source framework for managing, versioning, and orchestrating SQL-based data transformations. It’s in the same “data transformation” space as dbt, but with some important design and workflow differences. What SQLMesh Is SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and […]| Confessions of a Data Guy
You know, after literally multiple decades in the data space, writing code and SQL, at some point along that arduous journey, one might think this problem would be solved by me, or the tooling … yet alas, not to be. Regardless of the industry or tools used, such as Pandas, Spark, or Postgres, duplicates are […] The post Duplicates in Data and SQL appeared first on Confessions of a Data Guy.| Confessions of a Data Guy
Deletion Vectors are a soft‑delete mechanism in Delta Lake that enables Merge‑on‑Read (MoR) behavior, letting update/delete/merge operations mark row positions as removed without rewriting the underlying Parquet files. This contrasts with the older Copy‑on‑Write (CoW) model, where even a single deleted record triggers rewriting of entire files YouTube+8docs.delta.io+8Medium+8. Supported since Delta Lake 2.3 (read-only), full […] The post What Are Deletion Vectors (DLV)? appeared f...| Confessions of a Data Guy
I recently used Polars … inside an AWS Lambda … to fill a novel and somewhat obtuse CSV formatting issue. We were receiving CSV files that contained rows with specific columns that were empty because the following values matched the first one, until a different value finally appeared. Let me show you. +-----------+-----------+---------+ | City […] The post Solving a “Fill Forward” NULL problem with Polars appeared first on Confessions of a Data Guy.| Confessions of a Data Guy
So … Astronomer.io … who are they and what do they do? It’s funny how, every once in a while, the Data Engineering world gets dragged into the light of the real world … usually for bad things … and then gets shoved under the carpet again. Recently, because of the transgressions of the CEO […]| Confessions of a Data Guy
UncategorizedLakebase: Databricks’ Bold Play to Fuse OLTP and the Lakehouse| Confessions of a Data Guy
The future never shows up quietly. Just when you think you’ve tamed the latest “must-have” technology, a fresh acronym crashes the party. I’d barely finished wrapping my head around the Lakehouse paradigm when Databricks rolled out something new at the 2025 Data & AI Summit: Lakebase, a fully managed PostgreSQL engine built directly into the […]| Confessions of a Data Guy
I’d be lying if I said a small part of me didn’t groan when I first read about SQL Scripting being released by Databricks. Don’t get me wrong—I don’t fault Databricks for giving users what they want. After all, if you don’t feed the masses, they’ll turn on you. We data engineers are gluttons for […]| Confessions of a Data Guy
I’ve been thinking about this for a few days now, and I still don’t know whether to cheer or groan. Some moments, I see DuckLake as a smart, much-needed evolution; other times, it feels like just another unnecessary entry in the ever-growing Lake House jungle. Reality, as always, is probably somewhere in between. MotherDuck and […]| Confessions of a Data Guy
Let’s be honest: working with Apache Iceberg stops being fun the moment you step off your local laptop and into anything that resembles production. The catalog system—mandatory and rigid—has long been the Achilles’ heel of an otherwise promising open data format. For a long time, you had two options: over-engineered corporate-grade solutions that require infrastructure […]| Confessions of a Data Guy
Every so often, I have to convert some .txt or .csv file over to Excel format … just because that’s how the business wants to consume or share the data. It is what it is. This means I am often on the lookup for some easy to use, simple, one-liners that I can use to […]| Confessions of a Data Guy
Rethinking Object Storage: A First Look at Cloudflare R2 and Its Built‑In Apache Iceberg Catalog Sometimes, we follow tradition because, well, it works—until something new comes along and makes us question the status quo. For many of us, Amazon S3 is that well‑trodden path: the backbone of our data platforms and pipelines, used countless times each day. If […]| Confessions of a Data Guy
Running dbt on Databricks has never been easier. The integration between dbtcore and Databricks could not be more simple to set up and run. Wondering how to approach running dbt models on Databricks with SparkSQL? Watch the tutorial below.| Confessions of a Data Guy
There are things in life that are satisfying—like a clean DAG run, a freshly brewed cup of coffee, or finally deleting 400 lines of YAML. Then there are things that make you question your life choices. Enter: setting up Apache Polaris (incubating) as an Apache Iceberg REST catalog. Let’s get one thing out of the […]| Confessions of a Data Guy
Context and Motivation dbt (Data Build Tool): A popular open-source framework that organizes SQL transformations in a modular, version-controlled, and testable way. Databricks: A platform that unifies data engineering and data science pipelines, typically with Spark (PySpark, Scala) or SparkSQL. The post explores whether a Databricks environment—often used for Lakehouse architectures—benefits from dbt, especially if […]| Confessions of a Data Guy