We've been working on readers / writers for our recently announced Lance v2 file format and are posting in-depth articles about writing a high performance file reader. In the first article I talked about how we obtain parallelism without row groups. Today, I want to explain how, and why, we separate| LanceDB Blog
The API used to read files has evolved over time, from simple "full table" reads to batch reads and eventually to iterative "record batch readers". These more sophisticated APIs add additional power and flexibility. Lance takes this a step further to return a "stream of read tasks". This originates from| LanceDB Blog
Repetition and definition levels are a method of converting structural arrays into a set of buffers. The approach was made popular in Parquet and is one of the key ways Parquet, ORC, and Arrow differ. In this blog I will explain how they work by contrasting them with validity & offsets| LanceDB Blog
Record shredding is a classic method used to transpose rows of potentially nested data into a flattened tree of buffers that can be written to the file. A similar technique, cascaded encoding, has recently emerged, that converts those arrays into a flattened tree of compressed buffers. In this article we| LanceDB Blog
Streaming data applications can be tricky. When you can read data faster than you can process the data then bad things tend to happen. The most common scenario is you run out of memory and your process crashes. When the process doesn't crash, it often breaks performance (e.g. swapping| LanceDB Blog
Conventional wisdom states that compression and random access do not go well together. However, there are many ways you can compress data, and some of them support random access better than others. Figuring out which compression we can use, and when, and why, has been an interesting challenge. As we've| LanceDB Blog