My name is Arseny Kapoulkine and this is my blog where I write about computer graphics, optimization, programming languages and related topics.| zeux.io
Some people have a misconception that in software engineering, skill stops mattering for code quality from some level of seniority, and all of the value add shifts to architecture, high level design decisions, problem setting, or guiding others. And as long as you have staff/senior engineers design a system and oversee mid-level - and, for some work, junior - engineers, the output quality is the same as if you got senior folks to write everything instead.| zeux.io
meshoptimizer implements several geometry compression algorithms that are designed to take advantage of redundancies common in mesh data and decompress quickly - targeting many gigabytes per second in decoding throughput. One of them, index decoder, has seen a significant and unexpected variance in performance across multiple compilers and compiler releases recently; upon closer investigation, the differences can mostly be attributed to the same microarchitectural detail that is not often tal...| zeux.io
Hardware accelerated raytracing, as supported by DirectX 12 and Vulkan, relies on an abstract data structure that stores scene geometry, known as “acceleration structure” and often referred to as “BVH” or “BLAS”. Unlike geometry representation for rasterization, rendering engines can not customize the data layout; unlike texture formats, the layout is not standardized across vendors. It may seem like a trivial matter - surely, by 2025 all implementations are close to each other in...| zeux.io
I am happy to report that life after Roblox does indeed exist. When I quit, people told me I should take some time off, relax, unwind, recharge, travel…| zeux.io
The first somewhat social platform that I’ve used was LiveJournal; I used it around 2004-2010. Back then, we had posts and comments, but one of the notable features of the platform was the uni-directional friend relationships. The number of people who befriended you was somewhat of a status symbol, with a special term “тысячник” (a person with 1000+ reverse friend connections) used to denote Popular People. That said, my recollection is that people mostly wrote what was fun or i...| zeux.io
I regularly hear or read statements like this: “X is slow but this is to be expected because it needs to do a lot of work”. It can be said about an application or a component in a larger system, and can refer to other resources that aren’t time. I often find these profoundly unhelpful as they depend much more on the speaker’s intuition and understanding of the problem, than X itself.| zeux.io
In Luau, modulo operator a % b is defined as a - floor(a / b) * b, the definition inherited from Lua 5.1. While it has some numeric issues, like behavior for b = inf, it’s decently fast to compute so we have not explored alternatives yet. That is, it would be decently fast to compute if floor was fast.| zeux.io
When working with mesh shaders, the geometry needs to be split into meshlets: small geometry chunks where each meshlet has a set of vertices and triangle indices that refer to the vertices inside each meshlet. Mesh shader then has to transform all vertices and emit all transformed vertices and triangles through the shader API to the rasterizer. When viewed through the lens of traditional vertex reuse cache, mesh shaders seemingly make the reuse explicit so you would think that vertex/triangle...| zeux.io
When using std::condition_variable, there’s an easy to remember rule: all variables accessed in wait predicate must be changed under a mutex. However, this is easy to accidentally violate by throwing atomics in the mix.| zeux.io
In the process of working on calm, a minimal from-scratch fast CUDA implementation of transformer-based language model inference, a critical consideration was establishing the speed of light for the inference process, and measuring the progress relative to that speed of light. In this post we’ll cover this theoretical limit and its implications.| zeux.io
I joined Roblox in August 2012; eleven years and 4000 commits later, it’s time to say goodbye. Today was my last day.| zeux.io
Over the next few posts I’d like to write about optimizing mesh data for run-time performance (i.e. producing vertex/index buffers that accurately represent the source model and are as fast to render for GPU as possible).| zeux.io
Whenever there is an automated process involved, such as asset/code building, unit testing, automatic version packaging, bulk log processing, etc., there often is a set of command-line tools which do their thing and return the result. Then there is a calling process (which may be as simple as a batch file, or as complex as IncrediBuild), which launches the tool and acts upon success/failure.| zeux.io
We all like it when our code is fast. Some of us like the result, but dislike the process of optimization; others enjoy the process. However, optimization for the sake of optimization is wrong, unless you’re doing it in your pet project. Optimized code is sometimes less readable and, consequently, harder to understand and modify; because of that, optimization often introduces subtle bugs.| zeux.io
Debug information is the data that allows the debugger to, uhm, debug your program. It consists of the information about all types used in the program, of source line information (what instruction originated from what source line), of variable binding information (to know where on the stack frame/in register pool each local variable is stored) and other things that help you debug your program.| zeux.io
We’re stuck with C++, at least for another console generation. C++ has many quirks that I wish were not there, but there is no real alternative as of today. While modern languages tend to adopt the bulk compilation and/or smart linkers and so can have a proper module system and eat the cake too, C++ is stuck with header files (on the other hand, C++ builds are incremental and almost embarrassingly parallel). While the strategy of dealing with header files and staying sane seems more or less...| zeux.io
Lua is a very popular scripting language in game development industry. Many games use Lua for various scripting needs (data representation, UI scripting, AI scripting), and some go as far as write the majority of the game in Lua. At CREAT, we used Lua for all of UI scripting, and for AI and other game logic on some projects. And, well, there were times when the game crashed - and the callstack consisted mainly of Lua functions.| zeux.io
The day has come - I’ve left CREAT Studios and started working at Saber Interactive as a PS3 (well, that was obvious) programmer (well, that was obvious too).| zeux.io
Almost a year and a half ago I blogged about several useful things that you can do with custom IDirect3DDevice9 implementations. I don’t know why I did not post the code back then, but anyway - here it is:| zeux.io
This post is about a neat trick that is certainly not of my invention, but that should really be more well-known; at least, I haven’t heard of it till I stumbled across it while reading Box2D sources.| zeux.io
The language war in game development is long over - and the winner is C++. The utmost majority of code that’s going to run on the users side (engine code and game code) is written in C++. This is mostly not because the language is good, but because there is no better alternative.| zeux.io
Last time (I don’t blame you if you forgot, that was a year and a half ago) I described the view frustum culling solution, which involved projecting the box to clip space and testing the points against plane equations there. This is more or less the solution we used at work at the time; the production version has two additional features, size culling (after the culling operation we have the clip-space extents, so we can test if the box screen-space area is too small) and occlusion culling (...| zeux.io
Long time no see, everyone.| zeux.io
I’ve decided to take a small break from VFC series and post something completely different. VFC series continues next time, don’t worry.| zeux.io
I’m sorry for the lack of real post - it was a busy week, and a somewhat busy month lies ahead - I’m attending a local game conference in May and giving a speech about the process of porting our rendering subsystem to SPU (I hope to cover this topic here some day), so some time is spent preparing slides/etc.; my pet projects demand more attention than usual; there’s some weird but nevertheless interesting stuff at work… I’ll try to keep up, but you should really expect some more wee...| zeux.io
Before getting into professional game development I’ve spent a fair amount of time doing it for fun (in fact, I still do it now, although less intensively). The knowledge came from a variety of sources, but the only way that I knew and used to calculate frustum planes equations was as follows – get the equations in clip space (they’re really simple – (1, 0, 0, 1), (0, -1, 0, 1), etc.) and then get world space ones by transforming the planes with inverse transpose of view projection ca...| zeux.io
In previous iteration we converted the code to SoA instead of AoS, which enabled us to transform OBB points to world space relatively painlessly, and eliminated ugly and slow dot product, thus making the code faster. Still, the code is slow. Why?| zeux.io
Last week I’ve tried my best at optimizing the underlying functions without touching the essence of algorithm (if there was a function initially that filled a 8-vector array with AABB points, optimizations from previous post could be done in math library). It seems the strategy has to be changed.| zeux.io
Time’s running fast. Two weeks has passed since my post about COLLADA, and I’ve found a killer bug in FCollada TBN generation code.| zeux.io
About half a year ago, our team at work that develops the engine decided to try and switch from the proprietary Maya export plugin (it exported geometry, animation and materials) to COLLADA. The old export pipeline was somewhat obscure, lacked some useful optimizations, and (what’s most important) lacked any convenient way to setup materials. That was not a problem for platforms with more or less fixed functionality, but with next-generation consoles (or should I say current-generation alre...| zeux.io
So, yesterday we were discussing various lighting approaches at local IRC chat, and someone said that it was impossible to make a ps.2.0 single-pass lighting shader for 8 lights with diffuse, specular and attenuation, of course with normal map. Well, I told him he was actually wrong, and that I can prove it. Proving it turned out to be a very fun and challenging task. I am going to share the resulting shader and the lessons learned with you.| zeux.io
While designing D3D10, a number of decisions were made to improve runtime efficiency (to reduce batch cost, basically). It’s no secret that D3D9 runtime is not exactly lean & mean – there is a lot of magic going on behind the scenes, a lot of validation, caching, patching…| zeux.io
Shadow mapping is my primary area of interest in computer graphics, so expect more posts on this topic. Today I’d like to tell about robust unit cube clipping regarding different projection matrix building techniques.| zeux.io
A data structure that comes up fairly often when working with graphs or graph-like structure is a jagged array, or array-of-arrays. It’s very simple to build it out of standard containers but that’s often a poor choice for performance; in this post we’ll talk about a simple representation/construction code that I found useful across multiple different projects and domains. Crucially, we will focus on immutable structures - ones that you can build in one go from source data and then cont...| zeux.io
When working with mesh shaders to draw meshes, you need to split your source geometry into individual units called meshlets. Each meshlet would be processed by one mesh shader workgroup, and when compiling this mesh shader you need to specify the maximum number of triangles and vertices that the meshlet contains. These numbers are subject to some hardware limits. On current drivers, AMD, Intel and NVidia expose limits of 256 triangles and 256 vertices through EXT_mesh_shader Vulkan extension,...| zeux.io
When working with various forms of culling, it can be useful to project the object bounds to screen space. This is necessary to implement various forms of occlusion culling when using a depth pyramid, or to be able to reject objects or clusters that don’t contribute to any pixels. The same operation can also be used for level of detail selection, although it’s typically faster to approximate the projected area on screen - here we’re interested in efficient conservative projected bounds....| zeux.io
When working on vertex compressor for meshoptimizer in 2018, one of the goals was to make a bitstream that can be decompressed using (short) SIMD vector operations. This led to a host of design decisions in terms of how the data is structured, and some challenges when mapping the decoder flow onto various SIMD architectures. The most significant issue has to do with implementing an operation that often doesn’t have a straightforward implementation: byte expansion.| zeux.io
A friend recently learned about Proebsting’s law and mentioned it to me off hand. If you aren’t aware, Proebsting’s law states: Compiler Advances Double Computing Power Every 18 Years Which is to say, if you upgrade your compiler every 18 years, you would expect on average your code to double in performance on the same hardware. This is in sharp contrast to Moore’s law, and suggests that we should be cautious about the performance gains that compiler evolution brings. Proebsting write...| zeux.io
I joined Roblox at the end of 2012 as a rendering engineer; I had just spent more than a year working on various titles from FIFA franchise after years of console game development and was becoming a bit tired of the “big game development”. My work on FIFA was as a contractor and I got an offer for a full-time position, but I also had a friend who worked at Roblox reach out and offer me to move to California and work on Roblox. I knew absolutely nothing about Roblox, but California was nic...| zeux.io
In 2018, I wrote an article “Writing an efficient Vulkan renderer” for GPU Zen 2 book, which was published in 2019. In this article I tried to aggregate as much information about Vulkan performance as I could - instead of trying to focus on one particular aspect or application, this is trying to cover a wide range of topics, give readers an understanding of the behavior of different APIs on real hardware and provide a range of options for each problem that needs to be solved. At the time ...| zeux.io
3 years ago, we ported our renderer to Metal. It didn’t take much time, it was a blast and it worked really well on iOS. Today Metal is in better shape than ever - and I’d like to talk a bit about that. But first, if you have not read the original article, you might want to start with that; most of that still holds today.| zeux.io
When writing a Vulkan renderer, one has to learn a lot of new concepts. Some of them are easier to deal with than others, and one of the pretty straightforward additions is the pipeline cache. To make sure pipeline creation is as efficient as possible, you need to create a pipeline cache and use it whenever you need to create a new pipeline. To make sure subsequent runs of your application don’t have to spend the time repeatedly compiling the shader microcode, you need to save the pipeline ...| zeux.io
In 2011-2012 I worked on FIFA Street, followed by FIFA EURO 2012 DLC and finally FIFA 13 - all of these games were based on the same codebase, and this codebase was HUGE. Given an unknown codebase, you need a way to quickly get around it - since you don’t know the code, you resort to search-based navigation, aka grep. Using Visual Studio Ctrl+Shift+F search on a HDD on a codebase this size means that every search takes minutes. This was frustrating and as such I decided to solve this problem.| zeux.io
When implementing vertex/index decoders in meshoptimizer, the main focus was on lean implementation and decompression performance. When your streaming source is capable of delivering hundreds of megabytes per second, as is the case with SSD drives, and you want to accelerate loading by compressing the data further, you need to be decompressing at multiple hundreds of megabytes per second, ideally a gigabyte, to make sure a small number of CPU cores can keep up with IO. Keeping implementation ...| zeux.io
During development of meshoptimizer a question that comes up relatively often is “should this algorithm use SIMD?”. The library is performance-oriented, but SIMD doesn’t always provide significant performance benefits - unfortunately, the use of SIMD can make the code less portable and less maintainable, so this tradeoff has to be resolved on a case by case basis. When performance is of utmost importance, such as vertex/index codecs, separate SIMD implementations for SSE and NEON instru...| zeux.io
A library that I work on often these days, meshoptimizer, has changed over time to use fewer and fewer C++ library features, up until the current state where the code closely resembles C even though it uses some C++ features. There have been many reasons behind the changes - dropping C++11 requirement allowed me to make sure anybody can compile the library on any platform, removing std::vector substantially improved performance of unoptimized builds, removing algorithm includes sped up compil...| zeux.io
In the last article we’ve discussed the particulars of voxel data definition and storage for voxel terrain we use at Roblox. From there on a lot of other systems read & write data from the storage and interpret it in different ways - the implementation for each system (rendering, networking, physics) is completely separate and not tied too much to decisions storage or other systems are making, so we can study them independently. While logically speaking it would make sense to look at mesher...| zeux.io
I have been working a lot on vertex cache optimization lately, exploring several algorithms from multiple axes - optimization performance, optimization efficiency, corner cases and the like. While doing so, I’ve implemented a program to verify that the algorithms actually produce results beneficial for real hardware - and today we will discuss one such algorithm, namely “Optimal Grid Rendering”.| zeux.io
It’s been about almost two years since we shipped the first version of smooth voxel terrain at Roblox, and with it being live for a while and seeing a lot of incremental improvements I wanted to write about the internals of the technology - this feature required implementing serialization, network replication, collision detection, ray casting, rendering and in-memory storage support and within each area some implementation details ended up being quite interesting. Today we’ll talk about v...| zeux.io
We have successfully shipped the Metal rendering backend to millions of users, and I want to write a bit about that. There are varying opinions on Metal in the industry - some claim Metal would not have been needed if only Apple dedicated more attention to OpenGL and Vulkan, some say it’s the easiest graphics API that ever existed. Why even bother with Metal, some ask, if you can just write OpenGL or Vulkan code, and use MoltenGL or MoltenVK to the same effect? Here are my thoughts on the API.| zeux.io
Exactly ten years ago, the first version of my XML parser, pugixml, got released to the public.| zeux.io
In the last article (Approximating slerp) we discussed a need for a fast and reasonably precise quaternion interpolation method. By looking at the data we arrived at two improvements to nlerp, a less precise one and a more precise one. Let’s look at their implementations and performance!| zeux.io
Execution time in many programs is dominated by memory access time, not compute time. This is becoming increasingly true with higher instruction-level parallelism, wider SIMD, larger core counts and a lack of breakthroughs in memory access latency. A lot of performance talks now start by explaining that in a cache hierarchy the last-level cache miss is 100x or more expensive than a first-level cache hit, TLB misses are scary and contiguous data is great. But there is another beast that lurks ...| zeux.io
Today I’m going to describe a not very practical but neat experiment, the result of which is a sequence that’s awfully slow to sort using Microsoft STL implementation; additionally, the method of generating such sequence naturally extends to any other quicksort-like approach.| zeux.io
As I’ve written in the previous post, there is a long way to go from first tests to the complete testing suite. Without further ado, here is the list of things I consider important for a test suite of a middleware product. Some of the items here are only relevant for the case where you want an automatic continuous integration-style testing - they’re marked with asterisk (*****).| zeux.io
Four years and a half ago, I was working on a pet game project which used XML format as intermediate storage format. Initially we used TinyXML, but I got tired of its interface and horrible parsing performance, and found pugxml. It was somewhat faster, with the interface which was somewhat better, but still - it was very rough. I decided to slightly change the library, improving performance and design along the way. Thus pugixml was born.| zeux.io
There are lots of data structures out there, ranging from primitive to so sophisticated that only a single person in the world understands them (these are, of course, mostly useless). The choice of the data structure is mostly specific to the problem; however, obviously some data structures are generally more popular/useful than others.| zeux.io
This may come as a surprise, but I am not dead. In fact, what you see is a new post! As usual I have a lot of interesting themes to cover, and barely enough time to spare. While I’m at it, let me tell you about NDAs. I hate NDAs with a passion – I’ve got some things to blog about that are partially covered by NDA (of course, the interesting parts are NOT); also I’ve been thinking that this is a non-issue and basically that I can blog about things that are not quite critical, but half ...| zeux.io
I can’t believe I’m writing this, it’s been what, 2 months? During that time a lot of things happened – I’ve been to the conference and gave an hour-long talk about our SPU rendering stuff (which was more or less well received), I’ve almost completed an occlusion subsystem (rasterization-based), which is giving good results; and the financial crisis has finally hit the company I work at – some projects are freezed due to the lack of funding, and some people are fired. It’s kin...| zeux.io
There is a bunch of small notes I’d like to share – none of them deserves a post, but I don’t want them to disappear forever.| zeux.io
Memory management is one of (many) cornerstones of tech quality for console games. Proper memory management can decrease amount of bugs, increase product quality (for example, by eliminating desperate pre-release asset shrinking) and generally make life way easier – long term, that is. Improper memory management can wreak havoc. For example, any middleware without means to control/override memory management is, well, often not an option; any subsystem that uncontrollably allocates memory ca...| zeux.io
Last week I’ve posted some teaser code that will be transformed several times, each time yielding a faster one - “faster” in terms of “taking less cycles for the test case on SPU”. A lot of you probably looked at my admittedly lame excuse for, uhm, math library and want to ask – why the hell do you use scalar code? We’re going to address the problem in this issue. This is probably a no-brainer for most of my readers, but this is a good opportunity to introduce some important poi...| zeux.io
Here I come again, back from almost a year long silence – and for some weird reason a visitor counter shows that people are still reading my blog! This was an eventful year for me – I worked on lots of things at work and on some at home, got 3 more shipped titles to put in my CV, started really programming on PS3 (including many RSX-related adventures, optimizations and, recently, SPU coding, which I happen to enjoy a lot), and, as some of you will probably guess from the code below, star...| zeux.io
Recently I was doing particle rendering for different platforms (D3D9, PS3, XBox360), and I wanted to share my experience. The method I came with (which is more or less the same for all 3 platforms) is nothing new or complex - in fact, I know people were and are doing it already - but nevertheless I’ve never seen it described anywhere, so it might help somebody.| zeux.io
Backface culling is something we take for granted when rendering triangle meshes on the GPU. In general, an average mesh is expected to have about 50% of its triangles facing away from the camera. Unless you forget to set appropriate render states in your favorite graphics API, the hardware will reject these triangles as early in the rasterization pipeline as possible. Thus, it would seem that backface culling is a solved problem. In this post, however, we’ll explore a few alternative strat...| zeux.io
Machine learning is taking the world by storm. There’s amazing progress in many areas that were either considered intractable or had not reached a satisfying solution despite decades of research. A lot of results in machine learning are obtained using neural networks, but that’s just one class of algorithms. Today we’ll look at one key algorithm from meshoptimizer that was improved by getting the machine to find the best answer instead of me, the human1. A necessary disclaimer: I’m no...| zeux.io
Quaternions should probably be your first choice as far as representing rotations goes. They take less space than matrices (this is important since programs are increasingly more memory bound); they’re similar in terms of performance of basic operations (slower for some, faster for others); they are much faster to normalize which is frequently necessary to combat accumulating error; and finally they’re way easier to interpolate. In this post we’ll focus on interpolation.| zeux.io