Idiomatic C# tends to be object oriented and reliant on the garbage collector. That style can make modelling certain kinds of application logic easier, but as soon as performance is a requirement, strict OOP is a path to suffering. As this is a property of the underlying memory access patterns and the limitations of hardware, this applies to deeply OOP implementations in any language- even C++.
I won't go into a full digression on data oriented design, but the short version is that memory is extremely slow compared to everything else that a CPU is doing. In the time it takes to retrieve a piece of data from main memory, a single CPU core can execute hundreds of instructions. That stall can happen every single time a new piece of memory is touched. Large idiomatic object oriented applications tend to form an enormous web of references which, when followed, require incoherent and often uncached memory accesses. It doesn't matter how good your compiler is, that's going to leave the CPU mostly idle.
Here's an example from BEPUphysics v1, a demo of a bunch of boxes connected together to behave like stiff cloth:
Before we get into how bad this is, let me advocate for v1 a little bit here:
- I disabled multithreading for this test, so all stalls remain unfilled by other thread work.
- This simulation is pretty large (at least for v1) at 14400 dynamic entities and 84734 constraints, so not everything will fit into the CPU caches even under ideal circumstances.
- I intentionally evicted everything from caches between updates. That might happen in some applications that need to do a bunch of non-physics work, but not all of them.
If you're not familiar with Intel VTune (which has a free license now!), 'CPI' refers to cycles per instruction and 'DRAM Bound' refers to the percentage of cycles which are stuck waiting on main memory requests. Breaking out DRAM Bound shows memory bandwidth versus memory latency bottlenecks.
Okay, how bad is this?
- 3-5 CPI means the CPU is doing very, very little besides waiting. Note that an ideal CPI is not 1; modern processors can execute more than one instruction per cycle in most relevant cases (see reciprocal throughput numbers on Agner Fog's always-useful instruction tables).
- While not shown above, it's worth mentioning that BEPUphysics v1 was built pre-vectorization support in the runtime, so in addition to executing <<1 instruction per cycle, those instructions work with at most one scalar value at a time. This particular simulation uses about 0.4% of the 3770K's floating point throughput. To be fair, reaching even 10% can be tricky in highly optimized complex workloads, but 0.4% is just... not good.
- A full 82% of cycles in the core solving function are stuck waiting on memory requests that the prefetcher could not predict, and which were not already in any cache. These requests take the form of body velocity, inertia, and constraint data, and in v1, all of them involve randomized memory accesses.
- UpdateSolverActivity is mostly bottlenecked by total bandwidth rather than latency. It can be tempting to look at a bandwidth bottleneck and shrug helplessly- after all, you need a certain amount of data to compute what you need, there's not much left to do, right? But the memory subsystem of a CPU doesn't work with 4 or 12 bytes in isolation, it works with cache lines which span 64 bytes (on AMD/Intel/many other CPUs). If you ask for a boolean flag all by itself, the CPU will also pull down the surrounding cache line. If the data you truly want is sparsely distributed, the cache line utilization will be terrible and bandwidth required to serve all memory requests will be vastly higher than a tight packing.
- Note that the solver executes multiple iterations each frame. The second iteration and beyond will benefit from the first iteration pulling in a bunch of the required memory into L3 cache. In this case, the simulation is too big to fit all at once, but it does hold a decent chunk. If the simulation was larger or the cache got otherwise evicted between iterations, the above numbers would be even worse.
The core issue behind all of these is pulling memory from a bunch of unpredictable places. This arises mostly from v1's OOP-y design. For example, the solver stores an array of all constraints:
RawList<SolverUpdateable> solverUpdateables = new RawList<SolverUpdateable>();
But every SolverUpdateable is a reference type:
public abstract class SolverUpdateable
An array of reference types is actually an array of pointers. Outside of rare ideal cases, it's unlikely that the data pointed to by these references will be in order and contiguous. No matter how this array is traversed, the accesses will still jump all over memory and suffer tons of cache misses.
(Side note: the v1 solver also intentionally randomizes solver order for the sake of avoiding some pathological constraint ordering which causes even more cache misses. OOPiness can't be blamed for that, but it's still something that v2 stopped doing.)
There's a lot of other subtle stuff going on here too, but the single largest takeway is that we need a way to express memory in a contiguous, structured way. Fortunately, this feature has existed since the primordial era of C#: value types.
With value types, you can create a contiguous block of memory with values lined up in order. Take the following code as an example:
struct Snoot { public int A; public float B; public long C; } public static void Boop() { var snoots = new Snoot[4]; for (int i = 0; i < snoots.Length; ++i) { ref var snoot = ref snoots[i]; var value = i + 1; snoot.A = value; snoot.B = value; snoot.C = value; } ref var snootBytes = ref Unsafe.As<Snoot, byte>(ref snoots[0]); for (int i = 0; i < snoots.Length; ++i) { Console.Write($"Snoot {i} bytes: "); for (int j = 0; j < Unsafe.SizeOf<Snoot>(); ++j) { Console.Write($"{Unsafe.Add(ref snootBytes, i * Unsafe.SizeOf<Snoot>() + j)}, "); } Console.WriteLine(); } }
Executing Boop produces:
Snoot 0 bytes: 1, 0, 0, 0, 0, 0, 128, 63, 1, 0, 0, 0, 0, 0, 0, 0, Snoot 1 bytes: 2, 0, 0, 0, 0, 0, 0, 64, 2, 0, 0, 0, 0, 0, 0, 0, Snoot 2 bytes: 3, 0, 0, 0, 0, 0, 64, 64, 3, 0, 0, 0, 0, 0, 0, 0, Snoot 3 bytes: 4, 0, 0, 0, 0, 0, 128, 64, 4, 0, 0, 0, 0, 0, 0, 0,
Walking sequentially through the array, we can directly observe the byte values that make up the Snoot instances. There are no indirections that need to be tracked down.
So, if a traditional reference-heavy object oriented memory model isn't great, what is an alternative? BEPUphysics v2 shows one possibility. Using the solver as an example again, here is how v2 represents a group of constraints:
public struct TypeBatch { public RawBuffer BodyReferences; public RawBuffer PrestepData; public RawBuffer AccumulatedImpulses; public RawBuffer Projection; public Buffer<int> IndexToHandle; public int ConstraintCount; public int TypeId; }
An important note here is that the properties of a constraint- body references, prestep data, and so on- are split into different arrays. Not every stage of constraint processing needs to access every constraint property. For example, solve iterations do not need PrestepData at all; trying to bundle prestep data with the rest of the properties would just waste valuable space in cache lines during the solve. That helps with memory bandwidth.
There's no processing logic in the TypeBatch at all, though. It's raw data. Untyped, even- the RawBuffer just refers to a block of bytes. How is it used?
Each TypeBatch contains only one type of constraint, as the name may imply. TypeBatches are... processed... by a TypeProcessor. TypeProcessors are built to understand a single kind of constraint and have no state of their own; they could easily be implemented as function pointers of static functions. In the solver, you'll find:
public TypeProcessor[] TypeProcessors;
When the solver encounters a TypeBatch with a TypeId of 3, it can just do a quick array lookup to find the TypeProcessor which knows what to do with that TypeBatch's data.
(Side note: the cost of each virtual call is amortized over an entire type batch. Executing a type batch will tend to take hundreds of nanoseconds at an absolute minimum, so adding 2-4 nanoseconds for an indirection is pretty much irrelevant. Compare this to the OOPy approach of having a bunch of SolverUpdateable references in an array and doing a virtual call on every individual instance. Not a big deal compared to data cache misses, but still pointless overhead that can be avoided.)
In the end, this enables blasting through contiguous arrays of data with tight loops of code that look like this:
[MethodImpl(MethodImplOptions.AggressiveInlining)] public void Solve(ref BodyVelocities velocityA, ref BodyVelocities velocityB, ref AngularHingeProjection projection, ref Vector2Wide accumulatedImpulse) { Vector3Wide.Subtract(velocityA.Angular, velocityB.Angular, out var difference); Matrix2x3Wide.TransformByTransposeWithoutOverlap(difference, projection.VelocityToImpulseA, out var csi); Vector2Wide.Scale(accumulatedImpulse, projection.SoftnessImpulseScale, out var softnessContribution); Vector2Wide.Add(softnessContribution, csi, out csi); Vector2Wide.Subtract(projection.BiasImpulse, csi, out csi); Vector2Wide.Add(accumulatedImpulse, csi, out accumulatedImpulse); ApplyImpulse(ref velocityA.Angular, ref velocityB.Angular, ref projection, ref csi); }
It's pure math with no branches and no constraint cache misses. While GPUs get the spotlight when it comes to compute throughput, CPUs are still extremely good at this, especially with instruction level parallelism. Every single operation in the above is executed on several constraints at the same time using SIMD instructions (which I'll cover in more detail later).
There are some cache misses hidden in that chunk of code, but they are fundamentally required by the nature of the solver algorithm- body velocities can't be stored alongside constraint data because there is no one to one mapping between them, so they must be incoherently gathered and scattered. (Side note: these cache misses can be partially mitigated; the library tries to group constraints by access patterns.)
How much of a difference does this make? Here's some more vtune data, this time of a comparable v2 simulation. Same processor, multithreading still disabled, cache still evicted between every frame.
0.454 cycles per instruction is a wee bit better than the 5+ we observed in v1. Main memory latency is pretty much eliminated as a problem. And on top of that, just about every one of those instructions is operating on multiple lanes of data with SIMD. This is a big part of why v2 is often an order of magnitude faster than v1.
Summary
Value types are not new to C# and certainly not new to programming in general, but controlling memory access patterns is absolutely critical for performance. Without this, all other attempts at optimization would be practically irrelevant.
The next step will be to see how painless we can make this style of programming in modern C#.