Suppose you have a function that does some complex logic surrounding some other chunk of user-supplied logic. For example, a function that takes a list of bodies and constraints and applies the constraints to the bodies, like so:
public struct Body { public float X; public float Y; public float Otherness; public float Ineffability; } interface IConstraint { int A { get; } int B { get; } void Apply(ref Body a, ref Body b); } struct IneffabilityConstraint : IConstraint { public int A { get; set; } public int B { get; set; } public void Apply(ref Body a, ref Body b) { a.Ineffability = a.Ineffability + b.Otherness * (b.X * b.X + b.Y * b.Y); } } static void ApplyConstraintsThroughInterface(Span<Body> bodies, Span<IConstraint> constraints) { for (int i = 0; i < constraints.Length; ++i) { var constraint = constraints[i]; constraint.Apply(ref bodies[constraint.A], ref bodies[constraint.B]); } }
The loop grabs the body references and hands them to the relevant constraint. Nothing too unusual here; it works, it’s fairly simple.
And it’s way slower than it needs to be.
Virtual obstacles
The constraints span contains elements of type IConstraint. Even though we only have one IConstraint implementation in the above, imagine if this was a library- there could be any number of other IConstraint implementations, and the constraints span might contain a bunch of different types. The compiler punts responsibility for calling the proper implementation to execution time.
That punt results in virtual calls. From the ApplyConstraintsThroughInterface loop body, the IL (on the left) shows this as ‘callvirt’, and without any other help, it pretty much guarantees some heavy uninlineable function calls in the final assembly (on the right).
L_0004: ldarga.s constraints L_0006: ldloc.0 L_0007: call instance !0& [System.Runtime]System.Span`1<class CodegenTests.GenericsAbuse/IConstraint>::get_Item(int32) L_000c: ldind.ref L_000d: stloc.1 L_000e: ldloc.1 L_000f: ldarga.s bodies L_0011: ldloc.1 L_0012: callvirt instance int32 CodegenTests.GenericsAbuse/IConstraint::get_A() L_0017: call instance !0& [System.Runtime]System.Span`1<valuetype CodegenTests.GenericsAbuse/Body>::get_Item(int32) L_001c: ldarga.s bodies L_001e: ldloc.1 L_001f: callvirt instance int32 CodegenTests.GenericsAbuse/IConstraint::get_B() L_0024: call instance !0& [System.Runtime]System.Span`1<valuetype CodegenTests.GenericsAbuse/Body>::get_Item(int32) L_0029: callvirt instance void CodegenTests.GenericsAbuse/IConstraint::Apply(valuetype CodegenTests.GenericsAbuse/Body&, valuetype CodegenTests.GenericsAbuse/Body&)
movsxd rcx,r14d mov r15,qword ptr [rsi+rcx*8] mov rcx,r15 mov r11,7FFA11440020h cmp dword ptr [rcx],ecx call qword ptr [7FFA11440020h] cmp eax,ebp jae 00007FFA11551E1E movsxd rcx,eax shl rcx,4 lea r12,[rbx+rcx] mov rcx,r15 mov r11,7FFA11440028h cmp dword ptr [rcx],ecx call qword ptr [7FFA11440028h] cmp eax,ebp jae 00007FFA11551E1E movsxd r8,eax shl r8,4 add r8,rbx mov rcx,r15 mov rdx,r12 mov r11,7FFA11440030h cmp dword ptr [rcx],ecx call qword ptr [7FFA11440030h]
Note that not every interface or abstract function call will result in a full virtual call in the final assembly. The JIT can perform devirtualization sometimes, but I built this example to prevent it.
Boxing
We can store any IConstraint-implementing reference type instance in the span because it just stores the references. But value type instances aren’t directly tracked by the GC and don’t have a reference by default. And, since we’re talking about a contiguous array with fixed stride, we can’t just shove a value type instance of arbitrary size in.
To make everything work, the value type instance gets ‘boxed’ into a reference type instance. The reference to that new instance is then stored in the array. If you go look at the IL generated when sticking the IneffabilityConstraints into the IConstraint array, you’ll see this:
L_00cf: box CodegenTests.GenericsAbuse/IneffabilityConstraint
In other words, sticking a value type into a slot shaped like a reference type automatically creates a reference type instance for you. It works, but now you’ve given the poor GC more work to do, and that’s rude.
Sure, you could stick only to reference types, but then you’re on the path to having a really complicated web of references for the GC to track. Frankly, that’s pretty inconsiderate too.
Specializing for speediness
Ideally, we could eliminate virtual calls and any other form of last-second indirection, allow inlining, and avoid any boxing.
This is possible if we’re willing to organize the incoming data into batches of the same type. This opens the door, at least conceptually, for specializing the entire loop for the specific type we’re working with. How would this look?
static void ApplyConstraintsWithGenericsAbuse<TConstraint>(Span<Body> bodies, Span<TConstraint> constraints) where TConstraint : IConstraint { for (int i = 0; i < constraints.Length; ++i) { ref var constraint = ref constraints[i]; constraint.Apply(ref bodies[constraint.A], ref bodies[constraint.B]); } }
Almost identical! Rather than having a span over any possible IConstraint, this instead takes a span of a specific type TConstraint that implements IConstraint. If we peek at the IL and assembly of the loop body again:
L_0004: ldarga.s constraints L_0006: ldloc.0 L_0007: call instance !0& [System.Runtime]System.Span`1<!!TConstraint>::get_Item(int32) L_000c: stloc.1 L_000d: ldloc.1 L_000e: ldarga.s bodies L_0010: ldloc.1 L_0011: constrained. !!TConstraint L_0017: callvirt instance int32 CodegenTests.GenericsAbuse/IConstraint::get_A() L_001c: call instance !0& [System.Runtime]System.Span`1<valuetype CodegenTests.GenericsAbuse/Body>::get_Item(int32) L_0021: ldarga.s bodies L_0023: ldloc.1 L_0024: constrained. !!TConstraint L_002a: callvirt instance int32 CodegenTests.GenericsAbuse/IConstraint::get_B() L_002f: call instance !0& [System.Runtime]System.Span`1<valuetype CodegenTests.GenericsAbuse/Body>::get_Item(int32) L_0034: constrained. !!TConstraint L_003a: callvirt instance void CodegenTests.GenericsAbuse/IConstraint::Apply(valuetype CodegenTests.GenericsAbuse/Body&, valuetype CodegenTests.GenericsAbuse/Body&)
movsxd r10,r9d lea r10,[rax+r10*8] mov r11d,dword ptr [r10] cmp r11d,ecx jae 00007FFA11632C39 movsxd r11,r11d shl r11,4 add r11,r8 mov r10d,dword ptr [r10+4] cmp r10d,ecx jae 00007FFA11632C39 movsxd r10,r10d shl r10,4 add r10,r8 vmovss xmm0,dword ptr [r11+0Ch] vmovss xmm1,dword ptr [r10+8] vmovss xmm2,dword ptr [r10] vmulss xmm2,xmm2,xmm2 vmovss xmm3,dword ptr [r10+4] vmulss xmm3,xmm3,xmm3 vaddss xmm2,xmm2,xmm3 vmulss xmm1,xmm1,xmm2 vaddss xmm0,xmm0,xmm1 vmovss dword ptr [r11+0Ch],xmm0
The IL is pretty similar. Roslyn is still outputting callvirts, except now they are ‘constrained’. This makes all the difference. Thanks to the extra type awareness, the JIT can not only eliminate the virtual call but also inline the contained logic.
That alone accounts for about a factor of 4 speedup in a microbenchmark comparing the two approaches. A larger program with more logic per virtual call would not show quite as much difference, but the cost is real.
But wait, hold on… What’s this? If we do nothing more than change the IneffabilityConstraint to a class instead of a struct, the assembly becomes:
mov rcx,qword ptr [r12] mov r11,7FFA03910038h cmp dword ptr [rcx],ecx call qword ptr [7FFA03910038h] cmp eax,r14d jae 00007FFA03A2436A movsxd rcx,eax shl rcx,4 lea r13,[rcx+rbp] mov rcx,qword ptr [r12] mov r11,7FFA03910040h cmp dword ptr [rcx],ecx call qword ptr [7FFA03910040h] cmp eax,r14d jae 00007FFA03A2436A movsxd r8,eax shl r8,4 add r8,rbp mov rcx,qword ptr [r12] mov rdx,r13 mov r11,7FFA03910048h cmp dword ptr [rcx],ecx call qword ptr [7FFA03910048h]
The calls have returned! It’s still about 30-50% faster than the original pure interface version, but nothing is inlined anymore. The JIT should still have all the necessary type knowledge to inline… right?
Sort of. The JIT handles value types and reference types differently when it comes to generics. The JIT takes advantage of the fact that all reference types are able to share the same implementation, even if it means specific implementations can’t be inlined. (This behavior is subject to change; for all I know there might already be cases where certain reference types are specialized.)
But even if the JIT wanted to, it can’t efficiently share implementations across value types. They’re different sizes and, unless boxed, have no method table for indirection. So, rather than boxing every single value type to share the same code, it chooses to output a type-specialized implementation.
This is really useful.
In BEPUphysics v2
This compiler behavior is (ab)used all over the place in the engine. You can stretch this from simple things like efficient callbacks all the way over to almost-template-hell.
One of the first places users might run into this with BEPUphysics v2 is in narrow phase callbacks, specified when the Simulation.Create function is called:
public unsafe interface INarrowPhaseCallbacks { void Initialize(Simulation simulation); bool AllowContactGeneration(int workerIndex, CollidableReference a, CollidableReference b); bool ConfigureContactManifold(int workerIndex, CollidablePair pair, ConvexContactManifold* manifold, out PairMaterialProperties pairMaterial); bool ConfigureContactManifold(int workerIndex, CollidablePair pair, NonconvexContactManifold* manifold, out PairMaterialProperties pairMaterial); bool AllowContactGeneration(int workerIndex, CollidablePair pair, int childIndexA, int childIndexB); bool ConfigureContactManifold(int workerIndex, CollidablePair pair, int childIndexA, int childIndexB, ConvexContactManifold* manifold); void Dispose(); }
These are directly inlineable callbacks from the narrow phase’s execution context. Collision pairs can be filtered arbitrarily, contact manifolds can be accessed or modified before they’re used to create constraints, materials can be defined arbitrarily. There are no assumptions about collision filtering or material blending built into the engine, and there are no collision events in the library itself. Events could, however, easily be built on top of these callbacks.
Another case: want to enumerate the set of bodies connected to a given body through constraints? No problem. In the Simulation.Bodies collection:
public void EnumerateConnectedBodies<TEnumerator>(int bodyHandle, ref TEnumerator enumerator) where TEnumerator : IForEach<int>
Where IForEach<T> allows the user to provide a callback that will be executed for each connected body:
public interface IForEach<T> { void LoopBody(T i); }
The struct-implementing-enumerator-interface pattern is used quite a few times. The ReferenceCollector, in particular, sees a lot of use. It provides the simplest ‘store a list of results’ enumerator.
There are plenty more examples like the above, particularly around ray/query filtering and processing. Can we go a bit further than callbacks? Sure! Here’s a real-world expansion of the above toy ‘constraints’ example:
public abstract class TwoBodyTypeProcessor<TPrestepData, TProjection, TAccumulatedImpulse, TConstraintFunctions> : TypeProcessor<TwoBodyReferences, TPrestepData, TProjection, TAccumulatedImpulse> where TConstraintFunctions : struct, IConstraintFunctions<TPrestepData, TProjection, TAccumulatedImpulse>
Without copying too much spam into this blog post, this generic definition provides enough information to create a number of functions shared by all two body constraints. The SolveIteration function looks like this:
public unsafe override void SolveIteration(ref TypeBatch typeBatch, ref Buffer<BodyVelocity> bodyVelocities, int startBundle, int exclusiveEndBundle) { ref var bodyReferencesBase = ref Unsafe.AsRef<TwoBodyReferences>(typeBatch.BodyReferences.Memory); ref var accumulatedImpulsesBase = ref Unsafe.AsRef<TAccumulatedImpulse>(typeBatch.AccumulatedImpulses.Memory); ref var projectionBase = ref Unsafe.AsRef<TProjection>(typeBatch.Projection.Memory); var function = default(TConstraintFunctions); for (int i = startBundle; i < exclusiveEndBundle; ++i) { ref var projection = ref Unsafe.Add(ref projectionBase, i); ref var accumulatedImpulses = ref Unsafe.Add(ref accumulatedImpulsesBase, i); ref var bodyReferences = ref Unsafe.Add(ref bodyReferencesBase, i); int count = GetCountInBundle(ref typeBatch, i); Bodies.GatherVelocities(ref bodyVelocities, ref bodyReferences, count, out var wsvA, out var wsvB); function.Solve(ref wsvA, ref wsvB, ref projection, ref accumulatedImpulses); Bodies.ScatterVelocities(ref wsvA, ref wsvB, ref bodyVelocities, ref bodyReferences, count); } }
The TypeBatch, containing raw untyped buffers, is given meaning when handed to the TypeProcessor that knows how to interpret it. The individual type functions don’t have to worry about reinterpreting untyped memory or gathering velocities; that’s all handled by the shared implementation with no virtual call overhead.
Can we go even further? Sure! Let’s look at the ConvexCollisionTask which handles batches of collision pairs in a vectorized way. As you might expect by now, it has some hefty generics:
public class ConvexCollisionTask<TShapeA, TShapeWideA, TShapeB, TShapeWideB, TPair, TPairWide, TManifoldWide, TPairTester> : CollisionTask where TShapeA : struct, IShape where TShapeB : struct, IShape where TShapeWideA : struct, IShapeWide<TShapeA> where TShapeWideB : struct, IShapeWide<TShapeB> where TPair : struct, ICollisionPair<TPair> where TPairWide : struct, ICollisionPairWide<TShapeA, TShapeWideA, TShapeB, TShapeWideB, TPair, TPairWide> where TPairTester : struct, IPairTester<TShapeWideA, TShapeWideB, TManifoldWide> where TManifoldWide : IContactManifoldWide
To paraphrase, this is requiring that the inputs be two vectorizable shapes and a function capable of handling those shape types. But the actual batch execution does some more interesting things than merely inlining a user supplied function. Check out the loop body:
if (pairWide.OrientationCount == 2) { defaultPairTester.Test( ref pairWide.GetShapeA(ref pairWide), ref pairWide.GetShapeB(ref pairWide), ref pairWide.GetSpeculativeMargin(ref pairWide), ref pairWide.GetOffsetB(ref pairWide), ref pairWide.GetOrientationA(ref pairWide), ref pairWide.GetOrientationB(ref pairWide), countInBundle, out manifoldWide); } else if (pairWide.OrientationCount == 1) { //Note that, in the event that there is only one orientation, it belongs to the second shape. //The only shape that doesn't need orientation is a sphere, and it will be in slot A by convention. Debug.Assert(typeof(TShapeWideA) == typeof(SphereWide)); defaultPairTester.Test( ref pairWide.GetShapeA(ref pairWide), ref pairWide.GetShapeB(ref pairWide), ref pairWide.GetSpeculativeMargin(ref pairWide), ref pairWide.GetOffsetB(ref pairWide), ref pairWide.GetOrientationB(ref pairWide), countInBundle, out manifoldWide); } else { Debug.Assert(pairWide.OrientationCount == 0); Debug.Assert(typeof(TShapeWideA) == typeof(SphereWide) && typeof(TShapeWideB) == typeof(SphereWide), "No orientation implies a special case involving two spheres."); //Really, this could be made into a direct special case, but eh. defaultPairTester.Test( ref pairWide.GetShapeA(ref pairWide), ref pairWide.GetShapeB(ref pairWide), ref pairWide.GetSpeculativeMargin(ref pairWide), ref pairWide.GetOffsetB(ref pairWide), countInBundle, out manifoldWide); } //Flip back any contacts associated with pairs which had to be flipped for shape order. if (pairWide.HasFlipMask) { manifoldWide.ApplyFlipMask(ref pairWide.GetOffsetB(ref pairWide), pairWide.GetFlipMask(ref pairWide)); }
Typically, you really don’t want to wait until the last second to perform a branch. So what’s up with this?
The TPairWide.OrientationCount and HasFlipMask properties all return constant values. Since the JIT is already creating a dedicated code path for the specified type parameters (they’re all value types, after all), it takes into account the compile time known value and prunes out the unreachable code paths. The final assembly will only include whichever orientation count path is relevant, and the flip mask chunk won’t exist at all unless required. No branches involved!
The JIT can also recognize certain kinds of type condition as constant. In other words, when using value types with generics, C# supports something similar to C++ template specialization.
If this has awakened a dark hunger within you, you might also like the convex sweep generic definition which abstracts over different shape types as well as different shape path integrations.
And why not create collection types that vary not only over element type, but also over the internal buffer type, any required comparisons, and memory pool types? What could go wrong?
Summary
C# generics might not be a Turing complete metaprogramming language, but they can still do some pretty helpful things that go beyond just having a list of a particular type. With a bit of compiler awareness, you can express all sorts of simple or complex abstractions with zero overhead.
You can also create horrible messes.