Suppose you have a function that does some complex logic surrounding some other chunk of user-supplied logic. For example, a function that takes a list of bodies and constraints and applies the constraints to the bodies, like so:

public struct Body
{
  public float X;
  public float Y;
  public float Otherness;
  public float Ineffability;
}

interface IConstraint
{
  int A { get; }
  int B { get; }
  void Apply(ref Body a, ref Body b);
}

struct IneffabilityConstraint : IConstraint
{
  public int A { get; set; }

  public int B { get; set; }

  public void Apply(ref Body a, ref Body b)
  {
    a.Ineffability = a.Ineffability + b.Otherness * (b.X * b.X + b.Y * b.Y);
  }
}

static void ApplyConstraintsThroughInterface(Span<Body> bodies, Span<IConstraint> constraints)
{
  for (int i = 0; i < constraints.Length; ++i)
  {
    var constraint = constraints[i];
    constraint.Apply(ref bodies[constraint.A], ref bodies[constraint.B]);
  }
}

The loop grabs the body references and hands them to the relevant constraint. Nothing too unusual here; it works, it’s fairly simple.

And it’s way slower than it needs to be.

Virtual obstacles

The constraints span contains elements of type IConstraint. Even though we only have one IConstraint implementation in the above, imagine if this was a library- there could be any number of other IConstraint implementations, and the constraints span might contain a bunch of different types. The compiler punts responsibility for calling the proper implementation to execution time.

That punt results in virtual calls. From the ApplyConstraintsThroughInterface loop body, the IL (on the left) shows this as ‘callvirt’, and without any other help, it pretty much guarantees some heavy uninlineable function calls in the final assembly (on the right).

L_0004: ldarga.s constraints
L_0006: ldloc.0 
L_0007: call instance !0& [System.Runtime]System.Span`1<class CodegenTests.GenericsAbuse/IConstraint>::get_Item(int32)
L_000c: ldind.ref 
L_000d: stloc.1 
L_000e: ldloc.1 
L_000f: ldarga.s bodies
L_0011: ldloc.1 
L_0012: callvirt instance int32 CodegenTests.GenericsAbuse/IConstraint::get_A()
L_0017: call instance !0& [System.Runtime]System.Span`1<valuetype CodegenTests.GenericsAbuse/Body>::get_Item(int32)
L_001c: ldarga.s bodies
L_001e: ldloc.1 
L_001f: callvirt instance int32 CodegenTests.GenericsAbuse/IConstraint::get_B()
L_0024: call instance !0& [System.Runtime]System.Span`1<valuetype CodegenTests.GenericsAbuse/Body>::get_Item(int32)
L_0029: callvirt instance void CodegenTests.GenericsAbuse/IConstraint::Apply(valuetype CodegenTests.GenericsAbuse/Body&, valuetype CodegenTests.GenericsAbuse/Body&)

movsxd      rcx,r14d  
mov         r15,qword ptr [rsi+rcx*8]  
mov         rcx,r15  
mov         r11,7FFA11440020h  
cmp         dword ptr [rcx],ecx  
call        qword ptr [7FFA11440020h]  
cmp         eax,ebp  
jae         00007FFA11551E1E  
movsxd      rcx,eax  
shl         rcx,4  
lea         r12,[rbx+rcx]  
mov         rcx,r15  
mov         r11,7FFA11440028h  
cmp         dword ptr [rcx],ecx  
call        qword ptr [7FFA11440028h]  
cmp         eax,ebp  
jae         00007FFA11551E1E  
movsxd      r8,eax  
shl         r8,4  
add         r8,rbx  
mov         rcx,r15  
mov         rdx,r12  
mov         r11,7FFA11440030h  
cmp         dword ptr [rcx],ecx  
call        qword ptr [7FFA11440030h]  

Note that not every interface or abstract function call will result in a full virtual call in the final assembly. The JIT can perform devirtualization sometimes, but I built this example to prevent it.

Boxing

We can store any IConstraint-implementing reference type instance in the span because it just stores the references. But value type instances aren’t directly tracked by the GC and don’t have a reference by default. And, since we’re talking about a contiguous array with fixed stride, we can’t just shove a value type instance of arbitrary size in.

To make everything work, the value type instance gets ‘boxed’ into a reference type instance. The reference to that new instance is then stored in the array. If you go look at the IL generated when sticking the IneffabilityConstraints into the IConstraint array, you’ll see this:

L_00cf: box CodegenTests.GenericsAbuse/IneffabilityConstraint

In other words, sticking a value type into a slot shaped like a reference type automatically creates a reference type instance for you. It works, but now you’ve given the poor GC more work to do, and that’s rude.

Sure, you could stick only to reference types, but then you’re on the path to having a really complicated web of references for the GC to track. Frankly, that’s pretty inconsiderate too.

Specializing for speediness

Ideally, we could eliminate virtual calls and any other form of last-second indirection, allow inlining, and avoid any boxing.

This is possible if we’re willing to organize the incoming data into batches of the same type. This opens the door, at least conceptually, for specializing the entire loop for the specific type we’re working with. How would this look?

static void ApplyConstraintsWithGenericsAbuse<TConstraint>(Span<Body> bodies, Span<TConstraint> constraints) where TConstraint : IConstraint
{
  for (int i = 0; i < constraints.Length; ++i)
  {
    ref var constraint = ref constraints[i];
    constraint.Apply(ref bodies[constraint.A], ref bodies[constraint.B]);
  }
}

Almost identical! Rather than having a span over any possible IConstraint, this instead takes a span of a specific type TConstraint that implements IConstraint. If we peek at the IL and assembly of the loop body again:

L_0004: ldarga.s constraints
L_0006: ldloc.0 
L_0007: call instance !0& [System.Runtime]System.Span`1<!!TConstraint>::get_Item(int32)
L_000c: stloc.1 
L_000d: ldloc.1 
L_000e: ldarga.s bodies
L_0010: ldloc.1 
L_0011: constrained. !!TConstraint
L_0017: callvirt instance int32 CodegenTests.GenericsAbuse/IConstraint::get_A()
L_001c: call instance !0& [System.Runtime]System.Span`1<valuetype CodegenTests.GenericsAbuse/Body>::get_Item(int32)
L_0021: ldarga.s bodies
L_0023: ldloc.1 
L_0024: constrained. !!TConstraint
L_002a: callvirt instance int32 CodegenTests.GenericsAbuse/IConstraint::get_B()
L_002f: call instance !0& [System.Runtime]System.Span`1<valuetype CodegenTests.GenericsAbuse/Body>::get_Item(int32)
L_0034: constrained. !!TConstraint
L_003a: callvirt instance void CodegenTests.GenericsAbuse/IConstraint::Apply(valuetype CodegenTests.GenericsAbuse/Body&, valuetype CodegenTests.GenericsAbuse/Body&)

movsxd      r10,r9d  
lea         r10,[rax+r10*8]  
mov         r11d,dword ptr [r10]  
cmp         r11d,ecx  
jae         00007FFA11632C39  
movsxd      r11,r11d  
shl         r11,4  
add         r11,r8  
mov         r10d,dword ptr [r10+4]  
cmp         r10d,ecx  
jae         00007FFA11632C39  
movsxd      r10,r10d  
shl         r10,4  
add         r10,r8  
vmovss      xmm0,dword ptr [r11+0Ch]  
vmovss      xmm1,dword ptr [r10+8]  
vmovss      xmm2,dword ptr [r10]  
vmulss      xmm2,xmm2,xmm2  
vmovss      xmm3,dword ptr [r10+4]  
vmulss      xmm3,xmm3,xmm3  
vaddss      xmm2,xmm2,xmm3  
vmulss      xmm1,xmm1,xmm2  
vaddss      xmm0,xmm0,xmm1  
vmovss      dword ptr [r11+0Ch],xmm0  

The IL is pretty similar. Roslyn is still outputting callvirts, except now they are ‘constrained’. This makes all the difference. Thanks to the extra type awareness, the JIT can not only eliminate the virtual call but also inline the contained logic.

That alone accounts for about a factor of 4 speedup in a microbenchmark comparing the two approaches. A larger program with more logic per virtual call would not show quite as much difference, but the cost is real.

But wait, hold on… What’s this? If we do nothing more than change the IneffabilityConstraint to a class instead of a struct, the assembly becomes:

mov         rcx,qword ptr [r12]  
mov         r11,7FFA03910038h  
cmp         dword ptr [rcx],ecx  
call        qword ptr [7FFA03910038h]  
cmp         eax,r14d  
jae         00007FFA03A2436A  
movsxd      rcx,eax  
shl         rcx,4  
lea         r13,[rcx+rbp]  
mov         rcx,qword ptr [r12]  
mov         r11,7FFA03910040h  
cmp         dword ptr [rcx],ecx  
call        qword ptr [7FFA03910040h]  
cmp         eax,r14d  
jae         00007FFA03A2436A  
movsxd      r8,eax  
shl         r8,4  
add         r8,rbp  
mov         rcx,qword ptr [r12]  
mov         rdx,r13  
mov         r11,7FFA03910048h  
cmp         dword ptr [rcx],ecx  
call        qword ptr [7FFA03910048h]  

The calls have returned! It’s still about 30-50% faster than the original pure interface version, but nothing is inlined anymore. The JIT should still have all the necessary type knowledge to inline… right?

Sort of. The JIT handles value types and reference types differently when it comes to generics. The JIT takes advantage of the fact that all reference types are able to share the same implementation, even if it means specific implementations can’t be inlined. (This behavior is subject to change; for all I know there might already be cases where certain reference types are specialized.)

But even if the JIT wanted to, it can’t efficiently share implementations across value types. They’re different sizes and, unless boxed, have no method table for indirection. So, rather than boxing every single value type to share the same code, it chooses to output a type-specialized implementation.

This is really useful.

In BEPUphysics v2

This compiler behavior is (ab)used all over the place in the engine. You can stretch this from simple things like efficient callbacks all the way over to almost-template-hell.

One of the first places users might run into this with BEPUphysics v2 is in narrow phase callbacks, specified when the Simulation.Create function is called:

public unsafe interface INarrowPhaseCallbacks
{
  void Initialize(Simulation simulation);

  bool AllowContactGeneration(int workerIndex, CollidableReference a, CollidableReference b);

  bool ConfigureContactManifold(int workerIndex, CollidablePair pair, ConvexContactManifold* manifold, out PairMaterialProperties pairMaterial);

  bool ConfigureContactManifold(int workerIndex, CollidablePair pair, NonconvexContactManifold* manifold, out PairMaterialProperties pairMaterial);

  bool AllowContactGeneration(int workerIndex, CollidablePair pair, int childIndexA, int childIndexB);

  bool ConfigureContactManifold(int workerIndex, CollidablePair pair, int childIndexA, int childIndexB, ConvexContactManifold* manifold);  

  void Dispose();
}

These are directly inlineable callbacks from the narrow phase’s execution context. Collision pairs can be filtered arbitrarily, contact manifolds can be accessed or modified before they’re used to create constraints, materials can be defined arbitrarily. There are no assumptions about collision filtering or material blending built into the engine, and there are no collision events in the library itself. Events could, however, easily be built on top of these callbacks.

Another case: want to enumerate the set of bodies connected to a given body through constraints? No problem. In the Simulation.Bodies collection:

public void EnumerateConnectedBodies<TEnumerator>(int bodyHandle, ref TEnumerator enumerator) where TEnumerator : IForEach<int>

Where IForEach<T> allows the user to provide a callback that will be executed for each connected body:

public interface IForEach<T>
{
  void LoopBody(T i);
}

The struct-implementing-enumerator-interface pattern is used quite a few times. The ReferenceCollector, in particular, sees a lot of use. It provides the simplest ‘store a list of results’ enumerator.

There are plenty more examples like the above, particularly around ray/query filtering and processing. Can we go a bit further than callbacks? Sure! Here’s a real-world expansion of the above toy ‘constraints’ example:

public abstract class TwoBodyTypeProcessor<TPrestepData, TProjection, TAccumulatedImpulse, TConstraintFunctions>
  : TypeProcessor<TwoBodyReferences, TPrestepData, TProjection, TAccumulatedImpulse>
  where TConstraintFunctions : struct, IConstraintFunctions<TPrestepData, TProjection, TAccumulatedImpulse>

Without copying too much spam into this blog post, this generic definition provides enough information to create a number of functions shared by all two body constraints. The SolveIteration function looks like this:

public unsafe override void SolveIteration(ref TypeBatch typeBatch, ref Buffer<BodyVelocity> bodyVelocities, int startBundle, int exclusiveEndBundle)
{
  ref var bodyReferencesBase = ref Unsafe.AsRef<TwoBodyReferences>(typeBatch.BodyReferences.Memory);
  ref var accumulatedImpulsesBase = ref Unsafe.AsRef<TAccumulatedImpulse>(typeBatch.AccumulatedImpulses.Memory);
  ref var projectionBase = ref Unsafe.AsRef<TProjection>(typeBatch.Projection.Memory);
  var function = default(TConstraintFunctions);
  for (int i = startBundle; i < exclusiveEndBundle; ++i)
  {
    ref var projection = ref Unsafe.Add(ref projectionBase, i);
    ref var accumulatedImpulses = ref Unsafe.Add(ref accumulatedImpulsesBase, i);
    ref var bodyReferences = ref Unsafe.Add(ref bodyReferencesBase, i);
    int count = GetCountInBundle(ref typeBatch, i);
    Bodies.GatherVelocities(ref bodyVelocities, ref bodyReferences, count, out var wsvA, out var wsvB);
    function.Solve(ref wsvA, ref wsvB, ref projection, ref accumulatedImpulses);
    Bodies.ScatterVelocities(ref wsvA, ref wsvB, ref bodyVelocities, ref bodyReferences, count);
  }
}

The TypeBatch, containing raw untyped buffers, is given meaning when handed to the TypeProcessor that knows how to interpret it. The individual type functions don’t have to worry about reinterpreting untyped memory or gathering velocities; that’s all handled by the shared implementation with no virtual call overhead.

Can we go even further? Sure! Let’s look at the ConvexCollisionTask which handles batches of collision pairs in a vectorized way. As you might expect by now, it has some hefty generics:

public class ConvexCollisionTask<TShapeA, TShapeWideA, TShapeB, TShapeWideB, TPair, TPairWide, TManifoldWide, TPairTester> : CollisionTask
  where TShapeA : struct, IShape where TShapeB : struct, IShape
  where TShapeWideA : struct, IShapeWide<TShapeA> where TShapeWideB : struct, IShapeWide<TShapeB>
  where TPair : struct, ICollisionPair<TPair>
  where TPairWide : struct, ICollisionPairWide<TShapeA, TShapeWideA, TShapeB, TShapeWideB, TPair, TPairWide>
  where TPairTester : struct, IPairTester<TShapeWideA, TShapeWideB, TManifoldWide>
  where TManifoldWide : IContactManifoldWide

To paraphrase, this is requiring that the inputs be two vectorizable shapes and a function capable of handling those shape types. But the actual batch execution does some more interesting things than merely inlining a user supplied function. Check out the loop body:

if (pairWide.OrientationCount == 2)
{
  defaultPairTester.Test(
    ref pairWide.GetShapeA(ref pairWide),
    ref pairWide.GetShapeB(ref pairWide),
    ref pairWide.GetSpeculativeMargin(ref pairWide),
    ref pairWide.GetOffsetB(ref pairWide),
    ref pairWide.GetOrientationA(ref pairWide),
    ref pairWide.GetOrientationB(ref pairWide),
    countInBundle,
    out manifoldWide);
}
else if (pairWide.OrientationCount == 1)
{
  //Note that, in the event that there is only one orientation, it belongs to the second shape.
  //The only shape that doesn't need orientation is a sphere, and it will be in slot A by convention.
  Debug.Assert(typeof(TShapeWideA) == typeof(SphereWide));
  defaultPairTester.Test(
    ref pairWide.GetShapeA(ref pairWide),
    ref pairWide.GetShapeB(ref pairWide),
    ref pairWide.GetSpeculativeMargin(ref pairWide),
    ref pairWide.GetOffsetB(ref pairWide),
    ref pairWide.GetOrientationB(ref pairWide),
    countInBundle,
    out manifoldWide);
}
else
{
  Debug.Assert(pairWide.OrientationCount == 0);
  Debug.Assert(typeof(TShapeWideA) == typeof(SphereWide) && typeof(TShapeWideB) == typeof(SphereWide), "No orientation implies a special case involving two spheres.");
  //Really, this could be made into a direct special case, but eh.
  defaultPairTester.Test(
    ref pairWide.GetShapeA(ref pairWide),
    ref pairWide.GetShapeB(ref pairWide),
    ref pairWide.GetSpeculativeMargin(ref pairWide),
    ref pairWide.GetOffsetB(ref pairWide),
    countInBundle,
    out manifoldWide);
}

//Flip back any contacts associated with pairs which had to be flipped for shape order.
if (pairWide.HasFlipMask)
{
  manifoldWide.ApplyFlipMask(ref pairWide.GetOffsetB(ref pairWide), pairWide.GetFlipMask(ref pairWide));
}

Typically, you really don’t want to wait until the last second to perform a branch. So what’s up with this?

The TPairWide.OrientationCount and HasFlipMask properties all return constant values. Since the JIT is already creating a dedicated code path for the specified type parameters (they’re all value types, after all), it takes into account the compile time known value and prunes out the unreachable code paths. The final assembly will only include whichever orientation count path is relevant, and the flip mask chunk won’t exist at all unless required. No branches involved!

The JIT can also recognize certain kinds of type condition as constant. In other words, when using value types with generics, C# supports something similar to C++ template specialization.

If this has awakened a dark hunger within you, you might also like the convex sweep generic definition which abstracts over different shape types as well as different shape path integrations.

And why not create collection types that vary not only over element type, but also over the internal buffer type, any required comparisons, and memory pool types? What could go wrong?

Summary

C# generics might not be a Turing complete metaprogramming language, but they can still do some pretty helpful things that go beyond just having a list of a particular type. With a bit of compiler awareness, you can express all sorts of simple or complex abstractions with zero overhead.

You can also create horrible messes.