A different kind of C#: watching your step

In the previous post, we saw that the JIT does a reasonably good job generating efficient code.

Does that mean we can simply trust the compiler to do the right thing all the time? Let's find out!

We'll investigate using a very simple microbenchmark: add three different vector representations together in three different ways each. Here are the competitors:

struct VFloat
{
  public float X;
  public float Y;
  public float Z;
}
struct VNumerics3
{
  public Vector<float> X;
  public Vector<float> Y;
  public Vector<float> Z;
}
struct VAvx
{
  public Vector256<float> X;
  public Vector256<float> Y;
  public Vector256<float> Z;
}

The leftmost type, VFloat, is the simplest representation for three scalars. It’s not a particularly fair comparison since the Vector{T} type contains 4 scalars on the tested 3770K and the Vector256{float} contains 8, so they’re conceptually doing way more work. Despite that, comparing them will reveal some interesting compiler and processor properties.

The three Add implementations tested will be a manually inlined version, a static function with in/out parameters, and an operator. Here’s how the function and operator look for VFloat; I’ll omit the manually inlined implementation and other types for brevity (but you can see them on github):

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Add(in VFloat a, in VFloat b, out VFloat result)
{
  result.X = a.X + b.X;
  result.Y = a.Y + b.Y;
  result.Z = a.Z + b.Z;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static VFloat operator +(in VFloat a, in VFloat b)
{
  VFloat result;
  result.X = a.X + b.X;
  result.Y = a.Y + b.Y;
  result.Z = a.Z + b.Z;
  return result;
}

Each addition will be called several times in a loop. Some adds are independent, some are dependent. The result of each iteration gets stored into an accumulator to keep the loop from being optimized into nonexistence. Something like this:

for (int j = 0; j < innerIterationCount; ++j)
{
  ref var value = ref Unsafe.Add(ref baseValue, j);
  var r0 = value + value;
  var r1 = value + value;
  var r2 = value + value;
  var r3 = value + value;
  var i0 = r0 + r1;
  var i1 = r2 + r3;
  var i2 = i0 + i1;
  accumulator += i2;
}

Historically, using operators for value types implied a great deal of copying for both the parameters and returned value even when the function was inlined. (Allowing ‘in’ on operator parameters helps with this a little bit, at least in cases where the JIT isn’t able to eliminate the copies without assistance.)

To compensate, many C# libraries with any degree of performance sensitivity like XNA and its progeny offered ref/out overloads. That helped, but not being able to use operators efficiently always hurt readability. Having refs spewed all over your code wasn’t too great either, but in parameters (which require no call site decoration) have saved us from that in most cases.

But for maximum performance, you had to bite the bullet and manually inline. It was a recipe for an unmaintainable mess, but it was marginally faster!

Focusing on VFloat for a moment, how does that situation look today? Testing on the .NET Core 3.0.0-preview1-26829-01 alpha runtime:

VFloat Add Benchmarks.png

All effectively the same! The resulting assembly:

Manually Inlined: 
vmovss      xmm3,dword ptr [r8]  
vaddss      xmm3,xmm3,xmm3  
vmovss      xmm4,dword ptr [r8+4]  
vaddss      xmm4,xmm4,xmm4  
vmovaps     xmm5,xmm4  
vmovss      xmm6,dword ptr [r8+8]  
vaddss      xmm6,xmm6,xmm6  
vmovaps     xmm7,xmm6  
vmovaps     xmm8,xmm3  
vaddss      xmm4,xmm4,xmm5  
vmovaps     xmm5,xmm4  
vaddss      xmm6,xmm6,xmm7  
vmovaps     xmm7,xmm6  
vaddss      xmm3,xmm3,xmm8  
vmovaps     xmm8,xmm3  
vaddss      xmm4,xmm4,xmm5  
vmovaps     xmm5,xmm7  
vaddss      xmm5,xmm5,xmm6  
vaddss      xmm3,xmm3,xmm8  
vaddss      xmm0,xmm0,xmm3  
vaddss      xmm1,xmm1,xmm4  
vaddss      xmm2,xmm2,xmm5  
Operator:
vmovss      xmm3,dword ptr [r8]  
vaddss      xmm3,xmm3,xmm3  
vmovaps     xmm4,xmm3  
vmovss      xmm5,dword ptr [r8+4]  
vaddss      xmm5,xmm5,xmm5  
vmovaps     xmm6,xmm5  
vmovss      xmm7,dword ptr [r8+8]  
vaddss      xmm7,xmm7,xmm7  
vmovaps     xmm8,xmm7  
vaddss      xmm3,xmm3,xmm4  
vmovaps     xmm4,xmm3  
vaddss      xmm5,xmm5,xmm6  
vmovaps     xmm6,xmm5  
vaddss      xmm7,xmm7,xmm8  
vmovaps     xmm8,xmm7  
vaddss      xmm3,xmm3,xmm4  
vmovaps     xmm4,xmm6  
vaddss      xmm4,xmm4,xmm5  
vmovaps     xmm5,xmm8  
vaddss      xmm5,xmm5,xmm7  
vaddss      xmm0,xmm0,xmm3  
vaddss      xmm1,xmm1,xmm4  
vaddss      xmm2,xmm2,xmm5  

The manually inlined version and the operator version differ by a single instruction. That’s good news- using operators is, at least in some cases, totally fine now! Also, note that there are only 12 vaddss instructions, cutting out the other 12 redundant adds. Some cleverness!

Now let’s see how things look across all the test cases…

3.0.0-preview1-26829-01 Add Benchmarks.png

Oh, dear. The preview nature of this runtime has suddenly become relevant. Using an operator for the VAvx type is catastrophic. Comparing the manually inlined version to the operator version:

Manually Inlined:
vmovups     ymm3,ymmword ptr [r9]  
cmp         dword ptr [r8],r8d  
lea         r9,[r8+20h]  
vmovups     ymm4,ymmword ptr [r9]  
cmp         dword ptr [r8],r8d  
add         r8,40h  
vaddps      ymm5,ymm3,ymm3  
vaddps      ymm6,ymm4,ymm4  
vmovups     ymm7,ymmword ptr [r8]  
vaddps      ymm8,ymm7,ymm7  
vaddps      ymm9,ymm3,ymm3  
vaddps      ymm10,ymm4,ymm4  
vaddps      ymm11,ymm7,ymm7  
vaddps      ymm12,ymm3,ymm3  
vaddps      ymm13,ymm4,ymm4  
vaddps      ymm14,ymm7,ymm7  
vaddps      ymm3,ymm3,ymm3  
vaddps      ymm4,ymm4,ymm4  
vaddps      ymm7,ymm7,ymm7  
vaddps      ymm6,ymm6,ymm10  
vaddps      ymm8,ymm8,ymm11  
vaddps      ymm3,ymm12,ymm3  
vaddps      ymm4,ymm13,ymm4  
vaddps      ymm7,ymm14,ymm7  
vaddps      ymm4,ymm6,ymm4  
vaddps      ymm6,ymm8,ymm7  
vaddps      ymm5,ymm5,ymm9  
vaddps      ymm3,ymm5,ymm3  
vaddps      ymm0,ymm3,ymm0  
vaddps      ymm1,ymm4,ymm1  
vaddps      ymm2,ymm6,ymm2  
Operator:
lea         rdx,[rsp+2A0h]  
vxorps      xmm0,xmm0,xmm0  
vmovdqu     xmmword ptr [rdx],xmm0  
vmovdqu     xmmword ptr [rdx+10h],xmm0  
vmovdqu     xmmword ptr [rdx+20h],xmm0  
vmovdqu     xmmword ptr [rdx+30h],xmm0  
vmovdqu     xmmword ptr [rdx+40h],xmm0  
vmovdqu     xmmword ptr [rdx+50h],xmm0  
vmovupd     ymm0,ymmword ptr [rbp]  
vaddps      ymm0,ymm0,ymmword ptr [rbp]  
lea         rdx,[rsp+2A0h]  
vmovupd     ymmword ptr [rdx],ymm0  
vmovupd     ymm0,ymmword ptr [rbp+20h]  
vaddps      ymm0,ymm0,ymmword ptr [rbp+20h]  
lea         rdx,[rsp+2C0h]  
vmovupd     ymmword ptr [rdx],ymm0  
vmovupd     ymm0,ymmword ptr [rbp+40h]  
vaddps      ymm0,ymm0,ymmword ptr [rbp+40h]  
lea         rdx,[rsp+2E0h]  
vmovupd     ymmword ptr [rdx],ymm0  
lea         rcx,[rsp+540h]  
lea         rdx,[rsp+2A0h]  
lea         rdx,[rsp+2A0h]  
mov         r8d,60h  
call        00007FFC1961C290  
  ... repeat another 7 times...

The manually inlined variant does pretty well, producing a tight sequence of 24 vaddps instructions operating on ymm registers. Without optimizing away the redundant adds, that’s about as good as you’re going to get.

The operator version is… less good. Clearing a bunch of memory, unnecessary loads and stores, capped off with a curious function call. Not surprising that it’s 50 times slower.

Clearly something wonky is going on there, but let’s move on for now. Zooming in a bit so we can see the other results:

3.0.0-preview1-26829-01 Add Benchmarks, clamped.png

Both Vector{T} and AVX are slower than VFloat when manually inlined, but that’s expected given that half the adds got optimized away. Unfortunately, it looks like even non-operator functions take a hit relative to the manually inlined implementation.

When manually inlined, 8-wide AVX is also a little faster than 4-wide Vector{T}. On a 3770K, the relevant 4 wide and 8 wide instructions have the same throughput, so being pretty close is expected. The marginal slowdown arises from the Vector{T} implementation using extra vmovupd instructions to load input values. Manually caching the values in a local variable actually helps some.

Focusing on the function and operator slowdown, here’s the assembly generated for the Vector{T} function and operator cases:

Vector<T> add function:
vmovupd     xmm0,xmmword ptr [r8]  
vmovupd     xmm1,xmmword ptr [r8]  
vaddps      xmm0,xmm0,xmm1  
vmovapd     xmmword ptr [rsp+110h],xmm0  
vmovupd     xmm0,xmmword ptr [r8+10h]  
vmovupd     xmm1,xmmword ptr [r8+10h]  
vaddps      xmm0,xmm0,xmm1  
vmovapd     xmmword ptr [rsp+100h],xmm0  
vmovupd     xmm0,xmmword ptr [r8+20h]  
vmovupd     xmm1,xmmword ptr [r8+20h]  
vaddps      xmm0,xmm0,xmm1  
vmovapd     xmmword ptr [rsp+0F0h],xmm0  
... repeat...
Vector<T> operator:
vmovapd     xmm3,xmmword ptr [rsp+170h]  
vmovapd     xmm4,xmmword ptr [rsp+160h]  
vmovapd     xmm5,xmmword ptr [rsp+150h]  
vmovupd     xmm6,xmmword ptr [r8]  
vmovupd     xmm7,xmmword ptr [r8]  
vaddps      xmm6,xmm6,xmm7  
vmovapd     xmmword ptr [rsp+140h],xmm6  
vmovupd     xmm6,xmmword ptr [r8+10h]  
vmovupd     xmm7,xmmword ptr [r8+10h]  
vaddps      xmm6,xmm6,xmm7  
vmovapd     xmmword ptr [rsp+130h],xmm6  
vmovupd     xmm6,xmmword ptr [r8+20h]  
vmovupd     xmm7,xmmword ptr [r8+20h]  
vaddps      xmm6,xmm6,xmm7  
vmovapd     xmmword ptr [rsp+120h],xmm6  
... repeat...

Nothing crazy happening, but there’s clearly a lot of register juggling that the earlier manually inlined AVX version didn’t do. The add function versus manual inlining difference is more pronounced in the AVX case, but the cause is similar (with some more lea instructions).

But this is an early preview version. What happens if we update to a daily build from a few weeks after the one tested above?

Add Benchmarks, clamped.png

A little better on function AVX, and more than 17 times faster on operator AVX. Not ideal, perhaps, but much closer to reasonable.

(If you’re wondering why the AVX path seems to handle things differently than the Vector{T} paths, Vector{T} came first and has its own set of JIT intrinsic implementations. The two may become unified in the future, on top of some additional work to avoid quality regressions.)

Microbenchmarks are one thing; how do these kinds of concerns show up in actual use? As an example, consider the box-box contact test. To avoid a book-length post, I’ll omit the generated assembly.

Given that manual inlining isn’t exactly a viable option in most cases, v2 usually uses static functions with in/out parameters. As expected, the generated code looks similar to the microbenchmark with the same kind of function usage. Here’s a VTune snapshot of the results:

instructionbottleneck.png

The CPI isn’t horrible, but most of the bottleneck is related to the loading instructions. The above breaks out the 37.4% of cycles which are stalled on front-end bottlenecks. The instruction cache misses and delivery inefficiencies become relevant when there are no memory bottlenecks to hide them. With deeper analysis, many moves and loads/stores could be eliminated and this could get a nice boost.

Another fun note, from the header of BoxPairTester.Test when inlining the function is disabled:

mov         ecx,2AAh  
xor         eax,eax  
rep stos    dword ptr [rdi] 

CoreCLR aggressively clears locally allocated variables if the IL locals init flag is set. Given that the flag is almost always set, it’s possible to spend a lot of time pointlessly zeroing memory. Here, the rep stos instruction performs 2AAh = 682 iterations. Each iteration sets 4 bytes of memory to the value of just-zeroed eax register, so this zeroes out 2728 bytes of stack space every single time the function is called.

In practice, many such clears are amortized over multiple function calls by forcing inlining, but unless the locals init flag is stripped, they’ll still happen. When compiled under ReleaseStrip configuration, v2 uses a post-build step to strip the locals init flag (and in the future there will likely be other options). Some simulations can improve by over 10% with the clearing stripped.

Summary

If you’re writing the kind of code where the generated assembly quality actually matters and isn’t bottlenecked by something else like memory, you should probably sanity test the performance occasionally or peek at the generated assembly to check things out. The JIT is improving, but there are limits to how much deep analysis can be performed on the fly without interfering with user experience.

And if you’re trying to use preview features that are still under active development, well, you probably know what you’re getting into.