In the previous post, we saw that the JIT does a reasonably good job generating efficient code.
Does that mean we can simply trust the compiler to do the right thing all the time? Let's find out!
We'll investigate using a very simple microbenchmark: add three different vector representations together in three different ways each. Here are the competitors:
struct VFloat { public float X; public float Y; public float Z; }
struct VNumerics3 { public Vector<float> X; public Vector<float> Y; public Vector<float> Z; }
struct VAvx { public Vector256<float> X; public Vector256<float> Y; public Vector256<float> Z; }
The leftmost type, VFloat, is the simplest representation for three scalars. It’s not a particularly fair comparison since the Vector{T} type contains 4 scalars on the tested 3770K and the Vector256{float} contains 8, so they’re conceptually doing way more work. Despite that, comparing them will reveal some interesting compiler and processor properties.
The three Add implementations tested will be a manually inlined version, a static function with in/out parameters, and an operator. Here’s how the function and operator look for VFloat; I’ll omit the manually inlined implementation and other types for brevity (but you can see them on github):
[MethodImpl(MethodImplOptions.AggressiveInlining)] public static void Add(in VFloat a, in VFloat b, out VFloat result) { result.X = a.X + b.X; result.Y = a.Y + b.Y; result.Z = a.Z + b.Z; }
[MethodImpl(MethodImplOptions.AggressiveInlining)] public static VFloat operator +(in VFloat a, in VFloat b) { VFloat result; result.X = a.X + b.X; result.Y = a.Y + b.Y; result.Z = a.Z + b.Z; return result; }
Each addition will be called several times in a loop. Some adds are independent, some are dependent. The result of each iteration gets stored into an accumulator to keep the loop from being optimized into nonexistence. Something like this:
for (int j = 0; j < innerIterationCount; ++j) { ref var value = ref Unsafe.Add(ref baseValue, j); var r0 = value + value; var r1 = value + value; var r2 = value + value; var r3 = value + value; var i0 = r0 + r1; var i1 = r2 + r3; var i2 = i0 + i1; accumulator += i2; }
Historically, using operators for value types implied a great deal of copying for both the parameters and returned value even when the function was inlined. (Allowing ‘in’ on operator parameters helps with this a little bit, at least in cases where the JIT isn’t able to eliminate the copies without assistance.)
To compensate, many C# libraries with any degree of performance sensitivity like XNA and its progeny offered ref/out overloads. That helped, but not being able to use operators efficiently always hurt readability. Having refs spewed all over your code wasn’t too great either, but in parameters (which require no call site decoration) have saved us from that in most cases.
But for maximum performance, you had to bite the bullet and manually inline. It was a recipe for an unmaintainable mess, but it was marginally faster!
Focusing on VFloat for a moment, how does that situation look today? Testing on the .NET Core 3.0.0-preview1-26829-01 alpha runtime:
All effectively the same! The resulting assembly:
Manually Inlined: vmovss xmm3,dword ptr [r8] vaddss xmm3,xmm3,xmm3 vmovss xmm4,dword ptr [r8+4] vaddss xmm4,xmm4,xmm4 vmovaps xmm5,xmm4 vmovss xmm6,dword ptr [r8+8] vaddss xmm6,xmm6,xmm6 vmovaps xmm7,xmm6 vmovaps xmm8,xmm3 vaddss xmm4,xmm4,xmm5 vmovaps xmm5,xmm4 vaddss xmm6,xmm6,xmm7 vmovaps xmm7,xmm6 vaddss xmm3,xmm3,xmm8 vmovaps xmm8,xmm3 vaddss xmm4,xmm4,xmm5 vmovaps xmm5,xmm7 vaddss xmm5,xmm5,xmm6 vaddss xmm3,xmm3,xmm8 vaddss xmm0,xmm0,xmm3 vaddss xmm1,xmm1,xmm4 vaddss xmm2,xmm2,xmm5
Operator: vmovss xmm3,dword ptr [r8] vaddss xmm3,xmm3,xmm3 vmovaps xmm4,xmm3 vmovss xmm5,dword ptr [r8+4] vaddss xmm5,xmm5,xmm5 vmovaps xmm6,xmm5 vmovss xmm7,dword ptr [r8+8] vaddss xmm7,xmm7,xmm7 vmovaps xmm8,xmm7 vaddss xmm3,xmm3,xmm4 vmovaps xmm4,xmm3 vaddss xmm5,xmm5,xmm6 vmovaps xmm6,xmm5 vaddss xmm7,xmm7,xmm8 vmovaps xmm8,xmm7 vaddss xmm3,xmm3,xmm4 vmovaps xmm4,xmm6 vaddss xmm4,xmm4,xmm5 vmovaps xmm5,xmm8 vaddss xmm5,xmm5,xmm7 vaddss xmm0,xmm0,xmm3 vaddss xmm1,xmm1,xmm4 vaddss xmm2,xmm2,xmm5
The manually inlined version and the operator version differ by a single instruction. That’s good news- using operators is, at least in some cases, totally fine now! Also, note that there are only 12 vaddss instructions, cutting out the other 12 redundant adds. Some cleverness!
Now let’s see how things look across all the test cases…
Oh, dear. The preview nature of this runtime has suddenly become relevant. Using an operator for the VAvx type is catastrophic. Comparing the manually inlined version to the operator version:
Manually Inlined: vmovups ymm3,ymmword ptr [r9] cmp dword ptr [r8],r8d lea r9,[r8+20h] vmovups ymm4,ymmword ptr [r9] cmp dword ptr [r8],r8d add r8,40h vaddps ymm5,ymm3,ymm3 vaddps ymm6,ymm4,ymm4 vmovups ymm7,ymmword ptr [r8] vaddps ymm8,ymm7,ymm7 vaddps ymm9,ymm3,ymm3 vaddps ymm10,ymm4,ymm4 vaddps ymm11,ymm7,ymm7 vaddps ymm12,ymm3,ymm3 vaddps ymm13,ymm4,ymm4 vaddps ymm14,ymm7,ymm7 vaddps ymm3,ymm3,ymm3 vaddps ymm4,ymm4,ymm4 vaddps ymm7,ymm7,ymm7 vaddps ymm6,ymm6,ymm10 vaddps ymm8,ymm8,ymm11 vaddps ymm3,ymm12,ymm3 vaddps ymm4,ymm13,ymm4 vaddps ymm7,ymm14,ymm7 vaddps ymm4,ymm6,ymm4 vaddps ymm6,ymm8,ymm7 vaddps ymm5,ymm5,ymm9 vaddps ymm3,ymm5,ymm3 vaddps ymm0,ymm3,ymm0 vaddps ymm1,ymm4,ymm1 vaddps ymm2,ymm6,ymm2
Operator: lea rdx,[rsp+2A0h] vxorps xmm0,xmm0,xmm0 vmovdqu xmmword ptr [rdx],xmm0 vmovdqu xmmword ptr [rdx+10h],xmm0 vmovdqu xmmword ptr [rdx+20h],xmm0 vmovdqu xmmword ptr [rdx+30h],xmm0 vmovdqu xmmword ptr [rdx+40h],xmm0 vmovdqu xmmword ptr [rdx+50h],xmm0 vmovupd ymm0,ymmword ptr [rbp] vaddps ymm0,ymm0,ymmword ptr [rbp] lea rdx,[rsp+2A0h] vmovupd ymmword ptr [rdx],ymm0 vmovupd ymm0,ymmword ptr [rbp+20h] vaddps ymm0,ymm0,ymmword ptr [rbp+20h] lea rdx,[rsp+2C0h] vmovupd ymmword ptr [rdx],ymm0 vmovupd ymm0,ymmword ptr [rbp+40h] vaddps ymm0,ymm0,ymmword ptr [rbp+40h] lea rdx,[rsp+2E0h] vmovupd ymmword ptr [rdx],ymm0 lea rcx,[rsp+540h] lea rdx,[rsp+2A0h] lea rdx,[rsp+2A0h] mov r8d,60h call 00007FFC1961C290 repeat another 7 times
The manually inlined variant does pretty well, producing a tight sequence of 24 vaddps instructions operating on ymm registers. Without optimizing away the redundant adds, that’s about as good as you’re going to get.
The operator version is… less good. Clearing a bunch of memory, unnecessary loads and stores, capped off with a curious function call. Not surprising that it’s 50 times slower.
Clearly something wonky is going on there, but let’s move on for now. Zooming in a bit so we can see the other results:
Both Vector{T} and AVX are slower than VFloat when manually inlined, but that’s expected given that half the adds got optimized away. Unfortunately, it looks like even non-operator functions take a hit relative to the manually inlined implementation.
When manually inlined, 8-wide AVX is also a little faster than 4-wide Vector{T}. On a 3770K, the relevant 4 wide and 8 wide instructions have the same throughput, so being pretty close is expected. The marginal slowdown arises from the Vector{T} implementation using extra vmovupd instructions to load input values. Manually caching the values in a local variable actually helps some.
Focusing on the function and operator slowdown, here’s the assembly generated for the Vector{T} function and operator cases:
Vector<T> add function: vmovupd xmm0,xmmword ptr [r8] vmovupd xmm1,xmmword ptr [r8] vaddps xmm0,xmm0,xmm1 vmovapd xmmword ptr [rsp+110h],xmm0 vmovupd xmm0,xmmword ptr [r8+10h] vmovupd xmm1,xmmword ptr [r8+10h] vaddps xmm0,xmm0,xmm1 vmovapd xmmword ptr [rsp+100h],xmm0 vmovupd xmm0,xmmword ptr [r8+20h] vmovupd xmm1,xmmword ptr [r8+20h] vaddps xmm0,xmm0,xmm1 vmovapd xmmword ptr [rsp+0F0h],xmm0 repeat
Vector<T> operator: vmovapd xmm3,xmmword ptr [rsp+170h] vmovapd xmm4,xmmword ptr [rsp+160h] vmovapd xmm5,xmmword ptr [rsp+150h] vmovupd xmm6,xmmword ptr [r8] vmovupd xmm7,xmmword ptr [r8] vaddps xmm6,xmm6,xmm7 vmovapd xmmword ptr [rsp+140h],xmm6 vmovupd xmm6,xmmword ptr [r8+10h] vmovupd xmm7,xmmword ptr [r8+10h] vaddps xmm6,xmm6,xmm7 vmovapd xmmword ptr [rsp+130h],xmm6 vmovupd xmm6,xmmword ptr [r8+20h] vmovupd xmm7,xmmword ptr [r8+20h] vaddps xmm6,xmm6,xmm7 vmovapd xmmword ptr [rsp+120h],xmm6 repeat
Nothing crazy happening, but there’s clearly a lot of register juggling that the earlier manually inlined AVX version didn’t do. The add function versus manual inlining difference is more pronounced in the AVX case, but the cause is similar (with some more lea instructions).
But this is an early preview version. What happens if we update to a daily build from a few weeks after the one tested above?
A little better on function AVX, and more than 17 times faster on operator AVX. Not ideal, perhaps, but much closer to reasonable.
(If you’re wondering why the AVX path seems to handle things differently than the Vector{T} paths, Vector{T} came first and has its own set of JIT intrinsic implementations. The two may become unified in the future, on top of some additional work to avoid quality regressions.)
Microbenchmarks are one thing; how do these kinds of concerns show up in actual use? As an example, consider the box-box contact test. To avoid a book-length post, I’ll omit the generated assembly.
Given that manual inlining isn’t exactly a viable option in most cases, v2 usually uses static functions with in/out parameters. As expected, the generated code looks similar to the microbenchmark with the same kind of function usage. Here’s a VTune snapshot of the results:
The CPI isn’t horrible, but most of the bottleneck is related to the loading instructions. The above breaks out the 37.4% of cycles which are stalled on front-end bottlenecks. The instruction cache misses and delivery inefficiencies become relevant when there are no memory bottlenecks to hide them. With deeper analysis, many moves and loads/stores could be eliminated and this could get a nice boost.
Another fun note, from the header of BoxPairTester.Test when inlining the function is disabled:
mov ecx,2AAh xor eax,eax rep stos dword ptr [rdi]
CoreCLR aggressively clears locally allocated variables if the IL locals init flag is set. Given that the flag is almost always set, it’s possible to spend a lot of time pointlessly zeroing memory. Here, the rep stos instruction performs 2AAh = 682 iterations. Each iteration sets 4 bytes of memory to the value of just-zeroed eax register, so this zeroes out 2728 bytes of stack space every single time the function is called.
In practice, many such clears are amortized over multiple function calls by forcing inlining, but unless the locals init flag is stripped, they’ll still happen. When compiled under ReleaseStrip configuration, v2 uses a post-build step to strip the locals init flag (and in the future there will likely be other options). Some simulations can improve by over 10% with the clearing stripped.
Summary
If you’re writing the kind of code where the generated assembly quality actually matters and isn’t bottlenecked by something else like memory, you should probably sanity test the performance occasionally or peek at the generated assembly to check things out. The JIT is improving, but there are limits to how much deep analysis can be performed on the fly without interfering with user experience.
And if you’re trying to use preview features that are still under active development, well, you probably know what you’re getting into.