A different kind of C#: the JIT doesn't really need demonic empowerment

It would be a little silly to write a series on performance in C# without mentioning the Just-In-Time (JIT) compiler. Unlike an offline toolchain that precompiles assemblies for specific platforms ahead of time (AOT), many C# applications compile on demand on the end user's device. While this does theoretically give a JIT more knowledge about the target system, it also constrains how much time is available to compile. Most users won't tolerate a 45 second startup time even if it does make everything run 30% faster afterwards. 

It's worth mentioning that there are AOT compilation paths, and some platforms require AOT. Mono has historically provided such a path, .NET Native is used for UWP apps, and the newer CoreRT is moving along steadily. AOT does not always imply deep offline optimization, but the relaxation of time constraints at least theoretically helps. There's also ongoing work on tiered compilation which could eventually lead to higher optimization tiers. 

One common concern is that running through any of today's JIT-quality compilers will result in inferior optimizations that render C# a dead end for truly high performance code. It's definitely true that the JIT is not able to optimize as deeply as an offline process, and this can show up in a variety of use cases.

But before diving into that, I would like to point out some important context. Consider the following simulation, a modified version of the ClothLatticeDemo. It's 65536 spheres connected by 260610 ball socket joints plus any collision related constraints that occur on impact with the table-ball-thing.

clothlattice256x256.png

On my 3770K, it runs at about 30 ms per frame prior to impact, and about 45 ms per frame after impact. The vast majority of that time is spent in the solver executing code that looks like this (from BallSocket.cs):

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Solve(ref BodyVelocities velocityA, ref BodyVelocities velocityB, ref BallSocketProjection projection, ref Vector3Wide accumulatedImpulse)
{
    Vector3Wide.Subtract(velocityA.Linear, velocityB.Linear, out var csv);
    Vector3Wide.CrossWithoutOverlap(velocityA.Angular, projection.OffsetA, out var angularCSV);
    Vector3Wide.Add(csv, angularCSV, out csv);
    Vector3Wide.CrossWithoutOverlap(projection.OffsetB, velocityB.Angular, out angularCSV);
    Vector3Wide.Add(csv, angularCSV, out csv);
    Vector3Wide.Subtract(projection.BiasVelocity, csv, out csv);

    Symmetric3x3Wide.TransformWithoutOverlap(csv, projection.EffectiveMass, out var csi);
    Vector3Wide.Scale(accumulatedImpulse, projection.SoftnessImpulseScale, out var softness);
    Vector3Wide.Subtract(csi, softness, out csi);
    Vector3Wide.Add(accumulatedImpulse, csi, out accumulatedImpulse);

    ApplyImpulse(ref velocityA, ref velocityB, ref projection, ref csi);
}

It's a whole bunch of math in pretty tight loops. Exactly the kind of situation where you might expect a better optimizer to provide significant wins. And without spoiling much, I can tell you that the JIT could do better with the generated assembly here.

Now imagine someone at Microsoft (or maybe even you, it's open source after all!) receives supernatural knowledge in a fever dream and trades their soul to empower RyuJIT. Perversely blessed by the unfathomable darkness below, RyuJIT v666 somehow makes your CPU execute all instructions in 0 cycles. Instructions acting only upon registers are completely free with infinite throughput, and the only remaining cost is waiting on data from cache and memory.

How much faster would this simulation run on my 3770K when compiled with RyuJIT v666?

Take a moment and make a guess.

Infinitely faster can be ruled out- even L1 cache wouldn't be able to keep with with this demonically empowered CPU. But maybe the cost would drop from 45 milliseconds to 1 millisecond? Maybe 5 milliseconds?

From VTune:

clothlatticememorybandwidth.png

The maximum effective bandwidth of the 3770K in the measured system is about 23 GBps. Prior to impact, the simulation is consuming 18-19 GBps of that. Post-impact, it hovers around 15 GBps, somewhat reduced thanks to the time spent in the less bandwidth heavy collision detection phase. (There's also a bandwidth usage dip hidden by that popup box that corresponds to increased bookkeeping when all the collisions are being created, but it levels back out to around 15 GBps pretty quickly.)

If we assume that the only bottleneck is memory bandwidth, the speedup is at most about 1.25x before impact, and 1.55x after. In other words, the frame times would drop to no better than 24-30 ms. Realistically, stalls caused by memory latency would prevent those ideal speedups from being achieved.

RyuJIT v666, and by extension all earthly optimizing compilers, can't speed this simulation up much. Even if I rewrote it all in C it would be unwise to expect more than a few percent. Further, given that compute improves faster than memory in most cases, newer processors will tend to benefit even less from demons.

Of course, not every simulation is quite so memory bandwidth bound. Simulations that involve complex colliders like meshes will tend to have more room for magic compilers to work. It just won't ever be that impressive.

So, could the JIT-generated assembly be better? Absolutely, and it is getting better, rapidly. Could there sometimes be serious issues for very specific kinds of code, particularly when unbound by memory bottlenecks? Yes.

But is it good enough to create many complex speedy applications? Yup!