BEPUphysics v1.5.0 now on GitHub!

I decided to move the demos over to monogame after many years of procrastination, then looked back on all the commits and said, hey, I might as well package these up for a release.

Then codeplex was temporarily refusing my pushes, so I said, hey, might as well put it on github.

Check out the newness on github!

(Certain observers may note that BEPUphysics v2 is, in fact, not yet out, nor is it even being actively developed yet. Don't worry, it's not dead or anything, I just have to get this other semirelated project out of the way. I'm hoping to get properly started on v2 in very early 2017, but do remember how good I am at estimating timelines.)

Blazing Fast Trees

The prototype for the first big piece of BEPUphysics v2.0.0 is pretty much done: a tree.

This tree will (eventually) replace all the existing trees in BEPUphysics and act as the foundation of the new broad phase.

So how does the current prototype compare with v1.4.0's broad phase?

It's a lot faster.

The measured 'realistic' test scene includes 65536 randomly positioned cubic leaves ranging from 1 to 100 units across, with leaf size given by 1 + 99 * X^10, where X is a uniform random value from 0 to 1. In other words, there are lots of smaller objects and a few big objects, and the average size is 10 units. All leaves are moving in random directions with speeds given by 10 * X^10, where X is a uniform random value from 0 to 1, and they bounce off the predefined world bounds (a large cube) so that they stay in the same volume. The number of overlaps ranges between 65600 and 66300.

Both simulations are multithreaded with 8 threads on a 3770K@4.5ghz. Notably, the benchmarking environment was not totally clean. The small spikes visible in the new implementation do not persist between runs and are just the other programs occasionally interfering.

So, the first obvious thing you might notice is that the old version spikes like crazy. Those spikes were a driving force behind this whole rewrite. What's causing them, and how bad can they get?

 

The answers are refinement and really bad. Each one of those spikes represents a reconstruction of part of the tree which has expanded beyond its optimal size. Those reconstructions aren't cheap, and more importantly, they are unbounded. If a reconstruction starts near the root, it may force a reconstruction of large fractions of the tree. If you're really unlucky, it will be so close to the root that the main thread has to do it. In the worst case, the root itself might get reconstructed- see that spike on frame 0? The graph is actually cut off; it took 108ms. While a full root reconstruction usually only happens on the first frame, the other reconstructions are clearly bad enough. These are multi-frame spikes that a user can definitely notice if they're paying attention. Imagine how that would feel in VR.

To be fair to the old broad phase, this test is a lot more painful than most simulations. The continuous divergent motion nearly maximizes the amount of reconstruction required. 

But there's something else going on, and it might be even worse. Notice that slow upward slope in the first graph? The new version doesn't have it at all, so it's not a property of the scene itself. What does the tree quality look like?

 

This graph represents the computed cost of the tree. If you've heard of surface area heuristic tree builders in raytracing, this is basically the same thing except the minimized metric is volume instead of surface area. (Volume queries and self collision tests have probability of overlap proportional to volume, ray-AABB intersection probability is proportional to surface area. They usually produce pretty similar trees, though.)

The new tree starts with poor quality since the tree was built using incremental insertion, but the new refinement process quickly reduces cost. It gets to around 37.2, compared to a full sweep rebuild of around 31.9.

The old tree starts out better since the first frame's root reconstruction does a full median split build. But what happens afterward? That doesn't look good. What happens if tree churns faster? How about a bunch of objects moving 10-100 instead of 0-10 units per second, with the same distribution?

 

 

Uh oh. The cost increases pretty quickly, and the self test cost rises in step. By the end, the new version is well over 10 times as fast. As you might expect, faster leaf speeds are even worse. I neglected to fully benchmark that since a cost metric 10000 times higher than it should be slows things down a little.

What's happening?

The old tree reconstructs nodes when their volume goes above a certain threshold. After the reconstruction, a new threshold is computed based on the result of the reconstruction. Unfortunately, that new threshold lets the tree degrade further next time around. Eventually, the threshold ratchets high enough that very few meaningful refinements occur. Note in the graph that the big refinement time spikes are mostly gone after frame 1000. If enough objects are moving chaotically for long periods of time, this problem could show up in a real game.

This poses a particularly large problem for long-running simulations like those on a persistent game server. The good news is that the new version has no such problem, the bad news is that there is no good workaround for the old version. For now, if you run into this problem, try periodically calling DynamicHierarchy.ForceRebuild (or look for the internal ForceRevalidation in older versions). As the name implies, it will reset the tree quality but at a hefty price. Expect to drop multiple frames.

(This failure is blindingly obvious in hindsight, and I don't know how I missed it when designing it, benchmarking it, or using it. I'm also surprised no one's reported it to my knowledge. Oops!)

So, how about if nothing is moving?

 

The old version manages to maintain a constant slope, though it still has some nasty spikes. Interestingly, those aren't primarily from refinement, as we'll see in a moment.

This is also a less favorable comparison for the new tree, "only" being 3 times as fast.

Splitting the time contributions helps explain both observations:
  

The old version's spikes can't be reconstructions given that everything is totally stationary, and the self test shows them too. I didn't bother fully investigating this, but one possible source is poor load balancing. It uses a fairly blind work collector, making it very easy to end up with one thread overworked. The new version, in contrast, is smarter about selecting subtasks of similar size and also collects more of them.

So why is the new refinement only a little bit faster if the self test is 3.5 times faster? Two reasons. First, the new refinement is never satisfied with doing no work, so in this kind of situation it does a bit too much. Second, I just haven't spent much time optimizing the refinement blocks for low work situations like this. These blocks are fairly large compared to the needs of a totally stationary tree, so very few of them need to be dispatched. In this case, there were only 2. The other threads sit idle during that particular subphase. In other words, the new tree is currently tuned for harder workloads.

Now, keeping leaves stationary, what happens when the density of leaves is varied? First, a sparse distribution with 8 times the volume (and so about one eighth the overlaps):

 

A bit over twice as fast. A little disappointing, but this is another one of those 'easy' cases where the new refinement implementation doesn't really adapt to very small workloads, providing marginal speedups.

How about the reverse? 64 times more dense than the above, with almost 500000 overlaps. With about 8 overlaps per leaf, this is roughly the density of a loose pile.

 

Despite the fact that the refinement suffers from the same 'easy simulation' issue, the massive improvement in test times brings the total speedup to over 5 times faster. The new tree's refinement takes less than a millisecond on both the sparse and dense cases, but the dense case stresses the self test vastly more. And the old tree is nowhere near as fast at collision tests.

Next up: while maintaining the same medium density of leaves (about one overlap per leaf), vary the number. Leaves are moving at the usual 0-10 speed again for these tests.  First, a mere 16384 leaves instead of 65536:

Only about 2.5 times faster near the end. The split timings are interesting, though: 

The self test marches along at around 3.5 times as fast near the end, but the refinement is actually slower... if you ignore the enormous spikes of the old version. Once again, there's just not enough work to do and the work chunks are too big at the moment. 400 microseconds pretty okay, though.

How about a very high leaf count, say, 262144 leaves? 

Around 4 times as fast. Refinement has enough to chomp on.

Refinement alone hangs around 2.5-2.75 times as fast, which is pretty fancy considering how much more work it's doing. As usual, the self test is super speedy, only occasionally dropping below 4.20 times as fast.

How about multithreaded scaling? I haven't investigated higher core counts yet, but here are the new tree's results for single threaded versus full threads on the 3770K under the original 65536 'realistic' case:

 

Very close to exactly 4 times as fast total. Self tests float around 4.5 times faster. As described earlier, this kind of 'easy' simulation results in a fairly low scaling in refinement- only about 2.3 times faster. If everything was flying around at higher speeds, refinement would be stressed more and more work would be available.

For completeness, here's the new tree versus the old tree, singlethreaded, in the same simulation:

 

3 times faster refines (ignoring spikes), and about 4.5 faster in general.

How does it work?

The biggest conceptual change is the new refinement phase. It has three subphases:

1) Refit

As objects move, the node bounds must adapt. Rather than doing a full tree reconstruction every frame, the node bounds are recursively updated to contain all their children.

During the refit traversal, two additional pieces of information are collected. First, nodes with a child leaf count below a given threshold are added to 'refinement candidates' set. These candidates are the roots of a bunch of parallel subtrees. Second, the change in volume of every node is computed. The sum of every node's change in volume divided by the root's volume provides the change in the cost metric of the tree for this frame.

2) Binned Refine

A subset of the refinement candidates collected by the refit traversal are selected. The number of selected candidates is based on the refit's computed change in cost; a bigger increase means more refinements. The frame index is used to select different refinement candidates as time progresses, guaranteeing that the whole tree eventually gets touched.

The root always gets added as a refinement target. However, the refinement is bounded. All of these refinements tend to be pretty small. Currently, any individual refinement in a tree with 65536 leaves will collect no more than 768 subtrees, a little over 1%. That's why there are no spikes in performance.

Here's an example of candidates and targets in a tree with 24 leaves:

The number within each node is the number of leaves in the children of that node. Green circles are leaf nodes, purple circles are refinement candidates that weren't picked, and red circles are the selected refinement targets. In this case, the maximum number of subtrees for any refinement was chosen as 8.

Since the root has so many potential nodes available, it has options about which nodes to refine. Rather than just diving down the tree a fixed depth, it seeks out the largest nodes by volume. Typically, large nodes tend to be a high leverage place to spend refine time. Consider a leaf node that's moved far enough from its original position that it should be in a far part of the tree. Its parents will tend to have very large bounds, and refinement will see that.

For multithreading, refinement targets are marked (only the refinement treelet root, though- no need to mark every involved node). Refinement node collection will avoid collecting nodes beyond any marked node, allowing refinements to proceed in parallel.

The actual process applied to each refinement target is just a straightforward binned builder that operates on the collected nodes. (For more about binned builders, look up "On fast Construction of SAH-based Bounding Volume Hierarchies" by Ingo Wald.)

3) Cache Optimize 

The old tree allocated nodes as reference types and left them scattered through memory. Traversing the tree was essentially a series of guaranteed cache misses. This is not ideal.

The new tree is just a single contiguous array. While adding/removing elements and binned refinements can scramble the memory order relative to tree traversal order, it's possible to cheaply walk through parts of the tree and shuffle nodes around so that they're in the correct relative positions. A good result only requires optimizing a fraction of the tree; 3% to 5% works quite well when things aren't moving crazy fast. The fraction of cache optimized nodes scales with refit-computed cost change as well, so it compensates for the extra scrambling effects of refinement. In most cases, the tree will sit at 80-95% of cache optimal. (Trees with only a few nodes, say less than 4096, will tend to have a harder time keeping up right now, but they take microseconds anyway.)

Cache optimization can double performance all by itself, so it's one of the most important improvements.

As for the self test phase that comes after refinement, it's pretty much identical to the old version in concept. It's just made vastly faster by a superior node memory layout, cache friendliness, greater attention to tiny bits, and no virtual calls. 

Interestingly, SIMD isn't a huge part of the speedup. It's used here and there (mainly refit), but not to its full potential. The self test in particular, despite being the dominant cost, doesn't use SIMD at all. 

Future work

1) Solving the refinement scaling issue for 'easy' simulations would be nice.

2) SIMD is a big potential area for improvement. As mentioned, this tree is mostly scalar in nature. At best, refit gets decent use of 3-wide operations. My attempts at creating fully vectorized variants tended to do significantly better than the old one, but they incurred too much overhead in many phases and couldn't beat the mostly scalar new version. I'll probably fiddle with it some more when a few more SIMD instructions are exposed, like shuffles; it should be possible to get at least another 1.5 to 2 times speedup.

3) Refinement currently does some unnecessary work on all the non-root treelets. They actually use the same sort of priority queue selection, even though they are guaranteed to eat the whole subtree by the refinement candidate collection threshold. Further, it should be possible to improve the node collection within refinement by taking into account the change in subtree volume on a per-node level. The root refinement would seek out high entropy parts of the tree. Some early testing implied this would help, but I removed it due to memory layout conflicts. 

4) I suspect there are some other good options for the choice of refinement algorithm. I already briefly tried agglomerative and sweep refiners (which were too slow relative to their quality advantage), but I didn't get around to trying things like brute forcing small treelet optimization (something like "Fast Parallel Construction of High-Quality Bounding Volume Hierarchies"). I might revisit this when setting up the systems of the next point.

5) It should be possible to improve the cache optimization distribution. Right now, the multithreaded version is forced into a suboptimal optimization order and suffers from overhead introduced by lots of atomic operations. Some experiments with linking cache optimization to the subtrees being refined showed promise. It converged with little effort, but it couldn't handle the scrambling effect of root refinement. I think this is solvable, maybe in combination with #4.

6) Most importantly, all of the above assumes a bunch of dynamic leaves. Most simulations have tons of static or inactive objects. The benchmarks show that the new tree doesn't do a bad job on these by any means, but imagine all the leaves were static meshes. There's no point in being aggressive with refinements or cache optimizations because nothing is moving or changing, and there's no need for any collision self testing if static-static collisions don't matter.

This is important because the number of static objects can be vastly larger than the number of dynamic objects. A scene big enough to have 5000 active dynamic objects might have hundreds of thousands of static/inactive objects. The old broad phase would just choke and die completely, requiring extra work to use a StaticGroup or something (which still wouldn't provide optimal performance for statics, and does nothing for inactive dynamics). In contrast, a new broad phase that has a dedicated static/inactive tree could very likely handle it with very little overhead.

When I have mentioned big planned broad phase speedups in the past ("over 10 times on some scenes"), this is primarily what I was referring to. The 4 times speedup of the core rewrite was just gravy.

Now what?

If you're feeling adventurous, you can grab the tree inside of the new scratchpad repository on github. Beware, it's extremely messy and not really packaged in any way. There are thousands of lines of dead code and diagnostics, a few dependencies are directly referenced .dlls rather than nice nuget packages, and there's no documentation. The project also contains some of the vectorized trees (with far fewer features) and some early vectorized solver prototyping. Everything but the Trees/SingleArray tree variant is fairly useless, but it might be interesting to someone.

In the future, the scratchpad repo will be where I dump incomplete code scribblings, mostly related to BEPUphysics.

I'm switching developmental gears to some graphics stuff that will use the new tree. It will likely get cleaned up over time and turned into a more usable form over the next few months. A proper BEPUphysics v2.0.0 repository will probably get created sometime in H1 2016, though it will remain incomplete for a while after that.

BEPUphysics in a CoreCLR World

A lot of exciting stuff has happened in the .NET world over the last year, and BEPUphysics is approaching some massive breaking changes. It seems like a good time to condense the plans in one spot.

First, expect v1.4.0 to get packaged up as a stable release in the next couple of months. At this time, I expect that v1.4.0 will likely be the last version designed with XNA platform compatibility in mind.

Following what seems to be every other open source project in existence, BEPUphysics will probably be moving to github after v1.4.0 is released.

Now for the fun stuff:


BEPUphysics v2.0.0

High Level Overview:

Performance drives almost everything in v2.0.0. Expect major revisions; many areas will undergo total rewrites. Applications may require significant changes to adapt. The revisions follow the spirit of the DX11/OpenGL to DX12/Vulkan shift. The engine will focus on providing the highest possible performance with a minimal API.

Expect the lowest level engine primitives like Entity to become much 'dumber', behaving more like simple opaque data blobs instead of a web of references, interfaces, and callbacks. The lowest layer will likely assume the user knows what they're doing. For example, expect a fundamental field like LinearVelocity to be exposed directly and without any automatic activation logic. "Safe" layers that limit access and provide validation may be built above this to give new users fewer ways to break everything.

Features designed for convenience will be implemented at a higher level explicitly separated from the core simulation or the responsibility will be punted to the user.

Some likely victims of this redesign include:
-Internal timestepping. There is really nothing special about internal timestepping- it's just one possible (and very simple) implementation of fixed timesteps that could, and probably should, be implemented externally.
-Space-resident state buffers and state interpolation. Users who need these things (for asynchronous updates or internal timestepping) have to opt in anyway, and there's no reason to have them baked into the engine core.
-All deferred collision events, and many immediate collision events. The important degrees of access will be retained to enable such things to be implemented externally, but the engine will do far less.
-'Prefab' entity types like Box, Sphere, and so on are redundant and only exist for legacy reasons. Related complicated inheritance hierarchies and generics to expose typed fields in collidables will also likely go away.
-'Fat' collision filtering. Some games can get by with no filtering, or just bitfields. The engine and API shouldn't be hauling around a bunch of pointless dictionaries for such use cases.
And more. 

Platform Support:

Expect older platforms like Xbox360 and WP7 to be abandoned. The primary target will be .NET Core. RyuJIT and the new SIMD-accelerated numeric types will be assumed. Given the new thriving open source initiative, I think this is a safe bet.

Going forward, expect the engine to adopt the latest language versions and platform updates more rapidly. The latest version of VS Community edition will be assumed. Backwards compatibility will be limited to snapshots, similar to how v1.4.0 will be a snapshot for the XNA-era platforms.

Areas of Focus:

1) Optimizing large simulations with many inactive or static objects

In v1.4.0 and before, a common recommendation is to avoid broadphase pollution. Every static object added to the Space is one more object to be dynamically handled  by the broad phase. To mitigate this issue, bundling many objects into parent objects like StaticGroups is recommended. However, StaticGroups require explicit effort, lack dynamic flexibility, and are not as efficient as they could be.

Inactive objects are also a form of broadphase pollution, but unlike static objects, they cannot be bundled into StaticGroups. Further, these inactive objects pollute most of the other stages. In some cases, the Solver may end up spending vastly more time testing activity states than actually solving anything.

Often, games with these sorts of simulations end up implementing some form of entity tracking to remove objects outside of player attention for performance reasons. While it works in many cases, it would be better to not have to do it at all.

Two large changes are required to address these problems:
-The BroadPhase will be aware of the properties of static and inactive objects. In the normal case, additional static or inactive objects will incur almost no overhead. (In other words, expect slightly less overhead than the StaticGroup incurs, while supporting inactive dynamic objects.)
-Deactivation will be redesigned. Persistent tracking of constraint graphs will be dropped in favor of incremental analysis of the active set, substantially reducing deactivation maintenance overhead. Stages will only consider the active set, rather than enumerating over all objects and checking activity after the fact.

On the type of simulations hamstrung by the current implementation, these changes could improve performance hugely. In extreme cases, a 10x speedup without considering the other implementation improvements or SIMD should be possible.

2) Wide parallel scaling for large server-style workloads

While the engine scales reasonably well up to around 4 to 6 physical cores, there remain sequential bottlenecks and lock-prone bits of code. The NarrowPhase's tracking of obsolete collision pairs is the worst sequential offender. More speculatively, the Solver's locking may be removed in favor of a batching model if some other changes pan out.

The end goal is decent scaling on 16-64 physical cores for large simulations, though fully achieving this will likely require some time.

3) SIMD

With RyuJIT's support for SIMD types comes an opportunity for some transformative performance improvements. However, the current implementation would not benefit significantly from simply swapping out the BEPUutilities types for the new accelerated types. Similarly, future offline optimizing/autovectorizing compilers don't have much to work with under the current design. As it is, these no-effort approaches would probably end up providing an incremental improvement of 10-50% depending on the simulation.

To achieve big throughput improvements, the engine needs cleaner data flow, and that means a big redesign. The solver is the most obvious example. Expect constraints to undergo unification and a shift in data layout. The Entity object's data layout will likely be affected by these changes. The BroadPhase will also benefit, though how much is still unclear since the broad phase is headed for a ground up rewrite.

The NarrowPhase is going to be the most difficult area to adapt; there are a lot of different collision detection routines with very complicated state. There aren't as many opportunities for unification, so it's going to be a long case-by-case struggle to extract as much performance as possible. The most common few collision types will most likely receive in-depth treatment, and the remainder will be addressed as required.

Miscellaneous Changes:

-The demos application will move off of XNA, eliminating the need for a XNA Game Studio install. The drawer will be rewritten, and will get a bit more efficient. Expect the new drawer to use DX11 (feature level 11_0) through SharpDX. Alternate rendering backends for OpenGL (or hopefully Vulkan, should platform and driver support be promising at the time) may be added later for use in cross platform debugging. 

-As alluded to previously, expect a new broad phase with a much smoother (and generally lower) runtime profile. Focuses on incremental refinement; final quality of tree may actually end up higher than the current 'offline' hierarchies offered by BEPUphysics.

-StaticGroup will likely disappear in favor of the BroadPhase just handling it automatically, but the non-BroadPhase hierarchies used by other types like the StaticMesh should still get upgraded to at least match the BroadPhase's quality.

-Collision pair handlers are a case study in inheritance hell. Expect something to happen here, but I'm not yet sure what.

-Wider use of more GC-friendly data structures like the QuickList/QuickSet to avoid garbage and heap complexity.

-Convex casts should use a proper swept test against the broad phase acceleration structure. Should make long unaligned casts much faster.

-More continuous collision detection options. Motion clamping CCD is not great for all situations- particularly systems of lots of dynamic objects, like passengers on a plane or spaceship. The existing speculative contacts implementation helps a little to stabilize things, but its powers are limited. Granting extra power to speculative contacts while limiting ghost collisions would be beneficial.

-The CompoundShape could use some better flexibility. The CompoundHelper is testament to how difficult it can be to do some things efficiently with it.

Schedule Goals:

Variable. Timetable depends heavily on what else is going on in development. Be very suspicious of all of these targets.

Expect the earliest changes to start showing up right after v1.4.0 is released. The first changes will likely be related the debug drawer rewrite.

The next chunk may be CCD/collision pair improvements and the deactivation/broadphase revamp for large simulations. The order of these things is uncertain at this time because there may turn out to be some architectural dependencies. This work will probably cover late spring to mid summer 2015.

Early attempts at parallelization improvements will probably show up next. Probably later in summer 2015.

SIMD work will likely begin at some time in late summer 2015. It may take a few months to adapt the Solver and BroadPhase.

The remaining miscellaneous changes, like gradual improvements to collision detection routines, will occur over the following months and into 2016. I believe all the big changes should be done by some time in spring 2016.

This work won't be contiguous; I'll be hopping around to other projects throughout.

Future Wishlist:

-The ancient FluidVolume, though slightly less gross than it once was, is still very gross. It would be nice to fix it once and for all. This would likely involve some generalizations to nonplanar water- most likely procedural surfaces that would be helpful in efficiently modeling waves, but maybe to simple dynamic heightfields if the jump is short enough.

-Fracture simulation. This has been on the list for a very long time, but there is still a chance it will come up. It probably won't do anything fancy like runtime carving or voronoi shattering. More likely, it will act on some future improved version of CompoundShapes, providing different kinds of simple stress simulation that respond to collisions and environmental effects to choose which parts get fractured. (This isn't a very complicated feature, and as mentioned elsewhere on the forum, I actually implemented something like it once before in a spaceship game prototype- it just wasn't quite as efficient or as clean as a proper release would require.)

On GPU Physics:

In the past, I've included various kinds of GPU acceleration on the development wishlist. However, now, I do not expect to release any GPU-accelerated rigid body physics systems in the foreseeable future. BEPUphysics itself will stay exclusively on the CPU for the foreseeable future.

I've revisited the question of GPU accelerated physics a few times over the last few years, including a few prototypes. However, GPU physics in games is still primarily in the realm of decoration. It's not impossible to use for game logic, but having all of the information directly accessible in main memory with no latency is just a lot easier. 

And implementing individually complicated objects like the CharacterController would be even more painful in the coherence-demanding world of GPUs. (I would not be surprised if a GPU version of a bunch of full-featured CharacterControllers actually ran slower due to the architectural mismatch.) There might be a hybrid approach somewhere in here, but the extra complexity is not attractive.

And CPUs can give pretty-darn-decent performance. BEPUphysics is already remarkably quick for how poorly it uses the capabilities of a modern CPU.

And our own game is not a great fit for GPU simulation, so we have no strong internal reason to pursue it. Everything interacts heavily with game logic, there are no deformable objects, there are no fluids, any cloth is well within the abilities of CPU physics, and the clients' GPUs are going to be busy making pretty pictures.

This all makes implementing runtime GPU simulation a bit of a hard sell.

That said, there's a small chance that I'll end up working on other types of GPU accelerated simulation. For example, one of the GPU prototypes was a content-time tool to simulate flesh and bone in a character to automatically generate vertex-bone weights and pose-specific morph targets. We ended up going another direction in the end, but it's conceivable that other forms of tooling (like BEPUik) could end up coming out of continued development.

 

Have some input? Concerned about future platform support? Want to discuss the upcoming changes? Post on the forum thread this was mirrored from, or just throw tweets at me.

BEPUphysics v1.3.0 released!

Grab the new version! Check out the new stuff!

With this new version comes some changes to the forks. I've dropped the SlimDX and SharpDX forks in favor of focusing on the dependency free main fork.

The XNA fork will stick around for now, but the XNA fork's library will use BEPUutilities math instead of XNA math. Going forward, the fork will be used to maintain project configuration and conditional compilation requirements for XNA platforms.

Pointers!

I recently posted a couple of suggestions on uservoice that would save me from some future maintenance nightmares and improve the performance and memory usage of BEPUphysics in the long term.

First, it would be really nice to have wide support for unsafe code so that I wouldn't have to maintain a bunch of separate code paths. One of the primary targets of BEPUphysics, Windows Phone, does not currently support it. It would be nice if it did! If you'd like to help, vote and share this suggestion:
http://wpdev.uservoice.com/forums/110705-dev-platform/suggestions/4715894-allow-unsafe-code

Second, generic pointers in C# would make dealing with low level memory management more modularizable and much less painful. I could do all sorts of fun optimizations in BEPUphysics without totally forking the codebase! Here's the related suggestion:
http://visualstudio.uservoice.com/forums/121579-visual-studio/suggestions/4716089-unmanaged-generic-type-constraint-generic-pointe

And, as always, if you want to see the above notifications and others in a slightly less readable format, there's my twitter account

Hey, where'd the dependency free version/XNA version go?

I've moved the Dependency Free fork into the main branch. Primary development from here on out will be on this dependency free version. Other forks will have changes merged in periodically (at least as often as major version releases).

If you're looking for the XNA version of BEPUphysics, head to the newly created XNA fork.

I still haven't gotten around to MonoGame-izing the BEPUphysicsDemos and BEPUphysicsDrawer, though, so they still rely on XNA.

BEPUphysics and XNA

As some of you may have noticed, there's been a recent uptick in the discussion of XNA's slow fading. As the main branch of BEPUphysics is still based on XNA, I think it would be appropriate to discuss the path of BEPUphysics.

For those of you who are not aware, BEPUphysics has long had multiple forks based on different libraries for easy math interoperation. The official forks include SlimDX, SharpDX, and the dependency free fork (and of course the main development fork, which is currently XNA). The dependency free fork has its own math systems and does not depend on anything beyond .NET or Mono's libraries.

As mentioned/hidden away in the version roadmap, one of the next two major packaged releases of BEPUphysics will move over to the dependency free fork. Already, you can see progress in this direction; the dependency free math utilities have been reorganized and expanded. Internally, we are already using the dependency free fork utilities for our projects.

Expect the swap to occur in the next six months as my procrastination is overridden by 1) internal development and 2) the fact that I haven't released a proper packaged version since May 2012. (The latter of which has lead to some confusion about whether development has halted- development continues and will continue!)

The most likely target frameworks for the rewritten demos will be either SharpDX or, for wider use, MonoGame.

There have also been some discussions about WinRT versions of BEPUphysics. WinRT is not a targeted platform for our internal projects, so it's a lower priority. However, there are a variety of little annoyances and API changes (particularly with threading) which make it nontrivial for people to pull it into WinRT projects; it would be nice to solve this in one spot so there isn't a bunch of redundant work being done. I will likely get around to it eventually- assuming someone else doesn't maintain a fork for it :) Anyone? :) :)   :)   :) :) :)      :) :) :)    <:)

 

Don't forget to follow me on twitter so you can see a notification on twitter about this blog post!

BEPUik for Blender now available!

An early version of BEPUik, the full body inverse kinematics add on for blender, is now available.

Check out the announcement thread over on blender artists for more.

BEPUik started out as an experimental addition to BEPUphysics (currently available in the development fork):

This system was then ported and integrated with Blender. The result:

Finally and least importantly, I have a twitter account which has more than zero tweets on it now. Squashwell has one too.

BEPUphysics and hyperthreading

BEPUphysics v1.2.0 includes some improvements to multithreaded scaling, particularly in the DynamicHierarchy broad phase with certain core counts. Given that I recently/finally upgraded my CPU, it seemed appropriate to investigate scaling in the presence of hyperthreading.

3770K

[Before continuing, note that all multithreading comparisons in this article are slightly biased towards the single threaded case. Single threaded cases can bypass the overhead of dispatching work entirely, so the jump from one to two threads isn't entirely apples to apples...

...And I may have forgotten to turn off turbo boost in certain tests using the 3770K; oops. Fortunately, the base clock is 4.5ghz and the boosted frequency is 4.6ghz, so the difference should be no more than 2% or so on the 1 core and sometimes 2 core cases.]

Given the improvements, it seems natural to start with the broad phase. This test uniformly distributes thousands of boxes in a volume and then runs broad phase updates many thousands of times for different thread counts. This is running on the quad core hyperthreaded 3770K.

That's almost five times faster on a processor with four physical cores. A decent improvement over the old results in v0.16.0 on my Q6600!

After four threads, the gains slow down as expected. However, using all 8 available threads still manages to be 41% faster than 4 threads.

How about the running the same test on the 3770K with hyperthreading disabled? 

It's roughly equivalent with the first four hyperthreaded results, within error imposed by differing process environments. That's good; the thread scheduling is handled effectively by Windows 7 when hyperthreading is enabled. The final non-hyperthreading speedup is around 3.5 times. Not bad for 4 threads, but not as good as hyperthreading.

Now for some full simulations! The following test measures a few hundred/thousand time steps of three different configurations. The full time for each simulation in milliseconds is recorded.

The first simulation is 5000 boxes falling out of the sky onto a pile of objects. This starts out primarily stressing the broad phase before the pile-up begins.  Then, as the pile grows and thousands of collision pairs are created, it stresses some bookkeeping code's potential sequential bottlenecks, the narrow phase, and the solver. It runs for 700 time steps.

The second simulation is a big wall built of 4000 boxes. This mostly stresses the solver, with narrow phase second and broad phase third. It runs for 800 time steps.

The final simulation is 15625 boxes orbiting a planet (like the PlanetDemo) for 3000 time steps.

All of the simulations can run in real time, but the tests simulate the time steps as fast as they can calculate without any frame delays. 

First, with hyperthreading: 


Somewhat surprisingly, they all scale similarly. I was expecting the solver-heavy simulations to suffer a bit due to the added contention. Something is stopping the scaling from doing quite as well as the broad phase alone, though; the scaling ranges from 3.5 (Planet) to 3.8 (Pile).  The Planet scaling being lower is quite interesting since a large portion of that simulation should be the high-scaling broad phase.

Without hyperthreading:

The scalings range from 2.91 to 3.08 times faster at 4 threads. That's about the same as the 4 threads in hyperthreading. Once again, the Pile has the best scaling and the Planet the worst scaling for currently unknown reasons.

The consistency in scaling between different simulations is promising; it is evidence that there aren't any huge potholes in simulation type to watch out for when going for maximum performance.

Xbox360

We'll start with DynamicHierarchy tests again for the Xbox360. If I remember correctly, these results were on a smaller set of objects than was run on the 3770K (since a 3770K core is far faster than an Xbox360 core).

Not quite as impressive as the 3770K's scaling, but it still gets to around twice as fast. The important thing to note here is the speed boost offered by the usage of that final hardware thread. (The Xbox360 only has 3 physical cores with 6 hardware threads distributed between them. Two hardware threads, one on each of the first two physical cores, are reserved and cannot be used from within XNA projects.)

Despite theoretically stressing the load balancer more, loading up that last physical core with both hardware threads appears to be a win.

How about the full simulations? (Note that these were reduced in size a bit because the Xbox360 had a habit of getting too hot and trying to take a nap after thousands of reruns.)

Once again not quite as great as the 3770K's scaling. However, the Wall manages a respectable 2.3 times faster, providing more evidence that the parallel solver scales better than expected.

The Pile reverses its performance on the 3770K and shows a case where throwing as many threads as possible at a problem isn't the best option in all cases. It's within error (taking a little less than 4% longer to complete), but it's obviously not a clear victory. This suggests it's worth testing both 3 and 4 threads to see what behaves better for your simulation.

3930K

To the next platform: a 3930K running at 4.3ghz! Once again, we'll start with the DynamicHierarchy test. This is the same test that ran on the 3770K.

While we don't see scaling higher than the physical core count this time, it still gets to a decent 5.75 times faster. Interestingly, the single threaded time falls right in line with expectations relative to the 3770K; it takes 7% longer due to the 300 mhz speed difference, and another 5% or so due to the architecture improvements in Ivy Bridge over Sandy Bridge E.

Part of the difficulty here in getting the same kind of usage as 8 cores did on the 3770K may be the binary form of the dynamic hierarchy.  It naturally runs better on systems with core counts that are a power of two. v1.2.0 tries to compensate for this, but it can only help so much before single threaded bottlenecking eliminates any threading gains. Does anyone out there have an 8 core hyperthreaded Xeon or dual processor setup to test this theory? :)

Now for the most interesting test:

What is going on here? The solver-heavy Wall once again is the unexpected bastion of stable and robust scaling, reaching 4.2 times faster. It doesn't benefit from the final two threads, though. The other two simulations seem to have a seizure after 8 threads.

When I first saw these numbers, I assumed there had to be some environmental cause, but I couldn't locate any. Unfortunately, since the 3930K is not mine, I was unable to spend a long time searching. Hopefully I'll get some more testing time soon.

Wrap Up

Try using all the available threads. It seems to help most of the time. If the platform a has a huge number of cores, tread carefully, test, and tell me how it looks!

I don't have access to any recent AMD processors, so if any of you Bulldozers/soon-to-be-Piledrivers want to give these tests a shot, it could be informative. The tests used to generate the data in this post are in the BEPUphysicsDemos (MultithreadedScalingTestDemo and BroadPhaseMultithreadingTestDemo).

I still suspect some type of shenanigans on the 3930K simulations test, but the evidence has shown there are some aspects of threading behavior in BEPUphysics that I did not model properly. The solver seems to be a consistently good scaler. The question then becomes, if the DynamicHierarchy and Solver both scale well, what else in the largely embarrassingly-parallel loopfest of the engine is keeping things from scaling higher?

The answer to that will have to wait for more investigation and another post, to be completed hopefully before hexacores, octocores, and beyond are all over the place!

Questions or comments? Head to the forums

Shuffling some webstuff around

We'll be shuffling some data and domains to a new host over the coming weeks.

If the forum is inaccessible, try accessing it directly at http://bepu.nfshost.com/forum.

If this blog is inaccessible, try accessing it directly at /.

The Codeplex page at bepuphysics.codeplex.com will be free of any major changes (apart from engine updates, of course!).

BEPUphysics v1.0.0: XNA, SlimDX, and everything else!

As always, grab the newest version on codeplex.

In addition to the regular upgrades, this version comes with updates to the previously available SlimDX fork.

Don't want to use XNA or SlimDX? No problem! There's now a dependency free version as well! The BEPUphysicsDemos project used to test the dependency-free library still relies on XNA, but the library itself is completely independent.

Here's some jazzy boxswarm-related entertainment, filmed in v1.0.0: