GPUs are fast...

August 23, 2019, 2:23 PM

You are one of the four founding members of Toilet Toilers Interactive, a development studio riding a wave of popularity after a successful paid alpha launch.

Punching and Whatnot, your genre-fusing open world exploration game about physically punching things, attracted an early and devoted audience on the promise of its marketed-as-revolutionary technology stack.

Development videos of many thousands of punchings simulated in real time spread virally through the latest and most confusing forms of social media.

Critics hail your game as 'much better than it sounds'. Development costs have largely been recouped.

Your team, a group of technological pioneers in the field of simulated pugnacity, have the attention of the gaming world.

And this is your job.

"It's really not that hard, you just gotta..." you say, carefully manipulating a plate covered in crumbs.

"You're using your thumb," says your cofounder Minji, leaning back in a chair across from you with her arms crossed in skepticism.

"No I'm not," you say, putting the plate back down on the desk.

"Yes you were, I saw it."

"It was touching, but the thumb was barely even acting as a stabilizer."

"A person without thumbs doesn't get to use their thumbs as barely-stabilizers. Do it again, no thumb."

"Fine," you huff, making a show of extending your thumb as far as possible before grabbing the plate between your forefingers. Your phone vibrates, but this task demands your whole focus.

Minji waits.

The plate rises, slowly, carefully... and your phone vibrates again.

Loud obscenities ring through the office, grabbing Minji's attention. You quickly let the plate down, sending a few unstuck mealsquare crumbs onto your desk.

"See, it's easy," you begin, but are interrupted by the sound of rapid British-sounding footsteps approaching the doorway.

"ALL THE SERVERS ARE DOWN!"

 

August 23, 2019, 3:13 PM

You are all quite proud of the infrastructure you've built. Almost everything that could be automated, is. Need to update the servers? Zero downtime rolling automated installs. Crashes or hardware failure? Automatic load balancing fills in the gap as needed. Want to check server status? Fancy graphs on your phone.

All without using any off the shelf technologies. Everything from scratch. (It added 15 months to the project, which, as you frequently tell yourself, isn't really that long and it was definitely probably worth it.)

And, as a testament to just how robust it all is, it took three whole weeks after release to hit your first catastrophic failure and multi-hour outage.

"The servers are up on the previous deployment. Backups were untouched, all data retained," Minji says from across the conference table.

You smile smugly, nodding to yourself. The backup system had been your baby. To your right, Geir (whose name is pronounced something like Guy, but who asserts that you are still saying it wrong, but that's his fault for having such a British name) notices your self-satisfaction and gives you a look which you take as affirmation of your work's quality.

"Nominal levels of forum angst. I don't think we'll have to do more than offer a brief explanation," adds Kazimiera from your left, who had settled into a sort-of-community-manager role in addition to the rest of her regular developmental responsibilities.

"So, unless someone else snuck something into that last image, the only change was a new graphics driver. There's a new WHQL release that apparently fixed some of the tiled resource slowdowns we had to work around. I put it on the test servers and it came back all green," Geir said.

You furrow your brow. Kazimiera types furiously on her laptop.

"I thought we fixed that -" you begin, vague memories mixing with rising unease in your stomach.

"All test servers are dead," says Kazimiera.

"We didn't fix that."

You lean back in your chair.

"So, it's possible that we may have forgotten to address a small issue wherein the tests can sometimes, under complex conditions involving driver freezes, time out and return success, instead of not success," you explain to Geir, who had missed the fun. He squints at you.

"How does the tech support forum look?" asks Minji.

Kazimiera clicks around.

"A couple of 'server down' posts... Oh, someone is actually playing on integrated, and it's working! Except all skinned characters have little flickering black dots at every vertex."

"As it should be," says Minji, nodding.

"And about four 'help my game is completely broken' in the last few days, which is two or three more than usual."

"It went WHQL a few days ago," offers Geir.

Everyone sits quietly for a moment, staring into space.

"Alright," starts Minji, slapping the table. "It looks like we need some proactive action, I'm gonna be the scrummer. The scrum lord. It's time to get agile."

"But are you a certified scrum lord?" you ask.

"Kazi, kindly draw us a burndown chart," she says, ignoring you.

Kazimiera gets up and walks toward the whiteboard.

"British graphics expert," she says, pointing at the sighing Geir. "The surface space allocator uses sparse textures right? Would you confirm that everything is broken on the client?"

Geir nods.

Kazimiera is finishing up the labels on her chart. A line starts in the upper left and heads to the lower right, carefully dodging an older drawing of a cat. The line spikes upward to a point above the starting height a few times. The horizontal axis is labelled TIME, and the vertical axis CATASTROPHIC BUGS.

"Thank you, Kazi. I feel adequately scrummed. I'd like your help actually-fixing the test harness. And we should probably post some warnings."

Kazimiera caps her marker and returns to her seat.

Minji turns to you.

"There aren't a lot of GPU jobs on the backend, and you happen to have written all of them," she says.

You make a small noise of acknowledgement tinged by indignity.

 

August 23, 2019, 5:56 PM

The backend physics was, in fact, not working on the new driver version. Your workstation's card, a slightly older generation with an architecture named after a different dead scientist, exhibits the same problem. The reference device works fine.

With some effort and bisection, you've gotten to the point where things do not instantly crash. On the downside, the only active stage remaining is a single piece of the broad phase pair flush. And, given dummy input, it's producing nonsense results.

You launch the program again with a VS capture running, and it immediately crashes.

Pursing your lips, you start up NTRIGUE, a vendor created tool that was recently rebranded for unclear reasons. It looks like it's working, but starting a frame capture immediately crashes.

You take a deep breath and exhale slowly.

 

August 26, 2019, 1:52 PM

"As scrum lord, I hereby convene this scrummoot. Geir, care to begin?"

"Lots of things are busted. For example, with anisotropic filtering x8 or higher enabled, while indexing descriptor arrays in the pixel shader, in the presence of sufficiently steep slopes at over one hundred kilometers from the camera, the device hangs. I am not yet sure how to narrow down the repro further. Also I'm not sure if tiled resources should be considered to exist anymore."

"Sounds good. And the physics?" Minji asks, looking at you.

"I fixed a bug!"

"Care to elaborate?"

"Mm, no."

Kazimeria clicks a few times. "Group memory barriers added to a parallel reduction. Something something shouldn't be necessary, something something memory writes in thread groups below hardware threshold should proceed in order, but added them anyway."

Geir gives you a look.

"Does everything work now?' asks Minji.

"Mmmmm, no."

Kazimiera clicks a few more times. "Only one dispatch is actually completing without hanging the device," she explains. "Your commit descriptions are admirably thorough. Except this one of an ascii butt."

"On the upside, we should have the test suite fixed today or tomorrow," says Minji.

"Content pipeline's fine, by the way. Auto-animation net is still working flawlessly," adds Kazimiera.

"The Demon Prince of Drivers is patient," you warn.

 

August 27, 2019, 2:40 PM

You slump into your chair. Geir, arriving sooner, has already slid far enough down that only his shoulders and head are visible above the conference table.

"Praise be to scrum," chants Minji.

"My turn to start," you say. "FFFfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff," continuing until you run out of breath.

"How many points is that so I can mark it in my scrumbook?"

"...Seven...teen."

"I 'fixed' the terrain thing by detecting the vendor hardware and clamping maximum render distance. I submitted a bug report, but the repro sucks. Also, there's still some other bug where root sampler state gets corrupted- previous data gets reused, stuff explodes." says Geir.

"Good work, I award you one thousand points."

"That's no fair!" you whine.

"Well, have you submitted any bug reports yet?"

"Do you think they would fix it if I just submit a bug report that is the sound of frustrated yelling?" you ask, sliding down your chair a little.

"That may be a significant fraction of the total reports, you should put in some more effort so it will get noticed. Maybe try yelling louder," offers Kazimiera.

"That's a promising idea," you say, rubbing your chin. "Alright, well, under some conditions I have not yet narrowed down, UAV barriers seem to do nothing, and the physics is a series of compute dispatches, so, hooray. For clients, I may be able to work around it by forcing some sort of useless transition or other stall. But it will be a pretty big performance hit and there's not enough headroom for it to be used on the backend, so we can't update the drivers there."

"Interesting. Four more agile points for you. As for me, I've been tracking down some of the non-crashy problems on the client. It looks like all indirect executions now cost a minimum of about max command count times one or two microseconds, regardless of the actual count in the count buffer. I should be able to have that reported today. For this, I earn many points," says Minji.

"And I've just finished a little optimization on the proxy server that cuts minimum latency by half a millisecond," says Kazimiera.

"How does it feel, up in your ivory tower? Frolicking lightly above the clouds, free to make forward progress?" you ask.

Kazimiera smugs.

 

August 28, 2019, 4:33 PM

"So, I worked around the missing barriers at a slowdown of about four hundred percent," you explain. "To mitigate this slowdown, I propose spending three hundred and fifty thousand dollars on new hardware instead of continuing to work on it because I do not want to work on it."

"Any response from the vendor?" asks Minji.

Kazimiera hits F5. "No."

"Cool, cool, cool, cool, cool, cool, cool, cool, cool," says Minji, tapping a pen against the conference table to match. She sinks sideways, propping herself up with one elbow. "Lord British," she says, pointing the pen at Geir without turning, "tell me what you got."

"Sometimes I wonder if it was a good idea to leave Norway."

"Oh, my aunt went there once! Pretty long drive out, I heard. That's like a hundred miles northeast of London, isn't it?" asks Minji earnestly.

Geir shook his head, badly suppressing a smile.

"You know, we're semifamous now," starts Geir. "And a whole lot of their customers are immersed in our completely broken mess. They will probably be a little more motivated to work with-"

"Um," interrupts Kazimiera.

Everyone looks over.

"Windows 10 Creative Pool Summer Hangout update is rolling out. Staged, but should expect a lot of users to be on it within a week or two. It apparently changes some bits of the graphics stack."

Geir's eyes widen.

"It autoupdates to the latest WHQL driver."

 

This story is fictional. The bugs are not :(