Forum

drawspans optimizations in c?

Discuss programming topics for the various GPL'd game engine sources.

Moderator: InsideQC Admins

drawspans optimizations in c?

Postby qbism » Tue Oct 26, 2010 5:05 pm

Just wondered if anyone has tried improving performance of drawspans in c code (D_DrawSpans8). I'm going to try D_DrawSpans16, looks easy to do.

Profiling WinQuake w/ gfprof after compling with mingw, d_drawspans comes in at about 40% of CPU time. This is using the c version, not asm. Looking at several source bases (bjp's WinQuake, ToChris,.... ) the function is generally untouched.
User avatar
qbism
 
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am

Postby leileilol » Tue Oct 26, 2010 5:53 pm

I tried earlier tonight. No luck, and things weren't even faster when I removed all of the stepping.

Isn't DrawSpans16 in assembler?
i should not be here
leileilol
 
Posts: 2783
Joined: Fri Oct 15, 2004 3:23 am

Postby mh » Tue Oct 26, 2010 6:51 pm

This gets maybe 5% extra from D_DrawSpans8:
Code: Select all
         pdest += spancount;

         switch (spancount)
         {
         case 8: pdest[-8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 7: pdest[-7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 6: pdest[-6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 5: pdest[-5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 4: pdest[-4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 3: pdest[-3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 2: pdest[-2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 1: pdest[-1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         }

Replace the innermost do...while with it. ;)

Generally with software Quake I find that array indexing is faster than "*variable++" so if you can unroll your loops this way it can go much faster than ID's code - plenty of scope for that all over the place. You could probably precalculate a lot of the pbase indexing here too.

(Edited: indexes should start at -8, not -7).
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Postby qbism » Tue Oct 26, 2010 11:11 pm

DrawSpan16 was only in asm and only for i386. Here's c version, seems to work
Code: Select all
/*
=============
D_DrawSpans16
=============
*/
void D_DrawSpans16 (espan_t *pspan) //qbism- up it from 8 to 16
{
   int            count, spancount;
   unsigned char   *pbase, *pdest;
   fixed16_t      s, t, snext, tnext, sstep, tstep;
   float         sdivz, tdivz, zi, z, du, dv, spancountminus1;
   float         sdivzstepu, tdivzstepu, zistepu;

   sstep = 0;   // keep compiler happy
   tstep = 0;   // ditto

   pbase = (unsigned char *)cacheblock;

   sdivzstepu = d_sdivzstepu * 16;
   tdivzstepu = d_tdivzstepu * 16;
   zistepu = d_zistepu * 16;

   do
   {
      pdest = (unsigned char *)((byte *)d_viewbuffer +
            (screenwidth * pspan->v) + pspan->u);

      count = pspan->count;

   // calculate the initial s/z, t/z, 1/z, s, and t and clamp
      du = (float)pspan->u;
      dv = (float)pspan->v;

      sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
      tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
      zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
      z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

      s = (int)(sdivz * z) + sadjust;
      if (s > bbextents)
         s = bbextents;
      else if (s < 0)
         s = 0;

      t = (int)(tdivz * z) + tadjust;
      if (t > bbextentt)
         t = bbextentt;
      else if (t < 0)
         t = 0;

      do
      {
      // calculate s and t at the far end of the span
         if (count >= 16)
            spancount = 16;
         else
            spancount = count;

         count -= spancount;

         if (count)
         {
         // calculate s/z, t/z, zi->fixed s and t at far end of span,
         // calculate s and t steps across span by shifting
            sdivz += sdivzstepu;
            tdivz += tdivzstepu;
            zi += zistepu;
            z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

            snext = (int)(sdivz * z) + sadjust;
            if (snext > bbextents)
               snext = bbextents;
            else if (snext <= 16)
               snext = 16;   // prevent round-off error on <0 steps from
                        //  from causing overstepping & running off the
                        //  edge of the texture

            tnext = (int)(tdivz * z) + tadjust;
            if (tnext > bbextentt)
               tnext = bbextentt;
            else if (tnext < 16)
               tnext = 16;   // guard against round-off error on <0 steps

            sstep = (snext - s) >> 4;
            tstep = (tnext - t) >> 4;
         }
         else
         {
         // calculate s/z, t/z, zi->fixed s and t at last pixel in span (so
         // can't step off polygon), clamp, calculate s and t steps across
         // span by division, biasing steps low so we don't run off the
         // texture
            spancountminus1 = (float)(spancount - 1);
            sdivz += d_sdivzstepu * spancountminus1;
            tdivz += d_tdivzstepu * spancountminus1;
            zi += d_zistepu * spancountminus1;
            z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
            snext = (int)(sdivz * z) + sadjust;
            if (snext > bbextents)
               snext = bbextents;
            else if (snext < 16)
               snext = 16;   // prevent round-off error on <0 steps from
                        //  from causing overstepping & running off the
                        //  edge of the texture

            tnext = (int)(tdivz * z) + tadjust;
            if (tnext > bbextentt)
               tnext = bbextentt;
            else if (tnext < 16)
               tnext = 16;   // guard against round-off error on <0 steps

            if (spancount > 1)
            {
               sstep = (snext - s) / (spancount - 1);
               tstep = (tnext - t) / (spancount - 1);
            }
         }

         do
         {
            *pdest++ = *(pbase + (s >> 16) + (t >> 16) * cachewidth);
            s += sstep;
            t += tstep;
         } while (--spancount > 0);

         s = snext;
         t = tnext;

      } while (count > 0);

   } while ((pspan = pspan->pnext) != NULL);
}
I haven't tested performance. Will try mh unroll next!
User avatar
qbism
 
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am

Postby Spike » Tue Oct 26, 2010 11:36 pm

yay duff's device!
Spike
 
Posts: 2892
Joined: Fri Nov 05, 2004 3:12 am
Location: UK

Postby mh » Wed Oct 27, 2010 12:02 am

Spike wrote:yay duff's device!

The old ones never die. :lol:

I'm especially proud of the negative array indexes.

Edit: goes maybe 10% faster with 16.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Postby qbism » Wed Oct 27, 2010 12:43 am

Negative indices? Duff? Google...

I didn't grasp that it was "climbing down" until after commenting out //s += sstep; t += tstep;

Performance: In a roughly ideal test*, FPS increased from 106.5 to 112 going from span8 to span16. FPS increased again to 118 after unrolling span16. So that unroll is netting 5%+ in the test case. Combined effect is about 10%

*Facing the blank wall at DM6 dead-end. In contrast, facing complex geometry dilutes the FPS gain from this optimization. Server was throttled at 72 fps, and there wasn't much for the server to do.
User avatar
qbism
 
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am

Postby qbism » Wed Oct 27, 2010 12:48 am

Just noticed mh's edit, "10% faster". Corraboration! :D
User avatar
qbism
 
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am

Postby leileilol » Wed Oct 27, 2010 1:51 am

486 benchmarks!

DISCLAIMER: i'm not using the cleanest quake source in the world.

timedemo demo1, 320x200, no sound, pure DOS

id386 asm implementation:
span8: 16.0FPS
span16: 17.0FPS

C implementation:
span8: 11.5fps
span16 qbism: 12.4FPS
span16 qbism with mh's unrolls: 12.8FPS

Totally off-topic desperation:
span16 qbism with mh's unrolls and d_mipcap 4: 14.6FPS
span16 qbism with mh's unrolls and d_mipcap 4 with that awesome r_lowdetail feature in my awesome engine: 21.4FPS


I made mh's span stuff a seperate function turned on by the 2 value for comparison.

also off-topic - i need to try hexen2's span asm code to see if rjmonroe's optimisations are bull (EDIT: all that trouble to inline some NASM leads to no speed gain!)

more offtopic - I tried making a c drawspans32 and 64 out of it and it turned out crappy looking (to do with the cacheblock i think) and not even much faster
i should not be here
leileilol
 
Posts: 2783
Joined: Fri Oct 15, 2004 3:23 am

Postby qbism » Wed Oct 27, 2010 6:23 am

17fps vs 12.8fps: That's a lot of FPS still on the table. I wonder if any typical compiler settings could be screwing up the unroll benefit? Possible that unoptimized could be faster in this loop. Or maybe related to 486 instruction set?
User avatar
qbism
 
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am

Postby mh » Wed Oct 27, 2010 7:39 am

I think it's just the 486; I've run Quake on one of those before and it's a miracle to be getting FPS above single digits.

My benchmarks came from a timedemo demo1 by the way.

It's not really Duff's either as it doesn't have a do...while embedded in the switch, but then again since we already know there's an absolute maximum on spancount it doesn't need to.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Postby frag.machine » Wed Oct 27, 2010 11:11 am

leileilol wrote:also off-topic - i need to try hexen2's span asm code to see if rjmonroe's optimisations are bull (EDIT: all that trouble to inline some NASM leads to no speed gain!)


Inline asm into tight C loops are unlikely to bring any real gain in performance due the intensive stack handling overhead.
I know FrikaC made a cgi-bin version of the quakec interpreter once and wrote part of his website in QuakeC :) (LordHavoc)
User avatar
frag.machine
 
Posts: 2090
Joined: Sat Nov 25, 2006 1:49 pm

Postby qbism » Wed Oct 27, 2010 4:55 pm

User avatar
qbism
 
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am

Postby frag.machine » Wed Oct 27, 2010 6:23 pm

Just an idea occured me. I wonder if we could squeeze any performance by skipping the right shifts with the use of some sort of union:

mh code:
Code: Select all
// typedef   int   fixed16_t;
fixed16_t      s, t, snext, tnext, sstep, tstep;

(....)

switch (spancount)
{
         case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;


my idea:
Code: Select all
// union fixed32 {
//  int i;
// short s[2];
// }fixed32_t;

fixed32_t      s, t, snext, tnext, sstep, tstep;

(....)

switch (spancount)
{
         case 16: pdest[-16] = pbase[(s.s[1]) + (t.t[1]) * cachewidth]; s.i += sstep.i; t.i += tstep.i;


Can't check it right now, but seems feasible to me.
EDIT: of course, I am assuming big endians here, but the right thing to do would be to check for it out of the loop and use the correct index value for the high short in the union.
I know FrikaC made a cgi-bin version of the quakec interpreter once and wrote part of his website in QuakeC :) (LordHavoc)
User avatar
frag.machine
 
Posts: 2090
Joined: Sat Nov 25, 2006 1:49 pm

Postby mh » Wed Oct 27, 2010 9:41 pm

There's some floating point division further up in the loop that I'd like to have a pop at doing something about too. I suspect that might be causing more performance issues.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Next

Return to Engine Programming

Who is online

Users browsing this forum: No registered users and 1 guest