Forum

Drawspans optimization in C for software Quake

Post tutorials on how to do certain tasks within game or engine code here.

Moderator: InsideQC Admins

Drawspans optimization in C for software Quake

Postby qbism » Wed Oct 27, 2010 4:53 pm

Piece-of-cake sw quake performance increase...

This assumes an unmodified D_DrawSpans8. I've looked through the code of several sq quake projects and they were all unchanged... If you've got a modified one, please post about it here!

Cut-and-paste the new D_DrawSpans16 for 5% to 10% FPS boost. Obviously, replace occurrences of D_DrawSpans8. Keep the old D_DrawSpans8 function in your code for comparison if you want, maybe with some ifdefs if you're timing it.

This contains two optimizations: Unrolling the inner loop with a modified Duff's Device and increasing spans to 16 (like the i386 asm did). Thanks to mh for the hard part!

Code: Select all
/*
=============
D_DrawSpans
=============
*/

void D_DrawSpans16 (espan_t *pspan) //qbism up it from 8 to 16.  This + unroll = big speed gain!
{
   int            count, spancount;
   unsigned char   *pbase, *pdest;
   fixed16_t      s, t, snext, tnext, sstep, tstep;
   float         sdivz, tdivz, zi, z, du, dv, spancountminus1;
   float         sdivzstepu, tdivzstepu, zistepu;

   sstep = 0;   // keep compiler happy
   tstep = 0;   // ditto

   pbase = (unsigned char *)cacheblock;

   sdivzstepu = d_sdivzstepu * 16;
   tdivzstepu = d_tdivzstepu * 16;
   zistepu = d_zistepu * 16;

   do
   {
      pdest = (unsigned char *)((byte *)d_viewbuffer +
            (screenwidth * pspan->v) + pspan->u);

      count = pspan->count;

   // calculate the initial s/z, t/z, 1/z, s, and t and clamp
      du = (float)pspan->u;
      dv = (float)pspan->v;

      sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
      tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
      zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
      z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

      s = (int)(sdivz * z) + sadjust;
      if (s > bbextents)
         s = bbextents;
      else if (s < 0)
         s = 0;

      t = (int)(tdivz * z) + tadjust;
      if (t > bbextentt)
         t = bbextentt;
      else if (t < 0)
         t = 0;

      do
      {
      // calculate s and t at the far end of the span
         if (count >= 16)
            spancount = 16;
         else
            spancount = count;

         count -= spancount;

         if (count)
         {
         // calculate s/z, t/z, zi->fixed s and t at far end of span,
         // calculate s and t steps across span by shifting
            sdivz += sdivzstepu;
            tdivz += tdivzstepu;
            zi += zistepu;
            z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

            snext = (int)(sdivz * z) + sadjust;
            if (snext > bbextents)
               snext = bbextents;
            else if (snext <= 16)
               snext = 16;   // prevent round-off error on <0 steps from
                        //  from causing overstepping & running off the
                        //  edge of the texture

            tnext = (int)(tdivz * z) + tadjust;
            if (tnext > bbextentt)
               tnext = bbextentt;
            else if (tnext < 16)
               tnext = 16;   // guard against round-off error on <0 steps

            sstep = (snext - s) >> 4;
            tstep = (tnext - t) >> 4;
         }
         else
         {
         // calculate s/z, t/z, zi->fixed s and t at last pixel in span (so
         // can't step off polygon), clamp, calculate s and t steps across
         // span by division, biasing steps low so we don't run off the
         // texture
            spancountminus1 = (float)(spancount - 1);
            sdivz += d_sdivzstepu * spancountminus1;
            tdivz += d_tdivzstepu * spancountminus1;
            zi += d_zistepu * spancountminus1;
            z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
            snext = (int)(sdivz * z) + sadjust;
            if (snext > bbextents)
               snext = bbextents;
            else if (snext < 16)
               snext = 16;   // prevent round-off error on <0 steps from
                        //  from causing overstepping & running off the
                        //  edge of the texture

            tnext = (int)(tdivz * z) + tadjust;
            if (tnext > bbextentt)
               tnext = bbextentt;
            else if (tnext < 16)
               tnext = 16;   // guard against round-off error on <0 steps

            if (spancount > 1)
            {
               sstep = (snext - s) / (spancount - 1);
               tstep = (tnext - t) / (spancount - 1);
            }
         }

//qbism- Duff's Device loop unroll per mh.
         pdest += spancount;
         switch (spancount)
         {
         case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 9: pdest[-9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 8: pdest[-8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 7: pdest[-7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 6: pdest[-6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 5: pdest[-5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 4: pdest[-4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 3: pdest[-3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 2: pdest[-2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 1: pdest[-1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         }

         s = snext;
         t = tnext;

      } while (count > 0);

   } while ((pspan = pspan->pnext) != NULL);
}
User avatar
qbism
 
Posts: 1235
Joined: Thu Nov 04, 2004 5:51 am

Postby mh » Wed Oct 27, 2010 9:39 pm

I normally rename these functions with an "_C" after the name, so we'd have "D_DrawSpans16_C". Makes it clear which version you're using. ;)
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2286
Joined: Sat Jan 12, 2008 1:38 am

Postby mrmmaclean » Fri Oct 29, 2010 10:11 pm

Excellent work! Thank you!
User avatar
mrmmaclean
 
Posts: 33
Joined: Sun Aug 22, 2010 2:49 am

Re: Drawspans optimization in C for software Quake

Postby mankrip » Wed Nov 10, 2010 11:41 pm

Replacing this
Code: Select all
      // calculate s and t at the far end of the span
         if (count >= 16)
            spancount = 16;
         else
            spancount = count;

with this
Code: Select all
      // calculate s and t at the far end of the span
         spancount = count % 17;

may also help a bit.

I'm still applying your changes to all my versions of D_DrawSpans (_Stippled, _Blended, _BlendedBackwards, _ColorKeyed), so I can't test this at the moment, but it should probably work.

[edit] Actually, it probably won't, since it will return zero for multiples of 17. Eew.
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
User avatar
mankrip
 
Posts: 914
Joined: Fri Jul 04, 2008 3:02 am

Postby mh » Thu Nov 11, 2010 12:54 am

Code: Select all
spancount = count > 15 ? 16 : count
should be OK though. I haven't checked the generated asm.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2286
Joined: Sat Jan 12, 2008 1:38 am

Postby mankrip » Thu Nov 11, 2010 12:55 am

Okay, this version should be a bit faster:
Code: Select all
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
void D_DrawSpans16 (espan_t *pspan) //qbism up it from 8 to 16.  This + unroll = big speed gain!
{
   int         count, spancount;
   byte      *pbase, *pdest;
   fixed16_t   s, t, snext, tnext, sstep, tstep;
   float      sdivz, tdivz, zi, z, du, dv, spancountminus1;
   float      sdivzstepu, tdivzstepu, zistepu; // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )

   sstep = 0;   // keep compiler happy
   tstep = 0;   // ditto

   pbase = (byte *)cacheblock;

   // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
   sdivzstepu = d_sdivzstepu * 16;
   tdivzstepu = d_tdivzstepu * 16;
   zistepu = d_zistepu * 16;
   // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

   do
   {
      pdest = (byte *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);

      // Manoel Kasimier - begin
      count = pspan->count / 16;
      spancount = pspan->count % 16;
      // Manoel Kasimier - end

      // calculate the initial s/z, t/z, 1/z, s, and t and clamp
      du = (float)pspan->u;
      dv = (float)pspan->v;

      sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
      tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
      zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
      z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

      s = (int) (sdivz * z) + sadjust;
      if (s > bbextents)
         s = bbextents;
      else if (s < 0)
         s = 0;

      t = (int) (tdivz * z) + tadjust;
      if (t > bbextentt)
         t = bbextentt;
      else if (t < 0)
         t = 0;

      while (count-- > 0) // Manoel Kasimier
      {
         // calculate s/z, t/z, zi->fixed s and t at far end of span,
         // calculate s and t steps across span by shifting
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         sdivz += sdivzstepu;
         tdivz += tdivzstepu;
         zi += zistepu;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
         z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

         snext = (int) (sdivz * z) + sadjust;
         if (snext > bbextents)
            snext = bbextents;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         else if (snext <= 16)
            snext = 16;   // prevent round-off error on <0 steps causing overstepping & running off the edge of the texture
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

         tnext = (int) (tdivz * z) + tadjust;
         if (tnext > bbextentt)
            tnext = bbextentt;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         else if (tnext < 16)
            tnext = 16;   // guard against round-off error on <0 steps

         sstep = (snext - s) >> 4;
         tstep = (tnext - t) >> 4;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

         // Manoel Kasimier - begin
         pdest += 16;
         pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         pdest[ -1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         // Manoel Kasimier - end

         s = snext;
         t = tnext;
         // Manoel Kasimier - begin
      }
      if (spancount > 0)
      {
         // Manoel Kasimier - end

         // calculate s/z, t/z, zi->fixed s and t at last pixel in span (so can't step off polygon),
         // clamp, calculate s and t steps across span by division, biasing steps low so we don't run off the texture
         spancountminus1 = (float)(spancount - 1);
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         sdivz += d_sdivzstepu * spancountminus1;
         tdivz += d_tdivzstepu * spancountminus1;
         zi += d_zistepu * spancountminus1;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
         z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
         snext = (int)(sdivz * z) + sadjust;
         if (snext > bbextents)
            snext = bbextents;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         else if (snext < 16)
            snext = 16;   // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

         tnext = (int)(tdivz * z) + tadjust;
         if (tnext > bbextentt)
            tnext = bbextentt;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         else if (tnext < 16)
            tnext = 16;   // guard against round-off error on <0 steps
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

         if (spancount > 1)
         {
            sstep = (snext - s) / (spancount - 1);
            tstep = (tnext - t) / (spancount - 1);
         }

         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         //qbism- Duff's Device loop unroll per mh.
         pdest += spancount;
         switch (spancount)
         {
            case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  9: pdest[ -9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  8: pdest[ -8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  7: pdest[ -7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  6: pdest[ -6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  5: pdest[ -5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  4: pdest[ -4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  3: pdest[ -3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  2: pdest[ -2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            case  1: pdest[ -1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
            break;
         }
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
      }

   } while ( (pspan = pspan->pnext) != NULL);
}

I'm (again) applying these changes to the other versions of this function, so I can't benchmark it right now.
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
User avatar
mankrip
 
Posts: 914
Joined: Fri Jul 04, 2008 3:02 am

Postby mh » Thu Nov 11, 2010 1:04 am

Replace
Code: Select all
count = pspan->count / 16;
with
Code: Select all
count = pspan->count >> 4;

I got a few extra frames for that.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2286
Joined: Sat Jan 12, 2008 1:38 am

Postby mh » Thu Nov 11, 2010 1:16 am

Replacing the ifs for calculating s, t, snext and tnext with ? : squeezes out some more.

Sample:
Code: Select all
s = s > bbextents ? bbextents : (s < 0 ? 0 : s);
was
Code: Select all
      if (s > bbextents)
         s = bbextents;
      else if (s < 0)
         s = 0;
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2286
Joined: Sat Jan 12, 2008 1:38 am

Postby mankrip » Thu Nov 11, 2010 1:29 am

Won't it be a bit slower since it won't avoid attributing s to itself in cases where s shouldn't change?
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
User avatar
mankrip
 
Posts: 914
Joined: Fri Jul 04, 2008 3:02 am

Postby mh » Thu Nov 11, 2010 1:44 am

It doesn't really seem to make much difference actually. It's generating more instructions but they're faster instructions.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
User avatar
mh
 
Posts: 2286
Joined: Sat Jan 12, 2008 1:38 am

Postby mankrip » Thu Nov 11, 2010 2:53 am

Just for a laugh, here's the stippled version:
Code: Select all
void D_DrawSpans16_Stipple (espan_t *pspan) // Manoel Kasimier
{
   int         count, spancount;
   byte      *pbase, pcolor, *pdest; // Manoel Kasimier
   fixed16_t   s, t, snext, tnext, sstep, tstep, sstep2, tstep2; // Manoel Kasimier
   float      sdivz, tdivz, zi, z, du, dv, spancountminus1; // zi = z interpolation?; du = decimal u; dv = decimal v
   float      sdivzstepu, tdivzstepu, zistepu; // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
   int         izi, izistep, izistep2, stipple; // Manoel Kasimier
   short      *pz; // Manoel Kasimier

   pbase = (byte *)cacheblock;

   // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
   sdivzstepu = d_sdivzstepu * 16;
   tdivzstepu = d_tdivzstepu * 16;
   zistepu = d_zistepu * 16;
   // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

   // Manoel Kasimier - begin
   // we count on FP exceptions being turned off to avoid range problems
   izistep = (int)(d_zistepu * 0x8000 * 0x10000);
   izistep2 = izistep*2;
   // Manoel Kasimier - end

   do
   {
      pdest = (byte *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);
      stipple = ( ( (long) pdest - ( (long) pdest - (long) d_viewbuffer) / (long) screenwidth)) & 1; // Manoel Kasimier
      pz = d_pzbuffer + (d_zwidth * pspan->v) + pspan->u; // Manoel Kasimier

      count = pspan->count >> 4; // mh
      spancount = pspan->count % 16; // Manoel Kasimier

      // calculate the initial s/z, t/z, 1/z, s, and t and clamp
      du = (float)pspan->u;
      dv = (float)pspan->v;

      sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
      tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
      zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
      z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
      // we count on FP exceptions being turned off to avoid range problems // Manoel Kasimier
      izi = (int) (zi * 0x8000 * 0x10000); // Manoel Kasimier

      s = (int)(sdivz * z) + sadjust;
      if (s > bbextents)
         s = bbextents;
      else if (s < 0)
         s = 0;

      t = (int)(tdivz * z) + tadjust;
      if (t > bbextentt)
         t = bbextentt;
      else if (t < 0)
         t = 0;

      while (count-- > 0) // Manoel Kasimier
      {
         // calculate s/z, t/z, zi->fixed s and t at far end of span,
         // calculate s and t steps across span by shifting
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         sdivz += sdivzstepu;
         tdivz += tdivzstepu;
         zi += zistepu;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
         z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

         snext = (int) (sdivz * z) + sadjust;
         if (snext > bbextents)
            snext = bbextents;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         else if (snext <= 16)
            snext = 16;   // prevent round-off error on <0 steps causing overstepping & running off the edge of the texture
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

         tnext = (int) (tdivz * z) + tadjust;
         if (tnext > bbextentt)
            tnext = bbextentt;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         else if (tnext < 16)
            tnext = 16;   // guard against round-off error on <0 steps

         sstep = (snext - s) >> 4;
         tstep = (tnext - t) >> 4;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
         // Manoel Kasimier - begin
         sstep2 = sstep * 2;
         tstep2 = tstep * 2;

         pdest += 16;
         pz += 16;
         if (stipple)
         {
            s += sstep; t += tstep; izi += izistep;
            if (pz[-15] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-15] = pcolor; pz[-15] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[-13] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-13] = pcolor; pz[-13] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[-11] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-11] = pcolor; pz[-11] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -9] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -9] = pcolor; pz[ -9] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -7] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -7] = pcolor; pz[ -7] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -5] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -5] = pcolor; pz[ -5] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -3] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -3] = pcolor; pz[ -3] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -1] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -1] = pcolor; pz[ -1] = izi >> 16; } izi += izistep;
         }
         else
         {
            if (pz[-16] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-16] = pcolor; pz[-16] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[-14] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-14] = pcolor; pz[-14] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[-12] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-12] = pcolor; pz[-12] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[-10] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-10] = pcolor; pz[-10] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -8] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -8] = pcolor; pz[ -8] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -6] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -6] = pcolor; pz[ -6] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -4] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -4] = pcolor; pz[ -4] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
            if (pz[ -2] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -2] = pcolor; pz[ -2] = izi >> 16; } izi += izistep2;
         }
         // Manoel Kasimier - end

         s = snext;
         t = tnext;
         // Manoel Kasimier - begin
      }
      if (spancount > 0)
      {
         // Manoel Kasimier - end

         // calculate s/z, t/z, zi->fixed s and t at last pixel in span (so can't step off polygon),
         // clamp, calculate s and t steps across span by division, biasing steps low so we don't run off the texture
         spancountminus1 = (float)(spancount - 1);
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         sdivz += d_sdivzstepu * spancountminus1;
         tdivz += d_tdivzstepu * spancountminus1;
         zi += d_zistepu * spancountminus1;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
         z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
         snext = (int)(sdivz * z) + sadjust;
         if (snext > bbextents)
            snext = bbextents;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         else if (snext < 16)
            snext = 16;   // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

         tnext = (int)(tdivz * z) + tadjust;
         if (tnext > bbextentt)
            tnext = bbextentt;
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
         else if (tnext < 16)
            tnext = 16;   // guard against round-off error on <0 steps
         // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

         if (spancount > 1)
         {
            sstep = (snext - s) / (spancount - 1);
            tstep = (tnext - t) / (spancount - 1);
         }

         //qbism- Duff's Device loop unroll per mh.
         pdest += spancount;
         // Manoel Kasimier - begin
         pz += spancount;
         if (stipple)
            switch (spancount)
            {
               case 15: s += sstep; t += tstep; izi += izistep;
               if (pz[-15] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-15] = pcolor; pz[-15] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case 13: if (pz[-13] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-13] = pcolor; pz[-13] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case 11: if (pz[-11] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-11] = pcolor; pz[-11] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  9: if (pz[ -9] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -9] = pcolor; pz[ -9] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  7: if (pz[ -7] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -7] = pcolor; pz[ -7] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  5: if (pz[ -5] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -5] = pcolor; pz[ -5] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  3: if (pz[ -3] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -3] = pcolor; pz[ -3] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  1: if (pz[ -1] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -1] = pcolor; pz[ -1] = izi >> 16; }
               break;
            }
         else
            switch (spancount)
            {
               case 16: if (pz[-16] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-16] = pcolor; pz[-16] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case 14: if (pz[-14] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-14] = pcolor; pz[-14] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case 12: if (pz[-12] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-12] = pcolor; pz[-12] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case 10: if (pz[-10] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-10] = pcolor; pz[-10] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  8: if (pz[ -8] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -8] = pcolor; pz[ -8] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  6: if (pz[ -6] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -6] = pcolor; pz[ -6] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  4: if (pz[ -4] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -4] = pcolor; pz[ -4] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
               case  2: if (pz[ -2] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -2] = pcolor; pz[ -2] = izi >> 16; }
               break;
            }
      }
         // Manoel Kasimier - end
   } while ((pspan = pspan->pnext) != NULL);
}
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
User avatar
mankrip
 
Posts: 914
Joined: Fri Jul 04, 2008 3:02 am

Postby mankrip » Thu Nov 11, 2010 3:04 am

mh wrote:It doesn't really seem to make much difference actually. It's generating more instructions but they're faster instructions.

It may be compiler/processor dependent then, I'll do some benchmarking in both the PC and Dreamcast versions to see how it goes.
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
User avatar
mankrip
 
Posts: 914
Joined: Fri Jul 04, 2008 3:02 am

Postby Sajt » Thu Nov 11, 2010 4:44 am

Having fun optimizing your debug builds? :P
F. A. Špork, an enlightened nobleman and a great patron of art, had a stately Baroque spa complex built on the banks of the River Labe.
Sajt
 
Posts: 1215
Joined: Sat Oct 16, 2004 3:39 am

Postby mankrip » Thu Nov 11, 2010 5:01 am

:lol: ROFL!

I'm quite interested in seeing how faster the Dreamcast (and qbism's Flash) version will get with these optimizations. In particular, mh's unrolling code may have helped me to finally figure out how to optimize the stippled water the way I wanted (preemptively skipping the unnecessary pixels), and this is something I've wanted to figure out for years.
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
User avatar
mankrip
 
Posts: 914
Joined: Fri Jul 04, 2008 3:02 am

Postby leileilol » Mon Nov 15, 2010 1:34 am

I've tested the mk+mh spans with a release build on Pentium II, and they're actually slower than the Qbism/mh span16 up here. I keep getting 40.2/40.6 with the mk/mh version with timedemo demo1 compared to the qbism version which had 41.2.

Can't say if the optimizations are beneficial for the dreamcast, since that's an overrated piece of hitachi-powered crap
i should not be here
leileilol
 
Posts: 2783
Joined: Fri Oct 15, 2004 3:23 am

Next

Return to Programming Tutorials

Who is online

Users browsing this forum: No registered users and 1 guest