Page 1 of 2
Drawspans optimization in C for software Quake
Posted: Wed Oct 27, 2010 4:53 pm
by qbism
Piece-of-cake sw quake performance increase...
This assumes an unmodified D_DrawSpans8. I've looked through the code of several sq quake projects and they were all unchanged... If you've got a modified one, please post about it here!
Cut-and-paste the new D_DrawSpans16 for 5% to 10% FPS boost. Obviously, replace occurrences of D_DrawSpans8. Keep the old D_DrawSpans8 function in your code for comparison if you want, maybe with some ifdefs if you're timing it.
This contains two optimizations: Unrolling the inner loop with a modified Duff's Device and increasing spans to 16 (like the i386 asm did). Thanks to mh for the hard part!
Code: Select all
/*
=============
D_DrawSpans
=============
*/
void D_DrawSpans16 (espan_t *pspan) //qbism up it from 8 to 16. This + unroll = big speed gain!
{
int count, spancount;
unsigned char *pbase, *pdest;
fixed16_t s, t, snext, tnext, sstep, tstep;
float sdivz, tdivz, zi, z, du, dv, spancountminus1;
float sdivzstepu, tdivzstepu, zistepu;
sstep = 0; // keep compiler happy
tstep = 0; // ditto
pbase = (unsigned char *)cacheblock;
sdivzstepu = d_sdivzstepu * 16;
tdivzstepu = d_tdivzstepu * 16;
zistepu = d_zistepu * 16;
do
{
pdest = (unsigned char *)((byte *)d_viewbuffer +
(screenwidth * pspan->v) + pspan->u);
count = pspan->count;
// calculate the initial s/z, t/z, 1/z, s, and t and clamp
du = (float)pspan->u;
dv = (float)pspan->v;
sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
s = (int)(sdivz * z) + sadjust;
if (s > bbextents)
s = bbextents;
else if (s < 0)
s = 0;
t = (int)(tdivz * z) + tadjust;
if (t > bbextentt)
t = bbextentt;
else if (t < 0)
t = 0;
do
{
// calculate s and t at the far end of the span
if (count >= 16)
spancount = 16;
else
spancount = count;
count -= spancount;
if (count)
{
// calculate s/z, t/z, zi->fixed s and t at far end of span,
// calculate s and t steps across span by shifting
sdivz += sdivzstepu;
tdivz += tdivzstepu;
zi += zistepu;
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (int)(sdivz * z) + sadjust;
if (snext > bbextents)
snext = bbextents;
else if (snext <= 16)
snext = 16; // prevent round-off error on <0 steps from
// from causing overstepping & running off the
// edge of the texture
tnext = (int)(tdivz * z) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps
sstep = (snext - s) >> 4;
tstep = (tnext - t) >> 4;
}
else
{
// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so
// can't step off polygon), clamp, calculate s and t steps across
// span by division, biasing steps low so we don't run off the
// texture
spancountminus1 = (float)(spancount - 1);
sdivz += d_sdivzstepu * spancountminus1;
tdivz += d_tdivzstepu * spancountminus1;
zi += d_zistepu * spancountminus1;
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (int)(sdivz * z) + sadjust;
if (snext > bbextents)
snext = bbextents;
else if (snext < 16)
snext = 16; // prevent round-off error on <0 steps from
// from causing overstepping & running off the
// edge of the texture
tnext = (int)(tdivz * z) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps
if (spancount > 1)
{
sstep = (snext - s) / (spancount - 1);
tstep = (tnext - t) / (spancount - 1);
}
}
//qbism- Duff's Device loop unroll per mh.
pdest += spancount;
switch (spancount)
{
case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 9: pdest[-9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 8: pdest[-8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 7: pdest[-7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 6: pdest[-6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 5: pdest[-5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 4: pdest[-4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 3: pdest[-3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 2: pdest[-2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 1: pdest[-1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
}
s = snext;
t = tnext;
} while (count > 0);
} while ((pspan = pspan->pnext) != NULL);
}
Posted: Wed Oct 27, 2010 9:39 pm
by mh
I normally rename these functions with an "_C" after the name, so we'd have "D_DrawSpans16_C". Makes it clear which version you're using.
Posted: Fri Oct 29, 2010 10:11 pm
by mrmmaclean
Excellent work! Thank you!
Re: Drawspans optimization in C for software Quake
Posted: Wed Nov 10, 2010 11:41 pm
by mankrip
Replacing this
Code: Select all
// calculate s and t at the far end of the span
if (count >= 16)
spancount = 16;
else
spancount = count;
with this
Code: Select all
// calculate s and t at the far end of the span
spancount = count % 17;
may also help a bit.
I'm still applying your changes to all my versions of D_DrawSpans (_Stippled, _Blended, _BlendedBackwards, _ColorKeyed), so I can't test this at the moment, but it should probably work.
[edit] Actually, it probably won't, since it will return zero for multiples of 17. Eew.
Posted: Thu Nov 11, 2010 12:54 am
by mh
Code: Select all
spancount = count > 15 ? 16 : count
should be OK though. I haven't checked the generated asm.
Posted: Thu Nov 11, 2010 12:55 am
by mankrip
Okay, this version should be a bit faster:
Code: Select all
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
void D_DrawSpans16 (espan_t *pspan) //qbism up it from 8 to 16. This + unroll = big speed gain!
{
int count, spancount;
byte *pbase, *pdest;
fixed16_t s, t, snext, tnext, sstep, tstep;
float sdivz, tdivz, zi, z, du, dv, spancountminus1;
float sdivzstepu, tdivzstepu, zistepu; // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
sstep = 0; // keep compiler happy
tstep = 0; // ditto
pbase = (byte *)cacheblock;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
sdivzstepu = d_sdivzstepu * 16;
tdivzstepu = d_tdivzstepu * 16;
zistepu = d_zistepu * 16;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
do
{
pdest = (byte *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);
// Manoel Kasimier - begin
count = pspan->count / 16;
spancount = pspan->count % 16;
// Manoel Kasimier - end
// calculate the initial s/z, t/z, 1/z, s, and t and clamp
du = (float)pspan->u;
dv = (float)pspan->v;
sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
s = (int) (sdivz * z) + sadjust;
if (s > bbextents)
s = bbextents;
else if (s < 0)
s = 0;
t = (int) (tdivz * z) + tadjust;
if (t > bbextentt)
t = bbextentt;
else if (t < 0)
t = 0;
while (count-- > 0) // Manoel Kasimier
{
// calculate s/z, t/z, zi->fixed s and t at far end of span,
// calculate s and t steps across span by shifting
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
sdivz += sdivzstepu;
tdivz += tdivzstepu;
zi += zistepu;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (int) (sdivz * z) + sadjust;
if (snext > bbextents)
snext = bbextents;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
else if (snext <= 16)
snext = 16; // prevent round-off error on <0 steps causing overstepping & running off the edge of the texture
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
tnext = (int) (tdivz * z) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps
sstep = (snext - s) >> 4;
tstep = (tnext - t) >> 4;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
// Manoel Kasimier - begin
pdest += 16;
pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
pdest[ -1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
// Manoel Kasimier - end
s = snext;
t = tnext;
// Manoel Kasimier - begin
}
if (spancount > 0)
{
// Manoel Kasimier - end
// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so can't step off polygon),
// clamp, calculate s and t steps across span by division, biasing steps low so we don't run off the texture
spancountminus1 = (float)(spancount - 1);
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
sdivz += d_sdivzstepu * spancountminus1;
tdivz += d_tdivzstepu * spancountminus1;
zi += d_zistepu * spancountminus1;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (int)(sdivz * z) + sadjust;
if (snext > bbextents)
snext = bbextents;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
else if (snext < 16)
snext = 16; // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
tnext = (int)(tdivz * z) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
if (spancount > 1)
{
sstep = (snext - s) / (spancount - 1);
tstep = (tnext - t) / (spancount - 1);
}
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
//qbism- Duff's Device loop unroll per mh.
pdest += spancount;
switch (spancount)
{
case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 9: pdest[ -9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 8: pdest[ -8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 7: pdest[ -7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 6: pdest[ -6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 5: pdest[ -5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 4: pdest[ -4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 3: pdest[ -3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 2: pdest[ -2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
case 1: pdest[ -1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
break;
}
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
}
} while ( (pspan = pspan->pnext) != NULL);
}
I'm (again) applying these changes to the other versions of this function, so I can't benchmark it right now.
Posted: Thu Nov 11, 2010 1:04 am
by mh
Replace
with
I got a few extra frames for that.
Posted: Thu Nov 11, 2010 1:16 am
by mh
Replacing the ifs for calculating s, t, snext and tnext with ? : squeezes out some more.
Sample:
Code: Select all
s = s > bbextents ? bbextents : (s < 0 ? 0 : s);
was
Code: Select all
if (s > bbextents)
s = bbextents;
else if (s < 0)
s = 0;
Posted: Thu Nov 11, 2010 1:29 am
by mankrip
Won't it be a bit slower since it won't avoid attributing s to itself in cases where s shouldn't change?
Posted: Thu Nov 11, 2010 1:44 am
by mh
It doesn't really seem to make much difference actually. It's generating more instructions but they're faster instructions.
Posted: Thu Nov 11, 2010 2:53 am
by mankrip
Just for a laugh, here's the stippled version:
Code: Select all
void D_DrawSpans16_Stipple (espan_t *pspan) // Manoel Kasimier
{
int count, spancount;
byte *pbase, pcolor, *pdest; // Manoel Kasimier
fixed16_t s, t, snext, tnext, sstep, tstep, sstep2, tstep2; // Manoel Kasimier
float sdivz, tdivz, zi, z, du, dv, spancountminus1; // zi = z interpolation?; du = decimal u; dv = decimal v
float sdivzstepu, tdivzstepu, zistepu; // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
int izi, izistep, izistep2, stipple; // Manoel Kasimier
short *pz; // Manoel Kasimier
pbase = (byte *)cacheblock;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
sdivzstepu = d_sdivzstepu * 16;
tdivzstepu = d_tdivzstepu * 16;
zistepu = d_zistepu * 16;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
// Manoel Kasimier - begin
// we count on FP exceptions being turned off to avoid range problems
izistep = (int)(d_zistepu * 0x8000 * 0x10000);
izistep2 = izistep*2;
// Manoel Kasimier - end
do
{
pdest = (byte *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);
stipple = ( ( (long) pdest - ( (long) pdest - (long) d_viewbuffer) / (long) screenwidth)) & 1; // Manoel Kasimier
pz = d_pzbuffer + (d_zwidth * pspan->v) + pspan->u; // Manoel Kasimier
count = pspan->count >> 4; // mh
spancount = pspan->count % 16; // Manoel Kasimier
// calculate the initial s/z, t/z, 1/z, s, and t and clamp
du = (float)pspan->u;
dv = (float)pspan->v;
sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
// we count on FP exceptions being turned off to avoid range problems // Manoel Kasimier
izi = (int) (zi * 0x8000 * 0x10000); // Manoel Kasimier
s = (int)(sdivz * z) + sadjust;
if (s > bbextents)
s = bbextents;
else if (s < 0)
s = 0;
t = (int)(tdivz * z) + tadjust;
if (t > bbextentt)
t = bbextentt;
else if (t < 0)
t = 0;
while (count-- > 0) // Manoel Kasimier
{
// calculate s/z, t/z, zi->fixed s and t at far end of span,
// calculate s and t steps across span by shifting
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
sdivz += sdivzstepu;
tdivz += tdivzstepu;
zi += zistepu;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (int) (sdivz * z) + sadjust;
if (snext > bbextents)
snext = bbextents;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
else if (snext <= 16)
snext = 16; // prevent round-off error on <0 steps causing overstepping & running off the edge of the texture
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
tnext = (int) (tdivz * z) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps
sstep = (snext - s) >> 4;
tstep = (tnext - t) >> 4;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
// Manoel Kasimier - begin
sstep2 = sstep * 2;
tstep2 = tstep * 2;
pdest += 16;
pz += 16;
if (stipple)
{
s += sstep; t += tstep; izi += izistep;
if (pz[-15] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-15] = pcolor; pz[-15] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[-13] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-13] = pcolor; pz[-13] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[-11] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-11] = pcolor; pz[-11] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -9] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -9] = pcolor; pz[ -9] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -7] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -7] = pcolor; pz[ -7] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -5] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -5] = pcolor; pz[ -5] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -3] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -3] = pcolor; pz[ -3] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -1] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -1] = pcolor; pz[ -1] = izi >> 16; } izi += izistep;
}
else
{
if (pz[-16] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-16] = pcolor; pz[-16] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[-14] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-14] = pcolor; pz[-14] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[-12] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-12] = pcolor; pz[-12] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[-10] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-10] = pcolor; pz[-10] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -8] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -8] = pcolor; pz[ -8] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -6] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -6] = pcolor; pz[ -6] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -4] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -4] = pcolor; pz[ -4] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
if (pz[ -2] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -2] = pcolor; pz[ -2] = izi >> 16; } izi += izistep2;
}
// Manoel Kasimier - end
s = snext;
t = tnext;
// Manoel Kasimier - begin
}
if (spancount > 0)
{
// Manoel Kasimier - end
// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so can't step off polygon),
// clamp, calculate s and t steps across span by division, biasing steps low so we don't run off the texture
spancountminus1 = (float)(spancount - 1);
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
sdivz += d_sdivzstepu * spancountminus1;
tdivz += d_tdivzstepu * spancountminus1;
zi += d_zistepu * spancountminus1;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
z = (float)0x10000 / zi; // prescale to 16.16 fixed-point
snext = (int)(sdivz * z) + sadjust;
if (snext > bbextents)
snext = bbextents;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
else if (snext < 16)
snext = 16; // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
tnext = (int)(tdivz * z) + tadjust;
if (tnext > bbextentt)
tnext = bbextentt;
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
else if (tnext < 16)
tnext = 16; // guard against round-off error on <0 steps
// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
if (spancount > 1)
{
sstep = (snext - s) / (spancount - 1);
tstep = (tnext - t) / (spancount - 1);
}
//qbism- Duff's Device loop unroll per mh.
pdest += spancount;
// Manoel Kasimier - begin
pz += spancount;
if (stipple)
switch (spancount)
{
case 15: s += sstep; t += tstep; izi += izistep;
if (pz[-15] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-15] = pcolor; pz[-15] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 13: if (pz[-13] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-13] = pcolor; pz[-13] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 11: if (pz[-11] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-11] = pcolor; pz[-11] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 9: if (pz[ -9] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -9] = pcolor; pz[ -9] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 7: if (pz[ -7] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -7] = pcolor; pz[ -7] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 5: if (pz[ -5] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -5] = pcolor; pz[ -5] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 3: if (pz[ -3] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -3] = pcolor; pz[ -3] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 1: if (pz[ -1] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -1] = pcolor; pz[ -1] = izi >> 16; }
break;
}
else
switch (spancount)
{
case 16: if (pz[-16] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-16] = pcolor; pz[-16] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 14: if (pz[-14] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-14] = pcolor; pz[-14] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 12: if (pz[-12] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-12] = pcolor; pz[-12] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 10: if (pz[-10] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-10] = pcolor; pz[-10] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 8: if (pz[ -8] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -8] = pcolor; pz[ -8] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 6: if (pz[ -6] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -6] = pcolor; pz[ -6] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 4: if (pz[ -4] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -4] = pcolor; pz[ -4] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
case 2: if (pz[ -2] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -2] = pcolor; pz[ -2] = izi >> 16; }
break;
}
}
// Manoel Kasimier - end
} while ((pspan = pspan->pnext) != NULL);
}
Posted: Thu Nov 11, 2010 3:04 am
by mankrip
mh wrote:It doesn't really seem to make much difference actually. It's generating more instructions but they're faster instructions.
It may be compiler/processor dependent then, I'll do some benchmarking in both the PC and Dreamcast versions to see how it goes.
Posted: Thu Nov 11, 2010 4:44 am
by Sajt
Having fun optimizing your debug builds?
Posted: Thu Nov 11, 2010 5:01 am
by mankrip
ROFL!
I'm quite interested in seeing how faster the Dreamcast (and qbism's Flash) version will get with these optimizations. In particular, mh's unrolling code may have helped me to finally figure out how to optimize the stippled water the way I wanted (preemptively skipping the unnecessary pixels), and this is something I've wanted to figure out for years.
Posted: Mon Nov 15, 2010 1:34 am
by leileilol
I've tested the mk+mh spans with a release build on Pentium II, and they're actually slower than the Qbism/mh span16 up here. I keep getting 40.2/40.6 with the mk/mh version with timedemo demo1 compared to the qbism version which had 41.2.
Can't say if the optimizations are beneficial for the dreamcast, since that's an overrated piece of hitachi-powered crap