Page 1 of 2

Drawspans optimization in C for software Quake

Posted: Wed Oct 27, 2010 4:53 pm
by qbism
Piece-of-cake sw quake performance increase...

This assumes an unmodified D_DrawSpans8. I've looked through the code of several sq quake projects and they were all unchanged... If you've got a modified one, please post about it here!

Cut-and-paste the new D_DrawSpans16 for 5% to 10% FPS boost. Obviously, replace occurrences of D_DrawSpans8. Keep the old D_DrawSpans8 function in your code for comparison if you want, maybe with some ifdefs if you're timing it.

This contains two optimizations: Unrolling the inner loop with a modified Duff's Device and increasing spans to 16 (like the i386 asm did). Thanks to mh for the hard part!

Code: Select all

/*
=============
D_DrawSpans
=============
*/

void D_DrawSpans16 (espan_t *pspan) //qbism up it from 8 to 16.  This + unroll = big speed gain!
{
	int				count, spancount;
	unsigned char	*pbase, *pdest;
	fixed16_t		s, t, snext, tnext, sstep, tstep;
	float			sdivz, tdivz, zi, z, du, dv, spancountminus1;
	float			sdivzstepu, tdivzstepu, zistepu;

	sstep = 0;	// keep compiler happy
	tstep = 0;	// ditto

	pbase = (unsigned char *)cacheblock;

	sdivzstepu = d_sdivzstepu * 16;
	tdivzstepu = d_tdivzstepu * 16;
	zistepu = d_zistepu * 16;

	do
	{
		pdest = (unsigned char *)((byte *)d_viewbuffer +
				(screenwidth * pspan->v) + pspan->u);

		count = pspan->count;

	// calculate the initial s/z, t/z, 1/z, s, and t and clamp
		du = (float)pspan->u;
		dv = (float)pspan->v;

		sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
		tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
		zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
		z = (float)0x10000 / zi;	// prescale to 16.16 fixed-point

		s = (int)(sdivz * z) + sadjust;
		if (s > bbextents)
			s = bbextents;
		else if (s < 0)
			s = 0;

		t = (int)(tdivz * z) + tadjust;
		if (t > bbextentt)
			t = bbextentt;
		else if (t < 0)
			t = 0;

		do
		{
		// calculate s and t at the far end of the span
			if (count >= 16)
				spancount = 16;
			else
				spancount = count;

			count -= spancount;

			if (count)
			{
			// calculate s/z, t/z, zi->fixed s and t at far end of span,
			// calculate s and t steps across span by shifting
				sdivz += sdivzstepu;
				tdivz += tdivzstepu;
				zi += zistepu;
				z = (float)0x10000 / zi;	// prescale to 16.16 fixed-point

				snext = (int)(sdivz * z) + sadjust;
				if (snext > bbextents)
					snext = bbextents;
				else if (snext <= 16)
					snext = 16;	// prevent round-off error on <0 steps from
								//  from causing overstepping & running off the
								//  edge of the texture

				tnext = (int)(tdivz * z) + tadjust;
				if (tnext > bbextentt)
					tnext = bbextentt;
				else if (tnext < 16)
					tnext = 16;	// guard against round-off error on <0 steps

				sstep = (snext - s) >> 4;
				tstep = (tnext - t) >> 4;
			}
			else
			{
			// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so
			// can't step off polygon), clamp, calculate s and t steps across
			// span by division, biasing steps low so we don't run off the
			// texture
				spancountminus1 = (float)(spancount - 1);
				sdivz += d_sdivzstepu * spancountminus1;
				tdivz += d_tdivzstepu * spancountminus1;
				zi += d_zistepu * spancountminus1;
				z = (float)0x10000 / zi;	// prescale to 16.16 fixed-point
				snext = (int)(sdivz * z) + sadjust;
				if (snext > bbextents)
					snext = bbextents;
				else if (snext < 16)
					snext = 16;	// prevent round-off error on <0 steps from
								//  from causing overstepping & running off the
								//  edge of the texture

				tnext = (int)(tdivz * z) + tadjust;
				if (tnext > bbextentt)
					tnext = bbextentt;
				else if (tnext < 16)
					tnext = 16;	// guard against round-off error on <0 steps

				if (spancount > 1)
				{
					sstep = (snext - s) / (spancount - 1);
					tstep = (tnext - t) / (spancount - 1);
				}
			}

//qbism- Duff's Device loop unroll per mh.
         pdest += spancount;
         switch (spancount)
         {
         case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 9: pdest[-9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 8: pdest[-8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 7: pdest[-7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 6: pdest[-6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 5: pdest[-5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 4: pdest[-4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 3: pdest[-3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 2: pdest[-2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 1: pdest[-1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         }

			s = snext;
			t = tnext;

		} while (count > 0);

	} while ((pspan = pspan->pnext) != NULL);
}

Posted: Wed Oct 27, 2010 9:39 pm
by mh
I normally rename these functions with an "_C" after the name, so we'd have "D_DrawSpans16_C". Makes it clear which version you're using. ;)

Posted: Fri Oct 29, 2010 10:11 pm
by mrmmaclean
Excellent work! Thank you!

Re: Drawspans optimization in C for software Quake

Posted: Wed Nov 10, 2010 11:41 pm
by mankrip
Replacing this

Code: Select all

		// calculate s and t at the far end of the span
			if (count >= 16)
				spancount = 16;
			else
				spancount = count;
with this

Code: Select all

		// calculate s and t at the far end of the span
			spancount = count % 17;
may also help a bit.

I'm still applying your changes to all my versions of D_DrawSpans (_Stippled, _Blended, _BlendedBackwards, _ColorKeyed), so I can't test this at the moment, but it should probably work.

[edit] Actually, it probably won't, since it will return zero for multiples of 17. Eew.

Posted: Thu Nov 11, 2010 12:54 am
by mh

Code: Select all

spancount = count > 15 ? 16 : count
should be OK though. I haven't checked the generated asm.

Posted: Thu Nov 11, 2010 12:55 am
by mankrip
Okay, this version should be a bit faster:

Code: Select all

// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
void D_DrawSpans16 (espan_t *pspan) //qbism up it from 8 to 16.  This + unroll = big speed gain!
{
	int			count, spancount;
	byte		*pbase, *pdest;
	fixed16_t	s, t, snext, tnext, sstep, tstep;
	float		sdivz, tdivz, zi, z, du, dv, spancountminus1;
	float		sdivzstepu, tdivzstepu, zistepu; // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )

	sstep = 0;   // keep compiler happy
	tstep = 0;   // ditto

	pbase = (byte *)cacheblock;

	// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
	sdivzstepu = d_sdivzstepu * 16;
	tdivzstepu = d_tdivzstepu * 16;
	zistepu = d_zistepu * 16;
	// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

	do
	{
		pdest = (byte *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);

		// Manoel Kasimier - begin
		count = pspan->count / 16;
		spancount = pspan->count % 16;
		// Manoel Kasimier - end

		// calculate the initial s/z, t/z, 1/z, s, and t and clamp
		du = (float)pspan->u;
		dv = (float)pspan->v;

		sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
		tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
		zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
		z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

		s = (int) (sdivz * z) + sadjust;
		if (s > bbextents)
			s = bbextents;
		else if (s < 0)
			s = 0;

		t = (int) (tdivz * z) + tadjust;
		if (t > bbextentt)
			t = bbextentt;
		else if (t < 0)
			t = 0;

		while (count-- > 0) // Manoel Kasimier
		{
			// calculate s/z, t/z, zi->fixed s and t at far end of span,
			// calculate s and t steps across span by shifting
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			sdivz += sdivzstepu;
			tdivz += tdivzstepu;
			zi += zistepu;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

			snext = (int) (sdivz * z) + sadjust;
			if (snext > bbextents)
				snext = bbextents;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (snext <= 16)
				snext = 16;   // prevent round-off error on <0 steps causing overstepping & running off the edge of the texture
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			tnext = (int) (tdivz * z) + tadjust;
			if (tnext > bbextentt)
				tnext = bbextentt;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (tnext < 16)
				tnext = 16;   // guard against round-off error on <0 steps

			sstep = (snext - s) >> 4;
			tstep = (tnext - t) >> 4;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			// Manoel Kasimier - begin
			pdest += 16;
			pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			// Manoel Kasimier - end

			s = snext;
			t = tnext;
			// Manoel Kasimier - begin
		}
		if (spancount > 0)
		{
			// Manoel Kasimier - end

			// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so can't step off polygon),
			// clamp, calculate s and t steps across span by division, biasing steps low so we don't run off the texture
			spancountminus1 = (float)(spancount - 1);
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			sdivz += d_sdivzstepu * spancountminus1;
			tdivz += d_tdivzstepu * spancountminus1;
			zi += d_zistepu * spancountminus1;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
			snext = (int)(sdivz * z) + sadjust;
			if (snext > bbextents)
				snext = bbextents;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (snext < 16)
				snext = 16;   // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			tnext = (int)(tdivz * z) + tadjust;
			if (tnext > bbextentt)
				tnext = bbextentt;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (tnext < 16)
				tnext = 16;   // guard against round-off error on <0 steps
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			if (spancount > 1)
			{
				sstep = (snext - s) / (spancount - 1);
				tstep = (tnext - t) / (spancount - 1);
			}

			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			//qbism- Duff's Device loop unroll per mh.
			pdest += spancount;
			switch (spancount)
			{
				case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  9: pdest[ -9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  8: pdest[ -8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  7: pdest[ -7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  6: pdest[ -6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  5: pdest[ -5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  4: pdest[ -4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  3: pdest[ -3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  2: pdest[ -2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  1: pdest[ -1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				break;
			}
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
		}

	} while ( (pspan = pspan->pnext) != NULL);
}
I'm (again) applying these changes to the other versions of this function, so I can't benchmark it right now.

Posted: Thu Nov 11, 2010 1:04 am
by mh
Replace

Code: Select all

count = pspan->count / 16;
with

Code: Select all

count = pspan->count >> 4;
I got a few extra frames for that.

Posted: Thu Nov 11, 2010 1:16 am
by mh
Replacing the ifs for calculating s, t, snext and tnext with ? : squeezes out some more.

Sample:

Code: Select all

s = s > bbextents ? bbextents : (s < 0 ? 0 : s);
was

Code: Select all

      if (s > bbextents)
         s = bbextents;
      else if (s < 0)
         s = 0;

Posted: Thu Nov 11, 2010 1:29 am
by mankrip
Won't it be a bit slower since it won't avoid attributing s to itself in cases where s shouldn't change?

Posted: Thu Nov 11, 2010 1:44 am
by mh
It doesn't really seem to make much difference actually. It's generating more instructions but they're faster instructions.

Posted: Thu Nov 11, 2010 2:53 am
by mankrip
Just for a laugh, here's the stippled version:

Code: Select all

void D_DrawSpans16_Stipple (espan_t *pspan) // Manoel Kasimier
{
	int			count, spancount;
	byte		*pbase, pcolor, *pdest; // Manoel Kasimier
	fixed16_t	s, t, snext, tnext, sstep, tstep, sstep2, tstep2; // Manoel Kasimier
	float		sdivz, tdivz, zi, z, du, dv, spancountminus1; // zi = z interpolation?; du = decimal u; dv = decimal v
	float		sdivzstepu, tdivzstepu, zistepu; // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
	int			izi, izistep, izistep2, stipple; // Manoel Kasimier
	short		*pz; // Manoel Kasimier

	pbase = (byte *)cacheblock;

	// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
	sdivzstepu = d_sdivzstepu * 16;
	tdivzstepu = d_tdivzstepu * 16;
	zistepu = d_zistepu * 16;
	// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

	// Manoel Kasimier - begin
	// we count on FP exceptions being turned off to avoid range problems
	izistep = (int)(d_zistepu * 0x8000 * 0x10000);
	izistep2 = izistep*2;
	// Manoel Kasimier - end

	do
	{
		pdest = (byte *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);
		stipple = ( ( (long) pdest - ( (long) pdest - (long) d_viewbuffer) / (long) screenwidth)) & 1; // Manoel Kasimier
		pz = d_pzbuffer + (d_zwidth * pspan->v) + pspan->u; // Manoel Kasimier

		count = pspan->count >> 4; // mh
		spancount = pspan->count % 16; // Manoel Kasimier

		// calculate the initial s/z, t/z, 1/z, s, and t and clamp
		du = (float)pspan->u;
		dv = (float)pspan->v;

		sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
		tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
		zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
		z = (float)0x10000 / zi;	// prescale to 16.16 fixed-point
		// we count on FP exceptions being turned off to avoid range problems // Manoel Kasimier
		izi = (int) (zi * 0x8000 * 0x10000); // Manoel Kasimier

		s = (int)(sdivz * z) + sadjust;
		if (s > bbextents)
			s = bbextents;
		else if (s < 0)
			s = 0;

		t = (int)(tdivz * z) + tadjust;
		if (t > bbextentt)
			t = bbextentt;
		else if (t < 0)
			t = 0;

		while (count-- > 0) // Manoel Kasimier
		{
			// calculate s/z, t/z, zi->fixed s and t at far end of span,
			// calculate s and t steps across span by shifting
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			sdivz += sdivzstepu;
			tdivz += tdivzstepu;
			zi += zistepu;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

			snext = (int) (sdivz * z) + sadjust;
			if (snext > bbextents)
				snext = bbextents;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (snext <= 16)
				snext = 16;   // prevent round-off error on <0 steps causing overstepping & running off the edge of the texture
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			tnext = (int) (tdivz * z) + tadjust;
			if (tnext > bbextentt)
				tnext = bbextentt;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (tnext < 16)
				tnext = 16;   // guard against round-off error on <0 steps

			sstep = (snext - s) >> 4;
			tstep = (tnext - t) >> 4;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			// Manoel Kasimier - begin
			sstep2 = sstep * 2;
			tstep2 = tstep * 2;

			pdest += 16;
			pz += 16;
			if (stipple)
			{
				s += sstep; t += tstep; izi += izistep;
				if (pz[-15] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-15] = pcolor; pz[-15] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-13] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-13] = pcolor; pz[-13] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-11] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-11] = pcolor; pz[-11] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -9] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -9] = pcolor; pz[ -9] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -7] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -7] = pcolor; pz[ -7] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -5] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -5] = pcolor; pz[ -5] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -3] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -3] = pcolor; pz[ -3] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -1] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -1] = pcolor; pz[ -1] = izi >> 16; } izi += izistep;
			}
			else
			{
				if (pz[-16] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-16] = pcolor; pz[-16] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-14] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-14] = pcolor; pz[-14] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-12] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-12] = pcolor; pz[-12] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-10] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-10] = pcolor; pz[-10] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -8] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -8] = pcolor; pz[ -8] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -6] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -6] = pcolor; pz[ -6] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -4] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -4] = pcolor; pz[ -4] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -2] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -2] = pcolor; pz[ -2] = izi >> 16; } izi += izistep2;
			}
			// Manoel Kasimier - end

			s = snext;
			t = tnext;
			// Manoel Kasimier - begin
		}
		if (spancount > 0)
		{
			// Manoel Kasimier - end

			// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so can't step off polygon),
			// clamp, calculate s and t steps across span by division, biasing steps low so we don't run off the texture
			spancountminus1 = (float)(spancount - 1);
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			sdivz += d_sdivzstepu * spancountminus1;
			tdivz += d_tdivzstepu * spancountminus1;
			zi += d_zistepu * spancountminus1;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
			snext = (int)(sdivz * z) + sadjust;
			if (snext > bbextents)
				snext = bbextents;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (snext < 16)
				snext = 16;   // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			tnext = (int)(tdivz * z) + tadjust;
			if (tnext > bbextentt)
				tnext = bbextentt;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (tnext < 16)
				tnext = 16;   // guard against round-off error on <0 steps
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			if (spancount > 1)
			{
				sstep = (snext - s) / (spancount - 1);
				tstep = (tnext - t) / (spancount - 1);
			}

			//qbism- Duff's Device loop unroll per mh.
			pdest += spancount;
			// Manoel Kasimier - begin
			pz += spancount;
			if (stipple)
				switch (spancount)
				{
					case 15: s += sstep; t += tstep; izi += izistep;
					if (pz[-15] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-15] = pcolor; pz[-15] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 13: if (pz[-13] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-13] = pcolor; pz[-13] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 11: if (pz[-11] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-11] = pcolor; pz[-11] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  9: if (pz[ -9] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -9] = pcolor; pz[ -9] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  7: if (pz[ -7] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -7] = pcolor; pz[ -7] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  5: if (pz[ -5] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -5] = pcolor; pz[ -5] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  3: if (pz[ -3] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -3] = pcolor; pz[ -3] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  1: if (pz[ -1] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -1] = pcolor; pz[ -1] = izi >> 16; }
					break;
				}
			else
				switch (spancount)
				{
					case 16: if (pz[-16] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-16] = pcolor; pz[-16] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 14: if (pz[-14] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-14] = pcolor; pz[-14] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 12: if (pz[-12] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-12] = pcolor; pz[-12] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 10: if (pz[-10] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-10] = pcolor; pz[-10] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  8: if (pz[ -8] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -8] = pcolor; pz[ -8] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  6: if (pz[ -6] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -6] = pcolor; pz[ -6] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  4: if (pz[ -4] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -4] = pcolor; pz[ -4] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  2: if (pz[ -2] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -2] = pcolor; pz[ -2] = izi >> 16; }
					break;
				}
		}
			// Manoel Kasimier - end
	} while ((pspan = pspan->pnext) != NULL);
}

Posted: Thu Nov 11, 2010 3:04 am
by mankrip
mh wrote:It doesn't really seem to make much difference actually. It's generating more instructions but they're faster instructions.
It may be compiler/processor dependent then, I'll do some benchmarking in both the PC and Dreamcast versions to see how it goes.

Posted: Thu Nov 11, 2010 4:44 am
by Sajt
Having fun optimizing your debug builds? :P

Posted: Thu Nov 11, 2010 5:01 am
by mankrip
:lol: ROFL!

I'm quite interested in seeing how faster the Dreamcast (and qbism's Flash) version will get with these optimizations. In particular, mh's unrolling code may have helped me to finally figure out how to optimize the stippled water the way I wanted (preemptively skipping the unnecessary pixels), and this is something I've wanted to figure out for years.

Posted: Mon Nov 15, 2010 1:34 am
by leileilol
I've tested the mk+mh spans with a release build on Pentium II, and they're actually slower than the Qbism/mh span16 up here. I keep getting 40.2/40.6 with the mk/mh version with timedemo demo1 compared to the qbism version which had 41.2.

Can't say if the optimizations are beneficial for the dreamcast, since that's an overrated piece of hitachi-powered crap