Drawspans optimization in C for software Quake

Post tutorials on how to do certain tasks within game or engine code here.
qbism
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am
Contact:

Drawspans optimization in C for software Quake

Post by qbism »

Piece-of-cake sw quake performance increase...

This assumes an unmodified D_DrawSpans8. I've looked through the code of several sq quake projects and they were all unchanged... If you've got a modified one, please post about it here!

Cut-and-paste the new D_DrawSpans16 for 5% to 10% FPS boost. Obviously, replace occurrences of D_DrawSpans8. Keep the old D_DrawSpans8 function in your code for comparison if you want, maybe with some ifdefs if you're timing it.

This contains two optimizations: Unrolling the inner loop with a modified Duff's Device and increasing spans to 16 (like the i386 asm did). Thanks to mh for the hard part!

Code: Select all

/*
=============
D_DrawSpans
=============
*/

void D_DrawSpans16 (espan_t *pspan) //qbism up it from 8 to 16.  This + unroll = big speed gain!
{
	int				count, spancount;
	unsigned char	*pbase, *pdest;
	fixed16_t		s, t, snext, tnext, sstep, tstep;
	float			sdivz, tdivz, zi, z, du, dv, spancountminus1;
	float			sdivzstepu, tdivzstepu, zistepu;

	sstep = 0;	// keep compiler happy
	tstep = 0;	// ditto

	pbase = (unsigned char *)cacheblock;

	sdivzstepu = d_sdivzstepu * 16;
	tdivzstepu = d_tdivzstepu * 16;
	zistepu = d_zistepu * 16;

	do
	{
		pdest = (unsigned char *)((byte *)d_viewbuffer +
				(screenwidth * pspan->v) + pspan->u);

		count = pspan->count;

	// calculate the initial s/z, t/z, 1/z, s, and t and clamp
		du = (float)pspan->u;
		dv = (float)pspan->v;

		sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
		tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
		zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
		z = (float)0x10000 / zi;	// prescale to 16.16 fixed-point

		s = (int)(sdivz * z) + sadjust;
		if (s > bbextents)
			s = bbextents;
		else if (s < 0)
			s = 0;

		t = (int)(tdivz * z) + tadjust;
		if (t > bbextentt)
			t = bbextentt;
		else if (t < 0)
			t = 0;

		do
		{
		// calculate s and t at the far end of the span
			if (count >= 16)
				spancount = 16;
			else
				spancount = count;

			count -= spancount;

			if (count)
			{
			// calculate s/z, t/z, zi->fixed s and t at far end of span,
			// calculate s and t steps across span by shifting
				sdivz += sdivzstepu;
				tdivz += tdivzstepu;
				zi += zistepu;
				z = (float)0x10000 / zi;	// prescale to 16.16 fixed-point

				snext = (int)(sdivz * z) + sadjust;
				if (snext > bbextents)
					snext = bbextents;
				else if (snext <= 16)
					snext = 16;	// prevent round-off error on <0 steps from
								//  from causing overstepping & running off the
								//  edge of the texture

				tnext = (int)(tdivz * z) + tadjust;
				if (tnext > bbextentt)
					tnext = bbextentt;
				else if (tnext < 16)
					tnext = 16;	// guard against round-off error on <0 steps

				sstep = (snext - s) >> 4;
				tstep = (tnext - t) >> 4;
			}
			else
			{
			// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so
			// can't step off polygon), clamp, calculate s and t steps across
			// span by division, biasing steps low so we don't run off the
			// texture
				spancountminus1 = (float)(spancount - 1);
				sdivz += d_sdivzstepu * spancountminus1;
				tdivz += d_tdivzstepu * spancountminus1;
				zi += d_zistepu * spancountminus1;
				z = (float)0x10000 / zi;	// prescale to 16.16 fixed-point
				snext = (int)(sdivz * z) + sadjust;
				if (snext > bbextents)
					snext = bbextents;
				else if (snext < 16)
					snext = 16;	// prevent round-off error on <0 steps from
								//  from causing overstepping & running off the
								//  edge of the texture

				tnext = (int)(tdivz * z) + tadjust;
				if (tnext > bbextentt)
					tnext = bbextentt;
				else if (tnext < 16)
					tnext = 16;	// guard against round-off error on <0 steps

				if (spancount > 1)
				{
					sstep = (snext - s) / (spancount - 1);
					tstep = (tnext - t) / (spancount - 1);
				}
			}

//qbism- Duff's Device loop unroll per mh.
         pdest += spancount;
         switch (spancount)
         {
         case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 9: pdest[-9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 8: pdest[-8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 7: pdest[-7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 6: pdest[-6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 5: pdest[-5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 4: pdest[-4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 3: pdest[-3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 2: pdest[-2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         case 1: pdest[-1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
         }

			s = snext;
			t = tnext;

		} while (count > 0);

	} while ((pspan = pspan->pnext) != NULL);
}
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

I normally rename these functions with an "_C" after the name, so we'd have "D_DrawSpans16_C". Makes it clear which version you're using. ;)
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
mrmmaclean
Posts: 33
Joined: Sun Aug 22, 2010 2:49 am

Post by mrmmaclean »

Excellent work! Thank you!
mankrip
Posts: 924
Joined: Fri Jul 04, 2008 3:02 am

Re: Drawspans optimization in C for software Quake

Post by mankrip »

Replacing this

Code: Select all

		// calculate s and t at the far end of the span
			if (count >= 16)
				spancount = 16;
			else
				spancount = count;
with this

Code: Select all

		// calculate s and t at the far end of the span
			spancount = count % 17;
may also help a bit.

I'm still applying your changes to all my versions of D_DrawSpans (_Stippled, _Blended, _BlendedBackwards, _ColorKeyed), so I can't test this at the moment, but it should probably work.

[edit] Actually, it probably won't, since it will return zero for multiples of 17. Eew.
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

Code: Select all

spancount = count > 15 ? 16 : count
should be OK though. I haven't checked the generated asm.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
mankrip
Posts: 924
Joined: Fri Jul 04, 2008 3:02 am

Post by mankrip »

Okay, this version should be a bit faster:

Code: Select all

// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
void D_DrawSpans16 (espan_t *pspan) //qbism up it from 8 to 16.  This + unroll = big speed gain!
{
	int			count, spancount;
	byte		*pbase, *pdest;
	fixed16_t	s, t, snext, tnext, sstep, tstep;
	float		sdivz, tdivz, zi, z, du, dv, spancountminus1;
	float		sdivzstepu, tdivzstepu, zistepu; // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )

	sstep = 0;   // keep compiler happy
	tstep = 0;   // ditto

	pbase = (byte *)cacheblock;

	// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
	sdivzstepu = d_sdivzstepu * 16;
	tdivzstepu = d_tdivzstepu * 16;
	zistepu = d_zistepu * 16;
	// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

	do
	{
		pdest = (byte *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);

		// Manoel Kasimier - begin
		count = pspan->count / 16;
		spancount = pspan->count % 16;
		// Manoel Kasimier - end

		// calculate the initial s/z, t/z, 1/z, s, and t and clamp
		du = (float)pspan->u;
		dv = (float)pspan->v;

		sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
		tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
		zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
		z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

		s = (int) (sdivz * z) + sadjust;
		if (s > bbextents)
			s = bbextents;
		else if (s < 0)
			s = 0;

		t = (int) (tdivz * z) + tadjust;
		if (t > bbextentt)
			t = bbextentt;
		else if (t < 0)
			t = 0;

		while (count-- > 0) // Manoel Kasimier
		{
			// calculate s/z, t/z, zi->fixed s and t at far end of span,
			// calculate s and t steps across span by shifting
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			sdivz += sdivzstepu;
			tdivz += tdivzstepu;
			zi += zistepu;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

			snext = (int) (sdivz * z) + sadjust;
			if (snext > bbextents)
				snext = bbextents;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (snext <= 16)
				snext = 16;   // prevent round-off error on <0 steps causing overstepping & running off the edge of the texture
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			tnext = (int) (tdivz * z) + tadjust;
			if (tnext > bbextentt)
				tnext = bbextentt;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (tnext < 16)
				tnext = 16;   // guard against round-off error on <0 steps

			sstep = (snext - s) >> 4;
			tstep = (tnext - t) >> 4;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			// Manoel Kasimier - begin
			pdest += 16;
			pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			pdest[ -1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
			// Manoel Kasimier - end

			s = snext;
			t = tnext;
			// Manoel Kasimier - begin
		}
		if (spancount > 0)
		{
			// Manoel Kasimier - end

			// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so can't step off polygon),
			// clamp, calculate s and t steps across span by division, biasing steps low so we don't run off the texture
			spancountminus1 = (float)(spancount - 1);
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			sdivz += d_sdivzstepu * spancountminus1;
			tdivz += d_tdivzstepu * spancountminus1;
			zi += d_zistepu * spancountminus1;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
			snext = (int)(sdivz * z) + sadjust;
			if (snext > bbextents)
				snext = bbextents;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (snext < 16)
				snext = 16;   // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			tnext = (int)(tdivz * z) + tadjust;
			if (tnext > bbextentt)
				tnext = bbextentt;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (tnext < 16)
				tnext = 16;   // guard against round-off error on <0 steps
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			if (spancount > 1)
			{
				sstep = (snext - s) / (spancount - 1);
				tstep = (tnext - t) / (spancount - 1);
			}

			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			//qbism- Duff's Device loop unroll per mh.
			pdest += spancount;
			switch (spancount)
			{
				case 16: pdest[-16] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 15: pdest[-15] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 14: pdest[-14] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 13: pdest[-13] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 12: pdest[-12] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 11: pdest[-11] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case 10: pdest[-10] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  9: pdest[ -9] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  8: pdest[ -8] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  7: pdest[ -7] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  6: pdest[ -6] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  5: pdest[ -5] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  4: pdest[ -4] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  3: pdest[ -3] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  2: pdest[ -2] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				case  1: pdest[ -1] = pbase[(s >> 16) + (t >> 16) * cachewidth]; s += sstep; t += tstep;
				break;
			}
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
		}

	} while ( (pspan = pspan->pnext) != NULL);
}
I'm (again) applying these changes to the other versions of this function, so I can't benchmark it right now.
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

Replace

Code: Select all

count = pspan->count / 16;
with

Code: Select all

count = pspan->count >> 4;
I got a few extra frames for that.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

Replacing the ifs for calculating s, t, snext and tnext with ? : squeezes out some more.

Sample:

Code: Select all

s = s > bbextents ? bbextents : (s < 0 ? 0 : s);
was

Code: Select all

      if (s > bbextents)
         s = bbextents;
      else if (s < 0)
         s = 0;
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
mankrip
Posts: 924
Joined: Fri Jul 04, 2008 3:02 am

Post by mankrip »

Won't it be a bit slower since it won't avoid attributing s to itself in cases where s shouldn't change?
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Post by mh »

It doesn't really seem to make much difference actually. It's generating more instructions but they're faster instructions.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
mankrip
Posts: 924
Joined: Fri Jul 04, 2008 3:02 am

Post by mankrip »

Just for a laugh, here's the stippled version:

Code: Select all

void D_DrawSpans16_Stipple (espan_t *pspan) // Manoel Kasimier
{
	int			count, spancount;
	byte		*pbase, pcolor, *pdest; // Manoel Kasimier
	fixed16_t	s, t, snext, tnext, sstep, tstep, sstep2, tstep2; // Manoel Kasimier
	float		sdivz, tdivz, zi, z, du, dv, spancountminus1; // zi = z interpolation?; du = decimal u; dv = decimal v
	float		sdivzstepu, tdivzstepu, zistepu; // qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 )
	int			izi, izistep, izistep2, stipple; // Manoel Kasimier
	short		*pz; // Manoel Kasimier

	pbase = (byte *)cacheblock;

	// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
	sdivzstepu = d_sdivzstepu * 16;
	tdivzstepu = d_tdivzstepu * 16;
	zistepu = d_zistepu * 16;
	// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

	// Manoel Kasimier - begin
	// we count on FP exceptions being turned off to avoid range problems
	izistep = (int)(d_zistepu * 0x8000 * 0x10000);
	izistep2 = izistep*2;
	// Manoel Kasimier - end

	do
	{
		pdest = (byte *)((byte *)d_viewbuffer + (screenwidth * pspan->v) + pspan->u);
		stipple = ( ( (long) pdest - ( (long) pdest - (long) d_viewbuffer) / (long) screenwidth)) & 1; // Manoel Kasimier
		pz = d_pzbuffer + (d_zwidth * pspan->v) + pspan->u; // Manoel Kasimier

		count = pspan->count >> 4; // mh
		spancount = pspan->count % 16; // Manoel Kasimier

		// calculate the initial s/z, t/z, 1/z, s, and t and clamp
		du = (float)pspan->u;
		dv = (float)pspan->v;

		sdivz = d_sdivzorigin + dv*d_sdivzstepv + du*d_sdivzstepu;
		tdivz = d_tdivzorigin + dv*d_tdivzstepv + du*d_tdivzstepu;
		zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;
		z = (float)0x10000 / zi;	// prescale to 16.16 fixed-point
		// we count on FP exceptions being turned off to avoid range problems // Manoel Kasimier
		izi = (int) (zi * 0x8000 * 0x10000); // Manoel Kasimier

		s = (int)(sdivz * z) + sadjust;
		if (s > bbextents)
			s = bbextents;
		else if (s < 0)
			s = 0;

		t = (int)(tdivz * z) + tadjust;
		if (t > bbextentt)
			t = bbextentt;
		else if (t < 0)
			t = 0;

		while (count-- > 0) // Manoel Kasimier
		{
			// calculate s/z, t/z, zi->fixed s and t at far end of span,
			// calculate s and t steps across span by shifting
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			sdivz += sdivzstepu;
			tdivz += tdivzstepu;
			zi += zistepu;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point

			snext = (int) (sdivz * z) + sadjust;
			if (snext > bbextents)
				snext = bbextents;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (snext <= 16)
				snext = 16;   // prevent round-off error on <0 steps causing overstepping & running off the edge of the texture
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			tnext = (int) (tdivz * z) + tadjust;
			if (tnext > bbextentt)
				tnext = bbextentt;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (tnext < 16)
				tnext = 16;   // guard against round-off error on <0 steps

			sstep = (snext - s) >> 4;
			tstep = (tnext - t) >> 4;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			// Manoel Kasimier - begin
			sstep2 = sstep * 2;
			tstep2 = tstep * 2;

			pdest += 16;
			pz += 16;
			if (stipple)
			{
				s += sstep; t += tstep; izi += izistep;
				if (pz[-15] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-15] = pcolor; pz[-15] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-13] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-13] = pcolor; pz[-13] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-11] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-11] = pcolor; pz[-11] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -9] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -9] = pcolor; pz[ -9] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -7] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -7] = pcolor; pz[ -7] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -5] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -5] = pcolor; pz[ -5] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -3] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -3] = pcolor; pz[ -3] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -1] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -1] = pcolor; pz[ -1] = izi >> 16; } izi += izistep;
			}
			else
			{
				if (pz[-16] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-16] = pcolor; pz[-16] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-14] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-14] = pcolor; pz[-14] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-12] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-12] = pcolor; pz[-12] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[-10] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-10] = pcolor; pz[-10] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -8] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -8] = pcolor; pz[ -8] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -6] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -6] = pcolor; pz[ -6] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -4] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -4] = pcolor; pz[ -4] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
				if (pz[ -2] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -2] = pcolor; pz[ -2] = izi >> 16; } izi += izistep2;
			}
			// Manoel Kasimier - end

			s = snext;
			t = tnext;
			// Manoel Kasimier - begin
		}
		if (spancount > 0)
		{
			// Manoel Kasimier - end

			// calculate s/z, t/z, zi->fixed s and t at last pixel in span (so can't step off polygon),
			// clamp, calculate s and t steps across span by division, biasing steps low so we don't run off the texture
			spancountminus1 = (float)(spancount - 1);
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			sdivz += d_sdivzstepu * spancountminus1;
			tdivz += d_tdivzstepu * spancountminus1;
			zi += d_zistepu * spancountminus1;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end
			z = (float)0x10000 / zi;   // prescale to 16.16 fixed-point
			snext = (int)(sdivz * z) + sadjust;
			if (snext > bbextents)
				snext = bbextents;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (snext < 16)
				snext = 16;   // prevent round-off error on <0 steps from causing overstepping & running off the edge of the texture
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			tnext = (int)(tdivz * z) + tadjust;
			if (tnext > bbextentt)
				tnext = bbextentt;
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - begin
			else if (tnext < 16)
				tnext = 16;   // guard against round-off error on <0 steps
			// qbism ( http://forums.inside3d.com/viewtopic.php?t=2717 ) - end

			if (spancount > 1)
			{
				sstep = (snext - s) / (spancount - 1);
				tstep = (tnext - t) / (spancount - 1);
			}

			//qbism- Duff's Device loop unroll per mh.
			pdest += spancount;
			// Manoel Kasimier - begin
			pz += spancount;
			if (stipple)
				switch (spancount)
				{
					case 15: s += sstep; t += tstep; izi += izistep;
					if (pz[-15] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-15] = pcolor; pz[-15] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 13: if (pz[-13] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-13] = pcolor; pz[-13] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 11: if (pz[-11] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-11] = pcolor; pz[-11] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  9: if (pz[ -9] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -9] = pcolor; pz[ -9] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  7: if (pz[ -7] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -7] = pcolor; pz[ -7] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  5: if (pz[ -5] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -5] = pcolor; pz[ -5] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  3: if (pz[ -3] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -3] = pcolor; pz[ -3] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  1: if (pz[ -1] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -1] = pcolor; pz[ -1] = izi >> 16; }
					break;
				}
			else
				switch (spancount)
				{
					case 16: if (pz[-16] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-16] = pcolor; pz[-16] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 14: if (pz[-14] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-14] = pcolor; pz[-14] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 12: if (pz[-12] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-12] = pcolor; pz[-12] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case 10: if (pz[-10] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[-10] = pcolor; pz[-10] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  8: if (pz[ -8] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -8] = pcolor; pz[ -8] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  6: if (pz[ -6] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -6] = pcolor; pz[ -6] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  4: if (pz[ -4] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -4] = pcolor; pz[ -4] = izi >> 16; } s += sstep2; t += tstep2; izi += izistep2;
					case  2: if (pz[ -2] <= (izi >> 16)) if ( (pcolor = *(pbase + (s >> 16) + (t >> 16) * cachewidth)) != 255) { pdest[ -2] = pcolor; pz[ -2] = izi >> 16; }
					break;
				}
		}
			// Manoel Kasimier - end
	} while ((pspan = pspan->pnext) != NULL);
}
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
mankrip
Posts: 924
Joined: Fri Jul 04, 2008 3:02 am

Post by mankrip »

mh wrote:It doesn't really seem to make much difference actually. It's generating more instructions but they're faster instructions.
It may be compiler/processor dependent then, I'll do some benchmarking in both the PC and Dreamcast versions to see how it goes.
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
Sajt
Posts: 1215
Joined: Sat Oct 16, 2004 3:39 am

Post by Sajt »

Having fun optimizing your debug builds? :P
F. A. Špork, an enlightened nobleman and a great patron of art, had a stately Baroque spa complex built on the banks of the River Labe.
mankrip
Posts: 924
Joined: Fri Jul 04, 2008 3:02 am

Post by mankrip »

:lol: ROFL!

I'm quite interested in seeing how faster the Dreamcast (and qbism's Flash) version will get with these optimizations. In particular, mh's unrolling code may have helped me to finally figure out how to optimize the stippled water the way I wanted (preemptively skipping the unnecessary pixels), and this is something I've wanted to figure out for years.
Ph'nglui mglw'nafh mankrip Hell's end wgah'nagl fhtagn.
==-=-=-=-=-=-=-=-=-=-==
Dev blog / Twitter / YouTube
leileilol
Posts: 2783
Joined: Fri Oct 15, 2004 3:23 am

Post by leileilol »

I've tested the mk+mh spans with a release build on Pentium II, and they're actually slower than the Qbism/mh span16 up here. I keep getting 40.2/40.6 with the mk/mh version with timedemo demo1 compared to the qbism version which had 41.2.

Can't say if the optimizations are beneficial for the dreamcast, since that's an overrated piece of hitachi-powered crap
i should not be here
Post Reply