R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Discuss programming topics for the various GPL'd game engine sources.
Post Reply
qbism
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am
Contact:

R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by qbism »

Code: Select all

static byte    *vidcolmap;
...

vidcolmap = (byte *)vid.colormap;
...

void R_DrawSurfaceBlock8_mip0 (void)
{
    int	v, i, lightstep, light;
    byte	*psource, *prowdest;

    psource = pbasesource;
    prowdest = prowdestbase;

    for (v=0 ; v<r_numvblocks ; v++)
    {
        lightleft = r_lightptr[0];
        lightright = r_lightptr[1];
        r_lightptr += r_lightwidth;
        lightleftstep = (r_lightptr[0] - lightleft) >> 4;
        lightrightstep = (r_lightptr[1] - lightright) >> 4;

        for (i=0 ; i<16 ; i++)
        {
            lightstep = (lightleft - lightright) >> 4;
            light = lightright;

            prowdest[15] = vidcolmap[((light += lightstep) & 0xFF00) + psource[15]];
            prowdest[14] = vidcolmap[((light += lightstep) & 0xFF00) + psource[14]];
            prowdest[13] = vidcolmap[((light += lightstep) & 0xFF00) + psource[13]];
            prowdest[12] = vidcolmap[((light += lightstep) & 0xFF00) + psource[12]];
            prowdest[11] = vidcolmap[((light += lightstep) & 0xFF00) + psource[11]];
            prowdest[10] = vidcolmap[((light += lightstep) & 0xFF00) + psource[10]];
            prowdest[9] = vidcolmap[((light += lightstep) & 0xFF00) + psource[9]];
            prowdest[8] = vidcolmap[((light += lightstep) & 0xFF00) + psource[8]];
            prowdest[7] = vidcolmap[((light += lightstep) & 0xFF00) + psource[7]];
            prowdest[6] = vidcolmap[((light += lightstep) & 0xFF00) + psource[6]];
            prowdest[5] = vidcolmap[((light += lightstep) & 0xFF00) + psource[5]];
            prowdest[4] = vidcolmap[((light += lightstep) & 0xFF00) + psource[4]];
            prowdest[3] = vidcolmap[((light += lightstep) & 0xFF00) + psource[3]];
            prowdest[2] = vidcolmap[((light += lightstep) & 0xFF00) + psource[2]];
            prowdest[1] = vidcolmap[((light += lightstep) & 0xFF00) + psource[1]];
            prowdest[0] = vidcolmap[(light & 0xFF00) + psource[0]];

            psource += sourcetstep;
            lightright += lightrightstep;
            lightleft += lightleftstep;
            prowdest += surfrowbytes;
        }

        if (psource >= r_sourcemax)
            psource -= r_stepback;
    }
}
[edit: removed an unused += lightstep]
[edit2: declare vidcolmap 'static' ]
Last edited by qbism on Fri Feb 15, 2013 5:32 pm, edited 4 times in total.
taniwha
Posts: 401
Joined: Thu Jan 14, 2010 7:11 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by taniwha »

Just a comment on unrolling: about twelve years ago, I got into an argument about loop unrolling. I insisted a certain loop should be hand unrolled, he insisted it would be faster to let gcc do it (using dead simple code like "*dst++ = *src++;"). Well, there was only one thing to do: try both and see what happened. I "lost". The version with stupid code and gcc was allowed to do the unrolling was noticeably faster. When I investigated, it turned out gcc was unrolling both versions, but the hand-unrolled version wound up with N increments (N = number of times gcc unrolled the loop) vs the one increment. Also, there was N times as much code (nasty for cache), and I seem to remember there being some nasty register pressure.

In short, you are generally far better off using a decent compiler (remember, that was gcc as of 12 years ago), writing clean, simple code, and letting the compiler choose the optimal code, especially if you can tune the compiler to your intended target CPU. If the compiler you're using can't do a decent job of loop unrolling, burn it and get a new one.

That said, other than loop unrolling, your code does look clean :).
Leave others their otherness.
http://quakeforge.net/
qbism
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by qbism »

A good point, and I unrolled this without a clue regarding performance. But it helped me to visualize. Next I may post the rolled-back-up version c/ assembly. Unrolled gcc assembly:

Code: Select all

_R_DrawSurfaceBlock8_mip0:
	pushl	%ebp
	xorl	%ebp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$8, %esp
	movl	_r_numvblocks, %eax
	movl	_pbasesource, %edx
	movl	_prowdestbase, %ebx
	movl	%ebp, 4(%esp)
	testl	%eax, %eax
	jle	L38
	.p2align 4,,15
L44:
	movl	_r_lightptr, %ecx
	movl	_r_lightwidth, %edi
	movl	$16, (%esp)
	movl	(%ecx), %eax
	movl	4(%ecx), %esi
	leal	(%ecx,%edi,4), %edi
	movl	%edi, _r_lightptr
	movl	%eax, _lightleft
	movl	%esi, _lightright
	movl	(%edi), %ecx
	subl	%eax, %ecx
	shrl	$4, %ecx
	movl	%ecx, _lightleftstep
	movl	4(%edi), %ecx
	subl	%esi, %ecx
	shrl	$4, %ecx
	movl	%ecx, _lightrightstep
	.p2align 4,,15
L40:
	subl	%esi, %eax
	movl	_vidcolmap, %ecx
	sarl	$4, %eax
	movzbl	15(%edx), %edi
	addl	%eax, %esi
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 15(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	14(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 14(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	13(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 13(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	12(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 12(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	11(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 11(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	10(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 10(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	9(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 9(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	8(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 8(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	7(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 7(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	6(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	andl	$65280, %ebp
	addl	%eax, %esi
	movb	%cl, 6(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	5(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 5(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	4(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 4(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	3(%edx), %edi
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 3(%ebx)
	movl	%esi, %ecx
	addl	%eax, %esi
	movl	_vidcolmap, %ebp
	andl	$65280, %ecx
	addl	%esi, %eax
	movzbl	2(%edx), %edi
	andl	$65280, %eax
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 2(%ebx)
	movl	%esi, %ecx
	movl	_vidcolmap, %ebp
	andl	$65280, %ecx
	movzbl	1(%edx), %edi
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 1(%ebx)
	movl	_vidcolmap, %esi
	movzbl	(%edx), %ecx
	addl	%esi, %eax
	movzbl	(%eax,%ecx), %eax
	movb	%al, (%ebx)
	movl	_lightright, %esi
	movl	_lightrightstep, %ebp
	movl	_lightleft, %eax
	movl	_lightleftstep, %ecx
	movl	_sourcetstep, %edi
	addl	%ebp, %esi
	movl	%esi, _lightright
	addl	%ecx, %eax
	addl	%edi, %edx
	movl	%eax, _lightleft
	movl	_surfrowbytes, %edi
	addl	%edi, %ebx
	decl	(%esp)
	jne	L40
	movl	_r_stepback, %edi
	movl	%edx, %eax
	subl	%edi, %eax
	cmpl	%edx, _r_sourcemax
	cmovbe	%eax, %edx
	incl	4(%esp)
	movl	4(%esp), %eax
	cmpl	%eax, _r_numvblocks
	jg	L44
L38:
	addl	$8, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
qbism
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by qbism »

This is clean and rolled

Code: Select all

    void R_DrawSurfaceBlock8_mip0 (void)
    {
        int   v, i, b, lightstep, light;
        byte   *psource, *prowdest;

        psource = pbasesource;
        prowdest = prowdestbase;

        for (v=0 ; v<r_numvblocks ; v++)
        {
            lightleft = r_lightptr[0];
            lightright = r_lightptr[1];
            r_lightptr += r_lightwidth;
            lightleftstep = (r_lightptr[0] - lightleft) >> 4;
            lightrightstep = (r_lightptr[1] - lightright) >> 4;

            for (i=0 ; i<16 ; i++)
            {
                lightstep = (lightleft - lightright) >> 4;
                light = lightright;
                for (b=15; b>=0; b--)
                {
                    prowdest[b] = vidcolmap[(light & 0xFF00) + psource[b]];
                    light += lightstep;
                }
                psource += sourcetstep;
                lightright += lightrightstep;
                lightleft += lightleftstep;
                prowdest += surfrowbytes;
            }

            if (psource >= r_sourcemax)
                psource -= r_stepback;
        }
    }

qbism
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by qbism »

This is clean rolled-up assembly output, optimized. The compiler decided how to unroll the loop.

Code: Select all

_R_DrawSurfaceBlock8_mip0:
	pushl	%ebp
	xorl	%ebp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	_r_numvblocks, %edx
	movl	_pbasesource, %eax
	movl	_prowdestbase, %esi
	movl	%ebp, 8(%esp)
	testl	%edx, %edx
	jle	L38
	.p2align 4,,15
L44:
	movl	_r_lightptr, %ecx
	movl	_r_lightwidth, %edi
	movl	(%ecx), %edx
	movl	4(%ecx), %ebx
	leal	(%ecx,%edi,4), %edi
	movl	%edi, _r_lightptr
	movl	%edx, _lightleft
	movl	%ebx, _lightright
	movl	(%edi), %ecx
	movl	%ebx, (%esp)
	subl	%edx, %ecx
	shrl	$4, %ecx
	movl	%ecx, _lightleftstep
	movl	4(%edi), %ecx
	movl	$16, %edi
	movl	%edi, 4(%esp)
	subl	%ebx, %ecx
	shrl	$4, %ecx
	movl	%ecx, _lightrightstep
	.p2align 4,,15
L40:
	movl	(%esp), %ebp
	movl	_vidcolmap, %ecx
	movzbl	15(%eax), %edi
	subl	%ebp, %edx
	movl	(%esp), %ebp
	sarl	$4, %edx
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 15(%esi)
	movl	(%esp), %ebx
	movzbl	14(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 14(%esi)
	movl	(%esp), %ebx
	movzbl	13(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 13(%esi)
	movl	(%esp), %ebx
	movzbl	12(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 12(%esi)
	movl	(%esp), %ebx
	movzbl	11(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 11(%esi)
	movl	(%esp), %ebx
	movzbl	10(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 10(%esi)
	movl	(%esp), %ebx
	movzbl	9(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 9(%esi)
	movl	(%esp), %ebx
	movzbl	8(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 8(%esi)
	movl	(%esp), %ebx
	movzbl	7(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 7(%esi)
	movl	(%esp), %ebx
	movzbl	6(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 6(%esi)
	movl	(%esp), %ebx
	movzbl	5(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 5(%esi)
	movl	(%esp), %ebx
	movzbl	4(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 4(%esi)
	movl	(%esp), %ebx
	movzbl	3(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	addl	%edx, %ebx
	andl	$65280, %ebp
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 3(%esi)
	movl	%ebx, %ecx
	addl	%edx, %ebx
	movl	_vidcolmap, %ebp
	andl	$65280, %ecx
	addl	%ebx, %edx
	movzbl	2(%eax), %edi
	andl	$65280, %edx
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 2(%esi)
	movl	%ebx, %ecx
	movl	_vidcolmap, %ebp
	andl	$65280, %ecx
	movzbl	1(%eax), %edi
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 1(%esi)
	movl	_vidcolmap, %ebx
	movzbl	(%eax), %ecx
	addl	%ebx, %edx
	movzbl	(%edx,%ecx), %edx
	movb	%dl, (%esi)
	movl	_lightright, %edx
	movl	_lightrightstep, %ebx
	movl	_sourcetstep, %edi
	movl	_lightleftstep, %ebp
	movl	_surfrowbytes, %ecx
	addl	%ebx, %edx
	movl	%edx, (%esp)
	addl	%edi, %eax
	movl	%edx, _lightright
	movl	_lightleft, %edx
	addl	%ecx, %esi
	addl	%ebp, %edx
	decl	4(%esp)
	movl	%edx, _lightleft
	jne	L40
	movl	_r_stepback, %edi
	movl	%eax, %edx
	subl	%edi, %edx
	cmpl	_r_sourcemax, %eax
	cmovae	%edx, %eax
	incl	8(%esp)
	movl	8(%esp), %edx
	cmpl	%edx, _r_numvblocks
	jg	L44
L38:
	addl	$12, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
Loop comparison-
unrolled by hand:

Code: Select all

   movl   _vidcolmap, %ecx
   movzbl   14(%edx), %edi
   addl   %ecx, %ebp
   movzbl   0(%ebp,%edi), %ecx
   movl   %esi, %ebp
   addl   %eax, %esi
   andl   $65280, %ebp
   movb   %cl, 14(%ebx)
unrolled by compiler:

Code: Select all

	movzbl	14(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 14(%esi)
	movl	(%esp), %ebx
So the compiler took 9 instructions to unroll vs. 8 by hand. I did not look at the code outside the loop. Boiling down the differences assuming all registers are equal speed:

By hand:
movl _vidcolmap, %ecx

Machine:
movl %ebx, (%esp)
movl (%esp), %ebx
qbism
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by qbism »

if vidcolmap is declared static, "movl _vidcolmap, %ecx" becomes "movl (%esp), %ebx"

Code: Select all

	movzbl	14(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 14(%edx)
	movl	(%esp), %ebx
taniwha
Posts: 401
Joined: Thu Jan 14, 2010 7:11 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by taniwha »

I'd say there's a "push" somewhere (or maybe "mov whatever,(%esp)"). However, if that is outside the loop, then it will be a gain as it removes (code) cache pressure from the loop.
Leave others their otherness.
http://quakeforge.net/
qbism
Posts: 1236
Joined: Thu Nov 04, 2004 5:51 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by qbism »

Yes, "movl _vidcolmap, %ebx" is outside the loop. Here's the whole thing:

Code: Select all

	.p2align 4,,15
	.globl	_R_DrawSurfaceBlock8_mip0
	.def	_R_DrawSurfaceBlock8_mip0;	.scl	2;	.type	32;	.endef
_R_DrawSurfaceBlock8_mip0:
	movl	_r_numvblocks, %ecx
	movl	_pbasesource, %eax
	movl	_prowdestbase, %edx
	testl	%ecx, %ecx
	jle	L43
	pushl	%ebp
	xorl	%ecx, %ecx
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	_vidcolmap, %ebx
	movl	%ecx, 8(%esp)
	.p2align 4,,15
L38:
	movl	_r_lightptr, %edi
	movl	_r_lightwidth, %ebp
	movl	%ebx, (%esp)
	movl	(%edi), %ecx
	movl	4(%edi), %esi
	leal	(%edi,%ebp,4), %ebp
	movl	%ebp, _r_lightptr
	movl	%ecx, _lightleft
	movl	%esi, _lightright
	movl	0(%ebp), %edi
	subl	%ecx, %edi
	shrl	$4, %edi
	movl	%edi, _lightleftstep
	movl	4(%ebp), %edi
	movl	$16, %ebp
	movl	%ebp, 4(%esp)
	subl	%esi, %edi
	shrl	$4, %edi
	movl	%edi, _lightrightstep
	.p2align 4,,15
L36:
	subl	%esi, %ecx
	movl	(%esp), %ebx
	sarl	$4, %ecx
	movzbl	15(%eax), %edi
	addl	%ecx, %esi
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 15(%edx)
	movl	(%esp), %ebx
	movzbl	14(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 14(%edx)
	movl	(%esp), %ebx
	movzbl	13(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 13(%edx)
	movl	(%esp), %ebx
	movzbl	12(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 12(%edx)
	movl	(%esp), %ebx
	movzbl	11(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 11(%edx)
	movl	(%esp), %ebx
	movzbl	10(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 10(%edx)
	movl	(%esp), %ebx
	movzbl	9(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 9(%edx)
	movl	(%esp), %ebx
	movzbl	8(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 8(%edx)
	movl	(%esp), %ebx
	movzbl	7(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 7(%edx)
	movl	(%esp), %ebx
	movzbl	6(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	andl	$65280, %ebp
	addl	%ecx, %esi
	movb	%bl, 6(%edx)
	movl	(%esp), %ebx
	movzbl	5(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 5(%edx)
	movl	(%esp), %ebx
	movzbl	4(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	addl	%esi, %ecx
	andl	$65280, %ecx
	movb	%bl, 4(%edx)
	movl	(%esp), %ebx
	movzbl	3(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	andl	$65280, %ebp
	movb	%bl, 3(%edx)
	movl	(%esp), %ebx
	movzbl	2(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	(%esp), %edi
	movb	%bl, 2(%edx)
	movzbl	1(%eax), %esi
	addl	%edi, %ecx
	movzbl	(%ecx,%esi), %ebx
	movb	%bl, 1(%edx)
	movzbl	(%eax), %esi
	movzbl	(%ecx,%esi), %ecx
	movb	%cl, (%edx)
	movl	_lightright, %esi
	movl	_lightrightstep, %ecx
	movl	_lightleftstep, %ebx
	movl	_sourcetstep, %ebp
	addl	%ecx, %esi
	movl	_lightleft, %ecx
	movl	%esi, _lightright
	addl	%ebp, %eax
	addl	%ebx, %ecx
	movl	%ecx, _lightleft
	movl	_surfrowbytes, %edi
	addl	%edi, %edx
	decl	4(%esp)
	jne	L36
	movl	_r_stepback, %edi
	movl	%eax, %ecx
	movl	(%esp), %ebx
	subl	%edi, %ecx
	cmpl	%eax, _r_sourcemax
	cmovbe	%ecx, %eax
	incl	8(%esp)
	movl	8(%esp), %ecx
	cmpl	%ecx, _r_numvblocks
	jg	L38
	addl	$12, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
L43:
	ret
leileilol
Posts: 2783
Joined: Fri Oct 15, 2004 3:23 am

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by leileilol »

What would duff devicing the inner most loop do?
i should not be here
taniwha
Posts: 401
Joined: Thu Jan 14, 2010 7:11 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by taniwha »

Yeah, with vidcolormap static, the compiler knows nothing should be able to change it unexpectedly, and thus can do such optimizations.

My general rule (for other reasons, actually) is if it doesn't need to be accessible to other files, make it static (whether variable or function). Also, as a related rule: the "extern" keyword should not appear in a .c file (limit its use to .h files). The reason for this is there was a bug (possibly in only QF, but I don't know) where a variable was declared as int in one .c file and short in another .c file. The compiler didn't catch the problem because one of the .c files was using extern, and for some reason the linker didn't complain.

While places like R_DrawSurfaceBlock8_mip0 are obvious places to optimize (or at least likely places), always follow Abrash's rule: profile, profile, reconsider the algorithm, profile, profile, then optimize. There's little point in optimizing code that has little impact on performance, and there's no point optimizing an O(N**2) algorithm when there's an O(N) (or even the holy grail, O(1)*) algorithm available.

* Yes, for some problem spaces, O(1) exists, but they're rare. Sure, they're sometimes more expensive for small N, but since they scale so nicely, who cares? :) eg, using a hash table will be slower for small N (hash string, compare a string or two vs just comparing a few strings in a simple linked list), but get to a few thousand (eg, compiling frikbot with qcc) and the hash table wins hands down.

Anyway, if the code is fast enough (thus the need for profiling), clean, easily read (and modified!) code is much more important than "fast" code. Also, compiler writers put their efforts into getting the compiler to produce optimial code for the clean, easily read code. Hand optimizing the code can wind up fighting the compiler's optimization algorithms and wind up producing sub-optimal code (probably still better than the clean code with compiler optimizations turned off).

Now, the best thing to do for "hand optimizing" code is actually to give the compiler hints (and as a fringe benefit, improve the readability of the code and let the compiler help catch errors). Things like "static" and "const" wherever possible can make a big difference. Just making vec3_t params "const" where possible made a noticeable difference!
Leave others their otherness.
http://quakeforge.net/
taniwha
Posts: 401
Joined: Thu Jan 14, 2010 7:11 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by taniwha »

leileilol: Duff's device is useless when the number of iterations is fixed (which, unless I'm mistaken, it appears to be*). Odds are high that Duff's device will produce slower code for fixed iterations (depends on the compiler (aggressiveness of constant folding, common subexpression elimination and dead code removal)).

* In fact, it seems only the outermost loop is variable.

Duff's device is cool, but it's no hammer and the problem doesn't even look like a nail ;).
Leave others their otherness.
http://quakeforge.net/
leileilol
Posts: 2783
Joined: Fri Oct 15, 2004 3:23 am

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by leileilol »

okay.

For some odd reason, in vc6, using "Prefer small code" in the optimizations for r_surf.c makes it faster for me
i should not be here
taniwha
Posts: 401
Joined: Thu Jan 14, 2010 7:11 am
Contact:

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by taniwha »

My guess is cache usage for the code itself. ie, keep the code small and a lot of r_surf.c might fit in L1 cache.
Leave others their otherness.
http://quakeforge.net/
frag.machine
Posts: 2126
Joined: Sat Nov 25, 2006 1:49 pm

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by frag.machine »

leileilol wrote:okay.

For some odd reason, in vc6, using "Prefer small code" in the optimizations for r_surf.c makes it faster for me
smaller code == (usually) more L1/L2 cache hits ?

EDIT: Gahh, ninja'd by taniwha :P
I know FrikaC made a cgi-bin version of the quakec interpreter once and wrote part of his website in QuakeC :) (LordHavoc)
leileilol
Posts: 2783
Joined: Fri Oct 15, 2004 3:23 am

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Post by leileilol »

Probably, though i've only tried that small code thing with my duff deviced rgb surfaceblocks. I've yet to test it on the plain unrolled ones i've had before.

I should compile some build that allows to choose the duff blocks, unrolled blocks, and combined duff and unrolled blocks (to one function), then test that build on a Pentium II, trying small code and fast code optimizations on two different builds

EDIT: Unrolled > Duff in a surfaceblock. Working out a duff was a waste of my time :D
i should not be here
Post Reply