Page 1 of 1

R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Tue Feb 12, 2013 12:49 am
by qbism

Code: Select all

static byte    *vidcolmap;
...

vidcolmap = (byte *)vid.colormap;
...

void R_DrawSurfaceBlock8_mip0 (void)
{
    int	v, i, lightstep, light;
    byte	*psource, *prowdest;

    psource = pbasesource;
    prowdest = prowdestbase;

    for (v=0 ; v<r_numvblocks ; v++)
    {
        lightleft = r_lightptr[0];
        lightright = r_lightptr[1];
        r_lightptr += r_lightwidth;
        lightleftstep = (r_lightptr[0] - lightleft) >> 4;
        lightrightstep = (r_lightptr[1] - lightright) >> 4;

        for (i=0 ; i<16 ; i++)
        {
            lightstep = (lightleft - lightright) >> 4;
            light = lightright;

            prowdest[15] = vidcolmap[((light += lightstep) & 0xFF00) + psource[15]];
            prowdest[14] = vidcolmap[((light += lightstep) & 0xFF00) + psource[14]];
            prowdest[13] = vidcolmap[((light += lightstep) & 0xFF00) + psource[13]];
            prowdest[12] = vidcolmap[((light += lightstep) & 0xFF00) + psource[12]];
            prowdest[11] = vidcolmap[((light += lightstep) & 0xFF00) + psource[11]];
            prowdest[10] = vidcolmap[((light += lightstep) & 0xFF00) + psource[10]];
            prowdest[9] = vidcolmap[((light += lightstep) & 0xFF00) + psource[9]];
            prowdest[8] = vidcolmap[((light += lightstep) & 0xFF00) + psource[8]];
            prowdest[7] = vidcolmap[((light += lightstep) & 0xFF00) + psource[7]];
            prowdest[6] = vidcolmap[((light += lightstep) & 0xFF00) + psource[6]];
            prowdest[5] = vidcolmap[((light += lightstep) & 0xFF00) + psource[5]];
            prowdest[4] = vidcolmap[((light += lightstep) & 0xFF00) + psource[4]];
            prowdest[3] = vidcolmap[((light += lightstep) & 0xFF00) + psource[3]];
            prowdest[2] = vidcolmap[((light += lightstep) & 0xFF00) + psource[2]];
            prowdest[1] = vidcolmap[((light += lightstep) & 0xFF00) + psource[1]];
            prowdest[0] = vidcolmap[(light & 0xFF00) + psource[0]];

            psource += sourcetstep;
            lightright += lightrightstep;
            lightleft += lightleftstep;
            prowdest += surfrowbytes;
        }

        if (psource >= r_sourcemax)
            psource -= r_stepback;
    }
}
[edit: removed an unused += lightstep]
[edit2: declare vidcolmap 'static' ]

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Tue Feb 12, 2013 2:26 am
by taniwha
Just a comment on unrolling: about twelve years ago, I got into an argument about loop unrolling. I insisted a certain loop should be hand unrolled, he insisted it would be faster to let gcc do it (using dead simple code like "*dst++ = *src++;"). Well, there was only one thing to do: try both and see what happened. I "lost". The version with stupid code and gcc was allowed to do the unrolling was noticeably faster. When I investigated, it turned out gcc was unrolling both versions, but the hand-unrolled version wound up with N increments (N = number of times gcc unrolled the loop) vs the one increment. Also, there was N times as much code (nasty for cache), and I seem to remember there being some nasty register pressure.

In short, you are generally far better off using a decent compiler (remember, that was gcc as of 12 years ago), writing clean, simple code, and letting the compiler choose the optimal code, especially if you can tune the compiler to your intended target CPU. If the compiler you're using can't do a decent job of loop unrolling, burn it and get a new one.

That said, other than loop unrolling, your code does look clean :).

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Tue Feb 12, 2013 3:44 am
by qbism
A good point, and I unrolled this without a clue regarding performance. But it helped me to visualize. Next I may post the rolled-back-up version c/ assembly. Unrolled gcc assembly:

Code: Select all

_R_DrawSurfaceBlock8_mip0:
	pushl	%ebp
	xorl	%ebp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$8, %esp
	movl	_r_numvblocks, %eax
	movl	_pbasesource, %edx
	movl	_prowdestbase, %ebx
	movl	%ebp, 4(%esp)
	testl	%eax, %eax
	jle	L38
	.p2align 4,,15
L44:
	movl	_r_lightptr, %ecx
	movl	_r_lightwidth, %edi
	movl	$16, (%esp)
	movl	(%ecx), %eax
	movl	4(%ecx), %esi
	leal	(%ecx,%edi,4), %edi
	movl	%edi, _r_lightptr
	movl	%eax, _lightleft
	movl	%esi, _lightright
	movl	(%edi), %ecx
	subl	%eax, %ecx
	shrl	$4, %ecx
	movl	%ecx, _lightleftstep
	movl	4(%edi), %ecx
	subl	%esi, %ecx
	shrl	$4, %ecx
	movl	%ecx, _lightrightstep
	.p2align 4,,15
L40:
	subl	%esi, %eax
	movl	_vidcolmap, %ecx
	sarl	$4, %eax
	movzbl	15(%edx), %edi
	addl	%eax, %esi
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 15(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	14(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 14(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	13(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 13(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	12(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 12(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	11(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 11(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	10(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 10(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	9(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 9(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	8(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 8(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	7(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 7(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	6(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	andl	$65280, %ebp
	addl	%eax, %esi
	movb	%cl, 6(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	5(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 5(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	4(%edx), %edi
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ecx
	movl	%esi, %ebp
	addl	%eax, %esi
	andl	$65280, %ebp
	movb	%cl, 4(%ebx)
	movl	_vidcolmap, %ecx
	movzbl	3(%edx), %edi
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 3(%ebx)
	movl	%esi, %ecx
	addl	%eax, %esi
	movl	_vidcolmap, %ebp
	andl	$65280, %ecx
	addl	%esi, %eax
	movzbl	2(%edx), %edi
	andl	$65280, %eax
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 2(%ebx)
	movl	%esi, %ecx
	movl	_vidcolmap, %ebp
	andl	$65280, %ecx
	movzbl	1(%edx), %edi
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 1(%ebx)
	movl	_vidcolmap, %esi
	movzbl	(%edx), %ecx
	addl	%esi, %eax
	movzbl	(%eax,%ecx), %eax
	movb	%al, (%ebx)
	movl	_lightright, %esi
	movl	_lightrightstep, %ebp
	movl	_lightleft, %eax
	movl	_lightleftstep, %ecx
	movl	_sourcetstep, %edi
	addl	%ebp, %esi
	movl	%esi, _lightright
	addl	%ecx, %eax
	addl	%edi, %edx
	movl	%eax, _lightleft
	movl	_surfrowbytes, %edi
	addl	%edi, %ebx
	decl	(%esp)
	jne	L40
	movl	_r_stepback, %edi
	movl	%edx, %eax
	subl	%edi, %eax
	cmpl	%edx, _r_sourcemax
	cmovbe	%eax, %edx
	incl	4(%esp)
	movl	4(%esp), %eax
	cmpl	%eax, _r_numvblocks
	jg	L44
L38:
	addl	$8, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Tue Feb 12, 2013 3:52 am
by qbism
This is clean and rolled

Code: Select all

    void R_DrawSurfaceBlock8_mip0 (void)
    {
        int   v, i, b, lightstep, light;
        byte   *psource, *prowdest;

        psource = pbasesource;
        prowdest = prowdestbase;

        for (v=0 ; v<r_numvblocks ; v++)
        {
            lightleft = r_lightptr[0];
            lightright = r_lightptr[1];
            r_lightptr += r_lightwidth;
            lightleftstep = (r_lightptr[0] - lightleft) >> 4;
            lightrightstep = (r_lightptr[1] - lightright) >> 4;

            for (i=0 ; i<16 ; i++)
            {
                lightstep = (lightleft - lightright) >> 4;
                light = lightright;
                for (b=15; b>=0; b--)
                {
                    prowdest[b] = vidcolmap[(light & 0xFF00) + psource[b]];
                    light += lightstep;
                }
                psource += sourcetstep;
                lightright += lightrightstep;
                lightleft += lightleftstep;
                prowdest += surfrowbytes;
            }

            if (psource >= r_sourcemax)
                psource -= r_stepback;
        }
    }


Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Tue Feb 12, 2013 4:21 am
by qbism
This is clean rolled-up assembly output, optimized. The compiler decided how to unroll the loop.

Code: Select all

_R_DrawSurfaceBlock8_mip0:
	pushl	%ebp
	xorl	%ebp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	_r_numvblocks, %edx
	movl	_pbasesource, %eax
	movl	_prowdestbase, %esi
	movl	%ebp, 8(%esp)
	testl	%edx, %edx
	jle	L38
	.p2align 4,,15
L44:
	movl	_r_lightptr, %ecx
	movl	_r_lightwidth, %edi
	movl	(%ecx), %edx
	movl	4(%ecx), %ebx
	leal	(%ecx,%edi,4), %edi
	movl	%edi, _r_lightptr
	movl	%edx, _lightleft
	movl	%ebx, _lightright
	movl	(%edi), %ecx
	movl	%ebx, (%esp)
	subl	%edx, %ecx
	shrl	$4, %ecx
	movl	%ecx, _lightleftstep
	movl	4(%edi), %ecx
	movl	$16, %edi
	movl	%edi, 4(%esp)
	subl	%ebx, %ecx
	shrl	$4, %ecx
	movl	%ecx, _lightrightstep
	.p2align 4,,15
L40:
	movl	(%esp), %ebp
	movl	_vidcolmap, %ecx
	movzbl	15(%eax), %edi
	subl	%ebp, %edx
	movl	(%esp), %ebp
	sarl	$4, %edx
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 15(%esi)
	movl	(%esp), %ebx
	movzbl	14(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 14(%esi)
	movl	(%esp), %ebx
	movzbl	13(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 13(%esi)
	movl	(%esp), %ebx
	movzbl	12(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 12(%esi)
	movl	(%esp), %ebx
	movzbl	11(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 11(%esi)
	movl	(%esp), %ebx
	movzbl	10(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 10(%esi)
	movl	(%esp), %ebx
	movzbl	9(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 9(%esi)
	movl	(%esp), %ebx
	movzbl	8(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 8(%esi)
	movl	(%esp), %ebx
	movzbl	7(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 7(%esi)
	movl	(%esp), %ebx
	movzbl	6(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 6(%esi)
	movl	(%esp), %ebx
	movzbl	5(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 5(%esi)
	movl	(%esp), %ebx
	movzbl	4(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 4(%esi)
	movl	(%esp), %ebx
	movzbl	3(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	addl	%edx, %ebx
	andl	$65280, %ebp
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 3(%esi)
	movl	%ebx, %ecx
	addl	%edx, %ebx
	movl	_vidcolmap, %ebp
	andl	$65280, %ecx
	addl	%ebx, %edx
	movzbl	2(%eax), %edi
	andl	$65280, %edx
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 2(%esi)
	movl	%ebx, %ecx
	movl	_vidcolmap, %ebp
	andl	$65280, %ecx
	movzbl	1(%eax), %edi
	addl	%ebp, %ecx
	movzbl	(%ecx,%edi), %ecx
	movb	%cl, 1(%esi)
	movl	_vidcolmap, %ebx
	movzbl	(%eax), %ecx
	addl	%ebx, %edx
	movzbl	(%edx,%ecx), %edx
	movb	%dl, (%esi)
	movl	_lightright, %edx
	movl	_lightrightstep, %ebx
	movl	_sourcetstep, %edi
	movl	_lightleftstep, %ebp
	movl	_surfrowbytes, %ecx
	addl	%ebx, %edx
	movl	%edx, (%esp)
	addl	%edi, %eax
	movl	%edx, _lightright
	movl	_lightleft, %edx
	addl	%ecx, %esi
	addl	%ebp, %edx
	decl	4(%esp)
	movl	%edx, _lightleft
	jne	L40
	movl	_r_stepback, %edi
	movl	%eax, %edx
	subl	%edi, %edx
	cmpl	_r_sourcemax, %eax
	cmovae	%edx, %eax
	incl	8(%esp)
	movl	8(%esp), %edx
	cmpl	%edx, _r_numvblocks
	jg	L44
L38:
	addl	$12, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
Loop comparison-
unrolled by hand:

Code: Select all

   movl   _vidcolmap, %ecx
   movzbl   14(%edx), %edi
   addl   %ecx, %ebp
   movzbl   0(%ebp,%edi), %ecx
   movl   %esi, %ebp
   addl   %eax, %esi
   andl   $65280, %ebp
   movb   %cl, 14(%ebx)
unrolled by compiler:

Code: Select all

	movzbl	14(%eax), %edi
	addl	%edx, %ebx
	movl	%ebx, %ebp
	andl	$65280, %ebp
	addl	%ecx, %ebp
	movl	%ebx, (%esp)
	movzbl	0(%ebp,%edi), %ebx
	movb	%bl, 14(%esi)
	movl	(%esp), %ebx
So the compiler took 9 instructions to unroll vs. 8 by hand. I did not look at the code outside the loop. Boiling down the differences assuming all registers are equal speed:

By hand:
movl _vidcolmap, %ecx

Machine:
movl %ebx, (%esp)
movl (%esp), %ebx

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Thu Feb 14, 2013 5:55 am
by qbism
if vidcolmap is declared static, "movl _vidcolmap, %ecx" becomes "movl (%esp), %ebx"

Code: Select all

	movzbl	14(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 14(%edx)
	movl	(%esp), %ebx

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 3:19 am
by taniwha
I'd say there's a "push" somewhere (or maybe "mov whatever,(%esp)"). However, if that is outside the loop, then it will be a gain as it removes (code) cache pressure from the loop.

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 4:41 am
by qbism
Yes, "movl _vidcolmap, %ebx" is outside the loop. Here's the whole thing:

Code: Select all

	.p2align 4,,15
	.globl	_R_DrawSurfaceBlock8_mip0
	.def	_R_DrawSurfaceBlock8_mip0;	.scl	2;	.type	32;	.endef
_R_DrawSurfaceBlock8_mip0:
	movl	_r_numvblocks, %ecx
	movl	_pbasesource, %eax
	movl	_prowdestbase, %edx
	testl	%ecx, %ecx
	jle	L43
	pushl	%ebp
	xorl	%ecx, %ecx
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	_vidcolmap, %ebx
	movl	%ecx, 8(%esp)
	.p2align 4,,15
L38:
	movl	_r_lightptr, %edi
	movl	_r_lightwidth, %ebp
	movl	%ebx, (%esp)
	movl	(%edi), %ecx
	movl	4(%edi), %esi
	leal	(%edi,%ebp,4), %ebp
	movl	%ebp, _r_lightptr
	movl	%ecx, _lightleft
	movl	%esi, _lightright
	movl	0(%ebp), %edi
	subl	%ecx, %edi
	shrl	$4, %edi
	movl	%edi, _lightleftstep
	movl	4(%ebp), %edi
	movl	$16, %ebp
	movl	%ebp, 4(%esp)
	subl	%esi, %edi
	shrl	$4, %edi
	movl	%edi, _lightrightstep
	.p2align 4,,15
L36:
	subl	%esi, %ecx
	movl	(%esp), %ebx
	sarl	$4, %ecx
	movzbl	15(%eax), %edi
	addl	%ecx, %esi
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 15(%edx)
	movl	(%esp), %ebx
	movzbl	14(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 14(%edx)
	movl	(%esp), %ebx
	movzbl	13(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 13(%edx)
	movl	(%esp), %ebx
	movzbl	12(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 12(%edx)
	movl	(%esp), %ebx
	movzbl	11(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 11(%edx)
	movl	(%esp), %ebx
	movzbl	10(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 10(%edx)
	movl	(%esp), %ebx
	movzbl	9(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 9(%edx)
	movl	(%esp), %ebx
	movzbl	8(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 8(%edx)
	movl	(%esp), %ebx
	movzbl	7(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 7(%edx)
	movl	(%esp), %ebx
	movzbl	6(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	andl	$65280, %ebp
	addl	%ecx, %esi
	movb	%bl, 6(%edx)
	movl	(%esp), %ebx
	movzbl	5(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	movb	%bl, 5(%edx)
	movl	(%esp), %ebx
	movzbl	4(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	addl	%ecx, %esi
	andl	$65280, %ebp
	addl	%esi, %ecx
	andl	$65280, %ecx
	movb	%bl, 4(%edx)
	movl	(%esp), %ebx
	movzbl	3(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	%esi, %ebp
	andl	$65280, %ebp
	movb	%bl, 3(%edx)
	movl	(%esp), %ebx
	movzbl	2(%eax), %edi
	addl	%ebx, %ebp
	movzbl	0(%ebp,%edi), %ebx
	movl	(%esp), %edi
	movb	%bl, 2(%edx)
	movzbl	1(%eax), %esi
	addl	%edi, %ecx
	movzbl	(%ecx,%esi), %ebx
	movb	%bl, 1(%edx)
	movzbl	(%eax), %esi
	movzbl	(%ecx,%esi), %ecx
	movb	%cl, (%edx)
	movl	_lightright, %esi
	movl	_lightrightstep, %ecx
	movl	_lightleftstep, %ebx
	movl	_sourcetstep, %ebp
	addl	%ecx, %esi
	movl	_lightleft, %ecx
	movl	%esi, _lightright
	addl	%ebp, %eax
	addl	%ebx, %ecx
	movl	%ecx, _lightleft
	movl	_surfrowbytes, %edi
	addl	%edi, %edx
	decl	4(%esp)
	jne	L36
	movl	_r_stepback, %edi
	movl	%eax, %ecx
	movl	(%esp), %ebx
	subl	%edi, %ecx
	cmpl	%eax, _r_sourcemax
	cmovbe	%ecx, %eax
	incl	8(%esp)
	movl	8(%esp), %ecx
	cmpl	%ecx, _r_numvblocks
	jg	L38
	addl	$12, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
L43:
	ret

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 7:46 am
by leileilol
What would duff devicing the inner most loop do?

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 8:16 am
by taniwha
Yeah, with vidcolormap static, the compiler knows nothing should be able to change it unexpectedly, and thus can do such optimizations.

My general rule (for other reasons, actually) is if it doesn't need to be accessible to other files, make it static (whether variable or function). Also, as a related rule: the "extern" keyword should not appear in a .c file (limit its use to .h files). The reason for this is there was a bug (possibly in only QF, but I don't know) where a variable was declared as int in one .c file and short in another .c file. The compiler didn't catch the problem because one of the .c files was using extern, and for some reason the linker didn't complain.

While places like R_DrawSurfaceBlock8_mip0 are obvious places to optimize (or at least likely places), always follow Abrash's rule: profile, profile, reconsider the algorithm, profile, profile, then optimize. There's little point in optimizing code that has little impact on performance, and there's no point optimizing an O(N**2) algorithm when there's an O(N) (or even the holy grail, O(1)*) algorithm available.

* Yes, for some problem spaces, O(1) exists, but they're rare. Sure, they're sometimes more expensive for small N, but since they scale so nicely, who cares? :) eg, using a hash table will be slower for small N (hash string, compare a string or two vs just comparing a few strings in a simple linked list), but get to a few thousand (eg, compiling frikbot with qcc) and the hash table wins hands down.

Anyway, if the code is fast enough (thus the need for profiling), clean, easily read (and modified!) code is much more important than "fast" code. Also, compiler writers put their efforts into getting the compiler to produce optimial code for the clean, easily read code. Hand optimizing the code can wind up fighting the compiler's optimization algorithms and wind up producing sub-optimal code (probably still better than the clean code with compiler optimizations turned off).

Now, the best thing to do for "hand optimizing" code is actually to give the compiler hints (and as a fringe benefit, improve the readability of the code and let the compiler help catch errors). Things like "static" and "const" wherever possible can make a big difference. Just making vec3_t params "const" where possible made a noticeable difference!

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 8:22 am
by taniwha
leileilol: Duff's device is useless when the number of iterations is fixed (which, unless I'm mistaken, it appears to be*). Odds are high that Duff's device will produce slower code for fixed iterations (depends on the compiler (aggressiveness of constant folding, common subexpression elimination and dead code removal)).

* In fact, it seems only the outermost loop is variable.

Duff's device is cool, but it's no hammer and the problem doesn't even look like a nail ;).

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 8:40 am
by leileilol
okay.

For some odd reason, in vc6, using "Prefer small code" in the optimizations for r_surf.c makes it faster for me

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 9:27 am
by taniwha
My guess is cache usage for the code itself. ie, keep the code small and a lot of r_surf.c might fit in L1 cache.

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 2:43 pm
by frag.machine
leileilol wrote:okay.

For some odd reason, in vc6, using "Prefer small code" in the optimizations for r_surf.c makes it faster for me
smaller code == (usually) more L1/L2 cache hits ?

EDIT: Gahh, ninja'd by taniwha :P

Re: R_DrawSurfaceBlock8_mip0 cleaned up and unrolled

Posted: Fri Feb 15, 2013 7:35 pm
by leileilol
Probably, though i've only tried that small code thing with my duff deviced rgb surfaceblocks. I've yet to test it on the plain unrolled ones i've had before.

I should compile some build that allows to choose the duff blocks, unrolled blocks, and combined duff and unrolled blocks (to one function), then test that build on a Pentium II, trying small code and fast code optimizations on two different builds

EDIT: Unrolled > Duff in a surfaceblock. Working out a duff was a waste of my time :D