Laaca
Czech republic, 16.02.2009, 21:20 |
MMX moves (Developers) |
Do I something wrong or the simple MMX moves are slower than normal 386 moves?
I measured than if I switch on the "block A" it is about 8% slower than if I switch it off.
Both, DS:ESI and ES:EDI are in normal RAM (not VRAM or ROM).
My code:
{DS:ESI = source; ES:EDI = destination; ECX = number of bytes to copy}
{block A}
@mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop
{block B}
shr ecx,1;pushf;shr ecx,1;rep movsd;adc ecx,ecx
rep movsw;popf;adc ecx,ecx;rep movsb --- DOS-u-akbar! |
rr
Berlin, Germany, 16.02.2009, 21:34
@ Laaca
|
MMX moves |
> I measured than if I switch on the "block A" it is about 8% slower than if
> I switch it off.
What CPU on?
> {block A}
> @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
> add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop
I think those "[eXi]" require some slow effective address calculations on each iteration. Maybe some ASM coder knows more? Japheth, mht? --- Forum admin |
Japheth
Germany (South), 16.02.2009, 22:05
@ Laaca
|
MMX moves |
There's no speed to gain simply by using MMX registers instead of the standard ones. At least you'll have to use the MOVNTQ instruction to get a significant boost.
A pretty good pdf about this topic, which you hopefully can find via Google, is: gdc_2002_amd.pdf
It shows how to achieve faster memcopy speed by using several "tricks" (MMX, XMM, "prefetch" ). IIRC I once wrote a small tool which implemented most of the strategies mentioned in this document. There's a small chance that I'll be able to remember how I named it. --- MS-DOS forever! |
rr
Berlin, Germany, 16.02.2009, 22:11
@ Japheth
|
MMX moves |
> There's no speed to gain with MMX registers.
>
> A pretty good pdf about this topic, which you hopefully can find via
> Google, is: gdc_2002_amd.pdf
http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or memcpy_article.zip) also looks interesting. --- Forum admin |
FFK
16.02.2009, 23:04
@ Laaca
|
MMX moves |
try to copy 8 bytes aligned 64 bytes blocks with this
movq mm0,ds:[esi];
movq mm1,ds:[esi+32];
movq mm2,ds:[esi+8];
movq mm3,ds:[esi+40];
movq mm4,ds:[esi+16];
movq mm5,ds:[esi+48];
movq mm6,ds:[esi+24];
movq mm7,ds:[esi+56];
movq es:[edi],mm0
movq es:[edi+32],mm1
movq es:[edi+8],mm2
movq es:[edi+40],mm3
movq es:[edi+16],mm4
movq es:[edi+48],mm5
movq es:[edi+24],mm6
movq es:[edi+56],mm7
at least ESI or EDI should be 8 bytes aligned, and if both aligned you will get max speed |
Rugxulo
Usono, 17.02.2009, 00:54
@ Laaca
|
MMX moves |
Caveat: I am far from an expert! You'll be hard-pressed to find anyone who can 100% tell you about this stuff. (I've looked, it's complex! Different advice is found everywhere!)
> Do I something wrong or the simple MMX moves are slower than normal 386
> moves?
>
> {block A}
> @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
> add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop
Seg overrides are always slower, and DS: is default (even though some dumb assemblers will stick it in there anyways, wasting space). I think something like this only really helps on large data. (You could also try putting 8 into a spare register and using that instead of the immediate value. Not sure how much that'd help, though.)
> {block B}
> shr ecx,1;pushf;shr ecx,1;rep movsd;adc ecx,ecx
> rep movsw;popf;adc ecx,ecx;rep movsb
Since you're almost certainly writing this for a 686, their internal register renaming helps a lot (even for the stack and flags). So it will be more difficult to beat their default "rep movsb" (which is fairly fast on semi-modern, superscalar, out-of-order machines). |
Japheth
Germany (South), 17.02.2009, 13:52
@ rr
|
MMX moves |
> > There's no speed to gain with MMX registers.
> >
> > A pretty good pdf about this topic, which you hopefully can find via
> > Google, is: gdc_2002_amd.pdf
>
> http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or
> memcpy_article.zip)
> also looks interesting.
Thanks! Apparently this is even slightly better.
I remembered the name of the tool mentioned in my last post and did upload it: http://www.japheth.de/Download/memspeed.zip
It needs HXRT to run. It can be assembled with JWasm, but one will also need the HXDEV package. --- MS-DOS forever! |
Laaca
Czech republic, 17.02.2009, 16:29
@ Japheth
|
MMX moves |
Thanks guys! I will look at it.
However far more interresting for me is the Mem_copy variant which will skip the zero values (for transparent sprites (in Hicolor mode)) --- DOS-u-akbar! |
Japheth
Germany (South), 17.02.2009, 17:16
@ Laaca
|
MMX moves |
> However far more interresting for me is the Mem_copy variant which will
> skip the zero values (for transparent sprites (in Hicolor mode))
But that's a totally different animal. And you didn't mention this "variant" in your previous post. Also, if video memory is the destination, forget all strategies which try to achieve gains by doing "cache tricks". --- MS-DOS forever! |
Laaca
Czech republic, 17.02.2009, 18:56
@ Japheth
|
MMX moves |
> But that's a totally different animal. And you didn't mention this
> "variant" in your previous post. Also, if video memory is the destination,
> forget all strategies which try to achieve gains by doing "cache tricks".
Yes, I know - completely different animal.
I just want to optimize both my PutSprites - normal one and the transparent one --- DOS-u-akbar! |
FFK
17.02.2009, 21:37
@ Laaca
|
MMX moves |
>
> {block A}
> @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
> add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop
Here a faster version of {block A}
shr ecx,3
@mmxloop:
movq mm0,ds:[esi];
dec ecx
Lea esi,[esi+8]
movq es:[edi],mm0
lea edi,[edi+8];
jnz @mmxloop |
Rugxulo
Usono, 18.02.2009, 00:48
@ FFK
|
MMX moves |
movq mm0,ds:[esi];
You don't need the "ds:" part, but it should still work the same.
|
Japheth
Germany (South), 18.02.2009, 09:08
@ FFK
|
MMX moves |
> Here a faster version of {block A}
>
> shr ecx,3
> @mmxloop:
> movq mm0,ds:[esi];
> dec ecx
> Lea esi,[esi+8]
> movq es:[edi],mm0
> lea edi,[edi+8];
> jnz @mmxloop
That's probably true, but in reality the effect will be somewhere between "zero" and "virtually zero". --- MS-DOS forever! |
FFK
18.02.2009, 09:44
@ Japheth
|
MMX moves |
> That's probably true, but in reality the effect will be somewhere between
> "zero" and "virtually zero".
In practice it's very dependent on CPU, FSB, destination and source RAM.
But theorically with this code we are saving about 3 CPU cycles for each copy. And I think that 3 CPU cycles are not "zero" or "virtually zero"
Any way we can test and see what happen. |
marcov
18.02.2009, 09:55
@ rr
|
MMX moves |
> > There's no speed to gain with MMX registers.
> >
> > A pretty good pdf about this topic, which you hopefully can find via
> > Google, is: gdc_2002_amd.pdf
>
> http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or
> memcpy_article.zip)
> also looks interesting.
I use routines from
http://fastcode.sourceforge.net/challenge_content/FastMove.html
at work.
It comes with a little testsuite that allows you to quickly asses which routine is faster with a given workload for a given CPU. |