DOS ain't dead

Laaca Czech republic, 16.02.2009, 21:20	MMX moves (Developers) Post reply
	Do I something wrong or the simple MMX moves are slower than normal 386 moves? I measured than if I switch on the "block A" it is about 8% slower than if I switch it off. Both, DS:ESI and ES:EDI are in normal RAM (not VRAM or ROM). My code: {DS:ESI = source; ES:EDI = destination; ECX = number of bytes to copy} {block A} @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0 add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop {block B} shr ecx,1;pushf;shr ecx,1;rep movsd;adc ecx,ecx rep movsw;popf;adc ecx,ecx;rep movsb --- DOS-u-akbar!
rr Berlin, Germany, 16.02.2009, 21:34 @ Laaca	MMX moves Post reply
	> I measured than if I switch on the "block A" it is about 8% slower than if > I switch it off. What CPU on? > {block A} > @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0 > add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop I think those "[eXi]" require some slow effective address calculations on each iteration. Maybe some ASM coder knows more? Japheth, mht? --- Forum admin
Japheth Germany (South), 16.02.2009, 22:05 @ Laaca	MMX moves Post reply
	There's no speed to gain simply by using MMX registers instead of the standard ones. At least you'll have to use the MOVNTQ instruction to get a significant boost. A pretty good pdf about this topic, which you hopefully can find via Google, is: gdc_2002_amd.pdf It shows how to achieve faster memcopy speed by using several "tricks" (MMX, XMM, "prefetch" ). IIRC I once wrote a small tool which implemented most of the strategies mentioned in this document. There's a small chance that I'll be able to remember how I named it. --- MS-DOS forever!
rr Berlin, Germany, 16.02.2009, 22:11 @ Japheth	MMX moves Post reply
	> There's no speed to gain with MMX registers. > > A pretty good pdf about this topic, which you hopefully can find via > Google, is: gdc_2002_amd.pdf http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or memcpy_article.zip) also looks interesting. --- Forum admin
FFK 16.02.2009, 23:04 @ Laaca	MMX moves Post reply
	try to copy 8 bytes aligned 64 bytes blocks with this movq mm0,ds:[esi]; movq mm1,ds:[esi+32]; movq mm2,ds:[esi+8]; movq mm3,ds:[esi+40]; movq mm4,ds:[esi+16]; movq mm5,ds:[esi+48]; movq mm6,ds:[esi+24]; movq mm7,ds:[esi+56]; movq es:[edi],mm0 movq es:[edi+32],mm1 movq es:[edi+8],mm2 movq es:[edi+40],mm3 movq es:[edi+16],mm4 movq es:[edi+48],mm5 movq es:[edi+24],mm6 movq es:[edi+56],mm7 at least ESI or EDI should be 8 bytes aligned, and if both aligned you will get max speed
Rugxulo Usono, 17.02.2009, 00:54 @ Laaca	MMX moves Post reply
	Caveat: I am far from an expert! You'll be hard-pressed to find anyone who can 100% tell you about this stuff. (I've looked, it's complex! Different advice is found everywhere!) > Do I something wrong or the simple MMX moves are slower than normal 386 > moves? > > {block A} > @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0 > add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop Seg overrides are always slower, and DS: is default (even though some dumb assemblers will stick it in there anyways, wasting space). I think something like this only really helps on large data. (You could also try putting 8 into a spare register and using that instead of the immediate value. Not sure how much that'd help, though.) > {block B} > shr ecx,1;pushf;shr ecx,1;rep movsd;adc ecx,ecx > rep movsw;popf;adc ecx,ecx;rep movsb Since you're almost certainly writing this for a 686, their internal register renaming helps a lot (even for the stack and flags). So it will be more difficult to beat their default "rep movsb" (which is fairly fast on semi-modern, superscalar, out-of-order machines).
Japheth Germany (South), 17.02.2009, 13:52 @ rr	MMX moves Post reply
	> > There's no speed to gain with MMX registers. > > > > A pretty good pdf about this topic, which you hopefully can find via > > Google, is: gdc_2002_amd.pdf > > http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or > memcpy_article.zip) > also looks interesting. Thanks! Apparently this is even slightly better. I remembered the name of the tool mentioned in my last post and did upload it: http://www.japheth.de/Download/memspeed.zip It needs HXRT to run. It can be assembled with JWasm, but one will also need the HXDEV package. --- MS-DOS forever!
Laaca Czech republic, 17.02.2009, 16:29 @ Japheth	MMX moves Post reply
	Thanks guys! I will look at it. However far more interresting for me is the Mem_copy variant which will skip the zero values (for transparent sprites (in Hicolor mode)) --- DOS-u-akbar!
Japheth Germany (South), 17.02.2009, 17:16 @ Laaca	MMX moves Post reply
	> However far more interresting for me is the Mem_copy variant which will > skip the zero values (for transparent sprites (in Hicolor mode)) But that's a totally different animal. And you didn't mention this "variant" in your previous post. Also, if video memory is the destination, forget all strategies which try to achieve gains by doing "cache tricks". --- MS-DOS forever!
Laaca Czech republic, 17.02.2009, 18:56 @ Japheth	MMX moves Post reply
	> But that's a totally different animal. And you didn't mention this > "variant" in your previous post. Also, if video memory is the destination, > forget all strategies which try to achieve gains by doing "cache tricks". Yes, I know - completely different animal. I just want to optimize both my PutSprites - normal one and the transparent one --- DOS-u-akbar!
FFK 17.02.2009, 21:37 @ Laaca	MMX moves Post reply
	> > {block A} > @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0 > add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop Here a faster version of {block A} shr ecx,3 @mmxloop: movq mm0,ds:[esi]; dec ecx Lea esi,[esi+8] movq es:[edi],mm0 lea edi,[edi+8]; jnz @mmxloop
Rugxulo Usono, 18.02.2009, 00:48 @ FFK	MMX moves Post reply
	`movq mm0,ds:[esi];` You don't need the "ds:" part, but it should still work the same.
Japheth Germany (South), 18.02.2009, 09:08 @ FFK	MMX moves Post reply
	> Here a faster version of {block A} > > shr ecx,3 > @mmxloop: > movq mm0,ds:[esi]; > dec ecx > Lea esi,[esi+8] > movq es:[edi],mm0 > lea edi,[edi+8]; > jnz @mmxloop That's probably true, but in reality the effect will be somewhere between "zero" and "virtually zero". --- MS-DOS forever!
FFK 18.02.2009, 09:44 @ Japheth	MMX moves Post reply
	> That's probably true, but in reality the effect will be somewhere between > "zero" and "virtually zero". In practice it's very dependent on CPU, FSB, destination and source RAM. But theorically with this code we are saving about 3 CPU cycles for each copy. And I think that 3 CPU cycles are not "zero" or "virtually zero" Any way we can test and see what happen.
marcov 18.02.2009, 09:55 @ rr	MMX moves Post reply
	> > There's no speed to gain with MMX registers. > > > > A pretty good pdf about this topic, which you hopefully can find via > > Google, is: gdc_2002_amd.pdf > > http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or > memcpy_article.zip) > also looks interesting. I use routines from http://fastcode.sourceforge.net/challenge_content/FastMove.html at work. It comes with a little testsuite that allows you to quickly asses which routine is faster with a given workload for a given CPU.

MMX moves (Developers)

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves

MMX moves