DOS ain't dead - MMX moves

Laaca

Czech republic,
16.02.2009, 21:20

MMX moves (Developers)

Do I something wrong or the simple MMX moves are slower than normal 386 moves?
I measured than if I switch on the "block A" it is about 8% slower than if I switch it off.
Both, DS:ESI and ES:EDI are in normal RAM (not VRAM or ROM).

My code:

{DS:ESI = source; ES:EDI = destination; ECX = number of bytes to copy}

{block A}
@mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop

{block B}
shr ecx,1;pushf;shr ecx,1;rep movsd;adc ecx,ecx
rep movsw;popf;adc ecx,ecx;rep movsb

---
DOS-u-akbar!

Berlin, Germany,
16.02.2009, 21:34

@ Laaca

MMX moves

Post reply

> I measured than if I switch on the "block A" it is about 8% slower than if
> I switch it off.

What CPU on?

> {block A}
> @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
> add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop

I think those "[eXi]" require some slow effective address calculations on each iteration. Maybe some ASM coder knows more? Japheth, mht?

---
Forum admin

Japheth

Germany (South),
16.02.2009, 22:05

@ Laaca

MMX moves

Post reply

There's no speed to gain simply by using MMX registers instead of the standard ones. At least you'll have to use the MOVNTQ instruction to get a significant boost.

A pretty good pdf about this topic, which you hopefully can find via Google, is: gdc_2002_amd.pdf

It shows how to achieve faster memcopy speed by using several "tricks" (MMX, XMM, "prefetch" ). IIRC I once wrote a small tool which implemented most of the strategies mentioned in this document. There's a small chance that I'll be able to remember how I named it.

---
MS-DOS forever!

Berlin, Germany,
16.02.2009, 22:11

@ Japheth

MMX moves

Post reply

> There's no speed to gain with MMX registers.
>
> A pretty good pdf about this topic, which you hopefully can find via
> Google, is: gdc_2002_amd.pdf

http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or memcpy_article.zip) also looks interesting.

---
Forum admin

Japheth

Germany (South),
17.02.2009, 13:52

@ rr

MMX moves

Post reply

> > There's no speed to gain with MMX registers.
> >
> > A pretty good pdf about this topic, which you hopefully can find via
> > Google, is: gdc_2002_amd.pdf
>
> http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or
> memcpy_article.zip)
> also looks interesting.

Thanks! Apparently this is even slightly better.

I remembered the name of the tool mentioned in my last post and did upload it: http://www.japheth.de/Download/memspeed.zip

It needs HXRT to run. It can be assembled with JWasm, but one will also need the HXDEV package.

---
MS-DOS forever!

Laaca

Czech republic,
17.02.2009, 16:29

@ Japheth

MMX moves

Post reply

Thanks guys! I will look at it.

However far more interresting for me is the Mem_copy variant which will skip the zero values (for transparent sprites (in Hicolor mode))

---
DOS-u-akbar!

Japheth

Germany (South),
17.02.2009, 17:16

@ Laaca

MMX moves

Post reply

> However far more interresting for me is the Mem_copy variant which will
> skip the zero values (for transparent sprites (in Hicolor mode))

But that's a totally different animal. And you didn't mention this "variant" in your previous post. Also, if video memory is the destination, forget all strategies which try to achieve gains by doing "cache tricks".

---
MS-DOS forever!

Laaca

Czech republic,
17.02.2009, 18:56

@ Japheth

MMX moves

Post reply

> But that's a totally different animal. And you didn't mention this
> "variant" in your previous post. Also, if video memory is the destination,
> forget all strategies which try to achieve gains by doing "cache tricks".

Yes, I know - completely different animal.
I just want to optimize both my PutSprites - normal one and the transparent one

---
DOS-u-akbar!

marcov

18.02.2009, 09:55

@ rr

MMX moves

Post reply

> > There's no speed to gain with MMX registers.
> >
> > A pretty good pdf about this topic, which you hopefully can find via
> > Google, is: gdc_2002_amd.pdf
>
> http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or
> memcpy_article.zip)
> also looks interesting.

I use routines from

http://fastcode.sourceforge.net/challenge_content/FastMove.html

at work.

It comes with a little testsuite that allows you to quickly asses which routine is faster with a given workload for a given CPU.

FFK

16.02.2009, 23:04

@ Laaca

MMX moves

Post reply

try to copy 8 bytes aligned 64 bytes blocks with this

movq mm0,ds:[esi];
movq mm1,ds:[esi+32];
movq mm2,ds:[esi+8];
movq mm3,ds:[esi+40];
movq mm4,ds:[esi+16];
movq mm5,ds:[esi+48];
movq mm6,ds:[esi+24];
movq mm7,ds:[esi+56];

movq es:[edi],mm0
movq es:[edi+32],mm1
movq es:[edi+8],mm2
movq es:[edi+40],mm3
movq es:[edi+16],mm4
movq es:[edi+48],mm5
movq es:[edi+24],mm6
movq es:[edi+56],mm7

at least ESI or EDI should be 8 bytes aligned, and if both aligned you will get max speed :-)

Rugxulo

Usono,
17.02.2009, 00:54

@ Laaca

MMX moves

Post reply

Caveat: I am far from an expert! You'll be hard-pressed to find anyone who can 100% tell you about this stuff. (I've looked, it's complex! Different advice is found everywhere!)

> Do I something wrong or the simple MMX moves are slower than normal 386
> moves?
>
> {block A}
> @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
> add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop

Seg overrides are always slower, and DS: is default (even though some dumb assemblers will stick it in there anyways, wasting space). I think something like this only really helps on large data. (You could also try putting 8 into a spare register and using that instead of the immediate value. Not sure how much that'd help, though.)

> {block B}
> shr ecx,1;pushf;shr ecx,1;rep movsd;adc ecx,ecx
> rep movsw;popf;adc ecx,ecx;rep movsb

Since you're almost certainly writing this for a 686, their internal register renaming helps a lot (even for the stack and flags). So it will be more difficult to beat their default "rep movsb" (which is fairly fast on semi-modern, superscalar, out-of-order machines).

FFK 17.02.2009, 21:37 @ Laaca	MMX moves Post reply
	> > {block A} > @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0 > add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop Here a faster version of {block A} shr ecx,3 @mmxloop: movq mm0,ds:[esi]; dec ecx Lea esi,[esi+8] movq es:[edi],mm0 lea edi,[edi+8]; jnz @mmxloop

Rugxulo Usono, 18.02.2009, 00:48 @ FFK	MMX moves Post reply
	`movq mm0,ds:[esi];` You don't need the "ds:" part, but it should still work the same.

Japheth

Germany (South),
18.02.2009, 09:08

@ FFK

MMX moves

Post reply

> Here a faster version of {block A}
>
> shr ecx,3
> @mmxloop:
> movq mm0,ds:[esi];
> dec ecx
> Lea esi,[esi+8]
> movq es:[edi],mm0
> lea edi,[edi+8];
> jnz @mmxloop

That's probably true, but in reality the effect will be somewhere between "zero" and "virtually zero".

---
MS-DOS forever!

FFK 18.02.2009, 09:44 @ Japheth	MMX moves Post reply
	> That's probably true, but in reality the effect will be somewhere between > "zero" and "virtually zero". In practice it's very dependent on CPU, FSB, destination and source RAM. But theorically with this code we are saving about 3 CPU cycles for each copy. And I think that 3 CPU cycles are not "zero" or "virtually zero" Any way we can test and see what happen.