Back to home page

DOS ain't dead

Forum index page

Log in | Register

Back to index page
Thread view  Board view
Laaca

Homepage

Czech republic,
16.02.2009, 21:20
 

MMX moves (Developers)

Do I something wrong or the simple MMX moves are slower than normal 386 moves?
I measured than if I switch on the "block A" it is about 8% slower than if I switch it off.
Both, DS:ESI and ES:EDI are in normal RAM (not VRAM or ROM).

My code:

{DS:ESI = source; ES:EDI = destination; ECX = number of bytes to copy}

{block A}
@mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop

{block B}
shr ecx,1;pushf;shr ecx,1;rep movsd;adc ecx,ecx
rep movsw;popf;adc ecx,ecx;rep movsb

---
DOS-u-akbar!

rr

Homepage E-mail

Berlin, Germany,
16.02.2009, 21:34

@ Laaca
 

MMX moves

> I measured than if I switch on the "block A" it is about 8% slower than if
> I switch it off.

What CPU on?

> {block A}
> @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
> add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop

I think those "[eXi]" require some slow effective address calculations on each iteration. Maybe some ASM coder knows more? Japheth, mht?

---
Forum admin

Japheth

Homepage

Germany (South),
16.02.2009, 22:05

@ Laaca
 

MMX moves

There's no speed to gain simply by using MMX registers instead of the standard ones. At least you'll have to use the MOVNTQ instruction to get a significant boost.

A pretty good pdf about this topic, which you hopefully can find via Google, is: gdc_2002_amd.pdf

It shows how to achieve faster memcopy speed by using several "tricks" (MMX, XMM, "prefetch" ). IIRC I once wrote a small tool which implemented most of the strategies mentioned in this document. There's a small chance that I'll be able to remember how I named it.

---
MS-DOS forever!

rr

Homepage E-mail

Berlin, Germany,
16.02.2009, 22:11

@ Japheth
 

MMX moves

> There's no speed to gain with MMX registers.
>
> A pretty good pdf about this topic, which you hopefully can find via
> Google, is: gdc_2002_amd.pdf

http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or memcpy_article.zip) also looks interesting.

---
Forum admin

Japheth

Homepage

Germany (South),
17.02.2009, 13:52

@ rr
 

MMX moves

> > There's no speed to gain with MMX registers.
> >
> > A pretty good pdf about this topic, which you hopefully can find via
> > Google, is: gdc_2002_amd.pdf
>
> http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or
> memcpy_article.zip)
> also looks interesting.

Thanks! Apparently this is even slightly better.

I remembered the name of the tool mentioned in my last post and did upload it: http://www.japheth.de/Download/memspeed.zip

It needs HXRT to run. It can be assembled with JWasm, but one will also need the HXDEV package.

---
MS-DOS forever!

Laaca

Homepage

Czech republic,
17.02.2009, 16:29

@ Japheth
 

MMX moves

Thanks guys! I will look at it.

However far more interresting for me is the Mem_copy variant which will skip the zero values (for transparent sprites (in Hicolor mode))

---
DOS-u-akbar!

Japheth

Homepage

Germany (South),
17.02.2009, 17:16

@ Laaca
 

MMX moves

> However far more interresting for me is the Mem_copy variant which will
> skip the zero values (for transparent sprites (in Hicolor mode))

But that's a totally different animal. And you didn't mention this "variant" in your previous post. Also, if video memory is the destination, forget all strategies which try to achieve gains by doing "cache tricks".

---
MS-DOS forever!

Laaca

Homepage

Czech republic,
17.02.2009, 18:56

@ Japheth
 

MMX moves

> But that's a totally different animal. And you didn't mention this
> "variant" in your previous post. Also, if video memory is the destination,
> forget all strategies which try to achieve gains by doing "cache tricks".

Yes, I know - completely different animal.
I just want to optimize both my PutSprites - normal one and the transparent one

---
DOS-u-akbar!

marcov

18.02.2009, 09:55

@ rr
 

MMX moves

> > There's no speed to gain with MMX registers.
> >
> > A pretty good pdf about this topic, which you hopefully can find via
> > Google, is: gdc_2002_amd.pdf
>
> http://www.stud.uni-karlsruhe.de/~urkt/memcpy.pdf (or
> memcpy_article.zip)
> also looks interesting.

I use routines from

http://fastcode.sourceforge.net/challenge_content/FastMove.html

at work.

It comes with a little testsuite that allows you to quickly asses which routine is faster with a given workload for a given CPU.

FFK

Homepage

16.02.2009, 23:04

@ Laaca
 

MMX moves

try to copy 8 bytes aligned 64 bytes blocks with this

movq mm0,ds:[esi];
movq mm1,ds:[esi+32];
movq mm2,ds:[esi+8];
movq mm3,ds:[esi+40];
movq mm4,ds:[esi+16];
movq mm5,ds:[esi+48];
movq mm6,ds:[esi+24];
movq mm7,ds:[esi+56];

movq es:[edi],mm0
movq es:[edi+32],mm1
movq es:[edi+8],mm2
movq es:[edi+40],mm3
movq es:[edi+16],mm4
movq es:[edi+48],mm5
movq es:[edi+24],mm6
movq es:[edi+56],mm7

at least ESI or EDI should be 8 bytes aligned, and if both aligned you will get max speed :-)

Rugxulo

Homepage

Usono,
17.02.2009, 00:54

@ Laaca
 

MMX moves

Caveat: I am far from an expert! You'll be hard-pressed to find anyone who can 100% tell you about this stuff. (I've looked, it's complex! Different advice is found everywhere!)

> Do I something wrong or the simple MMX moves are slower than normal 386
> moves?
>
> {block A}
> @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
> add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop

Seg overrides are always slower, and DS: is default (even though some dumb assemblers will stick it in there anyways, wasting space). I think something like this only really helps on large data. (You could also try putting 8 into a spare register and using that instead of the immediate value. Not sure how much that'd help, though.)

> {block B}
> shr ecx,1;pushf;shr ecx,1;rep movsd;adc ecx,ecx
> rep movsw;popf;adc ecx,ecx;rep movsb

Since you're almost certainly writing this for a 686, their internal register renaming helps a lot (even for the stack and flags). So it will be more difficult to beat their default "rep movsb" (which is fairly fast on semi-modern, superscalar, out-of-order machines).

FFK

Homepage

17.02.2009, 21:37

@ Laaca
 

MMX moves

>
> {block A}
> @mmxloop: movq mm0,ds:[esi];movq es:[edi],mm0
> add esi,8;sub ecx,8;add edi,8;cmp ecx,8;jge @mmxloop


Here a faster version of {block A}

shr ecx,3
@mmxloop:
movq mm0,ds:[esi];
dec ecx
Lea esi,[esi+8]
movq es:[edi],mm0
lea edi,[edi+8];
jnz @mmxloop

Rugxulo

Homepage

Usono,
18.02.2009, 00:48

@ FFK
 

MMX moves

movq mm0,ds:[esi];

You don't need the "ds:" part, but it should still work the same.

Japheth

Homepage

Germany (South),
18.02.2009, 09:08

@ FFK
 

MMX moves

> Here a faster version of {block A}
>
> shr ecx,3
> @mmxloop:
> movq mm0,ds:[esi];
> dec ecx
> Lea esi,[esi+8]
> movq es:[edi],mm0
> lea edi,[edi+8];
> jnz @mmxloop

That's probably true, but in reality the effect will be somewhere between "zero" and "virtually zero".

---
MS-DOS forever!

FFK

Homepage

18.02.2009, 09:44

@ Japheth
 

MMX moves

> That's probably true, but in reality the effect will be somewhere between
> "zero" and "virtually zero".

In practice it's very dependent on CPU, FSB, destination and source RAM.
But theorically with this code we are saving about 3 CPU cycles for each copy. And I think that 3 CPU cycles are not "zero" or "virtually zero" ;-)
Any way we can test and see what happen.

Back to index page
Thread view  Board view
22049 Postings in 2034 Threads, 396 registered users, 265 users online (1 registered, 264 guests)
DOS ain't dead | Admin contact
RSS Feed
powered by my little forum