Fastest possible multiplication routine?

ページ 4/7
1 | 2 | 3 | | 5 | 6 | 7

By wouter_

Hero (525)

wouter_ さんの画像

14-03-2011, 11:41

Do you know how is the C flag after muluw_hl_bc ?
maybe I could also skip the and a 's if I were sure C us reset....

Unfortunately this is not possible. Also from the openMSX source code:

Verified on real R800:
    YHXN flags are unchanged
    SV   flags are reset
    Z    flag is set when result is zero
    C    flag is set when result doesn't fit in 16-bit

In other words C-flag is set when register DE is non-zero.

BTW what do you think about point 2)?

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

14-03-2011, 12:41

Do you know how is the C flag after muluw_hl_bc ?
maybe I could also skip the and a 's if I were sure C us reset....

Unfortunately this is not possible. Also from the openMSX source code:

Verified on real R800:
    YHXN flags are unchanged
    SV   flags are reset
    Z    flag is set when result is zero
    C    flag is set when result doesn't fit in 16-bit

In other words C-flag is set when register DE is non-zero.

good to know in any case, thanks


BTW what do you think about point 2)?

I canot figure out your proposal. How do you mean to change the two sections?

Same, about point 4), what is the alternative?

By wouter_

Hero (525)

wouter_ さんの画像

14-03-2011, 12:53

I canot figure out your proposal. How do you mean to change the two sections?

Currently for the case bc_neg_hl_pos you have

	push	hl
	muluw_hl_bc		; result in DEHL
	ex	de,hl		; result in HLDE
	pop	bc
	and	a
	sbc	hl,bc		; hl*bc + -1*hl<<16
	ret

I think the following is a bit faster: it replaces the expensive PUSH/POP instructions with more but cheaper instructions. If I counted correctly it should be 6 cycles faster.

	ld	d,b
	ld	e,c
	ex	de,hl
	ld	b,d
	ld	c,e
	muluw_hl_bc		; result in DEHL
	ex	de,hl		; result in HLDE
	and	a
	sbc	hl,bc		; hl*bc + -1*hl<<16
	ret

By Metalion

Paragon (1625)

Metalion さんの画像

14-03-2011, 13:15

I think that a generic multiplication routine is not the best answer in ASM.
It is far better to find the fastest routine that performs a specific multiplication (or a family of those).
It is quite rare to have the need for a generic multiplication routine in ASM, apart from real time generated graphics using trigonometrics (3D, ...).

But it is a very good mental exercise, nonetheless Tongue

Keep up the optimization LOL!

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

14-03-2011, 13:24

Good, in this way I can merge two branches

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;	Signed int 16 bit x 16 bit -> 32 bit multiplication for r800
;	Called with 1st arg in DE, 2nd arg in BC. Returns with
;	HLDE = DE*BC

	global _imul

_imul:
	ex	de,hl				; HL*BC

	bit	7,b
	jr	nz,bc_is_neg
	bit	7,h
	jr	nz,hl_is_neg

bc_hl_are_pos:
	                                        ; HL>=0 and BC>=0
	muluw_hl_bc				; result in DEHL
	ex		de,hl			; result in HLDE
	ret

bc_is_neg:
	bit	7,h
	jr	nz,bc_hl_are_neg
	                                        ; HL>=0 and BC<0
        ex_bc_hl                                ; swap bc and hl

hl_is_neg:					; HL<0 and BC>=0
	muluw_hl_bc				; result in DEHL
	ex		de,hl			; result in HLDE
	and		a
	sbc		hl,bc			; hl*bc + -1*bc<<16
	ret


bc_hl_are_neg:
                                                ; HL<0 and BC<0
	push	        hl
	muluw_hl_bc				; result in DEHL
	ex		de,hl			; result in HLDE
	and		a
	sbc		hl,bc			; hl*bc + -1*hl<<16
	pop		bc
	and		a
	sbc		hl,bc			; hl*bc + -1*hl<<16 + -1*bc<<16
	ret

where:

muluw_hl_bc macro
            db 0xed,0xc3
            endm

ex_bc_hl macro
	ld	d,b
	ld	e,c
	ex	de,hl
	ld	b,d
	ld	c,e
        endm

Thanks!

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

14-03-2011, 13:30

I think that a generic multiplication routine is not the best answer in ASM.
It is far better to find the fastest routine that performs a specific multiplication (or a family of those).
It is quite rare to have the need for a generic multiplication routine in ASM, apart from real time generated graphics using trigonometrics (3D, ...).

But it is a very good mental exercise, nonetheless Tongue

Keep up the optimization LOL!

This is the case of maze3d.The code is used in this game:
https://sites.google.com/site/testmsx/msx2-doom
The latest M3D.BIN works as SCC rom under a TR

By wouter_

Hero (525)

wouter_ さんの画像

14-03-2011, 13:37

Same, about point 4), what is the alternative?

Point 4 is very much a detail.

There are 4 possible control flows through your routine (the 2 inputs can each be either positive or negative). Each of these flows takes a certain amount of cycles to execute. The fastest path is for both inputs positive. The slowest is probably when both are negative (I didn't actually check). Point 4) is not about making the (average) execution time faster, but about making the difference between the fastest and slowest path smaller.

On R800 a conditional JR or JP instruction that does not actually jump takes less cycles to execute than one that does jump (on Z80 this is only the case for JR instructions). You can use this to arrange your code so that the slowest subroutine (both inputs negative) is reached without actually jumping, while to reach the fastest subroutine (both positive) you have to execute 2 actually taken jump instructions.

By Edwin

Paragon (1182)

Edwin さんの画像

14-03-2011, 23:38

Funny how this subject resurfaces at this point in time. I was just messing with it this weekend. Yes, it is for some unknown secret project Tongue

However, I have the multiplication side covered. I only needed 8bit x 8bit = 8bit (clipped result) and 16bit x 16bit = 16bit. Which works both for signed and unsigned numbers.

However, the same cannot be said for divisions. The net is full of good examples of 8 and 16 bit unsigned divisions, but none for signed. The only thought I have had so far is to do the same as proposed earlier: xor the two high bits and and make the numbers positive, do the division, and make it negative if the xorred bit is one.

If anybody has some faster implementation for signed divisions, I'm all ears. Note that they will end up in something which will be released including source.

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

15-03-2011, 00:05

small update at
https://sites.google.com/site/testmsx/msx2-doom
using the code from this post

run M3D.BIN in maze3d_20110315.rar
on a openmsx TR as scc rom

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

15-03-2011, 15:15

Funny how this subject resurfaces at this point in time. I was just messing with it this weekend. Yes, it is for some unknown secret project Tongue

However, I have the multiplication side covered. I only needed 8bit x 8bit = 8bit (clipped result) and 16bit x 16bit = 16bit. Which works both for signed and unsigned numbers.

However, the same cannot be said for divisions. The net is full of good examples of 8 and 16 bit unsigned divisions, but none for signed. The only thought I have had so far is to do the same as proposed earlier: xor the two high bits and and make the numbers positive, do the division, and make it negative if the xorred bit is one.

If anybody has some faster implementation for signed divisions, I'm all ears. Note that they will end up in something which will be released including source.

Do you need signed division 16 bit vs 16 bit, 16 bit vs 8 bit or 8bit vs 8 bit ?
z80 or r800?

ページ 4/7
1 | 2 | 3 | | 5 | 6 | 7