Fastest possible multiplication routine?

ページ 5/7
1 | 2 | 3 | 4 | | 6 | 7

By Edwin

Paragon (1182)

Edwin さんの画像

15-03-2011, 20:01

I need full 8 bit or full 16 bit. No mixed versions needed. Actually, I need the remainder as well, but it doesn't have to be from the same routine as the unsigned version gives. In z80 asm. Although I'm not so sure that r800 will make any difference.

By GhostwriterP

Paladin (683)

GhostwriterP さんの画像

15-03-2011, 20:20

All I can think off is that old school rule devision is multiplying with the inverse... For a 16 bit value this would imply a 64kb Wink table to quickly convert a number X to 1/X and subsequently use the the above singed multiplication routine. Possibly only for use on r800.

I did use muluw for some fixed devision (unsigned) some time ago... let me look it up...

found it!

;---------------------------------------------------
; 16 bit - div 12 ;mod 12
;
; valid value range 0 - 12288
;
; in: hl    uit: hl = div 12 , e = mod 12
;
; verandert: a,bc,de,hl


divmod12

  ld bc,5461   ; 65536 / 12

  muluw hl,bc

  push de      ; >>> bewaar div 12
/*
  ld a,21      ; afronding fout correctie grote getallen
  add a,h
  ld h,a
*/
  ld bc,12     ; rest x 12 = mod

  muluw hl,bc

  pop hl       ; <<<

  ret

Does also calculate the reminder. Does not help you though. Tongue

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

15-03-2011, 22:32

Untested, but basically it is GosthwriterP's code on 8 bits


; fastdiv
; input: E=dividend, C=divisor
; output: E=quotient, H=rest/divisor*256

fastdiv:
	xor	a
	ld	d,a
	ld	b,a
	ld	hl,HelpTable-1
	add	hl,bc
	ld	c,(hl)
	inc	hl
	ld	b,(hl)
	ex	de,hl
	muluw hl,bc
	ret

HelpTable: 		; dw	2^16/i = 1..255
  dw 65536 
  dw 32768 
  dw 21845 
  dw 16384 
  dw 13107 
  dw 10923 
  dw 9362 
  dw 8192 
  dw 7282 
  dw 6554 
  dw 5958 
  dw 5461 
  dw 5041 
  dw 4681 
  dw 4369 
  dw 4096 
  dw 3855 
  dw 3641 
  dw 3449 
  dw 3277 
  dw 3121 
  dw 2979 
  dw 2849 
  dw 2731 
  dw 2621 
  dw 2521 
  dw 2427 
  dw 2341 
  dw 2260 
  dw 2185 
  dw 2114 
  dw 2048 
  dw 1986 
  dw 1928 
  dw 1872 
  dw 1820 
  dw 1771 
  dw 1725 
  dw 1680 
  dw 1638 
  dw 1598 
  dw 1560 
  dw 1524 
  dw 1489 
  dw 1456 
  dw 1425 
  dw 1394 
  dw 1365 
  dw 1337 
  dw 1311 
  dw 1285 
  dw 1260 
  dw 1237 
  dw 1214 
  dw 1192 
  dw 1170 
  dw 1150 
  dw 1130 
  dw 1111 
  dw 1092 
  dw 1074 
  dw 1057 
  dw 1040 
  dw 1024 
  dw 1008 
  dw 993 
  dw 978 
  dw 964 
  dw 950 
  dw 936 
  dw 923 
  dw 910 
  dw 898 
  dw 886 
  dw 874 
  dw 862 
  dw 851 
  dw 840 
  dw 830 
  dw 819 
  dw 809 
  dw 799 
  dw 790 
  dw 780 
  dw 771 
  dw 762 
  dw 753 
  dw 745 
  dw 736 
  dw 728 
  dw 720 
  dw 712 
  dw 705 
  dw 697 
  dw 690 
  dw 683 
  dw 676 
  dw 669 
  dw 662 
  dw 655 
  dw 649 
  dw 643 
  dw 636 
  dw 630 
  dw 624 
  dw 618 
  dw 612 
  dw 607 
  dw 601 
  dw 596 
  dw 590 
  dw 585 
  dw 580 
  dw 575 
  dw 570 
  dw 565 
  dw 560 
  dw 555 
  dw 551 
  dw 546 
  dw 542 
  dw 537 
  dw 533 
  dw 529 
  dw 524 
  dw 520 
  dw 516 
  dw 512 
  dw 508 
  dw 504 
  dw 500 
  dw 496 
  dw 493 
  dw 489 
  dw 485 
  dw 482 
  dw 478 
  dw 475 
  dw 471 
  dw 468 
  dw 465 
  dw 462 
  dw 458 
  dw 455 
  dw 452 
  dw 449 
  dw 446 
  dw 443 
  dw 440 
  dw 437 
  dw 434 
  dw 431 
  dw 428 
  dw 426 
  dw 423 
  dw 420 
  dw 417 
  dw 415 
  dw 412 
  dw 410 
  dw 407 
  dw 405 
  dw 402 
  dw 400 
  dw 397 
  dw 395 
  dw 392 
  dw 390 
  dw 388 
  dw 386 
  dw 383 
  dw 381 
  dw 379 
  dw 377 
  dw 374 
  dw 372 
  dw 370 
  dw 368 
  dw 366 
  dw 364 
  dw 362 
  dw 360 
  dw 358 
  dw 356 
  dw 354 
  dw 352 
  dw 350 
  dw 349 
  dw 347 
  dw 345 
  dw 343 
  dw 341 
  dw 340 
  dw 338 
  dw 336 
  dw 334 
  dw 333 
  dw 331 
  dw 329 
  dw 328 
  dw 326 
  dw 324 
  dw 323 
  dw 321 
  dw 320 
  dw 318 
  dw 317 
  dw 315 
  dw 314 
  dw 312 
  dw 311 
  dw 309 
  dw 308 
  dw 306 
  dw 305 
  dw 303 
  dw 302 
  dw 301 
  dw 299 
  dw 298 
  dw 297 
  dw 295 
  dw 294 
  dw 293 
  dw 291 
  dw 290 
  dw 289 
  dw 287 
  dw 286 
  dw 285 
  dw 284 
  dw 282 
  dw 281 
  dw 280 
  dw 279 
  dw 278 
  dw 277 
  dw 275 
  dw 274 
  dw 273 
  dw 272 
  dw 271 
  dw 270 
  dw 269 
  dw 267 
  dw 266 
  dw 265 
  dw 264 
  dw 263 
  dw 262 
  dw 261 
  dw 260 
  dw 259 
  dw 258 
  dw 257 

Thinking to the signed version (for z80)
Tongue

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

16-03-2011, 08:18

Signed and unsigned 16 bit divisions (and modulus % operator)


;	16 bit divide and modulus routines

;	called with dividend in hl and divisor in de

;	returns with result in hl.

;	adiv (amod) is signed divide (modulus), ldiv (lmod) is unsigned

	global	adiv,ldiv,amod,lmod
	psect	text,class=CODE

amod:
	call	adiv
	ex	de,hl		;put modulus in hl
	ret

lmod:
	call	ldiv
	ex	de,hl
	ret

ldiv:
	xor	a
	push	af
	ex	de,hl
	jr	dv1


adiv:
	ld	a,h
	xor	d		;set sign flag for quotient
	ld	a,h		;get sign of dividend
	push	af
	call	negif
	ex	de,hl
	call	negif
dv1:	ld	b,1
	ld	a,h
	or	l
	jr	nz,dv8
	pop	af
	ret

dv8:	push	hl
	add	hl,hl
	jr	c,dv2
	ld	a,d
	cp	h
	jr	c,dv2
	jp	nz,dv6
	ld	a,e
	cp	l
	jr	c,dv2
dv6:	pop	af
	inc	b
	jp	dv8

dv2:	pop	hl
	ex	de,hl
	push	hl
	ld	hl,0
	ex	(sp),hl

dv4:	ld	a,h
	cp	d
	jr	c,dv3
	jp	nz,dv5
	ld	a,l
	cp	e
	jr	c,dv3

dv5:	sbc	hl,de
dv3:	ex	(sp),hl
	ccf
	adc	hl,hl
	srl	d
	rr	e
	ex	(sp),hl
	djnz	dv4
	pop	de
	ex	de,hl
	pop	af
	call	m,negat
	ex	de,hl
	or	a			;test remainder sign bit
	call	m,negat
	ex	de,hl
	ret

negif:	bit	7,h
	ret	z
negat:	ld	b,h
	ld	c,l
	ld	hl,0
	or	a
	sbc	hl,bc
	ret

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

16-03-2011, 08:27

PS
Taken from sources of the Hi-Tech crosscompiler for windows (v7.8)

By wouter_

Hero (525)

wouter_ さんの画像

16-03-2011, 10:14

Untested, but basically it is GosthwriterP's code on 8 bits

Unfortunately it doesn't work .. it gives off-by-one errors for many inputs. For example: 3 / 3 will result in 0 instead of 1 (3 * 21845 = 0x0000ffff, upper 16-bits are zero). But if you add 1 to all the entries in your lookup table the result will be correct (because you use 16-bit multiplication).

If off-by-one errors are acceptable for your application, there's a much faster routine possible, using 8-bit multiplication. I also optimized the LUT by aligning it on a 256-byte boundary (a similar optimization can be done on your original routine. Actually your routine has a bug: each entry in your table is 2 bytes long but you forgot to multiply the index by 2):

; input:    [A]=dividend, [L]=divisor
; output:   [H]=quotient
; modifies: [B][L]
; remark:   The calculated quotient can be off-by-one!!!

approx_udiv8:
	ld h,LUT_recip8 / 256 ; high byte of the table
	ld b,(hl)             ; B = 256 / L
	mulub a,b             ; H = (A * (256 / L)) / 256 ~= A / L (with rounding errors)
	ret

	.align 256            ; alternative: ds (256 - ($ and 255)) and 255
LUT_recip8:
	[ table with 256 entries generated by the formula
	    '256 / i'   or   '(256 / i) + 1'
	  both formulas result in off-by-one errors ]

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

16-03-2011, 14:00

Untested, but basically it is GosthwriterP's code on 8 bits

Unfortunately it doesn't work .. it gives off-by-one errors for many inputs. For example: 3 / 3 will result in 0 instead of 1 (3 * 21845 = 0x0000ffff, upper 16-bits are zero). But if you add 1 to all the entries in your lookup table the result will be correct (because you use 16-bit multiplication).

If off-by-one errors are acceptable for your application, there's a much faster routine possible, using 8-bit multiplication. I also optimized the LUT by aligning it on a 256-byte boundary (a similar optimization can be done on your original routine. Actually your routine has a bug: each entry in your table is 2 bytes long but you forgot to multiply the index by 2):

; input:    [A]=dividend, [L]=divisor
; output:   [H]=quotient
; modifies: [B][L]
; remark:   The calculated quotient can be off-by-one!!!

approx_udiv8:
	ld h,LUT_recip8 / 256 ; high byte of the table
	ld b,(hl)             ; B = 256 / L
	mulub a,b             ; H = (A * (256 / L)) / 256 ~= A / L (with rounding errors)
	ret

	.align 256            ; alternative: ds (256 - ($ and 255)) and 255
LUT_recip8:
	[ table with 256 entries generated by the formula
	    '256 / i'   or   '(256 / i) + 1'
	  both formulas result in off-by-one errors ]

I did not test the table agains the rounding errors. I confirm also the DW issue an extra add hl,bc is needed. Thanks! !
Anyway, even fixing both problems, your solution using 8 bit data is much more efficient and should give the very same results than mine using less space and time.

By wouter_

Hero (525)

wouter_ さんの画像

19-03-2011, 20:25

First of all sorry for the long post. I hope some of you will find it interesting nevertheless.

PS
Taken from sources of the Hi-Tech crosscompiler for windows (v7.8)

Ugh, that's very ugly code ;-) There has to be a better way...

I started from this routine I found via google (I made some very minor changes). It does 8-bit by 8-bit unsigned division.

; -- unsigned 8-bit / 8-bit = 8-bit  (8-bit remainder)
; input:  [L] = dividend, [H] = divisor
; output: [L] = quotient, [A] = remainder
;         ([C],[DE],[H] unchanged, [B] = 0)
; cycles Z80:  395-435 (If I didn't make any mistakes!!!)
;        R800:  87-95
; size: 14 bytes
                    ; Z80 R800
udiv8:	xor  a      ;  5    1
	ld   b,8    ;  8    2
loop:	sla  l      ; 10    2
	rla         ;  5    1
	cp   h      ;  5    1
	jr   c,next ; 13/8 3/2
	sub  h      ;  5    1
	inc  l      ;  5    1
next:   djnz loop   ; 14/9 3/2
	ret         ; 11    5

After some tweaking I got this. It has one instruction less in the inner loop. But the setup code before the loop is now more expensive, so the total speed gain isn't that much. However negating the divisor is also something that needs to happen in a signed division routine. So I hope this setup cost will partly disappear in a signed division routine. I didn't try to write such routine yet. Note that the worst and best-case execution times are now the same.

; -- unsigned 8-bit / 8-bit = 8-bit  (8-bit remainder)
; input:  [L] = dividend, [A] = divisor
; output: [L] = quotient, [A] = remainder
;         ([C],[DE] unchanged, [B] = 0)
; cycles Z80:  420 (If I didn't make any mistakes!!!)
;        R800:  92
; size: 18 bytes
                     ; Z80 R800
udiv8:	neg          ; 10    2
	ld   h,a     ;  5    1     ; h = -divisor
	xor  a       ;  5    1
	ld   b,8     ;  8    2
	rl   l       ; 10    2
loop:	rla          ;  5    1
	add  h       ;  5    1     ; a += -divisor
	jr   c,next  ; 13/8 3/2
	sub  h       ;  5    1     ; after this instr C-flag is always clear
next:	rl   l       ; 10    2
	djnz loop    ; 14/9 3/2
	ret          ; 11    5

If you really care about speed (and less about code size), you can unroll the loop. This has the extra advantage of not using the B register.

; -- 8-bit / 8-bit = 8-bit  (8-bit remainder)
; input:  [L] = dividend, [A] = divisor
; output: [L] = quotient, [A] = remainder
;         ([BC],[DE] unchanged)
; cycles Z80:  305 (If I didn't make any mistakes!!!)
;        R800:  67
; size: 63 bytes
                     ; Z80 R800
udiv8:	neg          ; 10    2
	ld   h,a     ;  5    1
	xor  a       ;  5    1
	rl   l       ; 10    2
	rla          ;  5    1
	add  h       ;  5    1
	jr   c,next1 ; 13/8 3/2
	sub  h       ;  5    1
next1:	rl   l       ; 10    2
	rla          ;  5    1
	add  h       ;  5    1
	jr   c,next2 ; 13/8 3/2
	sub  h       ;  5    1
next2:	rl   l       ; 10    2
	rla          ;  5    1
	add  h       ;  5    1
	jr   c,next3 ; 13/8 3/2
	sub  h       ;  5    1
next3:	rl   l       ; 10    2
	rla          ;  5    1
	add  h       ;  5    1
	jr   c,next4 ; 13/8 3/2
	sub  h       ;  5    1
next4:	rl   l       ; 10    2
	rla          ;  5    1
	add  h       ;  5    1
	jr   c,next5 ; 13/8 3/2
	sub  h       ;  5    1
next5:	rl   l       ; 10    2
	rla          ;  5    1
	add  h       ;  5    1
	jr   c,next6 ; 13/8 3/2
	sub  h       ;  5    1
next6:	rl   l       ; 10    2
	rla          ;  5    1
	add  h       ;  5    1
	jr   c,next7 ; 13/8 3/2
	sub  h       ;  5    1
next7:	rl   l       ; 10    2
	rla          ;  5    1
	add  h       ;  5    1
	jr   c,next8 ; 13/8 3/2
	sub  h       ;  5    1
next8:	rl   l       ; 10    2
	ret          ; 11    5


I also wrote a 16-bit version.

; -- 16-bit / 16-bit = 16-bit  (16-bit remainder)
; input:  [DE] = dividend, [BC] = divisor
; output: [DE] = quotient, [HL] = remainder
;         ([BC] = unchanged, [A] = 0)
; cycles Z80:  1438-1470 (If I didn't make any mistakes!!!)
;        R800:  265-281  (Can be tweaked by changing jp->jr
;                         but that's worse for Z80.)
; size: 27 bytes
                      ; Z80 R800
udiv16:	ld   hl,0     ; 11    3
	ld   a,16     ;  8    2
loop:	sla  e        ; 10    2
	rl   d        ; 10    2
	adc  hl,hl    ; 17    2    after this instr C-flag=0
	sbc  hl,bc    ; 17    2
	jr   nc,next1 ; 13/8 3/2   z80 and r800: jr here is faster than jp
next0:	add  hl,bc    ; 12    1
	dec  a        ;  5    1
	jp   nz,loop  ; 11   4/3   z80: jp is faster  r800: jr is faster
	ret           ; 11    5
next1:  inc  e        ;  5    1    before this instr e.0 was 0
	dec  a        ;  5    1
	jp   nz,loop  ; 11   4/3   z80: jp is faster  r800: jr is faster
	ret           ; 11    5

A similar tweak as for the 8-bit version can be done here. Though in this case it not a clear win: the best-case execution time got faster, but worst-case got slower (at least on Z80, on R800 it is a clear win). The inner loop has the same worst-case speed, but the setup code is more expensive. When this code is extended to signed division I again expect that the setup-cost will be reduced compared to the version above. This version is also easier to unroll if you need extra speed (unrolled version not shown).

; -- 16-bit / 16-bit = 16-bit  (16-bit remainder)
; input:  [BC] = dividend, [DE] = divisor
; output: [BC] = quotient, [HL] = remainder
;         ([A] = 0)
; cycles Z80:  1334-1526 (If I didn't make any mistakes!!!)
;        R800:  258-274  (Can be tweaked by changing jp->jr
;                         but that's worse for Z80.)
; size: 30 bytes
                      ; Z80 R800
udiv16:	xor  a        ;  5    1    C-flag = 0
	ld   h,a      ;  5    1
	ld   l,a      ;  5    1    HL = 0
	sbc  hl,de    ; 17    2 
	ex   de,hl    ;  5    1    DE = -divisor
	ld   h,a      ;  5    1
	ld   l,a      ;  5    1    HL = 0
	ld   a,16     ;  8    2
	rl   c        ; 10    2
	rl   b        ; 10    2
loop:	adc  hl,hl    ; 17    2
	add  hl,de    ; 12    1    HL += -divisor
	jr   c,next   ; 13/8 3/2   z80 and r800: jr here is faster than jp
	sbc  hl,de    ; 17    2    after this instr C-flag is clear
next	rl   c        ; 10    2
        rl   b        ; 10    2
	dec  a        ;  5    1
	jp   nz,loop  ; 11   4/3   z80: jp is faster  r800: jr is faster
	ret           ; 11    5


BTW: when comparing the cycle count of Z80 to R800, remember that Z80 runs at 3.5MHz while R800 runs at 7MHz (so 2 R800 cycles take the same amount of time as 1 Z80 cycle).

By WORP3

Paladin (864)

WORP3 さんの画像

19-03-2011, 22:56

If you don't care about code size you could even use loop up tables instead of loops !

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

20-03-2011, 14:43

well done, even if on r800 the 1/x table is by far faster

ページ 5/7
1 | 2 | 3 | 4 | | 6 | 7