I need full 8 bit or full 16 bit. No mixed versions needed. Actually, I need the remainder as well, but it doesn't have to be from the same routine as the unsigned version gives. In z80 asm. Although I'm not so sure that r800 will make any difference.
All I can think off is that old school rule devision is multiplying with the inverse... For a 16 bit value this would imply a 64kb table to quickly convert a number X to 1/X and subsequently use the the above singed multiplication routine. Possibly only for use on r800.
I did use muluw for some fixed devision (unsigned) some time ago... let me look it up...
found it!
;--------------------------------------------------- ; 16 bit - div 12 ;mod 12 ; ; valid value range 0 - 12288 ; ; in: hl uit: hl = div 12 , e = mod 12 ; ; verandert: a,bc,de,hl divmod12 ld bc,5461 ; 65536 / 12 muluw hl,bc push de ; >>> bewaar div 12 /* ld a,21 ; afronding fout correctie grote getallen add a,h ld h,a */ ld bc,12 ; rest x 12 = mod muluw hl,bc pop hl ; <<< ret
Does also calculate the reminder. Does not help you though.
Untested, but basically it is GosthwriterP's code on 8 bits
; fastdiv ; input: E=dividend, C=divisor ; output: E=quotient, H=rest/divisor*256 fastdiv: xor a ld d,a ld b,a ld hl,HelpTable-1 add hl,bc ld c,(hl) inc hl ld b,(hl) ex de,hl muluw hl,bc ret HelpTable: ; dw 2^16/i = 1..255 dw 65536 dw 32768 dw 21845 dw 16384 dw 13107 dw 10923 dw 9362 dw 8192 dw 7282 dw 6554 dw 5958 dw 5461 dw 5041 dw 4681 dw 4369 dw 4096 dw 3855 dw 3641 dw 3449 dw 3277 dw 3121 dw 2979 dw 2849 dw 2731 dw 2621 dw 2521 dw 2427 dw 2341 dw 2260 dw 2185 dw 2114 dw 2048 dw 1986 dw 1928 dw 1872 dw 1820 dw 1771 dw 1725 dw 1680 dw 1638 dw 1598 dw 1560 dw 1524 dw 1489 dw 1456 dw 1425 dw 1394 dw 1365 dw 1337 dw 1311 dw 1285 dw 1260 dw 1237 dw 1214 dw 1192 dw 1170 dw 1150 dw 1130 dw 1111 dw 1092 dw 1074 dw 1057 dw 1040 dw 1024 dw 1008 dw 993 dw 978 dw 964 dw 950 dw 936 dw 923 dw 910 dw 898 dw 886 dw 874 dw 862 dw 851 dw 840 dw 830 dw 819 dw 809 dw 799 dw 790 dw 780 dw 771 dw 762 dw 753 dw 745 dw 736 dw 728 dw 720 dw 712 dw 705 dw 697 dw 690 dw 683 dw 676 dw 669 dw 662 dw 655 dw 649 dw 643 dw 636 dw 630 dw 624 dw 618 dw 612 dw 607 dw 601 dw 596 dw 590 dw 585 dw 580 dw 575 dw 570 dw 565 dw 560 dw 555 dw 551 dw 546 dw 542 dw 537 dw 533 dw 529 dw 524 dw 520 dw 516 dw 512 dw 508 dw 504 dw 500 dw 496 dw 493 dw 489 dw 485 dw 482 dw 478 dw 475 dw 471 dw 468 dw 465 dw 462 dw 458 dw 455 dw 452 dw 449 dw 446 dw 443 dw 440 dw 437 dw 434 dw 431 dw 428 dw 426 dw 423 dw 420 dw 417 dw 415 dw 412 dw 410 dw 407 dw 405 dw 402 dw 400 dw 397 dw 395 dw 392 dw 390 dw 388 dw 386 dw 383 dw 381 dw 379 dw 377 dw 374 dw 372 dw 370 dw 368 dw 366 dw 364 dw 362 dw 360 dw 358 dw 356 dw 354 dw 352 dw 350 dw 349 dw 347 dw 345 dw 343 dw 341 dw 340 dw 338 dw 336 dw 334 dw 333 dw 331 dw 329 dw 328 dw 326 dw 324 dw 323 dw 321 dw 320 dw 318 dw 317 dw 315 dw 314 dw 312 dw 311 dw 309 dw 308 dw 306 dw 305 dw 303 dw 302 dw 301 dw 299 dw 298 dw 297 dw 295 dw 294 dw 293 dw 291 dw 290 dw 289 dw 287 dw 286 dw 285 dw 284 dw 282 dw 281 dw 280 dw 279 dw 278 dw 277 dw 275 dw 274 dw 273 dw 272 dw 271 dw 270 dw 269 dw 267 dw 266 dw 265 dw 264 dw 263 dw 262 dw 261 dw 260 dw 259 dw 258 dw 257
Thinking to the signed version (for z80)
Signed and unsigned 16 bit divisions (and modulus % operator)
; 16 bit divide and modulus routines ; called with dividend in hl and divisor in de ; returns with result in hl. ; adiv (amod) is signed divide (modulus), ldiv (lmod) is unsigned global adiv,ldiv,amod,lmod psect text,class=CODE amod: call adiv ex de,hl ;put modulus in hl ret lmod: call ldiv ex de,hl ret ldiv: xor a push af ex de,hl jr dv1 adiv: ld a,h xor d ;set sign flag for quotient ld a,h ;get sign of dividend push af call negif ex de,hl call negif dv1: ld b,1 ld a,h or l jr nz,dv8 pop af ret dv8: push hl add hl,hl jr c,dv2 ld a,d cp h jr c,dv2 jp nz,dv6 ld a,e cp l jr c,dv2 dv6: pop af inc b jp dv8 dv2: pop hl ex de,hl push hl ld hl,0 ex (sp),hl dv4: ld a,h cp d jr c,dv3 jp nz,dv5 ld a,l cp e jr c,dv3 dv5: sbc hl,de dv3: ex (sp),hl ccf adc hl,hl srl d rr e ex (sp),hl djnz dv4 pop de ex de,hl pop af call m,negat ex de,hl or a ;test remainder sign bit call m,negat ex de,hl ret negif: bit 7,h ret z negat: ld b,h ld c,l ld hl,0 or a sbc hl,bc ret
PS
Taken from sources of the Hi-Tech crosscompiler for windows (v7.8)
Untested, but basically it is GosthwriterP's code on 8 bits
Unfortunately it doesn't work .. it gives off-by-one errors for many inputs. For example: 3 / 3 will result in 0 instead of 1 (3 * 21845 = 0x0000ffff, upper 16-bits are zero). But if you add 1 to all the entries in your lookup table the result will be correct (because you use 16-bit multiplication).
If off-by-one errors are acceptable for your application, there's a much faster routine possible, using 8-bit multiplication. I also optimized the LUT by aligning it on a 256-byte boundary (a similar optimization can be done on your original routine. Actually your routine has a bug: each entry in your table is 2 bytes long but you forgot to multiply the index by 2):
; input: [A]=dividend, [L]=divisor ; output: [H]=quotient ; modifies: [B][L] ; remark: The calculated quotient can be off-by-one!!! approx_udiv8: ld h,LUT_recip8 / 256 ; high byte of the table ld b,(hl) ; B = 256 / L mulub a,b ; H = (A * (256 / L)) / 256 ~= A / L (with rounding errors) ret .align 256 ; alternative: ds (256 - ($ and 255)) and 255 LUT_recip8: [ table with 256 entries generated by the formula '256 / i' or '(256 / i) + 1' both formulas result in off-by-one errors ]
Untested, but basically it is GosthwriterP's code on 8 bits
Unfortunately it doesn't work .. it gives off-by-one errors for many inputs. For example: 3 / 3 will result in 0 instead of 1 (3 * 21845 = 0x0000ffff, upper 16-bits are zero). But if you add 1 to all the entries in your lookup table the result will be correct (because you use 16-bit multiplication).
If off-by-one errors are acceptable for your application, there's a much faster routine possible, using 8-bit multiplication. I also optimized the LUT by aligning it on a 256-byte boundary (a similar optimization can be done on your original routine. Actually your routine has a bug: each entry in your table is 2 bytes long but you forgot to multiply the index by 2):
; input: [A]=dividend, [L]=divisor ; output: [H]=quotient ; modifies: [B][L] ; remark: The calculated quotient can be off-by-one!!! approx_udiv8: ld h,LUT_recip8 / 256 ; high byte of the table ld b,(hl) ; B = 256 / L mulub a,b ; H = (A * (256 / L)) / 256 ~= A / L (with rounding errors) ret .align 256 ; alternative: ds (256 - ($ and 255)) and 255 LUT_recip8: [ table with 256 entries generated by the formula '256 / i' or '(256 / i) + 1' both formulas result in off-by-one errors ]
I did not test the table agains the rounding errors. I confirm also the DW issue an extra add hl,bc is needed. Thanks! !
Anyway, even fixing both problems, your solution using 8 bit data is much more efficient and should give the very same results than mine using less space and time.
First of all sorry for the long post. I hope some of you will find it interesting nevertheless.
PS
Taken from sources of the Hi-Tech crosscompiler for windows (v7.8)
Ugh, that's very ugly code ;-) There has to be a better way...
I started from this routine I found via google (I made some very minor changes). It does 8-bit by 8-bit unsigned division.
; -- unsigned 8-bit / 8-bit = 8-bit (8-bit remainder) ; input: [L] = dividend, [H] = divisor ; output: [L] = quotient, [A] = remainder ; ([C],[DE],[H] unchanged, [B] = 0) ; cycles Z80: 395-435 (If I didn't make any mistakes!!!) ; R800: 87-95 ; size: 14 bytes ; Z80 R800 udiv8: xor a ; 5 1 ld b,8 ; 8 2 loop: sla l ; 10 2 rla ; 5 1 cp h ; 5 1 jr c,next ; 13/8 3/2 sub h ; 5 1 inc l ; 5 1 next: djnz loop ; 14/9 3/2 ret ; 11 5
After some tweaking I got this. It has one instruction less in the inner loop. But the setup code before the loop is now more expensive, so the total speed gain isn't that much. However negating the divisor is also something that needs to happen in a signed division routine. So I hope this setup cost will partly disappear in a signed division routine. I didn't try to write such routine yet. Note that the worst and best-case execution times are now the same.
; -- unsigned 8-bit / 8-bit = 8-bit (8-bit remainder) ; input: [L] = dividend, [A] = divisor ; output: [L] = quotient, [A] = remainder ; ([C],[DE] unchanged, [B] = 0) ; cycles Z80: 420 (If I didn't make any mistakes!!!) ; R800: 92 ; size: 18 bytes ; Z80 R800 udiv8: neg ; 10 2 ld h,a ; 5 1 ; h = -divisor xor a ; 5 1 ld b,8 ; 8 2 rl l ; 10 2 loop: rla ; 5 1 add h ; 5 1 ; a += -divisor jr c,next ; 13/8 3/2 sub h ; 5 1 ; after this instr C-flag is always clear next: rl l ; 10 2 djnz loop ; 14/9 3/2 ret ; 11 5
If you really care about speed (and less about code size), you can unroll the loop. This has the extra advantage of not using the B register.
; -- 8-bit / 8-bit = 8-bit (8-bit remainder) ; input: [L] = dividend, [A] = divisor ; output: [L] = quotient, [A] = remainder ; ([BC],[DE] unchanged) ; cycles Z80: 305 (If I didn't make any mistakes!!!) ; R800: 67 ; size: 63 bytes ; Z80 R800 udiv8: neg ; 10 2 ld h,a ; 5 1 xor a ; 5 1 rl l ; 10 2 rla ; 5 1 add h ; 5 1 jr c,next1 ; 13/8 3/2 sub h ; 5 1 next1: rl l ; 10 2 rla ; 5 1 add h ; 5 1 jr c,next2 ; 13/8 3/2 sub h ; 5 1 next2: rl l ; 10 2 rla ; 5 1 add h ; 5 1 jr c,next3 ; 13/8 3/2 sub h ; 5 1 next3: rl l ; 10 2 rla ; 5 1 add h ; 5 1 jr c,next4 ; 13/8 3/2 sub h ; 5 1 next4: rl l ; 10 2 rla ; 5 1 add h ; 5 1 jr c,next5 ; 13/8 3/2 sub h ; 5 1 next5: rl l ; 10 2 rla ; 5 1 add h ; 5 1 jr c,next6 ; 13/8 3/2 sub h ; 5 1 next6: rl l ; 10 2 rla ; 5 1 add h ; 5 1 jr c,next7 ; 13/8 3/2 sub h ; 5 1 next7: rl l ; 10 2 rla ; 5 1 add h ; 5 1 jr c,next8 ; 13/8 3/2 sub h ; 5 1 next8: rl l ; 10 2 ret ; 11 5
I also wrote a 16-bit version.
; -- 16-bit / 16-bit = 16-bit (16-bit remainder) ; input: [DE] = dividend, [BC] = divisor ; output: [DE] = quotient, [HL] = remainder ; ([BC] = unchanged, [A] = 0) ; cycles Z80: 1438-1470 (If I didn't make any mistakes!!!) ; R800: 265-281 (Can be tweaked by changing jp->jr ; but that's worse for Z80.) ; size: 27 bytes ; Z80 R800 udiv16: ld hl,0 ; 11 3 ld a,16 ; 8 2 loop: sla e ; 10 2 rl d ; 10 2 adc hl,hl ; 17 2 after this instr C-flag=0 sbc hl,bc ; 17 2 jr nc,next1 ; 13/8 3/2 z80 and r800: jr here is faster than jp next0: add hl,bc ; 12 1 dec a ; 5 1 jp nz,loop ; 11 4/3 z80: jp is faster r800: jr is faster ret ; 11 5 next1: inc e ; 5 1 before this instr e.0 was 0 dec a ; 5 1 jp nz,loop ; 11 4/3 z80: jp is faster r800: jr is faster ret ; 11 5
A similar tweak as for the 8-bit version can be done here. Though in this case it not a clear win: the best-case execution time got faster, but worst-case got slower (at least on Z80, on R800 it is a clear win). The inner loop has the same worst-case speed, but the setup code is more expensive. When this code is extended to signed division I again expect that the setup-cost will be reduced compared to the version above. This version is also easier to unroll if you need extra speed (unrolled version not shown).
; -- 16-bit / 16-bit = 16-bit (16-bit remainder) ; input: [BC] = dividend, [DE] = divisor ; output: [BC] = quotient, [HL] = remainder ; ([A] = 0) ; cycles Z80: 1334-1526 (If I didn't make any mistakes!!!) ; R800: 258-274 (Can be tweaked by changing jp->jr ; but that's worse for Z80.) ; size: 30 bytes ; Z80 R800 udiv16: xor a ; 5 1 C-flag = 0 ld h,a ; 5 1 ld l,a ; 5 1 HL = 0 sbc hl,de ; 17 2 ex de,hl ; 5 1 DE = -divisor ld h,a ; 5 1 ld l,a ; 5 1 HL = 0 ld a,16 ; 8 2 rl c ; 10 2 rl b ; 10 2 loop: adc hl,hl ; 17 2 add hl,de ; 12 1 HL += -divisor jr c,next ; 13/8 3/2 z80 and r800: jr here is faster than jp sbc hl,de ; 17 2 after this instr C-flag is clear next rl c ; 10 2 rl b ; 10 2 dec a ; 5 1 jp nz,loop ; 11 4/3 z80: jp is faster r800: jr is faster ret ; 11 5
BTW: when comparing the cycle count of Z80 to R800, remember that Z80 runs at 3.5MHz while R800 runs at 7MHz (so 2 R800 cycles take the same amount of time as 1 Z80 cycle).
If you don't care about code size you could even use loop up tables instead of loops !
well done, even if on r800 the 1/x table is by far faster