well done, even if on r800 the 1/x table is by far faster

Yes, but only if the off-by-one errors are acceptable. Otherwise you need either a 16-bit multiplication for 8-bit division (like your routine, but with fixes in the table). Or use 8-bit multiplication, but with extra shifts (not always over the same distance). Google for 'division by constant using multiplication' for more info.

If you don't care about code size you could even use loop up tables instead of loops !

True, but only up to a certain point. A full LUT for 8-bit division already takes 64kB. For 16-bit division it would take 8GB!! The trick of only storing 1/x in a LUT works for 8-bit, but for 16-bit it's again not practical (128kB).

But of course it all depends on the situation. In some (rare?) cases a huge LUT to speedup one single routine might be a good choice.

well done, even if on r800 the 1/x table is by far faster

Yes, but only if the off-by-one errors are acceptable. Otherwise you need either a 16-bit multiplication for 8-bit division (like your routine, but with fixes in the table). Or use 8-bit multiplication, but with extra shifts (not always over the same distance). Google for 'division by constant using multiplication' for more info.

Maybe one could tune the single entries in the table where the errors occur

The major issue is that there is no reminder (Edwin needs it) and the fact that

it can work only on 8 bits or you will get 64Kbyte of table

PS

the seigned version of the div code seems to me brute force change of sign on input and output

I do not see clever tricks this time

Maybe one could tune the single entries in the table where the errors occur

For N-bit division it's not possible to find a N-bit multiplication factor that gives the correct result for all possible dividends (but it can be fixed with extra shifts and/or additions). See this paper for details: gmplib.org/~tege/divcnst-pldi94.pdf (warning: the mathematical proofs are not that easy to follow).

Ok. But using more bits in the multiplication it is:

from the same author

http://gmplib.org/~tege/division-paper.pdf

Using 16 bit moltiplication should avoid rounding errors on 8 bit divisions : only we need to find a good scale factor for computing the table

(that will be 512 bytes) and shift the result of the multiplication accordingly in order to get the right 8 bits

I haven't checked all the routines yet, but what wouter_ has is similar to what I have now. Which is good for my purpose (I don't really want table driven routines or even unrolled ones). Currently signed and unsigned multiplication and unsigned division are similar in complexity, it is just the signed division that requires significantly more due to sign changes before and after the routine. It would have been nice if that one could also be made in a similar fashion.

But the topic in itself is more interesting than what I need, so don't give up yet.

Sorry for negroposting, but this subject newer gets out of fashion.

I did have a talk with Proton from C64 scene and after that I came up with this version, that tries to have "good balance" with speed and size:

; Multiplication similar to MULUB A,B but for Z80 (NYYRIKKI) ; Tested on SjASMPlus ORG #C000 TABLE: EQU $/256 ;Needs to be 256-byte alligned! REPT 256 DB (($&255)*($&255)/4)/256 ENDR REPT 256 DB (($&255)*($&255)/4)&255 ENDR DB #40 REPT 255 DB ((512-($&255))*(512-($&255))/4)/256 ENDR MULT: ; IN (8bit unsigned): A , B ; OUT (16bit unsigned): HL = A*B LD C,A SUB B JR NC,$+4 NEG LD H,TABLE LD L,A LD D,(HL) INC H LD E,(HL) LD A,B ADD A,C JR NC,SMALLNUM NEG LD L,A LD A,(HL) INC H LD H,(HL) LD L,A AND A SBC HL,DE RET SMALLNUM: LD L,A LD A,(HL) DEC H LD H,(HL) LD L,A SBC HL,DE RET

Yes, always an interesting topic!

I compared yours (29,5µS average on CPC) with this (29µS average on CPC):

https://www.cpcwiki.eu/index.php/Programming:Integer_Multipl...

If I didn't count in a wrong way, for the CPC the last one is 0,5 microseconds faster, haha. For the MSX it will be not the same as it has other wait states.

Crazy little difference, but yours wins because of the smaller code/memory usage.

the msx had 1 wait state @3.5Mhz

the cpc had 4Mhz clock however there is a rounding in n. of cycles to 4 due to the 6502 a-like schema.

Hard to see what is the difference, without testing.....

Very cool!!!

One question, I just tested it with MDL, and it tells me that if the result of the multiplication is smaller than 256, the "H" register is not modified, is this right? This can cause problems if H happens not to be 0 before you call the routine.

The way I tested it is like this, I created an assembler file like this:

(here I copy pasted the code from above, verbatim) START: ld a, 7 ld b, 17 call MULT END:

And then called MDL like this:

`java -jar mdl.jar test.asm -dialect sjasmplus -e:u START END`

It reports that for this particular case (7 * 17), execution time is 178 t-states (including the initial register assignment and call) when using an MSX z80. It also reports that only registers A, F, B, C, E, L and R were modifier (but not H),

If you add the "-cpu z80cpc" flag, it reports 43 nops on an Amstrad CPC z80.

*santiontanon*wrote:

It reports that for this particular case (7 * 17), execution time is 178 t-states (including the initial register assignment and call) when using an MSX z80. It also reports that only registers A, F, B, C, E, L and R were modifier (but not H),

I can only imagine that it must hate my "JR NC,$+4" formatting... Try instead:

JR NC,SKIP NEG SKIP: LD H,TABLE

... I don't know why it misses D-register though, but it might be related.