Fastest possible multiplication routine?

ページ 6/7
1 | 2 | 3 | 4 | 5 | | 7

By wouter_

Hero (525)

wouter_ さんの画像

20-03-2011, 22:49

well done, even if on r800 the 1/x table is by far faster
Yes, but only if the off-by-one errors are acceptable. Otherwise you need either a 16-bit multiplication for 8-bit division (like your routine, but with fixes in the table). Or use 8-bit multiplication, but with extra shifts (not always over the same distance). Google for 'division by constant using multiplication' for more info.

If you don't care about code size you could even use loop up tables instead of loops !
True, but only up to a certain point. A full LUT for 8-bit division already takes 64kB. For 16-bit division it would take 8GB!! The trick of only storing 1/x in a LUT works for 8-bit, but for 16-bit it's again not practical (128kB).
But of course it all depends on the situation. In some (rare?) cases a huge LUT to speedup one single routine might be a good choice.

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

20-03-2011, 23:27

well done, even if on r800 the 1/x table is by far faster
Yes, but only if the off-by-one errors are acceptable. Otherwise you need either a 16-bit multiplication for 8-bit division (like your routine, but with fixes in the table). Or use 8-bit multiplication, but with extra shifts (not always over the same distance). Google for 'division by constant using multiplication' for more info.

Maybe one could tune the single entries in the table where the errors occur
The major issue is that there is no reminder (Edwin needs it) and the fact that
it can work only on 8 bits or you will get 64Kbyte of table

PS
the seigned version of the div code seems to me brute force change of sign on input and output
I do not see clever tricks this time

By wouter_

Hero (525)

wouter_ さんの画像

20-03-2011, 23:44

Maybe one could tune the single entries in the table where the errors occur
For N-bit division it's not possible to find a N-bit multiplication factor that gives the correct result for all possible dividends (but it can be fixed with extra shifts and/or additions). See this paper for details: gmplib.org/~tege/divcnst-pldi94.pdf (warning: the mathematical proofs are not that easy to follow).

By ARTRAG

Enlighted (6935)

ARTRAG さんの画像

21-03-2011, 00:38

Ok. But using more bits in the multiplication it is:
from the same author
http://gmplib.org/~tege/division-paper.pdf
Using 16 bit moltiplication should avoid rounding errors on 8 bit divisions : only we need to find a good scale factor for computing the table
(that will be 512 bytes) and shift the result of the multiplication accordingly in order to get the right 8 bits

By Edwin

Paragon (1182)

Edwin さんの画像

21-03-2011, 01:01

I haven't checked all the routines yet, but what wouter_ has is similar to what I have now. Which is good for my purpose (I don't really want table driven routines or even unrolled ones). Currently signed and unsigned multiplication and unsigned division are similar in complexity, it is just the signed division that requires significantly more due to sign changes before and after the routine. It would have been nice if that one could also be made in a similar fashion.

But the topic in itself is more interesting than what I need, so don't give up yet. Wink

By NYYRIKKI

Enlighted (6067)

NYYRIKKI さんの画像

03-01-2023, 17:46

Sorry for negroposting, but this subject newer gets out of fashion. Smile

I did have a talk with Proton from C64 scene and after that I came up with this version, that tries to have "good balance" with speed and size:

; Multiplication similar to MULUB A,B but for Z80 (NYYRIKKI)
; Tested on SjASMPlus

        ORG #C000

TABLE:  EQU $/256  ;Needs to be 256-byte alligned!
        REPT 256
        DB (($&255)*($&255)/4)/256
        ENDR

        REPT 256
        DB (($&255)*($&255)/4)&255
        ENDR

        DB #40
        REPT 255
        DB ((512-($&255))*(512-($&255))/4)/256
        ENDR

MULT:
        ; IN  (8bit unsigned):    A , B
        ; OUT (16bit unsigned):   HL = A*B

        LD C,A
        SUB B
        JR NC,$+4
        NEG
        LD H,TABLE
        LD L,A
        LD D,(HL)
        INC H
        LD E,(HL)

        LD A,B
        ADD A,C
        JR NC,SMALLNUM
        NEG
        LD L,A
        LD A,(HL)
        INC H
        LD H,(HL)
        LD L,A
        AND A
        SBC HL,DE
        RET

SMALLNUM:
        LD L,A
        LD A,(HL)
        DEC H
        LD H,(HL)
        LD L,A
        SBC HL,DE
        RET

By Prodatron

Paragon (1843)

Prodatron さんの画像

03-01-2023, 18:40

Yes, always an interesting topic! Smile
I compared yours (29,5µS average on CPC) with this (29µS average on CPC):
https://www.cpcwiki.eu/index.php/Programming:Integer_Multipl...
If I didn't count in a wrong way, for the CPC the last one is 0,5 microseconds faster, haha. For the MSX it will be not the same as it has other wait states.
Crazy little difference, but yours wins because of the smaller code/memory usage.

By PingPong

Enlighted (4138)

PingPong さんの画像

03-01-2023, 21:20

the msx had 1 wait state @3.5Mhz
the cpc had 4Mhz clock however there is a rounding in n. of cycles to 4 due to the 6502 a-like schema.

Hard to see what is the difference, without testing.....

By santiontanon

Paragon (1805)

santiontanon さんの画像

03-01-2023, 22:14

Very cool!!!

One question, I just tested it with MDL, and it tells me that if the result of the multiplication is smaller than 256, the "H" register is not modified, is this right? This can cause problems if H happens not to be 0 before you call the routine.

The way I tested it is like this, I created an assembler file like this:

(here I copy pasted the code from above, verbatim)

START:
        ld a, 7
        ld b, 17
        call MULT
END:

And then called MDL like this:

java -jar mdl.jar test.asm -dialect sjasmplus -e:u START END

It reports that for this particular case (7 * 17), execution time is 178 t-states (including the initial register assignment and call) when using an MSX z80. It also reports that only registers A, F, B, C, E, L and R were modifier (but not H),

If you add the "-cpu z80cpc" flag, it reports 43 nops on an Amstrad CPC z80.

By NYYRIKKI

Enlighted (6067)

NYYRIKKI さんの画像

03-01-2023, 23:56

santiontanon wrote:

It reports that for this particular case (7 * 17), execution time is 178 t-states (including the initial register assignment and call) when using an MSX z80. It also reports that only registers A, F, B, C, E, L and R were modifier (but not H),

I can only imagine that it must hate my "JR NC,$+4" formatting... Try instead:

        JR NC,SKIP
        NEG
SKIP:
        LD H,TABLE

... I don't know why it misses D-register though, but it might be related.

ページ 6/7
1 | 2 | 3 | 4 | 5 | | 7