ah, scrap the "H" is not modified. It's just me not remembering past-Santi decisions. I made MDL only print register modifications if they were modified to a DIFFERENT value than before (I guess that made sense to me when I coded it). Indeed, H, and D are modified (but just set to 0), haha. So, false alarm!
In this case MDL was interpreting "jr nc, $+4" fine. In this case it's fine, but I would avoid the use of "$". Some assemblers have different semantics for "$". For example, in standard z80 notation: "org 0; db $, $" should be the same as "db 0, 0", but in some assemblers, this will translate to "db 0, 1". So, if you want portable code to different assemblers, avoid the use of "$" as much as you can
Sorry for negroposting, but this subject newer gets out of fashion.
I did have a talk with Proton from C64 scene and after that I came up with this version, that tries to have "good balance" with speed and size:
That looks a lot like the code described here
Multiplication on a Z80
If you create the table slightly different you can use this for signed multiplications also. If you have your two inputs in 2-complements you do not even have to check your carry after adding and subtracting the values.
Fantastic, this is even faster (CPC: 24µS) than NYYRIKKIS/Protons (CPC: 29.5µS) and the one I mentioned from CPCWiki (CPC: 29µS).
That looks a lot like the code described here
Multiplication on a Z80
Oh, I'm even quite sure that I've seen this one before. I just didn't pay much attention, but yes this is indeed a good one! It can be made even a little bit faster using the trick from code that Prodatron posted:
; IN : A and D are to be multiplied ; OUT: HL is result ; CHANGES : AF,BC,E,HL ; SUB D LD H,MULTAB/256 LD L,A LD C,(HL) INC H LD B,(HL) ADD A,D ADD A,D LD L,A LD E,(HL) DEC H LD L,(HL) LD H,E OR A SBC HL,BC
Edit: ...and this actually uses only 512-byte tables... How ever the reason for smaller size tables and faster speed is that this is signed multiply and not unsigned, and it does not cover full 7bit+sign * 7bit+sign area. (only 7bit*6bit)
Another routine to multiply two 8 bit integers. It take between double and triple the time than fastest one but it uses no table and has no constraint.
;Effect: HL = D x E (integers) ;Modifies: AF, DE, HL. DxCtoHL: ld hl,0 ld a,d rra ld d,h jr nc,bit0 ld l,e bit0: or a rl e rl d rra jr nc,bit1 add hl,de bit1: or a rl e rl d rra jr nc,bit2 add hl,de bit2: or a rl e rl d rra jr nc,bit3 add hl,de bit3: or a rl e rl d rra jr nc,bit4 add hl,de bit4: or a rl e rl d rra jr nc,bit5 add hl,de bit5: or a rl e rl d rra jr nc,bit6 add hl,de bit6: or a rl e rl d rra ret nc ; Back if bi7 = 0 add hl,de ret
I think it can be optimized.
Sorry I didn't follow all the thread, but as far as I remember the Russian Peasant Algorithm uses simple and quick shifts (registers) to multiply 2 numbers:
https://www.wikihow.com/Multiply-Using-the-Russian-Peasant-M...
I think it can be optimized.
This sequence:
or a rl e
can be replaced by a single
sla e
There are probably other chances for optimization but that's an outstanding one.
Oh that's right ! It significantly reduces the size and speeds up the operation. And if we use A as entry instead of D, it saves one more instruction (ld a,d).
;Effect: HL = E x A (integers) ;Modifies: AF, DE, HL. ExAtoHL: ld hl,0 ld d,h rra jr nc,bit0 ld l,e bit0: sla e rl d rra jr nc,bit1 add hl,de bit1: sla e rl d rra jr nc,bit2 add hl,de bit2: sla e rl d rra jr nc,bit3 add hl,de bit3: sla e rl d rra jr nc,bit4 add hl,de bit4: sla e rl d rra jr nc,bit5 add hl,de bit5: sla e rl d rra jr nc,bit6 add hl,de bit6: sla e rl d rra ret nc ; Back if bit7 = 0 add hl,de ret
@gdx That can be made faster by putting the 8-bit values in h
and e
.
http://map.grauw.nl/articles/mult_div_shifts.php#lrmultr
; ; Multiply 8-bit values ; In: Multiply H with E ; Out: HL = result ; Mult8: ld d,0 ld l,d ld b,8 Mult8_Loop: add hl,hl jr nc,Mult8_NoAdd add hl,de Mult8_NoAdd: djnz Mult8_Loop ret
It combines the sla e
, rl d
and rra
in your code above into a single add hl,hl
.
Just unroll the loop eight times.
I didn't use a loop to make it a little faster and to avoid modifying one more register but anyway, your optimization is great!
Edit: ...and this actually uses only 512-byte tables...
You have forgotten it.