C Benchmarks

Pagina 1/2
| 2

Door Alcoholics_Anonymous

Resident (39)

afbeelding van Alcoholics_Anonymous

03-03-2017, 21:03

I'm putting together a set of benchmarks for running on various z80 C compilers.

The main reason for posting here is that msx users are actively using C compilers not used by anyone else so I was hoping we could get some numbers for you.

The current set of benchmarks are:

Dhrystone 2.1
Standard synthetic benchmark for integer programs.

[url=https://en.wikipedia.org/wiki/Whetstone_(benchmark)]Whetstone 1.2[/url]
Standard synthetic benchmark for floating point.

Sieve of Erastothenes
Common for small uP compilers because it's compilable by just about every c compiler. Finds prime numbers up to 8000.

Pi
Computes pi to some number of decimal places. Tests 32-bit integer math.

I'd also like to add coremark but for that one they do not allow distribution of source code. Instead you have to create an account at their website and promise not to make source available publicly.

If you can think of any other benchmarks that might be useful I'm open to that too.

I plan on compiling with sccz80 (z88dk), zsdcc (z88dk), sdcc, hitech cpm 309, hitech msdos 750, and hopefully I can get Zilog's tools to produce something. Any other submissions are welcome or if you are experienced with the hitech c compilers, a submission there would be appreciated since it's possible you may be able to get better results than me if you are more familiar with optimizations.

Souce code can be found here. More comments will be placed in those directories as I run the benchmarks. Currently the Dhrystone 2.1 and Whetstone 1.2 tests have been run for z88dk only with results in the z88dk subdirectories.

I'm using a z80 simulator called TICKS to time execution exactly in z80 cycles. If you can provide a binary with printing disabled, org indicated and binary address given, I can likely perform the time measurement with TICKS.

Aangemeld of registreer om reacties te plaatsen

Van AxelStone

Prophet (3199)

afbeelding van AxelStone

03-03-2017, 21:23

It should be interesting to have MSX-C results too since some of us are using it Wink

Van Sandy Brand

Champion (309)

afbeelding van Sandy Brand

03-03-2017, 22:54

Super cool research project! Smile

I guess these are more to measure the 'raw' speed of the code that these compilers generate.

What, in my experience, are more relevant questions are things such as these:

  • What is faster, using chars (8 bits) or ints (16 bit)? (Can the compiler properly optimize function calls through registers?)
  • Can the compiler optimize the usage of 'boolean' logic? (Reduce expression logic, use flag registers (also for function return values), etc).
  • Which compiler generates the smallest executables (dead-code detection, optimization, etc).
  • Which compiler is faster for random access in an array? (byte alignment, etc).
  • Some old compilers date all the way back to CP/M, so do the actually compilers use the full range of available Z80 feature set (for example: index registers).
  • Can the compiler generate code that uses memory page swapping for programs that are larger than 64Kb?

Would be awesome if we could deduce some good 'rule of thumbs' on how to write more efficient code for these compilers Smile

Van Alcoholics_Anonymous

Resident (39)

afbeelding van Alcoholics_Anonymous

04-03-2017, 03:38

AxelStone wrote:

It should be interesting to have MSX-C results too since some of us are using it Wink

Yes I am hoping someone will provide some results for that! I'm especially interested in c compilers that are in actual use now. I can run some benchmarks for compilers in the heyday but most of them are quite poor in comparison to the ones I listed and no one is using them seriously anyway. In terms of compilers that run on actual z80 hardware, hitech c v309 for cpm is quite good so that's probably msx c's main competitor in a direct sense.

Quote:

Super cool research project!
I guess these are more to measure the 'raw' speed of the code that these compilers generate.

Yes these are standard benchmarks that are supposed to reflect how people write generic c rather than taking advantage of any special features of a particular compiler.

I will include program sizes as well to make sure no one implementation is achieving outlier results by being impractical with memory use. It's hard to compare memory use directly for small/medium size programs since some implementations are more complete than others and can have more environment code present (some may fill in im2 vector table, restarts, create atexit / atquickexit stacks, auto size a heap, set up FILE*, etc) while others do nothing. I may try to quantify that by examining a disassembly of the output binaries.

Quote:

What, in my experience, are more relevant questions are things such as these:

Which compiler generates the smallest executables (dead-code detection, optimization, etc).

I find this is one of the most important performance criteria for a compiler. A number of large games have been written in c for the spectrum and memory space is always a central problem.

I also plan to compare compile sizes for a number of medium sized programs (~30-40k). I'll be using cpm as a target since it seems to be a common denominator and will allow the platform-specific code variable to be controlled. The problem here is a lot of the compilers of interest are not very complete and cannot compile the sort of programs you'd like to. In the past, I've been confined to comparing z88dk and the two hitech c compilers since nothing else would be able to compile the programs without a lot of help. But we'll see how things turn out this time.

Van hit9918

Prophet (2932)

afbeelding van hit9918

04-03-2017, 19:40

games need code like this

#define T char
struct obj { T id; T x; T y; T vx; T vy; T anim; };

void moveturtles(struct obj **arr, unsigned char arrlen) {
        unsigned char i = 0;
        struct obj *p = 0;
        for(i = 0; i < arrlen; i++) {
                p = *arr++;
                //and then many ops like this:
                p->x = p->x + p->vx;
                p->y = p->y + p->vy;
                p->anim++;
        }
}

offsetted acess to always the same struct, needs no tricky register allocator.

        ld a,(ix+x) : add (ix+vx) : ld (ix+x),a
        ld a,(ix+y) : add (ix+vy) : ld (ix+y),a
        inc (ix+anim)

it would also be ok when the compiler uses HL and INC HL DEC HL forth and back.

        ld a,(hl) : inc hl : inc hl : add (hl) : dec hl : dec hl : ld (hl),a      ;p->x = p->x + p->vx;
        inc hl 
        ld a,(hl) : inc hl : inc hl : add (hl) : dec hl : dec hl : ld (hl),a      ;p->y = p->y + p->vy;
        inc hl : inc hl : inc hl
        inc (hl)

This looks like more code but is actualy less.

So, this is the z80 gamer benchmark.
any compiler who comes close to that asm?

Van AxelStone

Prophet (3199)

afbeelding van AxelStone

04-03-2017, 21:43

We'll follow your tests, it seems really interesting Smile

Van Alcoholics_Anonymous

Resident (39)

afbeelding van Alcoholics_Anonymous

05-03-2017, 18:27

hit9918 wrote:

games need code like this

#define T char
struct obj { T id; T x; T y; T vx; T vy; T anim; };

void moveturtles(struct obj **arr, unsigned char arrlen) {
        unsigned char i = 0;
        struct obj *p = 0;
        for(i = 0; i < arrlen; i++) {
                p = *arr++;
                //and then many ops like this:
                p->x = p->x + p->vx;
                p->y = p->y + p->vy;
                p->anim++;
        }
}

So, this is the z80 gamer benchmark.
any compiler who comes close to that asm?

I'll look into this after the current benchmarks are done.

I'll point out that, in zx spectrum c games, the game engines themselves have subroutines that update the sprites so this part is often done in machine code (sprite manipulation tends to be more difficult as there is no hardware assist so more functionality tends to be built into the engine itself).

But another pattern is to have these structures allocated at global scope in a fixed size array to avoid referring to them through a pointer. The fixed size represents the maximum number of objects allowed, for which memory must be allocated anyway either statically or dynamically. Then those sprite members are accessed through direct memory addresses as in "ld hl,(spr0+offset_x)" etc. A flag can be used to indicate whether a particular sprite is active or not.

In a situation where there are multiple disjoint levels, overlaying is sometimes done so that the same memory is reused for a possibly varying set of sprite objects and level data. In z88dk, at least, this is done by creating individual sections for each level using the same ORG address and assigning to each any level/sprite specific data to the corresponding section. Then required level data for each level will overlap the same address range which solves memory issues brought about by allocating data for levels without overlap. If any level needs non-zero initialization it's normally handled by decompression at the start of the level.

Another approach is possibly to use an obstack to dynamically create needed objects and destroy all at the end of a level. It's a lightweight dynamic memory allocator with zero overhead and fast (wikipedia has a poor explanation, perhaps a better one in #3 of this comment). And then you have the same sort of code as you've shown since each sprite object is again referred to via pointer.

Quote:

We'll follow your tests, it seems really interesting Smile

Someone volunteered with access to the commercial IAR compiler which is great. If I can get Zilog's ZDS to work and someone steps forward with a copy of Softools, the main commercial entities will be covered.

Van hit9918

Prophet (2932)

afbeelding van hit9918

05-03-2017, 21:43

Quote:

But another pattern is to have these structures allocated at global scope in a fixed size array to avoid referring to them through a pointer.

that extra pointer fetch is not about the allocation of objects but about phantastic collision detection speeds.
it is only a fraction of the other ops.

Quote:

Then those sprite members are accessed through direct memory addresses as in "ld hl,(spr0+offset_x)"

this way 10 things onscreen need 10 times code size.

Van Alcoholics_Anonymous

Resident (39)

afbeelding van Alcoholics_Anonymous

08-03-2017, 17:04

hit9918 wrote:
Quote:

Then those sprite members are accessed through direct memory addresses as in "ld hl,(spr0+offset_x)"

this way 10 things onscreen need 10 times code size.

The idea is to use array access to compute a fixed address at compile time plus an offset determined by index at runtime.

Something like:
p->x = p->x + p->vx;

Would get turned into this:
sprite[i].x += sprite[i].vx;

The sprite.x and sprite.vx offsets can be computed at compile time so what is left is an offset i*sizeof(struct sprite) that is computed at runtime. If the loop is in terms of bytes (i+=sizeof(struct sprite) as the step size) then the multiplication is gone.

I'm not sure if this would lead to better code or not and it smacks of low level byte adjustment and casting which is not as natural as a pointer solution.

I've found a number of more standard benchmarks that I will include but this idea of a "gamer's benchmark" is a really good one given that that's mostly what people do with z80s these days.

If you have any other small snippets that may be useful to test, what I'd like to do is gather a collection of these short subroutines into kernels (read: subroutines) and run them from a driver (read: main()) which can be tested for speed and size. Emphasis can be placed on each kernel by selecting the number of times each kernel is executed according to how each kernel might be called in a typical game.

In the zx scene, there are a relatively large number of games written in c (I think about 40) but none of them are open source except for a few simple ones, with one exception being the full sized game Pietro Bros. A number of games are written with a particular game engine ("Churerra") which is open source though. I'll try to take a look at these sources and see if something can be extracted from them.

Van hit9918

Prophet (2932)

afbeelding van hit9918

09-03-2017, 08:09

ok I added a setup to call it in main() Smile

Quote:

The idea is to use array access to compute a fixed address at compile time plus an offset determined by index at runtime.

the thing is that this is about compilers weakness
what you describe is like this

        ld hl,sprite+x  ;the trick of including the offset in one instruction
        add hl,bc       ;bc = index i
        ld a,(hl)       ;acess sprite[i].x

now, if we look at this

        sprite[i].x = sprite[i].x + sprite[i].vx;
        sprite[i].y = sprite[i].y + sprite[i].vy;

the address &sprite[i], the location of that struct, it does never change across these 6 ops.
it could be in a register
like this

        ld hl,x         ;too freely select offset
        add hl,bc       ;bc = the &sprite[i] that is constant across 6 ops
        ld a,(hl)       ;acess sprite[i].x

the same instructions as above. but this time it does not need static arrays.
and with IX usage it is

        ld a,(ix+x)     ;IX = the &sprite[i] that is constant across 6 ops

the code (of below benchmark code) could look like this

;p = *arr++;
        ld e,(hl)
        inc hl
        ld d,(hl)
        inc hl
        ld ixl,e
        ld ixh,d        ;IX = fixed thru all 6 ops.
;p->x += p->vx
        ld a,(ix+ x+0)
        add  (ix+vx+0)
        ld   (ix+ x+0),a
        ld a,(ix+ x+1)
        adc  (ix+vx+1)
        ld   (ix+ x+1),a
;p->x += p->vy
        ld a,(ix+ y+0)
        add  (ix+vy+0)
        ld   (ix+ y+0),a
        ld a,(ix+ y+1)
        adc  (ix+vy+1)
        ld   (ix+ y+1),a
;p->anim++
        inc  (ix+anim)

with such compiler one could code nemesis in C Big smile
this is what the benchmark is poking at.
it also is ok when it uses HL and INC HL or DEC HL or ADD HL,BC for offsetting. but without needing to recalculate things from an array base every time.

this is what SDCC makes

;app.c:9: p = *arr++;
	ld	l, c
	ld	h, b
	ld	e,(hl)
	inc	hl
	ld	d,(hl)
	inc	bc
	inc	bc
;app.c:11: p->x = p->x + p->vx;
	push	de
	pop	iy
	inc	iy
	inc	iy
	ld	a,0 (iy)
	ld	-2 (ix),a
	ld	a,1 (iy)
	ld	-1 (ix),a
	ld	l, e
	ld	h, d
	push	bc
	ld	bc, #0x0006
	add	hl, bc
	pop	bc
	ld	a, (hl)
	inc	hl
	ld	h,(hl)
	ld	l,a
	ld	a,-2 (ix)
	add	a, l
	ld	l,a
	ld	a,-1 (ix)
	adc	a, h
	ld	0 (iy), l
	ld	1 (iy), a
;app.c:12: p->y = p->y + p->vy;
	ld	iy,#0x0004
	add	iy, de
	ld	a,0 (iy)
	ld	-2 (ix),a
	ld	a,1 (iy)
	ld	-1 (ix),a
	ld	l, e
	ld	h, d
	push	bc
	ld	bc, #0x0008
	add	hl, bc
	pop	bc
	ld	a, (hl)
	inc	hl
	ld	h,(hl)
	ld	l,a
	ld	a,-2 (ix)
	add	a, l
	ld	l,a
	ld	a,-1 (ix)
	adc	a, h
	ld	0 (iy), l
	ld	1 (iy), a
;app.c:13: p->anim++;
	ld	hl,#0x000a
	add	hl,de
	ld	e,(hl)
	inc	e
	ld	(hl),e

SDCC : mcs51/z80/z180/r2k/r3ka/gbz80/tlcs90/ds390/pic16/pic14/TININative/ds400/hc08/s08/stm8 3.6.0 #9615 (MINGW64)

here the code


#define T int
struct obj { T id; T x; T y; T vx; T vy; char anim; char hit; char dummy[16]; /* objects got some size */ };

void moveturtles(struct obj **arr, unsigned char arrlen) {
        unsigned char i = arrlen;
        struct obj *p = 0;
        do {
                p = *arr++;
                //and then many p->foo ops like this:
                p->x = p->x + p->vx;
                p->y = p->y + p->vy;
                p->anim++;
        } while (--i);
}


void boundingbox(int x, int y, struct obj **arr, unsigned char arrlen) {
        unsigned char i = arrlen;
        struct obj *p = 0;
        do {
                p = *arr++;
                //and then many p->foo ops like this:
                if (((x + 16) > p->x) && (x < (p->x + 16))) {
                        if (((y + 16) > p->y) && (y < (p->y + 16))) {
                                p->hit = 1;
                        }
                }
        } while (--i);
}


//game object allocation pool. all objects go in there.
struct obj pool[32];

//example plot, a list of turtles in the object pool
int turtleslistlen = 5;
struct obj* turtleslist[32] = { &pool[23], &pool[17], &pool[4], &pool[7], &pool[19] };



void main() {
        int i = 0;
        for (i = 0; i < 32; i++) { pool[i].x = i*4; pool[i].y = i*4; pool[i].vx = 0; pool[i].vy = 0; } 
        for (i = 0; i < 10000; i++) { //benchmark it
                int playerx = 32; int playery = 128;
                moveturtles(turtleslist, turtleslistlen);
                boundingbox(playerx, playery, turtleslist, turtleslistlen);
        }
}

Van DarkSchneider

Paragon (1030)

afbeelding van DarkSchneider

17-04-2017, 10:25

That is why I use programming tips, which are really needed when programming small 8-bit machines as I noticed.
In that case I always use a pointer, assign it and then use it for accessing. Instead "sprite[i]." create a *sprite, sprite = &sprite[i], and then "sprite->". You can also assign the first element of the array to *srpite and then do sprite++ on each loop.

Another tip is to use always unsigned arithmetic unless mandatory. As signed arithmetic is handled by C-runtime instead direct instructions. There is really no problem because we can operate in this way perfectly: 16-bit unsigned "-1" is 65535, if we add unsigned 10000 + 65535 = 9999 with overflow, so it is really as substracting 1.
C compilers allow to define them i.e. unsigned x = (unsigned)-1.

Pagina 1/2
| 2