Eero Tamminen wrote:Ok. Looking at the (attached) profile assembly, "xx_im0" (which takes more cycles than "cmd_xx") is actually latter part of the "cmd_xx" subroutine, so looking at the whole indeed makes sense.
Btw. Have you looked at the callgraphs yet?
Yeah! Finally. I've been a bit too busy with other stuff. I'm not suprised how it looks though A CPU emulator is bound to generate an extremely yucky callgraph, but then again there are some nice clues in there. I'll definitely start using these tools myself now, it's pretty awesome.
Do e.g. calls to "cmd_xx" look sensible, or does it get called more through some route than expected?
cmd_xx() handles a range of Z80 opcodes and basically carries on to different handlers using a jump table. So it's expected to be called very often. I may be able to shave off a couple of cycles there (but not much). To be sure, one probably need to test several speccy binaries when profiling. People wrote everything in assembly language, and all coders have their darlings.
Also, "loop8" causes half of the instruction cache misses in the whole profile. Would it be possible to split that large loop into smaller loops that would fit within i-cache?
Quite possibly! This is the screen conversion code. I have no good way of detecting Z80 writes to screen memory without slowing down the core, hence I need to batch-convert as much as I can - or a minimum amount - depending on CPU load. I can probably make it fit into the I-cache, and I should probably disable the D-cache while doing this (because it's using a huge lookup table to convert ZX graphics -> ST bitplane format).
If anyone has ideas for an efficient way to convert speccy screen memory to Atari ditto, I'm all ears.
Note that profiler can be used to validate optimizations too. Just set breakpoints before and after the part you want to profile. Once code hits the latter breakpoint, debugger shows how many instructions and cycles were spent between the debugger invocations.
And if things actually took longer than before optimizations, you have the profile & annotated disassembly ("profile save profile.txt") where you can check in details what went wrong... Often I just check the instruction count stats in the disassembly to see typical flow within the code.
I really need to learn this. I've written a PC/~286 emulator too, and optimisations are badly needed (obviously).