programming for Falcon questions

Hardware, coding, music, graphic and various applications

Moderators: Mug UK, moondog/.tSCc., [ProToS], lp, Moderator Team

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

programming for Falcon questions

Postby Zamuel_a » Fri Nov 28, 2014 10:55 pm

When I program something critical on a normal ST, I try to unroll loops, but maybe that's not necessary or good to do at all on the Falcon.
For example if I have a routine that copies something to the screen, like this:

Code: Select all

REPT 16
movem.l (a0)+,d0-d7
movem.l d0-d7,(a1)
lea xxx(a0),a0
lea yyy(a1),a1
ENDR


Is it better to put in in a loop? I guess it depends on how big the loop turns out to be.
I guess a loop is better for the instruction cache.

The data cache seems to be a bit "useless". It's only good if it is the same data that I need over and over again. But if I have something like a sprite routine there each sprite is it's own data, I can't see any use of the data cache. Or am I missing something?
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

AtariZoll
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2978
Joined: Mon Feb 20, 2012 4:42 pm
Contact:

Re: programming for Falcon questions

Postby AtariZoll » Sat Nov 29, 2014 8:27 am

Best would be to perform speed tests with different codes.
Because 68030 can parallel execute CPU operations from cache and bus transfers, data cache can gain some speed too. The real problem is that caches are pretty small - only 256 bytes, what is effective only with simpler loops, I guess.
Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Re: programming for Falcon questions

Postby Zamuel_a » Sat Nov 29, 2014 11:20 am

How can I test the speed of a piece of code in Hatari? On an ST I have usually drawn raster lines that I very easily can see, but it seems like Hatari is only updating the color registers once each VBL so it's not possible.
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1949
Joined: Sun Jul 31, 2011 1:11 pm

Re: programming for Falcon questions

Postby Eero Tamminen » Sat Nov 29, 2014 11:46 am

Zamuel_a wrote:How can I test the speed of a piece of code in Hatari? On an ST I have usually drawn raster lines that I very easily can see, but it seems like Hatari is only updating the color registers once each VBL so it's not possible.


Hatari Falcon emulation doesn't support palette changes between VBLs (like ST/STE emulation does). To measure and profile performance, you have profiler:
http://hg.tuxfamily.org/mercurialroot/h ... #Profiling

Just note that Hatari Falcon emulation isn't completely accurate. While DSP emulation should be accurate, 030 emulation isn't yet: data cache isn't emulated yet (things utilizing it well are faster on real Falcon) and FPU operations are ~2x too fast in Hatari.

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Re: programming for Falcon questions

Postby Zamuel_a » Sat Nov 29, 2014 12:06 pm

Will Hatari support palette changes, video address updates between each VBL in the future or will it take to much CPU power?
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: programming for Falcon questions

Postby dml » Sat Nov 29, 2014 12:26 pm

Zamuel_a wrote:Will Hatari support palette changes, video address updates between each VBL in the future or will it take to much CPU power?


For small fragments of code, use a timing harness. You can write one, or wait until I post the source for mine :) It's easy enough to write though if you just want to plug in your own instruction sequences. The one I wrote is more general for timing different kinds of thing (like blitter or DSP) so its a bit more complicated than you need.

The harness just runs a fragment of code in a loop of size X, and calls the fragment Y times, so you have several seconds of execution in total (X * Y iterations), long enough to get accurate timing.

You then count Timer C events, and divide into the total events and clock speed of the machine. This gives you the number of cycles per loop iteration (including the loop overhead). With this you can figure out a great deal, and you can get very accurate results. Small changes to instruction ordering, loop overhead and cache hits can all be tested this way. You will not get this information from Hatari - the emulation cores are still not accurate (but improve with each release).

You can find the timing for the fragment MINUS loop overhead, by executing an empty version of the loop (or an unrolled version with only 1 unroll, subtracted from an unrolled version with, say 5 unrolls, and divide by 4 to get the time per unroll fragment). This gives extremely accurate results if you're careful - to the cycle.

For optimizing a big program with lots of different routines (like a game or demo), use Hatari's profiler. It's a lot better for that kind of thing than anything you'll be able to write yourself.

Timing harness = tiny fragments, cpu-specific optimizations
Hatari profiler = big picture, algorithm and large code optimizations

AtariZoll
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2978
Joined: Mon Feb 20, 2012 4:42 pm
Contact:

Re: programming for Falcon questions

Postby AtariZoll » Sat Nov 29, 2014 4:04 pm

Zamuel_a wrote:How can I test the speed of a piece of code in Hatari? On an ST I have usually drawn raster lines that I very easily can see, but it seems like Hatari is only updating the color registers once each VBL so it's not possible.


I guess that you need real Falcon for accurate results. Hatari emulates not well caches, PMMU . I do speed tests in way what dml described - using timer-C mostly, and repeating short code lot of times for better result.
If you can post some code segments, I can do speed tests on real Falcon.
Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: programming for Falcon questions

Postby dml » Sat Nov 29, 2014 4:29 pm

Zamuel_a wrote:When I program something critical on a normal ST, I try to unroll loops, but maybe that's not necessary or good to do at all on the Falcon.
For example if I have a routine that copies something to the screen, like this:

Code: Select all

REPT 16
movem.l (a0)+,d0-d7
movem.l d0-d7,(a1)
lea xxx(a0),a0
lea yyy(a1),a1
ENDR


Is it better to put in in a loop? I guess it depends on how big the loop turns out to be.
I guess a loop is better for the instruction cache.

The data cache seems to be a bit "useless". It's only good if it is the same data that I need over and over again. But if I have something like a sprite routine there each sprite is it's own data, I can't see any use of the data cache. Or am I missing something?



On Falcon, unroll as far as you can without getting too close to 256 bytes for the full loopsize. 240 bytes seems to be the reliable limit in practice - go over that and you can start to get cache misses for various reasons.

If you have a double loop (width, height) then better make sure both loops lie within 240-250 bytes.

The data cache is not 'useless', but it is harder to use it effectively. In many cases you won't need it. It can be useful when indirecting tables or reading byte-sequences (since 4 contiguous bytes will be fetched as 2 words over the bus as soon as the first byte is requested). With care you can also use it for read-only stack (or offset(An)) variables if you run out of registers.

When counting cycles on Falcon, don't just look at cycle counts. Look at instruction sizes. Smaller instructions mean more can fit in the cache. Sometimes spending a few cycles to remove 6 bytes from a loop will allow you to cache the loop fully and it is faster overall. (reducing size can sometimes mean using larger instructions, if fewer are needed)

For uncached code (too long to cache), small instructions can offer even more gain. Even better, break long sequences into several loops, with an efficient queue between them.

If you align your stack, enable write-allocate cache mode and operate with only longwords, you can write fast data-recursive routines which have buffered writes and fully cached re-reads from the stack. It can be faster than just trying to use words, because words don't write-allocate and won't cache back as reads. Obvious things are less obvious on Falcon :) have fun.

mc6809e
Captain Atari
Captain Atari
Posts: 159
Joined: Sun Jan 29, 2012 10:22 pm

Re: programming for Falcon questions

Postby mc6809e » Sat Nov 29, 2014 9:13 pm

The biggest advantage of data caches is their ability to prefetch large chunks of data concurrently while other operations proceed concurrently. Typical optimizations involve constructing loops that operate on previously loaded data while triggering a cache miss with line fill to get new data ready for the next iteration (a form of software pipelining).

Unfortunately this ability is disabled on the falcon making the data cache much less useful.

There are a few benefits, though.

Large constants used in loops can be stored in a fast access table, for example. This helps reduce register pressure since you don't have to keep a constant in a register. It can also reduce instruction size since large embedded constants can replaced with d(ax) type accesses to get the value and keeping loops in the instruction cache is very important for speed.

Another trick is to take advantage of the cache's long word granularity. Each entry in the cache is 32-bits wide, so accessing a 16-bit value will also load its neighbor into the cache. This is handy since the falcon's bus is 16-bits wide. You can get another 16-bit access in the background while you're processing the first 16-bit value. This is a kind of pipelining on a very small scale.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1949
Joined: Sun Jul 31, 2011 1:11 pm

Re: programming for Falcon questions

Postby Eero Tamminen » Sat Nov 29, 2014 10:06 pm

AtariZoll wrote:I guess that you need real Falcon for accurate results. Hatari emulates not well caches, PMMU.


Better support for those is coming to next Hatari version, after Nicolas finishes updating Hatari CPU core to new WinUAE CPU core version. MMU emulation started in Aranym for 040, then it was modified in Hatari/NeXT emulator for (more complex) 030 MMU emulation, and those changes were eventually adopted in WinUAE and rest of issues fixed by Toni for forthcoming WinUAE release. Now it's coming back to Hatari, but it's taking time because porting 2-year newer WinUAE CPU core is a lot of work.

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Re: programming for Falcon questions

Postby Zamuel_a » Sun Nov 30, 2014 12:50 pm

In my example, I'm reading data from one buffer and copy to another (screen copy) so maybe the cache is not so helpful here. Each instruction need to read from the bus anyway. the two lea instructions and the loop itself might benefit, but I guess it won't be a great difference.

Do I need to "reset" the cache before a loop since it will be full of other stuff, or will it correct itself after one loop?
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: programming for Falcon questions

Postby dml » Sun Nov 30, 2014 1:01 pm

Zamuel_a wrote:In my example, I'm reading data from one buffer and copy to another (screen copy) so maybe the cache is not so helpful here. Each instruction need to read from the bus anyway. the two lea instructions and the loop itself might benefit, but I guess it won't be a great difference.


The data cache won't help you at all for a simple copy-style operation. But that is the most trivial type of operation with no redundancy, so that's shouldn't be too surprising.

The instruction cache is always useful. It can always reduce 3 bus fetches to 2, for a trivial copy operation.


If you are doing more than a simple copy (playing with bits, shifting, masking etc) you should pay attention to the 68030 write buffer.

The 68030 can write data and immediately be released to execute the next instruction(s), with the write physically occurring at some time later when the bus becomes free to do so. The write buffer is only 1 deep, so writes should be spaced out. There are a couple of other hazards with it but better read the 030 manual to understand them properly.

Reads work differently - the CPU must halt until the data is fetched, always. Sequences of reads and writes mixed among other instructions should therefore be laid out with care. Typically place a write as close behind the last read as possible, followed by any other intermediate instructions. Again there are a couple of exceptions to this but its a good rule of thumb.

The restriction on reading is also a good reason why you should not ignore the data cache in cases where it can be useful, because coupling the data cache with writebuffer can free up the CPU for stall-free execution.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: programming for Falcon questions

Postby dml » Sun Nov 30, 2014 1:06 pm

Zamuel_a wrote:Do I need to "reset" the cache before a loop since it will be full of other stuff, or will it correct itself after one loop?


The cache is easy to understand, especially with no burst mode (no lines of 16 bytes to think about).

The lower 8 bits (really 6 bits - bottom 2 are zero) of the address form the index into the cache. This cache slot gets replaced by whatever was last read. If you read it again, it will be served from the cache. If you read the same address + 256, it will replace that same slot with the new data, because it has the same 8-bit address. Fetching the old address will have to re-read from the bus because it will again have to replace the cache slot.

You can do some things to lock (freeze) stuff in the cache, and otherwise manipulate data to avoid regular collisions on the same slots, but 99% of the time you just keep your loops within 240-250 bytes and don't worry excessively about how it works. It makes sense to place loop starts on 4-byte boundaries if the loop is approaching the 256 limit, to avoid wasting cache slots, but otherwise even that isn't necessary.

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Re: programming for Falcon questions

Postby Zamuel_a » Sun Nov 30, 2014 1:16 pm

Typically place a write as close behind the last read as possible, followed by any other intermediate instructions.


In my example, is it better to use normal move.l instructions to read and write the data instead of reading several bytes at once with the movem instruction? So that read and write are mixed
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: programming for Falcon questions

Postby dml » Sun Nov 30, 2014 1:55 pm

Zamuel_a wrote:
Typically place a write as close behind the last read as possible, followed by any other intermediate instructions.


In my example, is it better to use normal move.l instructions to read and write the data instead of reading several bytes at once with the movem instruction? So that read and write are mixed


I'm a bit busy now but if you can wait a while i'll post you some actual timings comparing the two so you can see the difference.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: programming for Falcon questions

Postby dml » Sun Nov 30, 2014 3:02 pm

comparing the two sequences:

move.l (a0)+,(a1)
...

movem.l (a0)+,d0-d7
movem.l d0-d7,(a1)
...

Notice there is no increment of the destination address, in both cases. You'd need to account for the time to increment the destination for the movem case, which works against it a little. Also note this compares a transfer of 8 longwords using two different methods - movem can be asked to move more per instruction. Basically you need to be moving more than 8 registers to be worth using movem to copy data.


On a real Falcon...

The move.l measures (per longword transferred):

17.6 cycles with i-cache on, d-cache off
18.6 cycles with i-cache on, d-cache on
20.9 cycles with i-cache off, d-cache off

The movem.l measures (per longword transferred):

18.3 cycles with i-cache on, d-cache off
19.1 cycles with i-cache on, d-cache on
20.4 cycles with i-cache off, d-cache off


In Hatari 1.8 the results are a bit different.

The move.l measures (per longword transferred):

20.0 cycles with i-cache on
24.0 cycles with i-cache off

The movem.l measures (per longword transferred):

19.0 cycles with i-cache on
20.0 cycles with i-cache off

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: programming for Falcon questions

Postby dml » Sun Nov 30, 2014 3:12 pm

Repeated the experiment with 10 longwords per event, instead of 8.

(On a real Falcon)

The move.l measures (per longword transferred):

17.6 cycles with i-cache on, d-cache off
18.6 cycles with i-cache on, d-cache on
20.9 cycles with i-cache off, d-cache off

[obviously, no change here but measured anyway with a reconfigured loop size to avoid accidents]


The movem.l measures (per longword transferred):

17.9 cycles with i-cache on, d-cache off
18.8 cycles with i-cache on, d-cache on
19.6 cycles with i-cache off, d-cache off

...so movem still hasn't reached breakeven. I guess you need at least 12 registers to get there. The main benefit would be saving instruction space for the sake of code in an outer loop. The copy speed isn't much improved, if at all.

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Re: programming for Falcon questions

Postby Zamuel_a » Sun Nov 30, 2014 3:26 pm

ok so it doesn't matter so much with movem or not.
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: programming for Falcon questions

Postby dml » Sun Nov 30, 2014 3:31 pm

Zamuel_a wrote:ok so it doesn't matter so much with movem or not.


Doesn't make a big difference to copy speed, no - but it does use less space, while trashing more registers.

I find register trashing is more often a problem. It's a common error to optimize the inner loop 2% faster,while making the outer loop 20% slower. Can be an invisible net loss if you're not measuring and checking things like that. Both compilers and programmers make this kind of mistake. Knowing the correct balance for different levels of loop takes some care.


Social Media

     

Return to “Professionals”

Who is online

Users browsing this forum: No registered users and 3 guests