Software sprites

All 680x0 related coding posts in this section please.

Moderators: simonsunnyboy, Mug UK, Zorro 2, Moderator Team

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Software sprites

Postby Zamuel_a » Sat Sep 27, 2014 8:16 pm

I have started to code alittle for the Falcon and need a good sprite routine. On the STE I have always used the blitter, but I have read that it's slower than the CPU in the falcon so maybe a software routine is better. The question is just, how to write it. :wink: I could probably figure out something, but it might not be the fastest. Everything I found when it comes to software sprites for the Atari are preshifted ones, but I don't want that since it takes alot of memory.
Maybe the blitter is still faster than the CPU even on the Falcon for sprites? AND a mask and OR the sprite data and rotate the bits for each bitplane sounds not so fast to do in software. Doing a 1:1 memory copy is ofcourse much faster with the CPU than blitter, but that's another story.
This is ofcourse for the 8 bitplane mode. TC would be another matter...
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sat Sep 27, 2014 10:31 pm

Zamuel_a wrote:Maybe the blitter is still faster than the CPU even on the Falcon for sprites? AND a mask and OR the sprite data and rotate the bits for each bitplane sounds not so fast to do in software. Doing a 1:1 memory copy is ofcourse much faster with the CPU than blitter, but that's another story.
This is ofcourse for the 8 bitplane mode. TC would be another matter...


It's certainly easier to write a blitter sprite routine. So if you're more interested in productivity you could start with that and go back later to do some experiments in software.

Optimizing code for 030 is time consuming. There are more factors to take into account, which complicate optimization (and provide more routes to optimize). On plain 68k the most common solutions to any problems were unrolling code as much as possible, using big tables, or generating code sequences to do static jobs. These can't be taken for granted on 030 - you need to examine each case and decide the tradeoffs. Even getting a clear picture of timings is a challenge.

So it is far less likely you'll find a 'best' routine. I don't think everything has been tried either - I am still finding small things which make a difference.

I think you can probably beat the blitter with CPU for sprites on the Falcon but it will take time to fill in all the cases and discover the sweet spots. If you don't mind spending that time it's probably the way to go.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sat Sep 27, 2014 10:47 pm

A few examples of stuff you might make use of, if optimizing a 'software blitter'. It might also give you an idea why there is no 'perfect' sprite routine on Falcon. Too many combinations of stuff to play with.

- 030 has a barrel shifter. shifting has constant cost, no matter the number of bits

- 030 has instruction cache, which means you need to keep loops under 256 bytes if you want to save a memory read on every instruction

- 030 has a data cache. it is difficult to lever this effectively in linear memory operations but it is important to learn how it works so you know how to use it when you need it. at the very least, understand that it normally only caches reads (unless instructed otherwise) and is longword-based, so keeping it enabled means you are fetching longs from the bus all the time. this incurs no obvious cost for linear copies etc. but does cost for random access (wasted reads) if the accesses are close together in terms of cycles. the d-cache can be configured to cache writes, but only for 4-byte aligned longs. anything that isn't a cache 'hit' is a 'miss', and will delete a cache entry or replace it so take care if caching unaligned (or short) writes, which both incur misses.

- 030 has a writebuffer, so when you write a word to ram (and I think also a longword), the cpu does not wait until that write completes. it coasts on executing trivial ops until it either completes the pending write, or another memory op occurs (write or read), at which point it will stall. this often lets you hide the cost of writes by spreading them out among other ops.

- 030 has bitfield operators. they are expensive (10 cycles iirc, versus 2 for and/or, 4 for shift) but they effectively can perform move+shift+and+or operations *simultaneously* on 32bit values. seems like it might have a use in sprites? maybe, if you're lucky.

- high bandwidth graphics modes inflate the cost of read/write operations, versus cpu clocks. this means you can do more trivial operations per available read/write (on average), which is good news for sprites

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 9:28 am

Sorry I haven't tested this at all but I suppose it should work. If not, I guess you can figure out why not :)

Assuming we ignore bitfields for now, you can scroll data with the CPU reasonably efficiently if done 2 planes at a time. The sample below should work on a 2-plane display, but can be adapted for 8 planes by a small modification and repeating it 4 times. It does not perform masking - that will depend on the format of your source data. You can also avoid masking if you know parts of the sprite have a mask all 1's (but thats a separate problem again).

I spit it into alternating even/odd plane words because it saves a register exchange for the carry data between words. The eor.l can be done as another and.l (but would require another register for the inverted mask). Anyway you can probably find a better version of this done by somebody who had more time to research. This is back-of-envelope stuff.

Code: Select all

;   d7   scroll
;   d6   scroll mask ($ffff >> scroll, in both words)
;   d1/d3   carry data
;   d3 needs preloaded, or zeroed for first column of sprite

.even   move.l   (src)+,d1   ;   ab.cd:ef.gh
   ror.l   d7,d1      ;   gh.ab:cd.ef
   move.l   d1,d2
   and.l   d6,d2      ;   00.ab:00.ef
   eor.l   d2,d1      ;   gh.00:cd.00
   or.l   d3,d2      ;   GH.ab:CD.ef   
   move.l   d2,(dst)+

.odd   move.l   (src)+,d3   ;   ab.cd:ef.gh
   ror.l   d7,d3      ;   gh.ab:cd.ef
   move.l   d3,d2
   and.l   d6,d2      ;   00.ab:00.ef
   eor.l   d2,d3      ;   gh.00:cd.00
   or.l   d1,d2      ;   GH.ab:CD.ef   
   move.l   d2,(dst)+



If you want to optimize this for 030, you probably want to consider how the reads/writes are distributed. Keep them apart - they cause stalls if done too close together. If you can spread them out by 1 or 2 trivial ALU ops then you'll hide those cycles. movem is attractive but isn't always a win - read/write scheduling is more effective esp. if you're not fetching a lot of data at a time.

you can hardcode a jumptable for wide sprites, at least for the central columns but you could codegen for all sizes (with i-caching for internal columns)

You probably also want to consider how masking is combined with scrolling - whether its a separate pass (uses fewer registers, allows better optimization of each pass, but uses more memory operations) or integrated together (more register pressure, but opportunity for avoiding memory ops, or even datacache if not enough registers)

First and last column can obviously be optimized vs interior columns. Special hardcoded cases for 1- and 2- and perhaps even 3-word destination widths. Special case for scroll=0. etc. etc.

Have fun.

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Re: Software sprites

Postby Zamuel_a » Sun Sep 28, 2014 9:43 am

Thanks for the information! I will try and play around alittle.
The question is, will it be faster or about the same as doing the same operations with the blitter? Especially if AND and OR operations are included.
Ofcourse there are special cases that can be faster with the CPU and so, but a general 32x32 sprite routine for example.

How to make sure some code is runned efficently in the cache? Is there a way to "reset" the cache or will it be done by itself?
For example, I have a loop that takes < 256 bytes, but maybe the cache is already full with 200 bytes of other code. Then it will take some loops before only my code are laying here. So if the loop is only runned 2 times, when it's most likely that the seconds time will not be in the cache anyway, or how does it work?

Generally it's best to not unroll loops anymore? (ok maybe a loop for 2-3 times are still best to unroll)
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 667
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Software sprites

Postby Anima » Sun Sep 28, 2014 9:44 am

Zamuel_a wrote:Maybe the blitter is still faster than the CPU even on the Falcon for sprites? AND a mask and OR the sprite data and rotate the bits for each bitplane sounds not so fast to do in software. Doing a 1:1 memory copy is ofcourse much faster with the CPU than blitter, but that's another story.
This is ofcourse for the 8 bitplane mode. TC would be another matter...

For bitplane sprites I would definitely use the Blitter. The biggest advantage of using it is getting "intelligent" data moving/manipulation for free while the CPU calculates the parameters for the next line. Recently I did some tests with the blitter and it works fine for drawing (bitplane) polygons/circles and sprites. Probably I'll put that in a new thread but the sprite drawing routine is about twice as fast as the standard "and mask + or sprite" method (32 x 32 pixels sprites with 256 colours).

User avatar
exxos
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 4933
Joined: Fri Mar 28, 2003 8:36 pm
Location: England
Contact:

Re: Software sprites

Postby exxos » Sun Sep 28, 2014 9:57 am

Don't know about this kinda stuff, But could the DSP be used for something as it runs pretty fast ?
4MB STFM 1.44 FD- VELOCE+ 020 STE - Falcon 030 CT60 - Atari 2600 - Atari 7800 - Gigafile - SD Floppy Emulator - PeST - various clutter

http://www.exxoshost.co.uk/atari/ All my hardware guides - mods - games - STOS
http://www.exxoshost.co.uk/atari/last/storenew/ - All my hardware mods for sale - Please help support by making a purchase.
http://ataristeven.exxoshost.co.uk/Steem.htm Latest Steem Emulator

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 10:08 am

Zamuel_a wrote:Thanks for the information! I will try and play around alittle.
The question is, will it be faster or about the same as doing the same operations with the blitter? Especially if AND and OR operations are included.
Ofcourse there are special cases that can be faster with the CPU and so, but a general 32x32 sprite routine for example.

How to make sure some code is runned efficently in the cache? Is there a way to "reset" the cache or will it be done by itself?
For example, I have a loop that takes < 256 bytes, but maybe the cache is already full with 200 bytes of other code. Then it will take some loops before only my code are laying here. So if the loop is only runned 2 times, when it's most likely that the seconds time will not be in the cache anyway, or how does it work?

Generally it's best to not unroll loops anymore? (ok maybe a loop for 2-3 times are still best to unroll)


Lots of questions here. I'd say CPU if anything allows the chance to fast-path special cases. The blitter has a fairly constant speed and is a function of words read/written, for the most part. It's programmed at the level of lines and passes. IIRC for sprites, plane-at-a-time with masking done as a set of new planes.

CPU can perhaps optimize a bunch of cases which the blitter is forced to do in a more literal, brute way.

cache... you still need to unroll but keep the main executing body contiguous and under 256 bytes. the cache looks after itself and keeps the last 64 longwords read from memory. it uses the lower 8 bits ($FC) of address to map memory to cache slots - this is why code should be contiguous to fit in the cache without kicking entries while looping.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 10:10 am

Anima wrote:
Zamuel_a wrote:Maybe the blitter is still faster than the CPU even on the Falcon for sprites? AND a mask and OR the sprite data and rotate the bits for each bitplane sounds not so fast to do in software. Doing a 1:1 memory copy is ofcourse much faster with the CPU than blitter, but that's another story.
This is ofcourse for the 8 bitplane mode. TC would be another matter...

For bitplane sprites I would definitely use the Blitter. The biggest advantage of using it is getting "intelligent" data moving/manipulation for free while the CPU calculates the parameters for the next line. Recently I did some tests with the blitter and it works fine for drawing (bitplane) polygons/circles and sprites. Probably I'll put that in a new thread but the sprite drawing routine is about twice as fast as the standard "and mask + or sprite" method (32 x 32 pixels sprites with 256 colours).


This works very well for polys/spans because each one has unique width and paralleling the chips lets you hide that. sprites are simpler, with constant width/setup. the comparisons should be closer. however yes - could well be blitter always wins, or at least for the most common cases. im interested in the fastpath cases, will look closer.

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 667
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Software sprites

Postby Anima » Sun Sep 28, 2014 10:11 am

Zamuel_a wrote:How to make sure some code is runned efficently in the cache? Is there a way to "reset" the cache or will it be done by itself?
For example, I have a loop that takes < 256 bytes, but maybe the cache is already full with 200 bytes of other code. Then it will take some loops before only my code are laying here. So if the loop is only runned 2 times, when it's most likely that the seconds time will not be in the cache anyway, or how does it work?
There are a few more things to have in mind to get the best optimised code for the cache (like cache line size, cache line alignment, interrupts, etc.). I would suggest to make the loop fit within 240 bytes of code and you're good to go.

Zamuel_a wrote:Generally it's best to not unroll loops anymore? (ok maybe a loop for 2-3 times are still best to unroll)
Yes and no. For very thight loops you should consider unrolling as long as the desired routine still fits into the cache.

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 667
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Software sprites

Postby Anima » Sun Sep 28, 2014 10:23 am

exxos wrote:Don't know about this kinda stuff, But could the DSP be used for something as it runs pretty fast ?

Unfortunately no. The sprite drawing operations are "too simple" for the DSP so that the communication overhead would be too high. Also the DSP memory is way too small to store sprites with a higher colour count.

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Re: Software sprites

Postby Zamuel_a » Sun Sep 28, 2014 10:29 am

dml wrote:Sorry I haven't tested this at all but I suppose it should work. If not, I guess you can figure out why not :)

Assuming we ignore bitfields for now, you can scroll data with the CPU reasonably efficiently if done 2 planes at a time. The sample below should work on a 2-plane display, but can be adapted for 8 planes by a small modification and repeating it 4 times. It does not perform masking - that will depend on the format of your source data. You can also avoid masking if you know parts of the sprite have a mask all 1's (but thats a separate problem again).

I spit it into alternating even/odd plane words because it saves a register exchange for the carry data between words. The eor.l can be done as another and.l (but would require another register for the inverted mask). Anyway you can probably find a better version of this done by somebody who had more time to research. This is back-of-envelope stuff.

Code: Select all

;   d7   scroll
;   d6   scroll mask ($ffff >> scroll, in both words)
;   d1/d3   carry data
;   d3 needs preloaded, or zeroed for first column of sprite

.even   move.l   (src)+,d1   ;   ab.cd:ef.gh
   ror.l   d7,d1      ;   gh.ab:cd.ef
   move.l   d1,d2
   and.l   d6,d2      ;   00.ab:00.ef
   eor.l   d2,d1      ;   gh.00:cd.00
   or.l   d3,d2      ;   GH.ab:CD.ef   
   move.l   d2,(dst)+

.odd   move.l   (src)+,d3   ;   ab.cd:ef.gh
   ror.l   d7,d3      ;   gh.ab:cd.ef
   move.l   d3,d2
   and.l   d6,d2      ;   00.ab:00.ef
   eor.l   d2,d3      ;   gh.00:cd.00
   or.l   d1,d2      ;   GH.ab:CD.ef   
   move.l   d2,(dst)+



If you want to optimize this for 030, you probably want to consider how the reads/writes are distributed. Keep them apart - they cause stalls if done too close together. If you can spread them out by 1 or 2 trivial ALU ops then you'll hide those cycles. movem is attractive but isn't always a win - read/write scheduling is more effective esp. if you're not fetching a lot of data at a time.

you can hardcode a jumptable for wide sprites, at least for the central columns but you could codegen for all sizes (with i-caching for internal columns)

You probably also want to consider how masking is combined with scrolling - whether its a separate pass (uses fewer registers, allows better optimization of each pass, but uses more memory operations) or integrated together (more register pressure, but opportunity for avoiding memory ops, or even datacache if not enough registers)

First and last column can obviously be optimized vs interior columns. Special hardcoded cases for 1- and 2- and perhaps even 3-word destination widths. Special case for scroll=0. etc. etc.

Have fun.


I don't see any difference between the even and odd routine? Only some data registers are different, but the same function?
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 10:43 am

Zamuel_a wrote:
I don't see any difference between the even and odd routine? Only some data registers are different, but the same function?


Look closer ;) it's how they pass data to each other that matters.

the fragments are meant to be run alternately along a line, each scrolling bits into the next. the carry happens via d1,d3 and alternates for odd/even cases. it has nothing to do with starting on an odd/even address. its just an alternation strategy to save copying registers each time.

Zamuel_a
Atari God
Atari God
Posts: 1234
Joined: Wed Dec 19, 2007 8:36 pm
Location: Sweden

Re: Software sprites

Postby Zamuel_a » Sun Sep 28, 2014 10:51 am

dml wrote:
Zamuel_a wrote:
I don't see any difference between the even and odd routine? Only some data registers are different, but the same function?


Look closer ;) it's how they pass data to each other that matters.


ah ok!
The last d1 is from the even routine
ST / STFM / STE / Mega STE / Falcon / TT030 / Portfolio / 2600 / 7800 / Jaguar / 600xl / 130xe

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 12:25 pm

Again, not had time to test this but it demonstrates an interesting 'shortcut' for software sprites, vs blitter.

When masking typical sprite data you need to AND off the background planes before OR combining the source. With the blitter this is a whole extra pass which means 2x the number of memory operations (3 memory ops per pass * 2 passes * 8 planes = 48 bus events per 16 pixels). If you know the sprite has no holes - just endmasks - then you can optimize the blitter case but for arbitrary sources its a different matter.

But what happens if you pre-shift the 1-plane mask? [EDIT: oops - I suppose it's really using a 2plane mask in the sample] That's not too expensive to store. Here's a CPU sample which combines scrolling with a preshifted mask. It requires 4 memory ops per plane * 8 planes = 32 bus events per 16 pixels, which is considerably less than the blitter. I'm not claiming this will actually be faster in practice, but assuming many of the ALU ops are absorbed it must be getting close. With 8 planes feeding VIDEL it might actually overtake since the the bus will create longer stalls for both CPU and blitter. I guess it needs tried.

Note the scheduling of reads/writes to give the write buffer time to commit the longword. This is probably not optimal but a fair start.


Code: Select all

.even   move.l   (src)+,d1   ;   ab.cd:ef.gh
   ror.l   d7,d1      ;   gh.ab:cd.ef
   move.l   d1,d2
   move.l   (msk)+,d5   ;   fetch preshifted mask
   and.l   d6,d2      ;   00.ab:00.ef
   and.l   (dst),d5   ;   mask BG
   eor.l   d2,d1      ;   gh.00:cd.00 (carry new)
   move.l   d4,(dst)+   ;   *scheduled from odd*
   or.l   d3,d2      ;   GH.ab:CD.ef (merge old carry)
   or.l   d5,d2      ;   combine BG

.odd   move.l   (src)+,d3   ;   ab.cd:ef.gh
   ror.l   d7,d3      ;   gh.ab:cd.ef
   move.l   d3,d4   
   move.l   (msk)+,d5   ;   fetch preshifted mask
   and.l   d6,d4      ;   00.ab:00.ef
   and.l   (dst),d5   ;   mask BG
   eor.l   d4,d3      ;   gh.00:cd.00 (carry new)
   move.l   d2,(dst)+   ;   *scheduled from even*
   or.l   d1,d4      ;   GH.ab:CD.ef (merge old carry)
   or.l   d5,d4      ;   combine BG


Anyway I'm just curious. As I say I haven't tested this at all, or done any comparisons. I may do that later with my fragment cycle counter.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 12:37 pm

Anima wrote:Recently I did some tests with the blitter and it works fine for drawing (bitplane) polygons/circles and sprites. Probably I'll put that in a new thread but the sprite drawing routine is about twice as fast as the standard "and mask + or sprite" method (32 x 32 pixels sprites with 256 colours).


I guess any strategy you can use to avoid ANDing every plane will keep the blitter in front of any CPU version. This would usually involve analysing the sprite data though to take advantage of static knowledge of masks. A CPU version can benefit a lot from that but probably wouldn't keep up with a blitter equivalent if it is general enough.

But as we all know, it is always better to design the data to fit what you know will be fast at draw time ;)

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 667
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Software sprites

Postby Anima » Sun Sep 28, 2014 1:27 pm

dml wrote:When masking typical sprite data you need to AND off the background planes before OR combining the source. With the blitter this is a whole extra pass which means 2x the number of memory operations (3 memory ops per pass * 2 passes * 8 planes = 48 bus events per 16 pixels). If you know the sprite has no holes - just endmasks - then you can optimize the blitter case but for arbitrary sources its a different matter.

Doing the standard AND and OR operations with the Blitter is simply too wasteful for sprites. It's even worse when you look at the cycle costs of the AND (OP = 4) and OR (OP = 7) operations (with HOP = 2, source):

Code: Select all

    The BLiTTER is (unfortunately) not restricted to fixed execution
    times. Its effective execution times "per (16-Bit) copy" heavily
    depend on the chosen HOP- and OP-modes, in clock cycles:

          OP
    HOP    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
       0   4  8  8  4  8  8  8  8  8  8  8  8  4  8  8  4
       1   4  8  8  4  8  8  8  8  8  8  8  8  4  8  8  4
       2   4 12 12  8 12  8 12 12 12 12  8 12  8 12 12  4
       3   4 12 12  8 12  8 12 12 12 12  8 12  8 12 12  4
   
    (Actually, ENDMASKs differing from $FFFF are rumoured to also
     delay the BLiTTER, i have not yet verified this.)

The problem here is that we already have a mask but it is used differently in the standard way: the ENDMASKs. So if there's a way to use the ENDMASK registers as the real mask we can choose OP = 3 (destination = source) as the only operation which is even faster than using AND and OR alone (of course this also depends on the ENDMASK data).

The standard way to draw a 32x32 pixels sprite with 256 colours would look like this (register setup omitted):

Code: Select all

   | Mask.

   move   #2,0x8a20.w | Source X Increment.
   move   #2,0x8a22.w | Source Y Increment.

   move.b   #4,0x8a3b.w | OP: 4 (destination = not source and destination).

   move.l   a3,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   d3,0x8a36.w | X Count.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a1
   move.l   a3,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a1
   move.l   a3,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a1
   move.l   a3,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a1
   move.l   a3,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a1
   move.l   a3,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a1
   move.l   a3,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a1
   move.l   a3,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   | Sprite.

   lea   sprite_image,a1

   move   #16,0x8a20.w | Source X Increment.
   move   #320-2*16+16,0x8a22.w | Source Y Increment.

   move.b   #7,0x8a3b.w | OP: 7 (destination = source or destination).

   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a0
   addq.l   #2,a1
   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a0
   addq.l   #2,a1
   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a0
   addq.l   #2,a1
   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a0
   addq.l   #2,a1
   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a0
   addq.l   #2,a1
   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a0
   addq.l   #2,a1
   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.

   addq.l   #2,a0
   addq.l   #2,a1
   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a2,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.


However, the following code is about two times faster:

Code: Select all

   move.l   (a2)+,d1

   move   #32-1,d7
loop:
   move   d1,d2
   swap   d2
   clr      d2
   lsr.l   d0,d1
   lsr.l   d0,d2

   tst      d0
   jne      no_shift

   move   d1,d2
no_shift:
   move.l   d1,0x8a28.w | Endmask 1 + 2.
   move   d2,0x8a2c.w | Endmask 3.
   move.l   a0,0x8a24.w | Source Address.
   move.l   a1,0x8a32.w | Destination Address.
   move   a3,0x8a38.w | Y Count.
   move.b   d5,0x8a3c.w | Busy, HOG, Smudge, Line Number.
   move.l   (a2)+,d1

   add.l   #320,a0
   add.l   #512,a1

   dbf      d7,loop

As you can see in this code the CPU is preparing the ENDMASK values for the next line while the Blitter is drawing the current line. Another advantage is that the Blitter uses no RMW cycles when the ENDMASK is 0xffff so it's quite optimal in respect of the bandwith costs.

I think it's important to avoid preshifting at all cost because you will get in trouble with your memory consumption really fast, especially when using up to 256 colours for sprites. What you need in addition to the sprite data is of course a one bitplane mask but that doesn't need to preshifted as well.

I should move this to another thread since the headline quite doesn't fit. ;)

EvilFranky
Atari Super Hero
Atari Super Hero
Posts: 868
Joined: Thu Sep 11, 2003 10:49 pm
Location: UK
Contact:

Re: Software sprites

Postby EvilFranky » Sun Sep 28, 2014 1:55 pm

Are us mere mortals going to see some graphical examples of the 2 ways of rendering these 256 colour sprites? :angel: :mrgreen:

A before and after? :)

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 2:03 pm

Anima wrote:As you can see in this code the CPU is preparing the ENDMASK values for the next line while the Blitter is drawing the current line. Another advantage is that the Blitter uses no RMW cycles when the ENDMASK is 0xffff so it's quite optimal in respect of the bandwith costs.


For sizes up to 32 source pixels wide that probably is best, yes. Doing all planes in one blit, line at a time is always the most efficient mode (it's what I was using for plane polys previously, for the reasons you already gave). The important/interesting part is the loading of the new endmask(s) on each line for sprites...

Anima wrote:I think it's important to avoid preshifting at all cost because you will get in trouble with your memory consumption really fast, especially when using up to 256 colours for sprites. What you need in addition to the sprite data is of course a one bitplane mask but that doesn't need to preshifted as well.


I wasn't suggesting preshifting the pixel data (!), just the mask. But it's better if you don't need to preshift anything.

If I get some time I'll try speed tests for the two cases to compare. Source sizes greater than 32 pixels also don't work in that mode - but that's probably not a limiting factor. I suppose you could even tile bigger sprites to conform?

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 2:07 pm

BTW: a little trick which might help speed things further with the blit.... (I have also used this before, when it can be afforded)

move.l a0,0x8a24.w | Source Address.
move.l a1,0x8a32.w | Destination Address.

It is possible to update just the LSW of the source or dest, if your targets are 64k aligned. This is easier for source data in most cases since it can be packed such that it doesn't cross a word partition. Destination management is harder but can be made to work if you bank the destination. whether it is worth doing depends on how many blit state reloads you need to do per frame - probably more applicable to polygons tbh. still, nice to keep in mind.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 2:24 pm

EvilFranky wrote:Are us mere mortals going to see some graphical examples of the 2 ways of rendering these 256 colour sprites? :angel: :mrgreen:
A before and after? :)


Judging by the two methods on merit alone, I expect the blitter will be faster all else being equal, for 32x32 scrolled sprites, for which it is optimal (For other sizes, less clear but maybe still ahead).

The situation might be different for fewer planes, since the cost to initiate the blitter for each sprite line is constant, and involves a bunch of memory operations. But the point of diminishing returns is still probably 3 or 4 planes (guesswork), so it may just always be faster.

In any case, time permitting I'll see what the CPU can do when optimized in case it's still worth looking at. It's fun to try.

[EDIT]

I'm having a lot of trouble posting last couple of days. Not sure if it's me or AF is having problems again.

EvilFranky
Atari Super Hero
Atari Super Hero
Posts: 868
Joined: Thu Sep 11, 2003 10:49 pm
Location: UK
Contact:

Re: Software sprites

Postby EvilFranky » Sun Sep 28, 2014 2:54 pm

Can I just ask then, in my limited knowledge and only from what I have read...the blitter is just regarded as something to use in very special cases when it comes to sprites (on STE).

Is this routine basically only faster on a Falcon due to double the amount of bit planes? On 4 bit planes then there would be no benefit?

EvilFranky
Atari Super Hero
Atari Super Hero
Posts: 868
Joined: Thu Sep 11, 2003 10:49 pm
Location: UK
Contact:

Re: Software sprites

Postby EvilFranky » Sun Sep 28, 2014 2:54 pm

I couldn't access AF for about 20 mins earlier...back to normal now.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3474
Joined: Sat Jun 30, 2012 9:33 am

Re: Software sprites

Postby dml » Sun Sep 28, 2014 2:58 pm

EvilFranky wrote:Can I just ask then, in my limited knowledge and only from what I have read...the blitter is just regarded as something to use in very special cases when it comes to sprites (on STE).

Is this routine basically only faster on a Falcon due to double the amount of bit planes? On 4 bit planes then there would be no benefit?


I did polygon filling this way on STE because it was faster (with some adaptation) so it might be the same situation for sprites too. I just haven't tried it - maybe Anima has.

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 667
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Software sprites

Postby Anima » Sun Sep 28, 2014 3:12 pm

EvilFranky wrote:Are us mere mortals going to see some graphical examples of the 2 ways of rendering these 256 colour sprites? :angel: :mrgreen:

A before and after? :)

Yes you can try it by yourself (requires a real Atari Falcon with RGB/TV): 9_STD.TOS (9 sprites by doing the standard "AND and OR" blits and 15_FAST.TOS using the technique described above). Note: so far it will not work properly using an emulator like Hatari so there's a video as well. ;)


Social Media

     

Return to “680x0”

Who is online

Users browsing this forum: No registered users and 1 guest