Just another Blitter demo

All 680x0 related coding posts in this section please.

Moderators: exxos, simonsunnyboy, Mug UK, Zorro 2, Moderator Team

User avatar
Cyprian
Atari God
Atari God
Posts: 1404
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Re: Just another Blitter demo

Postby Cyprian » Wed May 25, 2016 2:10 pm

Anima wrote:The only "proof" I have is that you can modify any Blitter register after starting it. That shouldn't work either if any memory access is blocked.

nice idea, I did only measurements CPU/BLiTTER vs video raster.
can you please post an example code.

leonard wrote:Anyway AMIGA blitter is still even better. It can run on 3 operands at once ( mask + or in a single pass ). I did an amiga sprite test long time ago, far from being optimized to death, but I just checked I put 36 sprites ( 32*31, 3 bitplans ).

yep, amiga blitter has a nice cookie-cut function. It needs 4 memory accesses per one sprite word, where ST needs 6 memory accesses (3 for masking and 3 for sprites).

leonard wrote:But It was on a amiga 1200 ( don't know if the blitter is running at the same speed on a plain amiga 500)

Could be interesting to know how many 32*31, 3pl sprites can run on an normal amiga 500.

a1200 blitter is exactly the same as in a500 from performance point of view. the only difference is that in case of a1200 video chip can steals less memory accesses due to 64bit video memory path.
Jaugar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
SDrive / PAK68/3 / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
Hatari / Aranym / Steem / Saint
http://260ste.appspot.com/

User avatar
alexh
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2594
Joined: Wed Oct 20, 2004 1:52 pm
Location: UK - Oxford
Contact:

Re: Just another Blitter demo

Postby alexh » Wed May 25, 2016 3:38 pm

Cyprian wrote:The only difference is that in case of a1200 video chip can steals less memory accesses due to 64bit video memory path.

32-bit. And I think Chip RAM is 2x the speed (14MHz instead of 7MHz)
Last edited by alexh on Wed May 25, 2016 3:40 pm, edited 1 time in total.

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Wed May 25, 2016 3:39 pm

leonard wrote:This is really nice code! I love sprites record :)

Thanks! I know because I like your 16 x 16 record demo(s). ;)

leonard wrote:Since I wrote the "We Were @" demo, my opinion about blitter has changed a bit :) Blitter is efficient because you don't have instruction prefetching penalties. For 32*32 sprites, because of the "mask set each scanline" trick, I realize blitter is faster than CPU for 32*32. Maybe that's not the case for smaller sprites.

Correct. Unfortunately the setup costs are higher for smaller sprites and less bitplanes. Btw.: We Were @" is an awesome work. :cheers:

leonard wrote:Could be interesting to know how many 32*31, 3pl sprites can run on an normal amiga 500.

Maybe Frank B. can give us some numbers? ;)

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Wed May 25, 2016 4:09 pm

Cyprian wrote:nice idea, I did only measurements CPU/BLiTTER vs video raster.
can you please post an example code.

The current general case looks like this:

Code: Select all

   move.l   (a2)+,(a0) ; Endmask 1 + 2.
   move   (a2)+,(a1) ; Endmask 3.
   move.l a3,(a4) ; Destination Address.
   move   d6,(a5) ; Y Count.
   move.b   d5,(a6) ; Busy, HOG, Smudge, Line Number. Start the Blitter.
   add      d4,a3 ; New Destination Address. Will be executed before start.

   ; next line

   move.l   (a2)+,(a0) ; Endmask 1 + 2.
   move   (a2)+,(a1) ; Endmask 3.
   move.l a3,(a4) ; Destination Address.
   move   d6,(a5) ; Y Count.
   move.b   d5,(a6) ; Busy, HOG, Smudge, Line Number. Start the Blitter.
   add      d4,a3 ; New Destination Address. Will be executed before start.

[...]

And can theoretically be optimised to only five instructions per line:

Code: Select all

   move.l <destination_address>,(a4) ; Destination Address for the first line.

   move.l   (a2)+,(a0) ; Endmask 1 + 2.
   move   (a2)+,(a1) ; Endmask 3.
   move   d6,(a5) ; Y Count.
   move.b   d5,(a6) ; Busy, HOG, Smudge, Line Number. Start the Blitter.
   add.l    d4,(a4) ; New Destination Address. Memory access blocked so this will be executed after the Blitter finished.

   ; next line

   move.l   (a2)+,(a0) ; Endmask 1 + 2.
   move   (a2)+,(a1) ; Endmask 3.
   move   d6,(a5) ; Y Count.
   move.b   d5,(a6) ; Busy, HOG, Smudge, Line Number. Start the Blitter.
   add.l    d4,(a4) ; New Destination Address. Memory access blocked so this will be executed after the Blitter finished.

[...]

But in fact, the second optimised version doesn't work just because the Blitter has indeed been updated already with the next destination address before it really starts.

Cyprian wrote:yep, amiga blitter has a nice cookie-cut function. It needs 4 memory accesses per one sprite word, where ST needs 6 memory accesses (3 for masking and 3 for sprites).

Please keep in mind that the Atari Blitter is quite efficient where the mask is $ffff. So in this case it copies the data instead of doing expensive RMW operations (with the new approach).

Edit:

Some numbers: in general the Amiga Blitter is exactly two times faster in clock cycles for each memory operation. So the "cookie cut" method uses a total of 8 cycles. Compared to the new blitting method above this speed can only be achieved on the Atari on sprite areas where the ENDMASK is $ffff.

It would be nice to see what's left when we add the Blitter setup overhead with the speed penalty where ENDMASK != $ffff and keep the faster CPU (8 MHz vs. ~7.1 MHz) in mind. In other words: does the CPU speed advantage compensate the setup and mask penalty costs?

2nd edit:

I forgot that the Atari Blitter has also a NFSR option. So it seems that could be helpful as well. :D

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Wed May 25, 2016 6:09 pm

Ok a first calculation attempt for drawing a 32 x 32 pixels sprite in 16 colours (4 bitplanes):

Amiga Blitter "cookie cut" (without Blitter setup cost per sprite):

Total words to be processed: 3 words x 4 bitplanes x 32 lines = 384.
Result: 8 clock cycles per word for the cookie cut operation results in a total of: 384 x 8 = 3072 CPU cycles.

Atari Blitter (new method, assuming 50% mask compression and an average of having an occurence 80% of ENDMASK != $ffff cases):

Total words to be processed: (2 words wide source "NFSR" + 3 words wide destination) x 4 planes x 32 lines = 640 memory accesses.
Result for "copy only": 4 x 640 = 2560 CPU cycles.

"80% ENDMASK != $ffff" adds another 4(?) cycles for each destination word so we have in addition: 3 x 4 x 32 x 0.8 = 1229 CPU cycles.

Blitter setup with mask: 20 + 12 + 8 + 8 + 8 = 56 CPU cycles per line.
Blitter setup without mask: 8 + 8 + 8 = 24 CPU cycles per line.
50% mask compression = (56 + 24) / 2 = 40 cycles per line.
Result for the setup: 32 x 40 = 1280 CPU cycles.

Total result: 2560 + 1280 + 1229 = 5069 CPU cycles.

However, compared to the higher system clock of the Atari the Amiga Blitter timing would result in having 3072 x 8 / 7.1 ~= 3461 Atari CPU cycles "wasted".

So on a rough estimate the Amiga 500 is still about 46% faster. This number is probably lower due to the DMA bandwidth usage by other devices!?

Interestingly without the setup cost for each line the number for the Atari would be: 2560 + 1229 = 3789 CPU cycles. Compared to the equivalent 3461 CPU cycles for the Amiga version that's not bad at all!

Did I miss something? Comments?

Edit: the Blitter setup numbers are wrong. Forget everything in this post. Argh...

User avatar
Cyprian
Atari God
Atari God
Posts: 1404
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Re: Just another Blitter demo

Postby Cyprian » Thu May 26, 2016 11:54 pm

alexh wrote:32-bit. And I think Chip RAM is 2x the speed (14MHz instead of 7MHz)

AGA has 64bit data access, 68020 32bit and blitter 16bit access.
a1200 chip ram works exactly in the same manner as in a500 - it has ~224 memory access cycles per PAL frame.
14MHz with 32bit bus should offer bandwidth 56MB/s but on a1200 you can get max 4.5MB/s for read and 6.9MB/s for write.

Anima wrote:Amiga Blitter "cookie cut" (without Blitter setup cost per sprite):

Total words to be processed: 3 words x 4 bitplanes x 32 lines = 384.
Result: 8 clock cycles per word for the cookie cut operation results in a total of: 384 x 8 = 3072 CPU cycles.

When all amiga DMAs are OFF it is 8 clock cycles per word for the cookie cut. In case of 4 bitplane mode it is 16 cycles. IMO better is do calculations based on available memory slots.
a500/a1200 has 224 available memory slots per PAL scanline when screen is off and sound is off. In case of 320px line, 5 bitplanes need 100 memory slots, 4bpls - 80 memory slots per line.

Lets counts available memory slots per PAL frame:
- amiga 320x200 4 bitplanes: 200 * 144 + 113 * 224 = 54112
- amiga 320x256 5 bitplanes: 256 * 124 + 57 * 224 = 44512
- Atari: 313*128 = 40064

amiga's cookie cut needs 4 memory accesses (source, mask, background, destination) per one output word:
- source/mask 32px * 32 lines * 4 bitplanes = 256 words
- background/destination 48px * 32 lines * 4 bitplanes = 384 words
- clear (2 memory accesses) 384 * 2 = 768
Result: 1280 words per sprite (256 + 256 + 384 + 384)
Result with clearing: 2048 words per sprite (256 + 256 + 384 + 384 + 768)

Atari in case of mask in Endmask needs 3 memory accesses (source, background, destination) per one output word.
- source 32px * 32 lines * 4 bitplanes = 256 words
- background/destination 48px * 32 lines * 4 bitplanes = 384 words
- clear (1 memory access) 384
Result: 1024 words per sprite (256 + 384 + 384)
Result with clearing: 1408 words per sprite (256 + 384 + 384 + 384)

In case of Atari we need to setup Endmask/Destination address every line. Your calculations were:
- 50% mask compression = (56 + 24) / 2 = 40 cycles per line
- Result for the setup: 32 x 40 = 1280 CPU cycles
1280 CPU cycles equals 320 memory slots

Therefore 1024 words per sprite + 320 memory slots per setup = 1344
With clearing: 1408 +320 = 1728

Sprites 32x32 4bitplanes without clearing:
- Atari: 40064 / 1344 = 29,8
- amiga 320x256 5 bitplanes: 44512/ 1280 = 34,8 (is 16% faster than ST)
- amiga 320x200 4 bitplanes: 54112 / 1280 = 42,3 (is 41% faster than ST)

Sprites 32x32 4bitplanes with clearing:
- Atari: 40064 / 1728 = 23,2
- amiga 320x256 5 bitplanes: 44512/ 1728 = 25,8 (is 11% faster than ST)
- amiga 320x200 4 bitplanes: 54112 / 1728 = 31,3 (is 35% faster than ST)

One important remark, those calculations doesn't include amiga blitter setup. Therefore figures for amiga will be a bit worst.
Jaugar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
SDrive / PAK68/3 / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
Hatari / Aranym / Steem / Saint
http://260ste.appspot.com/

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Fri May 27, 2016 8:04 am

Thanks for the numbers.

Cyprian wrote:amiga's cookie cut needs 4 memory accesses (source, mask, background, destination) per one output word:
- source/mask 32px * 32 lines * 4 bitplanes = 256 words
- background/destination 48px * 32 lines * 4 bitplanes = 384 words
Result: 1280 words per sprite (256 + 256 + 384 + 384)

One question here, though: AFAIK the Amiga Blitter doesn't have a NFSR functionality so wouldn't the calculation really look like this:
Result: 1536 words per sprite (4 x 384)
as this is correct when "cookie cut needs 4 memory accesses per one output word" is true?

User avatar
alexh
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2594
Joined: Wed Oct 20, 2004 1:52 pm
Location: UK - Oxford
Contact:

Re: Just another Blitter demo

Postby alexh » Fri May 27, 2016 10:39 am

Cyprian wrote:AGA has 64bit data access

Interesting. Double CAS mode. Never new that existed.

ctirad
Captain Atari
Captain Atari
Posts: 216
Joined: Sun Jul 15, 2012 9:44 pm

Re: Just another Blitter demo

Postby ctirad » Fri May 27, 2016 11:29 am

I wouldn't call it 64bit mode, though. 64bit implies 64bit BUS, but this is rather a kind of burst mode.

User avatar
Cyprian
Atari God
Atari God
Posts: 1404
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Re: Just another Blitter demo

Postby Cyprian » Fri May 27, 2016 11:37 am

alexh wrote:
Cyprian wrote:AGA has 64bit data access

Interesting. Double CAS mode. Never new that existed.

good point. I saw that some coders use term "64-bit bitplane fetch" also I saw notes about "8 byte alignment" for AGA data.
Now I see that "64-bit bitplane fetch" refers to double CAS / 32bit mode in FMODE

Anima wrote:One question here, though: AFAIK the Amiga Blitter doesn't have a NFSR functionality so wouldn't the calculation really look like this:
Result: 1536 words per sprite (4 x 384)
as this is correct when "cookie cut needs 4 memory accesses per one output word" is true?


actually I thought that BLTAFWM / BLTALWM registers prevent from reading first/last word, but it seems that they didn't

amiga's cookie cut needs 4 memory accesses (source, mask, background, destination) per one output word:
- source/mask 32px * 32 lines * 4 bitplanes = 256 words
- background/destination 48px * 32 lines * 4 bitplanes = 384 words
- clear (2 memory accesses) 384 * 2 = 768

Result for one 32x32 4bitplanes masked sprites:
- without clearing: 1536 words per sprite (384 + 384 + 384 + 384)
- with clearing: 2304 words per sprite (384 + 384 + 384 + 384 + 768)


Sprites 32x32 4bitplanes without clearing:
- Atari: 40064 / 1344 = 29,8
- amiga 320x256 5 bitplanes: 44512 / 1536 = 29,0 (3% slower than ST)
- amiga 320x200 4 bitplanes: 54112 / 1536 = 35,2 (18% faster than ST)

Sprites 32x32 4bitplanes with clearing:
- Atari: 40064 / 1728 = 23,2
- amiga 320x256 5 bitplanes: 44512 / 2304 = 19,3 (17% slower than ST)
- amiga 320x200 4 bitplanes: 54112 / 2304 = 23,5 (more or less equal)


---EDIT---
Corrected clearing part, due to bug in cleaning calculation.
Clearing on ST is faster than on amiga: ~80000 vs ~70000 bytes per PAL frame. This is 14% in favor of ST.
Jaugar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
SDrive / PAK68/3 / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
Hatari / Aranym / Steem / Saint
http://260ste.appspot.com/

User avatar
calimero
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2063
Joined: Thu Sep 15, 2005 10:01 am
Location: STara Pazova, Serbia
Contact:

Re: Just another Blitter demo

Postby calimero » Sat May 28, 2016 6:20 am

leonard wrote:Could be interesting to know how many 32*31, 3pl sprites can run on an normal amiga 500.


I love Amiga vs Atari records :)

but not like fake records as in: http://www.pouet.net/prod.php?which=65376 :(
using Atari since 1986.http://wet.atari.orghttp://milan.kovac.cc/atari/software/ ・ Atari Falcon030/CT63/SV ・ Atari STe ・ Atari Mega4/MegaFile30/SM124 ・ Amiga 1200/PPC ・ Amiga 500 ・ C64 ・ ZX Spectrum ・ RPi ・ MagiC! ・ MiNT 1.18 ・ OS X

User avatar
Cyprian
Atari God
Atari God
Posts: 1404
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Re: Just another Blitter demo

Postby Cyprian » Sat May 28, 2016 12:10 pm

calimero wrote:I love Amiga vs Atari records :)


me too :)


leonard wrote:I did an amiga sprite test long time ago, far from being optimized to death, but I just checked I put 36 sprites ( 32*31, 3 bitplans ). But It was on a amiga 1200 ( don't know if the blitter is running at the same speed on a plain amiga 500)

Could be interesting to know how many 32*31, 3pl sprites can run on an normal amiga 500.

can you share your test program? we can check that on Winuae where is cycle exact a500 emulation.
Jaugar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
SDrive / PAK68/3 / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
Hatari / Aranym / Steem / Saint
http://260ste.appspot.com/

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Sun May 29, 2016 4:51 pm

New reordering of the code results in 21 sprites (with background clearing "only")... :D
https://www.youtube.com/watch?v=0kV3PLvBMNs



Edit: clarify.


User avatar
Ragstaff
Atari Super Hero
Atari Super Hero
Posts: 610
Joined: Mon Oct 20, 2003 3:39 am
Location: Melbourne Australia
Contact:

Re: Just another Blitter demo

Postby Ragstaff » Mon May 30, 2016 9:02 am

Very impressive result!

User avatar
Cyprian
Atari God
Atari God
Posts: 1404
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Re: Just another Blitter demo

Postby Cyprian » Mon May 30, 2016 10:52 am

Anima wrote:New reordering of the code results in 21 sprites (with background clearing "only")... :D [/youtube]

Edit: clarify.


great!

need more!
Jaugar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
SDrive / PAK68/3 / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
Hatari / Aranym / Steem / Saint
http://260ste.appspot.com/

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Mon May 30, 2016 1:53 pm

Ragstaff wrote:Very impressive result!

Cyprian wrote:great!

need more!

Thanks both.

The code is still viable for game development. I am about to write a tool to create optimized sprites from an image so that you can include them in your source. Stay tuned.

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Wed Jun 01, 2016 4:58 pm

Small update: here's the program to try out on Hatari or your Atari STE. Should work on the Mega STE and Falcon030 as well.

Please run it at 50 Hz. ;)
You do not have the required permissions to view the files attached to this post.

User avatar
frost
Captain Atari
Captain Atari
Posts: 346
Joined: Sun Dec 01, 2002 2:50 am
Location: Limoges
Contact:

Re: Just another Blitter demo

Postby frost » Tue Jun 21, 2016 9:23 pm

well, I didn't go to the second page, sorry for the noise :oops:
My blog, mostly about Atari and demo stuff.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Just another Blitter demo

Postby dml » Fri Aug 26, 2016 1:18 pm

Anima/Leonard...

I had a shot at this myself (the code-generating variant) and will have some conclusions soon. Haven't quite tested it properly - just internal debug-simulation so far and checking code quality by eye. However output does look fairly interesting. Will report something here when I know for sure its all working ok... ;)

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Sat Aug 27, 2016 3:25 pm

dml wrote:Will report something here when I know for sure its all working ok... ;)

Sounds interesting. Looking forward to your report. :coffe:

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Just another Blitter demo

Postby dml » Thu Sep 01, 2016 10:15 am

Have been busy to post much - but in summary, the line-reordering optimization seems to help a lot with the amount of data needing loaded/stores required and consequently registers allocated. Just need to make sure the line order is optimized for all mask preshifts simultaneously (and not one at a time) since it requires reordering the colour data at the same time - and preferably only one copy! :)

I'll have a lot more notes to add later when I have more time - for other types of optimizations - but this part was worth a mention since it definitely has an obvious impact in my tests.

[EDIT]

Forgot to clarify - in this case I'm generating custom code per preshift, where they share an optimized line order. So it takes more space than the version you describe. The only difference is that the line order is optimized for delta minimization per individual endmask register, since its different code per preshift per line. Other than that its basically the same sort of thing...

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Just another Blitter demo

Postby dml » Thu Sep 01, 2016 10:54 am

Here is the analysis of the preshifts from a real sprite used in my game demo:

The mask preview from preshift #13:

Code: Select all

preshift 13 preview:
000000000000000000000000000111110000000000000000
000000000000000000000011101111111111100000000000
000000000000000000000111111111111111110000000000
000000000000000000001111111111111111111100000000
000000000000000000001111111111111111111110000000
000000000000000011111111111111111111111111000000
000000000000000111111111111111111111111111000000
000000000000001111111111111111111111111111100000
000000000000001111111111111111111111111111100000
000000000000001111111111111111111111111111100000
000000000000001111111111111111111111111111110000
000000000000000111111111111111111111111111110000
000000000000000011111111111111111111111111110000
000000000000000011111111111111111111111111110000
000000000000000111111111111111111111111111110000
000000000000001111111111111111111111111111110000
000000000000011111111111111111111111111111111000
000000000000011111111111111111111111111111111000
000000000000011111111111111111111111111111111000
000000000000001111111111111111111111111111111000
000000000000001111111111111111111111111111110000
000000000000000111111111111111111111111111100000
000000000000000111111111111111111111111111000000
000000000000000111111111111111111111111111000000
000000000000000111111111111111111111111111000000
000000000000000011111111111111111111111111000000
000000000000000001111111111111111111111110000000
000000000000000000001111111111111111100000000000
000000000000000000000001111110011111000000000000
000000000000000000000000111000000000000000000000


Delta analysis for this preshift, without line reordering:

Code: Select all

surviving datawords...
000000000000000000000000000111110000000000000000
----------------00000011101111111111100000000000
----------------00000111111111111111110000000000
----------------00001111111111111111111100000000
--------------------------------1111111110000000
----------------11111111111111111111111111000000
0000000000000001--------------------------------
0000000000000011----------------1111111111100000
--------------------------------1111111111110000
0000000000000001--------------------------------
0000000000000000--------------------------------
0000000000000111----------------1111111111111000
0000000000000011--------------------------------
0000000000000001----------------1111111111100000
----------------01111111111111111111111110000000
----------------00001111111111111111100000000000
----------------00000001111110011111000000000000
----------------00000000111000000000000000000000
unique = 21
stores = 31


Analysis, with line reordering:

Code: Select all

surviving datawords...
000000000000000000000000111000000000000000000000
----------------0000000000011111----------------
----------------00000011101111111111100000000000
----------------0000111111111111----------------
--------------------------------1111111100000000
--------------------------------1111111110000000
----------------0111111111111111----------------
000000000000001111111111111111111111111111100000
--------------------------------1111111111110000
0000000000000001--------------------------------
0000000000000000--------------------------------
0000000000000111----------------1111111111111000
0000000000000011--------------------------------
0000000000000001--------------------------------
--------------------------------1111111111000000
0000000000000000--------------------------------
----------------00000111111111111111110000000000
----------------00000001111110011111000000000000
unique = 21
stores = 26


So that's a 15% saving in EM mask writes just using a common-optimized line order.

The EM store saving is obviously greater if you use a dedicated line order per preshift, but the cost of doing so makes it fairly useless - might as well just preshift the colour in that case :)

Anyway, just thought this might be interesting info, following on from your own dest-skip modification.

My sprite compiler does other things but I'll wait a bit longer before explaining any of that!

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Just another Blitter demo

Postby dml » Thu Sep 01, 2016 1:05 pm

Oh forgot one other detail on this part - you can also add a small extra penalty for line orders which unnecessarily separate deltas on subsequent lines i.e. two EM stores on two output lines are not adjacent. This increases the number of cases where you can use postinc/predec addressing to update the EMs with a single register / using fewer offsets.

so.. (e.g.)

FFFF FFFF 0000
FFFF 0000 0000
0000 0000 FFFF <- EM store not adjacent with previous line

..can become...

0000 0000 FFFF
FFFF FFFF 0000
FFFF 0000 0000

Of course this penalty has lower significance than total delta counts, but higher than zero-gain reordering.

User avatar
Anima
Atari Super Hero
Atari Super Hero
Posts: 654
Joined: Fri Mar 06, 2009 9:43 am
Contact:

Re: Just another Blitter demo

Postby Anima » Fri Sep 02, 2016 9:19 am

Thanks for the explanation.

As far as I understand you are optimising the masks for each shifting step, right? So do you have a universal routine to draw the sprites or do you have a different (precompiled) code for each sprite and shifting step?

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Just another Blitter demo

Postby dml » Fri Sep 02, 2016 10:59 am

Anima wrote:Thanks for the explanation.

As far as I understand you are optimising the masks for each shifting step, right? So do you have a universal routine to draw the sprites or do you have a different (precompiled) code for each sprite and shifting step?


Yes exactly - precompiled code per preshift - so it does consume a lot more space than the version you describe, and probably less appropriate for something like animated large sprites. But should be faster for equivalently sized sprites which don't need many frames (like a single bob). I think I estimate around 9k per 32x32 sprite image including colour data. I guess yours will be closer to 1.5-2k including colour data - something like that.

So basically your method looks like the best *general case* method for something like games - which I think was your main criteria. I'm just playing around with alternatives to see what else is worth trying for the most optimized case, and what may lie in between.

BTW I do see a way to make a sort of hybrid which has some benefits of both (midway between speed and cost) but its a bit more complicated and not really sorted out all the details yet. It involves building a more conservative 'dedicated routine' which serves more than one preshift at a time, while wasting some redundant stores on each one. You'd then be able to control the speed vs storage cost in meaningful steps. Not sure if its worth the hassle to write but its a valid thought experiment at least.


Social Media

     

Return to “680x0”

Who is online

Users browsing this forum: No registered users and 1 guest