little Blitter benchmark demo

All about demos on the Falcon, TT & clones
User avatar
exxos
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 4933
Joined: Fri Mar 28, 2003 8:36 pm
Location: England
Contact:

Re: little Blitter benchmark demo

Post by exxos »

Frank B wrote: More bugs to fix I think :)

The speed up is just because the teletype is paused and not printing chars just before the fade :) In that case it does 27 bobs a frame :D
Yeah, always happy to break stuff :lol:

oh yep, I see the VBL go back to 1VBL just before the text fades :)
4MB STFM 1.44 FD- VELOCE+ 020 STE - Falcon 030 CT60 - Atari 2600 - Atari 7800 - Gigafile - SD Floppy Emulator - PeST - various clutter

http://www.exxoshost.co.uk/atari/ All my hardware guides - mods - games - STOS
http://www.exxoshost.co.uk/atari/last/storenew/ - All my hardware mods for sale - Please help support by making a purchase.
http://ataristeven.exxoshost.co.uk/Steem.htm Latest Steem Emulator
User avatar
Eero Tamminen
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3993
Joined: Sun Jul 31, 2011 1:11 pm

Re: little Blitter benchmark demo

Post by Eero Tamminen »

Latest version (v7) works fine with EmuTOS too.
User avatar
leonard
Moderator
Moderator
Posts: 683
Joined: Thu May 23, 2002 10:48 pm
Contact:

Re: little Blitter benchmark demo

Post by leonard »

Frank B wrote:Possibly but we're comparing general purpose code against demo code ;) I bet there is some serious cheating going on
I agreee we compare two things very different. But my point is you already compare BLITTER generic code with PRE-SHIFT CPU routine. To me "preshift" is already not so "generic" (because of memory amount needed, especially with sprite animation).
So as soon as your benchmark compare blitter vs "preshift" cpu, I though it would be interesting to add a compare with generated code, wich is exactly just another form of "preshift" to my opinion.

Another thing interesting to add (but takes time to code :( ) is the "generic" case in a game: save & restore background. In that case, and in 4 bitplans, maybe the CPU is slighty better, just because the read background is mixed with the register load to apply the mask. ( 1 read instead of 2 for blitter ).

Anyway I just want to say that blitter is faster than CPU for many things. But people have to be aware of one thing: if you have to write a sprite demo, depending on many parameters (sprite size, shape etc), you HAVE to carefully think between CPU & Blitter because blitter won't be the best in all case!

To get a real compare with 4 bitplans, there is the nice reset demo in Decade Demo, showing 11 32*32 4 bitplans sprites, using pre-shifted data (but no generated code) and real time waveform. I draw 18 of the exactly same sprite using generated code in the genius demo, back in 1990 :)
Leonard/OXYGENE.
User avatar
Frank B
Atari God
Atari God
Posts: 1060
Joined: Wed Jan 04, 2006 1:28 am
Location: Glasgow

Re: little Blitter benchmark demo

Post by Frank B »

I'll maybe do another one with a 3 or 4 plane sprite, including restoral of the background. :)
First of all I want to add a Falcon 030 renderer, proper blitter detect, fix the delayed write of the video counters and a HOP filtering trick (5 bitplanes scroller ;) I'd also like to compare CPU shifted scroller vs blitter and port both to the Amiga. Might be interesting to have a "restart method" blitter renderer too.

This is in addition to optimising the code. I'm still resistant to adding certain optimisations at the expense of the API though.
I'll publish the code for others to play with :)
AtariZoll
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2978
Joined: Mon Feb 20, 2012 4:42 pm
Contact:

Re: little Blitter benchmark demo

Post by AtariZoll »

I started work on benchmark by my ideas: what would be close to some game sprite draw system. So, full background restore, low res ... It will not display Vbl count for given sprite amount draw, but time needed to draw (and undraw) certain count of sprites - on any screen loc. For start there will be no clipping support, so all sprites will be drawn fully, what means that position will not go near to edges. But real test should include draw with clipping too .
I don't see that we need special code for Falcon when ST low res is in case.

Considering CPU based fine scroller: that's fast enough only on TT, on Falcon is slower than blitter based scroll on some ST(E). So, I guess that on some Amiga 1200 will be OK, although don't see the point. I used pretty long code, with separated rutines for all shifts 1-15 .
Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.
User avatar
Frank B
Atari God
Atari God
Posts: 1060
Joined: Wed Jan 04, 2006 1:28 am
Location: Glasgow

Re: little Blitter benchmark demo

Post by Frank B »

AtariZoll wrote:I started work on benchmark by my ideas: what would be close to some game sprite draw system. So, full background restore, low res ... It will not display Vbl count for given sprite amount draw, but time needed to draw (and undraw) certain count of sprites - on any screen loc. For start there will be no clipping support, so all sprites will be drawn fully, what means that position will not go near to edges. But real test should include draw with clipping too .
I don't see that we need special code for Falcon when ST low res is in case.

Considering CPU based fine scroller: that's fast enough only on TT, on Falcon is slower than blitter based scroll on some ST(E). So, I guess that on some Amiga 1200 will be OK, although don't see the point. I used pretty long code, with separated rutines for all shifts 1-15 .
Cool :) My CPU realtime shifter can draw 9 32 * 30 2 plane objects a frame with the teletype and frame counter draw. Reasonably fast for the CPU. It'll slow down if I have to restore rather than clear the background.

Mine also has a raster timer btw. You can enable it with the 6 key and see how much time each component needs.
If we're going to compare mine to yours we should set some constraints first. Ie the API and format of the input data.
AtariZoll
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2978
Joined: Mon Feb 20, 2012 4:42 pm
Contact:

Re: little Blitter benchmark demo

Post by AtariZoll »

I don't see why API is relevant. Considering input data: 4 bpp. ST low res is what is most used when sprites are in question, and overall with blitter in games and demos.
Most interesting for games is following sprite drawing way: first save background for later sprite undraw, then must apply mask on background - using AND op. and mask data. Then with OR draw sprite self. Undraw is simple copy of saved background data.
Little speed can be gained if complete background is stored separately, then no need for saving it before sprite draw - at price of some RAM.
There is way what needs not background save and restore - with constant updating of whole background, so sprites will be "automatically" removed.
But that costs pretty much CPU or blitter time. However, that is necessary if we want scrolling background. May be good in case of many sprites too, because will save lot by not doing background save and restore.
Because of masking, I concluded that in sprite data color index 0 should be transparency = will be background color. Then masking can be done fastest.
Did some basic speed comparison by simple drawing 32x32 px sprites using CPU or blitter. Blitter way is faster some 20%, but it needs 4x more space for mask data, because blitter works only with 2 dimensions, so can not use same mask for all 4 bit planes. On Mega STE set to 16 MHz CPU way is faster - not surprise :D . My test prg. is still in very early stage, so will not post it yet.
Tests on Falcon are little strange: With CPU and blitter set to 16 MHz CPU way is faster some 30% .
Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.
User avatar
Steven Seagal
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2018
Joined: Sun Dec 04, 2005 9:12 am
Location: Undisclosed
Contact:

Re: little Blitter benchmark demo

Post by Steven Seagal »

Next version of Steem SSE will correctly clear "hog" bit, but there's also a timing problem, Steem is too quick (up to $14 blobs/VBL).
I found a bug where Steem didn't count the reading cycles on "FXSR", but that's not enough. This program is happy only when we do count cycles for "NFSR", which you shouldn't ... further investigations required.
So see how useful your little bug-ridden program turns out to be.
In the CIA we learned that ST ruled
Steem SSE: http://sourceforge.net/projects/steemsse
AtariZoll
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2978
Joined: Mon Feb 20, 2012 4:42 pm
Contact:

Re: little Blitter benchmark demo

Post by AtariZoll »

When not using skew and those source reread registers with blitter speed is same as on real HW. I will soon check with skew too.
However I noticed strange thing in 3.2 : it says 14 T states for move.l d0,(a1)+ , but that's 12 for sure . And 14 will be expanded to 16 at end .
Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.
User avatar
Frank B
Atari God
Atari God
Posts: 1060
Joined: Wed Jan 04, 2006 1:28 am
Location: Glasgow

Re: little Blitter benchmark demo

Post by Frank B »

AtariZoll wrote:I don't see why API is relevant.
Because I want the code to be usable and not a hacky mess. The way the code is now it's trivial to adapt it to any number of planes. It's also trivial to add background saving and restoral.
AtariZoll wrote: Blitter way is faster some 20%, but it needs 4x more space for mask data, because blitter works only with 2 dimensions, so can not use same mask for all 4 bit planes..
Why do you need to duplicate the mask? You can simply blit the same mask (Not SRC AND DST) 4 times to the screen. The set up time is negligible. The only reason for duplicating the masks is to save reloading the source address.
I'd also advise being careful about drawing conclusions about the speed on the Mega STe. Remember that you're not likely to be blitting the same sprite frame repeatedly in a game. Try cycling through various frames to reduce the cache hits on the source.
Last edited by Frank B on Mon Aug 31, 2015 6:59 am, edited 5 times in total.
User avatar
Frank B
Atari God
Atari God
Posts: 1060
Joined: Wed Jan 04, 2006 1:28 am
Location: Glasgow

little Blitter benchmark demo

Post by Frank B »

Steven Seagal wrote:Next version of Steem SSE will correctly clear "hog" bit, but there's also a timing problem, Steem is too quick (up to $14 blobs/VBL).
I found a bug where Steem didn't count the reading cycles on "FXSR", but that's not enough. This program is happy only when we do count cycles for "NFSR", which you shouldn't ... further investigations required.
So see how useful your little bug-ridden program turns out to be.
:) It wouldn't have had those bugs if I'd tested on real hardware :) In any case that behaviour of the hog bit wasn't documented. We all got it wrong. :(The positive thing is we found out something new about the hw behaviour :)
I'm relying on NFSR as an optimisation on the src data.
Last edited by Frank B on Mon Aug 31, 2015 4:15 pm, edited 1 time in total.
AtariZoll
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2978
Joined: Mon Feb 20, 2012 4:42 pm
Contact:

Re: little Blitter benchmark demo

Post by AtariZoll »

Frank B wrote: :) It wouldn't have had those bugs if I'd developed on real hardware :) In any case that behaviour of the hog bit wasn't documented. We all got it wrong! The positive thing is we found out something new about the hw behaviour :)
I'm relying on NFSR as an optimisation on the src data.
What I wrote about it was just opposite. I said that coders just set HOG bit by every blitter start based on why not, it costs nothing :D
Frank B wrote: Because I want the code to be usable and not a hacky mess. The way the code is now it's trivial to adapt it to any number of planes. It's also trivial to add background saving and restoral.
My idea was to do things on my way, and then we can compare results, or at least speed ratios between different ways of sprite draw. I don't think that anything is trivial if you want get max possible speed. And of course, I don't have clue about your code, and actually I don't want to know. it's always better if multiple people does it, then we can see which code is faster. Not sure that it's best idea to make some universal code, what is "trivial" to adapt to any number of planes. My goal is fastest possible code. But not by using some dirty tricks, just efficient code.
So, I do it in pure ASM, using Devpac 3 . As test "platform" I use Steem Debugger where it is easy to trace and spot bugs. After that may test on real HW. And as expected, all it works fine on Falcon - because code is not messy, hacky. For me C is mess, and really don't see need for it in such SW, where beside test code you need only few simple tasks as going supervisor, load some file, set screen ...
Frank B wrote: Why do you need to duplicate the mask?
I'd also advise being careful about drawing conclusions about the speed on the Mega STe. Remember that you're not likely to be blitting the same sprite frame repeatedly in a game. Try cycling through various frames to reduce the cache hits on the source.
Just because go on max possible speed. Setting and starting blitter for only 8 bytes of sprite data seems pretty bad idea, what would cost some 5-10% in speed. In case of wide sprites, so over 60 px width may be worth.
On Mega STE some 15 % speed gain is just because faster loading of opcodes and internal CPU operations in this case. 16 KB cache can hold some count of sprites too :D So, I would say that CPU with some aid is good, especially because poor blitter was never helped with cache.
Anyway, if you don't like my approach I can start new thread in coding section.
Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.
User avatar
Frank B
Atari God
Atari God
Posts: 1060
Joined: Wed Jan 04, 2006 1:28 am
Location: Glasgow

Re: little Blitter benchmark demo

Post by Frank B »

The more the merrier. Let's just be careful when doing comparisons. :) my code is all assembler too btw. It could easily be callable from c however.
User avatar
Frank B
Atari God
Atari God
Posts: 1060
Joined: Wed Jan 04, 2006 1:28 am
Location: Glasgow

Re: little Blitter benchmark demo

Post by Frank B »

Here's a much older build from 1993 or so. It's not a benchmark, just a little intro. Does 30 in a frame though :)
There's one wee cheat I'm using which I'm not using on the benchmark rewrite. Remember my goals aren't necessarily the same as Pepera's or anyone else.
I'm interested in putting several styles of renderer up against the blitter and benching the result.
You do not have the required permissions to view the files attached to this post.
User avatar
Frank B
Atari God
Atari God
Posts: 1060
Joined: Wed Jan 04, 2006 1:28 am
Location: Glasgow

Re: little Blitter benchmark demo

Post by Frank B »

Hi. I've created a new build of the benchmark with some new features.

1)
It will run on machines with no blitter present. If there is no blitter you will be restricted to CPU modes.
I've used the OS for this rather than trap a bus access. I might add that later.

2)
It has a new blitter "op" draw mode. This can be activated by pressing the 7 key. Key 8 will cycle through each draw mode.
It's fun to see xor and or draw modes :)

3)
It supports the VDI. You can access VDI renderer mode with the 9 key.
In this mode you can press b on the keyboard to toggle the blitter on and off.

4) The teletype no longer requires the blitter. It'll run on a normal ST.

The VDI renderer is hideously slow. The reason for this is that the number of planes have to match on the source and destination.
I had to move to a 4 plane object in this mode and use "no-op" planes to leave the text as is.

Ie it has to do a logical and with 0,0,-1,-1 rather than a clear. It has to use planes 3 and 4 with zeros for the or.
Might be useful to benchmark different screen accelerators :)

Other keys are as follows, 1/2 increase and decrease bob count, 3 blitter renderer, 4 cpu pre shift. 5 cpu realtime shift, 6 raster timer, 7 blitter op mode , 8 to flip op draw mode , 9 for the VDI mode and b to toggle blitter in VDI mode.
You can also change the waves with function keys.

I'll likely add 030 renderer and linea renderer modes next. I've attached screen shots of all modes with 256 objects on screen.

EDIT: Looks like I broke the op renderer..
EDIT: BLITB.TOS contains the fix.
You do not have the required permissions to view the files attached to this post.
Post Reply

Return to “Demos”