Falcon Doom

All about games on the Falcon, TT & clones

Moderators: Mug UK, moondog/.tSCc., [ProToS], lp, Moderator Team

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Mon Jan 21, 2013 3:26 pm

Eero Tamminen wrote:Ah, you meant the drive head alignment or something similar. That's indeed annoying.


Yes just another example of my bad luck with transfer media!

Eero Tamminen wrote:Using zmodem over serial could work and might even be faster than using floppies.


'Anima' has described a nice version of this which he uses for Falcon dev on Galaga88 etc. When I get more time to play with kit and software I'll try moving to that.

Eero Tamminen wrote:If you look to the end of the src/cpu/newcpu.c file, there are functions for both instruction and data cache, for 020, 030 and 040. And complaints about trickiness of implementing data cache for 030...
Note that these are enabled only when you've enabled cycle exact mode (which is default in latest Hatari WinUAE configuration).


Strangely, the behaviour I am seeing is exactly as if the data cache is turned off all the time. I made a test for this effect and it seems to prove the d-cache is getting 100% miss rate, but I haven't studied the reasons why yet. I also see significantly lower FPS in Hatari vs a real Falcon and the data cache might be the reason.

Or perhaps the d-cache is just off by accident due to a UAE config issue, or a Falcon/TOS emulation problem?

Eero Tamminen wrote:Looking at the sources, I'm not completely sure this happens when you have MMU enabled, but I would assume the code *eventually* to fall into going through the prefetch stuff that does icache check. :-)


I'm still using Hatari 1.6.2 on Windows and have all CPU param checkboxes ticked except '040 MMU emulation'. However it is the WinUAE build (hatari_falcon.exe) so presumably this has the 030 MMU support enabled?

[edit]

...confirmed by printing CACR inside the app (ienab/denab bits enabled), a test with denab turned on and normal textures vs 1-pixel textures and a separate test with denab bit forcibly disabled... all cases result in exactly the same render time so the d-cache emulation must be off/bypassed somewhere in UAE...

EvilFranky
Atari Super Hero
Atari Super Hero
Posts: 846
Joined: Thu Sep 11, 2003 10:49 pm
Location: UK
Contact:

Re: Falcon Doom

Postby EvilFranky » Mon Jan 21, 2013 4:16 pm

Hi Doug,

Sorry for the late response...got tied up when I was hoping to test this yesterday!

VGA - (My original test was also on VGA)
BMOPT1.TTP - 5.6421 FPS
BMOPT1B.TTP - 5.3543 FPS
BMOPT1C.TTP - 5.2904 FPS
BMOPT1D.TTP - 5.5557 FPS

Will do RGB soon...

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Mon Jan 21, 2013 4:53 pm

EvilFranky wrote:Hi Doug,

Sorry for the late response...got tied up when I was hoping to test this yesterday!

VGA - (My original test was also on VGA)
BMOPT1.TTP - 5.6421 FPS
BMOPT1B.TTP - 5.3543 FPS
BMOPT1C.TTP - 5.2904 FPS
BMOPT1D.TTP - 5.5557 FPS

Will do RGB soon...


Very weird - the only change I made betwen BMOPT1 and BMOPT1b was to fix the WAD path issue... the other changes were in (c)-disabling a minor opt, and (d)-single-pixel-textures. did the FPS settle down properly before taking the value?

I might have to look into that later, it's quite a big change for no obvious reason. Maybe the performance is affected by code/memory alignment or something.

EvilFranky
Atari Super Hero
Atari Super Hero
Posts: 846
Joined: Thu Sep 11, 2003 10:49 pm
Location: UK
Contact:

Re: Falcon Doom

Postby EvilFranky » Mon Jan 21, 2013 4:55 pm

Yeah I always let the FPS settle for about 1 minute before recording the value.

Very strange indeed...

User avatar
calimero
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 2064
Joined: Thu Sep 15, 2005 10:01 am
Location: STara Pazova, Serbia
Contact:

Re: Falcon Doom

Postby calimero » Mon Jan 21, 2013 7:43 pm

VGA
(EvilFranky - calimero)
BMOPT1.TTP - 5.6421 FPS - NA (didn't try)
BMOPT1B.TTP - 5.3543 FPS - 5.3543
BMOPT1C.TTP - 5.2904 FPS - 5.3328
BMOPT1D.TTP - 5.5557 FPS - 5.6025

RGB

BMOPT1.TTP - NA (didn't try)
BMOPT1B.TTP - 5.8404
BMOPT1C.TTP - 5.8233
BMOPT1D.TTP - 6.0347


@dml you can buy NetUSBEE ROM-port LAN adapter for £50 with shipping
check: http://dhs.nu/bbs-trade/index.php?request=1493
using Atari since 1986.http://wet.atari.orghttp://milan.kovac.cc/atari/software/ ・ Atari Falcon030/CT63/SV ・ Atari STe ・ Atari Mega4/MegaFile30/SM124 ・ Amiga 1200/PPC ・ Amiga 500 ・ C64 ・ ZX Spectrum ・ RPi ・ MagiC! ・ MiNT 1.18 ・ OS X

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Mon Jan 21, 2013 7:53 pm

calimero wrote:RGB
@dml you can buy NetUSBEE ROM-port LAN adapter for £50 with shipping
check: http://dhs.nu/bbs-trade/index.php?request=1493


thx... I was going to pick one up from Lotharek as I need UltraS as well. Hmm, hopefully he doesn't change his mind about producing them :)

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Mon Jan 21, 2013 8:16 pm

As of tonight I think I can see what the invisible primary cost is, apart from drawing.

I'll have to go away for a bit and think about it. Difficult problem but might be solved with a bit of planning.

It's a nasty interaction between ssector segment/wall generation and the way the floors are filled out by wall segments one wall at a time - which leaves a lot of stuff fragmented in the floor. I remember something about this now from the early days on BM and took it as far as I could without reworking the Doom floor approach completely.

It may be that I left it because of a lack of memory on the DSP to hold everything needed at once, or the order ssectors are being processed causing problems for the defragmentation step. In any case I decided to leave it alone at the time...


Below is a screenshot of the final room in e1m1, which is just 4 walls, a floor, a ceiling and a door inset. The floor near the door is fractured into lots of columns, as the segs from different ssectors are responsible for those floor areas and are linked to the inset not the main room. The defrag routine fixes fractures within convex ssectors but not between them, so this happens. Approx 6x the number of floor spans required to fill the floor :( each span costs quite a lot to set up for filling so fractures are expensive - small ones disproportionately so.

floor-problem.png


Anyway, ill have to study the code again and think about it a while to see if it can be dealt with some other way. I may not have much to say about progress until I come to a conclusion on that one.


I did manage to shave 4 cycles off the texture addressing step for floor/ceiling textures, so it's down from 16 to 12 cycles. It won't help very much until I fix the other problem above as the savings made are offset by extra set-up code for the faster method and there are still too many spans needing that set-up.
You do not have the required permissions to view the files attached to this post.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Mon Jan 21, 2013 9:30 pm

dml wrote:Strangely, the behaviour I am seeing is exactly as if the data cache is turned off all the time. I made a test for this effect and it seems to prove the d-cache is getting 100% miss rate, but I haven't studied the reasons why yet. I also see significantly lower FPS in Hatari vs a real Falcon and the data cache might be the reason.


According to Laurent, although WinUAE CPU has the code for cache handling and it even seems to be used, the cycle counting doesn't take that into account. The code uses Laurent's static cycles table and when MMU is enabled, not even that.

So, after I've added some global value that has the hit/miss information, output that with the other profiling information and verified with you that it seems to be working correctly, Laurent can start taking hits/misses into account in cycle counting.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Mon Jan 21, 2013 9:48 pm

Eero Tamminen wrote:According to Laurent, although WinUAE CPU has the code for cache handling and it even seems to be used, the cycle counting doesn't take that into account. The code uses Laurent's static cycles table and when MMU is enabled, not even that.

So, after I've added some global value that has the hit/miss information, output that with the other profiling information and verified with you that it seems to be working correctly, Laurent can start taking hits/misses into account in cycle counting.


Ok I see - that does makes sense. Great news! Will be interesting to see the shrinking gap between a real Falcon and Hatari after the changes - it's the only thing which has stood out so far in the tests I have done.

I got Hatari built on OSX today from Mercurial but for some reason it didn't install properly. Will look at it again later in the week so I can experiment with the cycle counting output.

Thanks again!
Doug.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Mon Jan 21, 2013 10:20 pm

dml wrote:Anyway, ill have to study the code again and think about it a while to see if it can be dealt with some other way.


What if you calculate an average value for each texture and use that if the span fill area is small enough? Can you somehow in earlier phase know (at least approximately) how large they will be?

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Mon Jan 21, 2013 10:30 pm

Eero Tamminen wrote:Can you somehow in earlier phase know (at least approximately) how large they will be?


As the spans are either vertical or horizontal, if you know that area will be about one pixel wide or high, there's no need for individual pixel perspective calculations etc. although span would be long, as everything in that pixel strip is at same distance from camera. Are optimizations like that already done or if not, possible to do and would they help?

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Mon Jan 21, 2013 11:08 pm

Eero Tamminen wrote:As the spans are either vertical or horizontal, if you know that area will be about one pixel wide or high, there's no need for individual pixel perspective calculations etc. although span would be long, as everything in that pixel strip is at same distance from camera. Are optimizations like that already done or if not, possible to do and would they help?


It is possible to do those kind of 'late stage' optimizations to cut the cost of existing spans. There is enough info to do that. There is a limit to the savings which can be made this way and it probably only works for 1-pixel columns. If you even lose the fraction of the texture coordinate (never mind averaging the texture down to a colour) the human eye picks it up as visible lines in the texture which don't follow the floor movement.

I will try to find some way to join the spans for a given display line before they ever leave the DSP, or worst case, as they arrive from the DSP.

I am remembering more now about how it works, and why the defragmentor can only work within a convex 'ssector'. It is because the spans on the DSP have no notion of texture/lighting attributes so it can only defragment spans against other spans known to have same attribute. The largest primitive Doom provides with a constant attribute is a ssector, and ssectors can be large or very tiny. Sometimes many ssectors are used adjacently to create insets or other more complex structures. They may share the same texture but the defragmentor can't take advantage of that since it has no differentiation criteria for spans from different ssectors, and would happily join spans with different textures given the chance.

So I am now looking at ways to commit groups of ssectors with the same attribute, without breaking BSP and sprite overlay order.

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Tue Jan 22, 2013 12:58 pm

Woke up today with a better understanding of why the floor is fractured into those columns in the first place, how the DSP/texturemapped version of BadMood inherited this from the original DVIEW WAD-viewer project it was based on, and why DVIEW probably inherited it from the original Carmack/Abrash version.

I have a bad dose of flu currently and heavy work pressure in between >coughs< so won't be doing much with the code but there is progress in terms of ideas.

In short I think the 'columns' thing in the floors was an 'optimisation' which suits particular engines/platforms but is a point of pain for our DSP/host version on the Falcon. The 'optimisation' was to leave polygon edges 'open' at ssector boundaries (invisible BSP boundaries between regions) which otherwise had identical attributes and elevation i.e. continuation of the same floor. If you look at real 'step' bounaries (Doom 'windows' if you like') those floor edges are properly closed because otherwise it would be a disaster. But the logic for closing a window edge is no different from closing edges in floors - it's like a window where the height on both sides is equal, after all. So ssector edges have simply been left open - from a rendering perspective the result 'looks' the same because the Doom engine ensures all gaps get filled and the presence of the physical edge is irrelevant.

These edges were likely left open because there are lots of them (they get mostly created by the BSP pass, not the editor) and closing them is generally not free.

On the PC, spans belonging to these nasty 'columns' cost less than spans belonging to formal polygon edges, because the edge generation is not free - there is complex occlusion stuff happening for all edges that get made - Doom rendering knows to stop drawing when the last pixel gets filled and must track occlusion and clip overlaps perfectly to work properly.

On the Falcon, spans belonging to 'columns' cost roughly the same as those from formal polygon edges because the DSP carries a lot of the work for free, but there are more spans and the cost per span is relatively high due to a) span texture run setup costs and b) DSP<->CPU exchanges to extract the span texturing terms.

So it's not mandatory to have these columns for Doom rendering to work and it is hurting the Falcon version. I think that's good news. I just need to prove it's accurate with some tests.

It's especially 'wrong' (to my mind, as a past 3D gfx engineer) that tiny ssectors in the distance will fracture up large areas near the viewer - it seems a bit nuts to me because that's a potentially uncontrolled cost. It could get really bad in places and would probably hurt the PC as well. So I figure just closing the polygon edge at ssector boundaries will solve the problem - at some cost, but likely much smaller cost than initiating all those fractured spans. The trick will be to do it cheaply enough that it doesn't just move the cost elsewhere.

A solution like this won't produce optimal floors BTW - it should improve things in the worst case and may not make a huge difference in some other areas, since the floor will still be separated into polygon regions representing ssectors. But at least small ssectors far away will not interfere with 90% of the floor area.


There is perhaps another solution which could generate optimal floors with no interior edges and zero fragmentation, but it will take me longer to figure out the details and I'm not certain it can work yet. Worth a look though - it would be a big win if possible. The whole floor potentially becomes a single fill event in that case (within a texture state).

That's my braindump for today.

One last thing - I kind of broadcast a call for assistance in the demoscene space over at DHS for texturing optimisations and have been getting useful feedback from the right people. Nothing i can use directly, yet - but I would not be surprised if that ends up leading somewhere too. Many brains better than one (esp. mine sometimes). And if anyone can beat my efforts it is certainly those guys ;)

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Wed Jan 23, 2013 11:52 am

Had another idea about performance, triggered by the Jaguar 'optimised' WAD discussion earlier (which still needs tried BTW).

I did a lot of work on BSPs around y2000 and occasionally since then. I know that the 'best wisdom' at the time Doom was made is not the 'best wisdom' now, regarding the production of BSPs for static visibility sets and detail management.

I'm wondering if somebody has built a better-performing BSP tool for Doom WADs which takes all of that newer stuff into account? Not sure how easy it is to find this out - a 'good' BSP means different things to different people so references to the method used is probably the only way to recognise it. Having the word 'optimised' in the tool docs/description is encouraging but won't really cover it.

Another trawl of the Doom sites might indicate if it's worth re-BSPing the original WADs with newer tools.

There's a good chance this 'upgrade' never happened because for most people who care, the BSP stuff isn't any kind of bottleneck and hasn't been for a long time. It's really only interesting if you're trying to make it work on a much much smaller system.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Wed Jan 23, 2013 1:41 pm

Regarding doing something else than Doom...

What about Heretic?

It's derivative of Doom sources and Patrice Mandin has ported it to Atari, but that is unusable slow on something like Falcon (because it uses only on CPU?):
http://patrice.mandin.pagesperso-orange ... -jeux.html

Downside of Heretic is the lack of extra WADs and that there are extra features that will most likely need optimizing: flight, sectors that can push the player, an inventory system, ambient sounds, translucency, looking up and down and pushable objects.

Hexen adds even more features:
http://en.wikipedia.org/wiki/Hexen:_Bey ... evelopment

(Regarding WADs, even Doom hasn't that many of them compared to Doom II, which has a huge amount of them...)

Descent would also be interesting, but that doesn't use BSPs:
http://en.wikipedia.org/wiki/Descent_(video_game)#Development

Btw. Has anybody tried a source port of "Rise of the Triad"? That should be lighter on the CPU than Doom as it's based on enhanced Wolfenstein3D engine:
http://en.wikipedia.org/wiki/Rise_of_the_Triad#Engine

(On the other hand optimizing Wolfenstein3D has no point as Ray already did a version that works works fast enough on Atari ST/E and looks very good.)

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Wed Jan 23, 2013 1:56 pm

Eero Tamminen wrote:Regarding doing something else than Doom...

What about Heretic?
...


Although I haven't tested it in a while, the BadMood engine should (?) be able to display scenes from any Doom derivative game (Doom2, Heretic, Hexen etc) - but not Descent (which was an early portal-based method IIRC). It may not display them correctly however, as 'bugfixes' were mainly applied against Doom and Doom 2 so far. I think the 'H' games made more use of transparency and that's implemented poorly in BM - (did it in a hurry towards the end and will need redone again but it can be fixed).

As for porting those games - if the source is available then tacking something like BM onto them is probably a similar process for each. However it will probably be easier to try with Doom which is kind of the 'reference' version for all of them.

I remember playing ROTT long ago it was pretty good fun :)

It's maybe not all that interesting right now but if you build a Doom level for BM where the floor and ceiling height is constant everywhere and there are very few texture changes going on in a scene, it will run much faster - especially the newer builds I'll be releasing next. Doing this removes many of the costs which associate with Doom vs Wolf (not all for sure, but it does get faster).

Eero Tamminen wrote:(On the other hand optimizing Wolfenstein3D has no point as Ray already did a version that works works fast enough on Atari ST/E and looks very good.)


Yes getting Wolf to go on an ST is quite a feat and a good project! :) I think Doom is something like 'that project' for the Falcon...

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Wed Jan 23, 2013 11:02 pm

Regarding instruction cache miss profiling in Hatari...

Hatari's WinUAE code can in some cases do several calls to fill-cache function per prefetch function call. I checked whether the code part marked as cache miss actually gets called several times during (at least some of) the instructions, and that indeed is the case, there can be 0-3 "cache misses" per executed instruction.

So... If a prefetch for a given PC address causes several "icache misses" and those are summed together in the profiling data output for corresponding PC addresses, is that the information you were interested about, or did you want something else?

(There's also something weird in the cycle counts, they look fine for old UAE core, but for WinUAE core they are all zero. I suspect WinUAE core might be calling debugger in the wrong part of the instruction emulation as the debugger integration to WinUAE core is just preliminary.)

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Wed Jan 23, 2013 11:22 pm

Eero Tamminen wrote:So... If a prefetch for a given PC address causes several "icache misses" and those are summed together in the profiling data output for corresponding PC addresses, is that the information you were interested about, or did you want something else?


I'm assuming the extra "cache misses" are extra instructions that get read into cache at the same time...?

If my assumption is right, I could also mark that in profiler as a single cache miss for the instruction at PC address. Which one you would prefer?

User avatar
dml
Fuji Shaped Bastard
Fuji Shaped Bastard
Posts: 3472
Joined: Sat Jun 30, 2012 9:33 am

Re: Falcon Doom

Postby dml » Thu Jan 24, 2013 9:40 am

Eero Tamminen wrote:I'm assuming the extra "cache misses" are extra instructions that get read into cache at the same time...?

If my assumption is right, I could also mark that in profiler as a single cache miss for the instruction at PC address. Which one you would prefer?


I suspect that the *average* miss rate per instruction is the most useful bit of information. If only the peak miss rate is recorded the analysis will be misleading for a programmer.


    Under idealized circumstances a normal cache miss simply means the cache was searched for a longword address and the search failed (tag not valid) so the tag is typically allocated and the longword placed in the cache (assuming the cache is not frozen - in which case it will leave the cache entry as it was).

    In 'burst' mode the cache is tested the same way but a whole 'line' of longwords is tested and filled in circular fashion - that is 4 contiguous longwords, wrapping on the end of a 16-byte 'line'. However the Falcon architecture does not support the necessary burst signals on the bus so burst mode has no effect and the burst bit in CACR can be ignored.

    BTW this is something that might need attention in WinUAE - It is possible that WinUAE is assuming burst mode all the time? In this case I would expect 0-4 cache misses (peak) but 0-1 cache miss (average) per instruction since burst mode is a kind of extended prefetch purely for cache filling, with a limit of 4. I think the burst bits are usually off on the Falcon but if enabled WinUAE might emulate incorrectly. It's also worth noting that a 'miss' on the 68030 implies a longword fetch - i.e. 2 individual 16bit memory fetches on the Falcon. Normally it would be just one fetch - another thing WinUAE might not know about!


    Assuming burst mode is off , there should be a maximum of 1 miss per longword per instruction. So if an instruction requires 1 word, that's 1 miss max. If a lengthy instruction needs 2 longwords that could be 2 misses but the same instruction is not longword-aligned that would be 3 misses. This is more likely the reason. In your experiments, is it the case that most instructions have 0-1 misses? But very occasional larger instructions have 2 or 3? I think it would be worth checking the size and alignment of the 3-miss cases.

Anyway back to the question - for maximum usefulness I think *average* misses per instruction (0.0-3.0?) is the best overall summary (i.e. total misses for instruction / total incidence of instruction over entire session).

Total summed misses per instruction over the entire session is probably also quite useful as it would let you see whether a recognized hotspot was caused by cache misses or just by incidence at a glance. Whether it's better just to show the large 'sum' itself or to show it as a relative cost (total misses per instruction / total misses from whole program over whole session) I don't know - whatever makes it easier to see a cache-miss hotspot against the background of other code.

I can imagine other possible views but for the sake of simplicity these two seem to provide the most useful information.

I think the peak/worst case misses per instruction is of limited use (except maybe to find large mis-aligned instructions) since all code must fill the cache at some point, so by the 2nd visit the information would be unlikely to change, and estimating cache activity from that would be impossible.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Thu Jan 24, 2013 10:25 am

In case you want to try the cache miss profiling, a patch for a test version is here:
http://listengine.tuxfamily.org/lists.t ... 00025.html

As to Hatari build issues on OSX, please subscribe to "hatari-devel" mailing list as none of the Hatari core developers has OSX, patches for OSX support are just occasionally contributed by Hatari devel list members (mainly Jerome Vernet). See here for subscription instructions: http://hatari.tuxfamily.org/contact.html

(Windows is slightly better supported, Nicolas has cross-compiling setup to build Windows binaries.)

If possible, I would suggest building & running Hatari on Linux as that's the only fully supported platform.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Thu Jan 24, 2013 11:58 pm

dml wrote:
Eero Tamminen wrote:I'm assuming the extra "cache misses" are extra instructions that get read into cache at the same time...?

If my assumption is right, I could also mark that in profiler as a single cache miss for the instruction at PC address. Which one you would prefer?


I suspect that the *average* miss rate per instruction is the most useful bit of information. If only the peak miss rate is recorded the analysis will be misleading for a programmer.


I wasn't clear on what I was asking was for, it was about what should happen on *each* executed instruction. Should I record just that there was miss, or how many misses executing the current instruction at this point (once) incurred ie. should I increase the sum of misses for given address by 0-1 or by 0-3.

Also, does the DSP have any kind of cache / cache misses?

(At least Hatari DSP emulation doesn't support such thing, but if it did, adding it to profiler would be trivial :-))


dml wrote:BTW this is something that might need attention in WinUAE - It is possible that WinUAE is assuming burst mode all the time? In this case I would expect 0-4 cache misses (peak) but 0-1 cache miss (average) per instruction since burst mode is a kind of extended prefetch purely for cache filling, with a limit of 4. I think the burst bits are usually off on the Falcon but if enabled WinUAE might emulate incorrectly. It's also worth noting that a 'miss' on the 68030 implies a longword fetch - i.e. 2 individual 16bit memory fetches on the Falcon. Normally it would be just one fetch - another thing WinUAE might not know about!


What about TT, does that also have burst mode normally off?

Latter point affects just timings, right? Anyway, that's more for Laurent than me... :-)


dml wrote:Assuming burst mode is off , there should be a maximum of 1 miss per longword per instruction. So if an instruction requires 1 word, that's 1 miss max. If a lengthy instruction needs 2 longwords that could be 2 misses but the same instruction is not longword-aligned that would be 3 misses. This is more likely the reason. In your experiments, is it the case that most instructions have 0-1 misses? But very occasional larger instructions have 2 or 3? I think it would be worth checking the size and alignment of the 3-miss cases.


After running this 4kB demo for a while:
http://pouet.net/prod.php?which=1015

I get following statistics:
- 0: 29282934
- 1: 43281
- 2: 3323
- 3: 6510

First number is the number of misses and second one is number of instructions having that number of misses.

Does that look reasonable?

Profile results look like this (also with the patch in earlier mail):

Code: Select all

> profile misses 8
addr:           misses:
0x01d1b0        11.85%  8233
0x01d1c0        7.70%   5346
0xe03cac        3.94%   2737
0xe1065c        3.46%   2406
0xe1c288        3.46%   2406
0xe03ca0        2.31%   1604
0xe03ca2        2.31%   1604
0xe356f0        2.31%   1604
8 CPU addresses listed.
> d 0x01d1aa
$01d1aa : 3c3c 0031                            move.w    #$31,d6                    0.00% (223, 0, 0)
$01d1ae : 2448                                 movea.l   a0,a2                      0.04% (11120, 0, 0)
$01d1b0 : 2649                                 movea.l   a1,a3                      0.04% (11120, 0, 8233)
$01d1b2 : 41e8 0258                            lea       $258(a0),a0                0.04% (11120, 0, 4)
$01d1b6 : 43e9 fda8                            lea       $fda8(a1),a1               0.04% (11120, 0, 0)
$01d1ba : 3e3c 0063                            move.w    #$63,d7                    0.04% (11120, 0, 0)
> profile addresses
[...]
$e0f6b6 :             add.w     d4,d4                      0.00% (1, 0, 1)
$e0f6b8 :             movea.l   a1,a2                      0.00% (1, 0, 3)
$e0f6ba :             adda.w    d4,a2                      0.00% (1, 0, 0)
$e0f6bc :             move.w    (a1)+,(a0)+                0.00% (16, 0, 2)
$e0f6be :             move.w    (a2)+,(a0)+                0.00% (16, 0, 0)
$e0f6c0 :             move.w    (a1)+,(a0)+                0.00% (16, 0, 1)
$e0f6c2 :             move.w    (a2)+,(a0)                 0.00% (16, 0, 0)
$e0f6c4 :             adda.w    d3,a5                      0.00% (16, 0, 1)
$e0f6c6 :             movea.l   a5,a0                      0.00% (16, 0, 0)
$e0f6c8 :             dbra      d2,$e0f6bc                 0.00% (16, 0, 1)
$e0f6cc :             rts                                  0.00% (1, 0, 0)
[...]
$e1054a :             rts                                  0.00% (802, 0, 0)
[...]
$e1065a :             subq.l    #4,sp                      0.00% (802, 0, 683)
$e1065c :             movem.l   d0-d7/a0-a6,-(sp)          0.00% (802, 0, 2406)
$e10660 :             movea.l   $e4d17e,a1                 0.00% (802, 0, 802)
$e10666 :             movea.l   $ffbe(a1),a0               0.00% (802, 0, 0)
$e1066a :             move.l    $ffc2(a1),$3c(sp)          0.00% (802, 0, 0)
$e10670 :             jsr       (a0)                       0.00% (802, 0, 802)
$e10672 :             movem.l   (sp)+,d0-d7/a0-a6          0.00% (802, 0, 1230)
$e10676 :             rts                                  0.00% (802, 0, 0)
[...]
$e1b67e :             link      a6,#$fffe                  0.00% (1, 0, 0)
$e1b682 :             move.l    #$3000,d0                  0.00% (1, 0, 0)
$e1b688 :             unlk      a6                         0.00% (1, 0, 1)
$e1b68a :             rts


Numbers in parenthesis are instruction count, cycle count (buggy in WinUAE core so always 0), and sum of misses for that address.


dml wrote:Anyway back to the question - for maximum usefulness I think *average* misses per instruction (0.0-3.0?) is the best overall summary (i.e. total misses for instruction / total incidence of instruction over entire session).

Total summed misses per instruction over the entire session is probably also quite useful as it would let you see whether a recognized hotspot was caused by cache misses or just by incidence at a glance. Whether it's better just to show the large 'sum' itself or to show it as a relative cost (total misses per instruction / total misses from whole program over whole session) I don't know - whatever makes it easier to see a cache-miss hotspot against the background of other code.


The code in the above linked patch shows both percentage of total and the sum of 0-3 hit values per executed instruction.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1559
Joined: Sun Jul 31, 2011 1:11 pm

Re: Falcon Doom

Postby Eero Tamminen » Fri Jan 25, 2013 12:03 am

dml wrote:Although I haven't tested it in a while, the BadMood engine should (?) be able to display scenes from any Doom derivative game (Doom2, Heretic, Hexen etc) - but not Descent (which was an early portal-based method IIRC). It may not display them correctly however, as 'bugfixes' were mainly applied against Doom and Doom 2 so far. I think the 'H' games made more use of transparency and that's implemented poorly in BM - (did it in a hurry towards the end and will need redone again but it can be fixed).

As for porting those games - if the source is available then tacking something like BM onto them is probably a similar process for each. However it will probably be easier to try with Doom which is kind of the 'reference' version for all of them.


Probably the most interesting extra thing in 'H' variants is network play, but I don't see how one would easily add that to Falcon versions as Falcon doesn't have networking except with MiNT and extra HW (or with serial or parallel when using slip/plip)...

I also just noticed that original Doom is supposed to be using MIDI music whereas later versions use sampled music. As Atari has MIDI ports and Hatari can emulate MIDI fine (e.g. when connected to software syntherizer [1]), eventually having that working also in Atari versions would be great. :-)

[1] http://hg.tuxfamily.org/mercurialroot/h ... -linux.txt

Hippy Dave
Atari Super Hero
Atari Super Hero
Posts: 515
Joined: Sat Jan 10, 2009 5:40 am

Re: Falcon Doom

Postby Hippy Dave » Fri Jan 25, 2013 3:05 am

Eero Tamminen wrote:
Also, does the DSP have any kind of cache / cache misses?

(At least Hatari DSP emulation doesn't support such thing, but if it did, adding it to profiler would be trivial :-))


The dsp56000 implements a three-stage (prefetch, decode, execute) pipeline. allowing most instructions to execute at a rate of one instruction every instruction cycle. There is no cache. Instruction cycle calculation is simple, especially at run-time.

User avatar
Omikronman
Atari Super Hero
Atari Super Hero
Posts: 525
Joined: Wed Dec 01, 2004 12:13 am
Location: Germany
Contact:

Re: Falcon Doom

Postby Omikronman » Fri Jan 25, 2013 4:06 am

BadMood engine should (?) be able to display scenes from any Doom derivative game (Doom2, Heretic, Hexen etc)


Yes, it is. I did run a doom file and also a heretic file with Bad Mood some time ago!

User avatar
LaurentS
Captain Atari
Captain Atari
Posts: 256
Joined: Mon Jan 05, 2009 5:41 pm

Re: Falcon Doom

Postby LaurentS » Fri Jan 25, 2013 8:37 am

Hi,

> The dsp56000 implements a three-stage (prefetch, decode, execute) pipeline. allowing most instructions to execute at a rate of one instruction every instruction cycle. There is no cache. Instruction cycle calculation is simple, especially at run-time.

I confirm. The cycles counting for the DSP is really easy compared to the 68030 one.
I'm pretty sure the DSP cycles are correct in hatari. (I've verified them many times, but another eye from someone else would be good).

For the 68030 part, it's really more complex.
First, I'm not that sure that the new CPU is really well implemented for a Falcon (I mean, taking into account correctly the 16 bits memory access + the memory aligment for cycles counting).
Then, the instructions cycles timings table is incomplete (no MMU timings, no FPU timings, some timings are not accurate (like MUL or DIV for example).
Then, there are the Videl timings (that can take the BUS for a certain amount of time)
And finally, the DMAs (disks, crossbar, ...)
In the CPU, there's also the cache/non cache counting.

All of this does a huge number of case to take into account.
Of course, it's feasible. :)

But I think Hatari should first merge the cycle exact CPU and the MMU CPU to have only one cycles counting (actually, if you use the 680330 + MMU cpu, all instructions cycles are egal to 2 cycles).
The cycles exact cpu is giving better cycles, but is not accurate.

You can also use the old CPU (hatari.exe) which uses ATARI ST timings for the instructions.
Sometimes, some programs run better with this one than with the new core (eg Beats of Rage ;) or MOAI96 by Mikro).

I'm not a great specialist of the 68030 (I think some people here are better than me on this).
Any help on hatari would always be welcome and would increase the emulation accuracy.

I think the instructions cycles counting is quite correct.
But if I use a benchmarker, there are big cycles differences between the real hardware and hatari, which means that there are many cycles that are not counted correctly.
It should be important to first fix this problem before playing with fine tuning.

Best regards
Laurent


Social Media

     

Return to “Games”

Who is online

Users browsing this forum: No registered users and 1 guest