Lance 12.5 / 25 / 50 KHz routine for STE (V13)

All 680x0 related coding posts in this section please.

Moderators: simonsunnyboy, Mug UK, Zorro 2, Moderator Team

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE

Postby ljbk » Wed Apr 17, 2013 7:04 am

Eero Tamminen wrote:
I guess for that one needs a short worst case test MOD and to add some performance measurements to the code. Then one might use Hatari for automating finding out where the time goes in the worst cases...


The performance measurement is already there. Please check the file "read_me.txt".
The worst case MOD is to have 4 times a B-3 at the same time with a +7 finetune: divider 108 on all voices.
This means 657 bytes to read per sample and VBL with Amiga PAL and 663 bytes with Amiga NTSC.

Actual performance for Amiga PAL CPU clock with 8 trash buffers and volume control active (worst case):
41.70% @ 50 KHz stereo
34.36% @ 25 KHz stereo
22.75% @ 12.5 KHz stereo


Paulo.
You do not have the required permissions to view the files attached to this post.

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE

Postby ljbk » Wed Apr 17, 2013 8:27 am

As a curiosity, here is a test with the best possible note combination: 4 times a C-1 with a -8 finetune or if you prefer a 907 divider on all 4 voices that mean reading about 78 bytes per sample and VBL with an Amiga PAL CPU clock.
Actual performance for Amiga PAL CPU clock with 8 trash buffers and volume control active (best case):
20.94% @ 50 KHz stereo
15.93% @ 25 KHz stereo
13.45% @ 12.5 KHz stereo

As you can see there is a very variable response depending on the notes played.

If we consider for the average mod 3 times the worst case and 1 time the best case then we get:
36.51% @ 50 KHz stereo
29.75% @ 25 KHz stereo
20.43% @ 12.5 KHz stereo

If we consider for the average mod 3 times the worst case and 2 times the best case then we get:
33.40% @ 50 KHz stereo
26.99% @ 25 KHz stereo
19.03% @ 12.5 KHz stereo

For most of the mods, the response should be close to the above values.
Here are the results for the well known "Stardust Memories":
35.64% @ 50 KHz stereo
29.20% @ 25 KHz stereo
22.54% @ 12.5 KHz stereo

The mod i included with version 6, from the Amiga demo @ Revision 2013 (spatism.mod) is one of the worst cases requiring the 40% CPU @ 50 KHz.


Paulo.
You do not have the required permissions to view the files attached to this post.

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE

Postby ljbk » Wed Apr 17, 2013 10:39 am

To complete these tests, here is some more info on the "trash" play influence.

Here are the results with no "trash" play for the worst case 4 voices mod:
47.86% @ 50 KHz stereo
40.52% @ 25 KHz stereo
28.91% @ 12.5 KHz stereo

Here i repeat the results with 8 "trash" buffers:
41.70% @ 50 KHz stereo
34.36% @ 25 KHz stereo
22.75% @ 12.5 KHz stereo

And now the same but with 32 "trash" buffers:
40.91% @ 50 KHz stereo
33.57% @ 25 KHz stereo
21.94% @ 12.5 KHz stereo

The "trash" play averages the data "feeder" time needed to get new notes or effects.
So its influence depends on the speed at which the new data is read. This means that for a mod tempo of 1 (very fast tempo), there will be no change while for a mod tempo of 31 this will be very important. As the default tempo is 6 and the mods with tempo above 8 are rare, the best compromise to save memory is to have 8 trash buffers.
The number of read notes is also important. If this case ("trash" play), to really test for worst case we should have a note read at every pattern position on every voice with the effect that takes the maximum CPU time.

But i think that this is enough to get an idea about the "trash" play CPU time saving.

Paulo.

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE

Postby ljbk » Wed Apr 17, 2013 12:04 pm

To complete the previous post, here are the results with the worst mod that now has a note B-3 at every position for every voice.

Here are the results with no "trash" (no change as expected):
47.86% @ 50 KHz stereo
40.52% @ 25 KHz stereo
28.91% @ 12.5 KHz stereo

Here i repeat the results with 8 "trash" buffers:
42.84% @ 50 KHz stereo
35.49% @ 25 KHz stereo
23.75% @ 12.5 KHz stereo

And now the same but with 32 "trash" buffers:
41.96% @ 50 KHz stereo
34.62% @ 25 KHz stereo
23.00% @ 12.5 KHz stereo


Paulo.
You do not have the required permissions to view the files attached to this post.

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE

Postby ljbk » Wed Apr 17, 2013 3:59 pm

Well, version 8 is available.


This update was not planned.

It solves one of the problems Lance refers in the file "lance.txt" included in the Octalyser 0.96r package:
"
o My technique has problem with some certain samples and
make them sound strange. This is not many samples and
I believe it is just those samples with very short period.
Most samples sound very good.
"
The problem occurs with samples where the repeat start value is set to 0.
The feeder used by Lance, but not by Octalyser, when finding this and a loopsize > 1 sets the repeat start to 1 and decrements the loop size by 1 (all values are words). This is because his feeder uses the repeat start value as the key value to control if the sample loops or not: if repeat start = 0 => no loop otherwise loop. If you have a loop size of 8 like one sample in the example mod: annamull.mod, this messes the sound completly.
The solution is not to do that and to use the loopsize as the key value to control if the sample loops or not.
If loopsize <= 1 => no loop otherwise loop.

I found this misbehaviour while trying to include a new feature. But as it will need a big revamp of the original code, i decided to publish this solution as a new version before trying to do those big changes so that there is a solid base with that correction included.


Paulo.

PS:
Of course a nice way to check the change is to use that annamull.mod with one of the previous versions and compare with Winamp or Milky.

RA_pdx
Captain Atari
Captain Atari
Posts: 215
Joined: Sun Feb 02, 2003 12:01 pm
Location: Nuernberg/GERMANY

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V8)

Postby RA_pdx » Fri Apr 19, 2013 7:39 am

Great that you are still improving the Lance replayer! :cheers:
>> > raZen/Paradox < <<

Atari 1040STE, TOS 2.06, 4MB, MC68010, IDE 8GB SSD, Gigafile

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9)

Postby ljbk » Fri Apr 19, 2013 8:54 am

RA_pdx wrote:Great that you are still improving the Lance replayer! :cheers:


Thanks.

But, it is now time to start to beat Lance ... :)

V9 is now available and it gains around 1.5% CPU for all cases.
From this version on, i will not care anymore on maintaining the Lance strategy, names, data structures ...
If you don't trust the changes then keep using version 8 or the original Lance code.
More peak CPU reduction should now only be possible by changing (or rewriting) the messy Protracker handler code (that was copy/pasted from the Amiga by Lance). That code can take almost 10% CPU time. As that code is very dangerous to change, due to its messy structure, i separated it from the main part so that it can be updated independently. If a new version brings an error then we can fall back to the first version included in this V9 package.
More average CPU time can still be gained, but this is only important for more than 1 VBL demos or programs like 3D and in most cases it will need more memory.

For the worsemod.mod case the gains are (a bit smaller) as follows:

Volume control + no trash:
V8:
47.86% @ 50 KHz stereo
40.52% @ 25 KHz stereo
28.91% @ 12.5 KHz stereo
V9:
46.52% @ 50 KHz stereo
39.17% @ 25 KHz stereo
27.58% @ 12.5 KHz stereo

Volume control + 8 trash buffers:
V8:
42.84% @ 50 KHz stereo
35.49% @ 25 KHz stereo
23.75% @ 12.5 KHz stereo
V9:
41.67% @ 50 KHz stereo
34.33% @ 25 KHz stereo
22.60% @ 12.5 KHz stereo


Enjoy,
Paulo.

mc6809e
Captain Atari
Captain Atari
Posts: 159
Joined: Sun Jan 29, 2012 10:22 pm

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9)

Postby mc6809e » Fri Apr 19, 2013 8:46 pm

Getting a 50KHz output rate using just 40% CPU time is impressive.

I remember writing a 4-voice polyphonic synth for a 6809 machine years ago and shaving away every cycle I could just to get an 11KHz output rate. I bet you're really enjoying yourself :).

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby ljbk » Mon Apr 22, 2013 10:19 am

mc6809e wrote:Getting a 50KHz output rate using just 40% CPU time is impressive.

I remember writing a 4-voice polyphonic synth for a 6809 machine years ago and shaving away every cycle I could just to get an 11KHz output rate. I bet you're really enjoying yourself :).


Yes, if i would not be enjoying it i would stop at once :) .

You had one thing right. Hacking Lance is now on top of the 40 % limit for all mods @ 50 KHz with the new version 9A.
This version is in all almost identical to version 9 except in the Protracker handler.
That piece of code was improved in a way that could also be done on Amiga:
- organize CPU registers and give each of them one general task;
- take out all mulu/muls/divu/divs;
- take out all possible sp save/restore of registers;
- improve the finetune note index search;
- use a datastruct for the main mod variables;
- keep in the main branchs of the code flow only the possible effect cases;
Finally the specific Amiga stuff was commented: Filter and FunkIt (because of the way loops are played on ST that code would not work as it is and anyway even Milky Tracker does not support this).
The zip also contains version 9 just in case there was a mistake in the Protracker handler update.

Some improvements to this Protracker handler are still possible but the gains would be very limited.
The mixer can gain CPU time by changing the number of frames it divides the VBL in but memory requirements and output quality would change.
So i guess in terms of peak CPU usage this is almost it.
Average CPU requirements, important for 3D or multi-VBL demos, can still improved but more memory will be required.

So here are the new benchmarks for the worsemod.mod case:

Volume control + no trash:
V8:
47.86% @ 50 KHz stereo
40.52% @ 25 KHz stereo
28.91% @ 12.5 KHz stereo
V9:
46.52% @ 50 KHz stereo
39.17% @ 25 KHz stereo
27.58% @ 12.5 KHz stereo
V9A:
41.56% @ 50 KHz stereo
34.23% @ 25 KHz stereo
22.60% @ 12.5 KHz stereo

Volume control + 8 trash buffers:
V8:
42.84% @ 50 KHz stereo
35.49% @ 25 KHz stereo
23.75% @ 12.5 KHz stereo
V9:
41.67% @ 50 KHz stereo
34.33% @ 25 KHz stereo
22.60% @ 12.5 KHz stereo
V9A:
40.03% @ 50 KHz stereo
32.71% @ 25 KHz stereo
21.06% @ 12.5 KHz stereo


Enjoy,
Paulo.

PS:

The above benchmarks and all published before correspond to the left CPU time.
When i say that 40% are used, in fact it was measured that there was 60% CPU time left.
Those 40% include the time taken to measure and the TOS MFP interrupts.
If you remove the TOS interrupts by uncommenting 3 intructions in instmfp subrout, you will gain more than 1.5% CPU time:

Volume control + 8 trash buffers + no TOS MFP interrupts:
V9A:
38.49% @ 50 KHz stereo
31.14% @ 25 KHz stereo
19.52% @ 12.5 KHz stereo

User avatar
Zorro 2
Administrator
Administrator
Posts: 2194
Joined: Tue May 21, 2002 12:44 pm
Location: Saint Cloud (France)
Contact:

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby Zorro 2 » Tue Apr 23, 2013 3:44 pm

Hi mister Paulo,

I follow your posts for a while and I decide to take a time to test our investigations.

The routs from Lance is not the better replayer but it's very interesting to play Amiga module especially from an Atari STE at 50 khz :mrgreen:

I propose to you to test some Amiga modules between the v8 and v9 and other of your mainly routine, to test quality and issues from the last release. Example : the module "DISCONN.MOD" lost some loop replay in your last version. Maybe some fixes need to be re-evaluate...

I like your hacking job :wink:
You do not have the required permissions to view the files attached to this post.
Member of NoExtra Team

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1695
Joined: Sun Jul 31, 2011 1:11 pm

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby Eero Tamminen » Tue Apr 23, 2013 4:52 pm

ljbk wrote:Hacking Lance is now on top of the 40 % limit for all mods @ 50 KHz with the new version 9A.


Here's a Hatari profile of playing ILLUSION.MOD for a bit less than 1min:

Code: Select all

> profile cycles 26
addr:           cycles:
0x018daa        45.95%  178241000       dbeq      d2,$18da8
0x018da8        30.66%  118957236       move.b    (a0),d0
0xe0370e         0.08%  319176  movem.l   (sp)+,d0-d7/a0-a6
0xe0717a         0.08%  319176  movem.l   (sp)+,d0-d7/a0-a6
0xe036ca         0.08%  309504  movem.l   d0-d7/a0-a6,-(sp)
0xe0716e         0.08%  309504  movem.l   d0-d7/a0-a6,-(sp)
0xe036bc         0.07%  270844  addq.l    #1,$4ba
0xe03718         0.05%  195048  rte       
0xe036c2         0.05%  193460  rol       $10f0
0xe03712         0.05%  193460  move.b    #$df,$fffffa11.w
0x01bcc0         0.04%  160748  move.w    $10(a6),$e(a5)
0x01bcc6         0.03%  128632  rts       
0x01aebe         0.03%  106480  movem.l   (sp)+,d0-d1/a0-a1
0xe036c8         0.03%  106404  bpl.s     $e03712
0x01c0be         0.03%  96992   move.w    2(a6),d0
0x01ae8c         0.02%  96856   movem.l   d0-d1/a0-a1,-(sp)
0x01c0c6         0.02%  96768   beq       $1bcc0
0x01b58a         0.02%  87176   trap      #0
0xe0379e         0.02%  87048   movem.l   (sp)+,d0-d1/a0
0xe0371a         0.02%  77376   movem.l   d0-d1/a0,-(sp)
0xe0717e         0.02%  67704   move.l    $28c6,-(sp)
0xe0faf0         0.02%  67704   move.l    $31d4,-(sp)
0xe15954         0.02%  67704   add.l     d0,$672a
0xe227ee         0.02%  67704   addq.l    #1,$7188
0xe22836         0.02%  67704   move.l    $717c,-(sp)
0x018d94         0.02%  65996   cmpi.b    #$b9,$fffffc02.w
26 CPU addresses listed.


Interestingly, except for the player "idle loop", all the instructions taking most of the CPU time are in TOS (v2)...

Could you do a build with GST debug symbols i.e. using:

Code: Select all

OPT D+,X+


So that I can provide also symbol level profile and callgraph?


PS. It seems that Lance player doesn't work properly with EmuTOS, it plays things at wrong speed. Is this an issue in the player or in EmuTOS?

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby ljbk » Tue Apr 23, 2013 7:46 pm

Hi !

This player is not like Hextracker.
Lance strategy is ONLY prepared to play in PAL @ 50 Hz. It generates 1000 DMA updates @ 50 KHz.
The correct value for a 8.021247 MHz CPU clock should be 1000.26 so that is acceptable: 50066 / (8021247 / 160256 VBL cycles) = 1000.26.
Of course a MFP interrupt with a 245.5 divider could be set up but that is not part of his strategy. If this is included in any program then it is up to the user to decide if 60 Hz or Mono are supported.
So if EMUTOS starts in NTSC / 60 Hz, it will sound speedy :)
25 KHz version generates 500 updates and 12.5 KHz version generates 250 updates.
This does not work with Falcon because of the LCM usage. But even if that would not be there, as the Falcon has a DMA rate of 49170, it would need only 982 updates at the same VBL rate @ 50 KHz.

Illusion.mod takes much less than the 40% because it does not use volume control.
The source is included in the zip so anyone can rebuild and change it.

The idle loop is as follows:
wsync1:
cmp.b (a0),d0
dbne d2,wsync1 Wait until it changes

So i can not understand:
- why you have dbeq ?
- why don't you have cmp.b (a0),d0 ?
- why you have so much move.b (a0),d0 ?

Again, minimum free VBL time is printed at the end (3r word) in HEX. Just get the number, multiply by 20 (iddle loop cycles), and divide by 1602.56 and you get the free VBL time in %. 100 - that value will give you the used part.


Paulo.

User avatar
Cyprian
Atari God
Atari God
Posts: 1477
Joined: Fri Oct 04, 2002 11:23 am
Location: Warsaw, Poland

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9)

Postby Cyprian » Tue Apr 23, 2013 9:21 pm

ljbk wrote:More peak CPU reduction should now only be possible by changing (or rewriting) the messy Protracker handler code (that was copy/pasted from the Amiga by Lance). That code can take almost 10% CPU time.

10% is equal about 31 raster scanlines.
There is "super optimized" MOD playroutine for Amiga - Player61A. In the Guide file is mentioned that: "The maximum rastertime taken is under 6 lines on a normal 68000 Amiga":

http://www.amiga-stuff.com/text/mod-pac ... r61A.guide
http://eab.abime.net/showthread.php?t=52014

Actually it uses a bit different data format than MOD file but maybe would be worth to replace Protracker handler for Player61A in Lance routine?
Jaugar / TT030 / Mega STe / 800 XL / 1040 STe / Falcon030 / 65 XE / 520 STm / SM124 / SC1435
SDrive / PAK68/3 / CosmosEx / SatanDisk / UltraSatan / USB Floppy Drive Emulator / Eiffel / SIO2PC / Crazy Dots / PAM Net
Hatari / Aranym / Steem / Saint
http://260ste.appspot.com/

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9)

Postby ljbk » Tue Apr 23, 2013 9:41 pm

Cyprian wrote:
ljbk wrote:More peak CPU reduction should now only be possible by changing (or rewriting) the messy Protracker handler code (that was copy/pasted from the Amiga by Lance). That code can take almost 10% CPU time.

10% is equal about 31 raster scanlines.
There is "super optimized" MOD playroutine for Amiga - Player61A. In the Guide file is mentioned that: "The maximum rastertime taken is under 6 lines on a normal 68000 Amiga":

http://www.amiga-stuff.com/text/mod-pac ... r61A.guide
http://eab.abime.net/showthread.php?t=52014

Actually it uses a bit different data format than MOD file but maybe would be worth to replace Protracker handler for Player61A in Lance routine?


Hi !

Thanks for the info.
Yes, indeed there is that possibility. But after version 9A, the PT handler has been reduced to much smaller amount and with trash mode active, i think it is not worth the effort of adding that.
Actually i do not need that at all as i can use the "pre-play" feature i included in YM50K(2005) that costs maximum 600 cycles per voice but as the name says it pre-plays the mod at start buffering pointers, periods, volumes and the changes in a packed way. So it needs a big amount of memory. But as you could listen in the 5 YM50K disks, i managed to include in a 1MB STF a bunch of big and long time playing mods.

Paulo.

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby ljbk » Tue Apr 23, 2013 9:42 pm

Zorro 2 wrote:Hi mister Paulo,

I follow your posts for a while and I decide to take a time to test our investigations.

The routs from Lance is not the better replayer but it's very interesting to play Amiga module especially from an Atari STE at 50 khz :mrgreen:

I propose to you to test some Amiga modules between the v8 and v9 and other of your mainly routine, to test quality and issues from the last release. Example : the module "DISCONN.MOD" lost some loop replay in your last version. Maybe some fixes need to be re-evaluate...

I like your hacking job :wink:


Thanks for the info.
I will check what happenned and give you some feed back.

Edit:Just found the problem.
Going to V9, i rewrote the mt_init to allow the detection of non supported tags like any other number of voices than 4 and also the support for the old 15 samples mods.
So i request the necessary memory, copy the mod data to the top of it and start the copy back to right places.
The copy back of the pattern data is wrong. What i wonder is how it was possible that a lot of mods worked ...
Anyway the correction is valid for both V9 and 9A and is (swap a2 and a1):

old:

l_mt_init16
move.l (a1)+,(a2)+
dbf d1,l_mt_init16

new:

l_mt_init16
move.l (a2)+,(a1)+
dbf d1,l_mt_init16

Corrected zip uploaded.
Sorry for that register swaping ... :lol:

Paulo.

PS:
It is good to know that someone is testing ! :wink:

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1695
Joined: Sun Jul 31, 2011 1:11 pm

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby Eero Tamminen » Tue Apr 23, 2013 11:29 pm

ljbk wrote:The idle loop is as follows:
wsync1:
cmp.b (a0),d0
dbne d2,wsync1 Wait until it changes

So i can not understand:
- why you have dbeq ?
- why don't you have cmp.b (a0),d0 ?
- why you have so much move.b (a0),d0 ?


I assume you have optimizations enabled in your assembler and that move.b & dbeq use less cycles than cmp.b & dbne?

ljbk wrote:Again, minimum free VBL time is printed at the end (3r word) in HEX. Just get the number, multiply by 20 (iddle loop cycles), and divide by 1602.56 and you get the free VBL time in %. 100 - that value will give you the used part.


I think 100% - "idle loop percentage" (= 23.5%) is a good approximation for this. :-)


hackl09a.s didn't build (references to missing symbols), so I built hackl00e.s. Attached is a callgraph, including also interrupts (any switch to an address with a label is recorded as some kind of call by Hatari profiler), you can view it with "dotty" from GraphViz package, or with XDot (python) application.

Profile overview looks like this:

Code: Select all

CPU profile information from 'lance.txt':
- Hatari v1.6.2+ (Apr 22 2013), OldUAE CPU core

Time spent in profile = 66.33007s.

Calls:
- max = 20188959, in l_main_trash_lp0 at 0x12a9c, on line 17
- 20407789 in total
Executed instructions:
- max = 20188960, in l_main_trash_lp0+2 at 0x12a9e, on line 19
- 56415891 in total
Used cycles:
- max = 242603536, in l_main_trash_lp0+2 at 0x12a9e, on line 19
- 532049912 in total

Calls:
  98.93%            20188959            l_main_trash_lp0

Executed instructions:
  71.64%                    40418174                      l_main_trash_lp0
  23.96%                    13517157                      mt_funkend
   0.90%                      509830                      l_mt_return03
   0.90%                      508495                      l_mt_return12
   0.46%                      258926                      ROM_TOS
   0.28%                      156040                      l_mt_st_mix03
   0.28%                      156040                      l_mt_st_mix12
   0.13%                       75578                      mt_ftuloop
   0.12%   0.17%   0.17%       66400     93649     93649  mt_updatefunk
   0.11%   0.18%   0.18%       63080    103491    103491  q_check_ftime

Used cycles:
  76.12%                   405014368                      l_main_trash_lp0
  18.32%                    97483896                      mt_funkend
   1.17%                     6214640                      ROM_TOS
   0.89%                     4730712                      l_mt_return03
   0.89%                     4714040                      l_mt_return12
   0.25%                     1342960                      l_mt_st_mix03
   0.25%                     1342848                      l_mt_st_mix12
   0.17%   0.28%   0.28%      904664   1506280   1506280  mt_updatefunk
   0.16%                      837668                      mt_ftuloop
   0.11%   0.21%   0.21%      598784   1091052   1091052  q_check_ftime
   0.11%   0.14%   0.32%      583756    748380   1688852  mt_checkefx


It would look a bit more interesting with code which idle loop doesn't take 75% of CPU. :-)
You do not have the required permissions to view the files attached to this post.

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby ljbk » Wed Apr 24, 2013 7:23 am

Eero Tamminen wrote:
ljbk wrote:The idle loop is as follows:
wsync1:
cmp.b (a0),d0
dbne d2,wsync1 Wait until it changes

So i can not understand:
- why you have dbeq ?
- why don't you have cmp.b (a0),d0 ?
- why you have so much move.b (a0),d0 ?


I assume you have optimizations enabled in your assembler and that move.b & dbeq use less cycles than cmp.b & dbne?


No: dbeq and dbne are different and lead to different behaviour as move.b and cmp.b do.

Eero Tamminen wrote:
ljbk wrote:Again, minimum free VBL time is printed at the end (3rd word) in HEX. Just get the number, multiply by 20 (iddle loop cycles), and divide by 1602.56 and you get the free VBL time in %. 100 - that value will give you the used part.


I think 100% - "idle loop percentage" (= 23.5%) is a good approximation for this. :-)


As a matter of fact, the used CPU for Illusion.mod under Hackl59a.prg with TOS interrupts (no change to provided code) is 28% : 3rd word = 0x1689.
0x1689 = 5769
5769 x 20 = 115380 cycles (72% CPU)
100% - 72% = 28%


Eero Tamminen wrote:hackl09a.s didn't build (references to missing symbols), so I built hackl00e.s. Attached is a callgraph, including also interrupts (any switch to an address with a label is recorded as some kind of call by Hatari profiler), you can view it with "dotty" from GraphViz package, or with XDot (python) application.


Sorry, but i have no errors building it with Devpac 1.24, i attach.
You can also fetch Devpac 3.10 here:
http://dhs.nu/files.php?t=single&ID=5

Eero Tamminen wrote:It would look a bit more interesting with code which idle loop doesn't take 75% of CPU. :-)



You can either change the MOD to the previously provided worsemod.mod for example or set z_autovoldetect to $0000 (automatic "no volume control" detection not allowed).



Paulo.
You do not have the required permissions to view the files attached to this post.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1695
Joined: Sun Jul 31, 2011 1:11 pm

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby Eero Tamminen » Wed Apr 24, 2013 9:23 am

ljbk wrote:No: dbeq and dbne are different and lead to different behaviour as move.b and cmp.b do.


They're from here:

Code: Select all

$ grep -A1 l_main_trash_lp0 hackl00e.s
l_main_trash_lp0
        move.b  (a0),d0                         buffer is free ?
        dbeq    d2,l_main_trash_lp0             yes / no
        not     d2                              invert free time value


Which shows up in profile as:

Code: Select all

$ grep -A3 l_main_trash_lp0 lance.txt
l_main_trash_lp0:
$012a9c :             move.b    (a0),d0         35.79% (20188959, 161941984, 0)
$012a9e :             dbeq      d2,$12a9c       35.79% (20188960, 242603536, 0)
$012aa2 :             not.w     d2               0.01% (3320, 13280, 0)


This isn't expected?


ljbk wrote:
Eero Tamminen wrote:hackl09a.s didn't build (references to missing symbols), so I built hackl00e.s. Attached is a callgraph, including also interrupts (any switch to an address with a label is recorded as some kind of call by Hatari profiler), you can view it with "dotty" from GraphViz package, or with XDot (python) application.


As you can see, "mt_samplestarts" isn't defined in "hackl09a.s", it's only referenced there:

Code: Select all

$ grep mt_samplestarts  *.s
hackl00e_nopt_004.s:    lea     mt_samplestarts(pc),a4
hackl00e_nopt_004.s:    move.l  mt_samplestarts(pc),a0
hackl00e_nopt_004.s:    lea     mt_samplestarts(pc),a1
hackl00e.s:     lea     mt_samplestarts(pc),a4
hackl00e.s:     move.l  mt_samplestarts(pc),a0
hackl00e.s:     lea     mt_samplestarts(pc),a1
hackl00e.s:     lea     mt_samplestarts(pc),a1
hackl00e.s:mt_samplestarts
hackl09a.s:     lea     mt_samplestarts(pc),a4
hackl09a.s:     move.l  mt_samplestarts(pc),a0
hackl09a.s:     lea     mt_samplestarts(pc),a1
hackl09a.s:     lea     mt_samplestarts(pc),a1  sample pointers
pthdl000.s:     lea     mt_samplestarts(pc),a1
pthdl000.s:mt_samplestarts
pthdl004.s:     lea     mt_samplestarts(pc),a1  sample pointers
pt_src50_2013.s:        lea     mt_samplestarts(pc),a1
pt_src50_2013.s:        move.l  mt_samplestarts,a0
pt_src50_2013.s:        lea     mt_samplestarts(pc),a1
pt_src50_2013.s:mt_samplestarts

It's defined in "pt_src50_2013.s", "pthdl000.s" and "hackl00e.s"?

I have last used Devpac nearly 20 years ago, maybe I'm doing something wrong...

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby ljbk » Wed Apr 24, 2013 10:07 am

Eero Tamminen wrote:
ljbk wrote:No: dbeq and dbne are different and lead to different behaviour as move.b and cmp.b do.


They're from here:

Code: Select all

$ grep -A1 l_main_trash_lp0 hackl00e.s
l_main_trash_lp0
        move.b  (a0),d0                         buffer is free ?
        dbeq    d2,l_main_trash_lp0             yes / no
        not     d2                              invert free time value


Which shows up in profile as:

Code: Select all

$ grep -A3 l_main_trash_lp0 lance.txt
l_main_trash_lp0:
$012a9c :             move.b    (a0),d0         35.79% (20188959, 161941984, 0)
$012a9e :             dbeq      d2,$12a9c       35.79% (20188960, 242603536, 0)
$012aa2 :             not.w     d2               0.01% (3320, 13280, 0)


This isn't expected?



Your are right here.
I was talking about the "no trash" idle loop and you were talking about the "trash play" idle loop.
Sorry about that.

Eero Tamminen wrote:
ljbk wrote:
Eero Tamminen wrote:hackl09a.s didn't build (references to missing symbols), so I built hackl00e.s. Attached is a callgraph, including also interrupts (any switch to an address with a label is recorded as some kind of call by Hatari profiler), you can view it with "dotty" from GraphViz package, or with XDot (python) application.


As you can see, "mt_samplestarts" isn't defined in "hackl09a.s", it's only referenced there:

Code: Select all

$ grep mt_samplestarts  *.s
hackl00e_nopt_004.s:    lea     mt_samplestarts(pc),a4
hackl00e_nopt_004.s:    move.l  mt_samplestarts(pc),a0
hackl00e_nopt_004.s:    lea     mt_samplestarts(pc),a1
hackl00e.s:     lea     mt_samplestarts(pc),a4
hackl00e.s:     move.l  mt_samplestarts(pc),a0
hackl00e.s:     lea     mt_samplestarts(pc),a1
hackl00e.s:     lea     mt_samplestarts(pc),a1
hackl00e.s:mt_samplestarts
hackl09a.s:     lea     mt_samplestarts(pc),a4
hackl09a.s:     move.l  mt_samplestarts(pc),a0
hackl09a.s:     lea     mt_samplestarts(pc),a1
hackl09a.s:     lea     mt_samplestarts(pc),a1  sample pointers
pthdl000.s:     lea     mt_samplestarts(pc),a1
pthdl000.s:mt_samplestarts
pthdl004.s:     lea     mt_samplestarts(pc),a1  sample pointers
pt_src50_2013.s:        lea     mt_samplestarts(pc),a1
pt_src50_2013.s:        move.l  mt_samplestarts,a0
pt_src50_2013.s:        lea     mt_samplestarts(pc),a1
pt_src50_2013.s:mt_samplestarts

It's defined in "pt_src50_2013.s", "pthdl000.s" and "hackl00e.s"?

I have last used Devpac nearly 20 years ago, maybe I'm doing something wrong...


This seems to be a case sensitive issue with your assembler.
mt_samplestarts is defined at the end of the file (copied from the Amiga source):
The same should happen with mt_SongDataPtr versus mt_songdataptr.

Code: Select all

mt_SampleStarts
      dc.l   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
      dc.l   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


If your change the two upper case letters to lower case, then you should have no problem.


About CPU load, the values i refered are the peak values and not the average values.
If "no trash play" is used, then the peak value will be the worst VBL case.
If "trash play" is used and N buffers are available, then the peak case will be the worst average of N-1 VBLs. That is the advantage of "trash play".
But another thing is the average load.
So for version 9A as it is and Illusion.mod we get:
- 28% peak CPU load (from 3rd word);
- 23.06% average CPU load (from 1st word);

As i understand it, you are measuring the average value with that Hatari feature.
If you are doing a 50fps demo the peak value is the important one.
If you are doing a 3D demo at 10fps then you can take the average value into account but the frame rate will vary with time with the worst case related to the peak value.


Paulo.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1695
Joined: Sun Jul 31, 2011 1:11 pm

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby Eero Tamminen » Wed Apr 24, 2013 3:57 pm

ljbk wrote:This seems to be a case sensitive issue with your assembler.


Ah, I see. By default labels are case-sensitive in Devpac.

Other labels with the same problem were: mt_songdatatr, mt_PBreakflag. All of these variables are accessed both in low and mixed case.

ljbk wrote:As i understand it, you are measuring the average value with that Hatari feature.
If you are doing a 50fps demo the peak value is the important one.
If you are doing a 3D demo at 10fps then you can take the average value into account but the frame rate will vary with time with the worst case related to the peak value.


It's easiest to get average values with Hatari profiler. One just needs to enable profiling ("profile on") and profile a bit longer and you see average values when you invoke the Hatari debugger next time.

By setting a breakpoint to some address/label which is called exactly once per frame, one can get frame specific profile data.

As to getting profile for worst frame with Hatari... that requires support from the profiled program. It needs to provide some a number which e.g. gets increased each time the previous frame was worse than any previous one.

With that and some breakpoint chaining/scripting, Hatari should then be able to automatically save profile for (new) worst frame, whenever that happens. I haven't yet tried how well that works in practice, but I've thought to do it for Doug's Doom code, after he has added the required counter to his code. :)

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1695
Joined: Sun Jul 31, 2011 1:11 pm

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby Eero Tamminen » Wed Apr 24, 2013 5:22 pm

Here's new profile with worsemod and re-built hackl09a.s:

Code: Select all

CPU profile information from 'hackl09a-profile.txt':
- Hatari v1.6.2+ (Apr 19 2013), OldUAE CPU core

Time spent in profile = 75.20072s.

Calls:
- max = 18262598, in l_main_trash_lp0 at 0x12a9c, on line 17
- 18620119 in total
Executed instructions:
- max = 18262598, in l_main_trash_lp0 at 0x12a9c, on line 18
- 67464789 in total
Used cycles:
- max = 219509428, in l_main_trash_lp0+2 at 0x12a9e, on line 19
- 603203572 in total

Calls:
  98.05%    18256341  l_main_trash_lp0
   0.73%      136708  zVibratoTabPtrs

Executed instructions:
  54.21%    36570832  l_main_trash_lp0
  41.31%    27871368  zVibratoTabPtrs
   0.85%      575892  l_mt_return12
   0.85%      575892  l_mt_return03
   0.44%      293508  ROM_TOS
   0.26%      176908  l_mt_st_mix12
   0.26%      176908  l_mt_st_mix03
   0.16%      106658  mt_CheckEfx
   0.14%       92796  mt_plvskip
   0.12%       82808  l_mt_no_swap12
   0.12%       82808  l_mt_no_swap03
   0.11%       71516  q_check_ftime

Used cycles:
  60.76%   366520572  l_main_trash_lp0
  33.48%   201971280  zVibratoTabPtrs
   1.17%     7045028  ROM_TOS
   0.88%     5337552  l_mt_return03
   0.88%     5337280  l_mt_return12
   0.25%     1523008  l_mt_st_mix12
   0.25%     1522840  l_mt_st_mix03
   0.19%     1118060  mt_CheckEfx
   0.15%      894080  mt_plvskip
   0.12%      753808  l_mt_no_swap03
   0.12%      753752  l_mt_no_swap12
   0.11%      678360  q_check_ftime


Percentages & values here are values summed from costs in profile disassembly and assigned to whatever label happens to preceed them. They don't take into account anything the code calls through subroutines calls etc.[1]

The costs that get assigned to "zVibratoTabPtrs" label (last one in the binary), is stuff way after "mt_Return", as can be seen here:

Code: Select all

mt_Return:
$015e16 :             rts                                  0.00% (3137, 50192, 0)
[...]
$05aafc :             move.b    d2,(sp)+                   0.20% (136708, 1095344, 0)
$05aafe :             move.b    (a0)+,d0                   0.20% (136708, 1095064, 0)


I'm not really sure what it is...?


[1] I tried getting subroutine call costs, but this thing confuses it:

Code: Select all

mt_mixer:
        lea     mt_save_SSP(pc),a0
        move.l  sp,(a0)                 Save Supervisor Stack Pointer
        pea     mt_emulate(pc)          mt_emulate for return address
        move    sr,d0
        andi    #$0FFF,d0               Status Register with User mode set
        move    d0,-(sp)                new Status Register for next code
        rte                             Load new SR and mt_emulate as PC
*                                       So from here we go to mt_emulate
*                                       in User mode to allow interrupts


Is this always called in an exception? During MOD playback it would seem to be just BSRed from "l_main_trash_lp0".

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby ljbk » Wed Apr 24, 2013 7:17 pm

Hi !

The answers to your questions are:

1- zVibratoTabPtrs => generated code is a huge part of this program and of any speedy thing on the Atari ST :) ! I have tons of it in Hextracker ! :D

2- mt_emulate (the mixer) has to be called in User mode and via the rte and ends with a trap #0 call to go back to Supervisor mode and that routine ends with an rts so it returns to the calling proc.


Paulo.

User avatar
Eero Tamminen
Atari God
Atari God
Posts: 1695
Joined: Sun Jul 31, 2011 1:11 pm

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V9A)

Postby Eero Tamminen » Wed Apr 24, 2013 9:22 pm

ljbk wrote:1- zVibratoTabPtrs => generated code is a huge part of this program and of any speedy thing on the Atari ST :) ! I have tons of it in Hextracker !


Ok, in that case it's not possible to get good enough symbols coverage for it (except by manually adding them, but that's a lot of work and error prone), so post-processed profiles aren't that interesting.

ljbk wrote:2- mt_emulate (the mixer) has to be called in User mode and via the rte and ends with a trap #0 call to go back to Supervisor mode and that routine ends with an rts so it returns to the calling proc.


And 1) coupled with 2) + lack of subroutines calls makes tracing of subroutine calls for callgraph generation not very useful.

So, as a summary, only the regular (low level) profiling with Hatari makes sense for this, and somebody who knows the (generated) code well, looking at the profile disassembly for the ~40% part...

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V10)

Postby ljbk » Thu Apr 25, 2013 4:36 pm

Hi !

Version 10 is now available.
It was tested with 15 mods.
Test results can be found in the file hacking_lance_10.xls .
V10 gains 2.3% CPU in peaks compared to V8 and 3% CPU in average compared to V8 for those 15 mods.
This means a reduction of 7% in peak CPU load and of 11% in average CPU load.
Part of the "average" gain needs a lot more memory and is optional (z_improve_avg $0000/$0001).

That file will also show the details about those 15 mods:
- Lance code (V8) takes 39.50% in peaks and 27.67% in average @ 50 KHz;
- Lance code (V8) with 8 trash buffers goes down to 35.22% in peaks and up to 27.76% in average @ 50 KHz;
- V10 with 8 trash buffers goes down to 32.67% in peaks and down to 24.68% in average @ 50 KHz;
- Going down to 25 KHz saves around 6.7% in peaks and 5.7% in average;
- Going down to 12.5 KHz saves 13.5% in peaks and 10% in average;
- From Lance 50 KHz with no trash to V10 25 KHz with 8 trash buffers, 13.5% are gained in peaks and 8.7% in average;

Enjoy,
Paulo.

User avatar
ljbk
Atari Super Hero
Atari Super Hero
Posts: 514
Joined: Thu Feb 19, 2004 4:37 pm
Location: Estoril, Portugal

Re: Lance 12.5 / 25 / 50 KHz routine for STE (V11)

Postby ljbk » Wed May 01, 2013 8:06 pm

Hi !

Version 11 is available.
On the menu, we have:
- minor speed improvements and bug corrections;
- a realtime global volume control (value from $0000 to $FFFF), but LCM precision is much smaller :) ;
- an "octave 3 cheat" option for demo makers to allow to exchange a bit of quality for some more CPU time;

I wish to thank Evil / DHS for the ideas exchange we had on possible features to improve this program.
The "octave 3 cheat" was designed and coded by me but the concept was his idea.

If you don't care about BPM(fine tempo), are happy with the LCM volume control and only wish to handle 4 voices, then this is probably the last version (and the fastest) for you unless some bugs are found.
For the others, who knows, may be some more features will be added later.

Enjoy,
Paulo.


Social Media

     

Return to “680x0”

Who is online

Users browsing this forum: No registered users and 2 guests