Pepera the demo uses preshifting. The blitter is still substantially faster there. Even using preshifting you can't beat the blitter for a true masking sprite. As for code gen, well are we sure there is not cheating going on there?

How much cheating would I be allowed to do?
Could I
1) limit the window size and clear all bobs in one pass?
2) copy and not mask the first sprite?
3) inline every function call?
4) precalculate all the coordinate tranforms?
5) align all the sprite data on a 64 kb boundary to save setting the high word of the source address?
I've got an older version with 2 and 3 done.. It does 31 bobs a frame but the code is a mess. I've calculated the max throughput for that size of sprite is 35 bobs a frame excluding bus arbitration. Here are the figures for a 3 plane 32*32 sprite excluding bus arbitration.
Source data is 3 words wide 3 planes deep and 32 lines high.
That's 3 columns wide
r/m/w | r/m/w | m/r
3 + 3 + 2 = 8 bus cycles per plane per line for mask.
3 + 3 + 2 = 8 bus cycles per plane per line for sprite data
16 in total for masking
* 3 planes = 16 * 3 cycles per line = 48 bus cycles
* 32 high = 1,536 bus cycles per bob
clear is
3 words wide per plane * 32
w / w / w * 3 * 32 = 288 bus cycles per bob
3 columns * 3 planes * 32 lines
so excluding set up time we have a total cost of
288 + 1536 = 1,824 bus cycles
There are 40 thousand bus cycles per frame
160,000 / 4 = CPU -> bus cycles
40,000 / 1,824 = 21.9 bobs a frame