Smaller than smallest

So, I realized that g++ has a bunch of internal parameters that can be set with the --param flag. The version of g++ we use have 300 of them.

So I did what any normal person would do, and wrote a program that would figure out which parameters can make the program smaller.

So far, my program has found options that will save about 4k compared to the “smallest size” option by applying these parameters:

-Os --param=case-values-threshold=0 --param=early-inlining-insns=0 --param=gcse-cost-distance-ratio=1 --param=gcse-unrestricted-cost=2 --param=ipa-cp-value-list-size=9 --param=ipa-sra-max-replacements=16 --param=iv-always-prune-cand-set-bound=20 --param=jump-table-max-growth-ratio-for-size=299 --param=large-function-insns=5398 --param=large-stack-frame=512 --param=large-stack-frame-growth=500 --param=lto-min-partition=1001 --param=max-combine-insns=4 --param=max-completely-peeled-insns=201 --param=max-crossjump-edges=201 --param=max-cse-insns=1001 --param=max-cse-path-length=20 --param=max-early-inliner-iterations=1 --param=max-hoist-depth=300 --param=max-inline-functions-called-once-insns=1000 --param=max-inline-functions-called-once-loop-depth=3 --param=max-inline-insns-single=1399 --param=max-jump-thread-duplication-stmts=1 --param=max-jump-thread-paths=6 --param=max-predicted-iterations=1010 --param=max-stores-to-merge=1300 --param=max-tail-merge-comparisons=100 --param=max-tail-merge-iterations=0 --param=min-crossjump-insns=1 --param=partial-inlining-entry-probability=35 --param=scev-max-expr-size=10 --param=sched-autopref-queue-depth=0 --param=sink-frequency-threshold=100 --param=sra-max-propagations=8 --param=sra-max-scalarization-size-Osize=200 --param=switch-conversion-max-branch-ratio=1 --param=tree-reassoc-width=2 --param=uninlined-function-insns=1 --param=uninlined-thunk-insns=20

Some of these parameters probably do nothing, and some of them may have a large impact on program speed, but at least we have some options, now we just have to figure out what does what.

So, 4k isn’t bad, but I also had a go at reducing the size for “-O3”, which so have produced these flags:

-O3 --param=case-values-threshold=0 --param=dse-max-alias-queries-per-store=257 --param=early-inlining-insns=5 --param=inline-heuristics-hint-percent=100 --param=inline-unit-growth=2 --param=ipa-cp-value-list-size=16 --param=max-completely-peeled-insns=20 --param=max-inline-insns-single=7 --param=max-jump-thread-duplication-stmts=1 --param=max-rtl-if-conversion-unpredictable-cost=80

These flags save a whopping 142k compared to plain -O3. (But may also make the program a lot slower of course.)

These parameters open up a lot of options for tweaking. However, figuring out which options are good ones and which ones hurt performance a lot is going to take some work.

If anybody wants to experiment with these options, you can go to your arduino15 directory, then go packages → proffieboard → hardware → stm32l4 → 4.6 and then open up boards.txt. In there you can find lines which have “.optimize” in them, these lines specifies what flags to use. There are individual lines for each board, and also individual lines for each option in the optimization menu.

If you do experiment with it, please report your results here. Also, I recommend using the “top” command to see if there is an impact on the speed of the code or not.

Running the program overnight came up with the following parameters:

-O3 --param=avg-loop-niter=1 --param=case-values-threshold=0 --param=dse-max-alias-queries-per-store=257 --param=early-inlining-insns=2 --param=hot-bb-frequency-fraction=1 --param=inline-heuristics-hint-percent=100 --param=inline-min-speedup=100 --param=inline-unit-growth=0 --param=ipa-cp-unit-growth=11 --param=ipa-cp-value-list-size=16 --param=ipa-max-param-expr-ops=20 --param=iv-always-prune-cand-set-bound=20 --param=large-function-growth=50 --param=large-stack-frame-growth=999 --param=lim-expensive=40 --param=max-completely-peel-times=160 --param=max-completely-peeled-insns=20 --param=max-inline-functions-called-once-insns=1999 --param=max-inline-insns-auto=14 --param=max-inline-insns-single=0 --param=max-jump-thread-duplication-stmts=1 --param=max-rtl-if-conversion-unpredictable-cost=80 --param=max-sched-extend-regions-iters=0 --param=max-sched-region-insns=99 --param=max-stores-to-merge=128 --param=max-tail-merge-comparisons=11 --param=max-unroll-times=80 --param=min-crossjump-insns=1 --param=partial-inlining-entry-probability=35 --param=sra-max-scalarization-size-Ospeed=10 --param=tree-reassoc-width=2 --param=uninlined-function-insns=1

which saves 173k, that makes it only ~12k larger than the “smallest size” binary.

Does it need to be the 4.6 plugin?
I tried compiling “Fastest” with your last post’s params with 3.6 and get

cc1plus: error: invalid --param name 'inline-heuristics-hint-percent'
cc1plus: error: invalid --param name 'ipa-cp-unit-growth'; did you mean 'ipcp-unit-growth'?
cc1plus: error: invalid --param name 'ipa-max-param-expr-ops'
cc1plus: error: invalid --param name 'max-inline-functions-called-once-insns'

yes it does, I would not expect the parameters to be the same in other versions of g++

report

Serial + Mass Storage + WebUSB:
177896 3.6.0 smallest
176776 4.6.0 smallest
247856 4.6.0 fastest before new optimization
182824 4.6.0 fastest after new optimization
65,032 saved
6,048 over smallest

Serial only:
171920 4.6.0 smallest
241704 4.6.0 fastest before new optimization
178032 4.6.0 fastest after new optimization
63,672 saved
6,112 over smallest.

Top before new optimization:
Audio DMA: 0.00%
Wav reading: 0.00%
Pixel DMA: 5.98%
LOOP: 7.44%
Motion: 4.80%
Global loops / second: 9407.43
High frequency loops / second: 11107.83
blade fps: 188.77
Acceleration measurements per second: 1607.44
Hybrid Font loop: 0.66%
WS2811_Blade loop: 9.01%
SoundQueue loop: 0.93%
ClockControl loop: 5.53%
Booster loop: 9.14%
SDCard loop: 5.39%
Amplifier loop: 2.85%
LSM6DS3H loop: 2.45%
I2CBus loop: 0.90%
Parser loop: 1.40%
aux loop: 2.70%
pow loop: 1.71%
SaberBCButtons loop: 13.08%
BatteryMonitor loop: 5.52%
Fusor loop: 18.71%
AudioDynamicMixer loop: 0.89%
MonitorHelper loop: 0.91%

Top after new optimization:
Audio DMA: 0.00%
Wav reading: 0.00%
Pixel DMA: 6.09%
LOOP: 10.78%
Motion: 0.00%
Global loops / second: 6787.88
High frequency loops / second: 8398.19
blade fps: 187.66
Acceleration measurements per second: 1521.39
Hybrid Font loop: 0.79%
WS2811_Blade loop: 14.56%
SoundQueue loop: 1.17%
ClockControl loop: 5.55%
Booster loop: 12.91%
SDCard loop: 7.92%
Amplifier loop: 5.30%
LSM6DS3H loop: 1.77%
I2CBus loop: 1.14%
Parser loop: 1.69%
aux loop: 2.72%
pow loop: 1.66%
SaberBCButtons loop: 13.18%
BatteryMonitor loop: 5.38%
Fusor loop: 5.62%
AudioDynamicMixer loop: 0.92%
MonitorHelper loop: 0.85%

Went back to smallest to take top readings:

Audio DMA: 0.00%
Wav reading: 0.00%
Pixel DMA: 6.34%
LOOP: 6.59%
Motion: 6.16%
Global loops / second: 4492.49
High frequency loops / second: 6167.20
blade fps: 186.03
Acceleration measurements per second: 1606.42
Hybrid Font loop: 0.56%
WS2811_Blade loop: 31.86%
SoundQueue loop: 0.64%
ClockControl loop: 3.34%
Booster loop: 7.07%
SDCard loop: 4.57%
Amplifier loop: 2.60%
LSM6DS3H loop: 1.66%
I2CBus loop: 0.62%
Parser loop: 1.07%
aux loop: 1.58%
pow loop: 0.93%
SaberBCButtons loop: 9.11%
BatteryMonitor loop: 3.97%
Fusor loop: 10.18%
AudioDynamicMixer loop: 0.64%
MonitorHelper loop: 0.52%

I think I might need to code something new for measuring speed.
“top” is helpful, but it’s just too much work to figure out what’s going on.
Maybe if I make top report number cycles per call in addition to the percentage…
I have to be careful though, because some calls (like the WS2811_Blade loop) does different amounts of work on each iteration, reporting the number of cycles per call for that is just not that helpful…

From what I can tell from your test NoSloppy, it seems like -O3 + new params is somwhere between -O3 and -Os in terms of speed, while being a lot closer to -Os than -O3 in terms of size, so it seems like a potentially good tradeoff.

I can’t offer much as far as anything other than “smallest code” goes. That’s all I use.
So I’m not sure what benefits “faster” code even offers.
I mostly assume that the SD card is the bottleneck and negates anything running “faster” in the code anyway. “Smallest code” runs without any sort of noticeable slowness when the SD is good.
I can also assume that to the average user, the most essential thing is memory.

I’m also thinking that I should rename the optimization menu options.
The current options really don’t make a lot of sense.

Maybe something like this:

  • extra small (-Os + custom params)
  • small (-Os)
  • unoptimized (-O0)
  • medium (-O1)
  • fast (-O2)
  • extra fast (-O3)
  • tailored (-O3 + custom params)

Although we may end up with multiple “tailored” options…

Most of the time, “smallest” is fine.
However, I have found that with zigzag blades, or other props which uses a lot of pixels, “faster/fastest” makes a difference. Also, when using color displays.

The SD card bottleneck and CPU bottleneck are actually the same thing, because currently ProffieOS is not actually able to do anything while waiting for the SD card. A slow SD card means less CPU time for other things. This also means that V3 boards are all-around faster, because even though the CPU runs at the same speed as a V2 board, SD card access takes less time, which frees up CPU time.

Ultimately though, the “extra small” setting might be the one that matters the most. Saving a few KB can add space for another preset. :slight_smile:

1 Like