Skip to content

Avoid reevaluating arguments in _round, _constrain, _sign.#529

Open
jaguilar wants to merge 1 commit intosimplefoc:devfrom
jaguilar:opt
Open

Avoid reevaluating arguments in _round, _constrain, _sign.#529
jaguilar wants to merge 1 commit intosimplefoc:devfrom
jaguilar:opt

Conversation

@jaguilar
Copy link
Copy Markdown

Description

_round, _constrain and _sign currently reevaluate their arguments, which wastes CPU cycles. In torque/foc_current mode, this fix saves about 300ns/foc loop (about 2% of total FOC CPU usage). In other modes, especially estimated_current, it saves more.

Type of change

  • Optimization

How Has This Been Tested?

I tested this on on my STM32G474 FOC bench setup.

Test Configuration/Setup:

  • Hardware: STM32G474, custom 3BLDC driver
  • IDE: VSCode & arm-none-eabi-gcc
  • MCU package version (stm32duino/arduino-esp32/..)

_round, _constrain and _sign currently reevaluate their arguments, which
wastes CPU cycles. In torque/foc_current mode, this fix saves about
300ns/foc loop (about 2% of total CPU usage). In other modes, especially
estimated_current, it saves more.
@runger1101001
Copy link
Copy Markdown
Member

Nice.
You're saying the ternary operator isn't optimised as well as the simpler structure in the template function?

Would you have time to compare __builtin_round() with the self-implemented one? That might be another possible option?

@jaguilar
Copy link
Copy Markdown
Author

I could try it.

The point of the template function is the make sure that each expression is only evaluated once. So for example if you have _constrain(complicated_expr, low, high), you don't want to do if (complicated_expr < low) ? low : (complicated_expr > high) ? high : complicated_expr, since in the non-clamped case you have to reevaluate the expression three times.

@jaguilar
Copy link
Copy Markdown
Author

jaguilar commented Apr 29, 2026

Okay, replacing with builtins gives another 600ns speedup on a baseline of 9500ns for loopFOC (for position control mode in foc_current torque control). These implementations should be C++11 compatible. Let me know if it is acceptable and I will amend the commit.

Note that I have no way of testing that this works across different compilers so if you have a test suite to hand please feel welcome to run it.

template<typename T>
constexpr inline int _sign(T val) {
  return __builtin_signbit(val);
}

#ifndef _round
template<typename T>
constexpr inline std::enable_if<std::is_same<T, double>::value, long>::type _round(T x) {
  return __builtin_round(x);
}
template<typename T>
constexpr inline std::enable_if<std::is_same<T, float>::value, long>::type _round(T x) {
  return __builtin_roundf(x);
}
#endif

// Use enable_if to select the fastest implementation according to the amt type.
// Using __builtin_fXf is measurably faster than using the ternary approach.
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_integral<T>::value, T>::type _constrain(T amt, L low, H high) {
  return (amt < low) ? low : (amt > high) ? high : amt;
}
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_same<T, float>::value, T>::type _constrain(T amt, L low, H high) {
  return __builtin_fmaxf(low, __builtin_fminf(high, amt));
}
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_same<T, double>::value, T>::type _constrain(T amt, L low, H high) {
  return __builtin_fmax(low, __builtin_fmin(high, amt));
}

@runger1101001
Copy link
Copy Markdown
Member

runger1101001 commented Apr 29, 2026

Wow, thank you so much for trying it - like I thought, even better :-)

Personally I’m very tempted to do it like that… what do you think?
The built ins are GCC specific. But I’m not aware of any Arduino platform that does not use gcc… so I think we’re fine for compatibility in our ecosystem and it’s only people porting to CubeIDE and Keil compilers that might have troubles…

Although very cool I think I’d leave out the template magic that selects the fastest one. But also up for discussion.

@jaguilar
Copy link
Copy Markdown
Author

jaguilar commented Apr 29, 2026

Although very cool I think I’d leave out the template magic that selects the fastest one. But also up for discussion.

I don't think we can, because if anyone is compiling without -ffast-math (which is probably a lot of people because it's not on by default), the __builtin_fmax/_fmin/_round calls would add double promotion and software double rounding. (Will await your agreement or further feedback on this point before continuing.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants