[CMSIS-NN] Fix stateful execution and batch-major striding for CMSIS-NN LSTM by veblush · Pull Request #3564 · tensorflow/tflite-micro

veblush · 2026-05-21T18:24:41Z

Problem

The current CMSIS-NN LSTM wrapper uses arm_lstm_unidirectional_s8 and arm_lstm_unidirectional_s16. These CMSIS-NN functions are designed for stateless sequence evaluation: they explicitly wipe the cell state at t=0 and ignore any initial hidden state, returning only the sequence outputs.

This breaks TFLM's streaming/embedded ML workloads which rely on stateful LSTMs where the CellStateTensor and HiddenStateTensor persist as variable tensors across Invoke() calls.

Furthermore, CMSIS-NN's internal implementation for batch-major tensors (time_major=false with batch_size > 1) incorrectly jumps memory by time_steps, causing an out-of-bounds read on the contiguous hidden_state buffer.

Solution

Fallback to explicit looping: Implemented a manual time/batch loop within CMSIS_NN_EvalInteger8x8_16Lstm and CMSIS_NN_EvalInteger16x8_16Lstm that bypasses the stateless sequence evaluator and instead iteratively calls the single-step CMSIS-NN kernels (arm_nn_lstm_step_s8 and arm_nn_lstm_step_s16).
State Persistence: The fallback loop properly preserves the CellStateTensor and HiddenStateTensor across timesteps and invocations.
Stride Bug Bypass: For time_major=false, the loop evaluates one batch at a time (batch_size=1 passed to the kernel), which guarantees cache-friendly contiguous memory reads and avoids CMSIS-NN's batch striding bug entirely.
Future-proofing: Introduced #ifdef CMSIS_NN_STATEFUL_LSTM. Once ARM merges a fix upstream to support the optional hidden_state context pointer, this flag will seamlessly switch back to using the native CMSIS-NN sequence evaluator. (Fixed LSTM ARM-software/CMSIS-NN#219)

BUG=N/A

This PR fixes two critical issues in `arm_lstm_unidirectional_s8` and `s16` that prevent state persistence in streaming models and cause out-of-bounds reads during non-time-major inference. These issues are closely related to in tensorflow/tflite-micro#3564. Problem: - State Wiping: By default, `arm_lstm_unidirectional_*` unconditionally sets `hidden_in` to `NULL` and memsets `cell_state` to 0. This discards the `HiddenStateTensor` and `CellStateTensor` that TFLM relies on to persist state across `Invoke()` calls for streaming models. - Striding Bug: In the `time_major` = `false` block of `arm_lstm_unidirectional_*`, CMSIS-NN attempts to jump between batches by passing `batch_offset` = `params->time_steps` to `arm_nn_lstm_step_*`. However, `arm_nn_lstm_step_*` forwards this `batch_offset` to `arm_nn_vec_mat_mul_result_acc_s8_s16` for both the `data_in` and `hidden_in` pointers. Since the `hidden_state` buffer is contiguous (stride 1) and not strided like `data_in`, passing `batch_offset` = `params->time_steps` causes out-of-bounds reads on the hidden_in buffer at `timestep` t=0. Solution: - Adding a `hidden_state` pointer to `cmsis_nn_lstm_context`. - Forwarding this `hidden_state` as `hidden_in` when present, skipping the `cell_state` wiping if so. - Explicitly iterating over the `batch_size` in the `time_major` = `false` case when computing step sizes, which forces `batch_offset` = 1 and avoids the buggy out-of-bounds stride entirely while writing to the final memory buffer sequentially.

ddavis-2015 · 2026-06-18T19:50:54Z

+    if (params.time_steps > 0) {
+      std::copy_n(step_hidden_in, params.batch_size * params.hidden_size,
+                  hidden_state);
+    }


not sure why this is here. When using the greedy memory planner, the hidden_state may be overwritten by subsequent operator's output(s). See next comment for more info.

ddavis-2015 · 2026-06-18T19:54:11Z

+      if (params.time_steps > 0) {
+        std::copy_n(step_hidden_in, params.hidden_size,
+                    hidden_state + b * params.hidden_size);
+      }


Same as the above comment with this additional info: I have not been able to produce a Colab where the. converter will produce a stateful, fused LSTM operation with quantization. The converter (and the Colab session) crash every time. The only time I can make a stateful LSTM in Colab, always produces an unfused LSTM.

ddavis-2015 · 2026-06-18T19:57:50Z

+      // Update hidden state for next step
+      std::copy_n(hidden_out, params.batch_size * params.hidden_size,
+                  hidden_state);


Don't understand why this is inside the step loop. Why not just update the hidden state input pointer as was done in the s8 code?

ddavis-2015 · 2026-06-18T20:01:15Z

+        // Update hidden state for next step
+        std::copy_n(hidden_out, params.hidden_size, current_hidden);


Don't understand why this is inside the step loop. Why not just update the hidden state input pointer as was done in the s8 code?

suleshahid · 2026-06-29T16:19:09Z

Could we add test case where its failing before/working after the fix?

veblush requested a review from a team as a code owner May 21, 2026 18:24

veblush added the ci:full Triggers the comprehensive cross-platform test suite. label May 21, 2026

veblush mentioned this pull request May 21, 2026

Fixed LSTM ARM-software/CMSIS-NN#219

Merged

Fixed unidirectional_sequence_lstm

d2fd6ae

veblush force-pushed the cm-lstm branch from fef2465 to d2fd6ae Compare May 21, 2026 22:19

veblush enabled auto-merge June 16, 2026 23:29

ddavis-2015 reviewed Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CMSIS-NN] Fix stateful execution and batch-major striding for CMSIS-NN LSTM#3564

[CMSIS-NN] Fix stateful execution and batch-major striding for CMSIS-NN LSTM#3564
veblush wants to merge 1 commit into
tensorflow:mainfrom
veblush:cm-lstm

veblush commented May 21, 2026 •

edited

Loading

Uh oh!

ddavis-2015 Jun 18, 2026

Uh oh!

ddavis-2015 Jun 18, 2026

Uh oh!

ddavis-2015 Jun 18, 2026 •

edited

Loading

Uh oh!

ddavis-2015 Jun 18, 2026

Uh oh!

suleshahid commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Update hidden state for next step
		std::copy_n(hidden_out, params.hidden_size, current_hidden);

Uh oh!

Conversation

veblush commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

ddavis-2015 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

ddavis-2015 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

ddavis-2015 Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddavis-2015 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

suleshahid commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

veblush commented May 21, 2026 •

edited

Loading

ddavis-2015 Jun 18, 2026 •

edited

Loading