Skip to content

Fix latency feature extraction#122

Closed
jacobbeierle wants to merge 1 commit intomainfrom
first_last_latency_bug
Closed

Fix latency feature extraction#122
jacobbeierle wants to merge 1 commit intomainfrom
first_last_latency_bug

Conversation

@jacobbeierle
Copy link
Copy Markdown
Contributor

Fix: Latency First/Last Prediction Cumulative Sum Bug

Summary

support_code/behavior_summaries.py

  1. Before the groupby().sum(), the latency values are now extracted from the per-bin filtered_data using agg(lambda s: s.iloc[0]) and agg(lambda s: s.iloc[-1]). These return a Series indexed by MouseID, preserve NaN, and do not sum across bins.
  2. The old .head(1) / .tail(1) calls (which operated on already-summed, post-groupby data) are replaced with direct assignment from these pre-computed Series.

tests/support_code/test_behavior_summaries.py — 7 new tests covering:

  1. First bin with behavior → correct value returned
  2. First bin without behavior → NaN (does not bleed from a later bin)
  3. Last bin with behavior → correct value returned
  4. Last bin without behavior → NaN (does not bleed from a prior bin)
  5. Single-bin edge case
  6. Multi-mouse correctness (each mouse gets its own values)

Context

bin_first_XX.{behavior}_latency_first_prediction and bin_last_XX.{behavior}_latency_last_prediction in the output feature CSVs are wrong. Instead of reporting the latency to the first/last prediction within the analysis window, they report a cumulative sum of per-bin latency values. For example, for the video 041345_B6J_M_42462_trimmed.avi:

  • bin_first_15.Jumping_latency_first_prediction = 30000 (16.67 min) — outside the 0–15 min window
  • bin_first_60.Jumping_latency_first_prediction = 256925 (142.7 min) — far outside the 0–60 min window

The bug is in support_code/behavior_summaries.py.

Root Cause

In support_code/behavior_summaries.py, aggregate_data_by_bin_size() (line 117):

Line 136 sums ALL numeric columns per MouseID, including the latency columns:

aggregated = filtered_data.groupby("MouseID")[numeric_cols].sum()

This collapses filtered_data to one row per MouseID with latency values summed across bins (e.g., bins 0–5, 5–10, 10–15 each have a latency, and they get added together).

Lines 183–188 then try to extract first/last from the already-collapsed result:

aggregated[f"bin_first_{bin_size * 5}.{behavior}_latency_first_prediction"] = (
    aggregated[f"{behavior}_latency_to_first_prediction"].head(1)
)
aggregated[f"bin_last_{bin_size * 5}.{behavior}_latency_last_prediction"] = (
    aggregated[f"{behavior}_latency_to_last_prediction"].tail(1)
)

After the groupby, there is one row per MouseID. .head(1) returns only the first mouse's summed value — for a single-mouse run this silently produces the wrong (summed) number; for multi-mouse runs it also produces NaN for all mice except the first/last.

What the Values Should Be

The per-bin latency_to_first_prediction values are absolute frame numbers from the start of the video (not relative to the bin start), so:

  • latency_first_prediction for a window = the first non-NaN latency_to_first_prediction across all bins in that window per MouseID
  • latency_last_prediction for a window = the last non-NaN latency_to_last_prediction across all bins in that window per MouseID

Fix

File: support_code/behavior_summaries.py

In aggregate_data_by_bin_size(), before the groupby().sum() on line 136, extract the latency values from the per-bin filtered_data using nth(), which preserves NaN (unlike first()/last() which skip NaN and would carry a prior-bin value forward):

latency_first_col = f"{behavior}_latency_to_first_prediction"
latency_last_col = f"{behavior}_latency_to_last_prediction"
latency_first = filtered_data.groupby("MouseID")[latency_first_col].nth(0)
latency_last = filtered_data.groupby("MouseID")[latency_last_col].nth(-1)
  • nth(0) — returns the first bin's value per MouseID; if that bin has no behavior the result is NaN
  • nth(-1) — returns the last bin's value per MouseID; if that bin has no behavior the result is NaN, rather than falling back to a previous bin's value

Then replace lines 183–188 with:

aggregated[f"bin_first_{bin_size * 5}.{behavior}_latency_first_prediction"] = latency_first
aggregated[f"bin_last_{bin_size * 5}.{behavior}_latency_last_prediction"] = latency_last

Since both aggregated and latency_first/latency_last are indexed by MouseID after the respective groupby operations, pandas will align them correctly for all mice.

Verification

No existing tests cover behavior_summaries.py. After the fix, add a pytest test in tests/ (or support_code/tests/) that:

  1. Constructs a small synthetic per-bin DataFrame for two mice with 4 bins each:
    • Mouse A: latency_first bins = [2506, 9412, NaN, 38222]; latency_last bins = [4900, 11000, NaN, 45000]
    • Mouse B: latency_first bins = [NaN, 5000, 12000, NaN]; latency_last bins = [NaN, 8000, 15000, NaN]
  2. Calls aggregate_data_by_bin_size() with bin_size=4
  3. Asserts for latency_first_prediction (first bin's value):
    • Mouse A == 2506 (first bin has behavior)
    • Mouse B == NaN (first bin has no behavior — must NOT fall back to 5000)
  4. Asserts for latency_last_prediction (last bin's value):
    • Mouse A == 45000 (last bin has behavior)
    • Mouse B == NaN (last bin has no behavior — must NOT fall back to 15000 from the previous bin)

@jacobbeierle
Copy link
Copy Markdown
Contributor Author

I am closing this request because I discovered the root cause of this bug and the average bout length bug originate from the same problematic cumulative sum error. I will submit a new pull request on this branch momentarily that addresses both bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant