Skip to content

Document Tentacle script abandonment#3175

Draft
jimmyp wants to merge 3 commits into
mainfrom
jimpelletier/eft-3295-document-stuck-script-recovery
Draft

Document Tentacle script abandonment#3175
jimmyp wants to merge 3 commits into
mainfrom
jimpelletier/eft-3295-document-stuck-script-recovery

Conversation

@jimmyp
Copy link
Copy Markdown

@jimmyp jimmyp commented May 26, 2026

Summary

Adds a new standalone page at src/pages/docs/infrastructure/deployment-targets/tentacle/tentacle-script-abandonment.md covering Tentacle script abandonment, the product behaviour that lets Tentacle abandon a deployment script when it can't run normally, releasing the per-target mutex so the next deployment in the queue can start.

The page covers both triggers under one product term:

  • PowerShell startup detection (EFT-365, shipped April 2026). Fires when powershell.exe doesn't start executing the script body within 5 minutes. Task ends Failed with exit code -47. Windows + PowerShell only.
  • Cancellation timeout (EFT-3295, in flight). Fires when a cancellation hasn't taken effect on the Tentacle within 2 minutes. Task ends Cancelled. Any script on Tentacle (Windows or Linux); SSH and Kubernetes agent not in scope.

The page also covers why these failures happen (target-side antivirus/EDR interference) with a link to OctopusTentacle#1208 for the stack traces and CrowdStrike + Rapid7 deadlock analysis, and what customers should do about it (whitelist Tentacle paths, cross-linked to the existing antivirus section on the troubleshooting page).

The existing Troubleshooting failed or hanging tasks page gets a short cross-link pointer to the new page.

Test plan

  • Spell check passes
  • Astro build doesn't break links
  • Cross-link to #anti-virus-software on the troubleshooting page resolves (anchor exists on the existing page)
  • navOrder 58 slots the new page between agent-vs-agentless (55) and troubleshooting-tentacles (60) in the Tentacle sidebar

Open questions for review

@lucyjspence @LukeButters, three things in the new page worth a second eye on:

  1. EFT-365 startup-detection timeout = 5 minutes. PR Move the Certificate supported formats page content to the Certificates index page. #1200 lists 5 min as the default. Clare quoted 13 min in her May 18 email to Philips. Is the actual rolled-out value 5, or did the rollout configure 13?
  2. Cancellation timeout version requirements. Placeholder text reads "to be confirmed when the work ships". After EFT-3295 lands and we know the Tentacle version that publishes AbandonScript on IScriptServiceV2, update before merging.
  3. PowerShell startup detection scope. Is Linux pwsh support on a roadmap, or does this stay Windows + powershell.exe only indefinitely? Affects whether to soften that paragraph or leave it firm.

Reducing risk

  • Honest framing throughout: both triggers are mitigation, not cure. The page says explicitly that the underlying problem is on the target machine and the customer still needs to whitelist antivirus paths.
  • The abandonment subsections explicitly tell the customer the runaway process may still be running on the target.
  • Don't-oversell: no claims of robustness, no marketing language.

Refs: EFT-365, EFT-3295

[JIM_BOT.EXE v2.13]

Adds a new section to the "Troubleshooting failed or hanging tasks" page
covering the two automatic recoveries that ship with EFT-365 (PowerShell
startup detection) and EFT-3295 (cancel-abandon escape hatch).

The section sits between the existing "Automatic failure of hanging tasks"
subsection (Hung Deployment Detection — a different feature) and the
"Antivirus software" subsection (operator-side fix), so the page flows
from deployment-level detection, to script-level recovery, to the
underlying fix the customer still needs to make.

Honest framing throughout: both recoveries are mitigation, the underlying
problem is on the target machine, and the abandon path explicitly does
not kill the runaway script process.

Refs: EFT-365, EFT-3295

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@team-marketing-branch-protections
Copy link
Copy Markdown

Pull request environment is available at https://stoctodocspr3175.z22.web.core.windows.net.

You can view the ephemeral environment status in Octopus Deploy.

This environment will be automatically deprovisioned when the pull request is closed, or after 7 days of inactivity.

@jimmyp jimmyp marked this pull request as draft May 26, 2026 23:16
…page

Pivots away from the troubleshooting-embedded approach in the previous commit.
The customer-facing product term is Tentacle script abandonment (per the
Naming Playbook applied to the Linear ticket title, code, FT name, and
Tentacle's own log line). It deserves a standalone page.

What changed:
- New page at src/pages/docs/infrastructure/deployment-targets/tentacle/tentacle-script-abandonment.md
  Covers both triggers under one product term:
    - PowerShell startup detection (EFT-365)
    - Cancellation timeout (EFT-3295)
  Explains the underlying cause (target-side AV/EDR interference) with a
  link to OctopusTentacle#1208 for stack traces and the CrowdStrike +
  Rapid7 deadlock analysis. Cross-links to the existing antivirus
  exclusion list rather than duplicating it.
- Troubleshooting page: the embedded "Recovering from stuck PowerShell
  scripts on Tentacle" section is removed and replaced with a short
  cross-link to the new page. The troubleshooting page stays symptom-led;
  the new page is product-term-led.

Language discipline: no colloquial "stuck", "hung", "hanging", "frozen"
anywhere in the customer-facing prose. "script" (not "PowerShell script")
for the umbrella behaviour; "PowerShell script" only inside the
PowerShell-specific trigger section.

Refs: EFT-365, EFT-3295

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jimmyp jimmyp changed the title Document automatic recovery from stuck PowerShell scripts on Tentacle Document Tentacle script abandonment May 27, 2026
navOrder: 58
---

Octopus Tentacle can abandon a deployment script when the script can't run normally on the target. Abandonment releases the Tentacle's per-target mutex so the next deployment in the queue can start, even though the script's underlying process may still be running on the target.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a link to documentation about the per target mutex?


When Tentacle abandons a script:

- The Tentacle's per-target mutex is released. The next deployment in the queue for that target can start immediately.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the queue, the tentacle will begin executing the next script it has for that target

When Tentacle abandons a script:

- The Tentacle's per-target mutex is released. The next deployment in the queue for that target can start immediately.
- The Tentacle-side runtime locks holding state for the script are dropped.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different to the mutex?


If `powershell.exe` doesn't reach the first instruction of your script in 5 minutes, Tentacle marks the task as `Failed` with exit code `-47` and prevents the script body from running, even if PowerShell wakes up later. Tentacle records a log line like:

```
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify this code block as whatever the most generic thing is. Its a block that contains sample logs


Tentacle abandons a script in response to one of two triggers.

### PowerShell startup detection
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeButters did we really do this for powershell only? And if so was that just because thats the only thing we'd observed the behaviour for?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes only for powershell.

Afaik we have never seen this issue affect bash (on linux).

PowerShell startup detection: PowerShell did not start within 5 minutes for task <task ID>
```

Version requirements:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we do standard. If so lets keep it as is. If not lets make this less table format and more an explination


The server-side task log records:

```
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again give this a language, use a generic one

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to all code blocks

Tentacle abandoned the script.
```

If the script had already completed by the time abandonment was attempted, the second line reads:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm we have unit tests on these log lines that include a comment ensuring people know there is documentation dependant on this? If not ask me to go add them


For a worked example with stack traces and a detailed analysis of a CrowdStrike + Rapid7 deadlock on a customer's target, see [OctopusTentacle issue #1208](https://github.com/OctopusDeploy/OctopusTentacle/issues/1208).

Multiple security agents installed on the same host are the most common pattern. The fix is on the target machine.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is on the target machine is an AIism. Make it sound like me

Both abandonment triggers are mitigation, not a fix. The underlying problem is on the target machine, and you're best placed to fix it. Three steps, in order:

1. **Configure your antivirus or endpoint-protection software to exclude Tentacle's working directories.** Specifically `<Tentacle Home>\Tools` and `<Tentacle Home>\Work`. The full exclusion list and additional directories you can include if you're still seeing issues are documented in [Troubleshooting failed or hanging tasks: Antivirus software](/docs/support/troubleshooting-failed-or-hanging-tasks#anti-virus-software).
2. **Keep target-side security tooling updated.** Known interactions between specific CrowdStrike and Rapid7 versions cause the deadlock; vendor updates have addressed similar issues before.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the example where vendors have addressed similar issues before


This is generally indicative of an internal error in Octopus. In Octopus Cloud we actively monitor for these issues, but please reach out to support for further assistance, especially if the problem persists.

### Tentacle script abandonment
Copy link
Copy Markdown
Author

@jimmyp jimmyp May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like this would cause a hanging task. When really its the solution to many of these problems

- Link "per-target mutex" to /docs/administration/managing-infrastructure/run-multiple-processes-on-a-target-simultaneously
- Drop "queue" framing: Tentacle picks up the next script it has for that target, not a notion of a queue.
- Remove the redundant "runtime locks" bullet (covered by the mutex bullet).
- Add `text` language fence to all sample-log code blocks.
- PowerShell startup detection scope reworded: PowerShell-only because that's the only context where the failure has been observed (confirmed by @LukeButters in review).
- Convert bulleted "Version requirements" blocks to single-sentence prose per existing docs convention.
- Drop the "Cancellation hasn't taken effect..." dispatch log line. The current implementation only emits the two outcome lines (`Tentacle abandoned the script` or `Script had already completed before abandon was needed`); the dispatch line was a spec aspiration that isn't in the code yet.
- Replace the AIism "The fix is on the target machine." with two concrete sentences naming Octopus's limit and where the fix lives.
- Remove the unsupported "vendor updates have addressed similar issues before" claim. Replace with a directive to check the vendor's release notes.
- Reword the cross-link section on the troubleshooting page from "Tentacle script abandonment" to "Automatic recovery for hanging tasks" so the heading reads as the solution, not the problem. The canonical product term stays in the link text.

Refs: EFT-365, EFT-3295

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## What to do about it

Both abandonment triggers are mitigation, not a fix. The underlying problem is on the target machine, and you're best placed to fix it. Three steps, in order:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Both abandonment triggers are mitigation, not a fix. The underlying problem is on the target machine, and you're best placed to fix it. Three steps, in order:
Both abandonment triggers are mitigation, not a fix. The underlying problem is on the target machine, to fix it consider:


For a worked example with stack traces and a detailed analysis of a CrowdStrike + Rapid7 deadlock on a customer's target, see [OctopusTentacle issue #1208](https://github.com/OctopusDeploy/OctopusTentacle/issues/1208).

Multiple security agents installed on the same host are the most common pattern. Octopus can't reach inside that interaction to fix it. The fix lives in your target-side antivirus configuration.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Multiple security agents installed on the same host are the most common pattern. Octopus can't reach inside that interaction to fix it. The fix lives in your target-side antivirus configuration.
Multiple security agents installed on the same host are the most common pattern. Octopus can't reach inside that interaction to resolve this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants