fix: add SSRF protection to FileUrl in crewai-files#5844
fix: add SSRF protection to FileUrl in crewai-files#5844devin-ai-integration[bot] wants to merge 4 commits into
Conversation
- Add security.py module with DNS-resolving URL validation that blocks private/reserved IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 127.0.0.0/8, 169.254.0.0/16, 0.0.0.0/32) and IPv6 equivalents - Update FileUrl._validate_url() to use the new validate_url() function - Disable follow_redirects in read()/aread() to prevent redirect-based SSRF - Add CREWAI_FILES_ALLOW_UNSAFE_URLS env var escape hatch - Add comprehensive tests for SSRF protection Fixes #5843 Co-Authored-By: João <joao@crewai.com>
|
Prompt hidden (unlisted session) |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds SSRF protection by resolving hostnames and blocking private/reserved IPs via a new validate_url(), integrates it into FileUrl, disables HTTP redirects for fetches, and provides tests covering helpers, validate_url behavior, FileUrl integration, DNS rebinding, escape hatch, and redirect regression checks. ChangesSSRF Protection and Redirect Hardening
🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Co-Authored-By: João <joao@crewai.com>
Co-Authored-By: João <joao@crewai.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/crewai-files/src/crewai_files/core/security.py`:
- Around line 22-36: The denylist in _BLOCKED_IPV4_NETWORKS and
_BLOCKED_IPV6_NETWORKS is too narrow; extend both lists to include other
non-public/special-use ranges (e.g. IPv4: 100.64.0.0/10, 198.18.0.0/15,
192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24, 224.0.0.0/4 (multicast),
240.0.0.0/4 (reserved/experimental), 255.255.255.255/32; IPv6: ff00::/8
(multicast), 2001:db8::/32 (documentation), 2001:10::/28 (ORCHID), and other
IANA reserved blocks). Locate the lists declared as _BLOCKED_IPV4_NETWORKS and
_BLOCKED_IPV6_NETWORKS and add ipaddress.ip_network(...) entries for each of
these CIDRs; update any unit tests that validate hostname/IP blocking to cover
the new ranges. Ensure you import or use ipaddress.ip_network strings
consistently with existing entries.
- Around line 77-81: The warning currently logs the full URL variable in
logger.warning; change it to redact sensitive parts before logging by parsing
the url (e.g., using urllib.parse.urlparse) and removing userinfo, query, and
fragment or replacing them with "<redacted>", then log only the sanitized
representation along with _UNSAFE_URLS_ENV; update the call site that references
url in the logger.warning invocation (the code that emits "%s is enabled -
skipping URL validation for: %s") to pass the redacted URL instead of the raw
url.
In `@lib/crewai-files/src/crewai_files/core/sources.py`:
- Around line 451-456: The URL is only validated in _validate_url at
construction, leaving a TOCTOU gap; modify the fetch paths to re-run the same
validation right before network access by calling validate_url(self.url) (from
crewai_files.core.security) at the start of the synchronous read() method and
the asynchronous aread() method (and any other fetch-related methods around the
475-493 region), so that each fetch rechecks the hostname and raises/aborts if
validation fails.
In `@lib/crewai-files/tests/test_ssrf_protection.py`:
- Around line 117-124: The tests test_allows_public_https_url and
test_allows_public_http_url call validate_url but perform real DNS lookups via
example.com; make them deterministic by mocking DNS/resolver calls used inside
validate_url (e.g., monkeypatch or unittest.mock of socket.getaddrinfo /
dns.resolver.resolve or the module helper used) to return a canned public IP,
then assert the same URL is returned; apply the same mocking pattern to the
other similar tests around the 178-185 region (the other test_* functions that
validate public URLs) so no real network/DNS resolution occurs during CI.
- Around line 27-29: The literal "0.0.0.0" in the test fixture triggers ruff
S104; update the test to add an inline per-line suppression directly on the
offending literal (the "0.0.0.0" entry) such as adding a trailing comment like
"# noqa: S104 # intentional test vector for SSRF protection" so the linter
permits the case while documenting intent; keep the suppression next to the
"0.0.0.0" string in test_ssrf_protection.py so it's clearly associated with that
test value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5376ccf9-b2da-4713-9c90-d46bc9a61a90
📒 Files selected for processing (4)
lib/crewai-files/src/crewai_files/core/security.pylib/crewai-files/src/crewai_files/core/sources.pylib/crewai-files/tests/test_file_url.pylib/crewai-files/tests/test_ssrf_protection.py
| _BLOCKED_IPV4_NETWORKS = [ | ||
| ipaddress.ip_network("10.0.0.0/8"), | ||
| ipaddress.ip_network("172.16.0.0/12"), | ||
| ipaddress.ip_network("192.168.0.0/16"), | ||
| ipaddress.ip_network("127.0.0.0/8"), | ||
| ipaddress.ip_network("169.254.0.0/16"), # Link-local / cloud metadata | ||
| ipaddress.ip_network("0.0.0.0/32"), | ||
| ] | ||
|
|
||
| _BLOCKED_IPV6_NETWORKS = [ | ||
| ipaddress.ip_network("::1/128"), | ||
| ipaddress.ip_network("::/128"), | ||
| ipaddress.ip_network("fc00::/7"), # Unique local addresses | ||
| ipaddress.ip_network("fe80::/10"), # Link-local IPv6 | ||
| ] |
There was a problem hiding this comment.
Broaden blocked IP coverage beyond the current RFC1918-focused lists.
The current denylist omits several non-public/special-use ranges (for example 100.64.0.0/10, 198.18.0.0/15, multicast/experimental blocks). A hostname resolving to one of those can still pass validation, which weakens SSRF protection.
Suggested patch
_BLOCKED_IPV4_NETWORKS = [
ipaddress.ip_network("10.0.0.0/8"),
ipaddress.ip_network("172.16.0.0/12"),
ipaddress.ip_network("192.168.0.0/16"),
ipaddress.ip_network("127.0.0.0/8"),
ipaddress.ip_network("169.254.0.0/16"), # Link-local / cloud metadata
+ ipaddress.ip_network("100.64.0.0/10"), # Carrier-grade NAT
+ ipaddress.ip_network("192.0.0.0/24"), # IETF protocol assignments
+ ipaddress.ip_network("198.18.0.0/15"), # Benchmark/testing
+ ipaddress.ip_network("224.0.0.0/4"), # Multicast
+ ipaddress.ip_network("240.0.0.0/4"), # Reserved/experimental
+ ipaddress.ip_network("255.255.255.255/32"),
ipaddress.ip_network("0.0.0.0/32"),
]
_BLOCKED_IPV6_NETWORKS = [
ipaddress.ip_network("::1/128"),
ipaddress.ip_network("::/128"),
ipaddress.ip_network("fc00::/7"), # Unique local addresses
ipaddress.ip_network("fe80::/10"), # Link-local IPv6
+ ipaddress.ip_network("ff00::/8"), # Multicast
]Also applies to: 44-55
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/crewai-files/src/crewai_files/core/security.py` around lines 22 - 36, The
denylist in _BLOCKED_IPV4_NETWORKS and _BLOCKED_IPV6_NETWORKS is too narrow;
extend both lists to include other non-public/special-use ranges (e.g. IPv4:
100.64.0.0/10, 198.18.0.0/15, 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24,
224.0.0.0/4 (multicast), 240.0.0.0/4 (reserved/experimental),
255.255.255.255/32; IPv6: ff00::/8 (multicast), 2001:db8::/32 (documentation),
2001:10::/28 (ORCHID), and other IANA reserved blocks). Locate the lists
declared as _BLOCKED_IPV4_NETWORKS and _BLOCKED_IPV6_NETWORKS and add
ipaddress.ip_network(...) entries for each of these CIDRs; update any unit tests
that validate hostname/IP blocking to cover the new ranges. Ensure you import or
use ipaddress.ip_network strings consistently with existing entries.
| logger.warning( | ||
| "%s is enabled - skipping URL validation for: %s", | ||
| _UNSAFE_URLS_ENV, | ||
| url, | ||
| ) |
There was a problem hiding this comment.
Avoid logging full URLs in escape-hatch warnings.
This warning currently emits the full URL, which can leak tokens/credentials in query strings or userinfo to logs. Prefer redacting sensitive parts before logging.
Suggested patch
+from urllib.parse import urlsplit, urlunsplit
@@
if _is_escape_hatch_enabled():
+ parts = urlsplit(url)
+ safe_url = urlunsplit((parts.scheme, parts.netloc, parts.path, "", ""))
logger.warning(
"%s is enabled - skipping URL validation for: %s",
_UNSAFE_URLS_ENV,
- url,
+ safe_url,
)
return url📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.warning( | |
| "%s is enabled - skipping URL validation for: %s", | |
| _UNSAFE_URLS_ENV, | |
| url, | |
| ) | |
| if _is_escape_hatch_enabled(): | |
| parts = urlsplit(url) | |
| safe_url = urlunsplit((parts.scheme, parts.netloc, parts.path, "", "")) | |
| logger.warning( | |
| "%s is enabled - skipping URL validation for: %s", | |
| _UNSAFE_URLS_ENV, | |
| safe_url, | |
| ) | |
| return url |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/crewai-files/src/crewai_files/core/security.py` around lines 77 - 81, The
warning currently logs the full URL variable in logger.warning; change it to
redact sensitive parts before logging by parsing the url (e.g., using
urllib.parse.urlparse) and removing userinfo, query, and fragment or replacing
them with "<redacted>", then log only the sanitized representation along with
_UNSAFE_URLS_ENV; update the call site that references url in the logger.warning
invocation (the code that emits "%s is enabled - skipping URL validation for:
%s") to pass the redacted URL instead of the raw url.
| def _validate_url(self) -> FileUrl: | ||
| """Validate URL format.""" | ||
| if not self.url.startswith(("http://", "https://")): | ||
| raise ValueError(f"Invalid URL scheme: {self.url}") | ||
| """Validate URL format and ensure it does not target internal networks.""" | ||
| from crewai_files.core.security import validate_url | ||
|
|
||
| validate_url(self.url) | ||
| return self |
There was a problem hiding this comment.
Revalidate immediately before each fetch to reduce DNS-rebinding exposure.
Validation only at object creation leaves a gap before read()/aread() where hostname resolution can change. Re-running validation at fetch time materially reduces that TOCTOU window.
Suggested patch
def read(self) -> bytes:
"""Fetch content from URL (for providers that don't support URL references)."""
if self._content is None:
import httpx
+ from crewai_files.core.security import validate_url
+ validate_url(self.url)
response = httpx.get(self.url, follow_redirects=False)
response.raise_for_status()
self._content = response.content
@@
async def aread(self) -> bytes:
"""Async fetch content from URL."""
if self._content is None:
import httpx
+ from crewai_files.core.security import validate_url
+ validate_url(self.url)
async with httpx.AsyncClient() as client:
response = await client.get(self.url, follow_redirects=False)Also applies to: 475-493
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/crewai-files/src/crewai_files/core/sources.py` around lines 451 - 456,
The URL is only validated in _validate_url at construction, leaving a TOCTOU
gap; modify the fetch paths to re-run the same validation right before network
access by calling validate_url(self.url) (from crewai_files.core.security) at
the start of the synchronous read() method and the asynchronous aread() method
(and any other fetch-related methods around the 475-493 region), so that each
fetch rechecks the hostname and raises/aborts if validation fails.
| def test_allows_public_https_url(self): | ||
| result = validate_url("https://example.com/image.png") | ||
| assert result == "https://example.com/image.png" | ||
|
|
||
| def test_allows_public_http_url(self): | ||
| result = validate_url("http://example.com/file.pdf") | ||
| assert result == "http://example.com/file.pdf" | ||
|
|
There was a problem hiding this comment.
Make public-URL allow tests deterministic by mocking DNS.
These tests currently rely on real resolver/network behavior (example.com lookup). That can fail in sandboxed/offline CI and produce flaky results unrelated to SSRF logic.
Suggested patch
def test_allows_public_https_url(self):
- result = validate_url("https://example.com/image.png")
- assert result == "https://example.com/image.png"
+ fake_addrinfo = [(2, 1, 6, "", ("93.184.216.34", 443))]
+ with patch("socket.getaddrinfo", return_value=fake_addrinfo):
+ result = validate_url("https://example.com/image.png")
+ assert result == "https://example.com/image.png"
@@
def test_allows_public_http_url(self):
- result = validate_url("http://example.com/file.pdf")
- assert result == "http://example.com/file.pdf"
+ fake_addrinfo = [(2, 1, 6, "", ("93.184.216.34", 80))]
+ with patch("socket.getaddrinfo", return_value=fake_addrinfo):
+ result = validate_url("http://example.com/file.pdf")
+ assert result == "http://example.com/file.pdf"
@@
def test_accepts_public_https_url(self):
- url = FileUrl(url="https://example.com/image.png")
- assert url.url == "https://example.com/image.png"
+ fake_addrinfo = [(2, 1, 6, "", ("93.184.216.34", 443))]
+ with patch("socket.getaddrinfo", return_value=fake_addrinfo):
+ url = FileUrl(url="https://example.com/image.png")
+ assert url.url == "https://example.com/image.png"Also applies to: 178-185
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/crewai-files/tests/test_ssrf_protection.py` around lines 117 - 124, The
tests test_allows_public_https_url and test_allows_public_http_url call
validate_url but perform real DNS lookups via example.com; make them
deterministic by mocking DNS/resolver calls used inside validate_url (e.g.,
monkeypatch or unittest.mock of socket.getaddrinfo / dns.resolver.resolve or the
module helper used) to return a canned public IP, then assert the same URL is
returned; apply the same mocking pattern to the other similar tests around the
178-185 region (the other test_* functions that validate public URLs) so no real
network/DNS resolution occurs during CI.
Co-Authored-By: João <joao@crewai.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/crewai-files/src/crewai_files/core/security.py`:
- Around line 1-19: The file formatting for crewai_files.core.security is not
ruff-compliant; run the formatter or reformat the module so imports and
docstring conform to ruff style (e.g., fix blank lines, import ordering, and
spacing around the module docstring and imports) and re-run `uv run ruff format
lib/crewai-files/src/crewai_files/core/security.py`; ensure the top-level
docstring, the `from __future__` line, imports (ipaddress, logging, os, socket,
urlparse) and the `logger = logging.getLogger(__name__)` line follow ruff/PEP8
formatting rules.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 68236ecc-7ead-449e-8f60-2153e2de290b
📒 Files selected for processing (2)
lib/crewai-files/src/crewai_files/core/security.pylib/crewai-files/tests/test_ssrf_protection.py
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/crewai-files/tests/test_ssrf_protection.py
Summary
Addresses #5843 —
FileUrl._validate_url()only checked the URL scheme (http:///https://) but never resolved DNS or validated the target IP, allowing SSRF to internal networks and cloud metadata endpoints (e.g.http://169.254.169.254/).Changes:
New
security.pymodule (lib/crewai-files/src/crewai_files/core/security.py): Addsvalidate_url()which resolves DNS viasocket.getaddrinfo()and blocks requests to private/reserved IP ranges (RFC 1918, loopback, link-local/cloud metadata, IPv4-mapped IPv6). Mirrors the existing pattern increwai-tools/security/safe_path.py. IncludesCREWAI_FILES_ALLOW_UNSAFE_URLSenv var escape hatch.Updated
FileUrl._validate_url(): Now delegates to the newvalidate_url()instead of a simple string prefix check.Disabled
follow_redirectsin bothread()andaread(): Changed fromfollow_redirects=Truetofollow_redirects=Falseto prevent redirect-based SSRF bypasses.Review & Testing Checklist for Human
follow_redirects=Falseis a breaking behavioral change. AnyFileUrlpointing to a URL that returns a 301/302 redirect (common with CDNs, URL shorteners, S3 presigned URLs) will now fail with an HTTP error instead of following the redirect. Verify this is acceptable, or consider implementing redirect-following with per-hop IP validation instead.FileUrlconstruction time (inside Pydantic'smodel_validator). This means creating aFileUrlnow requires network access and may be slow or raiseValueErrorin offline/restricted environments. Confirm this is acceptable for all call sites.read()/aread()makes a separate HTTP request later. A DNS rebinding attack could change the resolved IP between these two steps. Thefollow_redirects=Falsechange helps but doesn't fully close this gap.crewai-tools: The blocked IP ranges and validation logic are copied fromcrewai_tools.security.safe_path. If that list is updated, this copy won't be in sync. Consider whether a shared dependency or extraction tocrewai-corewould be preferable.Suggested test plan: Create a
FileUrltargetinghttp://169.254.169.254/latest/meta-data/andhttp://127.0.0.1/and confirm both raiseValueError. Also test a legitimate public URL (e.g.https://example.com/image.png) still works end-to-end throughread().Notes
test_resolve_url_source_bedrock_fetches_contentwas already failing onmain(unrelated to this PR).test_ssrf_protection.pyadds 50+ test cases covering IP validation, URL validation, FileUrl integration, DNS rebinding simulation, and the escape hatch.Link to Devin session: https://app.devin.ai/sessions/e7cf94a18fe442c3a39265dd5fc605ef
Summary by CodeRabbit
New Features
Bug Fixes
Tests