Python: Remove imprecise container steps #2#21888
Conversation
- remove `tupleStoreStep` and `dictStoreStep` from `containerStep` These are imprecise compared to the content being precise. - add implicit reads to recover taint at sinks - add implicit read steps for decoders to supplement the `AdditionalTaintStep` that now only covers when the full container is tainted.
We now find an alert on this line as we hope to It is not an alert for _full_ SSRF, though, since that configuration cannot handle multiple substitutions.
and adjust collection test
1d66c7b to
20fadc8
Compare
20fadc8 to
ec13e1b
Compare
|
The first DCA run showed major performance problems, as expected from the original PR. I tracked this down to the problem that |
There was a problem hiding this comment.
Pull request overview
This PR refines Python’s new dataflow/taint tracking for containers by removing imprecise “element taints whole container” steps for tuples and dictionaries (where content is tracked precisely), and compensates by adding implicit-read mechanisms so taint can still be recovered at sinks and through certain conversions/decoders. It updates flow-summary/content-set plumbing to support wildcard content sets for better scalability and adjusts a broad set of query and library tests accordingly.
Changes:
- Remove imprecise container bubbling for tuple/dict stores and introduce wildcard
ContentSets plus implicit taint reads at sinks. - Add/adjust conversion-related read steps (e.g., decoders,
%formatting,format_map) to preserve intended taint behavior without container-wide tainting. - Update query-test
.expectedbaselines and library tests to reflect the new, more precise paths/alerts.
Show a summary per file
| File | Description |
|---|---|
| python/ql/lib/semmle/python/dataflow/new/internal/TaintTrackingPrivate.qll | Adds default implicit read policy (wildcard tuple/dict element reads) and removes tuple/dict store bubbling from containerStep. |
| python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPublic.qll | Introduces wildcard-capable ContentSet representation (AnyTupleElement/AnyDictionaryElement) and singleton wrapper helper. |
| python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll | Refactors read/store/clear steps to operate via singleton/wildcard ContentSets and adds conversion read steps. |
| python/ql/lib/semmle/python/dataflow/new/internal/FlowSummaryImpl.qll | Updates summary encoding/summary-components to use singleton ContentSets and adds encoding for wildcard sets. |
| python/ql/lib/semmle/python/dataflow/new/internal/TypeTrackingImpl.qll | Wraps tracked content in singleton ContentSets when delegating to summary flow. |
| python/ql/lib/semmle/python/frameworks/Stdlib.qll | Adjusts stdlib summary behavior to taint both list element content and (imprecisely) the list where appropriate. |
| python/ql/src/Variables/LoopVariableCapture/LoopVariableCaptureQuery.qll | Updates implicit read allowlist logic to use wildcard tuple/dict element checks and store-content inspection. |
| python/ql/consistency-queries/DataFlowConsistency.ql | Excludes new conversion read steps from consistency checks where appropriate. |
| python/ql/test/library-tests/frameworks/tornado/taint_test.py | Updates tornado taint expectations around comprehensions/element reads under the new container semantics. |
| python/ql/test/library-tests/frameworks/stdlib/test_re.py | Updates re modeling expectations to rely on implicit reads at sinks for Match object content. |
| python/ql/test/library-tests/dataflow/tainttracking/defaultAdditionalTaintStep/test_unpacking.py | Removes a now-fixed spurious taint expectation for tuple/list unpacking. |
| python/ql/test/library-tests/dataflow/tainttracking/defaultAdditionalTaintStep/test_collections.py | Adjusts dict .items() taint expectations to match new dict precision semantics. |
| python/ql/test/library-tests/dataflow/sensitive-data/test.py | Updates sensitive-data expectations for non-sensitive dict entries with precise dict content. |
| python/ql/test/query-tests/Security/CWE-943-NoSqlInjection/NoSqlInjection.expected | Updates expected NoSQL injection path output to reflect refined dict/tuple content steps. |
| python/ql/test/query-tests/Security/CWE-918-ServerSideRequestForgery/PartialServerSideRequestForgery.expected | Updates expected SSRF partial-path edges/nodes under new tuple content behavior. |
| python/ql/test/query-tests/Security/CWE-312-CleartextLogging/CleartextLogging.expected | Updates expected cleartext logging results (removes prior spurious dict cross-talk path). |
| python/ql/test/query-tests/Security/CWE-209-StackTraceExposure/StackTraceExposure.expected | Updates expected stack trace exposure path with refined dict/str conversion steps. |
| python/ql/test/query-tests/Security/CVE-2018-1281/BindToAllInterfaces.expected | Updates expected bind-to-all-interfaces results to reflect tuple element precision. |
| python/ql/test/experimental/query-tests/Security/CWE-1427-PromptInjection/PromptInjection.expected | Updates expected prompt-injection results to reflect reduced container-wide tainting. |
| python/ql/test/experimental/query-tests/Security/CWE-1427-PromptInjection/openai_test.py | Adjusts inline alert annotation expectations to match updated analysis behavior. |
Copilot's findings
- Files reviewed: 20/20 changed files
- Comments generated: 3
| [(k, v) for (k, v) in request.headers.get_all()], # The comprehension is not tainted, only the elements | ||
| list([(k, v) for (k, v) in request.headers.get_all()]), # Here, all the elements of the list are tainted, but the list is not. |
| # since we have taint-step from store of `password`, we will consider any item in the | ||
| # dictionary to be a password :( | ||
| print(_config["sleep_timer"]) # $ SPURIOUS: SensitiveUse=password | ||
| # since we have precise dictionary content, other items of the config are not tainted |
| # returns Match object, which is tested properly below. (note: the match objects contain | ||
| # tainted values but are not themselves tainted - this test relies on implicit reads at sinks). |
Supersedes #17493.
We used to have taint steps from any element of a collection to the entire collection (see here).
These are very imprecise, leading to false positives (e.g. seen #17008 (comment) and #16976).
They are also at odds with how other languages treat collections, see https://github.com/github/codeql-python-team/issues/728?reload=1 about this.
We wish to keep the semantics, that if a collection is tainted, then all elements are considered tainted. Therefor we now try to not taint collections, if we have precise information about which elements are tainted.
For a list, if an element is tainted, we do not know which one, so any read is potentially reading tainted information.
There is not much difference between the list having content and the list being tainted.
But for a dictionary, if an entry is tainted and we know which one, only reads of the appropriate key is reading tainted information. All other reads should ideally be considered safe (they used to not be). If we do not know that other keys are safe, e.g. if the collection came from an untrusted source, we can taint the collection itself, and all reads will be considered dangerous. So for collections with precise content, there is a big difference between having content and the collection being tainted.
Thus we wish to remove these imprecise taint steps for tuples and dictionaries, where we track content precisely (we keep them for lists and sets, where content is imprecise anyway).
Changes
In this PR we do the following:
Status:
Potential confusions:
Improvements:
Closes #17493.