Skip to content

Some optimizations for Raphael#16

Open
hxsyzl wants to merge 188 commits intocrdroidandroid:16.0-raphaelfrom
hxsyzl:16.0-raphael
Open

Some optimizations for Raphael#16
hxsyzl wants to merge 188 commits intocrdroidandroid:16.0-raphaelfrom
hxsyzl:16.0-raphael

Conversation

@hxsyzl
Copy link
Copy Markdown
Contributor

@hxsyzl hxsyzl commented Apr 17, 2026

lz4 zram cgroupv2 binder thermal psi drm

Demon000 and others added 30 commits April 17, 2026 01:17
Change-Id: I066cd75e12e3af33c356ed79d1ac50b406fbdb99
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Some DT devices, mainly smartphones, do need more trip points
to allow more fine grained thermal mitigations, hence allowing
a better user experience (and overall performance), for example,
by lowering the CPU clocks just a little for each temperature
step.

Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Currently, when lmh interrupt fires we notify the scheduler about
thermal pressure. The scheduler reduces the capacity of the cpu(s)
for which the interrupt fired, while we requeue our deferrable work.

This leads to an interesting deadlock, since the cpu is running with
reduced capacity, the scheduler does not put any task on it,
extending the idle time and causing it to remain at reduced capacity -
leading to severe underutilization of the cpu(s).

Switch to using delayed work instead of deferrable work. This will cause
us to exit idle until the thermal condition is recovered.

Change-Id: I4c27c9a952ed556336297eb25e815c848b8266eb
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
[Helium-Studio: Apply to msm_lmh_dcvs driver]
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Signed-off-by: Julian Liu <wlootlxt123@gmail.com>
Signed-off-by: Nauval Rizky <enuma.alrizky@gmail.com>
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Don't take priority over other workqueues.

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Step wise algorithm ignores the temperature stable trend, which can
cause a case where step wise will not clear the mitigation, when the
temperature is lower than hysteresis and stays stable throughout.

In order to avoid this, consider temperature stable trend similar to
temperature dropping trend. This will lower the mitigation when the
temperature is less than hysteresis.

Change-Id: Ic7d1d35c74d732562193195e46af945ef41912c2
Signed-off-by: Ram Chandrasekar <rkumbako@codeaurora.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: PrimoDev23 <lexx.ps2711@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 1def1bd05d3d9d317a537039434ab900db4239fe)
There is a possibility that a zone will have a polling delay to evaluate
the temperature and determine the mitigation temperature. There is a
possibility that the cooling device to mitigate in this zone might be
mitigated by another zone. The non-mitigating polling type thermal zone,
will use the current mitigation for the cooling device and reduces the
mitigation by one level and this level is active till the other
mitigation clears. Once the other mitigation clears, the mitigation vote
by this non-mitigating thermal zone will reduce step by step. Thus the
mitigation is not cleared right away in this case, instead it takes
multiple polling iteration to clear. This could cause performance
impact.

To avoid this scenario in step-wise algorithm, skip the evaluation of
mitigation when the trip is not reached and there is no previous
mitigation vote. This optimization will prevent the unnecessary
mitigation vote from the non-mitigating thermal zone.

Change-Id: I3939cb53937b650ecf432989f065cfa619297397
Signed-off-by: Ram Chandrasekar <rkumbako@codeaurora.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: PrimoDev23 <lexx.ps2711@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 4dea2c1decfe8e99121e8b43cdcca10ab1f543ee)
Step wise algorithm will always determine the next mitigation action
based on the current operating level. When a cooling device is mitigated
by two rules simultaneously, each rule will have its own mitigation value.
When one rule goes below the clear threshold, step wise algorithm will
still decide to mitigate, since the next mitigation is determined based
on the current operating level. For rules, that doesn't have passive or
active polling, this mitigation action will stay till the rule triggers
the trip threshold for the next time.

In order to avoid this, during downward trend and when no throttling,
consider the last mitigation action sent from the rule. If it is already
at the lowest mitigation level, clear any mitigation.

Change-Id: I4ffd5733097393e5128260512b12576a14be8136
Signed-off-by: Ram Chandrasekar <rkumbako@codeaurora.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: PrimoDev23 <lexx.ps2711@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit d5a4c73b9010024a83f8d55b4316682a49ae2203)
The qmi APIs return a positive transaction result value on successful
qmi message update for all qmi cooling devices. The qmi cooling
devices return the same value for cooling device state update request
from thermal framework. But thermal framework expects zero as return
value for successful cdev update and then only it will update cooling
device stats for that cooling device.

Update qmi cooling device driver to return zero on successful cooling
device state update to thermal framework.

Change-Id: Id9d59ed2c88c7e95dc2746a061e0ebadf0477a30
Signed-off-by: Manaf Meethalavalappu Pallikunhi <manafm@codeaurora.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: PrimoDev23 <lexx.ps2711@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 946d978ed0eb3d4be5a60a8a4a2548b7e8c6a4a8)
It is unnecessary to update disabled thermal zones post suspend and
sometimes leads error/warning in bad behaved thermal drivers.

Bug: 129435616
Change-Id: If5d3bfe84879779ec1ee024c0cf388ea3b4be2ea
Signed-off-by: Wei Wang <wvw@google.com>
before patch and "echo 50000 > /sys/class/thermal/tz-by-name/sdm-therm/emul_temp"
com.android.uibench.janktests.UiBenchJankTests#testInvalidateTree: PASSED (02m6.247s)
        gfx-avg-slow-ui-thread: 0.07110321338664297
        gfx-avg-missed-vsync: 0.0
        gfx-avg-high-input-latency: 74.25140826299423
        gfx-max-frame-time-50: 12
        gfx-min-total-frames: 2250
        gfx-avg-frame-time-99: 11.8
        gfx-avg-num-frame-deadline-missed: 1.6
        gfx-avg-frame-time-50: 9.6
        gfx-max-high-input-latency: 99.86666666666667
        gfx-avg-frame-time-90: 11.0
        gfx-avg-frame-time-95: 11.0
        gfx-max-frame-time-95: 13
        gfx-max-frame-time-90: 13
        gfx-max-slow-draw: 0.0
        gfx-max-frame-time-99: 13
        gfx-avg-slow-draw: 0.0
        gfx-max-total-frames: 2251
        gfx-avg-jank: 43.678000000000004
        gfx-max-slow-bitmap-uploads: 0.0
        gfx-max-missed-vsync: 0.0
        gfx-avg-total-frames: 2250
        gfx-max-jank: 96.67
        gfx-max-slow-ui-thread: 0.13333333333333333
        gfx-max-num-frame-deadline-missed: 3
        gfx-avg-slow-bitmap-uploads: 0.0

aefore patch and "echo 50000 > /sys/class/thermal/tz-by-name/sdm-therm/emul_temp"
google/perf/jank/UIBench/UIBench (1 Test)
----------------------------------------
[1/1] com.android.uibench.janktests.UiBenchJankTests#testInvalidateTree: PASSED (02m7.027s)
        gfx-avg-slow-ui-thread: 0.0
        gfx-avg-missed-vsync: 0.0
        gfx-avg-high-input-latency: 11.53777777777778
        gfx-max-frame-time-50: 7
        gfx-min-total-frames: 2250
        gfx-avg-frame-time-99: 8.0
        gfx-avg-num-frame-deadline-missed: 0.0
        gfx-avg-frame-time-50: 7.0
        gfx-max-high-input-latency: 41.15555555555556
        gfx-avg-frame-time-90: 7.2
        gfx-avg-frame-time-95: 7.8
        gfx-max-frame-time-95: 8
        gfx-max-frame-time-90: 8
        gfx-max-slow-draw: 0.0
        gfx-max-frame-time-99: 8
        gfx-avg-slow-draw: 0.0
        gfx-max-total-frames: 2250
        gfx-avg-jank: 0.0
        gfx-max-slow-bitmap-uploads: 0.0
        gfx-max-missed-vsync: 0.0
        gfx-avg-total-frames: 2250
        gfx-max-jank: 0.0
        gfx-max-slow-ui-thread: 0.0
        gfx-max-num-frame-deadline-missed: 0
        gfx-avg-slow-bitmap-uploads: 0.0

Bug: 143162654
Test: use emul_temp to change thermal condition and see capacity changed
Change-Id: Idbf943f9c831c288db40d820682583ade3bbf05e
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Whenever BCL interrupt triggers, it notifies thermal framework.
The framework disables the BCL interrupt and initiates a passive
polling to monitor clear threshold. But BCL peripheral interrupts
are lazy IRQ disable in nature by default. Even if BCL has initiated
disable interrupt, there is a chance it may take some time to disable
in hardware. During this time hardware can trigger interrupt again.
But BCL driver assumes it as spurious interrupt and disables the
interrupt again which will cause permanent disablement of that
interrupt.

If BCL interrupt is triggering again post BCL interrupt
disable, just ignore that interrupt to avoid nested interrupt
disablement. From above scenario, BCL is already in polling mode,
ignoring this spurious interrupt doesn't cause any issue.

Bug: 118493676
Change-Id: Ia77fc66eaf66f97bacee96906cc6a5735a6ed158
Signed-off-by: Manaf Meethalavalappu Pallikunhi <manafm@codeaurora.org>
Signed-off-by: Wei Wang <wvw@google.com>
The routines in this driver are required to enable stable performance at
all supported CPU frequencies. Instead of allowing the driver to be
removed completely, always compile it and simply stub out the actual CPU
throttle components, in order to enable full CPU performance.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
Returning before thermal trips due to lower battery should
fix the lag due to restricted thermals. This what BCL does.

test: check 2-3 cycles of battery percentage under 10%
result: Fixes device lag completely. Maintains constant performance

Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit ecc53270590cc372ed7c509fcefe3de7fe7269d9)

Change-Id: I51eca8a7245e62178054d2ab73960d8fab9d2d28
Cherry-picked from commit 691eceb7d1b6 ("thermal: Create softlink by name
for thermal_zone and cooling device").
Manually tweaked due to code changes.

Bug: 115763013
Change-Id: Ic2d265dcec4caba228b4e86dc1163c13952d8d3a
Signed-off-by: Vincent Palomares <paillon@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Android atrace and catapult's systrace has clock_set_rate tracing
support which can be used for lmh tracing support

Bug: 73656525
Bug: 116457174
Test: Build
Change-Id: I29d1dc4491e7ac077727989eefda824c3f27a64c
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Bug: 149660093
Bug: 150825703
Change-Id: I5608b70c29f8a3aa4a3646436d1059ca155b8c29
Signed-off-by: Wilson Sung <wilsonsung@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Bug: 118439547
Test: tz and cdev softlink can be created at sys/class/thermal
Change-Id: Ied14db5b1b8f96d746fec3db7a65161d14a6e763
Signed-off-by: TeYuan Wang <kamewang@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Bug: 129787335
test: verified thermal-engine throttling by emul temp
Change-Id: I3b687c9205387df2365e41af51fbbaf107c47f56
Signed-off-by: TeYuan Wang <kamewang@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
VIRT_COUNT_THRESHOLD added to qti_virtual_sensor for combining multiple
sensors triggered condition.

Bug: 128625129
Change-Id: Ib0dcb753180f0cec72b1cc4e5c3b2b899abbb9ea
Signed-off-by: George Lee <geolee@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
thermal_zone_get_cdev_by_name added to thermal_core for cooling device
querying by name.

Bug: 128625129
Change-Id: I3253e61800f8514769d1efc33111dec1e9000e26
Signed-off-by: George Lee <geolee@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
This reverts commit 62dc88e.

Bug: 152162350
Signed-off-by: George Lee <geolee@google.com>
Change-Id: I2c7d8f4197eb55df04740f73e6333052ebb15218
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
SOC, VBat, Temperature, and Battery cycle are monitored via thermal
core as part of the sensor fusion to trigger BCL throttling under
extreme condition.

Bug: 128625129
Change-Id: Icd55a562aabb56698a2806e6d4f1be6e3382816a
Signed-off-by: George Lee <geolee@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Default threshold of 2.989V is causing brownout at low SOC under highly
stressful load.  VBat threshold can be changed via device tree.

Bug: 135629692
Change-Id: Ic24cd9c25bf7baee168776741f318aebdbdb6e0d
Signed-off-by: George Lee <geolee@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
… string

Change the argument similarly to thermal_zone_device_register. This will
help register names get from of_property_read_string during probe.

LKML link: https://lkml.org/lkml/2019/5/14/631

Bug: 128625129
Test: Build
Change-Id: I966fbe10ac84705f5dd4d26af4d0a848a2f36b68
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
When the thermal_zone property is set to “tracks-low” we should init with
THERMAL_TEMP_INVALID_LOW (274000), not THERMAL_TEMP_INVALID (-274000).

Bug: 111683694
Test: Build and thermal mitigation works properly.
Change-Id: I2824809fdfb47c544d30fde6e4866c3ee4e7b25c
Signed-off-by: davidchao <davidchao@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
The lmh_dcvs limits initial value is U32_MAX which is unreasonable high.
Set initial limits value as cluster maximum frequency which is readable.

Bug: 130617766
Test: adb shell cat /sys/bus/platform/drivers/msm_lmh_dcvs/18*/lmh*

Change-Id: I3cff40fa70d51876f66f79c2936a37a02d9cbcf3
Signed-off-by: YiHo Cheng <yihocheng@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
There is a small possibility that usbc-therm thermistor will read
voltage as zero. It will cause usb-port cooling device
protection be false triggered, so return error or ignore to notify
of_thermal when this issue happened.

Bug: 134517322
Test: reboot test 300 times

Change-Id: I7fa8c9a64a33e590e46e7cfb445f28ad60a79cf6
Signed-off-by: TeYuan Wang <kamewang@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Bug: 130391839
Test: Verified CPU/GPU throttling via emul temp
      run pts -m PtsThermalHalTestCases

Change-Id: I2b42780166ca186124f73cec68953d6ca477b5da
Signed-off-by: TeYuan Wang <kamewang@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
hnaz and others added 16 commits April 17, 2026 01:26
Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
  while ((group = iterate_groups(next)))  # all ancestors
    psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
  nop
psi_task_switch()
  while ((group = iterate_groups(next)))  # until (prev & next)
    psi_group_change(next, .set=TSK_ONCPU)
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
We noticed that the cost of psi increases with the increase in the
levels of the cgroups. Particularly the cost of cpu_clock() sticks out
as the kernel calls it multiple times as it traverses up the cgroup
tree. This patch reduces the calls to cpu_clock().

Performed perf bench on Intel Broadwell with 3 levels of cgroup.

Before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.747 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 3.516 [sec]

       3.516689 usecs/op
         284358 ops/sec

After the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.640 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 3.329 [sec]

       3.329820 usecs/op
         300316 ops/sec

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shakeelb@google.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
psi_group_cpu->tasks, represented by the unsigned int, stores the
number of tasks that could be stalled on a psi resource(io/mem/cpu).
Decrementing these counters at zero leads to wrapping which further
leads to the psi_group_cpu->state_mask is being set with the
respective pressure state. This could result into the unnecessary time
sampling for the pressure state thus cause the spurious psi events.
This can further lead to wrong actions being taken at the user land
based on these psi events.

Though psi_bug is set under these conditions but that just for debug
purpose. Fix it by decrementing the ->tasks count only when it is
non-zero.

Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1618585336-37219-1-git-send-email-charante@codeaurora.org
[Helium-Studio: Simplify the scope in the commit message to `psi` only]
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
introduced a race condition that corrupts internal psi state. This
manifests as kernel warnings, sometimes followed by bogusly high IO
pressure:

  psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
  (schedule() decreasing RUNNING and ONCPU, both of which are 0)

  psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
  (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)

What the offending commit does is batch the two psi callbacks in
schedule() to reduce the number of cgroup tree updates. When prev is
deactivated and removed from the runqueue, nothing is done in psi at
first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
updated along with TSK_ONCPU.

However, the deactivation and the task switch inside schedule() aren't
atomic: pick_next_task() may drop the rq lock for load balancing. When
this happens, cgroup_move_task() can run after the task has been
physically dequeued, but the psi updates are still pending. Since it
looks at the task's scheduler state, it doesn't move everything to the
new cgroup that the task switch that follows is about to clear from
it. cgroup_move_task() will leak the TSK_RUNNING count in the old
cgroup, and psi_sched_switch() will underflow it in the new cgroup.

A similar thing can happen for iowait. TSK_IOWAIT is usually set when
a p->in_iowait task is dequeued, but again this update is deferred to
the switch. cgroup_move_task() can see an unqueued p->in_iowait task
and move a non-existent TSK_IOWAIT. This results in the inconsistent
task state warning, as well as a counter underflow that will result in
permanent IO ghost pressure being reported.

Fix this bug by making cgroup_move_task() use task->psi_flags instead
of looking at the potentially mismatching scheduler state.

[ We used the scheduler state historically in order to not rely on
  task->psi_flags for anything but debugging. But that ship has sailed
  anyway, and this is simpler and more robust.

  We previously already batched TSK_ONCPU clearing with the
  TSK_RUNNING update inside the deactivation call from schedule(). But
  that ordering was safe and didn't result in TSK_ONCPU corruption:
  unlike most places in the scheduler, cgroup_move_task() only checked
  task_current() and handled TSK_ONCPU if the task was still queued. ]

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.org
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Race detected between psi_trigger_destroy/create as shown below, which
cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry
and psi_system->poll_timer->entry->next. Under this modification, the
race window is removed by initialising poll_wait and poll_timer in
group_init which are executed only once at beginning.

  psi_trigger_destroy()                   psi_trigger_create()

  mutex_lock(trigger_lock);
  rcu_assign_pointer(poll_task, NULL);
  mutex_unlock(trigger_lock);
					  mutex_lock(trigger_lock);
					  if (!rcu_access_pointer(group->poll_task)) {
					    timer_setup(poll_timer, poll_timer_fn, 0);
					    rcu_assign_pointer(poll_task, task);
					  }
					  mutex_unlock(trigger_lock);

  synchronize_rcu();
  del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by
                                  psi_trigger_create()

So, trigger_lock/RCU correctly protects destruction of
group->poll_task but misses this race affecting poll_timer and
poll_wait.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Co-developed-by: ziwei.dai <ziwei.dai@unisoc.com>
Signed-off-by: ziwei.dai <ziwei.dai@unisoc.com>
Co-developed-by: ke.wang <ke.wang@unisoc.com>
Signed-off-by: ke.wang <ke.wang@unisoc.com>
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
We've noticed cases where tasks in a cgroup are stalled on memory but
there is little memory FULL pressure since tasks stay on the runqueue
in reclaim.

A simple example involves a single threaded program that keeps leaking
and touching large amounts of memory. It runs in a cgroup with swap
enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
there is significant CPU pressure and memory SOME, there is barely any
memory FULL since the task enters reclaim and stays on the runqueue.
However, this memory-bound task is effectively stalled on memory and
we expect memory FULL to match memory SOME in this scenario.

The code is confused about memstall && running, thinking there is a
stalled task and a productive task when there's only one task: a
reclaimer that's counted as both. To fix this, we redefine the
condition for PSI_MEM_FULL to check that all running tasks are in an
active memstall instead of checking that there are no running tasks.

        case PSI_MEM_FULL:
-               return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
+               return unlikely(tasks[NR_MEMSTALL] &&
+                       tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);

This will capture reclaimers. It will also capture tasks that called
psi_memstall_enter() and are about to sleep, but this should be
negligible noise.

Signed-off-by: Brian Chen <brianchen118@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Martin find it confusing when look at the /proc/pressure/cpu output,
and found no hint about that CPU "full" line in psi Documentation.

% cat /proc/pressure/cpu
some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277

The PSI_CPU_FULL state is introduced by commit e7fcd7622823
("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
but also counted at the system level as a side effect.

Naturally, the FULL state doesn't exist for the CPU resource at
the system level. These "full" numbers can come from CPU idle
schedule latency. For example, t1 is the time when task wakeup
on an idle CPU, t2 is the time when CPU pick and switch to it.
The delta of (t2 - t1) will be in CPU_FULL state.

Another case all processes can be stalled is when all cgroups
have been throttled at the same time, which unlikely to happen.

Anyway, CPU_FULL metric is meaningless and confusing at the
system level. So this patch will report zeroes for CPU full
at the system level, and update psi Documentation accordingly.

Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state")
Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com
[Helium-Studio: Simplify the scope in the commit message to `psi` only]
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:

poll_timer_fn
  wake_up_interruptible(&group->poll_wait);

psi_poll_worker
  wait_event_interruptible(group->poll_wait, ...)
  psi_poll_work
    psi_schedule_poll_work
      if (timer_pending(&group->poll_timer)) return;
      ...
      mod_timer(&group->poll_timer, jiffies + delay);

Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06bdc.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:

Before the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           684365   1385156    1261240
Immediate reschedules blocked:             682846   1381654    1258682
Immediate reschedules (delta):             1519     3502       2558
Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%

After the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           882244   770298    426218
Immediate reschedules blocked:             881996   769796    426074
Immediate reschedules (delta):             248      502       144
Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%

The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang <yt.chang@mediatek.com>
Reported-by: Wenju Xu <wenju.xu@mediatek.com>
Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: SH Chen <show-hong.chen@mediatek.com>
Link: https://lore.kernel.org/r/20221028194541.813985-1-surenb@google.com
[Helium-Studio: Simplify the scope in the commit message to `psi` only]
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.

While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.

The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.

The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.

With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.

Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.

Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Helium-Studio: Simplify the scope in the commit message to `psi` only]
Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: neobuddy89 <neobuddy89@gmail.com>
Same concept as here: kerneltoast/android_kernel_google_wahoo@fe23bc0
Extended version that covers more cases.

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
[kdrag0n: Fixed compile error in Adreno driver when debugfs is enabled]
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: neobuddy89 <neobuddy89@gmail.com>
Change-Id: Icfbc59fff5647b28631a670facc93ff837687d4c
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
The current brightness level mapping does not correctly map the
brightness level range from user space to the range supported by the
panel.

For example if the max user brightness reported is 4095, and panel
backlight range is 0-255. Then user is expected to be able to set
brightness in range from 0-4095, but current logic truncates at
bl-max (255). Moreover it doesn't take into account bl-min.

Fix logic such that the brightness range set by user correctly scales to
the backlight level from panel.

Bug: 139263611
Change-Id: Ic70909af63fb5b66ebc1434477f2fc41a785ce1f
Signed-off-by: Adrian Salido <salidoa@google.com>
Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>
we have cgroupv2 ,so please use pressure
@hxsyzl hxsyzl marked this pull request as draft April 18, 2026 12:46
@hxsyzl
Copy link
Copy Markdown
Contributor Author

hxsyzl commented Apr 18, 2026

I'm try to backport newest thermal

@firebird11 firebird11 self-assigned this Apr 18, 2026
@firebird11 firebird11 marked this pull request as ready for review April 18, 2026 21:05
hxsyzl and others added 6 commits April 19, 2026 08:06
we have it ,so this pick is useless
This reverts commit 8333774.
…stantaneous thermal pressure

Add architecture specific APIs to update and track thermal pressure on a
per CPU basis. A per CPU variable thermal_pressure is introduced to keep
track of instantaneous per CPU thermal pressure. Thermal pressure is the
delta between maximum capacity and capped capacity due to a thermal event.

topology_get_thermal_pressure can be hooked into the scheduler specified
arch_scale_thermal_pressure to retrieve instantaneous thermal pressure of
a CPU.

arch_set_thermal_pressure can be used to update the thermal pressure.

Considering topology_get_thermal_pressure reads thermal_pressure and
arch_set_thermal_pressure writes into thermal_pressure, one can argue for
some sort of locking mechanism to avoid a stale value.  But considering
topology_get_thermal_pressure can be called from a system critical path
like scheduler tick function, a locking mechanism is not ideal. This means
that it is possible the thermal_pressure value used to calculate average
thermal pressure for a CPU can be stale for up to 1 tick period.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-4-thara.gopinath@linaro.org
…forms

Hook up topology_get_thermal_pressure to arch_scale_thermal_pressure thus
enabling scheduler to retrieve instantaneous thermal pressure of a CPU.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-5-thara.gopinath@linaro.org
Hook up topology_get_thermal_pressure to arch_scale_thermal_pressure thus
enabling scheduler to retrieve instantaneous thermal pressure of a CPU.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-6-thara.gopinath@linaro.org
Thermal governors can request for a CPU's maximum supported frequency to
be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular CPU is
reduced. Delta between the original maximum capacity and capped maximum
capacity is known as thermal pressure. Enable cpufreq cooling device to
update the thermal pressure in event of a capped maximum frequency.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-9-thara.gopinath@linaro.org
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
The thermal pressure signal gives information to the scheduler about
reduced CPU capacity due to thermal. It is based on a value stored in a
per-cpu 'thermal_pressure' variable. The online CPUs will get the new
value there, while the offline won't. Unfortunately, when the CPU is back
online, the value read from per-cpu variable might be wrong (stale data).
This might affect the scheduler decisions, since it sees the CPU capacity
differently than what is actually available.

Fix it by making sure that all online+offline CPUs would get the proper
value in their per-cpu variable when thermal framework sets capping.

Fixes: f12e4f66ab6a3 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping")
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Link: https://lore.kernel.org/all/20210614191030.22241-1-lukasz.luba@arm.com/
Bug: 199501011
Change-Id: I10cceb48b72ccce1f51cfc0a7ecfa8d8e67d4394
(cherry picked from commit 2ad8ccc17d1e4270cf65a3f2a07a7534aa23e3fb)
Signed-off-by: Ram Chandrasekar <quic_rkumbako@quicinc.com>
@hxsyzl
Copy link
Copy Markdown
Contributor Author

hxsyzl commented Apr 19, 2026

image I have completed this work, and compiling it in my environment does not produce any errors. You can check the build results of the test branch (https://github.com/hxsyzl/android_kernel_xiaomi_sm8150/tree/test)

@kondors1995
Copy link
Copy Markdown

image I have completed this work, and compiling it in my environment does not produce any errors. You can check the build results of the test branch (https://github.com/hxsyzl/android_kernel_xiaomi_sm8150/tree/test)

thease thermals are not really usefuel if you are not ussing modern sched or for miui thermal hal

@hxsyzl
Copy link
Copy Markdown
Contributor Author

hxsyzl commented Apr 19, 2026

image我已经完成了这项工作,在我的环境中编译时没有出现任何错误。你可以查看测试分支的构建结果(https://github.com/hxsyzl/android_kernel_xiaomi_sm8150/tree/test)

如果你不使用现代的热能或MIUI热能HAL的话,这些保暖其实并不算是真正的燃料

crdandroid use mi thermal in A16

@kondors1995
Copy link
Copy Markdown

image我已经完成了这项工作,在我的环境中编译时没有出现任何错误。你可以查看测试分支的构建结果(https://github.com/hxsyzl/android_kernel_xiaomi_sm8150/tree/test)

如果你不使用现代的热能或MIUI热能HAL的话,这些保暖其实并不算是真正的燃料

crdandroid use mi thermal in A16

Yes mi thermals only need qcom changes & Xiaomi stuff

So topology thermal changes you picked in are not needed since they are for evdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.