Some optimizations for Raphael by hxsyzl · Pull Request #16 · crdroidandroid/android_kernel_xiaomi_sm8150

hxsyzl · 2026-04-17T01:30:55Z

lz4 zram cgroupv2 binder thermal psi drm

Change-Id: I066cd75e12e3af33c356ed79d1ac50b406fbdb99 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Some DT devices, mainly smartphones, do need more trip points to allow more fine grained thermal mitigations, hence allowing a better user experience (and overall performance), for example, by lowering the CPU clocks just a little for each temperature step. Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Currently, when lmh interrupt fires we notify the scheduler about thermal pressure. The scheduler reduces the capacity of the cpu(s) for which the interrupt fired, while we requeue our deferrable work. This leads to an interesting deadlock, since the cpu is running with reduced capacity, the scheduler does not put any task on it, extending the idle time and causing it to remain at reduced capacity - leading to severe underutilization of the cpu(s). Switch to using delayed work instead of deferrable work. This will cause us to exit idle until the thermal condition is recovered. Change-Id: I4c27c9a952ed556336297eb25e815c848b8266eb Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org> [Helium-Studio: Apply to msm_lmh_dcvs driver] Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Signed-off-by: Julian Liu <wlootlxt123@gmail.com> Signed-off-by: Nauval Rizky <enuma.alrizky@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Don't take priority over other workqueues. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

Step wise algorithm ignores the temperature stable trend, which can cause a case where step wise will not clear the mitigation, when the temperature is lower than hysteresis and stays stable throughout. In order to avoid this, consider temperature stable trend similar to temperature dropping trend. This will lower the mitigation when the temperature is less than hysteresis. Change-Id: Ic7d1d35c74d732562193195e46af945ef41912c2 Signed-off-by: Ram Chandrasekar <rkumbako@codeaurora.org> Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: PrimoDev23 <lexx.ps2711@gmail.com> Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live> (cherry picked from commit 1def1bd05d3d9d317a537039434ab900db4239fe)

There is a possibility that a zone will have a polling delay to evaluate the temperature and determine the mitigation temperature. There is a possibility that the cooling device to mitigate in this zone might be mitigated by another zone. The non-mitigating polling type thermal zone, will use the current mitigation for the cooling device and reduces the mitigation by one level and this level is active till the other mitigation clears. Once the other mitigation clears, the mitigation vote by this non-mitigating thermal zone will reduce step by step. Thus the mitigation is not cleared right away in this case, instead it takes multiple polling iteration to clear. This could cause performance impact. To avoid this scenario in step-wise algorithm, skip the evaluation of mitigation when the trip is not reached and there is no previous mitigation vote. This optimization will prevent the unnecessary mitigation vote from the non-mitigating thermal zone. Change-Id: I3939cb53937b650ecf432989f065cfa619297397 Signed-off-by: Ram Chandrasekar <rkumbako@codeaurora.org> Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: PrimoDev23 <lexx.ps2711@gmail.com> Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live> (cherry picked from commit 4dea2c1decfe8e99121e8b43cdcca10ab1f543ee)

Step wise algorithm will always determine the next mitigation action based on the current operating level. When a cooling device is mitigated by two rules simultaneously, each rule will have its own mitigation value. When one rule goes below the clear threshold, step wise algorithm will still decide to mitigate, since the next mitigation is determined based on the current operating level. For rules, that doesn't have passive or active polling, this mitigation action will stay till the rule triggers the trip threshold for the next time. In order to avoid this, during downward trend and when no throttling, consider the last mitigation action sent from the rule. If it is already at the lowest mitigation level, clear any mitigation. Change-Id: I4ffd5733097393e5128260512b12576a14be8136 Signed-off-by: Ram Chandrasekar <rkumbako@codeaurora.org> Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: PrimoDev23 <lexx.ps2711@gmail.com> Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live> (cherry picked from commit d5a4c73b9010024a83f8d55b4316682a49ae2203)

The qmi APIs return a positive transaction result value on successful qmi message update for all qmi cooling devices. The qmi cooling devices return the same value for cooling device state update request from thermal framework. But thermal framework expects zero as return value for successful cdev update and then only it will update cooling device stats for that cooling device. Update qmi cooling device driver to return zero on successful cooling device state update to thermal framework. Change-Id: Id9d59ed2c88c7e95dc2746a061e0ebadf0477a30 Signed-off-by: Manaf Meethalavalappu Pallikunhi <manafm@codeaurora.org> Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: PrimoDev23 <lexx.ps2711@gmail.com> Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live> (cherry picked from commit 946d978ed0eb3d4be5a60a8a4a2548b7e8c6a4a8)

It is unnecessary to update disabled thermal zones post suspend and sometimes leads error/warning in bad behaved thermal drivers. Bug: 129435616 Change-Id: If5d3bfe84879779ec1ee024c0cf388ea3b4be2ea Signed-off-by: Wei Wang <wvw@google.com>

before patch and "echo 50000 > /sys/class/thermal/tz-by-name/sdm-therm/emul_temp" com.android.uibench.janktests.UiBenchJankTests#testInvalidateTree: PASSED (02m6.247s) gfx-avg-slow-ui-thread: 0.07110321338664297 gfx-avg-missed-vsync: 0.0 gfx-avg-high-input-latency: 74.25140826299423 gfx-max-frame-time-50: 12 gfx-min-total-frames: 2250 gfx-avg-frame-time-99: 11.8 gfx-avg-num-frame-deadline-missed: 1.6 gfx-avg-frame-time-50: 9.6 gfx-max-high-input-latency: 99.86666666666667 gfx-avg-frame-time-90: 11.0 gfx-avg-frame-time-95: 11.0 gfx-max-frame-time-95: 13 gfx-max-frame-time-90: 13 gfx-max-slow-draw: 0.0 gfx-max-frame-time-99: 13 gfx-avg-slow-draw: 0.0 gfx-max-total-frames: 2251 gfx-avg-jank: 43.678000000000004 gfx-max-slow-bitmap-uploads: 0.0 gfx-max-missed-vsync: 0.0 gfx-avg-total-frames: 2250 gfx-max-jank: 96.67 gfx-max-slow-ui-thread: 0.13333333333333333 gfx-max-num-frame-deadline-missed: 3 gfx-avg-slow-bitmap-uploads: 0.0 aefore patch and "echo 50000 > /sys/class/thermal/tz-by-name/sdm-therm/emul_temp" google/perf/jank/UIBench/UIBench (1 Test) ---------------------------------------- [1/1] com.android.uibench.janktests.UiBenchJankTests#testInvalidateTree: PASSED (02m7.027s) gfx-avg-slow-ui-thread: 0.0 gfx-avg-missed-vsync: 0.0 gfx-avg-high-input-latency: 11.53777777777778 gfx-max-frame-time-50: 7 gfx-min-total-frames: 2250 gfx-avg-frame-time-99: 8.0 gfx-avg-num-frame-deadline-missed: 0.0 gfx-avg-frame-time-50: 7.0 gfx-max-high-input-latency: 41.15555555555556 gfx-avg-frame-time-90: 7.2 gfx-avg-frame-time-95: 7.8 gfx-max-frame-time-95: 8 gfx-max-frame-time-90: 8 gfx-max-slow-draw: 0.0 gfx-max-frame-time-99: 8 gfx-avg-slow-draw: 0.0 gfx-max-total-frames: 2250 gfx-avg-jank: 0.0 gfx-max-slow-bitmap-uploads: 0.0 gfx-max-missed-vsync: 0.0 gfx-avg-total-frames: 2250 gfx-max-jank: 0.0 gfx-max-slow-ui-thread: 0.0 gfx-max-num-frame-deadline-missed: 0 gfx-avg-slow-bitmap-uploads: 0.0 Bug: 143162654 Test: use emul_temp to change thermal condition and see capacity changed Change-Id: Idbf943f9c831c288db40d820682583ade3bbf05e Signed-off-by: Wei Wang <wvw@google.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Whenever BCL interrupt triggers, it notifies thermal framework. The framework disables the BCL interrupt and initiates a passive polling to monitor clear threshold. But BCL peripheral interrupts are lazy IRQ disable in nature by default. Even if BCL has initiated disable interrupt, there is a chance it may take some time to disable in hardware. During this time hardware can trigger interrupt again. But BCL driver assumes it as spurious interrupt and disables the interrupt again which will cause permanent disablement of that interrupt. If BCL interrupt is triggering again post BCL interrupt disable, just ignore that interrupt to avoid nested interrupt disablement. From above scenario, BCL is already in polling mode, ignoring this spurious interrupt doesn't cause any issue. Bug: 118493676 Change-Id: Ia77fc66eaf66f97bacee96906cc6a5735a6ed158 Signed-off-by: Manaf Meethalavalappu Pallikunhi <manafm@codeaurora.org> Signed-off-by: Wei Wang <wvw@google.com>

The routines in this driver are required to enable stable performance at all supported CPU frequencies. Instead of allowing the driver to be removed completely, always compile it and simply stub out the actual CPU throttle components, in order to enable full CPU performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Yaroslav Furman <yaro330@gmail.com> Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>

Returning before thermal trips due to lower battery should fix the lag due to restricted thermals. This what BCL does. test: check 2-3 cycles of battery percentage under 10% result: Fixes device lag completely. Maintains constant performance Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live> (cherry picked from commit ecc53270590cc372ed7c509fcefe3de7fe7269d9) Change-Id: I51eca8a7245e62178054d2ab73960d8fab9d2d28

Cherry-picked from commit 691eceb7d1b6 ("thermal: Create softlink by name for thermal_zone and cooling device"). Manually tweaked due to code changes. Bug: 115763013 Change-Id: Ic2d265dcec4caba228b4e86dc1163c13952d8d3a Signed-off-by: Vincent Palomares <paillon@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Android atrace and catapult's systrace has clock_set_rate tracing support which can be used for lmh tracing support Bug: 73656525 Bug: 116457174 Test: Build Change-Id: I29d1dc4491e7ac077727989eefda824c3f27a64c Signed-off-by: Wei Wang <wvw@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Bug: 149660093 Bug: 150825703 Change-Id: I5608b70c29f8a3aa4a3646436d1059ca155b8c29 Signed-off-by: Wilson Sung <wilsonsung@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Bug: 118439547 Test: tz and cdev softlink can be created at sys/class/thermal Change-Id: Ied14db5b1b8f96d746fec3db7a65161d14a6e763 Signed-off-by: TeYuan Wang <kamewang@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Bug: 129787335 test: verified thermal-engine throttling by emul temp Change-Id: I3b687c9205387df2365e41af51fbbaf107c47f56 Signed-off-by: TeYuan Wang <kamewang@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

VIRT_COUNT_THRESHOLD added to qti_virtual_sensor for combining multiple sensors triggered condition. Bug: 128625129 Change-Id: Ib0dcb753180f0cec72b1cc4e5c3b2b899abbb9ea Signed-off-by: George Lee <geolee@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

thermal_zone_get_cdev_by_name added to thermal_core for cooling device querying by name. Bug: 128625129 Change-Id: I3253e61800f8514769d1efc33111dec1e9000e26 Signed-off-by: George Lee <geolee@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

This reverts commit 4909e8c.

This reverts commit 62dc88e. Bug: 152162350 Signed-off-by: George Lee <geolee@google.com> Change-Id: I2c7d8f4197eb55df04740f73e6333052ebb15218 Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

SOC, VBat, Temperature, and Battery cycle are monitored via thermal core as part of the sensor fusion to trigger BCL throttling under extreme condition. Bug: 128625129 Change-Id: Icd55a562aabb56698a2806e6d4f1be6e3382816a Signed-off-by: George Lee <geolee@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Default threshold of 2.989V is causing brownout at low SOC under highly stressful load. VBat threshold can be changed via device tree. Bug: 135629692 Change-Id: Ic24cd9c25bf7baee168776741f318aebdbdb6e0d Signed-off-by: George Lee <geolee@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

… string Change the argument similarly to thermal_zone_device_register. This will help register names get from of_property_read_string during probe. LKML link: https://lkml.org/lkml/2019/5/14/631 Bug: 128625129 Test: Build Change-Id: I966fbe10ac84705f5dd4d26af4d0a848a2f36b68 Signed-off-by: Wei Wang <wvw@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

When the thermal_zone property is set to “tracks-low” we should init with THERMAL_TEMP_INVALID_LOW (274000), not THERMAL_TEMP_INVALID (-274000). Bug: 111683694 Test: Build and thermal mitigation works properly. Change-Id: I2824809fdfb47c544d30fde6e4866c3ee4e7b25c Signed-off-by: davidchao <davidchao@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

The lmh_dcvs limits initial value is U32_MAX which is unreasonable high. Set initial limits value as cluster maximum frequency which is readable. Bug: 130617766 Test: adb shell cat /sys/bus/platform/drivers/msm_lmh_dcvs/18*/lmh* Change-Id: I3cff40fa70d51876f66f79c2936a37a02d9cbcf3 Signed-off-by: YiHo Cheng <yihocheng@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

There is a small possibility that usbc-therm thermistor will read voltage as zero. It will cause usb-port cooling device protection be false triggered, so return error or ignore to notify of_thermal when this issue happened. Bug: 134517322 Test: reboot test 300 times Change-Id: I7fa8c9a64a33e590e46e7cfb445f28ad60a79cf6 Signed-off-by: TeYuan Wang <kamewang@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Bug: 130391839 Test: Verified CPU/GPU throttling via emul temp run pts -m PtsThermalHalTestCases Change-Id: I2b42780166ca186124f73cec68953d6ca477b5da Signed-off-by: TeYuan Wang <kamewang@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Move the unlikely branches out of line. This eliminates undesirable jumps during wakeup and sleeps for workloads that aren't under any sort of resource pressure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

The commit 36b238d57172 ("psi: Optimize switching tasks inside shared cgroups") only update cgroups whose state actually changes during a task switch only in task preempt case, not in task sleep case. We actually don't need to clear and set TSK_ONCPU state for common cgroups of next and prev task in sleep case, that can save many psi_group_change especially when most activity comes from one leaf cgroup. sleep before: psi_dequeue() while ((group = iterate_groups(prev))) # all ancestors psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU) psi_task_switch() while ((group = iterate_groups(next))) # all ancestors psi_group_change(next, .set=TSK_ONCPU) sleep after: psi_dequeue() nop psi_task_switch() while ((group = iterate_groups(next))) # until (prev & next) psi_group_change(next, .set=TSK_ONCPU) while ((group = iterate_groups(prev))) # all ancestors psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU) When a voluntary sleep switches to another task, we remove one call of psi_group_change() for every common cgroup ancestor of the two tasks. Co-developed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

We noticed that the cost of psi increases with the increase in the levels of the cgroups. Particularly the cost of cpu_clock() sticks out as the kernel calls it multiple times as it traverses up the cgroup tree. This patch reduces the calls to cpu_clock(). Performed perf bench on Intel Broadwell with 3 levels of cgroup. Before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.747 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 3.516 [sec] 3.516689 usecs/op 284358 ops/sec After the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.640 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 3.329 [sec] 3.329820 usecs/op 300316 ops/sec Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shakeelb@google.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

psi_group_cpu->tasks, represented by the unsigned int, stores the number of tasks that could be stalled on a psi resource(io/mem/cpu). Decrementing these counters at zero leads to wrapping which further leads to the psi_group_cpu->state_mask is being set with the respective pressure state. This could result into the unnecessary time sampling for the pressure state thus cause the spurious psi events. This can further lead to wrong actions being taken at the user land based on these psi events. Though psi_bug is set under these conditions but that just for debug purpose. Fix it by decrementing the ->tasks count only when it is non-zero. Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1618585336-37219-1-git-send-email-charante@codeaurora.org [Helium-Studio: Simplify the scope in the commit message to `psi` only] Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") introduced a race condition that corrupts internal psi state. This manifests as kernel warnings, sometimes followed by bogusly high IO pressure: psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0 (schedule() decreasing RUNNING and ONCPU, both of which are 0) psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0 (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU) What the offending commit does is batch the two psi callbacks in schedule() to reduce the number of cgroup tree updates. When prev is deactivated and removed from the runqueue, nothing is done in psi at first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are updated along with TSK_ONCPU. However, the deactivation and the task switch inside schedule() aren't atomic: pick_next_task() may drop the rq lock for load balancing. When this happens, cgroup_move_task() can run after the task has been physically dequeued, but the psi updates are still pending. Since it looks at the task's scheduler state, it doesn't move everything to the new cgroup that the task switch that follows is about to clear from it. cgroup_move_task() will leak the TSK_RUNNING count in the old cgroup, and psi_sched_switch() will underflow it in the new cgroup. A similar thing can happen for iowait. TSK_IOWAIT is usually set when a p->in_iowait task is dequeued, but again this update is deferred to the switch. cgroup_move_task() can see an unqueued p->in_iowait task and move a non-existent TSK_IOWAIT. This results in the inconsistent task state warning, as well as a counter underflow that will result in permanent IO ghost pressure being reported. Fix this bug by making cgroup_move_task() use task->psi_flags instead of looking at the potentially mismatching scheduler state. [ We used the scheduler state historically in order to not rely on task->psi_flags for anything but debugging. But that ship has sailed anyway, and this is simpler and more robust. We previously already batched TSK_ONCPU clearing with the TSK_RUNNING update inside the deactivation call from schedule(). But that ordering was safe and didn't result in TSK_ONCPU corruption: unlike most places in the scheduler, cgroup_move_task() only checked task_current() and handled TSK_ONCPU if the task was still queued. ] Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.org Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Race detected between psi_trigger_destroy/create as shown below, which cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry and psi_system->poll_timer->entry->next. Under this modification, the race window is removed by initialising poll_wait and poll_timer in group_init which are executed only once at beginning. psi_trigger_destroy() psi_trigger_create() mutex_lock(trigger_lock); rcu_assign_pointer(poll_task, NULL); mutex_unlock(trigger_lock); mutex_lock(trigger_lock); if (!rcu_access_pointer(group->poll_task)) { timer_setup(poll_timer, poll_timer_fn, 0); rcu_assign_pointer(poll_task, task); } mutex_unlock(trigger_lock); synchronize_rcu(); del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by psi_trigger_create() So, trigger_lock/RCU correctly protects destruction of group->poll_task but misses this race affecting poll_timer and poll_wait. Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism") Co-developed-by: ziwei.dai <ziwei.dai@unisoc.com> Signed-off-by: ziwei.dai <ziwei.dai@unisoc.com> Co-developed-by: ke.wang <ke.wang@unisoc.com> Signed-off-by: ke.wang <ke.wang@unisoc.com> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

We've noticed cases where tasks in a cgroup are stalled on memory but there is little memory FULL pressure since tasks stay on the runqueue in reclaim. A simple example involves a single threaded program that keeps leaking and touching large amounts of memory. It runs in a cgroup with swap enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though there is significant CPU pressure and memory SOME, there is barely any memory FULL since the task enters reclaim and stays on the runqueue. However, this memory-bound task is effectively stalled on memory and we expect memory FULL to match memory SOME in this scenario. The code is confused about memstall && running, thinking there is a stalled task and a productive task when there's only one task: a reclaimer that's counted as both. To fix this, we redefine the condition for PSI_MEM_FULL to check that all running tasks are in an active memstall instead of checking that there are no running tasks. case PSI_MEM_FULL: - return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]); + return unlikely(tasks[NR_MEMSTALL] && + tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); This will capture reclaimers. It will also capture tasks that called psi_memstall_enter() and are about to sleep, but this should be negligible noise. Signed-off-by: Brian Chen <brianchen118@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Martin find it confusing when look at the /proc/pressure/cpu output, and found no hint about that CPU "full" line in psi Documentation. % cat /proc/pressure/cpu some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489 full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277 The PSI_CPU_FULL state is introduced by commit e7fcd7622823 ("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level, but also counted at the system level as a side effect. Naturally, the FULL state doesn't exist for the CPU resource at the system level. These "full" numbers can come from CPU idle schedule latency. For example, t1 is the time when task wakeup on an idle CPU, t2 is the time when CPU pick and switch to it. The delta of (t2 - t1) will be in CPU_FULL state. Another case all processes can be stalled is when all cgroups have been throttled at the same time, which unlikely to happen. Anyway, CPU_FULL metric is meaningless and confusing at the system level. So this patch will report zeroes for CPU full at the system level, and update psi Documentation accordingly. Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state") Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com [Helium-Studio: Simplify the scope in the commit message to `psi` only] Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Psi polling mechanism is trying to minimize the number of wakeups to run psi_poll_work and is currently relying on timer_pending() to detect when this work is already scheduled. This provides a window of opportunity for psi_group_change to schedule an immediate psi_poll_work after poll_timer_fn got called but before psi_poll_work could reschedule itself. Below is the depiction of this entire window: poll_timer_fn wake_up_interruptible(&group->poll_wait); psi_poll_worker wait_event_interruptible(group->poll_wait, ...) psi_poll_work psi_schedule_poll_work if (timer_pending(&group->poll_timer)) return; ... mod_timer(&group->poll_timer, jiffies + delay); Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was reset and set back inside psi_poll_work and therefore this race window was much smaller. The larger window causes increased number of wakeups and our partners report visible power regression of ~10mA after applying 461daba06bdc. Bring back the poll_scheduled atomic and make this race window even narrower by resetting poll_scheduled only when we reach polling expiration time. This does not completely eliminate the possibility of extra wakeups caused by a race with psi_group_change however it will limit it to the worst case scenario of one extra wakeup per every tracking window (0.5s in the worst case). This patch also ensures correct ordering between clearing poll_scheduled flag and obtaining changed_states using memory barrier. Correct ordering between updating changed_states and setting poll_scheduled is ensured by atomic_xchg operation. By tracing the number of immediate rescheduling attempts performed by psi_group_change and the number of these attempts being blocked due to psi monitor being already active, we can assess the effects of this change: Before the patch: Run#1 Run#2 Run#3 Immediate reschedules attempted: 684365 1385156 1261240 Immediate reschedules blocked: 682846 1381654 1258682 Immediate reschedules (delta): 1519 3502 2558 Immediate reschedules (% of attempted): 0.22% 0.25% 0.20% After the patch: Run#1 Run#2 Run#3 Immediate reschedules attempted: 882244 770298 426218 Immediate reschedules blocked: 881996 769796 426074 Immediate reschedules (delta): 248 502 144 Immediate reschedules (% of attempted): 0.03% 0.07% 0.03% The number of non-blocked immediate reschedules dropped from 0.22-0.25% to 0.03-0.07%. The drop is attributed to the decrease in the race window size and the fact that we allow this race only when psi monitors reach polling window expiration time. Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism") Reported-by: Kathleen Chang <yt.chang@mediatek.com> Reported-by: Wenju Xu <wenju.xu@mediatek.com> Reported-by: Jonathan Chen <jonathan.jmchen@mediatek.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: SH Chen <show-hong.chen@mediatek.com> Link: https://lore.kernel.org/r/20221028194541.813985-1-surenb@google.com [Helium-Studio: Simplify the scope in the commit message to `psi` only] Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Brandon reports sporadic, non-sensical spikes in cumulative pressure time (total=) when reading cpu.pressure at a high rate. This is due to a race condition between reader aggregation and tasks changing states. While it affects all states and all resources captured by PSI, in practice it most likely triggers with CPU pressure, since scheduling events are so frequent compared to other resource events. The race context is the live snooping of ongoing stalls during a pressure read. The read aggregates per-cpu records for stalls that have concluded, but will also incorporate ad-hoc the duration of any active state that hasn't been recorded yet. This is important to get timely measurements of ongoing stalls. Those ad-hoc samples are calculated on-the-fly up to the current time on that CPU; since the stall hasn't concluded, it's expected that this is the minimum amount of stall time that will enter the per-cpu records once it does. The problem is that the path that concludes the state uses a CPU clock read that is not synchronized against aggregators; the clock is read outside of the seqlock protection. This allows aggregators to race and snoop a stall with a longer duration than will actually be recorded. With the recorded stall time being less than the last snapshot remembered by the aggregator, a subsequent sample will underflow and observe a bogus delta value, resulting in an erratic jump in pressure. Fix this by moving the clock read of the state change into the seqlock protection. This ensures no aggregation can snoop live stalls past the time that's recorded when the state concludes. Reported-by: Brandon Duffany <brandon@buildbuddy.io> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194 Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/ Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi") Cc: stable@vger.kernel.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [Helium-Studio: Simplify the scope in the commit message to `psi` only] Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

Signed-off-by: Yaroslav Furman <yaro330@gmail.com> Signed-off-by: neobuddy89 <neobuddy89@gmail.com>

Same concept as here: kerneltoast/android_kernel_google_wahoo@fe23bc0 Extended version that covers more cases. Signed-off-by: Yaroslav Furman <yaro330@gmail.com> [kdrag0n: Fixed compile error in Adreno driver when debugfs is enabled] Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: neobuddy89 <neobuddy89@gmail.com>

Change-Id: Icfbc59fff5647b28631a670facc93ff837687d4c Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>

The current brightness level mapping does not correctly map the brightness level range from user space to the range supported by the panel. For example if the max user brightness reported is 4095, and panel backlight range is 0-255. Then user is expected to be able to set brightness in range from 0-4095, but current logic truncates at bl-max (255). Moreover it doesn't take into account bl-min. Fix logic such that the brightness range set by user correctly scales to the backlight level from panel. Bug: 139263611 Change-Id: Ic70909af63fb5b66ebc1434477f2fc41a785ce1f Signed-off-by: Adrian Salido <salidoa@google.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>

we have cgroupv2 ,so please use pressure

hxsyzl · 2026-04-18T12:47:33Z

I'm try to backport newest thermal

we have it ,so this pick is useless This reverts commit 8333774.

…stantaneous thermal pressure Add architecture specific APIs to update and track thermal pressure on a per CPU basis. A per CPU variable thermal_pressure is introduced to keep track of instantaneous per CPU thermal pressure. Thermal pressure is the delta between maximum capacity and capped capacity due to a thermal event. topology_get_thermal_pressure can be hooked into the scheduler specified arch_scale_thermal_pressure to retrieve instantaneous thermal pressure of a CPU. arch_set_thermal_pressure can be used to update the thermal pressure. Considering topology_get_thermal_pressure reads thermal_pressure and arch_set_thermal_pressure writes into thermal_pressure, one can argue for some sort of locking mechanism to avoid a stale value. But considering topology_get_thermal_pressure can be called from a system critical path like scheduler tick function, a locking mechanism is not ideal. This means that it is possible the thermal_pressure value used to calculate average thermal pressure for a CPU can be stale for up to 1 tick period. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-4-thara.gopinath@linaro.org

…forms Hook up topology_get_thermal_pressure to arch_scale_thermal_pressure thus enabling scheduler to retrieve instantaneous thermal pressure of a CPU. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-5-thara.gopinath@linaro.org

Hook up topology_get_thermal_pressure to arch_scale_thermal_pressure thus enabling scheduler to retrieve instantaneous thermal pressure of a CPU. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-6-thara.gopinath@linaro.org

Thermal governors can request for a CPU's maximum supported frequency to be capped in case of an overheat event. This in turn means that the maximum capacity available for tasks to run on the particular CPU is reduced. Delta between the original maximum capacity and capped maximum capacity is known as thermal pressure. Enable cpufreq cooling device to update the thermal pressure in event of a capped maximum frequency. Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200222005213.3873-9-thara.gopinath@linaro.org Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>

The thermal pressure signal gives information to the scheduler about reduced CPU capacity due to thermal. It is based on a value stored in a per-cpu 'thermal_pressure' variable. The online CPUs will get the new value there, while the offline won't. Unfortunately, when the CPU is back online, the value read from per-cpu variable might be wrong (stale data). This might affect the scheduler decisions, since it sees the CPU capacity differently than what is actually available. Fix it by making sure that all online+offline CPUs would get the proper value in their per-cpu variable when thermal framework sets capping. Fixes: f12e4f66ab6a3 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping") Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/all/20210614191030.22241-1-lukasz.luba@arm.com/ Bug: 199501011 Change-Id: I10cceb48b72ccce1f51cfc0a7ecfa8d8e67d4394 (cherry picked from commit 2ad8ccc17d1e4270cf65a3f2a07a7534aa23e3fb) Signed-off-by: Ram Chandrasekar <quic_rkumbako@quicinc.com>

hxsyzl · 2026-04-19T08:10:07Z

I have completed this work, and compiling it in my environment does not produce any errors. You can check the build results of the test branch (https://github.com/hxsyzl/android_kernel_xiaomi_sm8150/tree/test)

kondors1995 · 2026-04-19T10:09:38Z

I have completed this work, and compiling it in my environment does not produce any errors. You can check the build results of the test branch (https://github.com/hxsyzl/android_kernel_xiaomi_sm8150/tree/test)

thease thermals are not really usefuel if you are not ussing modern sched or for miui thermal hal

hxsyzl · 2026-04-19T23:08:25Z

我已经完成了这项工作，在我的环境中编译时没有出现任何错误。你可以查看测试分支的构建结果（https://github.com/hxsyzl/android_kernel_xiaomi_sm8150/tree/test)

如果你不使用现代的热能或MIUI热能HAL的话，这些保暖其实并不算是真正的燃料

crdandroid use mi thermal in A16

kondors1995 · 2026-04-20T06:48:20Z

我已经完成了这项工作，在我的环境中编译时没有出现任何错误。你可以查看测试分支的构建结果（https://github.com/hxsyzl/android_kernel_xiaomi_sm8150/tree/test)

如果你不使用现代的热能或MIUI热能HAL的话，这些保暖其实并不算是真正的燃料

crdandroid use mi thermal in A16

Yes mi thermals only need qcom changes & Xiaomi stuff

So topology thermal changes you picked in are not needed since they are for evdf

Demon000 and others added 30 commits April 17, 2026 01:17

block: blk-core: Import Xiaomi changes

b807bbc

Change-Id: I066cd75e12e3af33c356ed79d1ac50b406fbdb99 Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

thermal: Don't register for non-existing thermal zone

7cee554

Signed-off-by: Julian Liu <wlootlxt123@gmail.com> Signed-off-by: Nauval Rizky <enuma.alrizky@gmail.com> Signed-off-by: Helium-Studio <67852324+Helium-Studio@users.noreply.github.com>

drivers: thermal: Don't qualify thermal polling as high priority

7dbac92

Don't take priority over other workqueues. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

thermal: tracing: Move clock_set_rate outsides CONFIG_COMMON_CLK_MSM

ad11ad6

Bug: 149660093 Bug: 150825703 Change-Id: I5608b70c29f8a3aa4a3646436d1059ca155b8c29 Signed-off-by: Wilson Sung <wilsonsung@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

thermal: Skip thermal sensor update if emul temp set

fca2bb5

Bug: 129787335 test: verified thermal-engine throttling by emul temp Change-Id: I3b687c9205387df2365e41af51fbbaf107c47f56 Signed-off-by: TeYuan Wang <kamewang@google.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

Revert "drivers: thermal: bcl_pmic5: filter out BCL spurious interrupts"

c62f513

This reverts commit 4909e8c.

Revert "driver: thermal: bcl_pmic5: Use mitigation level interrupts"

30bab1a

This reverts commit 62dc88e. Bug: 152162350 Signed-off-by: George Lee <geolee@google.com> Change-Id: I2c7d8f4197eb55df04740f73e6333052ebb15218 Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>

hnaz and others added 16 commits April 17, 2026 01:26

drivers: remove a few loggers

3c6ca5b

Signed-off-by: Yaroslav Furman <yaro330@gmail.com> Signed-off-by: neobuddy89 <neobuddy89@gmail.com>

drm: msm: Implement early fingerprint wake-up optimization

5c8763d

Change-Id: Icfbc59fff5647b28631a670facc93ff837687d4c Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Signed-off-by: Pranav Vashi <neobuddy89@gmail.com>

defconfig : remove unuse cmdline

0825bd2

we have cgroupv2 ,so please use pressure

Makefile : remove flags

a54f637

hxsyzl marked this pull request as draft April 18, 2026 12:46

firebird11 self-assigned this Apr 18, 2026

firebird11 marked this pull request as ready for review April 18, 2026 21:05

hxsyzl and others added 6 commits April 19, 2026 08:06

Revert "mm: Introduce kvcalloc()"

0e9e8bf

we have it ,so this pick is useless This reverts commit 8333774.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some optimizations for Raphael#16

Some optimizations for Raphael#16
hxsyzl wants to merge 188 commits intocrdroidandroid:16.0-raphaelfrom
hxsyzl:16.0-raphael

hxsyzl commented Apr 17, 2026 •

edited

Loading

Uh oh!

hxsyzl commented Apr 18, 2026

Uh oh!

hxsyzl commented Apr 19, 2026

Uh oh!

kondors1995 commented Apr 19, 2026

Uh oh!

hxsyzl commented Apr 19, 2026

Uh oh!

kondors1995 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

hxsyzl commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hxsyzl commented Apr 18, 2026

Uh oh!

hxsyzl commented Apr 19, 2026

Uh oh!

kondors1995 commented Apr 19, 2026

Uh oh!

hxsyzl commented Apr 19, 2026

Uh oh!

kondors1995 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

hxsyzl commented Apr 17, 2026 •

edited

Loading