Increase cooperation of cpufreq with the scheduler

Registered by Amit Kucheria on 2012-10-05

Based on preliminary discussions at LPC, start prototyping possible solutions to increasing cooperation with cpufreq and the scheduler.

DVFS (cpufreq) is currently triggered by a governor (ondemand,
interactive, etc.) that scales the frequency based on certain
heuristics about the load on the system. In doing so, it maintains a
lot of statistics to track system load.

There are two problems with it which affect both power and performance
of our system:
We evaluate all these statatics in cpufreq and this information is
already available to the scheduler as it moves tasks around.
Cpufreq response to load change is a reaction and not really an
action. i.e. Cpufreq reacts only after at least a few milliseconds
(~20-100) according to the load as its governors schedule timers to
monitor whats the current load on cpus. And that is done as per
sampling rate of governor.
If these decisions or inputs are taken directly from scheduler, then
we can change freq according to load just before the task is moved to
a cpu's runqueue. And hence will get better performance and power too.
There is an agreement that DVFS could be more efficient if driven by
the scheduler[1].
Getting rid of cpufreq could end up being a multi-step process:
Add missing statistics required by the governors to the scheduler
Convert the governors over to use scheduler statistics instead of
its own heuristics
OR
Write a new 'sched' governor that uses statistics from the
scheduler to driver DVFS
Get rid of governors and just drive the DVFS transition from the
scheduler if it works with older platforms too.
[1] https://lkml.org/lkml/2012/2/7/504

Blueprint information

Status:
Complete
Approver:
Amit Kucheria
Priority:
Medium
Drafter:
None
Direction:
Needs approval
Assignee:
viresh kumar
Definition:
Obsolete
Series goal:
Accepted for trunk
Implementation:
Slow progress
Milestone target:
milestone icon 2013.08
Started by
David Zinman on 2012-10-09
Completed by
Serge Broslavsky on 2013-09-28

Related branches

Sprints

Whiteboard

Meta:
Roadmap id: CARD-190
Headline: Increase cooperation of cpufreq/cpuidle with the scheduler
Acceptance: TBD

Design: Following is what I always had in Mind regarding how to do it
- Move {affected|related}_cpus out of cpufreq into scheduler

https://blueprints.launchpad.net/linaro-power-kernel/+spec/scheduler-better-cpu-topology

- Keep policy/governor part as is in cpufreq and don't un-necessarily over complicate scheduler. Rather remove timely timer interruption from cpufreq governors and register for a notifier with scheduler. Which will get called as soon as load changes by certain percentage. Then cpufreq can take a decision on target frequency with existing governors like ondemand/ conservative on what to do, based on the tunables exported to sysfs.
- No need to create any new governor for now but a sched governor might be useful later on.

[Viresh, Oct 8, 2012] Didn't found much about this blueprint in LPC slides or TI's proposal
[dzin, Oct 10, 2012] Please add a Headline, Acceptance and Roadmap id.
[dzin, Nov 24, 2012] Viresh has been pulled into the big.LITTLE IKS project and therefore this is not currently being worked on

Other Work Items Done during this period:
- Prepare Demo for Connect: Task placement with/without HMP patches: DONE
- Test HMP patches from Pantelis Antoniou (TI) on TC2: DONE

01/2013:
- Simplify cpufreq_add_dev and __cpufreq_remove_dev in cpufreq core: DONE

Notes on core work item:
- 2012.11: Trying to understand CFS and other scheduler parts. Will take some time.

- 2013.06: Finally back on this Activity after a long long gap.
  - Started with Vincent's scheduler presentations to understand scheduler: Saw 1 Hr video until now.
  - It talked mostly about Scheduling Domains and So, stopped Video and went to understand how scheduling domains are created by Linux (Ongoing).
  - Found some issues and sent fixes for them, applied by Ingo:
   545e0b4 sched: Remove unused params of build_sched_domain()
   fc47352 sched: Optimize build_sched_domains() for saving first SD node for a cpu
   62b4f88 sched: Rename sched.c as sched/core.c in comments and Documentation

- 2013.06. Few more patches went in:
e3e5340 sched: remove WARN_ON(!sd) from init_sched_groups_power()
fa73208 sched: Fix memory leakage in build_sched_groups()
834ecd5 sched: Use cached value of span instead of calling sched_domain_span()
e0e0454 sched: Create for_each_sd_topology()
f4753a1 sched: don't sd->child to NULL when it is already NULL
df57734 sched: don't initialize alloc_state in build_sched_domains

This activity is stopped now due to LNG move.

(?)

Work Items

Work items for 2012.11:
Review slides from LPC and proposal from TI: DONE
Fix minor issues in CPUFREQ-sysfs interface and vexpress cpufreq-driver: DONE
Fix code duplication withing governers, patch sent for review: DONE
Fix sparse warnings for CPUFREQ framework: DONE
Go through CPUFREQ kernel framekwork: DONE
Go through CPUFREQ Ondemand Governer code: DONE
Submit Kernel patch for taking common parts of governors to a separate file: DONE
Test above patch with cpufreq-bench: DONE

Work items for 2013.06:
Try to understand how sched domains and groups work: DONE
Clean code related to sched domains/groups in case something wrong is found: DONE

Work items for 2013.07:
Try to understand existing scheduler code: INPROGRESS
Take inputs from scheduler for switching frequencies: TODO

Dependency tree

* Blueprints in grey have been implemented.