Commit 4966c3b8 authored by aumgn's avatar aumgn
Browse files

Avancement analyse load_balancing

parent e962bc75
......@@ -100,13 +100,38 @@
+ __schedule -> ... -> detach_task
+ ... -> try_to_wake_up
* Scheduling domains, groups & capacity
* Hmmm
Code in <linux#v4.9>/include/linux/cpumask.h#772 :
#define to_cpumask(bitmap) \
((struct cpumask *)(1 ? (bitmap) \
: (void *)sizeof(__check_is_bitmap(bitmap))))
static inline int __check_is_bitmap(const unsigned long *bitmap)
return 1;
Used to statically checks the type of the macro parameter.
This is thanks to the function argument type.
This function will never be called at runtime and probably be optimized away by the compiler.
* Next steps
+ Analysis of the code of the main load balancing algorithm
+ Analysis of the scheduling domain fields and the way they are constructed
+ Analysis of the metrics used for load balancing
* Scheduling domains, groups & al.
** Scheduling domains (struct sched_domain)
+ The configuration of the scheduling domains for the current machine is available with the /proc/schedstat interface
+ Each sched_domain represents a common ressources (SMT, cache, NUMA memory controller).
+ They contains a list of scheduling groups
+ Scheduling domains are stored per CPU (i.e.: copied for each CPU).
+ A sched_domain contains a child field which point to the lower scheduling domain for the storing cpu
** TODO Scheduling groups (struct sched_group)
+ Scheduling group represents a subset of related CPUs inside a scheduling domain:
......@@ -125,35 +150,15 @@
=> The intent of this mask is not clear at all
=> Modified in <linux#v4.9>/kernel/sched/core.c#6114
* Next steps
+ Analysis of the scheduling domain construction
+ Analysis of the load/busy metric used for load balancing
+ Look at the SD_WAKE_IDLE scheduling domain flag for experiment
* Hmmm
Code in <linux#v4.9>/include/linux/cpumask.h#772 :
#define to_cpumask(bitmap) \
((struct cpumask *)(1 ? (bitmap) \
: (void *)sizeof(__check_is_bitmap(bitmap))))
static inline int __check_is_bitmap(const unsigned long *bitmap)
return 1;
* Criteria
** rebalance_domains -> ... -> detach_task
*** rebalance_domains
+ Earliest next balance might happen after 1 minute
+ Or later, if sd->last_balance + get_sd_balance_interval(sd, idle != CPU_IDLE) is after
*** TODO rebalance_domains
+ Each scheduling domain has its own load balancing interval
=> TODO checks how it is defined
+ Next balance is the maximum between :
- Current time + 1 minute
- sd->last_balance + get_sd_balance_interval(sd, idle != CPU_IDLE)
*** TODO get_sd_balance_interval(sd, cpu_busy)
+ Interval is stored as field inside the scheduling domain
......@@ -166,35 +171,92 @@ static inline int __check_is_bitmap(const unsigned long *bitmap)
+ Called supposedly each time the number of CPUs change
+ max_load_balance_interval = HZ * num_online_cpus() / 10
*** rebalance_domains
*** TODO rebalance_domains
+ Iterates over the scheduling domain hierarchy of the executing CPU.
+ Only considers sched_domain whose flag SD_LOAD_BALANCE is set
+ If flag SD_SERIALIZE is set uses a global spinlock `balancing`
+ Each scheduling domain has its own load balancing period
+ If flag SD_SERIALIZE is set uses a global spinlock
=> Used to ensure only a single load balancing will be run for each domain.
=> Seems to be only set for domain whose SD_NUMA flag is also set.
=> TODO Does it mean multiple load balancing can happen at the same time for different levels ?
+ TODO Use some kind of decay heuristics related to :
- sched_domain.next_decay_max_lb_cost
- sched_domain.max_newidle_lb_cost
+ Call load_balance to actually do the balancing for the current CPU and each relevant domain
*** load_balance
*** TODO load_balance
+ cpus variable stores a pointer to the global per-cpu variable load_balance_mask
and directly copy the content of cpu_active_mask into it
=> TODO Why use a predefined global variable ?
=> cpu_active_mask is defined as (<linux#v4.9>/source/include/linux/cpumask.h#52) :
+ cpu_possible_mask - has bit 'cpu' set iff cpu is populatable
+ cpu_present_mask - has bit 'cpu' set iff cpu is populated
+ cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
+ cpu_active_mask - has bit 'cpu' set iff cpu available to migration
+ If load_balancing is of type NEWLY_IDLE, it doesn't consider CPUs in the same group
+ Calls should_we_balance to check if balancing is actually necessary
*** TODO should_we_balance
+ When NEWLY_IDLE always returns true
+ Iterates on the CPUs available inside the scheduling group and which match the cpumask load_balance_mask
=> TODO load_balance_mask seems to be always 0 !
+ If nothing found, will return true if :
current cpu == cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg))
with : sg , the current sched_group
sched_group_cpus(sg), the cpumask of the current sched_group
sched_group_mask(sg), the cpumask of the current sched_group_capacity
TODO No clue about what this is meant to do.
+ From what I can deduce, this function seems to be meant as a filter which will only keep one CPU.
The CPU which will be elected for each group to actually do the load balancing.
+ Looks for the first idle CPU which appears in the masks :
- sched_group->cpumask
- sched_group_capacity->cpumask
- load_balance_mask (i.e.: cpu_active_mask)
+ If nothing found, will use the first cpu which matches the masks :
- sched_group->cpumask
- sched_group_capacity->cpumask
+ TODO Not sure what is the point of sched_group_capacity->cpumask in all this
+ TODO Why is cpu_active_mask considered in the first case but not the second ?
+ Will return true if current cpu is equals to the CPU found.
+ TLDR; This function is a filter which will only allow one CPU to continue the
load balancing prioritizing by idleness and then cpu id
=> TODO Check if coherent with SD_SERIALIZE allowing multiple simultaneous
load balancings for the same domain
*** load_balance
+ Calls find_busiest_group
*** find_busiest_group
+ Starts by computing stats inside a two level struct sd_lb_stats with field struct sg_lb_stats.
Stats initialized with a call to update_sd_lb_stats.
*** TODO update_sd_lb_stats
+ TODO Some really weird logic about prefering sibling
+ Uses field sched_domain->{busy,idle,newidle}_idx which are not really documented
except for "@load_idx: Load index of sched_domain of this_cpu for load calc."
=> Serve as an index of rq->cpu_load
+ TODO rq->cpu_load is updated in function "cpu_load_update". It contains a decaying average of the
cpu load
*** find_busiest_group
+ Calls check_asym_packing.
*** TODO check_asym_packing
+ Related to the POWER7 family of processor
+ "asym" stands for "Asymmetric SMT scheduling"
+ The documentation of the function states that : """
This is primarily intended to used at the sibling level. Some
cores like POWER7 prefer to use lower numbered SMT threads. In the
case of POWER7, it can move to lower SMT modes only when higher
threads are idle. When in lower SMT modes, the threads will
perform better since they share less core resources. Hence when we
have idle threads, we want them to be the higher ones.
This packing function is run on idle threads. It checks to see if
the busiest CPU in this domain (core in the P7 case) has a higher
CPU number than the packing function is being run on. Here we are
assuming lower CPU number will be equivalent to lower a SMT thread
+ "thread" in this context refers to some kind of hyperthreaded core
+ "SMT modes" refers to a capability of POWER7 where the number of hyperthreaded CPU can be
choosen from 1 to 4.
+ TODO "the threads will perform better since they share less core resources" ?
+ Introduced in this patchset:
+ TLDR; Optimization specific to the POWER7 processor, which checks if the current CPU is
idle and needs to steal the load of a CPU - whose id is higher - inside the same
SMT domain so that the SMT mode can be scaled down (CPU(s) with higher id disabled) and
performance improved
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment