Commit 74c43834 authored by aumgn's avatar aumgn
Browse files

Avancement analyse du placement des threads

parent 447f63c2
migrations/callgraph.png

219 KB | W: | H:

migrations/callgraph.png

266 KB | W: | H:

migrations/callgraph.png
migrations/callgraph.png
migrations/callgraph.png
migrations/callgraph.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -6,17 +6,24 @@
+ task_struct (sched.h): Represent a thread
+ lb_env (fair.c): Contains informations about a possible migration
* set_task_cpu (core.c):
* set_task_cpu/__set_task_cpu (core.c):
+ calls the callback of the scheduling class to actually migrates the thread
+ calls __set_task_cpu to update the mapping thread<->cpu indenpently of the scheduling class
+ triggers tracepoint and perf_event
** set_task_cpu
+ calls the callback of the scheduling class to actually migrates the thread
+ calls __set_task_cpu
+ triggers tracepoint and perf_event for migrations
+ Always used for migrations (as opposed to initial placement of thread)
** __set_task_cpu
+ Called to update the mapping thread<->cpu independently of the scheduling class
+ Called by set_task_cpu for migrations
+ Called by sched_fork/wake_up_new_task for initial placement
* Call Graph
#+BEGIN_SRC dot :file callgraph.png
digraph call_graph {
"set_task_cpu";
"set_task_cpu" -> "__set_task_cpu";
subgraph cluster1 {
label = "CFS load balancing";
......@@ -34,11 +41,15 @@
"idle_balance" -> "load_balance" [style=solid];
}
"nohz_balancer_kick" -> "scheduler_ipi" [style=dotted, label=""];
"detach_task" -> "set_task_cpu" [style=solid];
"pick_next_task_fair" -> "idle_balance" [style=solid];
"sched_fork" -> "__set_task_cpu" [style=solid];
"wake_up_new_task" -> "__set_task_cpu" [style=solid];
"__schedule" -> "pick_next_task" [style=solid];
"scheduler_tick" -> "trigger_load_balance" [style=solid];
"scheduler_ipi" -> "run_rebalance_domains" [style=dotted, label="SCHED_SOFTIRQ"];
......@@ -99,6 +110,9 @@
+ scheduler_ipi -> ... -> detach_task
+ __schedule -> ... -> detach_task
+ ... -> try_to_wake_up
+ ... -> sched_fork
+ ... -> wake_up_new_task
+ sched_exec -> ... -> move_queued_task
* Hmmm
......@@ -134,30 +148,7 @@ This function will never be called at runtime and probably be optimized away by
+ Scheduling domains are stored per CPU (i.e.: copied for each CPU).
+ A sched_domain contains a child field which point to the lower scheduling domain for
the storing cpu
+ Contains obscur fields {busy,idle,newidle}_idx (see update_sd_lb_stats) :
#+BEGIN_SRC sh
for domain in /proc/sys/kernel/sched_domain/cpu0/domain*; do
echo -n "$(basename $domain) : "
cat $domain/name
for idx in $domain/{busy,idle,newidle}_idx; do
printf "%15s : " "$(basename $idx)"
cat $idx;
done;
done
# => Output :
# domain0 : MC
# busy_idx : 2
# idle_idx : 0
# newidle_idx : 0
# domain1 : NUMA
# busy_idx : 3
# idle_idx : 2
# newidle_idx : 0
# domain2 : NUMA
# busy_idx : 3
# idle_idx : 2
# newidle_idx : 0
#+END_SRC
+ Contains obscur fields {busy,idle,newidle}_idx (see update_sd_lb_stats)
+ Among the flags, one is named SD_OVERLAP and documented as :
"sched_domains of this level overlap"
......@@ -165,7 +156,7 @@ This function will never be called at runtime and probably be optimized away by
+ Scheduling group represents a subset of related CPUs inside a scheduling domain:
- one sched_group = one CPU, for the lowest level of domains
- one sched_group = one scheduling domains of the underlying level, for the other levels
+ Per cpu
+ Stored per cpu
+ Contains a mask of CPUs that this group spans
** TODO Scheduling group capacity (struct sched_group_capacity)
......@@ -182,6 +173,8 @@ This function will never be called at runtime and probably be optimized away by
"""
+ Seems to be only modified with atomic operations.
+ Would probably be better named 'sched_group_shared'
=> contains fields irrelevant to capacity but here only for the memory sharing
aspect
+ TODO Also contains a mask of CPUs, (describe as an "iteration mask")
=> Introduced in c1174876874dcf: """
sched: Fix domain iteration
......@@ -195,14 +188,99 @@ This function will never be called at runtime and probably be optimized away by
[...]
"""
=> Set up in <linux#v4.9>/kernel/sched/core.c#6114 on initialization or update of
the sched_domains. (Update because of hotplug for example)
the sched_domains. (Updated because of hotplug of a CPU for example)
+ From "Critical Blue", capacity is used to reflect the fact that two hyperthreaded
cores do not provide the same computational value than two not hyperthreaded.
+ TLDR; struct which contains additional informations for sched_group but split apart
for some CPU sharing reason.
* Criteria
* Scheduling domains example
#+BEGIN_SRC sh
declare -A flags_decl
flags_decl[SD_LOAD_BALANCE]=0x0001
flags_decl[SD_BALANCE_NEWIDLE]=0x0002
flags_decl[SD_BALANCE_EXEC]=0x0004
flags_decl[SD_BALANCE_FORK]=0x0008
flags_decl[SD_BALANCE_WAKE]=0x0010
flags_decl[SD_WAKE_AFFINE]=0x0020
flags_decl[SD_ASYM_CPUCAPACITY]=0x0040
flags_decl[SD_SHARE_CPUCAPACITY]=0x0080
flags_decl[SD_SHARE_POWERDOMAIN]=0x0100
flags_decl[SD_SHARE_PKG_RESOURCES]=0x0200
flags_decl[SD_SERIALIZE]=0x0400
flags_decl[SD_ASYM_PACKING]=0x0800
flags_decl[SD_PREFER_SIBLING]=0x1000
flags_decl[SD_OVERLAP]=0x2000
flags_decl[SD_NUMA]=0x4000
for domain in /proc/sys/kernel/sched_domain/cpu0/domain*; do
echo -n "$(basename $domain) : "
cat $domain/name
flags=$(cat $domain/flags)
printf " Flags :\n"
for k in "${!flags_decl[@]}"; do
if [[ $(( $flags & ${flags_decl[$k]} )) -ne 0 ]]; then
printf " $k\n"
fi
done
printf " Indexes :\n"
printf " | busy | idle | newidle | wake | forkexec |\n"
printf " | %3d | %3d | %3d | %3d | %3d |\n" \
$(cat $domain/busy_idx) \
$(cat $domain/idle_idx) \
$(cat $domain/newidle_idx) \
$(cat $domain/wake_idx) \
$(cat $domain/forkexec_idx)
done
#+END_SRC
:RESULTS:
On mc2
domain0 : MC
Flags :
SD_BALANCE_EXEC
SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES
SD_LOAD_BALANCE
SD_BALANCE_FORK
SD_BALANCE_NEWIDLE
Indexes :
| busy | idle | newidle | wake | forkexec |
| 2 | 0 | 0 | 0 | 0 |
domain1 : NUMA
Flags :
SD_BALANCE_EXEC
SD_WAKE_AFFINE
SD_NUMA
SD_OVERLAP
SD_SERIALIZE
SD_LOAD_BALANCE
SD_BALANCE_FORK
SD_BALANCE_NEWIDLE
Indexes :
| busy | idle | newidle | wake | forkexec |
| 3 | 2 | 0 | 0 | 0 |
domain2 : NUMA
Flags :
SD_BALANCE_EXEC
SD_WAKE_AFFINE
SD_NUMA
SD_OVERLAP
SD_SERIALIZE
SD_LOAD_BALANCE
SD_BALANCE_FORK
SD_BALANCE_NEWIDLE
Indexes :
| busy | idle | newidle | wake | forkexec |
| 3 | 2 | 0 | 0 | 0 |
:END:
* Notes per functions
** rebalance_domains -> load_balance
......@@ -246,7 +324,7 @@ This function will never be called at runtime and probably be optimized away by
and directly copy the content of cpu_active_mask into it
=> TODO Why use a predefined global variable ?
+ Probably the code section is not interruptible hence the variable ensured not
to be shared
to be concurrently accesed
+ Maybe some kind of memory optimization
=> cpu_active_mask is defined as (<linux#v4.9>/source/include/linux/cpumask.h#52) :
+ cpu_possible_mask - has bit 'cpu' set iff cpu is populatable
......@@ -255,6 +333,7 @@ This function will never be called at runtime and probably be optimized away by
+ cpu_active_mask - has bit 'cpu' set iff cpu available to migration
TODO Is there an meaningful difference between available to scheduler and migration ?
+ If load_balancing is of type NEWLY_IDLE, it doesn't consider CPUs in the same group
=> Explained in cfc0311804717
+ Calls should_we_balance to check if balancing is actually necessary
*** TODO should_we_balance
......@@ -277,6 +356,7 @@ This function will never be called at runtime and probably be optimized away by
*** load_balance
+ Calls find_busiest_group
+ Calls find_busiest_queue
*** find_busiest_group
+ Starts by computing stats inside a two level struct sd_lb_stats with field
......@@ -284,8 +364,8 @@ This function will never be called at runtime and probably be optimized away by
+ Calls check_asym_packing.
+ Differentiate between a group which is (enum group_type fair.c#6715):
- Overloaded, i.e. its load is greater than its capacity
=> Slight difference between being overloaded and having no capacity
(when the load is eaqual to the capacity)
=> Slight difference between being overloaded and having no capacity left
(when the load is equal to the capacity)
- TODO Imbalanced, from field sched_group_capacity->imbalance
which may have been set by a previous call to load_balance
(set in function load_balance lines 7824 & unset line 7914)
......@@ -357,33 +437,140 @@ This function will never be called at runtime and probably be optimized away by
performance improved
* Thread placement
*** find_busiest_queue
+
** select_task_rq_fair
*** select_task_rq_fair
+ In the regular case (i.e. not SD_WAKE_AFFINE) :
- It finds the highest sched_domain which has the flag SD_LOAD_BALANCE and
SD_BALANCE_{FORK,EXEC,WAKE} set.
+ TODO the iteration will actually end as soon as one the scheduling domain
does not have one this flag.
+ If none found, then fallback to new_cpu except for SD_BALANCE_WAKE
which will look for an idle sibling (select_idle_sibling).
- It will look for the idlest group in this domain (find_idlest_group)
=> If not found (return value is NULL) retry with child sched_domain
- It will look for the idlest CPU in this group (find_idlest_cpu)
*** find_idlest_group
+ Computes the average load for each group (Using the scheme described for
find_busiest_group)
+ Will returns the group whose average is minimum (NULL otherwise)
if its average load satisfies :
=> 100 * this_load > = (100 + (sd->imbalance_pct-100)/2) * min_load.
Where this_load is the load of the executing CPU.
Usually:
100 * this_load > = [105..112] * min_load
this_load - min_load > = [5%..12%] * min_load
=> Difference of load has to be more than 5-12% of the min load, or old
cpu is better
TODO executing CPU and old_cpu might be different ?
*** find_idlest_cpu
+ Might return -1 (i.e.: no CPU found)
+ Looks for all potential destination CPU using the following criteria from most
important to least :
- Idle and exit_latency is minimum and idle_stamp is maximum
- Idlest, i.e.: whose load is minimum.
*** wake_wide
*** wake_cap
*** wake_affine
*** select_idle_sibling
The commit which introduced a rewrite of this function explains its design fairly
well: https://github.com/torvalds/linux/commit/10e2f1acd0106c05229f94c70a344ce3a2c8008b
* Occasions for thread placement
Thread placement is done at multiple occasions in Linux. It is done per-thread, for
example when the thread is created or when it wakes up. Linux also uses a global
load balancing algorithm which can moves multiple threads if necessary.
load balancing algorithm which can move multiple threads if necessary.
** Per-thread placement
On some occasions, a call to select_task_rq will be made to define the placement
of one thread. These occasions are :
- The thread just got created (process fork, pthread_create, clone, kernel thread)
- The thread call one of the syscall from the exec family
- The thread call one of the syscalls from the exec family
- The thread is about to wake up
For all this occasion, the main placement logic is factorized inside the select_task_rq
funtion which is implemented differently depending on the scheduling class.
Some variations are introduced depending on the occasion with a flags argument
(SD_BALANCE_FORK, SD_BALANCE_EXEC, SD_BALANCE_EXEC)
(SD_BALANCE_FORK, SD_BALANCE_EXEC, SD_BALANCE_WAKE)
*** Selection of the destination CPU (function select_task_rq_fair)
This function will try to find a viable destination CPU or fallback to its prev_cpu
argument. Except for one subtlety (SD_WAKE_AFFINE see below), the will look for the idlest
CPU among the scheduling domains which have the SD_LOAD_BALANCE and SD_BALANCE_{FORK,EXEC,WAKE}
flags set. The idea is to spread the threads evenly to maximize fairness and CPU utilization.
Discussion:
+ For forked and exec'd processes, this seems quite sane.
+ TODO For newly created threads this may lead to some memory distances issues ?
SD_WAKE_AFFINE case. Specifically for the case SD_BALANCE_WAKE, select_task_rq_fair might run
the same function but with a totally different underlying algorithm. It happens if it is assume
to be better to wake up the thread close to its waker (because they have a close relationship)
and the receiving thread is not overloaded then the algorithm will choose the CPU based on the
result of the function (s
** Global Load balancing
A CPU run the load balancing algorithm to try to pull some tasks for itself.
Three forms of global load balancing :
+ active, i.e. periodically triggered on the idlest CPU for each scheduling domain.
Three triggers for the global load balancing algorithm :
+ Periodically triggered on the idlest CPU for each scheduling domain.
Period depends on the scheduling domain.
+ newly idle (i.e. triggered by a CPU about to become idle when trying to schedule a new
task but the runqueue is empty)
+ nohz idle balancing. Sometimes a CPU realizes that it is busy and some cores are tickless
(NOHZ; meaning no schedule_tick and they won't have a chance to run the active load balancing).
+ Newly idle, i.e., triggered by a CPU about to become idle when trying to schedule a new
task but the runqueue is empty.
+ NOHX idle balancing. Sometimes a CPU realizes that it is busy and some cores are tickless
(NOHZ; meaning no schedule_tick, hence they won't have a chance to run the periodic load
balancing).
In these case the busy CPU will kick (IPI) one of the tickless CPU(s).
Similarly to Per-thread placement, while these types of placement rely on the same code (function
load_balance) a flags argument (CPU_IDLE, CPU_NEWLYIDLE, CPU_NOT_IDLE) adds some variation.
*** Principle of the load balancing
The idlest CPU for each scheduling domain will run the load balancing periodically (load balancing
happens less frequently for larger scheduling domain). The CPU will look for the busiest group of
the scheduling domain. It will then look for the busiest CPU inside this group. The busiest CPU is
the CPU whose load average is the most important.
* Criteria
** Per thread placement
*** select_task_rq_fair(SD_BALANCE_WAKE) -> place close to previous CPU and waker
+ Criteria : Previous CPU and waker CPU are both part of a scheduling domain whose
flag SD_WAKE_AFFINE is set.
Motivation : Consolidation of waker/wakee on shared cache ?
Note : TODO Motivation not clear. The flag SD_WAKE_AFFINE is set on all domains
on the mc2, even NUMA domains.
+ Criteria : Number of wakees is less than number of CPUs in llc domain of waker
Motivation : Consolidation, keep waker/wakee with a close relationship together
Spread wakee threads when waker:wakees relationship is 1:N.
+ Criteria : Capacity of waking CPU fits the waking thread utilization
Motivation : Avoid consolidation on a CPU whose capacity is not good enough
Meant for asymmetric architectures
+ Criteria : if load of waker CPU is less than load of previous CPU, then wake up
close to waker rather than previous CPU
Motivation : Quote 62470419e99 : """
In theory this should be beneficial if
the waker's CPU caches hot data for the wakee, and it's also beneficial
in the extreme ping-pong high context switch rate case.
"""
*** select_task_rq_fair(SD_BALANCE_{FORK,EXEC,WAKE}) -> place on idlest
+ Criteria : Idlest CPU of highest domain whose flags SD_LOAD_BALANCE and
SD_BALANCE_{FORK,EXEC,WAKE} are
Motivation : Spread the workload for fairness
+ Criteria : If idlest CPU(s) actually idle, take CPU whose exit_latency is less
Motivation : Use idle CPU which will be quicker to wake up
+ Criteria : Difference of load has to be more than 5-12% of the min load
Motivation : ???
** Load Balancing
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment