c - Cost of context switch between threads of same process, on Linux -
is there empirical data on cost of context switching between threads of same process on linux (x86 , x86_64, mainly, of interest)? i'm talking number of cycles or nanoseconds between last instruction 1 thread executes in userspace before getting put sleep voluntarily or involuntarily, , first instruction different thread of same process executes after waking on same cpu/core.
i wrote quick test program performs rdtsc
in 2 threads assigned same cpu/core, stores result in volatile variable, , compares sister thread's corresponding volatile variable. first time detects change in sister thread's value, prints difference, goes looping. i'm getting minimum/median counts of 8900/9600 cycles way on atom d510 cpu. procedure seem reasonable, , numbers seem believable?
my goal estimate whether, on modern systems, thread-per-connection server model competitive or outperform select-type multiplexing. seems plausible in theory, transition performing io on fd x
fd y
involves merely going sleep in 1 thread , waking in another, rather multiple syscalls, it's dependent on overhead of context switching.
(disclaimer: isn't direct answer question, it's suggestions hope helpful).
firstly, numbers you're getting sound they're within ballpark. note, however, interrupt / trap latency can vary lot among different cpu models implementing same isa. it's different story if threads have used floating point or vector operations, because if haven't kernel avoids saving/restoring floating point or vector unit state.
you should able more accurate numbers using kernel tracing infrastructure - perf sched
in particular designed measure , analyse scheduler latency.
if goal model thread-per-connection servers, shouldn't measuring involuntary context switch latency - in such server, majority of context switches voluntary, thread blocks in read()
waiting more data network. therefore, better testbed might involve measuring latency 1 thread blocking in read()
being woken same.
note in well-written multiplexing server under heavy load, transition fd x
fd y
involve same single system call (as server iterates on list of active file descriptors returned single epoll()
). 1 thread ought have less cache footprint multiple threads, through having 1 stack. suspect way settle matter (for definition of "settle") might have benchmark shootout...
Comments
Post a Comment