Tuesday, October 28, 2014

The Internet is running in debug mode

With the rise of the Web, textual encodings like xml/JSON have become very popular. Of course textual message encoding has many advantages from a developer's perspective. What most developers are not aware of, is how expensive encoding and decoding of textual messages really is compared to a well defined binary protocol.

Its common to define a system's behaviour by its protocol. Actually, a protocol messes up two distinct aspects of communication, namely:

  • Encoding of messages
  • Semantics and behavior (request/response, signals, state transition of communication parties ..)

Frequently (not always), these two very distinct aspects are mixed up without need. So we are forced to run the whole internet in "debug mode", as 99% of webservice and webapp communication is done using textual protocols.

The overhead in CPU consumption compared to a well defined binary encoding is factor ~3-5 (JSON) up to >10-20 (XML). The unnecessary waste of bandwith also adds to that greatly (yes you can zip, but this in turn will waste even more CPU).

I haven't calculated the numbers, but this is environmental pollution at big scale. Unnecessary CPU consumption to this extent wastes a lot of energy (global warming anyone ?).

Solution is easy:
  • Standardize on some simple encodings (pure binary, self describing binary ("binary json"), textual)
  • Define the behavioral part of a protocol (exclude encoding)
  • use textual encoding during development, use binary in production.
Man we could save a lot of webserver hardware if only http headers could be binary encoded ..

Friday, October 17, 2014

Follow up: Executors and Cache Locality Experiment

Thanks to Jean Philippe Bempel who challenged my results (for a reason), I discovered an issue in last post: Code-completion let me accidentally choose Executors.newSingleThreadScheduledExecutor() instead of Executors.newSingleThreadExecutor(), so the pinned-to-thread-actor results are actually even better than reported previously. The big picture has not changed that much, but its still worthwhile reporting.

On a second note: There are many other aspects to concurrent scheduling such as queue implementations etc.. Especially if there is no "beef" inside the message processing, these differences become more dominant compared to cache misses, but this is another problem that has been covered extensively by other people in depth (e.g. Nitsan Wakart).

Focus of this experiment is locality/cache misses, keep in mind different queueing implementations of executors for sure add dirt/bias.

As requested, I add results from the linux "perf" tool to prove there are significant differences in cache misses caused by random assignment of Thread - to Actor as done by ThreadPoolExecutor and WorkStealingExecutor.

Check out my recent post for a description of the test case.

Results with adjusted SingleThreadExecutor (XEON 2 socket, each 6 cores, no HT)


As in previous post, "dedicated" actor-pinned-to-thread performs best. For very small local state, there are only few cache misses so differences are small, but widen once a bigger chunk of memory is accessed by each actor. Note that ThreadPool is hampered by its internal scheduling/queuing mechanics, regardless of locality, it performs weak.


When increasing number of Actors to 8000 (so 1000 actors per thread), "Workstealing" and "Dedicated" perform similar. Reason: executing 8000 actors round robin creates cache misses for both executors. Note that in a real world server its likely that there are active and inactive actors, so I'd expect "Dedicated" to perform slightly better than in this synthetic test.

"perf stat -e" and "perf stat -cs" results

(only 2000, 4000, 8000 local size tests where run)

Dedicated/actor-pinned-to-thread:
333,669,424 cache-misses                                                
19.996366007 seconds time elapsed
185,440 context-switches                                            
20.230098005 seconds time elapsed
=> 9,300 context switches per second

workstealing:
2,524,777,488 cache-misses                                                
39.610565607 seconds time elapsed
381,385 context-switches                                            
39.831169694 seconds time elapsed
=> 9,500 context switches per second

fixedthreadpool:
3,213,889,492 cache-misses                                                
92.141264115 seconds time elapsed
25,387,972 context-switches                                            
87.547306379 seconds time elapsed
=>290,000 context switches per second








A quick test with a more realistic test method

In order to get a more realistic impression I replaced the synthetic int-iteration by some dirty "real world" dummy stuff (do some allocation and HashMap put/get). Instead of increasing the size of the "localstate" int array, I increase the HashMap size  (should also have negative impact on locality).


Note that this is rather short processing, so queue implementations and executor internal implementation might dominate locality here. This test is run on Opteron 8c16t * 2Sockets, a processor with 8kb L1 cache size only. (BTW: impl is extra dirty, so no performance optimization comments pls, thx)


As ThreadPoolExecutor is abnormous bad in this Test/Processor combination, plain numbers:

64 HMap entries256 HMapentries2000 HMapentries4000 HMapentries32k HMapentries320k HMapentries
WorkStealing107010711097112912381284
Dedicated656646661649721798
ThreadPool83148751941294921026910602

Conclusions basically stay same as in original post. Remember cache misses are only one factor of overall runtime performance, so there are workloads where results might look different. Quality/specialization of queue implementation will have huge impact in case processing consists of only some lines of code.

Finally, my result: 
Pinning actors to threads created lowest cache miss rates in any case tested.




Tuesday, October 14, 2014

Experiment: Cache effects when scheduling Actors with F/J, Threadpool, Dedicated Threads

Update: I accidentally used newSingleThreadScheduledExecutor instead of newFixedThreadPool(1) for the "Dedicated" test case [ide code completion ..]. With this corrected, "Dedicated" outperforms even more. See follow up post for updated results + "perf" tool cache miss measurement results (do not really change the big picture).

The experiment in my last post had a serious flaw: In an actor system, operations on a single actor are executed one after the other. However by naively adding message-processing jobs to executors, private actor state was accessed concurrently, leading to "false-sharing" and cache coherency related costs especially for small local state sizes.

Therefore I modified the test. For each Actor scheduled, the next message-processing is scheduled once the previous one finished, so the experiment resembles the behaviour of typical actors (or lightweight processes/tasks/fibers) correctly without concurrent access to a memory region.

Experiment roundup:

Several million messages are scheduled to several "Actor" simulating classes. Message processing is simulated by reading and writing the private, actor-local state in random order. There are more Actors (24-8000) than threads (6-8). Note that results established (if any) will also hold true for other light-weight concurrency schemes like go-routines, fibers, tasks ...

The test is done with

  • ThreadPoolExecutor
  • WorkStealingExecutor
  • Dedicated Thread (Each Actor has a fixed assignment to a worker thread)

Simulating an Actor accessing local state:


Full Source of Benchmark

 Suspection:
As ThreadPoolExecutor and WorkStealingExecutor schedule each message on a random Thread, they will produce more cache misses compared to pinning each actor onto a fixed thread. Speculation is, that work stealing cannot make up for the costs provoked by cache misses.


(Some) Variables:
  • Number of worker threads
  • Number of actors
  • Amount of work per message
  • Locality / Size of private unshared actor state


8 Threads 24 actors 100 memory accesses (per msg)


Interpretation:

For this particular load, fixed assigned threads outperform executors. Note: the larger the local state of an actor, the higher the probability of a prefetch fail => cache miss. In this scenario my suspection holds true: Work stealing cannot make up for the amount of cache misses. fixed assigned threads profit, because its likely, some state of a previously processed message resides still in cache once a new message is processed on an actor.
Its remarkable how bad ThreadpoolExecutor performs in this experiment.

This is a scenario typical for backend-type service: There are few actors with high load. When running a front end server with many clients, there are probably more actors, as typically there is one actor per client session. Therefor lets push up the number of actors to 8000:

8 Threads 8000 actors 100 memory accesses (per msg)



Interpretation:

With this amount of actors, all execution schemes suffer from cache misses, as the accumulated size of 8000 actors is too big to fit into L1 cache. Therefore the cache advantage of fixed-assigned threads ('Dedicated') does not make up for the lack of work stealing. Work Stealing Executor outperforms any other execution scheme if a large amount of state is involved.
This is a somewhat unrealistic scenario as in a real server application, client request probably do not arrive "round robin", but some clients are more active than others. So in practice I'd expect "Dedicated" will at least have some advantage of higher cache hits. Anyway: when serving many clients (stateful), WorkStealing could be expected to outperform.

Just to get a third variant: same test with 240 actors:


These results complete the picture: with fewer actors, cache effect supercede work stealing. The higher the number of actors, the higher the number of cache misses gets, so work stealing starts outperforming dedicated threads.


Modifying other variables

Number of memory accesses

If a message-processing does few memory accesses, work stealing improves compared to the other 2. Reason: fewer memory access means fewer cache misses means work stealing gets more significant in the overall result.

 ************** Worker Threads:8 actors:24 #mem accesses: 20
local state bytes: 64 WorkStealing avg:505
local state bytes: 64 ThreadPool avg:2001
local state bytes: 64 Dedicated avg:557
local state bytes: 256 WorkStealing avg:471
local state bytes: 256 ThreadPool avg:1996
local state bytes: 256 Dedicated avg:561
local state bytes: 2000 WorkStealing avg:589
local state bytes: 2000 ThreadPool avg:2109
local state bytes: 2000 Dedicated avg:600
local state bytes: 4000 WorkStealing avg:625
local state bytes: 4000 ThreadPool avg:2096
local state bytes: 4000 Dedicated avg:600
local state bytes: 32000 WorkStealing avg:687
local state bytes: 32000 ThreadPool avg:2328
local state bytes: 32000 Dedicated avg:640
local state bytes: 320000 WorkStealing avg:667
local state bytes: 320000 ThreadPool avg:3070
local state bytes: 320000 Dedicated avg:738
local state bytes: 3200000 WorkStealing avg:1341
local state bytes: 3200000 ThreadPool avg:3997
local state bytes: 3200000 Dedicated avg:1428


Fewer worker threads

Fewer worker threads (e.g. 6) increase probability of an actor message being scheduled to the "right" thread "by accident", so cache miss penalty is lower which lets work stealing perform better than "Dedicated" (the fewer threads used, the lower the cache advantage of fixed assigned "Dedicated" threads). Vice versa: if the number of cores involved increases, fixed thread assignment gets ahead.

Worker Threads:6 actors:18 #mem accesses: 100
local state bytes: 64 WorkStealing avg:2073
local state bytes: 64 
ThreadPool avg:2498
local state bytes: 64 Dedicated avg:2045
local state bytes: 256 WorkStealing avg:1735
local state bytes: 256 
ThreadPool avg:2272
local state bytes: 256 Dedicated avg:1815
local state bytes: 2000 WorkStealing avg:2052
local state bytes: 2000 
ThreadPool avg:2412
local state bytes: 2000 Dedicated avg:2048
local state bytes: 4000 WorkStealing avg:2183
local state bytes: 4000 
ThreadPool avg:2373
local state bytes: 4000 Dedicated avg:2130
local state bytes: 32000 WorkStealing avg:3501
local state bytes: 32000 
ThreadPool avg:3204
local state bytes: 32000 Dedicated avg:2822
local state bytes: 320000 WorkStealing avg:3089
local state bytes: 320000 
ThreadPool avg:2999
local state bytes: 320000 Dedicated avg:2543
local state bytes: 3200000 WorkStealing avg:6579
local state bytes: 3200000 
ThreadPool avg:6047
local state bytes: 3200000 Dedicated avg:6907

Machine tested:

(real cores no HT)
$ lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               3067.058
BogoMIPS:              6133.20
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     1,3,5,7,9,11
NUMA node1 CPU(s):     0,2,4,6,8,10


Conclusion
  • Performance of executors depends heavy on use case. There are work loads where cache locality dominates, giving an advantage of up to 30% over Work-Stealing Executor
  • Performance of executors varies amongst different CPU types and models (L1 cache size + cost of a cache miss matter here)
  • WorkStealing could be viewed as the better overall solution. Especially if a lot of L1 cache misses are to be expected anyway.
  • The ideal executor would be WorkStealing with a soft actor-to-thread affinitiy. This would combine the strength of both execution schemes and would yield significant performance improvements for many workloads
  • Vanilla thread pools without work stealing and actor-to-thread affinity perform significantly worse and should not be used to execute lightweight processes.
Source of Benchmark

Sunday, October 12, 2014

Alternatives to Executors when scheduling Tasks/Actors



Executors work well when fed with short units of stateless work, e.g.: division of a computation onto many CPU's. However they are sub-optimal to schedule jobs which are part of an ongoing, larger unit of work, e.g.: Scheduling messages of an actor or lightweight process.

Many Actor Frameworks (or similar message passing concurrency frameworks) schedule batches of messages using an Executor service. As the Executor service is not context aware, this spreads out processing of a single Actor/Task's messages across several threads/CPU's.

This can lead to many cache misses when accessing the state of an Actor/Process/Task as the processing thread changes frequently.
Even worse, this way CPU caches cannot stabilize as each new "Runnable" washes out cached memory of the previously processed task's.

A second issue arises when using busy-spin polling. If a framework reads its queues using busy-spin, it generates 100% CPU load for each processing Thread. So one would like to add a second thread/core only if absolutely required, so a one-thread-per-actor policy is not feasible.

With Kontraktor 2.0 I implemented a different scheduling mechanism, which achieves a horizontal scaling using a very simple metric to measure actual application CPU requirements, without randomly spreading processing of an actor onto different Cores's.


An actor has a fixed assignment to a workerthread ("DispatcherThread"). The scheduler periodically reschedules Actors by moving them to another worker if necessary.

Since overly sophisticated algorithms tend to introduce high runtime costs, actual scheduling is done in a very simple manner:

1) a Thread is observed as being overloaded, if the consume loop (pulling messages from the actor queue) has not been idle for N messages (currently N=1000).
2) If a Thread has been marked "overloaded", Actors with largest mailbox sizes are migrated to a newly started Thread (until #Threads == ThreadMax) as long SUM_QUEUED_MSG(Actors scheduled on Thread A) > SUM_QUEUED_MSG(Actors scheduled on newly created Thread B).
3) in case #Threads == ThreadMax, actors are rebalanced based on summation of queued messages and "overload" observations.

Problem areas: 
  • If the processing time of messages vary a lot, summation of queued messages is misleading. An improvement would be to add a weight onto each message type by profiling periodically. A simpler option would be to give an Actor an additional weight to be multiplicated with its queue size.
  • For load bursts, there is a latency until all of the available CPU's are actually used.
  • The delay until JIT kicks in produces bad profiling data and leads to false scale ups (heals over time, so not that bad).

An experiment

Update: Woke up this morning and it came to my mind this experiment has a flaw, as jobs per workitem are scheduled in parallel for the executor tests, so I am probably measuring the effects of false sharing. Results below removed, check the follow up post.



Performance cost of adaptive Scaling


To measure cost of auto-scheduling vs. explicit Actor=>Thread assignment, I run the Computing-Pi Test (see previous posts).
These numbers do not show effects of locality as it compares explicit scheduling with automatic scheduling.

Test 1 manually assigns a Thread to each Pi computing Actor,
Test 2 always starts with one worker and needs to scale up automatically once it detects actual load.


(note example requires >= kontraktor2.0-beta-2, if the 'parkNanos' stuff is uncommented it scales up to 2-3 threads only)

The test is run 8 times with increasing thread_max by one with each run.

results:

Kontraktor Autoscale (always start with 1 thread, then scale up to N threads)

1 threads : 1527
2 threads : 1273
3 threads : 718
4 threads : 630
5 threads : 521
6 threads : 576
7 threads : 619
8 threads : 668

Kontraktor with dedicated assignment of threads to Actors (see commented line in source above)

1 threads : 1520
2 threads : 804
3 threads : 571
4 threads : 459
5 threads : 457
6 threads : 534
7 threads : 615
8 threads : 659


Conclusion #2

Differences in runtimes can be attributed mostly to the delay in scaling up. For deterministic load problems, prearranged Actor scheduling is more efficient ofc. However thinking of a server receiving varying request loads over time, automatic scaling is a valid option.