Updated January 31, 2004
© Copyright 2004 by CyberLife Labs, LLC
All rights
reserved
This page exists to document a minor research project whose purpose is to understand exactly why two very similar computers are exhibiting such wildly different context-switching performance characteristics.
CyberLife Labs uses FreeBSD almost exclusively for its scientific, production and administrative needs. One machine in particular (appserver.geo) has always been known to run very fast, but it was never given much thought until recently. As an exercise, the BYTE Benchmark suite was installed and run on that unit along with another, nearly identical but slightly faster machine (beastie.lab). Most of the results were as expected, with beastie.lab edging out appserver.geo by a narrow margin. However, the context-switching results clocked appserver.geo at four-times the rate of it's slightly faster cousin. Considering the types of jobs these two machines typically run (very process-switching oriented), that difference explained why appserver.geo always seemed to get things done quicker. The only question was why.
The following table lists the configuration details of the two systems:
|
Component |
appserver.geo |
beastie.lab |
|
Motherboard |
ASUS A7V333 |
ASUS A7V333 |
|
CPU |
AMD Athlon XP 2100+ (1.73 Ghz) |
AMD Athlon XP 2200+ (1.8 Ghz) |
|
RAM |
PC2100 1024MB |
PC2100 512MB |
|
Chipset |
VIA KT333 |
VIA KT333 |
|
ATA Controller |
VIA 8233 ATA133 |
VIA 8233 ATA133 |
|
System Drive |
Western Digital UDMA100 100GB |
Western Digital UDMA33 30GB |
|
Secondary Drive |
---- |
Western Digital UDMA66 10GB |
|
CD-ROM Drive |
---- |
Creative WDMA2 52X |
|
Operating System |
FreeBSD 4.9-RELEASE |
FreeBSD 4.9-RELEASE |
Both systems were run with the exact same kernel configuration, compiled with the same compiler settings.
The following table lists the raw figures output from the BYTE Benchmark suite:
|
Test |
appserver.geo |
beastie.lab |
|
Dhrystone 2 without register variables |
3880692.0 |
4011065.9 |
|
Dhrystone 2 using register variables |
3860733.5 |
4009818.1 |
|
Arithmetic Test (type = arithoh) |
7851836.7 |
8163230.7 |
|
Arithmetic Test (type = register) |
356915.5 |
370548.3 |
|
Arithmetic Test (type = short) |
342698.9 |
355824.8 |
|
Arithmetic Test (type = int) |
356759.3 |
370485.0 |
|
Arithmetic Test (type = long) |
350642.9 |
370560.9 |
|
Arithmetic Test (type = float) |
794357.5 |
826194.1 |
|
Arithmetic Test (type = double) |
794815.8 |
826337.9 |
|
System Call Overhead Test |
755378.3 |
740416.6 |
|
Pipe Throughput Test |
899956.5 |
985937.3 |
|
Pipe-based Context Switching Test |
340616.2 |
89968.4 |
|
Process Creation Test |
9790.1 |
8763.2 |
|
Execl Throughput Test |
1338.0 |
276.6 |
|
File Read (10 seconds) |
1783104.0 |
1851455.0 |
|
File Write (10 seconds) |
40943.0 |
11400.0 |
|
File Copy (10 seconds) |
36109.0 |
11050.0 |
|
File Read (30 seconds) |
1797987.0 |
1838905.0 |
|
File Write (30 seconds) |
38704.0 |
11333.0 |
|
File Copy (30 seconds) |
36126.0 |
10130.0 |
|
C Compiler Test |
2445.4 |
333.7 |
|
Shell scripts (1 concurrent) |
3522.0 |
337.0 |
|
Shell scripts (2 concurrent) |
1763.3 |
168.0 |
|
Shell scripts (4 concurrent) |
889.0 |
84.0 |
|
Shell scripts (8 concurrent) |
444.7 |
40.0 |
|
Dc: sqrt(2) to 99 decimal places |
290604.9 |
118030.2 |
|
Recursion Test--Tower of Hanoi |
56435.2 |
58651.8 |
All tests were run in single-user mode to eliminate any chance of interference from other processes. The first nine tests all measure CPU performance. As is expected, beastie.lab performed about 4% faster than appserver.geo. The pipe-throughput, file-read and recursion tests are highly CPU dependent and thus show a similar 4% difference. The file-write and file-copy tests rate significantly higher on appserver.geo, but this is due to the use of an ATA100 disk versus the ATA33 on beastie.lab. (This was verified by forcing appserver.geo down to ATA33 and re-running the benchmark, at which point the file-write and file-copy tests were identical between the systems.) All other tests are highly dependent on context-switching and thus rate much higher on appserver.geo.
So why does appserver.geo task-switch so much better? It's not a CPU issue as the benchmarks correctly show beastie.lab to be the faster system. It's not a disk issue since, again, the benchmarks show beastie.lab to be faster (when both run UDMA33). Even if there were a large difference in disk performance, the context-switching tests don't use the disk subsystem at all. It's not a network issue since the NICs were completely shutdown. appserver.geo has twice the memory capacity of beastie.lab, but as with the disk subsystem, the context-switching tests are not memory-intensive. That leaves only memory and cache performance.
To determine if memory or cache was affecting the results, Alasir's RAMSPEED benchmark was used to directly stress-test both subsystems. The following table shows the results (in KB/s):
|
Block Size |
appserver.geo |
appserver.geo |
beastie.lab |
beastie.lab |
|
1 |
11650.8 |
7943.8 |
12192.7 |
8066.0 |
|
2 |
11915.6 |
7825.2 |
12483.1 |
8192.0 |
|
4 |
11650.8 |
7943.8 |
12192.7 |
8192.0 |
|
8 |
11915.6 |
7825.2 |
12192.7 |
8192.0 |
|
16 |
11915.6 |
7943.8 |
12483.1 |
8192.0 |
|
32 |
12192.7 |
7825.2 |
12787.5 |
8456.3 |
|
64 |
11915.6 |
7943.8 |
12483.1 |
8322.0 |
|
128 |
4519.7 |
4333.0 |
5349.9 |
4443.1 |
|
256 |
4481.1 |
4228.1 |
4559.0 |
4333.0 |
|
512 |
853.9 |
574.9 |
849.7 |
546.7 |
|
1024 |
852.5 |
576.1 |
841.6 |
547.9 |
|
2048 |
848.4 |
578.1 |
838.9 |
550.7 |
|
4096 |
844.3 |
578.1 |
834.9 |
548.4 |
|
8192 |
844.3 |
581.3 |
834.9 |
543.3 |
|
16384 |
844.3 |
581.9 |
833.5 |
545.6 |
Once again, beastie.lab shows an approximate 4% advantage over appserver.geo both with cache hits (block sizes <= 64KB) and cache misses (block sizes > 64KB), just as it should. This would suggest that neither memory nor cache are the culprit.
We've tested virtually everything we can think of and are still no closer to the answer than when we started. We figure the next step is to track down some of the FreeBSD kernel developers and find out exactly what affects context-switching performance. Hopefully that will give us an idea of where we should be looking.