Java threads may not use all your CPUs - JavaWorld August 2000
Tutorial Details:
Java threads may not use all your CPUs
Java threads may not use all your CPUs
By: By Patrick Killelea
Simple testing reveals a major flaw in Java 1.1.7 on Solaris 2.6
nterprise-level applications often need more CPU power than a single-CPU machine can provide, but getting applications to effectively use multiple CPUs can be tricky. If you're lucky, you may be able to run an application on separate single-CPU machines; Web servers scale this way. But many applications, such as databases, need to run on a single machine, which will require additional CPUs to be scaled.
In this article, we'll examine tests that show that Java threads in Sun's JRE 1.1.7 do not use more than two processors by default, regardless of how many are in the machine. Java is advertised as automatically scaling to use all available processors, but we'll see below that this is not entirely true. We'll also look at some simple ways to maximize Java multithreading performance on multi-CPU machines.
Note: My tests indicate that JRE 1.1.7 does not use more than two CPUs when running pure Java programs. Under other conditions, such as with native code that has native OS-level threads, JRE 1.1.7 may utilize all of the CPUs.
The scoop on symmetric multiprocessing
Getting multiple equivalent CPUs to cooperate in one machine is known as symmetric multiprocessing (SMP); running a single-CPU machine is known as uniprocessing. SMP machines' CPU modules usually plug into a bus. In theory, you add capacity by plugging in more CPUs.
Unfortunately, because coordinating CPU activity is difficult and most applications spend a lot of time waiting for I/O rather than using the CPU, using two processors doesn't necessarily double your throughput. Performance may even degrade with multiple CPUs. Indeed, for software not designed for an SMP system, adding a CPU may increase performance by no more than 30 percent.
For CPU-intensive programs designed for SMP, a second CPU may increase performance by 80 or 90 percent, depending on what the application is doing. CPU-intensive programs that can run as parallel processes or threads benefit more from additional CPU power than programs that spend most of their CPU time idle, waiting for a network or disk.
SMP options
One way to use multiple CPUs on a single machine is to run multiple processes that communicate via the various forms of interprocess communication, such as semaphores and shared memory. The operating system will automatically allocate different CPUs to different processes. Interprocess communication techniques are not especially portable, though. One alternative is using native thread programming in C or C++: this is difficult, but is a way to use multiple CPUs in a single process with better portability.
Java thread programming is simple by comparison and provides high portability. Java threads have also been advertised as scaling well on multi-CPU machines. You can run Java threads in parallel on multiple CPUs, but only if you use a native-threads library for Java instead of the green threads option. (The green thread implementation does all thread scheduling within the JVM. On Unix, that means all threads will run in a single process and never use multiple CPUs.)
Native-threads packages schedule threads in the platform's own threading library, and sometimes in the kernel itself. An environment variable or command line switch is normally used to activate native threads. Even when using native threads, Java does not necessarily use all available CPUs; it depends on how the particular JVM was written.
SMP support: Unix vs. Windows NT
Unix is far superior to Windows NT in the SMP arena. Windows NT can currently handle a maximum of four CPUs -- even if a machine has slots for more than four, performance is likely to decrease beyond that number because Windows NT is not efficient at partitioning work among more than four CPUs. Beware of demonstrations where the vendor sets up several machines side by side and claims good scalability. It is extremely difficult to partition most enterprise applications among several independent machines.
Solaris currently scales up to 64 CPUs, theoretically benefiting from each one. (I was able to test up to 12 CPUs, and found this to be true.) Solaris's CPU scalability lets you start with a low-end Solaris machine and add more CPUs as needed without reworking your applications or architecture. Of course, your application must be designed to use SMP.
Give the CPU a workout
I ran some tests in an attempt to prove that adding CPUs to a large Sun machine would increase performance. In a research note about SMP scalability on Linux (see Resources ), Cameron MacKinnon explained how he tested Linux SMP scalability by running multiple processes that each did nothing but count to 1 billion. This should only test the CPU -- it's not accessing the disk or memory, and maybe not even the CPU cache. In that spirit, here is an example C program that should give your CPU a workout:
main() { unsigned long i; for (i=0; i<1000000000; i++); }
I compiled this program with maximum optimization, like this: gcc -O3 -o loop loop.c -- I ran it on a 500-MHz Dell Optiplex GX1 PCI-bus PC with 512-KB cache running Linux 2.2. It takes about four seconds to run.
% time loop
4.02user 0.00system 0:04.02elapsed 99%CPU
Note: some output removed for clarity
Our 500-MHz CPU ticks off 1 billion clock cycles in two seconds. Since our program runs in four seconds, we're using about two clock cycles per increment of the variable i .
What if we try to simultaneously run two copies of this program on a single-CPU Linux PC? They won't run at exactly the same time because only one can use the CPU at any given moment. The kernel will schedule the processes, making them seem to run at the same time, but it will take twice as long. The elapsed time is directly proportional to the number of processes we're running. I ran as many as 24 processes with the following Perl script while the machine was otherwise idle:
#!/usr/local/bin/perl
$| = 1; # Do not buffer output.
$\ = "\n"; # Add newline to print statements.
for ($procs = 1; $procs <= 24; $procs++) {
for ($i = 1; $i <= $procs; $i++) {
if ($pid = fork) {}
elsif (defined $pid) {
exec 'loop';
}
else {
die "cannot fork: $!\n";
}
}
$start = time();
for ($i = 1; $i <= $procs; $i++) { wait; }
$end = time();
$latency = $end - $start;
print "$procs $latency";
}
The results can be seen in Table 1.
Table 1. Scalability of single-CPU machine
Processes
Time
(in seconds)
1
4
2
9
3
12
4
16
5
20
6
24
7
28
8
32
9
37
10
38
11
44
12
48
13
53
14
56
15
61
16
64
17
69
18
71
19
76
20
80
21
83
22
89
23
89
24
93
Figure 1, plotted with Gnuplot, illustrates the results.
Figure 1. Scalability of single-CPU PC hardware running Linux
Compilers do the funniest things
This test demonstrates that as you run more simultaneous processes, it takes proportionally longer for them to complete, which is normal for CPU-bound processes on a single-CPU machine. I have done similar experiments on Sun hardware, turning off all but one CPU, and saw the same linear scaling, but from a different starting point. That starting point for a single copy of the loop.c program is about 6.7 seconds on a Sun E450 with one 250-MHz CPU.
At this point, a common reaction is, "Now wait a minute, the Sun hardware is more expensive than the PC hardware, but runs this test more slowly?" Yes, in this specific case, the test runs slower on Sun hardware, but you can't draw too many conclusions from that. The Sun machine is designed to scale up to many processors; the PC hardware is not. This adds some overhead. The Sparc CPU runs at half the clock rate of the Intel CPU and uses a different instruction set; the Sparc CPU is RISC, while the Intel CPU is CISC.
To see how random some simple test results can be, simply count backwards in loop.c , going from 1 billion to zero rather than from zero to 1 billion. You will find that the Sun machine takes 3.35 seconds, exactly half the time it did before; the PC counts down in exactly the same time, 4.02 seconds. Now the Sun machine looks faster. What's going on here?
When counting down, the gcc compiler generates fewer instructions on the Sun machine than it does when counting up, but generates the same number of instructions on the Intel platform. Remember, the Sparc and Intel chips use different instruction sets. The gcc compiler is able to find a slightly more efficient way to count down on the Sparc CPU. (You can dump and examine the assembly language generated by gcc by using its -S option.) Compiler optimizations can be especially confusing.
Changing the lower limit of the counting can also affect the results. For example, if you count from 1 billion down to 4,095 on the Sun machine, it is about as fast as counting down to zero. But if you count from 1 billion down to 4,096 or more, the running time doubles. When the lower limit is 4,095 or less, the compiler generates a single opcode. But when the lower limit is 4,096 or more, the compiler generates multiple opcodes for the same function. A really clever compiler would just set our loop counter to its end value and skip the loop entirely, because the compiler can see that the loop has no contents. Then the loop would seem to run instantly and the whole test would be meaningless.
Give a set of CPUs a workout
I mentioned that Sun hardware is designed to scale up to 64 CPUs: does it really use all those CPUs effectively? This is an important question -- CPUs are expensive. Let's start with processes. On a 12-CPU Sun machine, the total running time is not affected until we exceed 12 loop processes. Then, the total running time increases with each additional process because we no longer have enough CPUs to run all the processes in parallel.
Every additional process adds about one-twelfth of one process's running time. This increase shows tha
Read
Tutorial at: Click here to view the tutorial
Rate Tutorial: Java threads may not use all your CPUs - JavaWorld August 2000
View Tutorial: Java threads may not use all your CPUs - JavaWorld August 2000
Related
Tutorials:
Java Q&A - Java Still Open
Java Q&A - Java Still Open |
Automating WWW
Exploration
Automating WWW
Exploration |
Programming Java threads in the
real world, Part
8
Programming Java threads in the
real world, Part
8 |
Master Merlin's new I/O classes
Master Merlin's new I/O classes |
Accelerate your RMI
programming
Accelerate your RMI
programming |
Diagnose common runtime problems with
hprof
Diagnose common runtime problems with
hprof |
Achieve strong performance with threads,
Part 1
Achieve strong performance with threads,
Part 1 |
Study guide Achieve strong performance with threads
Part 1
Study guide Achieve strong performance with threads
Part 1 |
Test networked
code the easy way
Test networked
code the easy way |
Profiling CPU
usage from within a Java application
Profiling CPU
usage from within a Java application |
Java Tip 132: The
taming of the thread
Java Tip 132: The
taming of the thread |
J2SE 1.4.1
boosts garbage
collection
J2SE 1.4.1
boosts garbage
collection |
Add concurrent processing with message-driven beans
Add concurrent processing with message-driven beans |
Smartly load your
properties
Smartly load your
properties |
Test email components in your software
Test email components in your software |
Good
ideas
Good
ideas |
Tracing in a multithreaded, multiplatform environment
Tracing in a multithreaded, multiplatform environment
In \"Use a consistent trace system for easier debugging,\" Scott Clee showed you how to trace and log from a custom class to provide a consistent tracing approach across your applications. This approa |
The ABCs of Synchronization, Part 1
Threads may execute in a manner where their paths of execution are completely independent of each other. Neither thread depends upon the other for assistance. For example, one thread might execute a print job, while a second thread repaints a window. And |
Advanced Synchronization in Java Threads
In this chapter, we look at some of the more advanced issues related to data synchronization--specifically, timing issues related to data synchronization. When you write a Java program that makes use of several threads, issues related to data synchronizat |
Understanding MIDP System Threads
Describes the multi-threaded aspects of the J2ME application environment. Understanding the interactions between systems threads, user-interface and application threads will help in avoiding MIDlet deadlock. |
|
|
|