Author: (c)Kris Kaspersky
Figure 1 by on AMAZON:
http://www.amazon.com/exec/obidos/tg/detail/-/1931769249/103-4727028-5998225?v=glance
From now on and throughout this book, we will use the term profiling to designate estimation of the performance both of the entire program and of its specific fragments in order to detect so-called hotspots - program sections that require the longest time to execute.
According to the "10/90" rule, ten percents of the code consume ninety percents of the system resources (this is similar to the well-known statistics according to which ten percents of the whole world population consume ninety percents of the whole amount of beer brewed by all manufacturers). If we represent the time required to complete each machine instruction graphically, proportionally to the growth of their linear addresses, the resulting diagram will contain several imperial peaks dominating practically flat plain blotched with a multitude of low "hills" (see Fig. 1.6). These peaks are hotspots.
Why the "temperature" of different program sections is so dissimilar? This happens because most calculating algorithms are reduced to loops - i.e., to multiple repetitions of the same code fragment. Quite often, these loops are not processed sequentially, but rather, form hierarchies with multiple levels of nesting. As a result, loops with the deepest level of nesting take the lion's share of the program's execution time. It is optimization of these loops that results in the highest performance gain!
Notice that optimization of bulky and slow, but rarely called functions is practically useless, since the resulting performance gain will be practically negligible (except, perhaps for the cases when such functions are too crooked and awkwardly written).
If the program's algorithm is simple, and its source code fits within the limits of a couple of hundreds of lines, then hotspots can be easily located by visually examining the listing. However, as the amount of the source code grows, this task gradually becomes more and more complicated. When analyzing the program comprising several thousands of sophisticated functions interacting with one another in a complicated manner (part of these functions might be called from external libraries or APIs of the operating system), one can't easily tell for sure, which of these functions is responsible for low application performance. In this situation, the natural approach is using specialized software tools.
Profileris the main instrument of the individual whose goal is program optimization. "Blind" optimization rarely produces good results. Do you remember the well-known proverb, according to which the slowest camel determines the speed of the camelcade? Software code behaves just similarly, and application performance is determined by its "bottleneck". Sometimes this happens because of a single machine instruction (for example, a division instruction executing repeatedly within a loop with deep level of nesting). The programmer, having spent enormous efforts for improving other sections of code, might be very much surprised to find out that the application performance was hardly increased by ten or fifteen percents.
Hence, the first rule is as follows: elimination of hotspots that are not the "hottest" practically does not improve the application performance. Actually, all efforts of urging the next to last camel will not make the camelcade move faster. As for the case when the next to last camel slows down the last one, this is the topic for another discussion. Since it requires a sound knowledge of profiling techniques, it will not be considered in this book.
The main goal of profiling is investigation of the application' run-time behavior in all its spots. Depending on the level of detail, the term "spot" can be used to designate either a single machine command or entire fragment of code written in high-level programming language (for example, this might be a function, a loop or a single line of source code).
Most contemporary profilers support the following set of basic operations:
Gathering information about the time required for the application to execute each spot enables you to detect the "hottest" sections of that application. At this point, however, it is necessary to notice that direct measurement will show that at least 99,99% of the total program's execution time is consumed by the main function. However, it is obvious that the main function is not a real hotspot. In fact, real hotspots are the functions called by the main function! In order to avoid confusion among programmers, profilers usually subtract the time required to execute child functions from the total execution time of each function within a program.
For example, let us consider the profiling result obtained using the profile.exe profiler supplied as part of the Microsoft Visual C++ compiler (Listing 1.1).
Listing 1.1 Example of profiling results produced by the tool supplied as part of Microsoft Visual C++
The second column (Func + Child Time) lists total execution time for each function. The most significant part of this time (actually, a lion's share) is taken by the main function (this is an expected result), the second position is taken by the gen_pswd function (99,6%), next goes the do_pswd function - 98,9% and, finally, the CheckCRC function, which takes about 3,0%. As for the CalculateCRC, which humbly takes 1,6%, at first glance, it can be simply neglected! Thus, according to the collected data, at first glance we have the following three hotspots: main, gen_pswd and do_pswd (Fig. 1.1).
Fig. 1.1. The diagram showing total execution time of each function. At first glance, it seems that we have three hotspots here. In practice, however, this is not so.
Actually, the main function can be dismissed at once, since it is obvious that this function is not responsible for performance degradation. The gen_pswd and do_pswd functions remain under consideration. If these were two absolutely independent functions, there actually would be two hotspots. However, this is not the case. If we subtract the total execution time of the child function (do_pswd) from the total execution time of the parent function (gen_pswd), the parent function will have only 0,8%. Yes! Less than one percent of total execution time!
Now let us consider the leftmost column of the profiler table ( Func time) in order to confirm our assumptions. In fact, there is only one hotspot in this program - do_pswd, and the only way of significantly improving the application performance is optimizing this hotspot (Fig. 1.2).
Fig. 1.2. The diagram illustrating net execution time of each function (i.e. without child functions). As this diagram shows, the program being profiled has only one hotspot, but quite an active one.
Well, let us assume that the "hottest" function has been detected. Now we need to optimize it. To achieve this goal, it would be useful to locate hotspots within the function itself. Unfortunately, the profile.exe profiler (and other similar tools) cannot help us, since its resolution is limited to functions only.
Fortunately, there are other, more advanced profilers, which are able of definitely resolving specific lines of code and even individual machine commands! The list of such tools includes, for example, the VTune profiler from Intel. Let us start it to investigate the do_pswd function (more details on the techniques of working with this profiler will be provided later in this chapter, see "Practical Profiling Session Using VTune"). The results of this investigation are shown in Listing 1.2.
Listing 1.2 "Temperature" distribution within the do_pswd function generated using VTune.
Now it's quite another matter! It immediately becomes obvious, which code fragments need to be optimized, and which parts of the program are already fine-tuned. Hotspots are mainly concentrated around the pswd[p] construction, which runs too slowly. Why? The source code doesn't provide an answer to this question. Therefore, it is not clear, what should be done in order to decrease the "temperature" of the hotspots.
Because of this reason, it is necessary to perform an investigation at the level of "pure" machine commands (fortunately, VTune provides this capability). Consider, for example, what the compiler has done to the innocently looking assignment operator pswd[p] = '!' (Listing 1.3).
Listing 1.3. Investigation of machine commands within the pswd[p]='!' construction
Now consider! A single line of source code causes three attempts of accessing the memory! First, the pswd pointer is loaded to the EDX register, then it is added to the p variable, which also is located in the memory, and, only after accomplishing this, the '!' (021h) constant is written to the memory.
Still, it remains unclear, why such a long time is required to load the pswd pointer. Perhaps, some process constantly preempts pswd from the cache, thus forcing the processor to access RAM? No, nothing of the sort! The program under consideration works with a small number of variables, which certainly fit within the L2 cache.
If execution time of some spot within a program is not constant, but, rather, varies within a certain limits (for example, depending on the type of the data being processed), then the profiling results become ambiguous, and the result itself becomes unreliable. To produce more definite results of analysis it is necessary to resolve the following issues: a) find out if the program actually contains such "floating" spots and, if yes, then: b) determine their execution times for the best, average and the worst cases.
Profilers able of measuring specific execution times of individual machine commands (also known as dynamic assembly analysis) are not numerous. Fortunately, VTune is capable of doing this. Let us consider the log of dynamic assembly analysis that it has generated (Listing 1.4). Probably, this will help us to solve the problem of slow loading of the pswd pointer.
Listing 1.4. Specific execution times of machine commands within the profiled code fragment
Once again we are dealing with the same annoyance! Loading of the pswd pointer has consumed 13 processor clock ticks, while all other commands easily fit within a couple of clock ticks! Actually, some commands even manage to complete together with the previous instruction, thus requiring 0 clock ticks.
With the exception of the command that loads the contents of the y variable to the ECX register, execution time of all commands is constant, i.e., it doesn't vary with the time. As relates to the command under consideration, it can take up to 80 clock cycles, depending on some unclear circumstances. This, of course, makes it the hottest spot of this code fragment. Eighty ticks won't do at any rate! Despite of the fact that its average execution time is only 7 clocks, while the minimum value is even zero, it is necessary to find out, why, for what purposes and under which circumstances so many CPU cycles are required?
The fact of the hotspot existence is in itself an evidence of the fact that something is wrong with the program. We are lucky enough, if this is a purely algorithmic error, which can easily be detected without using specialized tools (the example of this situation is the case when the bubble sorting proves to be the application's bottleneck). It is much worse, if the CPU time is consumed without any apparent reasons. In the worst case, such leaks do not occur randomly, but, rather, take place under specific conditions that we are unable to figure out.
Now, let us return to the question, which we have not answered yet: why does the pswd pointer load so slowly? Under which circumstances does its loading time increase so dramatically - from 7 up to 80 clock ticks? Most probably, something is wrong with these two machine commands, and, because of this reason, the processor has fined them with penalty by temporarily delaying their execution. Is it possible to find out, when did this happen, and because of which reason? Without fully emulating the processor it is hardly possible (although contemporary x86 processors provide the possibility of collecting this information with some minor limitations).
Leading manufacturers, such as Intel and AMD have long ago released their platform-specific profilers, which include full-featured emulators of their respective processors. These emulators allow the programmer to visualize the finest details of execution for each machine instruction.
For example, when working with VTune, it is sufficient to click the "mov ecx, DWORD PTR [ebp-28]" line, and profiler will provide the following information (Listing 1.5):
Listing 1.5. Penalty information displayed by Vtune for the "mov ecx, DWORD PTR [ebp-28]" instruction
Thus, our investigation turns out to be a detective story, even more complicated than the ones by, say, Agatha Christie. If we process the obtained results using even the simplest mathematical methods, they will look very much like nonsense. Consider for yourself. Total number of CPU cycles required for the instruction to execute makes 80, and, at the same time, it is a known fact, that it has been executed 11 times (see the count column in the profiler's report). On the other hand, in the worst case the instruction will require 80 clocks to execute! And the worst time of decoding is even longer - 86 clocks! This means that the worst time of instruction decoding exceeds its total execution time. Furthermore, as if all these discrepancies were insufficient, our instruction loses at least one clock per iteration (this occurred 10 times) because the address block is busy. Yes, this can certainly drive you mad!
The reason of such discrepancy lies in the relativity of the time concept itself. Do you really think that time is relative only in the Einstein's relativity theory? Well, if so, then you are mistaken! In pipelined processors (Pentium and AMD K6/Athlon in particular) there is no such concept as "instruction execution time" at all (see "Fundamental Profiling Problems, Pipelining and Throughput vs. Latency"). Since several instructions can be executed in parallel, simple algebraic summing of their execution times will produce significantly larger result, than these instructions actually require.
Well, later we will cover this topic in more detail. For the moment, let us notice the fact that because our instruction is missing from the cache (actually, it resides at the boundary of two cache lines) it is necessary to load it from the main memory. Therefore, during the first iteration of the loop, it executes significantly slower than during all subsequent iterations. Hence, we have these 80 cycles.
When we have more iterations (and, in real world, this number is actually quite large) we can neglect the time required for initial loading, but... Stop! The profiler has executed the body of this loop only 11 times. As a result, average execution time of this instruction made 7,3 cycles, which doesn't correspond to the reality at all!
Hum! This "hotspot" is not a hotspot at all! There is nothing to optimize here. If we increase the number of profiler runs at least four times, the average run time for our instruction will drop down to 1,8 clocks and will become one of the "coldest" spots of the program! Actually, this spot will turn to absolute zero, since the effective run time of this instruction is 0 clocks (i.e., it completes simultaneously with the previous machine command). In other words, we have failed completely.
Hence, we have the following rule: before starting optimization, make sure that the number of runs is sufficient to mask the initial loading overhead.
Now, let us briefly consider why did we encounter such a situation. By default, VTune runs the profiled fragment 1000 times. Is it too much? Don't hurry with the answer. Our intricate loop is organized so, that it gets control each 'z' - '!' == 0x59 iterations. Thus, during the whole interval of analysis, it will be executed only 1.000/89 == 11 times (Listing 1.6)! Notice that this example is not an artificial one. On the contrary, this situation is rather frequent.
Listing 1.6. Demonstration of the code containing infrequently run fragments, which distort the profiling results
Therefore, having detected the hotspot for the first time, make sure that the number of times it runs is sufficient. Otherwise, it is likely that the result will be doubtful. At this point, it is time to proceed with the discussion of the number of calls to each spot of the program.
However, we still have the second hotspot to consider, as well as surprisingly slow loading of the pswd pointer. Experienced programmers have probably already suggested the actual reason of this delay.
Actually, the pswd[p] = '!' line is the first line of the loop body. It gets the control each 0x59 iterations, which significantly exceeds the resolution of the dynamic branching prediction algorithm used by the processor to prevent the calculating pipeline from stopping.
Consequently, this branching is always predicted incorrectly, and the processor must start execution of this instruction from the scratch. Notice that the processor pipeline is rather long. One must wait, while it is filled... Actually, it is not the "mov edx, DWORD PTR [ebp+0ch]" command that is responsible for performance drop - any other command on its place would also run with the similar inefficiency! The cause, which, like a soldering torch, makes this spot so "red-hot", actually resides in quite different place!
Let us rise above, to the conditional jump instruction immediately preceding this command, and double-click it. VTune will output the following information (Listing 1.7):
Listing 1.7. Penalty information displayed by Vtune for the conditional jump instruction preceding the "mov edx, DWORD PTR [ebp+0ch]" command
Our hypothesis is fully confirmed. This branching was incorrectly predicted thirteen times, which is confirmed by VTune's evidence! Why thirteen, would you ask? The loop body is executed only 11 times! Yes, this is correct, it runs eleven times, but the processor couldn't predict it beforehand. It has twice attempted to pass the control to this loop. After these attempts failed, it ceased its attempts of passing control to this branch.
ÎÊ. When the puzzles are solved, this is certainly pleasing. However, there still remains more important question: how are we going to solve them? Well, in our situation we were lucky enough to have the unpredictable conditional jump so close to the hotspot. However, this is not always the case. On some occasions (which are not too rare) the "responsible" instruction might be located even in the different modules of the program! No comments... Except for one recommendation. When profiling, always implement a complex approach and use little gray cells. Perhaps, no one would provide you with the better advice...
As we have just demonstrated, determining the number of calls to the hotspot being profiled is required in order to determine if the measured data are reliable. Secondly, it is possible to evaluate the hotspot "temperature" by the frequency of calls as well as by its run time.
For example, consider the situation when we have two hotspots, both consuming the same processor time, however, the first one is called 100 times, while the second one - 100 thousand times. It is not difficult to guess that by optimizing the latter hotspot at lest by 1% we will gain a significant performance growth. On the other hand, even if we manage to reduce the run time of the first hotspot twice, we will speed up our program only by 25 percents.
Quite often, it makes sense to in-line frequently called functions, i.e., directly insert their code into the body of the other function, which will allow saving some amount of time.
Practically all profilers can determine the number of calls. Therefore, this problem is not the one that deserves special consideration.
Strictly speaking, determining the coverage has no direct relationship to the program optimization. Rather, it is the secondary function of profilers. On the other hand, since it is implemented, we must consider it, at least briefly.
The coverage is the percentage of the program code that was actually executed during the profiled session. Who might need this information? Above all, these are testers, since they need to make sure that the application code has been fully tested, and no "dark" spots remain in it.
On the other hand, when you optimize the program, it is important to know for sure, which parts of it were actually profiled, and which were not. Otherwise, you might overlook most hotspots, only because their respective branches never got control!
For example, let us consider, how the function coverage protocol generated by the profile.exe might look (Listing 1.8) for our test example pswd.exe (more details on the test example itself will be provided in the section titled as "Practical Profiling Session Using VTune").
Listing 1.8. The function coverage protocol generated by the profile.exe profiler
This listing shows, that only 60% of functions have got control during program execution, while remaining 40% were not called even once! Surely, it makes sense to find out, if these functions are ever called at all or do they represent the so-called "dead" code that can be deleted from the program, thus reducing its size practically twice?
If these functions get control under specific circumstances, it is necessary to analyze the source code in order to find out, what circumstances result in calls to those functions. This is required to reproduce these circumstances in order to enable the profiler to run other sections of the program. All names of covered and uncovered functions are listed in the Covered Function section. Covered functions are marked with the "*" sign, while uncovered ones - with "."
Generally speaking, there are lots of specialized applications (such as NuMega Code Coverage), specifically designed for solving this problem. These applications solve code coverage problem much better than any profiler does.
From now and on, under microprofiling we will understand measuring of run time for small code fragments (sometimes even individual commands).
Microprofiling has a range of inherent serious problems, which can hardly be eliminated. Lack of understanding of these problems often results in serious errors when interpreting the profiling results, and, consequently, in time loss and poor quality of optimization.
Let us begin by stating the fact that in pipelined systems there is no such concept as "individual single command execution time". To illustrate this statement, let us consider the following analogy. Suppose that some factory produces 600 memory chips per hour. Now answer the following question: how long does it take to produce a single chip? Six seconds? No! The completely technological cycle makes months, rather than days, to speak nothing about seconds! We do not notice this obvious fact only because of the manufacturing pipeline, which means that the whole manufacturing process is subdivided into individual stages. Each time instance, at least one memory chip passes through individual stages.
The number of manufactured items produced by the pipeline per time unit is known as its throughput. It is not difficult to show that in the general case, throughput is proportional to the inverse function of one production stage - actually, the shorter is each step, the more products will be manufactured. At the same time, the number of these steps (the length of the pipeline) doesn't play any role. By the way, note that on practically all manufactures each stage usually represents an elementary operation. The reason doesn't lie in the fact that human beings perform routine work better than machines - on the contrary, they hate it! Elementary operations take very short time, thus ensuring maximum throughput to the pipeline.
Processor manufacturers exploit the same approach. At the same time, there is a visible trend - the pipelines become longer with each newer model. For example, in first models of Pentium processors this length was only five stages, while in Pentium-II it reached the value of 14, and in Pentium-4 it has grown up to 20. This trend was caused by a rapid growth of the processor's clock speed, and, consequently, by the necessity to make the pipeline operate at this frequency.
On one hand, it is all very well - the pipeline operates at the crazy speed, processing up to 6 micro-operations per clock. Does it's length matter at lest anything? However, it really does matter! Let us return to our analogy with the plant. Suppose that we want to release a new model to manufacturing. When will we be able to produce the first items? (Let us neglect bureaucratic nuisances). The answer is obvious - the new model won't be ready neither after six seconds nor after an hour of waiting. Actually, we will have to wait, until the whole technological cycle accomplishes.
Latency - i.e., the complete time interval required to accomplish the production cycle, might have no significant influence on the technological process itself (after all, new chips don't appear every day). However, it has a significant influence on the performance of the CPU that processes heterogeneous code comprising various fragments dissimilar by their nature. Advancement of machine instructions along the pipeline encounters several principal problems - sometimes the operands might not be ready, sometimes the executive device might be busy, sometimes the unexpected conditional jump might be encountered (this, by the way, is similar to re-orientation of our plant towards manufacturing of the new model), etc. As a result, instead of continuous and smooth operation we have to deal with lengthy downtime, only occasionally interrupted by short spurts, immediately followed by another lengthy intervals of idleness.
In the best case, the time required to execute a single instruction is determined by the pipeline throughput. In the worst case - the same value depends on the pipeline' s latency. Since the throughput and latency are values of different order, it doesn't make any sense to speak about average run time of instruction. If you calculate this value, it will not correspond to any physical reality. An unpleasant consequence is the impossibility of determining real run time of the compact code section (if of course, we are not using processor emulation). Until the execution time of the code section exceeds the pipeline's latency (30 clocks for P6), we won't be able to tell anything about the code, its execution time or the pipeline!
One of the fundamental differences between digital and analog devices lies in the fact that the upper limit of measurement accuracy is determined by the precision of the measuring instrument (the precision of analog measuring instrument, on the contrary, grows with the number of measurements).
How can we measure the run time of individual sections of a program? The IBM PC AT family provides two such mechanisms: system timer (default frequency: 18,2 ticks per second, i.e., 55 msec., maximum frequency - 1.193.180 ticks per second or 0,84 msec.), real-time clock (the frequency of 1024 ticks per second, i.e., 0,98 msec.). Additionally, processors of the Pentium family have so called Time Stamp Counter, which is incremented by 1 with each tick of the processor core.
Now let us consider all these facilities in more detail. The system timer (taking into account the time required to read the measurements) provides the accuracy of about 5 msec, which makes more than two thousand ticks even in 500 MHz systems! For the processor, this is an eternity. During this time, it is capable of processing tons of data, for example, to sort about 1,5 hundreds of numbers. Because of this reason, system timer is not suitable for profiling of individual functions. Particularly, do not hope to find bottlenecks of the quick sort function using this approach! Why are we speaking about bottlenecks - when the number of processed values is small, it cannot reliably determine even the total time required for sorting.
At the same time, the system timer is not directly accessible under normal operating system, while minimum time interval that can be resolved by the standard C function clock() makes only 1/100 sec. This interval is suitable only for measuring run time of the whole program.
The precision of the real-time clock is hardly comparable to that of the system timer (which is re-programmed, of course).
Now, the Time Stamp Counter remains our only hope. After the first acquaintance with it, most people are simply delighted - "finally, Intel has given us something about which we have dreamed for so long!". Consider for yourself: first, operating systems of the Windows family (including Windows NT) provide unrestricted access to the RDTSC machine command, which reads the contents of this register. Secondly, since this register is incremented each tick, the programmer gets false impression of the possibility to correctly determine run time of each processor command. Alas! Unfortunately, this is not so!
Let us begin from the fact that in pipeline systems (as we know already) there is no such concept as command run time. No such concept at all! Therefore, we are dealing either with the throughput or with the latency. If so, the following question arises immediately: what value is measured using RDTSC in reality? Original documentation from Intel doesn't answer this question directly. Presumably, according to available information, RDTSC reads the contents of the timer register at the instance when the instruction in question passes through respective executive device. At the same time, RDTSC is not an ordered command; i.e. it can be completed even before the commands that precede it. This will be the case, when preceding command is idle because it waits for operand.
Let us consider an extreme case, (Listing 1.9) when we are measuring run time for the minimum portion of code (one machine command is required to save the value read during the first iteration):
Listing 1.9. An attempt of measuring execution time of a single machine command
When this example is run on P-III the result will be 32 clocks. This, in turn, results in a reasonable question: "why so many?" Well, let us return to the problem of processor ticks leakage later. For the moment, let us try to measure run time of some machine command (for example, INC EAX), which increments the EAX value by 1. Let us place this command between two RDTSC instructions and recompile the program.
Surprisingly enough, the test run produces the same result - 32 clocks. Strange! Let us add an extra INC EAX command. Once again, the test produces the same result - 32 clocks. How can it be? However, if you add three INC EAX instructions, run time will be incremented by one - now the result will be 33 clocks. Four and five INC EAX instructions will produce the same result, but after adding the sixths extra instruction the measurement result once again will be incremented by one.
Wait, but Pentium processor has only one ALU, and, therefore, it cannot execute more than one addition operation per clock! The obtained result - three commands per tick - represents the rate of their decoding, rather than the rate of their execution! To confirm this statement, let us launch for execution the following loop (Listing 1.10):
Listing 1.10. Measuring the time required to execute 6õ1000 INC machine commands
When running this loop on P-III, execution of this example will require 6.781 ticks per each machine instruction, rather than ~2.000, as expected! Consequently, in the previous case, when measuring the time required for execution of several machine commands, the RDTSC instruction has prematurely reported the result that we expected from it!
Thus, if would be fine if there existed some method of delaying execution of the RDTSC command until all preceding machine instructions are completed... And, in fact, there is such a method! It is sufficient to place one of the ordered execution commands before RDTSC. Ordered execution commands are processed only after the last of preceding unordered command leaves the pipeline. Until the ordered command completes, all the commands following it will wait.
Most of ordered commands are privileged commands (for example, I/O port reading instructions). Only few of them are available from the application level. In particular, the CPUID instruction belongs to this set of commands.
Most manuals (for example, the "How to optimize for the Pentium family of microprocessors" by Agner Fog, and Intel's official document titled "Using the RDTSC Instruction for Performance Monitoring") suggest using the following code (Listing 1.11):
Listing 1.11. Officially recommended method of calling RDTSC to measure run time
Unfortunately, even this officially recommended code is not suitable for measuring run time of individual instructions, because the result that it produces represents full run time of the instruction, i.e. its latency, rather than throughput, which interests us mostly.
(By the way, both Intel, and Fog have made a gross error - their example program doesn't contain explicit specification of the argument that CPUID expects to read from the EAX register. This argument must be specified before the CPUID instruction. Since instruction run time depends on its argument, run time of the profiled fragment is not constant. Rather, it depends on the state of the registers at the input and at the output. The example provided below performs explicit initialization of EAX, thus preventing various side effects that can impact the profiler).
There is just another problem, even more serious than the first one. Certainly, you remember the rules of quantum physics, which state, that any measurement of the object properties inevitably introduces the changes into that object. The changes, in turn, distort the result of measurements. What makes the things even more complicated, is the fact that these distortions cannot be eliminated by simple calibration, because they can be both quantitative and qualitative.
If the profiled code exploits the same processor units that the RDTSC/CPUID commands, its measured run time can be significantly different from the real one! No sophisticated measures can help us to achieve the measurement precision of one or two clocks!
Because of these factors, according to the author's personal experience, minimum time interval that can be trusted must be no less than 50 -100 clocks.
Hence, it is impossible to measure execution time of individual command using standard processor tools.
Let us for the moment stop discussing computers. Instead, let us try to answer the following question: is it possible to measure the thickness of normal paper using a standard measuring line? At first glance it seems that this task cannot be solved without special tools... However, if we take a sheaf containing about 50 leaflets... Certainly, you have already understood what I mean. Let the measurement error be ±1 mm, then measurement precision for each individual sheet of paper will be no worse than ±0,02 mm, which is sufficient for most purposes of our daily activities!
Why don't we use these techniques for measuring run time of individual machine commands? Actually, run time of a single command is so small that there are no ways to measure it (see "Measurement errors"), however, if we take a couple of hundreds, or even hundred of thousands of such commands, then, ... Alas! Machine commands behave quite differently than sheets of paper do. Non-uniformity of the pipeline results in non-linear dependence between number of commands and their run time.
Besides this, contemporary
processors are too intellectual to interpret the code being processed
literally. No! They are very creative and inventive when performing this task.
Consider, for example, the situation when the processor encounters the
following sequence of commands:
MOV EAX, 1;
MOV EAX, 1;
MOV EAX, 1.
Each command places the value of 1 into the EAX register. Do you think that
the processor is so dumb that it executes all three commands? No, it is not so!
Since the result of first two assignments is not used in any way, the processor
will dismiss these commands as unneeded. Thus, it will spend time only on command
decoding, and thus save significant time on execution!
Code optimization performed by the processor at the hardware level significantly improves overall system performance, thus decreasing actual run time of machine commands. Of course, we can measure precisely how long did it take the processor to execute a code fragment comprising thousand specific commands. However, you need to carefully interpret the results in relation to the evaluation of the run time of individual command.
With the account of the fact that throughput of most instructions makes only one tick, and that minimum time interval that can be measured is about 50 - 100 clocks, resolution limit of profilers without emulation doesn't exceed 50 commands.
From now on, under "resolution" we will understand the length of the hotspot, more or less reliably recognized by the profiler.
Strictly speaking, profilers without emulation detect some lengthy area to which the hotspot belongs rather than the hotspot as such.
From now on, under we will use the "macroprofiling" term to designate measurements of run time of the structural program units (functions, repeatedly executed loops, etc.), and, sometimes, even that of the whole program.
Macroprofiling has inherent problems of its own. These include inconsistency of run time, the "second pass" problem, as well as necessity to take into account various side effects... In other words, it presents quite a bunch of the problems!
If you have some previous experience with profiling of applications, you have certainly encountered with run time inconsistencies. The results of run time measurements vary from run to run, and sometimes can differ rather significantly.
There are at least two reasons for such inconsistencies: software inconsistency, related to the fact that in multitasking operating systems (Windows, in particular) the profiled program is significantly influenced by extremely variable environment, and hardware inconsistency, caused by internal "multitasking" of the hardware.
Both these reasons will be covered in detail because of their significant impact on the profiling results.
In multitasking environment, quite a good example of which is Windows - the most popular of today's operating systems, no program has exclusive access to all system resources and has to share them with other tasks. This means, that the speed of execution of the profiled program is not constant, but, rather, depends heavily on the existing environment. In practice, the result dissipation can reach the values up to 10% -15%, sometimes even more, especially if other resource-consuming tasks run simultaneously.
Still, this doesn't present an especial problem. It is sufficient to make several test runs within each profiling session and then choose the measurement with the shortest run time. The idea of this approach is as follows: performance measurements are not normal instrumental measurements, and typical rules of metrology are not applicable here. The processor never makes errors, and each obtained measurement result is accurate. Other matter, that it is more or less distorted by side effects, but no side effect will never make the program execute faster than it actually does. Therefore, the fastest run (the one with minimum run time) represents the result, which is least distorted by side effects.
By the way, most manuals recommend that before profiling you disconnect from the network ("so the machine won't receive network packets"), close all applications except the profiler, and even reboot "just for the case". This, of course, is pure nonsense! For example, I quite often debug programs in parallel with working in Microsoft Word, receiving mail, downloading files from the Internet, etc. The profiling results I obtained during these activities always remained satisfactory! Of course, you don't need to heavily overload your machine without real necessity, and, certainly, it makes sense to close applications running in parallel, if you don't need them. However, there is no need to make the situation absurd by attempting to ensure absolutely "sterile" environment.
Perhaps, this might seem strange, but at the hardware level run time of the same operations also is not constant. It is subject to dispersions, which sometimes are significant and can exceed software inconsistency. In contrast to software inconsistency, which can be eliminated at least in theory (for example, by running the program in a single-user mode), hardware inconsistency cannot be eliminated principally.
Why does this problem arise at all? There are lots of various reasons. For example, if the system bus frequency doesn't match the frequency of RAM chips, the chipset will have to wait during a random time interval for the arrival of the rising edge of the next clock pulse. Based on the fact, that, depending on the type of RAM chips, each packet exchange cycle takes approximately 5 - 9 clocks, and it is necessary to synchronize both its start and end, it is not difficult to calculate, that in the worst case we have discrepancy of about 25% - 40%.
The most interesting fact is that hardware inconsistency can significantly differ from system to system. Unfortunately, I didn't succeed in determining, what is the exact reason of this phenomenon. However, I can tell for sure, that, for example, on P-III 733/133/100/i815EP hardware inconsistency is rather insignificant and hardly exceeds 1% - 2%, despite the difference in system bus frequency and RAM frequency. Therefore, in this case, it can be neglected.
On the other hand, the situation is rather different for AMD Athlon 1050/100/100/VIA KT133! In this case, we are dealing with enormous discrepancy, which sometimes exceeds two times in operations with RAM! On such a system it is impossible to profile programs at all! In particular, sequential measurements of the time required to copy 16 MB memory block might look as shown below (after discarding extreme values):
Listing 1.12. Sequential measurements of the time required to copy 16 MB memory block
Just look, here the discrepancy between minimum and maximum run times makes no less than 36%! This means that you will be unable to detect hotspots of smaller length. Furthermore, you won't be able to evaluate the influence of specific optimizing algorithms, if they do not provide twofold performance growth!
Hence, we can draw the following conclusions: I) not every system is suitable for profiling and optimizing applications and II) if sequential measurements produce significant discrepancy, simply replace the system. (In this case, the term "system" designates computer hardware rather than the operating system).
As was shown earlier, "raw" measurement results obtained for program run time are not suitable for anything useful. Obviously, they require pre-processing before usage. At least, it is necessary to discard marginal values caused by unexpected side effects (consider, for example, the situation when the OS started to flush data to disk at the most important time instance in the course of profiling), and then.... Then we will have to face the problem of difficult choice. Namely, we must choose either the result with the shortest run time - as the result least influenced by the negative impact of multitasking, or calculate the most typical run time (as the real-world result, in contrast to the one obtained under ideal laboratory conditions).
I have successfully tested and used a mixed method, which combines both these methods. Actually, I offer you to work on the basis of averaged minimal run time. The algorithm in general is as follows: we perform N runs of the profiled program, then discard N/3 maximum and N/3 minimum measurement results. For remaining N/3 measurements, it is necessary to calculate the average value, which will be assumed to be the main result. The value of N varies depending on the situation. Normally, however, 9 -12 runs are sufficient, since further increase of this number doesn't improve the result precision.
One of the possible implementations of this algorithm is shown below (Listing 1.13):
Listing 1.13. Calculating typical average execution time
To achieve acceptable precision of the measurement results, it is necessary to run the profiled application at least 9 - 12 times (see "Run time inconsistency. Processing of the measurement results"). Note that each run must be performed under similar environment conditions. Alas! Without developing full-featured emulator of the whole system this requirement practically cannot be met. Disc cache, cache of both levels, physical page buffers and jump history significantly complicate program profiling, since repeated runs significantly reduce its run time.
If we are profiling a loop that is repeated multiple times, we can neglect this factor, since the time required to load data and/or code into the cache has practically no effect on the total run time of the loop. Unfortunately, this is not always the case (such case was considered in "Goals and problems of profiling. Penalty information").
On the other hand, it might be the application initialization that we wish to optimize! Despite the fact that it is executed only once per session, no user will be pleased, if the procedure of program startup lasts for minutes (or even dozens of minutes). Of course, it is possible to reboot the system, but... how long will it take you to accomplish the profiling!
Well, clearing the data cache is an easy task. It is sufficient to read a large memory block significantly exceeding the cache size. It also makes sense to write a large block in order to flush all write buffers (see "Cache"). By the way, this operation will also clear the TLB (Translate Look aside Buffer) - the buffer storing memory pages for quick access (see. "Prefetching"). Similar procedures can be used to clear the code's cache/TLB. It is sufficient to generate a large function with the size of about 1 - 4 MB, which doesn't do anything (for example, we can fill it with NOP commands). All these steps will reduce the negative impact of the above-described effects. Unfortunately, it is impossible to eliminate the negative effects completely, since in this world there are phenomena that can't be controlled neither directly nor indirectly (at least, at the application level).
On the other hand, if we are optimizing an individual function (for example, let us choose the string reverse function for this purpose), then, its first run won't produce any useful results. This happens, because in this case it is the efficiency of the function's code/algorithm that determines the performance, while overhead related to loading machine instructions into the cache, assigning and mapping of pages by the operating system, loading functions into the cache... In real program this overhead is usually already eliminated (even if the function is called only once).
Let us conduct the following experiment. For example, let us take the function provided below (Listing 1.14) and run it 10 times sequentially, measuring its execution time during each run.
Listing 1.14. The example of function that accesses each cell loaded into the cache only once
For memory blocks that completely fit within the L1 cache, on P-III 733/133/100/i815EP we will get the following results (Listing 1.15):
Listing 1.15. Results shown by the function (see Listing 1.14) for memory blocks that fit within the L1 cache
Pay special attention to the following fact: the time required for the first run of this function (do not confuse it with the run time of the first loop iteration!) is practically four times as long as all subsequent ones! Note that measurement results can vary unpredictably from 62.190 to 91.873 clocks, which corresponds to the error of about 50%. Does this mean that if in reality this loop is executed only once, it doesn't make any sense trying to optimize it? Certainly not! For example, let us get rid from this XOR crankiness and exchange these two array elements using temporary variable, as most practical people would do. In real-world practice, this will reduce the time of the first run to 47.603 - 65.577 clocks, thus improving the efficiency by 20% - 40%!
Still, more or less stable repeating of the results begins only from the third run! The reasons because of which the first run is so slow, are obvious (loading data into the cache, etc.), but why the second run is also very slow, and what prevents it from being faster? In practice, these are forks. During the first run, the dynamic fork prediction algorithm has not collected sufficient information yet. Therefore, during the second run, it still produced errors. However, staring with the third run, it has finally "understood" the situation and began to work as expected.
The following experiment allows us to make sure, that these are forks that are responsible for this behavior: let us determine the __OVER_BRANCH__ value and consider its effect on the results. Well, it's certainly so! The difference between the first and the third runs has been reduced from 0,3% to 0,1%. Naturally, if our algorithm contained more branches, three runs would be insufficient for accumulating the statistically reliable results. On the other hand, the error introduced by jumps is relatively small and can be neglected. By the way, notice that function run time on all subsequent runs is constant, with the precision of one clock!
Thus, when profiling functions that run repeatedly, the results of the first two or three runs must be discarded. At least, you should never use them when calculating an average value.
On the contrary, when profiling functions that run only once within the profiled program, it is necessary to pay attention only to the run time of the first run, and discard all the other results. At the same time, at subsequent runs, it is necessary to clear all types of cache and all buffers, which distort actual performance.
When correcting specific errors, we always introduce another ones. Therefore, after any code modifications, however insignificant they might seem, it is necessary to repeat the whole profiling session.
The following example illustrates this. For example, let us consider the situation when the optimized program contains the function similar to the one presented below (Listing 1.16):
Listing 1.16. Code fragment demonstrating the situation when discarding redundant code might result in significant and unexpected performance degradation
Obviously, if we attempt to pass zero or negative value to this function, the for_a loop will never be executed, and, consequently, forced argument check (highlighted in bold within the listing) has no sense! Certainly, when foo values are large enough, the overhead is rather insignificant. However, can we hope that removal of this line at least won't slow down the function?
This question isn't the crazy or senseless one, and it doesn't relate to purely theoretic abstraction! It is quite possible, that removal of this line will produce an effect quite opposite to the expected one. Instead of function optimization, we will significantly reduce the speed of the function's execution!
How can it be? Quite easily! If the compiler doesn't align loops in memory (like MS VC), then, it is highly probable, that we'll encounter cache conflict (see "Part II. The Cache"), which is subject to penalty. On the other hand, there might be situations when this conflict simply doesn't occur! This is an entirely random matter. Perhaps, it was this absolutely senseless (and notice, executed only once) argument check that saved the whole loop from penalty delays, that arise in each iteration.
All above-mentioned relates not only to elimination of the redundant code, but also to any code modification resulting in change of the code size. In the course of optimization, the code performance can change in any way, even in the most exotic manner. It can both decrease and improve unexpectedly, without any visible reason. The most inconvenient circumstance here is the fact that most compilers do not provide the capability of managing the code alignment. Thus, if the loop fits into the memory addresses inconvenient from the processor's point of view, all that remains to us is adding an extra code fragment into the program, in such a way as to move the loop out of unfavorable conditions, or simply to change the program code by choosing the better combination.
"This situation is idiotic", - that's probably what you'd say, and you'd be right. Fortunately, compilers such as MS VC align function addresses by the addresses representing multiples of 0x20 (which corresponds to the length of a single cache line in processors such as P6 and K6). This eliminates mutual influence of functions and limits the area of command change by a single function.
The same relates to the size of processed data blocks, number and types of variables, etc. Quite often, it happens so that decrease of the amount of required memory results conflicts of specific types, which, in turn, results in performance degradation. Note, that when we are dealing with global and/or dynamic variable, we are not limited to the range of a specific function, but, rather, impact the program as a whole! (see "DRAM Banks Conflicts").
Let us formulate the three rules, which must be observed when profiling large programs, especially the ones developed independently by several individuals. Just imagine the situation, when you notice, that after you introduce several "improvements", the performance of your code fragment drops significantly and unexpectedly... After that, whatever you do (perhaps even "rollback" to the previous version) you cannot manage to achieve the performance provided by earlier versions. Tomorrow, however, the performance level restores all by itself and without any visible reason! - Well, this might serve as an evidence of the fact that your colleague also has changed his module, and this has influenced your part of the program!
These rules are as follows:
Most programmers, especially freelancers, usually have a couple of machines at their disposal, where the whole software development cycle takes place: from initial design to debugging and optimization. At the same time, hardware configurations of these computers are different. The code optimized for one platform can be inefficient for other one. Data flow planning (see "Data Flow Planning") is a good example illustrating this statement. For example, recall specific features of data pre-selection in the VIA KT133 chipset, which result in unexpected performance drop when executing parallel processing of several flows located in a close proximity. This unpleasant fact is not reflected in documentation, it cannot be predicted by means of calculation. Rather, it can be detected only experimentally.
Profiling the program on a single machine will not allow you to detect all "bottlenecks" of the algorithm. At least, it is necessary to cover three or four typical configurations, paying attention not only to processors, but to chipsets as well. By doing so, you will get some kind of insurance against "surprises" such as above-described "features" of the VIA KT133 chipset.
It is much harder to find a combined solution that would fit best for all existing platforms. To be more precise, it isn't simply "harder" - it is impossible.
The existing profilers are not numerous, therefore, programmers haven't face especial problems when choosing the profiler.
If optimization isn't your main field of interest, and efficiency as such isn't critical for you, then you can be satisfied with practically any available profiler - for example, with the one supplied with your compiler of choice. More sophisticated profilers in this case will prove to be a burden to you, because you won't be using their full capabilities and potential, which requires a sound knowledge of the processor architecture and computer hardware.
On the other hand, if performance of your programs and quality of their optimization is the matter of primary interest for you, and you plan to dedicate significant time to profiling, no other profiler will help you, except for products such as Intel VTune and AMD Code Analyst. Pay specific attention to the way which I used to formulate this statement: "Intel VTune and AMD Code Analyst", rather than "Intel VTune or AMD Code Analyst". Each of the above-mentioned profilers supports only its "native" processor. Thus, using only one of them won't allow you to optimize the code for both platforms (i.e., each of them will do no more than half of the job).
Since users who are acquainted with both of these products are not numerous, I'll provide a brief description of their functionality.
Fig. 1.3. VTune Profiler Logo
Obviously, VTune is the most powerful among all profilers that ever existed (at leas, for the IBM PC platform). Actually, one can't classify it as simply "profiler", since VTune is an intellectual tool whose functionality isn't limited to hotspot detection. Besides detecting hotspots, VTune is capable of providing rather helpful tips and advises on their elimination. Additionally, VTune provides rather powerful code optimizer, which increases the performance of programs compiled with ñ Microsoft Visual C++ 6.0 approximately by 20%, - surely, this improvement is always welcome!
As a matter of fact, the list of advantages provided by VTune is so comprehensive, that you should consider it as a leader that has practically no alternative! As relates to the current overview, I'd like to discuss some of its shortcomings (after all, there are no programs absolutely free from drawbacks).
One of the main disadvantages characteristic for VTune is its exceedingly large size (version 6 - the newest for the moment of this writing - "weights" neither mor nor less than 150 MB). It's cost is also rather impressive.
Another shortcoming relates to the fact, that VTune isn't always stable - sometimes it can "hang" the system. For example, I quite often encounter the situation when, after activating some performance counters, Windows crashes into "blue screen" (the most common error message is IRQL_NOT_LESS_OR_EQUAL). On the other hand, if you carefully think over each of your actions, you'll be quite satisfied by VTune, especially with the account of the fact that this product is the best one, and has no alternative.
Furthermore, VTune is rather complicated. Sometimes it might seem that it is impossible to matter it and achieve sound understanding of all its features. Even the built-in help system takes more than thousand pages (A4 format)! Even careful reading will take you at least whole month! On the other hand, let us consider this problem from another point of view. What instrument do you need - a powerful tool or a toy for beginners? If you need a real tool, then, according to definition, the more powerful instrument you choose, the more sophisticated it is. In my personal opinion, VTune is not more complicated than, say, most popular development environments, such as Visual C++ or DELPHI. Hence, the problem is due to the lack of comprehensive literature on profiling in general and this product in particular, rather than to the fact that VTune is complex and difficult to master. Mainly because of this fact, this book includes a short self-study guide on VTune, which you will find in the chapter titled "Practical Profiling Session Using VTune". I hope that this information will be helpful to you.
Fig. 1.4.The AMD Code Analyst Logo
AMD Code Analyst is beaten by it's nearest competitor - VTune - at least by two orders. I'd never recommend you to use it as your main profiler. Although I'll probably expose myself to the risk of being beaten by aggressive AMD fans, I'll still list all main drawbacks of the AMD Code Analyst:
Despite these drawbacks, Code Analyst also has some advantages. For example, it is rather small (for the moment of this writing, its latest version - 1.1.0 required only 16 MB, which is approximately 10 times smaller than VTune), reliable and stable, and, most importantly - it provides fully functional emulator for AMD processors such as K6-II, Athlon (with external and integrated cache types), and Duron, including their mobile implementations. Furthermore, it provides the capabilities of manually selecting bus and core frequencies. This feature is certainly useful for assessing the influence of the bus frequency on performance, which is especially important for applications that intensely work with RAM (unfortunately, VTune lacks this feature). Finally, Code Analyst has rather clearly written help system, which describes processor's bottlenecks quite briefly, but in clear and readily understandable language. And - the most advantageous circumstance - in contrast to VTune, it is a freeware (at least, currently ...)!
In the long run, however, it doesn't matter whether you like or dislike this profiler, since you are obliged to use it for optimizing your applications to run on Athlon processors, which take a significant segment of the market. This fact shouldn't, be neglected!
Fig. 1.5.Copyright information displayed by the Microsoft Profile.exe
The next profiler that we are going to consider, Microsoft's profile.exe, is so simple and unsophisticated, that it even has no product name. Throughout this book, we will name it after the name of its executable file. Profile.exe is very small profiler providing only minimum functionality. It is considered in this overview only because of the fact that it is included with Microsoft Visual C++ compilers (Professional and Enterprise Editions). Thus, most programmers using Microsoft's Visual C++ compilers get this profiler along with their development environment, while other profilers must be purchased or downloaded from the Internet.
Why do you need to develop your own custom profiler if such tools are supplied with practically any compiler? And, if the functionality provided by the built-in profiler proves to be insufficient, you can use AMD Code Analyst and Intel VTune.
Unfortunately, built-in profiler supplied with Microsoft Visual Studio (like most other similar profilers) uses system timer for time measurements, and, therefore, its sensitivity is not sufficient for most of our purposes. Intel VTune is too large, and rather expensive, while freeware AMD Code Analyst is inconvenient (and, besides this, there is no guarantee that it won't cease to be a freeware program in the future). All these factors prevent using all above-listed profilers as main instruments (at least for the author of this book).
The program suggested by the author - DoCPU Clock - can't be classified as "profiler" in the general sense of this term. It doesn't detect hotspots, doesn't calculate the number of calls, and, furthermore, it can't even work with executable files. DoCPU Clock is nothing more than a small set of macros intended for including into the source code of the program. These macros determine the time required for execution of profiled code fragments. For the purposes of this book, these limited capabilities are sufficient, since all that we will need is evaluation of the influence of specific optimizing algorithms on the program performance.
Detailed description of the DoCPU Clock profiler is contained within its source file. Therefore, I will not provide it here for the sake of briefness.
Practically any instrument, even the most perfect and flawless one, will be useless, if you don't know how to work with it. The VTune profiler can't be classified as "intuitive" product that you can master by simply trying to use it. VTune is a professional instrument, the correct usage of which is impossible without special learning. Otherwise, most of its capabilities will remain unnoticed, and the leaving the developer in astonishment: "what the others have found in this VTune?"
The current chapter isn't a detailed guide, but the author still hopes that it will help you to make your first steps in mastering VTune and get acquainted with it's main functional capabilities. Besides this, you are expected to make a decision: are you going to work with VTune or better choose another profiler - more simple and easier to use.
As a test program for our experiments with profiling and optimization we will use a simple password cracker. Why password cracker, would you ask? First, because it represents an illustrative and realistic example. Secondly, because such applications have the highest requirements to performance. Previewing possible indignation of some readers, I'd like to mention that in this example we are not dealing with such things as cracking real ciphers and passwords! The algorithm implemented in this program is never used in real life. Besides this, it is vulnerable to quick and efficient attack that cracks the ciphertext within a matter of few seconds (practically immediately).
First of all, we are interested in hotspot detection and elimination rather than in password cracking. Because of this, even those who consider all investigators as hackers, and all hackers as criminals, can type in the following source code without injuring puritan morals (Listing 1.17):
Listing 1.17. [Profile/pswd.c] Password cracker before optimization
Let us compile this code with maximum optimization and run it, in order to check how the machine optimizer did its job.
Test run of this program on P-III 733 will produce the cracking speed of about 30 thousand passwords per second! This is as good as nothing. How long will it take us to crack the ciphertext at this rate??? Where the processor cycles are get lost?
Let us use the VTune profiler to detect the bottlenecks of this program. So, let us start it (bear in mind that under Windows 2000 and Windows NT you must log in as administrative user for this purpose) and, while the PC is blinking with the hard-disk LED, let us create the symbol table (don't confuse it with the bugcheck information!). Without this table no profiler will be able to determine, which part of the executable code relates to which function. To create the symbolic table, it is sufficient to specify the /profile key in the linker command line. For example, this might look as follows: " link /profile pswd.obj". If everything was done correctly, the pswd.map file will be generated, which would contain approximately the following (Listing 1.18):
Listing 1.18. [Profile/pswd.c] Contents of the pswd.map file
Now, VTune is ready for profiling, and waits for our instructions. We have the following options: either open the existing project (but currently we have nothing to open), or start the New Project Wizard (this suits our purposes, but, on the other hand, are we able to understand all the settings of this Wizard?), or, alternately, perform quick analysis of the application's performance by choosing the Quick Performance Analysis option. Well, select the latter option! In the next dialog box, specify the path to the "pswd.exe" file and click the "GO" Button. VTune will automatically start the profiled application and begin to collect the run-time information for each of the program's points. The progress indicator will entertain you during the process. If you are lucky and the system doesn't hang, then after a couple of seconds VTune will open a full-screen window filled with various types of information - some of which is valuable, while other is useless. Let us consider this data in more details (see Fig. 0.6). In the left part of the screen you'll see the Project Navigator, which enables the user to quickly navigate between various parts of the project. For the moment, we don't need it. Therefore, let us concentrate our attention on the central part of the screen, displaying the diagram's windows.
The top pane of the window displays, how long did it take to execute each point, thus allowing to detect hotspots, - i. e., code fragments that required the longest time to execute. In our example, the profiler has detected 187 hotspots, about which it has informed us (see the right pane of the profiler window). Pay special attention to the two peaks located to the right of the screen center. These two peaks aren't simply "hot". Actually, they are "white-hot", since they consume most part of the program's performance. It is a "must" to begin from optimization of these points!
Move the cursor to the highest peak - VTune will immediately inform you that it belongs to the out function. Stop! What out?! We didn't call any such function or anything similar to it!! Who called this ugly function? (Certainly, you have already guessed that it was the printf function, but let us pretend that we don't know anything about it, especially because in other cases it isn't simple to find out the party actually responsible for such an occurrence).
Fig. 1.6 The contents of the VTune window immediately after initial analysis of the application. First of all, we are interested in the top pane, which maps the detected hotspots to their addresses. The bottom pane contains information on the relative run times of all modules comprising the system. Note that it isn't the pswd.exe module (marked by an error in the diagram) that takes the first position in this list. The main part of resources is consumed by another module. Thus, we get a false impression of uselessness of the pswd.exe optimization...
To avoid useless viewing of the entire code, let us employ another instrument provided by the profiler - "Call Graph", which allows to display the class hierarchy in user-friendly form.
From the Run menu, select the Win32* Call Graph Profiling Session command and relax while VTune profiles the application. When profiling is complete, two more windows will appear. The contents of the top window - a form of a spreadsheet - is intuitive, and, therefore we won't discuss it in detail. As for the bottom window, let us consider it more carefully. On the light yellow background, there are two bright peaks, marked with the "Thread 400" and "mainCRTStartup" labels. Let us double-click the latter one, and VTune will immediately display a long list of child functions called by the application's source code. Among them, find the main function (this will be easy, since only this function is highlighted with red) and double-click it again.... Proceed the same way, until you open all child functions of the main procedure.
As a result, you will find out, that it is the printf function that calls the out function. The printf function, in turn, is called from the do_pswd function. Well, now you'll recall that this function was used for output of the current password! What a silly idea! Now it is clear, who caused the performance degradation!
Fig. 1.7. Hierarchy of "hot" functions generated by the Call Graph wizard. The color symbolizes the function's "temperature", and the numeric label specifies, how many times this function was called
Certainly, it is impractical to completely eliminate the display of the current state of the program (after all, the user really needs to know how many passwords have been checked to the moment, and, on the other hand, it is necessary to have some control over the machine just to make sure that it hasn't hanged). Still, rather than displaying each password, you can carry out this operation for each 600-th, or, better still, for each 6000-th password. In this case, the overhead caused by calling printf will be reduced significantly, and, possibly, will even drop to zero.
Let us rewrite the code fragment responsible for output of the current state as follows (Listing 1.19):
Listing 1.19. Reducing the number of calls to the printf function
Oh, Good Lord! After recompiling the program, we get the cracking speed that exceeds 1,5 millions of passwords per second! This means, that the program's performance has increased more than five times! Actually, the program became as fast, that the "sensitivity" of the clock function becomes insufficient, and it is necessary to increase the number of iterations 100 times! As you will learn in the nearest future, this isn't still a limit for performance growth!
After re-starting the updated program under the profiler shows that the number of hotspots has decreased from 187 to 106. This is good, but, on the other hand, there are other hotspots! If you go to the View pane located in the to right corner of the Hot Spots window and set the Hotspots by function radio button, you'll learn that about ~80% of the total run time of our program is consumed by the CalculateCRC function, followed (with significant gap) by gen_pswd (~12%). The next positions in this list (~3%) are shared between the CheckCRC and do_pswd functions.
This is no-go! Some minor function (CalculateCRC) consumes practically all program's run-time! If only we could find out, which of its parts is most responsible for this performance degradation... Well, VTune allows detecting it!
Double-click the red rectangle to display it in full-screen mode. Now you'll find out that the Calculate CRC function contains 18 hotspots, three of which are especially "hot" ~30%, ~25% and ~10%, respectively (see Fig. ñì. 0.8). Let us start with the hottest one. Double-click the highest rectangle, and ... VTune will beep and inform you that "No source found for offset 0x69 into F:\.OPTIMIZE\src\Profil\pswd.exe. Proceed with disassembly only?" Actually, since the program was compiled without bugcheck information, VTune is unable to find our, which byte of assembler code corresponds to which code line, and the compiler doesn't agree to provide this information, because in the optimized program the correspondence between the source code and machine code generated by the compiler isn't unambiguous.
Of course, you can profile non-optimized program, but what is the use of it? This will be another program with different hotspots! In any case high-quality optimization is impossible without knowledge of assembler, therefore, click OK to work with assembler code without source code.
Fig. 1.8. "Temperature" distribution within the CalculateCRC function (the snapshot was taken with high resolution)
VTune will immediately draw our attention to the REPNE SCANB instruction. You don't need to be a guru to recognize the strlen core in it. Have we used strlen in the source code of the program? Yes, we did! Look here (Listing 1.20):
Listing 1.20. The call to strlen in the loop header produced negative effect. The compiler didn't recognize it as invariant and didn't move it outside the loop, because of which the length of the same string is calculated at each iteration.
It seems that the compiler didn't move the call to strlen outside the loop body, although its argument - the pswd variable - wasn't modified within the loop! Well, if the compiler didn't do it, let us it ourselves and rewrite this code fragment as follows (Listing 1.21):
Listing 1.21. Moving the strlen function outside the loop
After recompiling the program, you will notice with satisfaction that now its performance has grown up to 3.5 millions of passwords per second, i.e. it increased more than 2.5 times as compared to the previous version.
Still, the profiling shows that the number of hotspots didn't reduced. On the contrary, there appeared an extra hotspot! Why? The problem is, that the algorithm used for calculation of the hotspot number takes into account relative run-times of program fragments rather than absolute value. Gradually, as we are eliminating the highest peaks, smaller "ripples" appear on the diagram.
Despite optimization, the CalculateCRC function remains the "leader" in resource consumption and still takes more than 50% of the total run time of entire program. Now, the following two commands become the "hottest" spot:
Listing 1.22. The "hottest" spot of the profiled program after moving the strlen function outside the loop
Well, what's wrong with these commands? Surely, here we are dealing with the memory access (x += *(int *)((int)pswd + a)), but, principally, the password under testing must reside in the L1 cache. Consequently, the operation of accessing it must take only one clock. Has someone preempted this data from the cache? Or, perhaps, some conflict has occurred? It is rather difficult to guess! Actually, guessing might be endless, it isn't this code that causes the problem. Rather, it is quite different branch of the program...
Now, it is time to employ one of the most powerful capabilities of VTune - dynamic analysis, which allows not only to detect the leak, but also to find out the true reason of it. At the same time, dynamic analysis isn't carried out on the "live" processor, but, rather, on its program emulator. This allows you to save on financial investments! To perform software optimization, you don't need to purchase the whole line of processors - from Intel 486 to Pentium-4, - you only need to purchase VTune, and after that you'll be able to optimize your programs for Pentium-4, having only Pentium-II or Pentium-III at your disposal.
Before starting dynamic analysis, you'll need to specify which part of the program you are going to profile. In particular, it is possible to choose, whether you'll profile the entire CalculateCRC function or only specific hotspot of that function. Since our function contains quite a lot of hotspots, let us choose the first approach.
Scroll the window up and move the cursor to the line labeled by CalculateCRC string (labels are displayed in the second column from the left). If you don't find such line, find the Scroll to Previous Portal button on the toolbar (it is labeled with the icon that looks like a blue triangle with the point turned upwards) and click it. Now let us establish the Dynamic Analysis Entry Point, which is set by clicking the button labeled with the yellow right-arrow. Proceed the similar way to specify the Dynamic Analysis Exit Point. To accomplish this, scroll the screed down, find the last line of the CalculateCRC function (it contains a single command - ret), click it, and then click the toolbar button labeled with yellow left-arrow. Now choose the following commands from the menu - "Run\Dynamic Analysis Session". In the dialog box that appears next, select the processor model to be emulated (P-III in our example) and click Start. Well, here goes!
The profiler will start the program once again, and, after running it during a couple of minutes, will display the screen similar to the one shown in Fig. 1.9.
Fig. 1.9. Dynamic analysis of the program is capable of determining the "temperature" of each machine instruction and explaining the reasons of its "heating"
Well, here is our hotspot (on the screenshot it is highlighted by the cursor). If you double-click it with the mouse, VTune will display a dialog, providing detailed description of the problem:
Listing 1.23. Dynamic analysis results for the chosen hotspot
What a Klondike of information! It proves, that the cache actually has nothing to do with the problem (cache miss occurred only once). The main cause of the problem is access to misaligned data (this occurred 2000 times, which equals to the number of program runs). This means, that such event took place on each iteration of the loop - hence, the performance degradation.
Let us find out, where within the program the pswd pointer is initialized? Here is the fragment of the main function (now it is clear, why static analysis of the CalculateCRC function was unable to provide something useful):
Listing 1.24. Alignment of the password buffer in order to avoid penalty from the processor
Let us remove the "pswd += 62" line and recompile the program. Four and a half millions of passwords per second! Catch the tiger by the tail!
Let us return to Fig. 1.9 and note that access to misaligned data is not the only hotspot present in the Calculate CRC function. With a small gap between the two hotspots, it is followed by the PUSH instruction, which temporary stores registers in the stack, and, ... once again the strlen function, which we have already encountered.
Actually, calculation of the password length by its duration is comparable to the calculation of its CRC. If only we could avoid it... Well, but why do we need to calculate the length of each password? Password checking isn't chaotic. On the contrary, passwords are generated according to the strictly ordered scheme. Using this scheme, passwords are incremented by one, but it doesn't take place frequently. Thus, perhaps it would be better if we delegate this task to the gen_pswd function? Let this function determine the length of initial password, and then, as the string length is incremented, increment the global variable length by 1. Rewrite the gen_pswd code as follows:
Listing 1.25. Elimination of the strlen function by means of "manually" incrementing the password length by one character
The code of the CalculateCRC function must be rewritten as follows:
Listing 1.26. Using global variable for determining password length
As a result of these simple transforms, we manage to achieve the speed of ... eight millions of passwords per second. Is this sufficient? Just wait! The most interesting part of our job hasn't begun yet...
Now, the gen_pswd function, which consumes more than 50% of the total program run time, becomes the leader.
After the gen_pswd function, with significant gap, follow two other functions - CalculateCRC - ~21% and CheckCRC - ~15%. At the same time, ~40% of the total time required to execute gen_pswd is concentrated in a single hotspot. This is too bad! Certainly, this function requires optimization!
Double-click the highest peak, and you'll come to the IDIV instruction, which performs integer division. Wait, but where in the gen_pswd did we use integer division? Here it is:
Listing 1.27. Integer division operation in the gen_pswd function
These lines of code calculate the percentage of the work already done. Curiously enough, but this operation takes approximately the same time as the work itself! Well, and since we discovered that this is so, let us get rid of these "bells and whistles"! Remove the division command, substitute "0" or any other number for the progress indicator, recompile the program, and.... Now its speed is fourteen and a half millions of passwords per second!
Despite the great speed of checking, the gen_pswd function still takes about ~22% of the total execution time, which, certainly, isn't good.
Double-click this function, and you'll notice that a single peak highlighted in red practically dominates diagrams of approximately equal height. Let us investigate it!
Disassembling will reveal that this hotspot hides the construction that we have already encountered:
Listing 1.28. A well-known construction hidden between the most active hotspot within the gen_pswd function
Well, for the sake of performance we can discard monitoring of the current state. Let us discard this code fragment altogether.
As a result, the speed will be increased by five millions of passwords per second. Well, this result isn't too impressive. The necessity of this step is, therefore, rather doubtful...
Now, we deal with a new hotspot - saving the ESI register somewhere deep within the CalculateCRC function. The compiler carefully prevents any modifications from being introduced into it. Despite the fact that the number of variables used in this program is not large, and, principally, we could avoid memory access by placing all variables in the registers, the compiler is unable to do so, since it optimizes each function separately.
Well, let us sacrifice the structured architecture and join all intensely used functions - (gen_pswd, do_paswd, CheckCRC and CalculateCRC) within a single "super-function".
Its implementation might look as follows:
Listing 1.29. Joining gen_pswd, do_paswd, Check CRC and Calculate CRC functions within a single super-function
Compile and start our redesigned program... Well, you won't believe your eyes - now you have achieved the speed thirty-five millions of passwords per second! And before this, it seemed that we had no more performance reserves. Well, now who will dare to say that Pentium is a slow processor? Generation of the next password, its checking and calculation of its CRC fits within some 20 clocks...
Well, 20 clocks?! We still have something to work on!
As the profiling has shown, main part of the hotspots is now concentrated within the loop of CRC calculation - it consumes more than 80% of the total program run time, among which 50% are used by conditional jump that closes the loop (Pentium processors just "can't stand" short loops and conditional jumps). The remaining 50% are used for accessing cache memory. This fact needs to be clarified.
There is a common opinion, that reading non-split data from cache memory requires one clock - the same as reading data from a register. This is really so, however, after careful investigation, one discovers that "one cell per clock" actually represents the throughput of the cache memory, while total time required for loading the data (with the account of latency) is at least three clocks. When reading dependent data from the cache (like in our case) the total time required for accessing the cell is determined by latency rather than by the throughput. Besides this, P6 processors are equipped with a single data reading device, and, therefore, even under favorable conditions they can load only one cell per clock. On the contrary, the data stored in registers are not subject to this limitation.
Thus, to increase the performance we need to get rid of the loop and reduce to minimum the number of memory access operations. Unfortunately, we cannot efficiently get rid of the loop, since we don't know the required number of iterations beforehand. There is the similar situation with variables: when programming on assembly language, we could easily place the password buffer in general-purpose registers (thanks to the fact that 16-character long password is practically the longest one that could be practically guessed using password cracker - fits within four registers, while other variables fit within the remaining three registers). However, this way isn't available for application programmers who have mastered only one high-level programming language, and they have to search for other solutions.
And such solutions really exist! Before this moment, we have increased the program speed at the account of restricting or discarding the most resource-consuming operations. We didn't change anything in the basic algorithms. This approach has helped us to significantly improve the performance, but, currently, it has reached its natural limit. Now, further optimization is only possible at the algorithmic level.
As relates to algorithmic optimization, there is no universal approaches or commonly acceptable solutions, - each phase must be considered individually, in the context of specific environment. Returning to our problem, let us think over the following fact: is it actually necessary to calculate checksums for each new password? Because of the obvious weakness of the CRC calculation algorithm used in this program, we can replace it by the equivalent but faster one.
Actually, since the lower byte of the password is summed only once, when going to the next password, CRC in most cases is increased by one. I say "in most cases", because when changing the second and subsequent bytes of the password, the changed byte is summed at least two times. This significantly complicates our algorithm. However, this is much better than the "dumb" method of constantly calculating the CRC, which we used earlier.
In other words, the improved implementation of the password cracker might look as follows:
Listing 1.30. Algorithmic optimization of the CRC calculation algorithm
char *buff;
int y=0;
int k;
int length = strlen(pswd);
int mask;
x = -1;
for (b = 0; b <= length; b++)
x += *(int *)((int)pswd + b);
for(a = 0; a < max_iter ; a++)
{
if (x==validCRC)
{
buff = (char *) malloc(strlen(crypteddata));
strcpy(buff, crypteddata); DeCrypt(pswd, buff);
printf("CRC %8X: try to decrypt: \"%s\"\n", validCRC,buff);
}
y = 1;
k = 'z'-'!';
while((++pswd[p])>'z')
{
pswd[p] = '!';
// next character
y = y | y << 8;
x -= k;
k = k << 8;
k += ('z'-'!');
p++;
if (!pswd[p])
{
pswd[p]='!';
pswd[p+1]=0;
length++;
x = -1;
for (b = 0; b <= length; b++)
x += *(int *)((int)pswd + b);
y = 0;
pswd[p]=' ';
}
//printf("%x\n",y);
} // end while(pswd)
p = 0;
x+=y;
} // end for(a)
return 0;
}
What is the result of algorithmic optimization? You can't even guess it, but here it is - eighty-three millions of passwords per second or ~1/10 of password per clock. Fantastic!
And, at the same time, the program has been written on pure C! And, what's the most interesting, there still remains quite a good performance reserve!
Now let us discuss the less-known feature of the VTune profiler - the Coach (which also can be understood as "trainer" or "teacher").
Practically, the Coach is no less than high-class interactive optimizer that supports quite a wide range of programming languages, including C, C++, FORTRAN and Java. It analyzes the source code of the program, searches the weak points, and, having detected them, provides detailed instructions on their elimination.
Naturally, the IQ of this program can't be even compared to that of the human programmer. Furthermore, as we will discover later, the Instructor is not as smart as it might seem (some might even say that it is dumb rather than smart). Still, it would be rather useful to consider it closely.
Despite the fact that the Instructor is mainly oriented towards beginners (this is obvious from the "simple" style of the help system), it sometimes can prove to be useful even for professionals, especially when it is necessary to optimize the code written by someone else, and you can't afford detailed investigation of that code (or are simply short of time).
Well, here are bad news (however, this should be expected) - when the debugging information lacks, the instructor cannot work with the source code, and goes to the level of pure assembly language (see "Step Ten"). However, this circumstance doesn't present serious inconveniences since the source code is analyzed rather than profiled. Because of this, don't be surprised by the fact that inclusion of debugging information into the executable file automatically "switches off" all optimizing options of the compiler. When the instructor will work with machine code, it will never even touch the compiled machine code!
Thus, let us recompile our demonstration example with the "/Zi" option in the compiler's command line, and link it with the "/DEBUG" command line option of the linker. Then, let us load the resulting file into VTune, wait until the profiler displays the "Hot Spots" diagram, and then double-click the highest rectangle, which, as you know already, corresponds to the gen_pswd function, where our program spends most of its run time.
Now, holding the left mouse button, highlight the fragment that you want to analyze (it would be most logical to select the entire gen_pswd function). After that, locate the toolbar button labeled with the icon depicting a teacher and click it. The instructor will request information about the file used to compile the program. You can choose between the following options - make-file, pre-processor file and, as in any professional program, the possibility of manual input. Since we didn't set any specific compiler options, and, in any case, we have no make-file, let us choose the Manual Entry option and click OK. Ignore the "No source option were specified" message displayed on the screen and then click OK once again.
VTune will immediately start the analysis and after several seconds will inform you by displaying the following message: "There are 9 recommendations identified for the selected code. Double-click on any advice for additional information". It proves that our program (which is optimized already) contains no less than 9 weak points! What's more important, these weak points are so weak, that they can be detected by trivial search using the template! Well, no comments, we have done a "good" job! However, let us discover what VTune has found here. The instructor's recommendations are written directly over the program code, and the first one is as follows (Listing 1.31):
Listing 1.31. The first recommendation of the Coach
Well, VTune isn't quite right! This loop is executed only once per each run of the program. Therefore, optimization of this loop will not cause a significant influence on the program speed! By the way, what is "loop unrolling" and how to accomplish it? Double-click the highlighted line, and the instructor will display the following information:
Well, this is quite a detailed instruction on loop unrolling! At the same time, if you still can't understand how actually the loop unrolling is accomplished, you can click the "Examples" link and view specific example written in C, Java or Fortran. Let us choose "C" and view further instructions provided by VTune:
Listing 1.32. Further instructions provided by Vtune (C language example)
Despite VTune recommendations, we will not unroll this loop. Rather, let us go further and consider the second tip. This tip also recommends you to unroll the loop, but this time, the loop to be unrolled is nested within the while loop. Since this loop gets control only when the checked password is lengthened by one character (which doesn't happen frequently) it, like the previous one, has practically no influence on the program performance. Because of these reasons, let us send both these recommendations to /dev/null.
The third tip criticizes the innocently looking instruction p++, which increases the p variable by one:
Listing 1.33. The third tip provided by Vtune
The loop whose index is incremented at line 114 should be interchanged with the loop whose index is incremented at line 121, for more efficient memory access
Obviously, the instructor went crazy because of the processor's overheating. These are different loops. They have different indexes. And they are not interrelated in any way. Furthermore, the loop that starts at the line 121 is executed so rarely, that it quite an enigma, why VTune doesn't like it?
Perhaps additional information provided by the Coach will explain everything? Double-click line 114 to read the following:
What? Where are multi-dimensional loops or cache misses? Here we are not dealing with anything of the sort! Presumably, we have encountered the Coach's error (after all, template search is no more than template search!). However, let us consider the example provided by the Coach. One must always bear in mind that, first of all one must look for his own errors than for that of someone else. Perhaps, we have misunderstood something...
Listing 1.34. The example provided by the Coach for the third tip
Well, this is correct. The code fragment provided by VTune illustrates that it is better to process two-dimensional arrays by rows rather than by columns (see. "Cache"). However, we have no two-dimensional arrays - therefore, there is no need in observing the instructions provided by the Coach for this case.
Tip number four - once again it relates to the CRC calculation loop. What attracted the attention now? Let us consider this in more detail...
Listing 1.35. The fourth tip provided by the Coach relates to the CRC calculation loop
Quite good and useful recommendations, after all! As relates to the recommendation of simplifying the addressing (which is already nearly primitive), it is repeated twice! And at the same time, you won't succeed in efficiently vectorizing this loop even when using Intel C/C++, to speak nothing about all other compilers.
Still, let us view the built-in help system - perhaps it will provide something interesting!
Vectorization is certainly useful approach, which really allows to significantly (manyfold) increase the program speed. However, there are two obstacles preventing it from being widely adopted: first, most x86-compilers are unable to vectorize the code, and migration to the Intel's compiler isn't always acceptable. Secondly, vectorization will be really efficient only under condition that the program is initially optimized for using this technology. And, although code vectorization is well-known in the world of mainfraimes, for x86 programmers it still remains something exotic!
Tip number five or just another blunder of the Coach. Let us consider, what the Coach has produced this time.
Listing 1.36. Tip number five is formally correct, but impractical
Generally speaking, the Coach is right, from the formal point of view. Moving invariant functions outside the loop body is a good programming practice, since, when the function resides within the loop body, it is called many times, but, because of its independence on the loop parameters, gives the same result any time it is called. Isn't it simpler to allocate memory only once, when entering the function, and the simply store the pointer returned by malloc in special variable for future usage?
Objections to this approach are as follows: what will we get as a result? This branch is called only when the checksum of the current password matches the reference checksum, which is a rare event. Actually, this happens several times during the whole run of the program (in best case).
Objection number two: the branch of the program that has allocated memory block, must release it, provided that this doesn't result in unexpected performance degradation.
Thus, we will not move anything outside the loop, despite the tips provided by the Coach.
Tip number six. This recommendation is practically identical to the previous one, however, this time the Coach recommends us to move the DeCrypt function. Yes! It has considered this function as invariant and advised to move it outside the loop, despite the following facts: à) the code of this function principally was at its disposal ("principally" because we have instructed the Coach to analyze only the gen_pswd function). b) the DeCrypt function takes the pswd pointer as an argument, and this pointer is explicitly changed within the loop! Because of this circumstance, the DeCrypt cannot be invariant! Isn't the Coach ashamed of such advises? I hope it is -otherwise why should it be so red?
Tip number seven. Now the Coach draws our attention to the following fact: "The value returned by De Crypt() on line 98 is not used..." and provides the following recommendation: "If the return value is being ignored, write an alternate version of the function which returns void".
This recommendation is based on the Coach's assumption that the function, which doesn't return any value will execute faster than the function, which does. Actually, this statement is rather disputable. Returning of the value doesn't take long time, and creation of two instances of the same function in reality is much more "expensive" than the overhead for returning unnecessary value.
Therefore, let us ignore this tip and proceed further.
Tip number eight. Now the Coach considers as invariant another function - printf, which prints the contents of the buffer buff, just returned by the De Crypt function. Hmm... didn't VTune developers inform the Coach about semantics of at least main library functions? I say this, because the printf function can never be moved outside the loop, no matter whether it is invariant or not! I suggest that it isn't necessary to explain why.
Tip number nine. ...The value returned by the printf function isn't used, therefore...
Well! Such instructions cannot be considered as satisfactory. We didn't use a single tip out of nine, because they wouldn't increase the program speed. Still, the Coach under no circumstances mustn't be considered as dumb useless program, since it explains really interesting and efficient optimization techniques, most of which are unknown to beginners.
Perhaps, some experienced programmers will argue that such a result that we have obtained is due to the fact that we have loaded it with the highly optimized program and, therefore, the Coach had no other choice than to dig into minor details. Well, let load into the Coach the initial version of the program and make it analyze the entire source code. It will make 33 warnings, none of which would be useful!
All remaining 17 hotspots represent overhead for accessing cache memory and... penalties for unsuitable (from the processor's point of view) command grouping. Well, let us leave alone memory access or, truly speaking, leave it to the readers (think over the following problem: why do we need to generate passwords, if their checksums are calculated without accessing them?) and concentrate on the planning of optimal command flow.
Let us employ another powerful tool provided by the VTune profiler - automatic optimizer, which has quite a pathetic name of "Assembly Coach" (don't confuse it with the Coach!). Press and hold down the left mouse button, and highlight the entire gen_pswd function, then find the Coach button at the toolbar. Click this button.
You'll have to choose between three types of optimization in the Mode of Operation listbox - Automatic Optimization, Single Step Optimization and Interactive Optimization. The first two modes don't present any specific interest to us. However, the quality of the third mode - Interactive Optimization - is very high. Let us select interactive optimization and click the Next button to the right of the drop-down list.
The screen contents will change immediately (see Fig. 1.10): the left pane will display the source assembly code, while the right pane - the code to be optimized. The bottom pane will display the so-called assumptions, for resolving which the optimizer will address the programmer. For the moment, this window contains the following assumption: "Offset: 0x55 & 0x72: Instructions Reference to Same Memory". Let us view the instructions residing at these offsets:
Listing 1.37. Instructions residing at the offsets specified by VTune
Despite visible difference in operands, they actually address the same variable, because there are two PUSH machine commands between them. These commands decrease the value of the ESP register by 8. Thus, this assumption is correct, and we confirm it by clicking Apply.
Now, let us concentrate on instructions highlighted with red and marked by the "red" traffic-light at the left. These are poorly planned instructions fined by penalties by the processor.
Fig. 1.10. Using Assembly Coach for optimization of the machine commands planning
Click the "traffic-light" located at the bottom and view, how VTune re-groups our commands... Now, the things have changed to the better! All instructions are highlighted by light-yellow, which indicates that there are no conflicts and no penalties. What has changed in the optimized code? First, PUSH commands (which actually push registers into the stack) are separate from the command modifying the register pointing to the stack top, which eliminates the parasite dependence by data (actually, it is not correct to push new data into the stack until you know the position of its top).
Second, arithmetic commands are now evenly distributed and interleaved with the register read/write commands, - since Pentium processors have only one ALU (Arithmetic Logical Unit), this measure practically doubles the performance.
Third, Pentium processors contain only one fully-functional x86 decoder, and, therefore, the declared speed of decoding (three instructions per clock) is achieved only when instructions follow in strict order. Instructions decoded only by fully-featured x86 decoder should be located at the starting positions of each triplet. The triplet "tail" can be filled with commands that can be decoded by other two decoders. As can be easily noticed, the MS VC generates the code, which is rather far from optimal (from the Pentium processor's point of view), and VTune recombines the commands to optimize the code.
Listing 1.38. Assembly code optimized by Microsoft Visual C++ 6.0 compiler in the maximum optimization mode (left) and its improved version optimized by VTune (right).
Let us click Next once again and proceed with analysis of the next block of instructions. Now VTune eliminates the dependence by data by separating the commands for reading and adding the ESI register with the command for incrementing the EAX register.
Listing 1.39. VTune eliminates data dependence
We must proceed the same way until the entire code gets optimized. And at this point, we encounter new problem. Unfortunately, VTune doesn't allow you to place the optimized code into the executable file. Probably, it assumes that assembly programmer will easily type this code in from the keyboard. However, not everyone is a qualified assembly programmer! (for example, among the readers of this book there are individuals who are not assembly programmers).
And, on the other hand, where should we retype the code? Into the "live" binary file? Of course not! Let us do the following (this approach certainly is far from perfect, but, for the moment I didn't manage to find a better one). Move the cursor to the pane containing the optimized code, and then select the Print command from the File menu. In the Field Selection window clear all checkboxes, except for Labels and Instructions and then print the code to file or clipboard.
While printing goes on, let us prepare assembly listing of our program by specifying the "/FA" option in the compiler command line (this option might differ for different compilers). As a result, we will get the pswd.asm, file. This file can be compiled ("ml /c /coff pswd.asm"), linked ("link /SUBSYSTEM:CONSLE pswd.obj LIBC.LIB") and started for execution. What a nuisance! We achieve the speed of about ~65 millions of passwords per second (compare this result to 83 millions achieved when proceeding the normal way). As it proves to be, MS VC simply doesn't insert the alignment directives into the assembly code! This complicates the quality evaluation of the code optimization achieved by VTune. Well, let us take these 65 millions as a basis and check, how VTune will improve this result.
Open the file created by the profiler, and you'll encounter another problem! Its syntax is absolutely incompatible with the syntax of popular assembly translators!
Listing 1.40. Fragment of assembly file generated by VTune
First, labels are not followed by a colon. Second, the invalid "plus" sign occurs in labels. Third, conditional jumps contain extra trailing address enclosed in braces.
In other words, we have got a lot of work to do manually. After we accomplish this work, the code fragment will look as follows:
Listing 1.41. Corrected fragment of the file generated by VTune is now suitable for translation by TASM or MASM assembler
Now it only remains to enclose it into the following construction and translate by TASM or MASM assembler -whichever you prefer:
Listing 1.42. "Enclosing" code for assembly file, into which it is necessary to nest the optimized code of the _gen_pswd function for further translation
As relates to the pswd.c program, it is necessary to declare gen_pswd as external function. This can be done as follows:
Listing 1.43. Declaring the external gen_pswd function in C program
Now it is possible to assemble the entire project:
Listing 1.44. Final build of the pswd project
Test run of the optimized program shows that it has a speed of ~78 millions of passwords per second, which is ~20% better than before optimization. Very well! VTune provides rather good code optimization! Still, the result that we have achieved is beaten by the speed that we managed to achieve at the previous step. Certainly, this is due to compiler rather than profiler, but this isn't the relief to us.
On the other hand, this problem has no impact on optimization of assembly programs.
Now it is time to analyze the obtained results. If we discard the first version of our program with repeated calls to printf, one can state that we have improved the performance of our program from one and a half to eighty-four millions of passwords per second, in other words, by approximately two orders! And we can certainly be proud of it! Although we are far from achieving the theoretical limit (for example, it is possible to check several passwords in parallel using vector MMX commands), application profiling is obviously the best approach to avoiding performance overhead!
Curiously enough, but every step in the optimization process resulted in practically exponential growth of the application performance (see Fig. 1.11). Of course, you cannot expect exponential performance growth for all cases (perhaps we were lucky enough), but the common trend is obvious - the most difficult problems are deeply nested, and it is practically impossible to solve them without digging the code.
In this chapter we have considered only the basic functional capabilities of VTune (and even that was done briefly). However, its capabilities are not limited to these features only! Actually, VTune represents an entire world, which, besides other useful features has built-in programming language and even its own API, allowing the programmer to call VTune functions from add-on custom modules (i.e., create custom plug-ins for this profiler).
In any case, by carefully studying this example, you will be able to make the first step in mastering VTune. Later on, you'll be able to study its features on your own. Finally, I'd like to give you a useful advice. The built-in Help system of VTune is very large and rather inconvenient to use, therefore, most of its topics remain unknown to most users. Because of this, it is better to use any help-decompiler to convert the hlp-file to rtf-format. The resulting rtf-file can then be opened using any text processor (MS Word, for example) and printed (or, perhaps, you might prefer to read it on-screen, since the help-file comprises more than 1000 pages).
Fig. 1.11 Chronology of the password cracker optimization
At this point, we can consider the introductory part of the book completed.