A Performance Optimization for C/C++ Systems That Employ Time-Stamping
2011-05-26 13:51
603 查看
A Performance Optimization for C/C++ Systems That Employ Time-Stamping
time(2)system call in the Solaris Operating System. This optimization applies especially to the financial market, and is based on our work with a number of different independent software vendors (ISVs). We have observed that the common practice of "time-stamping" messages, transactions, or other objects in a system can consume more resources than the developer might expect. In these systems, the time(2)system call is used to obtain the current time with which to stamp each message or object. (The time(2)system call returns the value of time in seconds since 00:00:00 UTC, January 1, 1970.) With many -- often thousands, or tens of thousands -- of active objects in typical enterprise system, this can lead to an excessively high use of system CPU cycles. We have observed systems processing thousands of transactions or messages eve 4000 ry second, each of which requires a time stamp every time it is acted upon. Such systems can end up calling time(2)several thousands of times per second, incurring a significant overhead in system resources. Two ways are available to reduce time(2)system call overhead. The first is to use our proposed optimized time(2)replacement solution that uses the caching technique to reduce the time(2)system call frequency. The second is to reduce the frequency of time(2)system calls in the application code. The suggested quick solution employs interposed libraries so there is no need to change the original application code. As an example, we have taken a sample application that performs data distribution for analysis. The application handles thousands of messages every second. Each message is time stamped with the current time, using the time(2)system call. One way to find out the frequency of use of time(2), or any other system call, is to use the truss(1)command, a utility in the Solaris OS that traces system calls and signals. For example: % truss -c -p pid Here pid is the process ID for the sample application and the -coption is used to count traced system calls, faults, and signals (rather than displaying the trace line-by-line, which is the default behavior). A summary report is produced after the traced command terminates or when trussis interrupted by Ctrl C. In Code Sample 1, we see an example trussoutput for the sample application process (whose pid was 1365). In this case, the trusscommand was terminated after a sufficiently long sample interval by a Ctrl-C. Code Sample 1: trussOutput Before Any Optimization
time(2)in the sample time of 84.98 seconds. That is nearly 10,000 calls to time()every second. A large amount of system time (10.045 seconds) was devoted to servicing these calls. Since the time(2)call has a one-second granularity, making this call several thousand times per second is certainly unnecessary. We can optimize the use of time(2)for the purposes of time stamping by implementing a local time()function which caches the current time, and only makes a system call when enough time has elapsed between calls. If insufficient time has elapsed since the last call to our local time function, we simply return the cached value. We can do this because we have, in the Solaris OS, access to another time function that is substantially faster than time(2), which is gethrtime(3C). (See "Measuring Execution Time in POSIX Compliant Programs and UNIX" in References section.) The book Inside Solaris, by Richard Mc Dougall and Jim Mauro, says the following about gethrtime(3C): gethrtime(3C)is known as a fast trap system call. This means that an invocation of gethrtime(3C)does not incur the normal overhead of a typical system call. Rather, it generates a fast trap into the kernel, which reads the hardware TICK register value and returns. While many system calls may take microseconds to execute (non-I/O system calls, that is; I/O system calls will be throttled by the speed of the device they're reading or writing), gethrtime(3C)takes a few hundred nanoseconds on a 300 MHz UltraSPARC processor. It's about 1,000 times faster than a typical system call. The source code for the shared library ( libfasttime.so) is given below. In this module, the symbol for time(2)is interposed to execute the optimized, caching time()library function. Thus, code changes in the rest of the application are unnecessary. The new function obtains the current high-resolution time (in nanoseconds) using gethrtime(3C), and compares it to the (cached) value of when the function was last called. If the call was issued within a certain delta, in the code below defined to be 1 millisecond, the cached value is returned, and no time-consuming system call is made. Once sufficient time has elapsed between the original call to time()and the current one, the system call is made, the cached value is reset, and the process starts over. To compile the time.cfile to build a libfasttime.solibrary, use: % cc -G -Kpic -o libfasttime.so -xO3 -xarch=v8plus time.c For a quick performance testing, this library can be preloaded for the purposes of linking with an application by setting the following (in bash): LD_FLAGS_32=preload=/tmp/libfasttime.so However, the preferred way is to link this libfasttime.solibrary during the build of your application. Note: This library can also be compiled in 64-bit mode for 64-bit applications by using: % cc -G -Kpic -o libfasttime.so -xO3 -xarch=v9 time.c The library also can be preloaded by setting the following (in bash): LD_FLAGS_64=preload=/tmp/libfasttime.so In Code Sample 2, we provide the source code for the time(2)wrapper. Code Sample 2: Source Code for time(2)Wrapper (File time.c)
trussOutput After Linking With Optimized fasttimeLibrary
time(2)was called decreased by 90 percent, and the system time was reduced by 60 percent. This improved the performance of the sample data distribution application overall. The sample application was able to provide noticeably more throughput per second compared to when it was running without the libfasttime.solibrary. Since sampling theory tells us that to completely capture a signal we need only sample at twice the rate of the highest frequency, DELTA in Code Sample 2 could be changed to 500 milliseconds with no change of behavior and with potentially even more time savings. So if you have a system that makes extensive use of time stamping, or otherwise makes frequent calls to the time(2)function, try the optimization we have outlined here. References Inside Solaris, by Richard Mc Dougall and Jim Mauro (reprinted with author's permission) |
A Performance Optimization for C/C++ Systems That Employ Time-Stamping
time(2)system call in the Solaris Operating System. This optimization applies especially to the financial market, and is based on our work with a number of different independent software vendors (ISVs). We have observed that the common practice of "time-stamping" messages, transactions, or other objects in a system can consume more resources than the developer might expect. In these systems, the time(2)system call is used to obtain the current time with which to stamp each message or object. (The time(2)system call returns the value of time in seconds since 00:00:00 UTC, January 1, 1970.) With many -- often thousands, or tens of thousands -- of active objects in typical enterprise system, this can lead to an excessively high use of system CPU cycles. We have observed systems processing thousands of transactions or messages every second, each of which requires a time stamp every time it is acted upon. Such systems can end up calling time(2)several thousands of times per second, incurring a significant overhead in system resources. Two ways are available to reduce time(2)system call overhead. The first is to use our proposed optimized time(2)replacement solution that uses the caching technique to reduce the time(2)system call frequency. The second is to reduce the frequency of time(2)system calls in the application code. The suggested quick solution employs interposed libraries so there is no need to change the original application code. As an example, we have taken a sample application that performs data distribution for analysis. The application handles thousands of messages every second. Each message is time stamped with the current time, using the time(2)system call. One way to find out the frequency of use of time(2), or any other system call, is to use the truss(1)command, a utility in the Solaris OS that traces system calls and signals. For example: % truss -c -p pid Here pid is the process ID for the sample application and the -coption is used to count traced system calls, faults, and signals (rather than displaying the trace line-by-line, which is the default behavior). A summary report is produced after the traced command terminates or when trussis interrupted by Ctrl C. In Code Sample 1, we see an example trussoutput for the sample application process (whose pid was 1365). In this case, the trusscommand was terminated after a sufficiently long sample interval by a Ctrl-C. Code Sample 1: trussOutput Before Any Optimization
time(2)in the sample time of 84.98 seconds. That is nearly 10,000 calls to time()every second. A large amount of system time (10.045 seconds) was devoted to servicing these calls. Since the time(2)call has a one-second granularity, making this call several thousand times per second is certainly unnecessary. We can optimize the use of time(2)for the purposes of time stamping by implementing a local time()function which caches the current time, and only makes a system call when enough time has elapsed between calls. If insufficient time has elapsed since the last call to our local time function, we simply return the cached value. We can do this because we have, in the Solaris OS, access to another time function that is substantially faster than time(2), which is gethrtime(3C). (See "Measuring Execution Time in POSIX Compliant Programs and UNIX" in References section.) The book Inside Solaris, by Richard Mc Dougall and Jim Mauro, says the following about gethrtime(3C): gethrtime(3C)is known as a fast trap system call. This means that an invocation of gethrtime(3C)does not incur the normal overhead of a typical system call. Rather, it generates a fast trap into the kernel, which reads the hardware TICK register value and returns. While many system calls may take microseconds to execute (non-I/O system calls, that is; I/O system calls will be throttled by the speed of the device they're reading or writing), gethrtime(3C)takes a few hundred nanoseconds on a 300 MHz UltraSPARC processor. It's about 1,000 times faster than a typical system call. The source code for the shared library ( libfasttime.so) is given below. In this module, the symbol for time(2)is interposed to execute the optimized, caching time()library function. Thus, code changes in the rest of the application are unnecessary. The new function obtains the current high-resolution time (in nanoseconds) using gethrtime(3C), and compares it to the (cached) value of when the function was last called. If the call was issued within a certain delta, in the code below defined to be 1 millisecond, the cached value is returned, and no time-consuming system call is made. Once sufficient time has elapsed between the original call to time()and the current one, the system call is made, the cached value is reset, and the process starts over. To compile the time.cfile to build a libfasttime.solibrary, use: % cc -G -Kpic -o libfasttime.so -xO3 -xarch=v8plus time.c For a quick performance testing, this library can be preloaded for the purposes of linking with an application by setting the following (in bash): LD_FLAGS_32=preload=/tmp/libfasttime.so However, the preferred way is to link this libfasttime.solibrary during the build of your application. Note: This library can also be compiled in 64-bit mode for 64-bit applications by using: % cc -G -Kpic -o libfasttime.so -xO3 -xarch=v9 time.c The library also can be preloaded by setting the following (in bash): LD_FLAGS_64=preload=/tmp/libfasttime.so In Code Sample 2, we provide the source code for the time(2)wrapper. Code Sample 2: Source Code for time(2)Wrapper (File time.c)
trussOutput After Linking With Optimized fasttimeLibrary
time(2)was called decreased by 90 percent, and the system time was reduced by 60 percent. This improved the performance of the sample data distribution application overall. The sample application was able to provide noticeably more throughput per second compared to when it was running without the libfasttime.solibrary. Since sampling theory tells us that to completely capture a signal we need only sample at twice the rate of the highest frequency, DELTA in Code Sample 2 could be changed to 500 milliseconds with no change of behavior and with potentially even more time savings. So if you have a system that makes extensive use of time stamping, or otherwise makes frequent calls to the time(2)function, try the optimization we have outlined here. References Inside Solaris, by Richard Mc Dougall and Jim Mauro (reprinted with author's permission) |
相关文章推荐
- Three Optimization Tips for C++
- C++ Footprint and Performance Optimization (Sams Professional) by Rene Alexander
- C++ Timesaving Techniques For Dummies
- Single-stack real-time operating system for embedded systems
- TreeFTL:Efficient RAM Management for High Performance of NAND Flash-based Storage Systems-论文注释笔记
- Java Performance Optimization Tools and Techniques for Turbocharged Apps--reference
- Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical
- i fi had a dollar for every time someone said that i would have my own web site
- TIME_WAIT and its design implications for protocols and scalable client server systems
- Performance and Optimization For Mecanim[Unity]
- C++ AMP: Performance Guidance for C++ AMP
- RtAudio介绍(A Cross-Platform C++ Class for Realtime Audio Input/Output)
- 【1】Quality of Service Support for Real-time Storage Systems
- CORBA for Real-Time Systems @ JDJ
- how to scroll a ScrollViewer at design time in Blend to manually design content for it that goes beyond the visible view
- [Tool] Memory leak & performance profiling tools for C# / C++
- RtAudio介绍(A Cross-Platform C++ Class for Realtime Audio Input/Output)
- A C++ class for more precise time interval measurement
- Drupal Performance Tuning and Optimization for large web sites
- Likes Dislike Improving Performance in C++ with Compile Time Polymorphism