您的位置:首页 > 编程语言 > C语言/C++

A Performance Optimization for C/C++ Systems That Employ Time-Stamping

2011-05-26 13:51 603 查看

A Performance Optimization for C/C++ Systems That Employ Time-Stamping

Print-friendly Version

By Amjad Khan and Neelakanth Nadgir, November 23, 2004

This article describes how to optimize the performance of enterprise systems that employ extensive time-stamping using the
system call in the Solaris Operating System. This optimization applies especially to the financial market, and is based on our work with a number of different independent software vendors (ISVs).
We have observed that the common practice of "time-stamping" messages, transactions, or other objects in a system can consume more resources than the developer might expect. In these systems, the
system call is used to obtain the current time with which to stamp each message or object. (The
system call returns the value of time in seconds since 00:00:00 UTC, January 1, 1970.)
With many -- often thousands, or tens of thousands -- of active objects in typical enterprise system, this can lead to an excessively high use of system CPU cycles. We have observed systems processing thousands of transactions or messages eve
ry second, each of which requires a time stamp every time it is acted upon. Such systems can end up calling
several thousands of times per second, incurring a significant overhead in system resources.
Two ways are available to reduce
system call overhead. The first is to use our proposed optimized
replacement solution that uses the caching technique to reduce the
system call frequency. The second is to reduce the frequency of
system calls in the application code. The suggested quick solution employs interposed libraries so there is no need to change the original application code.
As an example, we have taken a sample application that performs data distribution for analysis. The application handles thousands of messages every second. Each message is time stamped with the current time, using the
system call. One way to find out the frequency of use of
, or any other system call, is to use the
command, a utility in the Solaris OS that traces system calls and signals. For example:
% truss -c -p pid

Here pid is the process ID for the sample application and the
option is used to count traced system calls, faults, and signals (rather than displaying the trace line-by-line, which is the default behavior). A summary report is produced after the traced command terminates or when
is interrupted by Ctrl C.
In Code Sample 1, we see an example
output for the sample application process (whose pid was 1365). In this case, the
command was terminated after a sufficiently long sample interval by a Ctrl-C.
Code Sample 1:
Output Before Any Optimization

% [code]truss -c -p 1365

syscall seconds calls errors
read .639 18636 956
time 8.376 785118
semop .007 544 170
poll .362 23378
writev .627 32191
recv .000 14
sendmsg .031 1028
------ ------ ----
sys totals: 10.045 860909 1126
usr time: 39.000
elapsed: 84.980
The results show that 785,118 calls were made to
in the sample time of 84.98 seconds. That is nearly 10,000 calls to
every second. A large amount of system time (10.045 seconds) was devoted to servicing these calls.
Since the
call has a one-second granularity, making this call several thousand times per second is certainly unnecessary. We can optimize the use of
for the purposes of time stamping by implementing a local
function which caches the current time, and only makes a system call when enough time has elapsed between calls. If insufficient time has elapsed since the last call to our local time function, we simply return the cached value. We can do this because we have, in the Solaris OS, access to another time function that is substantially faster than
, which is
. (See "Measuring Execution Time in POSIX Compliant Programs and UNIX" in References section.)
The book Inside Solaris, by Richard Mc Dougall and Jim Mauro, says the following about
is known as a fast trap system call. This means that an invocation of
does not incur the normal overhead of a typical system call. Rather, it generates a fast trap into the kernel, which reads the hardware TICK register value and returns. While many system calls may take microseconds to execute (non-I/O system calls, that is; I/O system calls will be throttled by the speed of the device they're reading or writing),
takes a few hundred nanoseconds on a 300 MHz UltraSPARC processor. It's about 1,000 times faster than a typical system call.
The source code for the shared library (
) is given below. In this module, the symbol for
is interposed to execute the optimized, caching
library function. Thus, code changes in the rest of the application are unnecessary. The new function obtains the current high-resolution time (in nanoseconds) using
, and compares it to the (cached) value of when the function was last called. If the call was issued within a certain delta, in the code below defined to be 1 millisecond, the cached value is returned, and no time-consuming system call is made. Once sufficient time has elapsed between the original call to
and the current one, the system call is made, the cached value is reset, and the process starts over.
To compile the
file to build a
library, use:
% cc -G -Kpic -o libfasttime.so -xO3 -xarch=v8plus time.c

For a quick performance testing, this library can be preloaded for the purposes of linking with an application by setting the following (in bash):

However, the preferred way is to link this
library during the build of your application.
Note: This library can also be compiled in 64-bit mode for 64-bit applications by using:
% cc -G -Kpic -o libfasttime.so -xO3 -xarch=v9 time.c

The library also can be preloaded by setting the following (in bash):

In Code Sample 2, we provide the source code for the
Code Sample 2: Source Code for
Wrapper (File

* Copyright 2004 Sun Microsystems, Inc.
* 4150 Network Circle, Santa Clara, CA 95054
* All Rights Reserved.
* This software is the proprietary information of Sun Microsystems, Inc.
* This code is provided by Sun "as is" and "with all faults." Sun
* makes no representations or warranties concerning the quality, safety
* or suitability of the code, either express or implied, including
* without limitation any implied warranties of merchantability, fitness
* for a particular purpose, or non-infringement. In no event will Sun
* be liable for any direct, indirect, punitive, special, incidental
* or consequential damages arising from the use of this code. By
* downloading or otherwise utilizing this codes, you agree that you
* have read, understood, and agreed to these terms.

/* to compile, use cc -G -Kpic -o libfasttime.so -xO3 -xarch=v8plus time.c */

/* time in nanoseconds to cache the time system call */
#define DELTA 1000000   /* 1 millisecond */

static time_t (*func) (time_t *);

time_t time(time_t *tloc)
static time_t global = 0;
static hrtime_t old = 0;

hrtime_t new = gethrtime();
if(new - old > DELTA ){
global = func(tloc);
old = new;
return global;

#pragma init (init_func)
void init_func()
func = (time_t (*) (time_t *)) dlsym (RTLD_NEXT, "time");
if (!func)
fprintf(stderr, "Error initializing library/n");
Code Sample 3:
Output After Linking With Optimized

% [code]truss -c -p 1701

syscall seconds calls errors
read 1.205 36702 2766
time .762 71953
semop .006 541 169
poll .672 44705
writev 1.204 59945
recv .000 12
sendmsg .003 84
------ ------ ----
sys totals: 3.855 213942 2935
usr time: 62.183
elapsed: 84.700
These code samples show that the number of times
was called decreased by 90 percent, and the system time was reduced by 60 percent. This improved the performance of the sample data distribution application overall. The sample application was able to provide noticeably more throughput per second compared to when it was running without the
library. Since sampling theory tells us that to completely capture a signal we need only sample at twice the rate of the highest frequency, DELTA in Code Sample 2 could be changed to 500 milliseconds with no change of behavior and with potentially even more time savings.
So if you have a system that makes extensive use of time stamping, or otherwise makes frequent calls to the
function, try the optimization we have outlined here.

Measuring Execution Time in POSIX Compliant Programs and UNIX

Inside Solaris, by Richard Mc Dougall and Jim Mauro (reprinted with author's permission)

A Performance Optimization for C/C++ Systems That Employ Time-Stamping

Print-friendly Version

By Amjad Khan and Neelakanth Nadgir, November 23, 2004

This article describes how to optimize the performance of enterprise systems that employ extensive time-stamping using the
system call in the Solaris Operating System. This optimization applies especially to the financial market, and is based on our work with a number of different independent software vendors (ISVs).
We have observed that the common practice of "time-stamping" messages, transactions, or other objects in a system can consume more resources than the developer might expect. In these systems, the
system call is used to obtain the current time with which to stamp each message or object. (The
system call returns the value of time in seconds since 00:00:00 UTC, January 1, 1970.)
With many -- often thousands, or tens of thousands -- of active objects in typical enterprise system, this can lead to an excessively high use of system CPU cycles. We have observed systems processing thousands of transactions or messages every second, each of which requires a time stamp every time it is acted upon. Such systems can end up calling
several thousands of times per second, incurring a significant overhead in system resources.
Two ways are available to reduce
system call overhead. The first is to use our proposed optimized
replacement solution that uses the caching technique to reduce the
system call frequency. The second is to reduce the frequency of
system calls in the application code. The suggested quick solution employs interposed libraries so there is no need to change the original application code.
As an example, we have taken a sample application that performs data distribution for analysis. The application handles thousands of messages every second. Each message is time stamped with the current time, using the
system call. One way to find out the frequency of use of
, or any other system call, is to use the
command, a utility in the Solaris OS that traces system calls and signals. For example:
% truss -c -p pid

Here pid is the process ID for the sample application and the
option is used to count traced system calls, faults, and signals (rather than displaying the trace line-by-line, which is the default behavior). A summary report is produced after the traced command terminates or when
is interrupted by Ctrl C.
In Code Sample 1, we see an example
output for the sample application process (whose pid was 1365). In this case, the
command was terminated after a sufficiently long sample interval by a Ctrl-C.
Code Sample 1:
Output Before Any Optimization

% [code]truss -c -p 1365

syscall seconds calls errors
read .639 18636 956
time 8.376 785118
semop .007 544 170
poll .362 23378
writev .627 32191
recv .000 14
sendmsg .031 1028
------ ------ ----
sys totals: 10.045 860909 1126
usr time: 39.000
elapsed: 84.980
The results show that 785,118 calls were made to
in the sample time of 84.98 seconds. That is nearly 10,000 calls to
every second. A large amount of system time (10.045 seconds) was devoted to servicing these calls.
Since the
call has a one-second granularity, making this call several thousand times per second is certainly unnecessary. We can optimize the use of
for the purposes of time stamping by implementing a local
function which caches the current time, and only makes a system call when enough time has elapsed between calls. If insufficient time has elapsed since the last call to our local time function, we simply return the cached value. We can do this because we have, in the Solaris OS, access to another time function that is substantially faster than
, which is
. (See "Measuring Execution Time in POSIX Compliant Programs and UNIX" in References section.)
The book Inside Solaris, by Richard Mc Dougall and Jim Mauro, says the following about
is known as a fast trap system call. This means that an invocation of
does not incur the normal overhead of a typical system call. Rather, it generates a fast trap into the kernel, which reads the hardware TICK register value and returns. While many system calls may take microseconds to execute (non-I/O system calls, that is; I/O system calls will be throttled by the speed of the device they're reading or writing),
takes a few hundred nanoseconds on a 300 MHz UltraSPARC processor. It's about 1,000 times faster than a typical system call.
The source code for the shared library (
) is given below. In this module, the symbol for
is interposed to execute the optimized, caching
library function. Thus, code changes in the rest of the application are unnecessary. The new function obtains the current high-resolution time (in nanoseconds) using
, and compares it to the (cached) value of when the function was last called. If the call was issued within a certain delta, in the code below defined to be 1 millisecond, the cached value is returned, and no time-consuming system call is made. Once sufficient time has elapsed between the original call to
and the current one, the system call is made, the cached value is reset, and the process starts over.
To compile the
file to build a
library, use:
% cc -G -Kpic -o libfasttime.so -xO3 -xarch=v8plus time.c

For a quick performance testing, this library can be preloaded for the purposes of linking with an application by setting the following (in bash):

However, the preferred way is to link this
library during the build of your application.
Note: This library can also be compiled in 64-bit mode for 64-bit applications by using:
% cc -G -Kpic -o libfasttime.so -xO3 -xarch=v9 time.c

The library also can be preloaded by setting the following (in bash):

In Code Sample 2, we provide the source code for the
Code Sample 2: Source Code for
Wrapper (File

* Copyright 2004 Sun Microsystems, Inc.
* 4150 Network Circle, Santa Clara, CA 95054
* All Rights Reserved.
* This software is the proprietary information of Sun Microsystems, Inc.
* This code is provided by Sun "as is" and "with all faults." Sun
* makes no representations or warranties concerning the quality, safety
* or suitability of the code, either express or implied, including
* without limitation any implied warranties of merchantability, fitness
* for a particular purpose, or non-infringement. In no event will Sun
* be liable for any direct, indirect, punitive, special, incidental
* or consequential damages arising from the use of this code. By
* downloading or otherwise utilizing this codes, you agree that you
* have read, understood, and agreed to these terms.

/* to compile, use cc -G -Kpic -o libfasttime.so -xO3 -xarch=v8plus time.c */

/* time in nanoseconds to cache the time system call */
#define DELTA 1000000   /* 1 millisecond */

static time_t (*func) (time_t *);

time_t time(time_t *tloc)
static time_t global = 0;
static hrtime_t old = 0;

hrtime_t new = gethrtime();
if(new - old > DELTA ){
global = func(tloc);
old = new;
return global;

#pragma init (init_func)
void init_func()
func = (time_t (*) (time_t *)) dlsym (RTLD_NEXT, "time");
if (!func)
fprintf(stderr, "Error initializing library/n");
Code Sample 3:
Output After Linking With Optimized

% [code]truss -c -p 1701

syscall seconds calls errors
read 1.205 36702 2766
time .762 71953
semop .006 541 169
poll .672 44705
writev 1.204 59945
recv .000 12
sendmsg .003 84
------ ------ ----
sys totals: 3.855 213942 2935
usr time: 62.183
elapsed: 84.700
These code samples show that the number of times
was called decreased by 90 percent, and the system time was reduced by 60 percent. This improved the performance of the sample data distribution application overall. The sample application was able to provide noticeably more throughput per second compared to when it was running without the
library. Since sampling theory tells us that to completely capture a signal we need only sample at twice the rate of the highest frequency, DELTA in Code Sample 2 could be changed to 500 milliseconds with no change of behavior and with potentially even more time savings.
So if you have a system that makes extensive use of time stamping, or otherwise makes frequent calls to the
function, try the optimization we have outlined here.

Measuring Execution Time in POSIX Compliant Programs and UNIX

Inside Solaris, by Richard Mc Dougall and Jim Mauro (reprinted with author's permission)

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息