您的位置:首页 > 其它

SoC performance benchmark

2015-10-10 11:27 357 查看


Preface

This article would illustrate the programs used to benchmark the SoC(include the SMP) performance, also the step to build and run the benchmark programs.  And at the end, I give 2 scripts to make the benchmark work more efficiently.

These benchmark programs would evaluate the Integer and FP performance, also the latency of the L1-Cache and L2-Cache. We can fetch these tools from net. And some of them comes from the lmbench. For the lmbench you may view my previous blog post(In Chinese).ARM
Linux BenchMark. Also refer the github repo which suit the previous blog post:
https://github.com/tonyho/ARM_BenchMark
Besides, if you want to compare the SoC in the phone  and the arm linux board, you can do these:

①Install the benchmark apks(the roylongbottom
collect and modify many benchmarks tools for Android) to android phone to make a benchmark

②then use the below repo tools to run a benchmark in ARM linux board:
https://github.com/tonyho/ARM-MP-BenchMark
③compare the result


1. Integer BenchMark: CoreMark(version:1.01)


compile:

downlaod the coremark from http://www.eembc.org/
①compile the source code for single core CPU:

arm-poky-linux-gnueabi-gcc -c -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a15 -I./ -Isimple -DITERATIONS=0 -DSEED_METHOD=SEED_ARG -DCOMPILER_FLAGS=\""-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15-Os\"" -Os core_main.c core_list_join.c core_matrix.c core_state.c core_util.c simple/core_portme.c


Link:
arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark -lc


For static link:
arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark.static -lc -static


②compile the source code for multicore CPU:

cp linux/ -r arm_ti


#Modify the CC and LD to cross compile toolchain gcc
gvim arm_ti/core_portme.mak


#build the coremark:
make PORT_DIR=./arm_ti/ XCFLAGS="-DMULTITHREAD=4 -DUSE_FORK=1"
make PORT_DIR=./arm_ti/ REBUILD=1


③Toolchain problem

for these ToolChain cannot pass the string macro which contain space, such as the toolchain built by Yocto 1.6.1
cp linux/ -r arm_ti


#Modify the CC and LD to cross compile toolchain gcc
gvim arm_ti/core_portme.mak


build the source code, the output executable object is coremark.exe:
make clean && arm-poky-linux-gnueabi-gcc -O2 -I./arm_ti/ -I. -DFLAGS_STR=\""-O2-DMULTITHREAD=2-DUSE_FORK=1-DPERFORMANCE_RUN=1-lrt"\" -DITERATIONS=0 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 core_list_join.c core_main.c core_matrix.c core_state.c core_util.c ./arm_ti//core_portme.c -o ./coremark.exe -lrt


usage:


1. copy the coremark (for multicore is coremark.exe) to /usr/bin

cp coremark/coremark.exe ...


2. run the coremark


Replace the ITER_PROFILE to a number, make sure that the number can make the coremark run at least 1 min.
time coremark/coremark.exe 0x0 0x0 0x66 ITER_PROFILE 7 1 2000


3. get the average result


When the coremark print the result,rerun the coremark for several times, pick the Iterations/Sec value, get the average, fill the table. Eg:
time coremark 0x0 0x0 0x66 400000 7 1 2000


①single core result log example

2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 250749878
Total time (secs): 250.749878
Iterations/Sec : 1595.215133
Iterations : 400000
Compiler version : GCC4.8.3 20140401 (prerelease)
Compiler flags : arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x65c5
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 1595.215133 / GCC4.8.3 20140401 (prerelease) arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15 / STACK

real 4m10.831s
user 4m10.750s
sys 0m0.000s


②multicore/multithread result log example


2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 58661
Total time (secs): 58.661000
Iterations/Sec : 9546.376639
Iterations : 560000
Compiler version : GCC4.8.3 20140401 (prerelease)
Compiler flags : -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt
Parallel Fork : 2
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
[1]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[1]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[1]crcstate : 0x8e3a
[0]crcfinal : 0xbd59
[1]crcfinal : 0xbd59
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 9546.376639 / GCC4.8.3 20140401 (prerelease) -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt / Heap / 2:Fork
real 0m58.670s
user 1m57.260s
sys 0m0.000s


For more detail, refer the ARM document: CoreMark
Benchmarking for ARM Cortex Processors


2. Float BenchMark

use the lat_ops form lmbench(version:3.0), single core test program


1. program position

lmbench/bin/lat_ops, copy the lmbench to target board
cp -r lmbench /


2. run

change the working directory to lmbench/bin/arm-linux, and run the lat_ops for several times and get avarage value as the result value:

for example:
root@xxx:/# cd /lmbench/bin/arm-linux/
root@xxx:/lmbench/bin/arm-linux# ./lat_ops
integer bit: 0.67 nanoseconds
integer add: 0.67 nanoseconds
integer mul: 2.08 nanoseconds
integer div: 57.43 nanoseconds
integer mod: 8.11 nanoseconds
int64 bit: 0.68 nanoseconds
uint64 add: 0.74 nanoseconds
int64 mul: 3.36 nanoseconds
int64 div: 90.15 nanoseconds
int64 mod: 62.60 nanoseconds
float add: 3.36 nanoseconds
float mul: 4.04 nanoseconds
float div: 12.14 nanoseconds
double add: 3.36 nanoseconds
double mul: 4.04 nanoseconds
double div: 21.52 nanoseconds
float bogomflops: 10.77 nanoseconds
double bogomflops: 20.20 nanoseconds


3. L1 L2 Cache Latency BenchMark

use the lat_mem_rd from lmbench(version:3.0), single core test program


1. prepare

program position: lmbench/bin/lat_mem_rd, copy the lmbench to target board
cp -r lmbench /


2. run

change the working directory to lmbench/bin/arm-linux, and run the lat_mem_rd for several times and get average value as the result value.
./lat_mem_rd 1M


In program output log, the following is the latency value:

0.00098-->L1 Cache

0.12500-->L2 Cache

eg:
root@xxx:/lmbench/bin/arm-linux# ./lat_mem_rd 1M
"stride=128
0.00049 2.687
0.00098 2.688
0.00195 2.688
0.00293 2.688
0.00391 2.669
0.00586 2.669
0.00781 2.669
0.01172 2.669
0.01562 2.669
0.02344 8.708
0.03125 7.198
0.04688 13.687
0.06250 13.189
0.09375 14.683
0.12500 14.683
0.18750 14.746
0.25000 14.746
0.37500 14.783
0.50000 14.933
0.75000 27.538
1.00000 70.250


4. DMIPS BenchMark

Use the Dhrystone(version:2.1), single core test program


1.Get the source

get the source from: http://www.roylongbottom.org.uk/linux%20benchmarks.htm#anchor4
wget 'http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz'
wget 'http://linux-sunxi.org/images/a/a1/Classic_benchmarks.patch'
tar -xzf classic_benchmarks.tar.gz
patch -p0 < Classic_benchmarks.patch
cd classic_benchmarks/source_code/



2. Setting the tuning options

change the toolchain path, and tuning options:
gvim Makefile

CC=gcc-4.7 ==> CC=XXXX-gcc
CFLAGS=-static -O3 -mcpu=cortex-A8 -mtune=cortex-A8 -mfpu=neon -funroll-loops ==>
CFLAGS=-static -O3 -mcpu=cortex-A15 -mtune=cortex-A15 -mfpu=neon -funroll-loops


3. change the SoC type string, and CPU frequency

gvim common_32bit/cpuidc.c


Change the string and SoC frequency:
strcpy(idString1, "Cortex A8"); ==> strcpy(idString1, "Cortex A15");
megaHz = 1000; ==> megaHz = 1500;


4. build the program

make


5. run the dhry2 test program

1. cp dhry2 to target board, and add the execution attribute for the file, and run it:
cp dhry2 XXXX
chmod a+x ./dhry2
./dhry2


2. the VAX MIPS rating is the DMIPS value, rerun for several times, and get the average as the result

eg:
root@xxx:/# dhry2
####################################################
getDetails and MHz

Assembler CPUID and RDTSC
CPU Cortex A8, Features Code 00000000, Model Code 00000000

Measured - Minimum 1500 MHz, Maximum 1500 MHz
Linux Functions
get_nprocs() - CPUs 2, Configured CPUs 2
get_phys_pages() and size - RAM Size 1.97 GB, Page Size 4096 Bytes
uname() - Linux, saturn15, 3.10.31-ltsi
#1 SMP PREEMPT Tue Dec 9 13:39:16 JST 2014, armv7l

##########################################

Dhrystone Benchmark, Version 2.1 (Language: C or C++)

Optimisation Opt 3 64 Bit
Register option not selected

40000 runs 0.00 seconds
400000 runs 0.05 seconds
4000000 runs 0.49 seconds
8000000 runs 0.97 seconds
16000000 runs 1.94 seconds
32000000 runs 3.89 seconds

Final values (* implementation-dependent):

Int_Glob: O.K. 5 Bool_Glob: O.K. 1
Ch_1_Glob: O.K. A Ch_2_Glob: O.K. B
Arr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 32000010
Ptr_Glob-> Ptr_Comp: * 610704
Discr: O.K. 0 Enum_Comp: O.K. 2
Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob-> Ptr_Comp: * 610704 same as above
Discr: O.K. 0 Enum_Comp: O.K. 1
Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc: O.K. 5 Int_2_Loc: O.K. 13
Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1
Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRING

Microseconds for one run through Dhrystone: 0.12
Dhrystones per Second: 8232458
VAX MIPS rating = 4685.52

Press Enter


6. Scripts

For the benchmark, we usually would run the test for several times, then averages all these results to get a final result. And I have written two scripts to do these.

There're 2 scripts my bitbucket snippet: CPU_BenchMark_Scripts
CPUBenchMark_Average.sh: run in host or target board which has the bash and awk and grep
CPU_RunBenchMark.sh: run on the target

The CPU_RunBenchMark.sh would run the benchmark programs to get the results and store the results in the PROGRAM_NAME.log, the PROGRAM_NAME is the program name. eg: coremark.

The CPUBenchMark_Average.sh is used to average the results which store in the PROGRAM_NAME .log.

So below is the step to use the scripts:

①Copy the benchmark programs(coremark.exe dhry2 lat_ops lat_mem_rd) to target board

②Copy the CPU_RunBenchMark.sh and CPUBenchMark_Average.sh to the same directory as benchmark programs

③Modify the CPU_RunBenchMark.sh to suit the directory
runTest coremark_v1.0 'time ./coremark.exe 0x0 0x0 0x66 200000 7 1 2000' coremark.log
runTest classic_benchmarks/source_code 'echo | ./dhry2' dhry2.log 10
runTest lmbench/bin/arm-linux './lat_ops' lat_ops.log
runTest lmbench/bin/arm-linux './lat_mem_rd 1M' lat_mem_rd.log


the runTest shell function is used to run a program ($2) which in the directory $1.

④Modify the for loop for the times of benchmark programs run.
for i in 1 2 3 4 5 6 7 8 9 10;do
eval "$2" 2>&1 | tee -a $3
done


⑤Average the results

Just run the CPUBenchMark_Average.sh if the target board shipped the grep awk, if the target board don't have these tools, copy the logs and scripts to host PC to run, it would output the result to STDOUT, eg:
$ sh average.sh
===========CoreMark================================
Iterations/Sec = 9569.107810
===========Dhry2===================================
VAX MIPS rating = 4685.468000
===========L1 Lat==================================
0.00098 = 2.669300
===========L2 Lat==================================
0.12500 = 14.684400
===========integer=================================
integer bit = 0.670000
integer add = 0.670000
integer mul = 2.070000
integer div = 56.908000
integer mod = 8.044000
===========int64==================================
int64 bit = 0.670000
uint64 add = 0.710000
int64 mul = 3.340000
int64 div = 89.491000
int64 mod = 62.155000
===========float==================================
float add = 3.340000
float mul = 4.009000
float div = 12.022000
===========double=================================
double add = 3.340000
double mul = 4.010000
double div = 21.372000
===========float/double bogo======================
float bogomflops = 10.688000
double bogomflops = 20.038000


如果文章有格式问题,请移步:http://www.hexiongjun.com/?p=174

转载请注明出处。作者:TonyHo hexiongjun.com 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息