您的位置:首页 > 其它

Wow! Even Microsoft uses AutoESL's C synthesis to speed up its SW

2010-06-10 00:17 621 查看
We purchased AutoESL's AutoPilot in 2008 to implement some of the time-

consuming cores in our software into FPGA hardware for the runtime speed-up

improvements.  We found this can often accelerate our SW runtimes by 2-3

orders of magnitude.  The AutoESL C-to-RTL synthesis tool claims to support

both Altera and Xilinx FPGAs, as well as ASICs, but we only tried it on

Altera Stratix II's.  Our software:

1. RankBoost - a machine-learning algorithm used in the dynamic ranking

of search engines.  RankBoost is several thousand lines of ANSI C

with the synthesizable time-consuming being 149 lines.

We used AutoPilot to generate RankBoost's core computation logic and

integrate it to  existing interface IP cores like the DDR2 controller

IP core.  AutoESL utilized common megafunctions for the target devices

and automatically generated the Avalon bus interface.  The final

implementation had about 12000 ALUTs on the Altera Stratix II FPGA.

2. Sorting Algorithm - also several thousand lines of OO C++ code with 138

lines that needed speeding up.  AutoESL again utilized megafunctions

for our Stratix II's.  Additionally, we used AutoPilot's "APInt", or

arbitrary precision integers.  AutoPilot has a source-level simulation

utility for APInt, and the resource usage depends on the size of the

array in the processing engine used.  For example, when the sorting

process engine array is 128, the synthesis result shows a total ALUT

of 11346.

In general, AutoPilot takes a high level description of a design in ANSI C,

OO C++ or SystemC as input, and synthesizes it into Verilog/VHDL RTL code.

AutoPilot also automatically generates a vector-based testbench from the

C/C++ level testbench for users to use to verify the design.

ANSI C vs. OO C++ vs. SystemC

AutoPilot supported all the ANSI C and C++ language constructs we required

it to support implementing our algorithms in hardware.  Our standard C/C++

function parameters were synthesized into various handshaking, memory,

streaming, and bus interfaces.  We didn't test all the language features

that AutoESL claimed to supported, but I believe our two cases covered the

most commonly used language features that we may use as input to AutoPilot.

For our RankBoost, we used ANSI C.  In the Sorting Algorithm, we used C++.

1. We wrote our RankBoost design in ANSI-C. (C is simple and compact).

2. We wanted to implement the Sorting Algorithm using an object oriented

style code; since ANSI-C is not object oriented, we used C++ for it.

We wrote the Sorting Algorithm to take advantage of a couple of C++

features, including classes and templates, so that the code itself

would be more generic and reusable.  For example, our data elements

types could be easily configured with template parameters.

3. We used AutoESL's APint data type (arbitrary precision integer data

type) for the Sorting Algorithm.  APint is supported in both C and C++,

but the implementation of APint in C++ using templates was easier,

since AutoESL's C++ APInt is also a templatezed class.

We never tested the object-oriented C++ code in AutoESL; we had committed to

one particular Sorting Algorithm (odd-even sort) with fixed data type for

the implementation, and OO was not a must-have for this purpose.

Design exploration:

One aspect of AutoPilot is that its fast runtime allowed us to do in-depth

explorations of the design space.  For RankBoost's core computation logic,

we investigated different performance/area tradeoffs while doing quick

retargeting from one ANSI C source to 2 different FPGA tech libraries in a

Altera Stratix-II:

Design     Mem     FP      Reg    Logic   Latency

and Lib   (bits)  adder    FFs    ALUTs   cycles    MHz

---------  ------  -----   ----    -----   -------  ------

AutoPilot   128K     8     7911    5886     19M     140.55

(8 PEs,

XtremeData

floating

point lib)

AutoPilot   128K     8     6295    5295     19M     107.03

(8 PEs

Altera FPU

lib)

AutoPilot   144K    12     9999    9706     14M     105.49

(12 PEs,

Altera FPU

lib)

Hand-generated code vs. AutoPilot generated code:

Hand-coded  128K     8     5373    5523     19M     125.00

RTL

AutoPilot   128K     9     5453    5316     19M     125.00

(final

design)

AutoPilot's RTL code generation time for the core SW in RankBoost was only

about 1.5 minutes -- near-zero compared to our time to hand-code RTL.

Because of AutoPilot's fast synthesis time, we did additional design space

exploration to select the best configuration for the most optimal design.

We were able to get a QoR comparable to hand-coded RTL yet we still saw a

75% project time savings.

Manual RTL creation time, including verification: 2 months

AutoPilot RTL creation time, including verification: 2 weeks

The above time to create RTL with AutoPilot included 5 major revisions of

our C code for RankBoost.  We had cropped the initial code from RankBoost's

software implementation, and found the original coding style could be more

efficiently written for C synthesis implementation and optimization.  We

had two kinds of modifications on RankBoost:

1. Modifying the ANSI C code for better C synthesis.  For example, the

major body of our code was initially written in the main() function.

For synthesis, we wrapped the code into a separate function in main(),

with this new function specified as the top module to be synthesized.

We also made changes to the parameters of the function and assigned

the interface type to the input and output as the following shows:

void foo(float * mem_data,

volatile uint64 * input_dataport1,

volatile float * input_dataport2,

int size,

volatile float *output)

{

#pragma AUTOPILOT INTERFACE fifo port=input_dataport1

#pragma AUTOPILOT INTERFACE fifo port=input_dataport2

#pragma AUTOPILOT INTERFACE fifo port=output

//major body of the code here

}

Note: The "volatile" pointer type is needed to specify a FIFO.  If a

pointer is marked as volatile, the compiler won't optimize the number

and order of its read and write accesses.

2. Modifying C code for improving code optimization.  For example, in our

initial code, we had

for (j = 0; j < 255; ++j)

{

k = 255 - j;

fHisto[k - 1] += fHisto[k];

}

This piece of code was used to build an integral histogram from a

256-bin histogram.  We had thousands of histograms to be processed.

Each histogram is stored in an array declared as

float fHisto [256];

Since the floating point adder in Altera's megafunction library needs

7 to 8 cycles to output the result and there is a read-after-write

dependency, the addition operation could not be fully pipelined in the

above code.  To remove bubbles in the pipeline, we put 16 histograms

together:

float fHisto [16][256];

And then processed them in an interleaved manner:

for (j = 0; j < 255; ++j)

{

for (i = 0; i < 16; ++i)

{

#pragma AUTOPILOT pipeline II=1

k = 255 - j;

fHisto[i][k - 1] += fHisto[i][k];

}

}

Notice that we used a pragma to specify the loop pipelining interval.

To boost data-level parallelism, we implemented 8 more pipelines since

the histograms are independent of each other:

float fHisto[8][16][256];

for (j = 0; j < 255; ++j)

{

for (i = 0; i < 16; ++i)

{

#pragma AUTOPILOT pipeline II=1

for (k = 0; k < 8; ++k)

{

#pragma AUTOPILOT unroll

fHisto[k][i][255 - j - 1] += fHisto[k][i][255 - j];

}

}

}

This code was then synthesized to a 8-way SIMD (Single Instruction,

Multiple Data) engine.  Through these code changes, we avoided the

bubbles in the RankBoost pipeline, reduced the latency, and fully

utilized data parallelism with an 8-way SIMD architecture.

I would like to mention that we could easily change it to a 16-way SIMD

by simply adding and modifying a few lines in the RankBoost C code.

On our Sorting Algorithm, the generated logic from AutoPilot was so close to

our theoretically optimal results that we saw no reason to implement it

manually for comparison purposes.  We just used AutoPilot's RTL.  So I don't

have hand-code RTL vs. AutoPilot RTL data for the Sorting Algorithm.

The set-up and learning curve for AutoPilot:

It took us less than 1 day to set up the AutoPilot environment for the first

time, and only several minutes for the follow-on designs.

In the early stages, our design methodology was an iterative loop between

constraining AutoPilot synthesis and results analysis with its built-in

Control Data Flow Graph (CDFG).  Later, we started with the targeted micro

architecture in mind and then we created the C/C++ code plus corresponding

synthesis directives.  So it was important to our implementation to be

familiar with AutoPilot's directives.  Here's our ramp-up for AutoESL:

- 1 to 2 days for onsite training on AutoPilot: basics, methodologies,

tool setup, hands-on tutorials.

- 1 to 2 weeks to begin with your own design and learn by doing.  In

our case, we did this was our RankBoost project.

- 3 to 4 additional weeks to try out AutpPilot's other advanced features

like: simulation, integration with SoPC, customized IP, floating point,

advanced language optimization, etc.  This process may take some time

while I also prefer a "learning by doing" style because some advanced

features will only be adapted in special cases.  We did this with our

Sorting Algorithm project.

So, overall a hardware designer experienced in RTL simulation and synthesis

should expect to spend 6 to 7 weeks getting ramped up on AutoESL.  Much of

this depends on how deeply they want to learn its advanced features:

- Controls.  Our users control results in several ways, including adding

synthesis directives to control pipelining, interfaces, and memory

using Tcl commands or pragmas or the GUI.

- GUI.  AutoPilot has a GUI for users to understand the generated logic.

For example, it has a schedule viewer to visualize the scheduling

result and a report view so you can easily compare QoRs for different

implementations.

- Floating point synthesis.  We used single-precision float type and

floating point adders for RankBoost; AutoPilot fully supports these

standard single- and double-precision floating point data types for

Altera platforms.  We could directly synthesize common floating-point

math routines such as square root, exponentiation, logarithm, etc.

- Loop and hierarchical function pipelining.  AutoESL's loop pipelining

allows multiple successive iterations of a loop to execute in parallel

by initiating one iteration before the previous one has completed.

This can optimize the design for both loop throughput and latency.

- Power reduction.  AutoPilot's optimization also includes various

transformations for power reduction, including Operation Gating, MUX

optimization and reduction, FSM coding, pipeline register gating, clock

gating as well as using given Multi-Vdd assignment.  We don't pay much

attention to power consumption with our current FPGAs so we didn't use

this power functionality, but it's an important feature and we would

like to try it out in the near future.

- Interface Synthesis.  Our designers use AutoPilot's standard function

parameters to infer the desired inputs and outputs to the environment

rather than hand code any target-specific interface timing behaviors

into our C/C++ source.  AutoPilot's interface synthesis converts the

parameter reads and writes into the actual interface accesses.  The

direction of the data transfer is inferred from the way a parameter is

used in the function body.

For example, based on the specified communication interfaces in the

platform library, a store operation on a scalar pointer (e.g., *p = x)

can be turned into a direct wire connection, or a FIFO write, or even

a bus write transfer.  This helps tremendously to keep our designers

away from the "devil-is-in-the-details" of the target platform and

focus more on developing the functional/algorithmic part of the design.

Currently, AutoPilot supports the following types of interfaces:

- Wire interface,

- Buffer interface,

- Memory interface,

- FIFO interface,

- Bus interface

The user can control the selection of interface with a few pragmas.

AutoPilot's negatives:

- It needs better user interface with its CDFG.

- Needs better 3rd party tool chain support.  It took us a while to

setup the whole tool chain including our ModelSim RTL simulator and

Altera Quartus II FPGA implementation tools.

- AutoESL claims that AutoPilot supports Altera's SOPC builder tool

and Avalon bus interconnects.  However, we did not test these.

- It needs better Verilog support.  AutoPilot includes some libraries

written only in VHDL, for example a few platform-specific bus

interface adaptors are generated only in VHDL.  It would be better

if the Verilog version was generated as well.

AutoESL's technical support was professional and they covered the product,

integration into the design flow, and language.  We gave AutoPilot a first

look in 2007 and it's been delivering major features for FPGA-based design

in its recent releases.  This tool can produce very acceptable results in

a very short time.

I give AutoPilot a score of 4 out of 5 possible and would strongly recommend

it to others.

AutoPilot 的国内 customer

Wow! Even Microsoft uses AutoESL's
C synthesis - DeepChip Homepage

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐