Wow! Even Microsoft uses AutoESL's C synthesis to speed up its SW
2010-06-10 00:17
621 查看
We purchased AutoESL's AutoPilot in 2008 to implement some of the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements. We found this can often accelerate our SW runtimes by 2-3 orders of magnitude. The AutoESL C-to-RTL synthesis tool claims to support both Altera and Xilinx FPGAs, as well as ASICs, but we only tried it on Altera Stratix II's. Our software: 1. RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines. RankBoost is several thousand lines of ANSI C with the synthesizable time-consuming being 149 lines. We used AutoPilot to generate RankBoost's core computation logic and integrate it to existing interface IP cores like the DDR2 controller IP core. AutoESL utilized common megafunctions for the target devices and automatically generated the Avalon bus interface. The final implementation had about 12000 ALUTs on the Altera Stratix II FPGA. 2. Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up. AutoESL again utilized megafunctions for our Stratix II's. Additionally, we used AutoPilot's "APInt", or arbitrary precision integers. AutoPilot has a source-level simulation utility for APInt, and the resource usage depends on the size of the array in the processing engine used. For example, when the sorting process engine array is 128, the synthesis result shows a total ALUT of 11346. In general, AutoPilot takes a high level description of a design in ANSI C, OO C++ or SystemC as input, and synthesizes it into Verilog/VHDL RTL code. AutoPilot also automatically generates a vector-based testbench from the C/C++ level testbench for users to use to verify the design. ANSI C vs. OO C++ vs. SystemC AutoPilot supported all the ANSI C and C++ language constructs we required it to support implementing our algorithms in hardware. Our standard C/C++ function parameters were synthesized into various handshaking, memory, streaming, and bus interfaces. We didn't test all the language features that AutoESL claimed to supported, but I believe our two cases covered the most commonly used language features that we may use as input to AutoPilot. For our RankBoost, we used ANSI C. In the Sorting Algorithm, we used C++. 1. We wrote our RankBoost design in ANSI-C. (C is simple and compact). 2. We wanted to implement the Sorting Algorithm using an object oriented style code; since ANSI-C is not object oriented, we used C++ for it. We wrote the Sorting Algorithm to take advantage of a couple of C++ features, including classes and templates, so that the code itself would be more generic and reusable. For example, our data elements types could be easily configured with template parameters. 3. We used AutoESL's APint data type (arbitrary precision integer data type) for the Sorting Algorithm. APint is supported in both C and C++, but the implementation of APint in C++ using templates was easier, since AutoESL's C++ APInt is also a templatezed class. We never tested the object-oriented C++ code in AutoESL; we had committed to one particular Sorting Algorithm (odd-even sort) with fixed data type for the implementation, and OO was not a must-have for this purpose. Design exploration: One aspect of AutoPilot is that its fast runtime allowed us to do in-depth explorations of the design space. For RankBoost's core computation logic, we investigated different performance/area tradeoffs while doing quick retargeting from one ANSI C source to 2 different FPGA tech libraries in a Altera Stratix-II: Design Mem FP Reg Logic Latency and Lib (bits) adder FFs ALUTs cycles MHz --------- ------ ----- ---- ----- ------- ------ AutoPilot 128K 8 7911 5886 19M 140.55 (8 PEs, XtremeData floating point lib) AutoPilot 128K 8 6295 5295 19M 107.03 (8 PEs Altera FPU lib) AutoPilot 144K 12 9999 9706 14M 105.49 (12 PEs, Altera FPU lib) Hand-generated code vs. AutoPilot generated code: Hand-coded 128K 8 5373 5523 19M 125.00 RTL AutoPilot 128K 9 5453 5316 19M 125.00 (final design) AutoPilot's RTL code generation time for the core SW in RankBoost was only about 1.5 minutes -- near-zero compared to our time to hand-code RTL. Because of AutoPilot's fast synthesis time, we did additional design space exploration to select the best configuration for the most optimal design. We were able to get a QoR comparable to hand-coded RTL yet we still saw a 75% project time savings. Manual RTL creation time, including verification: 2 months AutoPilot RTL creation time, including verification: 2 weeks The above time to create RTL with AutoPilot included 5 major revisions of our C code for RankBoost. We had cropped the initial code from RankBoost's software implementation, and found the original coding style could be more efficiently written for C synthesis implementation and optimization. We had two kinds of modifications on RankBoost: 1. Modifying the ANSI C code for better C synthesis. For example, the major body of our code was initially written in the main() function. For synthesis, we wrapped the code into a separate function in main(), with this new function specified as the top module to be synthesized. We also made changes to the parameters of the function and assigned the interface type to the input and output as the following shows: void foo(float * mem_data, volatile uint64 * input_dataport1, volatile float * input_dataport2, int size, volatile float *output) { #pragma AUTOPILOT INTERFACE fifo port=input_dataport1 #pragma AUTOPILOT INTERFACE fifo port=input_dataport2 #pragma AUTOPILOT INTERFACE fifo port=output //major body of the code here } Note: The "volatile" pointer type is needed to specify a FIFO. If a pointer is marked as volatile, the compiler won't optimize the number and order of its read and write accesses. 2. Modifying C code for improving code optimization. For example, in our initial code, we had for (j = 0; j < 255; ++j) { k = 255 - j; fHisto[k - 1] += fHisto[k]; } This piece of code was used to build an integral histogram from a 256-bin histogram. We had thousands of histograms to be processed. Each histogram is stored in an array declared as float fHisto [256]; Since the floating point adder in Altera's megafunction library needs 7 to 8 cycles to output the result and there is a read-after-write dependency, the addition operation could not be fully pipelined in the above code. To remove bubbles in the pipeline, we put 16 histograms together: float fHisto [16][256]; And then processed them in an interleaved manner: for (j = 0; j < 255; ++j) { for (i = 0; i < 16; ++i) { #pragma AUTOPILOT pipeline II=1 k = 255 - j; fHisto[i][k - 1] += fHisto[i][k]; } } Notice that we used a pragma to specify the loop pipelining interval. To boost data-level parallelism, we implemented 8 more pipelines since the histograms are independent of each other: float fHisto[8][16][256]; for (j = 0; j < 255; ++j) { for (i = 0; i < 16; ++i) { #pragma AUTOPILOT pipeline II=1 for (k = 0; k < 8; ++k) { #pragma AUTOPILOT unroll fHisto[k][i][255 - j - 1] += fHisto[k][i][255 - j]; } } } This code was then synthesized to a 8-way SIMD (Single Instruction, Multiple Data) engine. Through these code changes, we avoided the bubbles in the RankBoost pipeline, reduced the latency, and fully utilized data parallelism with an 8-way SIMD architecture. I would like to mention that we could easily change it to a 16-way SIMD by simply adding and modifying a few lines in the RankBoost C code. On our Sorting Algorithm, the generated logic from AutoPilot was so close to our theoretically optimal results that we saw no reason to implement it manually for comparison purposes. We just used AutoPilot's RTL. So I don't have hand-code RTL vs. AutoPilot RTL data for the Sorting Algorithm. The set-up and learning curve for AutoPilot: It took us less than 1 day to set up the AutoPilot environment for the first time, and only several minutes for the follow-on designs. In the early stages, our design methodology was an iterative loop between constraining AutoPilot synthesis and results analysis with its built-in Control Data Flow Graph (CDFG). Later, we started with the targeted micro architecture in mind and then we created the C/C++ code plus corresponding synthesis directives. So it was important to our implementation to be familiar with AutoPilot's directives. Here's our ramp-up for AutoESL: - 1 to 2 days for onsite training on AutoPilot: basics, methodologies, tool setup, hands-on tutorials. - 1 to 2 weeks to begin with your own design and learn by doing. In our case, we did this was our RankBoost project. - 3 to 4 additional weeks to try out AutpPilot's other advanced features like: simulation, integration with SoPC, customized IP, floating point, advanced language optimization, etc. This process may take some time while I also prefer a "learning by doing" style because some advanced features will only be adapted in special cases. We did this with our Sorting Algorithm project. So, overall a hardware designer experienced in RTL simulation and synthesis should expect to spend 6 to 7 weeks getting ramped up on AutoESL. Much of this depends on how deeply they want to learn its advanced features: - Controls. Our users control results in several ways, including adding synthesis directives to control pipelining, interfaces, and memory using Tcl commands or pragmas or the GUI. - GUI. AutoPilot has a GUI for users to understand the generated logic. For example, it has a schedule viewer to visualize the scheduling result and a report view so you can easily compare QoRs for different implementations. - Floating point synthesis. We used single-precision float type and floating point adders for RankBoost; AutoPilot fully supports these standard single- and double-precision floating point data types for Altera platforms. We could directly synthesize common floating-point math routines such as square root, exponentiation, logarithm, etc. - Loop and hierarchical function pipelining. AutoESL's loop pipelining allows multiple successive iterations of a loop to execute in parallel by initiating one iteration before the previous one has completed. This can optimize the design for both loop throughput and latency. - Power reduction. AutoPilot's optimization also includes various transformations for power reduction, including Operation Gating, MUX optimization and reduction, FSM coding, pipeline register gating, clock gating as well as using given Multi-Vdd assignment. We don't pay much attention to power consumption with our current FPGAs so we didn't use this power functionality, but it's an important feature and we would like to try it out in the near future. - Interface Synthesis. Our designers use AutoPilot's standard function parameters to infer the desired inputs and outputs to the environment rather than hand code any target-specific interface timing behaviors into our C/C++ source. AutoPilot's interface synthesis converts the parameter reads and writes into the actual interface accesses. The direction of the data transfer is inferred from the way a parameter is used in the function body. For example, based on the specified communication interfaces in the platform library, a store operation on a scalar pointer (e.g., *p = x) can be turned into a direct wire connection, or a FIFO write, or even a bus write transfer. This helps tremendously to keep our designers away from the "devil-is-in-the-details" of the target platform and focus more on developing the functional/algorithmic part of the design. Currently, AutoPilot supports the following types of interfaces: - Wire interface, - Buffer interface, - Memory interface, - FIFO interface, - Bus interface The user can control the selection of interface with a few pragmas. AutoPilot's negatives: - It needs better user interface with its CDFG. - Needs better 3rd party tool chain support. It took us a while to setup the whole tool chain including our ModelSim RTL simulator and Altera Quartus II FPGA implementation tools. - AutoESL claims that AutoPilot supports Altera's SOPC builder tool and Avalon bus interconnects. However, we did not test these. - It needs better Verilog support. AutoPilot includes some libraries written only in VHDL, for example a few platform-specific bus interface adaptors are generated only in VHDL. It would be better if the Verilog version was generated as well. AutoESL's technical support was professional and they covered the product, integration into the design flow, and language. We gave AutoPilot a first look in 2007 and it's been delivering major features for FPGA-based design in its recent releases. This tool can produce very acceptable results in a very short time. I give AutoPilot a score of 4 out of 5 possible and would strongly recommend it to others. AutoPilot 的国内 customer
Wow! Even Microsoft uses AutoESL's
C synthesis - DeepChip Homepage
相关文章推荐
- New Apache Project 'Drill' Aims to Speed Up Hadoop Queries
- linux eclipse CDT 编译报错 :undefined reference to `curl_easy_cleanup' 的解决方案
- (转)A SQL query walks into a bar and sees two tables. He walks up to them and says 'Can I join you?'
- Git: Why 'Everything up-to-date' when pushing
- How to set up and test a simple OLEDB Linked Server in Microsoft® SQL Server to allow retrieval of d
- 静态检测内存泄露Analyze--Value stored to 'dataArr' during its initialization is never read
- fatal: The remote end hung up unexpectedly error: failed to push some refs to ''
- android ndk编译C++ 的undefined reference to '__cxa_end_cleanup'及 __gxx_personality_v0问题
- make: `clean' is up to date
- Modify 'make.conf' to improve ports download and install speed
- SVN UP 命令出现错误:Can't convert string from 'UTF-8' to native encoding:
- Could not write to output file 'c:\Windows\Microsoft.NET ASP.NET Files\root\xx' -- 'Access is denied
- The current identity (JSTAM2\jstcrm) does not have write access to 'C:\WINDOWS\Microsoft.NET\Framewo
- Could not write to output file 'c:/WINDOWS/Microsoft.NET/Framework/v2.0.5072的处理
- 上传图片Microsoft VBScript 运行时错误 错误 '800a01a8' extension=upfile.file(inputname).FileExt
- Google code: Why 'Everything up-to-date' when pushing (git)
- Unable to cast COM object of type Microsoft.Office.Interop.Excel.ApplicationClass' ...
- "Value stored to '***' during its initialization is never read"
- How to properly use &#39;dd&#39; to benchmark the write speed of your disk?
- Unable to connect to the Microsoft Visual Studio Remote Debugging Monitor named 'LIGAO'. 绑定句柄无效。