Warp Processors

Warp processors dynamically and transparently optimize an executing software binary by moving software kernels to on-chip configurable logic, improving performance by 7.4X and reducing energy consumption by 49% on average. Extensive research has shown that hardware/software partitioning, the process of dividing an application between software executing on a microprocessor and hardware co-processors, can result in overall software speedups as well as reducing system energy. By developing a custom CAD-oriented field programmable gate array (FPGA) and lean on-chip decompilation, partitioning, and just-in-time (JIT) FPGA compilation tools, warp processors provide the performance and energy benefits of HW/SW partitioning without any designer effort, expertise, or knowledge. Using warp processors, a designer can use any programming language and compiler to create a truly portable application binary. With future extensions and enhancements to warp processor, a warp processor will determine how to best execute the portable binary either as software executing on a processor, entirely in hardware using configurable logic, or partitioning the application between software and hardware.
Low Power Warp Processing

Our original warp processor design was primarily performance-driven and did not focus on power consumption, which is becoming an increasingly important design constraint. Alternatively, a low-power warp processor leverages the dynamic partitioning benefits of warp processors and the power saving benefits ofvoltage and frequency scaling to create a high-performance embedded architecture capable of dynamically reducing power consumption without degrading performance. By focusing on reducing power consumption, our low-power design achieves an average power reduction of 74%.
Non-Intrusive Dynamic Application Profiling

Our non-intrusive dynamic application profiler (DAProf) is capable of profiling an executing application by monitoring the application’s short backwards branches, function calls, and function returns. The resulting profile information provides an accurate characterization of the frequently executed loops within the application providing a breakdown of loop executions versus loop iterations per execution. DAProf achieves excellent profiling accuracy with an average accuracy of 97% for loop executions, 97% for average iterations per execution, and 95% for percentage of total application execution time. In addition, the presented dynamic application profiler only incurs an 11% area overhead compared to an ARM9 microprocessor. DAProf is ideally suited for rapidly profiling software applications and dynamic optimization approaches such as warp processing in which detailed loop execution information is needed to provide accurate performance estimates.
Hardware/Software Partitioning of Floating-Point Applications

While hardware/software partitioning has been shown to provide significant performance gains, most hardware/software partitioning approaches are limited to partitioning computational kernels utilizing integers or fixed point implementations. Software developers often initially develop an application using built-in floating point representations and later convert the application to a fixed point representation – a potentially time consuming process. We are currently developing a hardware/software partitioning approach for floating point applications that eliminates the need for developers to rewrite software applications for fixed point implementations. Instead, the proposed approach incorporates efficient, configurable floating point to fixed point and fixed point to floating point hardware converters at the boundary between the hardware coprocessors and memory. This effectively separates the system into a floating point domain consisting of the microprocessor and memory subsystem and a fixed point computing domain consisting of the partitioned hardware coprocessors, thereby providing an efficient and rapid method for implementing fixed point hardware coprocessors. Our hardware/software partitioning approach for a floating point application provides application speedups of 4.3X on average without requiring any designer effort to re-implement software with a fixed point representation.
Application Specific FPGAs
The inclusion of field programmable gate arrays (FPGAs) within a system-on-a-chip (SOC) design offers programmability, flexibility, and reconfigurability not possible with application specific integrated circuits (ASIC) or full-custom implementations. However, these benefits come at the expense of significant area, performance, and power consumption overheads compared to ASIC or full-custom circuits. As a typical SOC design will require fabrication of the final integrated circuit, rather than rely on a generic FPGA architecture, an FPGA integrated within an SOC design can be optimized for the specific intended application. We are currently investigating design space exploration methodologies for generating application specific FPGAs (ASFPGAs) by tailoring several FPGA architectural features for a specific hardware circuit to improve the area, delay, or energy consumption compared to traditional FPGA designs and reduce the overheads of utilizing an FPGA compared to ASIC and full custom implementations. Our preliminary efforts have demonstrated that an ASFPGA optimized for a particular design metric provides a 70% improvement in area, energy, or delay compared to a generic FPGA architecture, with a minimum and maximum improvement of 20% and 99% for specific hardware circuits.
Just-in-Time (JIT) FPGA Compilation
Just-in-time (JIT) FPGA compilation takes a netlist in a standard netlist binary format, and execute technology mapping, placement, and routing. We have developed a JIT compiler consisting of lean versions of technology mapping, placement, and routing algorithms that require an order of magnitude less execution time and memory requirements compared with their desktop-based counterparts while producing acceptable quality hardware circuits.
