Customization of an embedded RISC CPU with SIMD extensions for video encoding: A case study PDF

Title	Customization of an embedded RISC CPU with SIMD extensions for video encoding: A case study
Author	D. Reisis
Pages	18
File Size	558.1 KB
File Type	PDF
Total Downloads	360
Total Views	701

Preview

CLICK TO PREVIEW PDF

Summary

Description

ARTICLE IN PRESS

INTEGRATION, the VLSI journal 41 (2008) 135–152 www.elsevier.com/locate/vlsi

Customization of an embedded RISC CPU with SIMD extensions for video encoding: A case study V.A. Chouliarasa,, V.M. Dwyera, S. Aghaa, J.L. Nunez-Yanezb, D. Reisisc, K. Nakosc, K. Manolopoulosc a

Department of Electronic and Electrical Engineering, Loughborough University, UK b Department of Electronic Engineering, Bristol University, UK c Department of Physics, University of Athens, Greece

Received 24 October 2005; received in revised form 13 February 2007; accepted 13 February 2007

Abstract This work presents a detailed case study in customizing a conﬁgurable, extensible, 32-bit RISC processor with vector/SIMD instruction extensions for the efﬁcient execution of block-based video-coding algorithms utilizing a proprietary co-design environment. In addition to the default Full-Search motion estimation of the MPEG-2 Test Model 5, fourteen fast ME algorithms were implemented in both scalar and vector form. Results demonstrate a reduction of up to 68% in the dynamic instruction count of the full search-based encoder whereas the fast motion estimation algorithms achieved a reduction in instruction count of nearly 90%, both accelerated via three 128-bit vector/SIMD instructions when compared to the scalar, reference implementation of the standard. We address in detail the proﬁling, vectorization and the development of these vector instruction set extensions, discuss in depth the implementation of a parametric vector accelerator that implements these instructions and show the introduction of that accelerator into a 32-bit RISC processor pipeline, in a closely-coupled conﬁguration. r 2007 Elsevier B.V. All rights reserved. Keywords: ‘‘Conﬁgurable, extensible CPUs’’; SIMD; Coprocessors; System-on-Chip; Video coding

1. Introduction Vector and single-instruction, multiple-data (SIMD) architectures are the most effective means for exploiting the abundant data-level parallelism present in current and emerging embedded workloads [1–3]. Ever-increasing transistor budgets have permitted the use of complete, short-vector units to satisfy the performance requirements both in desktop [4] and in embedded processors. In the embedded world in particular orders of magnitude improvement in 3D geometry processing capability have been demonstrated by implementing highly targeted system-on-chip (SoC) vector processors [5], compared to the previous generation hardwired ASIC [6]. This work details the development of a reduced instruction set (RISC) Corresponding author. Tel.: +44 0 1509 227 113; fax: +44 0 1509 227 014. E-mail address: [email protected] (V.A. Chouliaras).

0167-9260/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2007.02.003

processor/custom SIMD coprocessor combination by utilizing a proprietary software–hardware co-design environment, partially based on open-source simulators and silicon intellectual property (IP) but primarily on internally developed methodologies, tools and IP. We describe the instruction-set-architecture (ISA) speciﬁcation and subsequently, the hardware implementation of a parametric vector coprocessor for the efﬁcient execution of the reference MPEG-2 video coding standard TM5 encoder [7] as well as performance-optimized variants of that workload which utilize a number of fast motion estimation algorithms. System-on-Chip platforms suitable for video coding workloads are typically based on combinations of processing cores and hardwired engines with the former belonging primarily to two architectural categories: Conﬁgurable and reconﬁgurable CPUs. Conﬁgurable, extensible processors [8,9] are a mature and potent technology. Their customization is primarily a manual process, typically

ARTICLE IN PRESS 136

V.A. Chouliaras et al. / INTEGRATION, the VLSI journal 41 (2008) 135–152

carried out during the architecture deﬁnition phase of the integrated system. An increasing body of research investigating automatic instruction set design [10–12] has proposed semi-automated ISA-deﬁnition tools some of which have become commercial products [13]. Very high performance embedded systems (particularly systems that are streaming in nature) increasingly utilize one or more such processors which allow for the ﬁne-tuning of a standard ISA to the algorithmic requirements of the workload. Such processors constitute a tried solution to the performance requirements of most current and emerging streaming applications. However, this comes at the expense of design customization and the associated high cost of silicon masks; there are few instances of highperformance consumer product application-speciﬁc integrated circuits (ASICs) incorporating such processors being reused in subsequent products. In the domain of custom ‘reconﬁgurable’ CPUs the customization of the processor subsystem is done on-line (at run-time), with the conﬁguration bitstream generated off-line and stored in the system memory. This functionality is achieved through the introduction of a ﬁeldprogrammable logic fabric of varying granularity [14] in close physical proximity to a controlling CPU. The fabric can be re-conﬁgured hundreds or even thousands of times per second, satisfying the temporal computation requirements of the executing algorithm. Commercial offerings in this area include Tensilica’s Xtensa LX core [15], a high-performance very-long instruction word (VLIW)-style processor executing designer-deﬁned variable-width FLIX instructions each encapsulating multiple operations. The proprietary toolset allows the designer to augment the execution data paths, I/ O ports and registered state (both programmer-visible and microarchitectural state) to achieve highly optimized custom solutions. ARM offers a family embedded streaming data engines known as the OptimoDE architecture [16], originally developed by Adelante technologies. OptimoDE is a VLIW architecture with a fully user-deﬁnable data path speciﬁcally dedicated to the data intensive part of an application. Stretch Incorporated follows a slightly different route and produces the S5000 processor chip [17] which incorporates the Xtensa processor core and a softwareconﬁgurable data path (ISEF) based on Stretch’s proprietary programmable logic. The ISEF allows system designers to extend the processor instruction set and deﬁne the new instructions in the C/C++ domain. Other vendors in the reconﬁgurable processor market include Elixent and Quicksilver: Elixent’s D-fabrix processing array [18] uses an array of 4-bit arithmetic logic units (ALUs), registers and embedded RAMs connected via switchboxes to a routing network. A RISC processor controls the D-fabrix arrays forwarding data and collecting results. Quicksilver calls its solution the Adaptive Computing Machine (ACM) [19]. Their fractal architecture involves the replication of four basic nodes: arithmetic, bit manipulation, scalar, ﬁnite-state-machine at different

levels to form more complex processing nodes. The PACT Corp. XPP64-A [20] is built from an 8 8 array of ALUPAEs (Processing Array Elements) with two rows of random-access memory (RAM)-PAEs at the edges. The PAEs contain three different logic components: The ALU component performs 24 different arithmetic, logic and shift operations. The other two component types perform barrel shifting operations, extra arithmetic and data routing between each of the 64 PAEs. Though desktop short-vector architectures for accelerating media workloads are not a new technology as already mentioned [4], the detailed account of combining a conﬁgurable, extensible RISC CPU and a targeted, parametric vector accelerator is of value to the embedded systems research community. In that respect, there are a number of novel aspects of our work: (a) we extended our basecase simulation environment [21] to allow the introduction of extra (machine) state, thereby enabling the modeling of our proposed architecture extensions. During this process, we created C and script-based infrastructure and a systematic methodology to add extra instructions to the simulator and, most importantly, re-architected a major part of the core simulator engine to allow for multiple CPU contexts [22]; (b) we fully vectorized the reference MPEG-2 TM5 code and created both scalar and vector versions of the default Full-Search motion estimation (FSME) as well as a further fourteen fast ME algorithms and measured the performance improvement, due to the vector extensions, in both the TM5 as well as the optimized versions of the encoder; (c) on the architecture side, we propose two additional vector instructions (for sub-pel accuracy in ME) targeted particularly to the MPEG-2 workload. Though short-vector sum-of-absolute-differences (SAD) operations provide excellent acceleration potential we chose to precisely quantify this for a number of test video sequences and search window ranges. These two additional instructions identiﬁed provide added beneﬁt particularly at lower search ranges. We believe the reason such instructions have not been used in the past is due to their source and destination operand requirements (three and ﬁve vector source operands, respectively, one scalar accumulator destination) which puts very high pressure on the register ﬁles of existing implementations; (d) we systematically added extra state to an open-source, Sparc V8 compliant micro-architecture available under the library general public license (LGPL) [23] to allow for conﬁgurable vector coprocessors to be introduced in the scalar processing pipeline [24]. We extended that processor with a special hardware-based multi-processing barrier synchronization mechanism to allow for efﬁcient barrier synchronization in symmetric multiprocessing (SMP) conﬁgurations and extended the programmervisible state (Sparc V8 ISA) to include a processor context ID register; (e) we deﬁned a high bandwidth (one vector load/store operation per cycle) load/store unit (VLSU) which is compile-time conﬁgurable to behave as a scratchpad RAM (DMA-based, software-controlled

ARTICLE IN PRESS V.A. Chouliaras et al. / INTEGRATION, the VLSI journal 41 (2008) 135–152

cache) or as a conﬁgurable level 1 data cache; (f) we designed and implemented a custom, high-speed coprocessor interface between the main multiprocessing-capable CPU core and our default, highly parametric vector pipeline which we have utilized in all our work on vector accelerators [25,26]. A particularly important aspect of our work is the choice of conﬁgurability both along the microarchitecture as well as the architecture axes. Finally, this is our ﬁrst work in which we report on an actual very large scale integration (VLSI) macro developed using our methodology and consisting of our heavily modiﬁed RISC CPU and our video coding-targeted vector coprocessor. The next section provides additional background on the target workloads and our experimental techniques. Section 3 addresses the coprocessor programmer’s model and instruction set extensions and Section 4 identiﬁes the beneﬁts of the proposed instruction extensions, in terms of the dynamic instruction count metric, for a variety of video sequences, search methods and vector pipeline widths. An extended overview of the microarchitecture of the modiﬁed Leon2 processor and our conﬁgurable vector co-processor is presented in Section 5. Finally, Section 6 presents the VLSI implementation of the complete video processing subsystem consisting of the combined processor and vector coprocessor and elaborates on the power consumption of the VLSI macrocell after a synthesis/power analysis campaign. This is followed by the conclusions drawn from this work and discussion of future work. 2. Experimental procedure The basic workload in this study is the MPEG-2 TM5 video coding standard. In the TM5 reference implementation full-search motion estimation (ME) is used in intraframe (temporal) prediction and is the most computationally expensive operation performed as it exhaustively scans all locations within a search window in the reference frame and ﬁnds the best match (minimum error) for any given macroblock in the predicted (current) frame. To reduce the ME computational load a number of fast algorithms have been devised, including 2D logarithmic search [27], cross search [28], three-step search [29], four-step-search [30], gradient-descent search [31] and diamond search [32]. A common characteristic in the above ME methods is the computation of an error term that identiﬁes how well the predicted macroblock maps to a reference macroblock. These low-level computations are independent of one another and can be performed in parallel by independent processors thus, exploiting the thread level parallelism (TLP) of the video coding standard. A second and equally important observation applies to the computations performed within each macroblock; these are characterized by data level parallelism (DLP), manifested as the ability to process multiple picture elements (pels) independently, and can thus be efﬁciently exploited via vector/SIMD architectures. A software-based video coder implementation should target those data-parallel, low-level error computa-

137

tions in creating custom vector instructions since the performance beneﬁts realized by SIMD acceleration can be utilized by all the above search methods. We therefore target that DLP and quantify on the performance beneﬁt for the whole encoding process (dynamic instruction count reduction) through vectorizing these low-level computations. The TLP-aspect of the workload is currently under investigation and will be presented in another paper. The experimental methodology followed starts with the proﬁling of the unmodiﬁed MPEG2-TM5 reference video encoder in native mode (Linux X86) as well as on the simulated processor toolset (Simplescalar). This step provides valuable statistics such as the per-function dynamic instruction count. The simulated processor architecture is based on the Simplescalar PISA [21] which is an experimental virtual machine for architecture research and algorithmic optimization. It can be best described as a 32-bit RISC architecture with 64-bit opcodes. All the vector extension instructions discussed in this work were allocated from the NOP opcode space, under annotation 1. The compiler was GCC 2.7.3 with optimizations (CFLAGS ¼ O3) and strict video bitstream equivalency across the scalar (original TM5 implementation) and the vector versions of the workloads was maintained throughout this process. Figs. 1 and 2 depict the proﬁling results of the unmodiﬁed MPEG-2 TM5 and the MPEG-2 implementation incorporating the fourteen additional fast ME methods, respectively, when processing 12 frames of the Paris sequence. As shown in Fig. 1, the most CPU-intensive functions are the inner loop function of ME (DIST1) which computes the error of the current macroblock over all macroblocks in the search window of the reference frame. In particular, the DIST1 dynamic instruction count fraction ranges from 0.1 to 0.47 of the total executed instructions for a search window of 6–126 pels, respectively. The situation is slightly different in the case of fast ME algorithms as shown in Fig. 2, which are search range independent. As a result, the contribution of the DIST1 function is approximately constant, in terms of the fractional dynamic instruction count, and ranges between 0.166 and 0.176 of the total instructions executed. The second major contributor to the instruction count is the forward discrete cosine transform computation (FDCT). As shown in Fig. 1, the fractional dynamic instruction count of this function ranges between 0.23 and 0.03 for full search ME over the search window range, due to corresponding increase in the DIST1 contribution. It should be made clear at this point that both ﬁgures depict relative values (reference point is the total instruction count of the MPEG2-TM5 reference implementation when encoding 12 frames of the same video sequence) as it is known from the theory of operation of the MPEG2 standard that the FDCT transform is independent of the search range of the FSME. This explains the decreasing contribution of the FDCT function since, with increasing search range, the computational dynamic instruction count

ARTICLE IN PRESS 138

V.A. Chouliaras et al. / INTEGRATION, the VLSI journal 41 (2008) 135–152

Remaining

library

FullSearch

dist1

fdct

MPEG-2 FSME Performance Profiling Bowing Sequence, 12 Frames, 12 F/GoP, CIF 1.00

0.04

0.90

0.09

0.20

0.80

0.34

0.40

0.42

0.42 0.10

0.70

0.37

0.60

0.12

0.50

0.23

0.19

0.18

0.04

0.09

0.12

0.23

0.16

0.40 0.30

0.10

0.47

0.41 0.16

0.20

0.31 0.21

0.23

0.10

0.14

0.00 6-pels

24-pels

0.07

0.06

0.05

0.03

46-pels

80-pels

112-pels

126-pels

Fig. 1. MPEG2 proﬁling (FSME).

Remaining

library

idctcol

putbits

dist1

fdct

MPEG-2 Sub-sampling ME Performance Profiling Bowing Sequence, 12 Frames, 12 F/GoP, CIF

0.254

0.200

0.212

0.238

0.298

0.8

0.298

0.212

0.210

0.220

0.212

0.203

0.186

0.213

0.9

0.195

1

0.270

0.175

0.391

.362

0.290

0.175

0.364

.369

0.343

0.364

0.382

0.428

0.366

0.404

0.6

0.175 0.247

0.166

0.1770

0.187

0.175 0.298

0.201

CENB_DS

0.193

0.215

HDS

CROSS

0.205

DS

0.204

0.197

FSS

2D_LOG

0.178

NTSS

0.236

0.206

SPIRAL

0.205

0.190

0.1

TSS

0.2

0.298

0.175

0.1760

0.176

0.175

0.176

0.170

0.3

0.170

0.4

CONJ

0.166

0.5

0.172

Fractional DIC

0.7

GRADIENT

PHOD

ORTHO

LDS

0

Sub-samplingalgorithm Fig. 2. MPEG2 proﬁling (fast ME).

(DIC) metric increases substantially and so does the contribution of the DIST1 function whereas the FDCT contribution remains constant. The distribution is different in fast ME with the FDCT fraction ranging between 0.178 and 0.298 of the total instruction count. It is important to emphasize that we studied only the ﬂoating-point-based FDCT of the MPEG2-TM5 and no attempt was made to derive

optimized, integer FDCT implementations as the principal aim of this work is to demonstrate the step-by-step development of a parametric vector ISA, parametric vector coprocessor microarchitecture and the encapsulation of the coprocessor into the RISC processor. In addition, the reference FDCT algorithm was vectorizable through a loop-transformation (loop interchange). For a detailed account of integer-based DCT algorithms, as would be

ARTICLE IN PRESS V.A. Chouliaras et al. / INTEGRATION, the VLSI journal 41 (2008) 135–152

...