:= Iist.{Dn[x],D„+2[x],D„+4W,D„+6[x]}
3 ) not available when dt = 8
{D„[x],D„+,[x],
D„«[x],D„„[x]}
I {D„[x3, D„.2[x],
D^4{X], Dn+6[X]}
3.;
Examples
VLD4.8 {DO, Dl, D2, D3}, [Rl]!.
VLD4.16 {D2,D3,D4,D5},#1,[R3]
VLD4.16 {D2[1],D4[1],D6[1],D8[1]},[R3],R4
411 VST4.32 {D20, D22, D24, D26}, [R7]
VST4.8 {D20[5],D21[5],D22[5],D23C5]},[R1],R4
TABLES
- 3 1 - _ _ _ i
-J i i r ' i i i j i . i — I •• - - ---- - - • • - — - OQ.
I
In one embodiment, the reordering logic 24 of figure 1 takes the form i
^ illustrated in figure 23. The logic of figure 23 includes two multiplexers 350, 355 at its |
inputs, which in the event of a load instruction are arranged to receive data from a load
FIFO 23 associated with ihe LSU 22 illustrated in figure I, or in the event of a store
:! instruction are arranged to receive data from the SHVED register store 20. Further, in
some situations, a load instruction may also cause the logic of figure 23 to receive data
firom the SIMD register store 20. The multiplexers 350, 355 are controlled to choose
between the different inputs, and to route the chosen inputs to the associated input |
registers 360, 365. hi one embodiment, each input register is able to store 64 bits of
10 data. The data stored in the input registers is then read through the crossbar
multiplexer 375 into the register cache 380, crossbar control register 370 providing
drive signals to the crossbar multiplexer to direct individual bytes of data received
from the input registers to desired byte locations within the register cache. The values
in control register 370 are derived by the instruction decoder.
1 5
As shown in figure 23, the register cache 380 can be considered as consisting
of four registers, and in one embodiment each register is 64 bits in lengfb.
After data has been stored in the register cache 380, it can then be read via
j
2 D output multiplexers 385 to either the store data FIFO 23' associated with the LSU 22
(in the event of a store instruction), or the SIMD register file 20 (in the event of a load
instruction).
"Whilst the byte crossbar multiplexer 375 can read the input registers at byte
: 5 granularity and write into the register cache at byte granularity, the write multiplexers
385 read from the register cache at 64-bit granularity.
The reordering logic 24 is largely autonomous from the rest of the SIMD
processing logic 18, but is given instructions in program order in the same fashion as
I ; 0 other fimctional units within the integrated circuit. In one embodiment, it has two
j register file read ports and two write ports which it confrols itself. In order that j
I hazards are detected and avoided the reordering logic 24 may be arranged to
communicate with some interlock logic (not shown) using scoreboards. !
i 1
-at
Store instructions from the SIMD register file 20 are performed out-of-order
with respect to other SIMD instructions, but remain in-order with, respect to other
store instructions from the SIMD register file. Pending stores are kept in a queue, and
when the store data is ready it is read and passed into the store FIFO 23' associated
5 with the LSU 22 via the reordering logic 24.
In one embodiment, all data passing between memory and the SIMD register
file 20 is routed via the reordering logic 24. However, in an alternative embodiment,
a bypass path around the reordering logic 24 may be provided for situations where it
0 is determined that no reordering is required.
i ' • •'
The register cache 380 is referred to as a "cache" since under certain
conditions it caches register values before they are written to the SIMD register file
20. The register cache holds data in the format that data is to be output from the
] 5 reordering logic 24.
Figures 24A to 24C illustrate the operation of the reordering logic 24 to
implement the necessary reordering required when performing an instruction of the
type VLD 3.16 {DO, Dl, D2}, [rl].
:o
Once the data has been loaded by the LSU 22, then in a first cycle (as shown
in Figure 24A) 64 bits of the retrieved data is loaded via multiplexer 350 into the
input register 360, whilst the next 64 bits are loaded via the multiplexer 355 into the
input registers 365. In the example illustrated in figures 24A through 24C, it is
25 assumed that the structure format represents a 3D vector having components x, y, z.
In the next cycle, as shown in Figure 24B, the 16-bit data elements within the input
registers are read into the register cache 380 via the byte crossbar multiplexer 375
which reorders the data so that any data elements relating to x components are placed
in a first register, any data elements relating to y components are placed in a second
3 0 register, and any data elements relating to z components are placed in a third register
of the register cache. Also during this cycle, the next 64 bits of data from the load
FIFO 23 are loaded via multiplexer 350 into the input register 360.
In tiie next cycle, as shown in Figure 24C, the data elements from the input
35 register 360 ar-e routed throng the byte crossbar multiplex-er into the register cache,
•32-
^ with the X, y and z components being de-interleaved as discussed earlier. As shown
iQ figure 24C, this results in the register cache containing four x components in a first
register, four y components in a second register, and four z components in a third
register. The contents of the register cache can then be output via the write
5 multiplexers 385, two registers at a time, to the registers specified by the load
instruction.
Figures 25A-25D illustrate a second example of the flow of data through the
reordering logic in order to perform the necessary reordering required when executing
1) the instraction VLD 3.16 {D0[1], Dl[l], D2[l]}, [rl]. In accordance with this
I instruction, data is going to be loaded into a particular lane of the registers DO, Dl and
D2, namely the second 16-bit wide lane of four 16-bit wide lanes within those
registers. Before a data element can be stored in a particular lane of a register, the
current contents of the register need to be retrieved, so that when the register is
15 subsequently written to, the contents of the register are written as a whole. This
feature avoids the need to provide for any writing to onl}' a portion of a register in the
SIMD register file 20. Accordingly, during a first cycle, as shown in Figure 25A, the
current contents of the registers DO and Dl are read from the SIMD register file via
the multiplexers 350, 355 into the input registers 360, 365. In the next cycle, as
2} shown in figure 25B, these contents are read into the register cache 380 through the
crossbar multiplexer 375 with the contents of DO being placed in a first register and
the contents of Dl being placed in a second register of the register cache. During the
same cycle, the contents of the register D2 are retrieved from the SIMD register file
via the multiplexer 350 and stored in the input register 360.
25
In the next cycle, as shown in Figure 25 C, the contents of the register D2 are
read into the register cache 380 via the crossbar multiplexer 375, such that they are
stored in a third register of the register cache. During the same cycle, the data
stmcture the subject of the load, which typically will have already have been retrieved
3D by the LSU, is read firom the load FIFO 23 via the multiplexer 350 into the input
registers 360. In the example illustrated in figure 25C, it is again considered that the
structure in memory represents 3D vector data with components x, y and z. In the
next cycle, as shown in Figure 25D, the x, y and z components are read into the
second lane of data elements via the crossbar multiplexer 375, so that tiie data element
j 5 Xo overwrites within the register cache the previous contents of the second lane of
-34'
^ register DO, the component yo overwrites within the register cache the data element
previously in the second lane of the register Dl, and the component zO overwrites
within the register cache the data element previously stored in the second lane of the
register D2.
5
It will be appreciated that at this point the actual contents of the registers DO,
Dl and D2 in the SIMD register file have not yet changed. However, the data stored
in the register cache can now be output via the write multiplexers 385 back to the
registers DO, Dl, D2 to overwrite the previous contents. As a result, it can be seen
13 that a single load instruction can be used to load the components of a particular
structure from memory, and to then insert the individual components of that structure
into different registers at a chosen lane location.
Figures 25E to 25H illustrate a third example of a flow of the data through the
15 reordering logic in order to perform the necessary reordering required when executing
the complementary store instruction to the load instruction that was discussed earlier
with reference to Figures 25A to 25D. Accordingly, Figures 25E to 25H illustrate the
steps required to perform the necessary reordering when executing the instruction
VST 3.16 {D0[1], Dl[l], D2[l]}, [rl]. Hence, in accordance with this instruction,
2 0 data is going to be stored from the second 16-bit wide lane of the registers DO, Dl and
D2 back to memory. As shown in Figure 25E, during a first cycle, the current
contents of the registers DO and Dl are read from the SIMD register file via the
multiplexers 350, 355 into the input registers 360, 365. In the next cycle, as shown in
Figure 25F, the data elements in the second lane, i.e. the values XQ and yo, are read
25 into a first register of the register cache 380 through the crossbar multiplexer 375.
During the same cycle, the contents of the register D2 are retrieved from the SIMD
register file via the multiplexer 350 and stored in the input register 360.
In the next cycle, as shown in Figure 25G, the data element in the second lane
2 0 of register D2 is read into the first register of the register cache 380 via the crossbar
multiplexer 375. Then, in the next cycle, as shown in Figure 25H, the x, y and z
components can now be output by the write multiplexers 385 to the LSU for storing
back to memory. It will be appreciated that at this stage the data-elements have now
been reordered into the structure format required for storage in memory.
:5
^ Figures 26A to 26E illustrate tiie reordering that takes place within the
reordering logic during execution of the following sequence of four instructions:
VLD 3.16 {DO, p i , D2}, #1, [rl]
VLD 3.16 {DO [1], Dl [1], D2 [1]}, [r2]
5 VLD 3.16 {DO [2], Dl [2], D2 [2]}, [r3]
VLD 3.16 {DO [3], Dl [3], D2 [3]}, [r4]
Once the data identified by the first load iustniction has been retrieved by the
LSU, it is read via the multiplexer 350 into the iuput register 360 during a first cycle
13 (see Figure 26A). In the next cycle, it is read into the register cache 380 via the
i crossbar multiplexer 375, such that the x, y and z components are placed in different
registers of the register cache. The "#1" Avithin the first instruction signifies that each
data element should be placed in the least significaut data lanes of each register, and
that the remaining lanes should be filled with logic 0 values, this being shown in
1) figure 26B. Also during this cycle, the data elements identified by the second load
instruction are retrieved into the input register 360. During the next cycle (see Figure
26C), the data elements stored in the input register 360 are moved into the register
cache 380 via the cross bar multiplexer 375, where they are stored in the second lane.
Also during this cycle, the data elements of the third load instruction are placed within
2i) the input register 3 60.
In the next cycle, the contents of the input register 360 are routed via the
crossbar multiplexer 375 into the third lane of the register cache, whilst the data
elements of the subject of the fourth load instruction are retrieved into the input
2: i register 360. This is shown in figure 26D.
Finally, as shown in figure 26E, in the next cycle these data elements are
routed via the crossbar multiplexer 375 into the register cache 380, where they are
stored in the fourth lane. Thereafter, the 64-bit wide chunlcs of data in each register of
3( I the register cache can be output to the specified registers of the SIMD register file.
It should be noted that in contrast to the approach taken in figures 25A to 25D,
the use of the first VLD instruction illustrat-ed with reference to figures 26K to 26E,
whereby once the data elements have been placed in a particular lane, the remaining
3; i lanes are filled with 0 values, avoids the need to retrieve fi:om the SIMD register file
the current contents of any of the registers DO to D2 before any updates are made.
From a review of figures 26A to 26E, it can be seen that the register cache 380 in this |
instance acts as a "-v^nite through cache", since it caches the data elements for a
sequence of load instructions, and when each instruction is completed, -writes the data
5 to the relevant registers of the SIMD register file. However, the register file does not
typically need to be read fi-om whilst each subsequent instruction in the sequence is
being performed.
It is often required in data processing to reduce a so-called vector of elements
1) to a single element by applying a commutative and associative operator 'op' between
all the elements. This will be described as a folding operation. Typical examples of
folding operations are to sum the elements of a vector, or find the maximum value of
the elements in a vector.
15 In a parallel prdcessiag architecture, one known approach used to perform
such a foldiag operation is described with reference to Figure 27. The data elemeAts
[0] to [3] to be folded are contained a register rl. It will be appreciated that a benefit
of parallel processing architectures is that it can enable the same operation to be
performed concurrently on multiple data elements. This is concept can be more
2) clearly understood with reference to so-called parallel processing lanes. In this
example, each parallel processing lane contains one of the data element [0] to [3].
Firstly, at step A, a first instmction is issued which-causes rotation of the data
elements by two places to form rotated data elements in register r2. This places
2) different data elements in each processing lane so that Single Instmction Multiple
Data (SIMD) operation can be applied at step B.
Thereafter, at step B, a second instmction is issued which causes a SIMD
operation to be performed on the data elements in each lane. In this example, the
3 ) resultant data elements of these multiple parallel operations are stored in register r3.
Accordingly, it can be seen that entries in r3 now contain the results of the
combination of half of data elements of the register rl (i.e. r3 contains: [0] op [2]; [1]
op [3]; [2] op [0]; and [3] op [1]).
^ Next, a third instructioii is issued which, causes the results stored in the register
r3 to he rotated by one parallel processing lane at step C and stored in the register r4.
Once again, the rotation of the data elements of stored in r3 with respect to those of r4
enables different data elements to occupy the same parallel processing lane.
5
Finally, at step D, a fourth instruction is issued which causes a further single
instruction multiple data operation to be performed on data elements stored in each
lane and the results are stored in register r5.
1) Accordingly, it can be seen that by using just four instructions all the data
elements across the register can be combined and the results stored in each entry in
the register r5 (i.e. each entry in r5 contains: [0] op [1] op [2] op [3]). The resultant
data element can be read as required from any of the four entries in the register r5.
1:1 Figure 28 illustrates the principle of a folding instruction of one embodiment.
Unlike the conventional arrangement of parallel processing lanes (which is described
with reference to Figure 27) in which each parallel processing lane has a fixed width
throughout the lane which is equal to the width of one data element, in this
embodiment the arrangement of the parallel processing lanes differs^ In this new
2(1 arrangement, the width of each parallel processing lane at its input is equal to the
width of at least two source data elements .and at its output is generally equal to the
width of one resultant data element. It has been found that arranging the parallel
processing lanes in this way provides significant advantages over prior art
arrangements since groups of data elements (for example pairs of data elements)
2' within a single register can be the subject of parallel processing operations. As will be
clear from the discussion below, this obviates the need to perform the data
manipulation operations of the prior art arrangements (i.e. the rotation operations)
since there is no need to arrange data elements in the correct entry locations in further
registers in order to enable multiple operations to occur in parallel.
3(
Accordingly, source data elements d[0] to d[3] are provided in respective
entries in a register. The adjacent source data elements d[0] and d[l] can be
considered as a pair of source data elements. The source data elements d[2] and d[3]
can also be considered as a pair of source data elements. Hence, in this example,
3! there are two pairs ofsource data elements.
-3^-
—" - q^^
• • • ••
W At step (A) an operation is performed on each pair of source data elements
within the register m order to generate a resultant data element, the same operation
occurring on each adjacent pair of source data elements.
Hence, it will be appreciated that the pair of source data elements and the
corresponding resultant data element all occupy the same lane of parallel processing.
It can be seen that after step (A) the number of resultant data elements is half that of
the number of source data elements. The data elements d[2] op d[3] and d[0] op d[l]
1) can also be considered as a pair of source data elements.
At step (B) a further identical operation is performed on a pair of source data
elements in order to generate a resultant data element d[0] op d[l] op d[2] op d[3]. It
can be seen that after step (B) the number of resultant data elements is also half that of
1J the number of source data elements. As mentioned previously, the operations are
commutative and associative operations and so the same resultant data elements are
generated irrespective of the exact order of combination of the source data elements.
Hence, it can be seen that the number of source data elements can be halved as
2 3 a result of each operation and that the same operation can be performed on those
source data elements in order to produce the required result. Accordingly, it can be
seen that tiie required resultant data element can be generated in just two operations
whereas the prior art arrangement of Figure 27 needed to perform at least four
operations. It will be appreciated that this improvement in efficiency is achieved
2 5 through performing parallel processing operations on groups of data elements within a
source register. Although just two pairs of source data elements have been illustrated
for reasons of clarity, it will be appreciated that any number of pairs of source data
elements could have been the subject of the operation. Also, whilst operations on
pairs of source data elements have been illustrated for reasons of clarity, it will be
30 appreciated that any number of source data elements (e.g. three, four or more) could
have been the subject of the operation.
In practice, for efficiency reasons, the folding instruction is arranged to
perform parallel operations on a minimum number of data elements, determined by
35 the smallest supported register size in the register data 'file 20. Figur-e 29 illustrates an
..i-ixxuiiii„i— .. - - ".io- " ' ' """ — ^
^ implementation which generates the same number of resultant data elements as the
number of source data elements.
Source data elements d[0] to d[3] are provided in a register Dn. In order to
j generate the same number of resultant data elements, the source data elements d[0] to
d[3] are also provided in a register Dm- It will be appreciated that the registers D,, and
Dm are likely to be the same register with the SIMD processing logic 18 reading each
source data element from the register Dn twice in order to generate dupHcated
resultant data elements.
n
At step (A), a single SIMD instruction is issued, each pair of source data
elements have am operation performed thereon and a corresponding resultant data
element is generated.
15 At step (B), another single SIMD instruction is issued to cause each pair of
source data elements to have an operation performed thereon in order to generate a
corresponding resultant data element.
Accordingly, it can be seen that all the -source data elements have been
2 0 combined to produce resultant data elements.
Figures 30a to 30d illustrate the operation of various folding instructions
which follow the same syntax described elsewhere. It will be appreciated that where
two source registers are indicated that these may be the same register. Also, it will be
: 5 appreciated that each source register could be specified as the destination register in
order to reduce the amount of register space utilised.
Figmre 30a illustrates the operation of a SIMD folding instruction whereby
pairs of source data elements from the same register, represented by 'n' bits, have an
; 0 operation performed thereon in order to generate resultant data elements represented
by 2n bits. Promoting the resultant data elements to have 2n bits reduces the
probability that an overflow will occur. When promoting the resultant data elements,
they are typically sign-extended or padded with O's. The following example summing
folding instructions support such an operation:
:\5
^ Mnemonic Data Type Operand Fonnat Description
VSUM.S16.SS Dd, Dm (add adjacent pairs of elements
and promote)
.S32.S16 Qd, Qm
i .S64.S32
.U16.US
.U32.U16
.U64.U32
1) In the particular example shown in Figure 30a (¥5111^1.832.816 Dd, Dm), a
I 64-bit register Dm containing four 16-bit data elements are folded and stored in a 64-
' bit register Dd containing two 32-bit resultant data elements.
Figure 30b illustrates the operation of a SIMD folding instruction whereby
1 i pairs of source data elements from different registers, represented by 'n' bits, have an
operation perfoimed thereon in order to generate resultant data elements also
represented by 'n' bits. The following example smnming, maximum and minimum
instructions support such an operation:
2) Mnemonic Data Type Operand Format Description
VSUM .18 Dd, Dn, Dm (add adjacent pairs of elements)
.116
.B2
.F32
1 )
Mnemonic Data Type Operand Fonnat Description
VFMX.S8 Dd, Dn, Dm (take maximum of adjacent pairs)
.816
.832
3) .U8
.U16
.U32
.F32
Mnemonic Data Type Operand Fonnat Description
VFMN .S8 Dd, Dn, Dm (take minimum of adjacent pairs)
.S16
5 .S32
.US
.U16
.U32
.F32
10
^ In the particular example shown in Figure 30b (VSUM.I16 Dd, Dn, Dm), two
64-bit registers Dm, Dn, each containing four 16-bit data elements are folded and
stored in a 64-bit register Dd containing four 16-bit resultant data elements.
15 Figure 30c illustrates the operation of a SIMD folding instruction whereby
pairs of source data elements from the same register, represented by 'n' bits, have an
operation performed thereon in order to generate resultant data elements also
represented by 'n' bits, hi the particular example shown in Figure 3'Oc, a 128-bit
register Qm containing eight 16-bit data elements are folded and stored in a 64-bit
2 3 register Dd containing four 16-bit resultant data elements.
Figure 30d illustrates the operation of a SBVDD folding instruction similar to
Figure 30b, but where Dm=Dn which causes the resultant data values to be dupUcated
in the destination register. Pairs of source data elements from the same register,
25 represented by 'n' bits, have an operation performed thereon in order to generate
resultant data elements also represented by 'n' bits, each of which is duplicated in
another entry ia the register, hi the particular example shown in Figure 30d, a 64-bit
register Dm containing four 16-bit data elements are folded and stored in a 64-bit
register Dd containing two sets of two 16-bit resultant data elements.
33
Figure 31 illusfrates schematically example SIMD folding logic which can
support folding instructions and which is provided as part of the SIMD processing
logic IS. For sake of clarity, the logic sho^vn is used to support instructions which
select the maximum of each adjacent pair. However, it will be ^predated that the
~ * ^ ^ ' ^ ^^ ^ " '
^ logic can be leadily adapted to provide support for other operations, as will be
described in more detail below.
The logic receives source data elements (Dni[0] to Dm[3]) from the register
5 Dm, optionally together with source data elements (Dn[0] to Dn[3]) from the register
Dn. Altemativel}', the logic receives source data elements (Qm[0] to Qm[7]) from the
register Qm. Each pair of adjacent source data elements are provided to an associated
folding operation logic unit 400. Each folding operation logic unit 400 has an
arithmetic imit 410 which subtracts one source data element from the other and
10 provides an indication of which was the greater over the path 415 to a multiplexer
I 420. Based upon the indication provided over the path 415, the multiplexer outputs
the greater value source data element from the operation logic unit 400. Hence, it can
be seen that each folding operation logic unit 400 is arranged to output the maximum
of the associated adjacent pair of data elements over respective paths 425, 435, 445,
15 455.
Selection and distribution logic 450 receives the resultant data elements and
provides these as required over paths 431 to 434 for storage in entries of a register Dd
in the SBS'ID register data file 20 ia support of the above-mentioned instructions. The
'j. 0 operation of the selection and distribution logic 450 will now be described.
In order to support the instruction illustrated in Figure 30a, source data
elements Dm[0] to Dm[3] are provided to the lower two folding operation logic units
400. The folding operation logic units 400 output data elements over the paths 425
: 5 and 435. The paths 431 and 432 will provide Dm[0] op Dm[l] in a sign-extended or
zero-extended format, whilst paths 433 and 434 will provide Dm[2] op Dm[3] in a
sign-extended or zero-extended format. This is achieved by signals being generated
by the SIN4D decoder 16 in response to the folding instruction which cause the
multiplexers 470 to select their B input, the multiplexers 460 to select either sign-
;0 extension or zero-extension, the multiplexers 490 to select their E input and the
multiplexer 480 to select its D input.
In order to support the instmction illustrated in Figure 30b, source data
elements Dm[0] to Dm[3] are provided to the lower two folding operation logic units
: 5 400, whilst source data elements Dn[0] to Dn[3] are provided to the upper two folding
7J
-42- -
^ operation logic units 400. The folding operation logic units 400 output data elements
over the paths 425, 435, 445 and 455. Path 431 will provide Dm[0] op Dm[l], path
432 will provide Dm[2] op Dni[3], path 433 will provide Dn[0] op Dn[l], and path
434 will provide Dn[2] op Dn[3]. This is achieved by signals being generated by the
j SIMD decoder 16 in response to the folding instraction which cause the multiplexers
470 to select their A input, the multiplexer 480 to select its C input and the
multiplexers 490 to select their E input.
In order to support the instruction illustrated in Figure 30c, source data
1') elements Qm[0] to Qm[7] are provided to the folding operation logic units 400. The
I folding operation logic units 400 output data elements over the paths 425, 435, 445
and 455. .Path 431 will provide Qm[0] .op Qm[l], path 432 will provide Qm[2] op
Qm[3], path 433 will provide Qm[4J op Qm[5], and path 434 will provide Qm[6] op
Qm[7]. This is achieved by signals being generated by the SIMD decoder 16 in
1:' response to the folding instruction which cause the multiplexers 470 to select their A
input, the multiplexer 480 to select its C input and the multiplexers 490 to select their
E input.
In order to support the instruction illustrated in Figure 30d, source data
2( I elements Dm[0] to Dm[3] are provided to the lower two folding operation logic units
400. The folding operation logic units 400 output data elements over the paths 425
and 435. Path 431 will provide Dm[.0] op Dm[l], path 432 will provide Dm[2] op
Dm[3], path 433 will provide Dm[0] op Dm[l], and path.434 will provide Dm[2] op
Dm[3]. This is achieved by signals being generated by the SIMD decoder 16 in
11 response to the folding instruction which cause the multiplexers 470 to select their A
input, the multiplexer 480 to select its D input and the multiplexers 490 to select their
F input. Alternatively, it will be appreciated that the source data elements could have
instead also been provided to the upper two folding operation logic units 400 and the
same operation as that illustration to reference to Figure 30b could have been
3(' performed which would reduce the complexity of the selection and distribution logic
450.
Accordingly, it can be seen that this logic enables a resultant data element to
be generated from two adjacent source data elements in a single operation directly'
3; i from the source data elements.
-43-
As mentioned above, the folding operation logic unit 400 may be arranged to
perform any number of operations on the source data elements. For example, further
logic could readily be provided to selectively enable the multiplexer 420 to supply the
) minimum of the source data elements over the path 425. Alternatively, the arithmetic
unit 410 could be arranged to selectively add, subtract, compare or multiply the
source data elements and to output the resultant data element. Hence, it will be
appreciated that the approach of tiie present embodiment advantageously provides a
great deal of flexibility in the range of folding operations that can be performed using
1) this arrangement.
Also, it vv'ill be appreciated that widlst the logic described with reference to
Figure 31 supports 16-bit operations, similar logic could be provided in order to
support 32 or 8-bit operations, or indeed any other sizes.
1 )
Figure 32 illustrates the operation of a vector-by-scalar SMD instruction. The
SIMD instmctions follow the same syntax described elsewhere. "It will be appreciated
that, as before, where two source registers are indicated, these may be the same
register. Also, each source register could be specified as the destination register in
2} order to reduce the amount of register space utilised and to enable efficient
recirculation of data elements.
A register Dm stores a number of data elements Dni[0] to Dni[3]. Each of these
data elements represent a selectable scalar operand. The vector by scalar SIMD
25 instruction specifies one of the data elements as the scalar operand and performs an
operation using that scalar operand in parallel on all the data elements in another
register Dn, the results of which are stored in a .corr-esponding entry in the register Dd.
It will be appreciated that the data elements stored in the registers Dm, Dn and Dd
could all be of differing sizes. In particular, the resultant data elements may be
;0 promoted with respect to the source data elements. Promoting may involve zero
padding or sign extending to convert firom one data type to another. This may have
the additional advantage of guaranteeing that an overflow can not occur.
Being able to select one scalar operand for a SIMD operation is particular
; 5 efficient in -situations involving matrices of data elements. Different scalar operands
^ can be written to the SDvED register file 20 and then readily selected for different
vector-by-scalar operations without the need to re-wiite data elements or move data
elements around. The following example multipMcation instructions support such an
operation:
Multiply by Scalar
Mnernonic Data Type Operand Format Description
VMUL .116 Dd, Dn, Dm[x] (Vd[i] = Vn[i] * Vm[x])
.B2 Qd, Qn, Dm[x]
1) .F32
.S32.S16 Qd,Dn,Dm[x]
.S64.S32
.U32.U16
.U64.U32
Multiply Accumulate by Scalar
Mnemonic Data Type Operand Format Description
\/MLA.Il 6 Dd, Dn, Dm[x] (Vd[i] = Vd[i] + (Vn[i] * Vm[x]))
.132 Qd, Qn, Dm[x]
2) .F32
.S32.S16 Qd,Dn,Dm{x]
.S64.S32
.U32.U16
.U64.U32
2j
Multiply Subtract by Scalar
Mnemonic Data Type Operand Format Description
VMLS.I16 Dd, Dn, Dm[x] (Vd[i] = Vd[i] - (Vn[i] *• Vm[x]))
.132 Qd, Qn,Dm[x]
3) .F32
.S32.S16 Qd,Dn,Dm{x]
.S64.S32
.U32.U16
.U64.U32
3)
Vd, Vn and Vm describe vectors of elements constructed from the chosen
^ register fomiat and chosen data type. Elements within this vector are selected using
I the array notation [x]. For example, Vd[0] selects the lowest elenaertt in the vector
Vd.
j
An iterator i is used to allow a vector definition; the semantics hold for all
values of i where i is less than the number of elements within the vector. The
instruction definitions provide 'Data Type' and 'Operand Format' colunons; a vahd
instruction is constructed by taking one from each column.
1)
Figure 33 illustrates an arrangement of scalar operands HO to H31 in the
SIMD register file 20. As mentioned elsewhere, the preferred number of bits used in
a field of the instruction to specify the location of a data element in the SIMD register
file 20 is 5-bits. This enables 32 possible locations to be specified. It will be
1) appreciated that one possible way to map the scalar operands onto the SIMD register
file 20 would have been to have placed -each operand in the first entry in -each of the
registers Do to D31. However, the SIMD register file 20 is instead arranged to map or
alias the selectable scalar operands to the first 32 logical entries in the SIMD register
file 20. Mapping the scalar operands in this way provides significant advantages.
2) Firstly, by locating the scalar operands in contiguous entries minimises the number of
D registers used to store the scalar operands which in turn maximises the number of D
registers available to store other data elements. By having the scalar operands stored
in contiguous entries enables all scalar operands within a vector to be accessed, which
is particularly beneficial when performing matrix or filter operations. For example, a
2 S matrix by vector multiplication requires a vector by scalar operation to be performed
for each scalar chosen from the vector. Furthermore, storing the selectable scalar
operands in this way enables, from at least some of the registers, all the scalar
operands to be selected from those registers.
3D Figure 34 illusfrates schematically logic arranged to perform a vector-byscalar
operation of an embodiment.
The source data elements (Dn,[0] to Dm[3]) provided from the r-egister Dm-
Each source data element is provided to scalar selection logic 510 which comprises a
---^-^ ^ -^4-
-46- --
^ number of multiplexers 500. Each source data element is provided to one input of
each multiplexer 500 (i.e. each multiplexer receives source data elements Dm[0] to
Djn[3]. Hence, it can be seen that each multiplexer can output any of the source data
elements Dm[0] to Dm[3]. In this embodiment, each multiplexer is arranged to output
; the same source data element. Hence, the scalar selection logic 510 can be arranged
to select and output one scalar operand. This is achieved by signals being generated
by the SIMD decoder 16 in response to the vector-by-scalar instruction which cause
the multiplexers to output one of the source data elements Dm[0] to Dn,[3] as the
selected scalar operand.
K
I Vector-by-scalar operation logic 520 receives the selected scalar operand and
also receives source data elements Dn[0] to Dn[3] provided from the register Dn. Each
source data element is proAdded to the vector-by-scalar operation logic 520 which
comprises a number of operation units 530. Each source data element is provided to
I. one of the operation units 530 (i.e. each operation unit receives one of the source data
elements Djn[0] to Dm[3] and the selected scalar operand). The vector-by-scalar
operation logic 520 performs an operation on the two data elements and outputs a
resultant data element for storage in respective entries of a register in the SIMD
register data file 20 in support of the above-mentioned instructions. This is achieved
2
instruction which cause the operations units 530 to perform the required
op eration on the received data elements.
Accordingly, it can be seen that this logic enables one of data element of a
2 j source register to be selected as a scalar operand and to perform the vector-by-scalar
operations using the same scalar operand on all source data elements from another
register.
Figure 35 shows a known way of dealing with a shift and narrow operation
3} during SIMD processing. As can be seen three separate instructions (SHR, SHR and
PACK LO) are required to perform this operation. Intermediate values are shown
with dotted hnes for clarity in Figui-e 35 and in Figures 36 and 3 8.
Figute 36 shows a shift right and narrow operation according to the present
: 5 technique. The architecture of the present embodiment is particularly well adapted to
process shift and narrow operations and can do so in response to a single insti-uction.
^ The instruction is decoded by an instraction decoder within SIMD decoder 16 (see 1
Figure 1). In this example the data in register Qn, located in SIMD register file 20
(see Fig 1) is shifted right by 5 bits and then the remaining data is rounded and then
5 the 16 right hand side bits are transferred across to the destination register Dd, also
located in SIMD register file 20. The hardware is able to optionally support rounding
and/or saturation of the data depending on the iustmction. Generally shifting right
instructions do not require saturation as when dealing with integers shifting right
generally produces a smaller number. However, when shifting right and narrowing
1D saturation may be appropriate.
Saturation is a process that can be used to restrict a data element to a certain
range by choosing the closest allowable value. For example if two unsigned 8-bit
integers are multiplied using 8 bit registers, the result may overflow. In this case the
15 most accurate result that could be given is binary 11111111, and thus, the number will
be saturated to give this value. A similar problem may arise when shifting and
narrowing, whereby a number that is narrowed cannot fit into the narrower space. In
this case in the case of an unsigned number, when any of the bits that are discarded in
the shift step are not zero then the number is saturated to the maximum allowable
20 value. In the case of a signed number the problem is more complicated. In this case
the number must be saturated to the maximum allowable positive number or
maximum allowable negative number when the most Significant bit is different from
any of the discarded bits.
2 5 Saturation can also occur where the type of data element, input is different to
that output, e.g. a signed value may be shifted and narrowed, saturated and an
unsigned value output. The ability to output different data types can be very useful.
For example, in pbcel processing luminance is an unsigned value, however, during
processing this value it may be appropriate to process it as a signed value. Following
; 0 processing an unsigned value should be output, however simply switching firom a
signed to an unsigned value could cause problems, unless the ability to saturate the
value is provided. For example, if during processing due to slight inaccuracies the
luminance value has dropped to a negative number, simply outputting this negative
signed value as an unsigned value would be a nonsense. Thus, the abihty to saturate
i
.-uuraiui-j -. ,„_„ . -^ iT
- 4 & " • • '
^ any negative number to zero prior to outputting the unsigned value is a very u-sefiil
tool.
Examples of possible formats for different shift instructions are given below in
) tables 6 and 7. As can be seen the instructions specifies that it is vector instruction by
having a V at the front, a shift is then specified with the SH and in the case of shifting
with immediates, the direction right or left is then indicated by an R or L. The
instruction then comprises two types, as in table 0, the first being the size of the data
elements in the destination register and the second being the size of the element in the
13 source register. The next information comprises the name of the destination register
1 and of the source register and then^ an immediate value may be given, this value
indicates the number of bits that the data is to be shifted and is preceded by a #.
Modifiers to the general format of the instruction may be used, a Q is used to indicate
the operation uses saturating integer arithmetic and a R is used to indicate that the
15 operation performs rounding More details of the format of the instructions are given
earher in the description, for example, in table 0.
Table 7 shows instructions for shifting by signed variables. This instruction is the same
as the shifting left by immediates, but instead of providing an immediate with the
2 D instmction a register address indicating where a vector of signed variable is stored is
provided with the iustruction. In this case a negative number indicates a right hand shift.
As the number of bits to be shifted are stored in a vector, a different signed variable can
be stored for each data element so that they can each be shifted by different amounts.
This process is sho^vn in more detail in Figure 39.
25
TABLE 6
Shift by Immediate
2 0 Immediate shifts use an immediate value encoded within the instruction to shift all
elements of the source vector by the same amount. Narrowing versions allow casting
down of values, which can include saturation, while Long versions allow casting up
with any fixed point.
2 5 Shift with accumulate versions are provid-ed to support efficient scaling and
accumulation found in many DSP algorithms. Right shift instructions also provide
rounding options. Rounding is performed by in effect adding a half to the number-to
be rounded. Thus, when shifting right by n places 2""' is add-ed to the value prior to
shifting it._ Thus, in the following table round(n) = 2""' if n ^ or 0 if n
^0
49- i
^ Bitwise extract instructions are included to allow efficient packing of data.
' Mnemonic Data Type Operand F o r m a t D e s c r i p t i on
VSHR .S8 Dd, Dn, #UIMM Shift Right by Immediate
i .S16 Dd, Dn, #UIMM Vd[i] := V n [ i ] >> UIMM
.S32
.S64
.U8
. a iG
1) .U32
.U64
. 3 8 . s l 6 Dd, Qn, #UIMM Shift Right by Immediate and narrow
. 3 1 6 . S32 Vd[i] := V n [ i ] >> UIMM
I j .S32.S64
.U8.U16
.U16.U32
.U32.US4
I
2) VRSHR .38 Dd, Dn, tUIMM Shift Right by Immediate with rounding
. 3 1 6 Qd, Qn, #UIMM V(i[i] : = (Vii[i]+round(UIMM) )
. 3 3 2 » UIMM
. 3 6 4
.U8
25 .U16
. U 32
.UG4
. S 8 . S 1 6 Dd, Qn, #UIMM shift Right by Immediate
3 } .515.S32 and Narrow w i t h Rounding
. 3 3 2 . S 6 4 Vd[i] := (Vn[i] + round
. U 8 . U 1 6 (UIMM)) » UIMM
. U 1 6 . U 3 2
. U 3 2 . U 6 4
35
VQSHR .38. S I 6 Dd, Qn, #UIMM Saturating s h i f t Right
by Immediate and Narrow
. 3 1 6 . 3 3 2 Vd[i] := s a t ( V n [ i ] >> 0IMM)
. 3 3 2 . 3 6 4
^0 .U8.U16
. U 1 6 . U 3 2
. U 3 2 . U 6 4
. U 8 . S 1 6
. U 1 6 . S 3 2
^5 .U32.S64
VQRSHR .38. S I 6 Dd, Qn, #UIMM saturating s h i f t Right by
. 3 1 6 . S3 2 Immediate and Narrow with Rounding
. S 3 2 . S 6 4 Vd[i] := sat-
to .U8.U16 + round(UIMM)) » UIMM)
.U16.U32
.U32.U64
, U 8 . S 1€
. U 1 6 . S 32
^5 .U32.S64
VSRA TsB Dd, Dn, #iriMM Shift R i g h t by I m m e d i a te
.SX6 -Qd, Qn, #ITIMM and A c c u m u l a te
.S32 Vd(i] := VdUl + (Vn[i] » UIMM)
\.Q .364
.U8
.•U16
j-jcnxiu--'— -- -^fr
.•U32
• ' -^^^
VQSRA .SB Dd, Dn, #U1MM saturating Shift Right by
5 .S16 Qd, Qn, #UIMM immediate and Accumulate
.S32 Vd[i] := s a t ( V d [ i]
.SS4 + (Vn[i] » UIMM) )
.U8
.1716
13 .U32
.U64
VRSRA Tsi Dd, Dn, #UIMM Shift Right by Immediate
.516 Qd, Qn, #triMM and Accumulate with Rounding
15 .S3 2 Vd[i] := Vd[i3 + ((Vn[i] +
.564 round(UIMM)) » UIMM)
.U8
.U16
.U32
23 .Ue4
VQRSRA .SB Dd, Dn, #UIMM saturating Shift Right by Immediate
.S16 Qd, Qn, ttUIMM and Accumulate with Rounding
.S32 Vd[i] := s a t (
2 5 .S64 Vd[i] + ( ( V n [ i ] +
.U8 round(UIMM)) » UIMM))
.U16
.U32
.U64
3) , , , _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ = _ _ _ _ _ _ _ _ _ . _^
VSHL .18 Dd, Dn, #UIMM shift Left by Immediate
.116 Qd, Qn, #UIMM Vd[i] := Vn [i]
.132
3) .164
.S16.S8 Qd, Dn, #UIMM Shift Left Long by
Immediate
.S32.Sie Vd[i] := Vn[i]
.S64.S32
4) .U16.ua
.U32.U15
•U64.U32
VQSHL .S8 Dd, Dn, #UIMM Saturating Shift Left
.S16 Qd, Qn, #UIMM by Immediate
4) .S32 Vd[i] := sat |
.S64
.UB
.U16
.U32
5) .U64
. U 8 . S8
. U 1 6 . S 16
. U 3 2 . S 32
. U 6 4 . S S4
5)
TABLE?
Shift by Signed Variable
w Shifts in this section perform shifts on one vector of elements controlled by the -signed
shift amoimts specified in a second vector. Supporting signed shift amounts allows
,j_uwaxiJLi-,_i.,., , - -^ - - --
^ support for shifting by exponent values, which may reasonably be negative; a
^ negative control value will perform a -sliift right. Vector shifts allow -each element to
be shifted by a different amount, but can be used to shift all lanes by the same amount
by duphcating the shift control operand to all lanes of a vector before perfonniug the
5 shift. The signed shift control value is an element is the same SVLQ as the smallest
operand element size of the operand to be shifted. However, the shifter variable is
interpreted using only the bottom 8-bits of each lane to determine the shift amount.
Rounding and Saturation options are also available.
13 Mnemonic Data Type Operand Format Description
VSHL .S8 Dd, Dn, Dm Shift Left by Signed
V a r i a b l e
.S16 Qd, Qn, Qm Vd[i] := Vn[i]
.S32
15 .S64
.UB
.U16
.U32
.U64
2Q
VQSHL .S8 Dd, Dn, Dm Saturating S h i f t Left
.S16 Qd, Qn, Qm by Signed V a r i a b le
. S 3 2 Vd[i] ;= sat |
(Vn[i] « V m [ i ])
. S 64
25 .U8
. U lS
.U32
.U64
VRSHL .S8 Dd, Dn, Dm Rounding Shifb Left by signed variable
3 3 .S16 Qd, Qn, Qm Vd[i] := (Vn[i] + round
. 5 3 2 (-Vm[i])) « Vm{i]
. 5 64
.U8
3 5 .U16
.U32
.U64
VQRSHL .58 Dd, Dn, Dm Saturating Rounding Shift
43 .515 Qd, Qn, -Qm Left by Signed Variable
. 5 3 2 vd[i] := sat«td>((Va[i] +
. 5 6 4 round(-Vni[i]) )
.US
.U16
45 .U32
.U64
Thus, as can be seen the hardware supports instructions that are able to specify
both the size of the source data element and resultant data element and also sometimes
53 the number of places that the data is to be shifted. This makes it an extremely
adaptable and powerful tool.
The shift right and narrow operation shown in Figure 36 has a number of
possible applications. For example, in calculations involving fixed point numbers
Sp where a certain accuracy is required, it may be appropriate to place a say 16-bit
^ mimber somewhere towards the centre of a 32-bit data value to reduce the risk of data
over or under flow, while -calculations are performed. At the end of the calculations a
16-bit number may be required, and thus a shift and narrow operation as shown in
Figure 36 would be appropriate. The possibility envisaged by the present technique
5 of using different sized source and destination registers is particularly effective here
•
and allows different sized data to remain in a particular lane during SIMD processing.
A further use of the shift and narrow operation similar to that illustrated in
Figure 36 could be in the processing of colour pixel data. SIMD processing is
10 particularly appropriate for video data as video data comprises many pixels tiiat all
require the same operation to be performed upon fhem. Thus, different pixel data can
be in different lanes in a register and a single instruction can perform the same
operations on all of the data. Often, video data may come as red green and blue data.
This needs to be separated out before meaningftil operations can be performed upon
15 it. Figure 37 shows a typical example of red green and blue data being present in a
16-bit data element. In the example shown the blue data could be extracted by a shift
left by 3 bits and narrow operation. The shift left by 3 places sends the blue data to
the right of the middle of the data element, as is sho-wn schematically by the dotted I
r >
line register (representing an intennediate value), three zeros fiU in the three empty
: 0 positions at the right of the data value caused by the shift left of the data. The narrow
operation results m the blue data and the three zeros being transferred to the resultant
8 bit data element.
In addition to shifting and narrowing the present technique can also be used to
; 5 cast up and shift, this process is shown in Figure 38. In this case, the casting up is
performed followed by a shift left. This operation can be used to for example transfer
a 32-bit value to a 64-bit value, the 32 bit value being placed in an appropriate
position within the 64 bit value, hi the example shown two 32 bit values are
transferred to 64 bit values by being placed at the most significant bits in the lane with
: 0 zeros being added as least significant bits.
Figure 39 shows the possibility of using a vector of values indicating the
number of places each data element should be shifted, the values being signed
integers, negative numbers indicating a shift right. A register holding a value for each
: 5 data element is used and each -data element is shifted by the amount specified by the
.-i-uoajiu—-J .._, - - -• - —- ... • _ _^
^ value located in its lane. The instructions for such operations are set out previously in
table 7.
Figure 40 schematically shov/s a simple multiplexing operation. In this
5 multiplexing operation, multiplexer 700 selects either value a or value b to be output
at D depending on the value of the control bit c. c is used to select the output between
a and b. c is often based upon the result of a decision such as is a > b. Embodiments
of the architecture provide the abihty to perform multiplexing operations during
SJMD processing. SIMD processing is not suitable for performing branch operations
10 and thus multiplexing can not be performed using standard if then else instructions,
rather a mask is created, the mask being used to indicate which parts of two source
registers a and b are to be selected.
This mask consists of control values that are used to indicate which parts of
15 two source registers a and b are to be selected. In some embodiments a one in a
certain position may indicate that a certain section of b is to be selected while a .zero
in that position would indicate that a corresponding section of a is to be selected. This
mask is stored ki a general-purpose register thereby reducing the heed for special
purpose registers.
Generation of the mask is dependent on the multiplexing operation to be
performed and is created in response to this operation. For example in the case .given
above a comparison of a and b is performed. This can be done on a portion by portion
basis, for example corresponding data elements in the SIMD processing are
25 compared. Corresponding data elements of b and a are compared and a value is
written to the portion of the general purpose register that is being used to store the
I control values depending whether b is greater than a, or b is -equal to or less than a.
This can be done using a compare greater than instruction VCGT on all of the data
elements in parallel. This instruction is provided in the instruction-set of embodiments
20 of the system. Table 8 below shows some of the wide range of comparison
instructions that are provided by embodiments of the architecture.
TABLES
2 5 Comparison and Selection
-J .!Bi.ini_i - - ... - .--
•
Comparison and tests of variables to generate masks can be performed which can be
used to provide data plane selection and masking. It also provides iastructions to
' ^ select the maximum, and minimum, including folding versions which can be used at
the end of vectorised code to find the maximum or minimum within a vector.
.
Mnemonic Data Type Operand Foxmat Description
VCEQ .18 Dd, Dn, Dm Compare Equal
.116 Qd, Qn, Qm Vd [i] := (Vnti] ==Vm[i]) ?
. 1 3 2 ones : z e r os
10 . F3 2
• VCGE .S8 Dd; Dn, Dm Compare Greater-than or Equal
. S 1 6 Qd, Qn, Qm Vd[i] := (Vn[i] >= V m [ i ])
. S 3 2 ? ones : ?;eros
.US
l,i .U16
.U32
. F 32
VCGT .S8 Dd, Dn, Dm Compare G r e a t e r - t h an
2) .S16 Qd, Qn, Qm Vd[±] := (Vn[i] > V m [ i ] ) ?
. S 3 2 ones : z e r os
.U8 j
.U16 I
.U32 I
2) .F32 I
VCAGE .F32 Dd, Dn, Dm compare Absolute Greater-than or Equal
Qd, Qn, Qm Vd[i] •:= (|Vn[i] | >= |Vni[i]|) ? ones :
zeros
3) VCAGT .F32 Dd, Dn, Dm Compare Absolute Gteater-than
Qd, Qn, Qm Vd[i] := (|Vii[i]| > |vm[i] |)?onesi zeros
VCEQZ ,18 Dd, Dm Compare E q u a l t o Zero
. 1 1 6 Qd, Qm Vd[i] := (Vm[i] == 0)
. 1 3 2 ? ones : z e r os
3 ) .F32 •'_
VCGEZ . S8 Dd, Dm compare Greater-than or Equal to Zero
. S 1 6 Qd, Qm Vd[i] .-= (Vm[i] >= 0)
. S 3 2 ? ones : z e r os
. F 32
4)
VCGTZ .38 Dd, Dm Compare G r e a t e r - t h a n Zero
. S 1 6 Qd, Qm Vd[i] := (Vm[i] > 0) ?
. S 3 2 : o n e s : z e r os
. F 32
4) VCLEZ .F32 Dd, Dm Compare Less-than or Equal t o Zero
Qd, Qm Vd[i] 1= (Vm[i]
JJbte: Xnteger a 0)
VCLTZ .F32 Dd, Dm Compare L e s s - t h a n Z e ro
5) Qd, Qm Vd[i] := (Vm[i]
: o n e s : z e r os
Note; Integer a = 0)
VTST ^Ti Dd, Dn, Dm Test B i ts
55 .116 -Qd, Qn, Qm vd[i] := ((Vn[i] tVmEi]) != o)
. 1 3 2 ? ones : zeros
VMAX .S8 Dd, Dn, Dm Maximum
. S r 6 Qd, Qn, Qm Vd[i] := (Vn[i] >= Vm[i] ) ?
. S 3 2 Vn{i] : Vm[i] {
^3 .US
.XJV6
.U32
^ S f c " .. . . -- -
t J L i U i a j - j , : . ^- - . • - - • —- •-- - -
•F32
9 VMIN .SB Dd, Dn, Dm Minimum
I .Sie Qd, Qn, Qm Vd[i] := (Vnfi] >=Vm[i]) ?
^ .S32 Vm[i] : Vn[i]
i .U8
. U lS
.U32
.Fsa
1) Once the mask has been created a single instruction can be used to select
either a or b using the general-purpose register containing this mask, the control
register C. Thus, the data processor is controlled by C to perform the multiplexing
operation of selecting either a or b.
' 15 Figure 41 schematically shows an embodiment of the system wherein the
selection of source values a or b is done on a bit wise basis, hi this case the control
register 730 has been filled with data by comparing data elements in registers a 710
and b 720. Thus, data element aO, which is say eight bits wide is compared with data
element bO having the same size, hi this case a is less than or equal to b and thus
2 3 eight zeros are inserted mto the corresponding portion of the control register 730. If a
is greater than b 8 ones are inserted into the corresponding portion of the control
register 730. A similar comparison is performed in parallel for all the data elements
and corresponding control bits produced. The comparison operation that generates the
control vector corresponds to the instruction VCGT.S8 c,a,b. Selection can then be
2 5 performed very simply on a bit by bit basis by performing simple logical operations
between the bits store in tlie source registers and the corresponding bits stored in the
control register, each resultant bit being written to a destination register, which in this
example is register 730, i.e. the results overwrite the control values. The advantage of
this bitwise selection is that it is independent of data type and widiii and if appropriate
; 0 different sized data elements can be compared.
Figure 42 shows an altemative embodiment where the control is not done on a
bit- wise basis but is done on a data element basis. In the embodiment shown if a data
element in the control register C 730, is greater than or equal to zero then a
: 5 corresponding data element in -source register b 720, it is written to the destination
register (in this -case register 720). If, as in this example, C is a signed integer, then
only tiie most significant bit of-C needs to be considered when deciding which of a or |
b to select.
_LJuauJix_i_ > - . . . . . . . . - . . _ - "—SJ / _,
' In other embodiments other properties of C can be used to determine whether
^ a data element from register a, 710 is to be selected, or one from data register b, 720.
Examples of such properties include, whether C is odd or even, where again only one
) bit of the control value need to be considered, in this case the l e ^ t significant bit, or if
C is equal to zero, not equal to zero or greater than zero. I
j
i
Generally ARM instructions and in fact many other RISC instructions only
provide three operands mth any instruction. Multiplexing operations in general |
1} require four operands to specify two source registers a and b, a control register C and i
> a destination register D. Embodiments of the present system take advantage of the !
fact that generally following a multiplexing operation, at least one of the two sets of
source data or the control data is no longer required. Thus, the destination register is
chosen to be either one of the two source registers or the control register. This only
1S works as the control register is a general-purpose register and not a special register.
In embodiments of the system, three different instructions are provided in the
instruction set, an instruction specific to "WTiting back to one source register, another
instraction for writing back to the other source register and a third instruction for
writing to the control register. Each instruction requires just three operands,
2D specifying two soxirce registers and a control register. These three instructions are
specified in table 9 below.
TABLE 9
25 Logical and Bitwise selection
Mnemonic Data Type Operand Format Description
VBIT none Dd, fin, Dm Bitwise Insert if True
2 0 Qd, Qn, Qm Vd ;= (Vm) ? Vn ; Vd
VBIP none Dd, Dn, Dm Bitwise Insert if False
Qd, Qn, Qm Vd := (Vm) ? Vd : Vn
VBSL none Dd, Dn, Dm Bitwise Select
Qd, Qn, Qm Vd := (Vd) ? Vn : Vm
35
Figure 43 schematically shows three examples of multiplexer arrangements
corresponding to the three multiplexing instructions provided by &e system. Figure
43a shows multiplexer 701 wired to perform the instruction bitwise select VBSL. In
\bis example, contrary to the example illustrated in Figures -41 and 42, A is 'selected i
^ 0 when C is false (0), and B is selected when C is true (1). ha the embodiment
^ illustrated the destinatiDn register is the same as the control register so that the
resultant values overwrite the control values. If the reverse selection v/as required,
i.e. A is selected v/hen C is true and B when C is false, the same circuit could be used
' by simply swapping the operands A and B.
j
Figure 43b shows a multiplexer 702 corresponding to the instruction BIT
bitwise insert if true, and results in source register A acting as both source and
destination register and being overwritten with the result data. In this example B is
written into A when C is true, while if C is false the value present in register A
13 remains unchanged. In this embodiment if the reverse selection is required, i.e. it is
desired to ^vrite B to the destination register if C is false rather than true it is not
possible to simply swtch the registers around as the device does not have the
symmetry of multiplexer 701.
1 j Figure 43 c shows a multiplexer 703 that is set up to correspond to the reverse
selection of Figure 43b, i.e. the instruction BIF bitwise insert if false. In this
embodiment the value in register A is written into register B when C is false, while
when C is true the value in register B remains unchanged. As in figure 43b there is no
symmetry in this system.
23
Figure 44 schematically illustrates a sequence of bytes of data Bo to B? stored
i within a memory. These bytes are stored in accordance with byte invariant addr-essing
whereby the same byte of data will be returned m response to reading of a given
memory address irrespective of the current endianess mode. The memory also
2 5 supports unaligned addressing whereby half words, words or larger multi-byte data
elements may be read from the memory starting at an arbitrary memory byte address.
When the eight bytes of data Bo to B7 are read from the memory with the
system in little endian mode, then the bytes Bo to By are laid out within a register 800
3 3 in tbie order shown in Figure 44. The register 800 contains four data elements each
comprising a half word of sixteen bits. Figure 44 also shows the same eight bytes of
data Bo to B7 being read out into a register S02 when the system is operating in big
endian mode.
I
^ In this example, the data once read out from memory into the respective SIMD
register 800, 802 is subject to a squaring operation which results in a doubling of the
data element size. Accordingly, the result is written in two destination SIMD ,
registers 804, 806. As will be seen from Figure 44, the result values written
) respectively in the first or second of these register pairs 804, 806 vary depending upon
the endianess mode in which the data has been read from the memory. Accordingly, a
SIMD computer program Avhich is to further manipulate the squared result values may
need to be altered to take account of the different layout of the data depending upon
the endianess mode. This disadvantageously results in the need to produce two
13 different forms of the computer program to cope with different endianess in the way
that the data has been stored within the memory.
I
Figure 45 addresses this problem by the provision of reordering logic 808.
The data processing system includes memory accessing logic 810 which serves to
15 read the eight bytes of data Bo to B7 from the memory starting at a specified memory
address and utilising the b3/te invariant addressing characteristic of the memory. The
output of the memory accessing logic 810 accordingly presents bytes read from a
given memory address at the same output lane irrespective of the endianess mode.
Thus, in the example illustrated in which the data elements are half words, a byte
i 0 recovered from a particular memory address may be the most significant portion of a
half word when in one endianess mode whilst it is the least significant portion of a half word in the other endianess mode.
The data element reordering logic 808 is responsible for reordering the data
; 5 elements retrieved from the memory by the memory access logic 810 such that the
data elements which are loaded into the SIMD register 812 will be in a form
consistent with the data having been stored in a little endian form and loaded without
rearrangement irrespective of the endianess mode being used within the memory
system, hi the case of a little endian mode being used within the memory system, the
iO data element reordering logic 808 will not reorder the bytes and will pass these
through unaltered. However, in the case of the data being stored in a big endian form
within the memory system, the data element reordering logic 808 serves to reverse the |
j
order of the bytes read from the memory within each half word so that the half word
dataelement will appear in little endian form Avithin the SIMD register 812. In this
S5 way, a single SIMD computer program can perform the correct data processing
LicaiuiJ.-....-. -- ^.,, - - -""
j y -
operations upon the data elements traasferxed into the SIMD register irrespective of
the endianess mode in which these were stored within the memory. It will be seen
from Figure 45 that the data element reordering logic 808 is responsive to a signal
indicating the endianess mode being used by the memory and a signal indicating the
i size of the data elements concerned. The endianess mode being used will control
whether or not any reordering is required and the size will control the nature of the
reordering applied if it is required. It will be seen that when the data is stored within
the memory in little endian mode and the SMD registers are little endian, then no
reordering is required. Conversely, if the SIMD registers assumed a big endian form
1) then no reordering would be required when the data was stored in big endian form
within the memory but reordering would be required when the data was stored within
a little endian form within the memory.
Figure 46 illustrates an example similar to that of Figure 45 except that in this
1) example the data elements are 32-bit data words. As will be seen, when these data
words are stored within the memory in a big endian form, the reordering apphed by
the data element reordering logic 808 reverses the byte order of four byte data
elements as retrieved by the memory accessing logic 810 so that these are stored into
the SIMD register 812 in a form consistent with the data having been stored in a little
2) endian form in the memory and loaded without rearrangement.
It will be appreciated that in the context of the processor system as a whole
described herein, the memory accessing logic 810 and the data element reordering
element 808 may form part of the previously described load store unit. The data
2 3 element reordering logic 808 may also be used to compensate for memory system
endianess when reading data into the scalar registers when a particular endianess is
>eing assumed for the data within the scalar registers.
Figure 47 illustrates the data element reordering logic 808 in more detail. It
3 3 will be seen that this is formed as three levels of multiplexers controlled by respective
controlled signals Z, Y and X. These three layers are respectively responsible for
reversing positions of adjacent bytes, adjacent half words and adjacent words of data.
The control signals X, Y and Z are decoded from an endianess signal which when
asserted indicates big endian mode and a size signal indicating respectively 64, 32 or
2 5 16 bit data element size as is illustrated in Figure 47. It will be appreciated that many
- fel -
,J_JJLilUXiL_L - -^- - ., ^^ . . _._
-60- ' ' ~
^ other foims of data element reordering logic could be used to achieve the same
I functional result as is illustrated in Figures 45 and 46.
The memory access instruction which is.used to perform the b3^e invariant
5 addressing of the memory conveniently uses a memory address pointer wliich is held
within a register of a scalar register bank of the processor. The processor supports
data processing instructions which change the data element size as well as data
processing instructions which operate on selected ones of data elements within a
SIMD register. I
I
10 I
Figure 48 illustrates a register data store 900 which includes a list of registers
DO, Dl each serving as a table register, an index register D7 and a result register D5.
It will be seen that the table registers DO, Dl are contiguously numbered registers
within the register data store 900. The result register D7 and the index register D5 are
] 5 arbitrarily positioned relative to the table registers and each other. The syntax of the I
instruction corresponding to this data manipulation is sho^vn in the figure.
Figure 49 schematically illustrates the action of a table lookup extension
instruction. This instruction specifies a list of registers to be used as a block of table
20 registers, such as by specifying the first register in the Hst and the number of registers
in the list (e.g. one to four). The instruction also specifies a register to be used as the
( index register D7 and a register to be used as the result register D5. The table lookup
extension instruction further specifies the data elements size of the data elements
stored within the table registers DO, Dl and to be selected and written into the result
25 register D5. In the example illustrated, the table registers DO, Dl each contain eight
data elements. Accordingly, the index values have an in-range span of 0 to 15. Index
values outside of this predetermined range will not result in a table lookup and instead
the corresponding position within the result register D5 will be left unchanged. As
illustrated, the fourth and sixth index values are out-of-range in this way. The other
30 index values point to respective data elements within the table registers DO, Dl and
these data elements are then stored into the corresponding positions within the result
register D5. There is a one-to-one correspondence between index value position
within the index register D7 and data element position within the result register D5. j
The values marked "U" in the result register D5 indicate that the values stored at I
35 those locations are preserved during the action of the table lookup extension
.•i^J.!il1!3.l-J . • - -.- .-,.--_ „
^ instruction. Thus, whatever bits were stored in those locations prior to execution of
W
' the instruction are still stored within those positions followmg the execution of the
instruction.
j
i
j Figure 50 illustrates the index values from Figure 49 which are then subject to
a SIMD subtraction operation whereby an offset of sixteen is applied to each of the index values. This takes the previously in-raage index values to out-of-range values.
The previously out-of-range values are now moved in-range. Thus, when the index
register D7 containing the now modified iadex values is reused in another table
13 lookup extension instruction, the fourth and sixth index values are now in-range and
I result in table lookups being performed in table registers DO, Dl (or other different
registers which may be specified in the second table lookup extension instruction)
which have also been reloaded prior to the execution of a second table lookup
extension instruction. Thus, a single set of index values within an index register D7
15 ma}' be subject to an offset and then reused with reloaded table registers DO, Dl to
give the effect of a larger table being available.
Figure 51 illustrates fiirther a table lookup instruction which may be provided
in addition to the table lookup extension instruction. The difference between these
20 instructions is that when an out-of-range index value is encountered in a table lookup
instruction, the location within the result register D5 corresponding to that index value
is written to with zero values rather than being left unchanged. This type of behaviour
is useful in certain programming situations. The example Figure 51 illustiates three
table registers rather than two table registers. The first, third, fourth, sixth and
: 5 seventh index values are out-of-range. The second, fifth and eighth index values are
in-range and result in table lookups of corresponding data elements within the table
registers.
As mentioned earlier, load and store instiuctions are provided for moving data
: 0 between the SIMD register file 20 (see Figure 1) and memory. Each such load and
store instruction will specify a start address identifying tiie location within the
memory from which the access operation (whether that be a load operation or a store
operation) should begin. In accordance with the load and store instructions of :
embodiments, the amount of data that is the subj-ect of that load or-store instruction
: 5 can be varied on a per instniction basis. In particular embodiments, the amount of
•^ data is identified by identifying the data type "dt" (i.e. the size of each data element)
and identifying the number of data elements to be accessed by identifying the SIMD
register list and optionally the number of structures to be accessed.
,; When performing SIMD processing, it is often the case that the access
operations performed with respect to the necessary data elements are often unaligned
accesses (also referred to herein as byte aligned accesses), in other words, the start
address will often be unaligned, and in such situations the LSU 22 needs to allocate to
the access operation the maximum number of accesses that may be required to enable
1) the access operation to complete.
Whilst in a possible implementation, the LSU 22 could be arranged to assume
that every access is unahgned, this means that the LSU 22 is unable to improve the
efficiency of the access operations in situations where the start address is in fact
1) aligned with a certain multiple number of bytes.
Whilst the LSU 22 would be able to determine firom the start address whether |
the start address has a predetermined alignment, the LSU 22 typically has to commit
the number of accesses for the access operation at a time before the start address has
2) actually been computed. In a particular embodiment, the LSU 22 has a pipelined
architecture, and the number of accesses to be used to perform any particular access
operation is determined by the LSU in the decode stage of the pipeline. However,
often the start address is computed in a subsequent execute stage of the pipeline, for
example by adding an offset value to a base address, and accordingly the LSU 22 is
2) unable to await determination of the start address before determining how many
accesses to allocate to the access operation.
In accordance with an embodiment, this problem is alleviated by providing an
aUgnment specifier field within the access instruction, also referred to herein as an
3) ahgmnent qualifier. In one particular embodiment, the ahgnment qualifier can take a
i
first value which indicates that the -start address is to be treated as byte aligned, i.e.
unaligned. It will be appreciated that this first value could be provided by any
predetermined encoding of the alignment specifier field. In addition, the ahgnment
qualifier can take any one of a plurahty of -second values indicating different
3 5 predetermined alignments that the -start address is to be treated as conforming to, and
i ' i r n i ] -R I I ,
-63- " ' '
^ in one particular embodiment, the pliirality of available second values axe as indicated
in the following table:
Alignment Start Address Promise and Availability
Qualifier Format
@16 ..xxxxxxxO The start address is to be considered to be a multiple
of 2 bytes.
Available to instructions that transfer exactly 2 bytes.
@32 ..xxxxxxOO The start address is to be considered to be a multiple
of 4 bytes.
Available to instructions that transfer exactly 4 bytes.
@64 ..xxxxxOOO The start address is to b e considered to b e a multiple
of 8 bj^es.
Available to instructions that transfer a multiple of 8
I bytes.
@128 ..xxxxOOOO The start address is to b e considered to be a multiple
of 16b3^es.
Available to instructions that transfer a multiple of 16
; bytes. I
@256 ..xxxOOOOO The start address is to b e considered to be a multiple
of 32 bytes.
Available to instructions that transfer a multiple of 32
I I b y t e s . I i
Table 10 |
The manner in which this alignment specifier information is used in one
embodiment will now be described with reference to Figure 52. As shown in Figure
52, the LSU 22 will typically be connected to a memory system via a data bus of a
predetermined width. Often the memory system "will consist of a number of different
ID levels of memory, and the first level of memory is often a cache, this being the level
of memory wilii which the LSU communicates via the data bus. Accordingly, as
shown in Figure 52, the LSU 22 is arranged to communicate with a level 1 cache
1010 of the memory via a data bus 1020, in this particular example the data bus being
considered to have a width of 64 bits. In the event of a cache hit the access takes
]5 place with respect of the contents of the level 1 cache, whereas in the event of a cache
miss, the level 1 cache 1010 will then communicate with other -parts of the memory
system 1000 via one or more fiirther buses 1030.
The various parts of the memory system may be distributed, and in the
; 0 example illustrated in Figure 52, it is assumed that the level 1 cache lOlO is provided
on-chip, i.-e. is incorporated within the iategrated circuit 2 of Figure 1, whilst the rest
of the memory system 1000 is-provided off-chip. The delimitation between on-chip
^ and off-chip is indicated by the dotted line 1035 in Figure 52. However, it will be
appreciated by those skilled in the art that other configurations may be used, and so
for example aU of the memory system may be provided off-chip, or some other
delimitation between the on-chip parts of the memory system and the off-chip parts of
5 the memory system may be provided.
The LSU 22 is also arranged to communicate with a memory management unit
(MMU) 1005, which typically incorporates a Translation Lookaside Buffer (TLB)
1015. As will be appreciated by those skilled in the art, an MMU is used to perform
ID certain access control functions, for example conversion of virtual to physical
! addresses, determination of access permissions (i.e. whether the access can take
place), etc. To do this, the MMU stores within the TLB 1015 descriptors obtained
I from page tables in memory. Each descriptor defines for a corresponding page of
memory the necessary access control information relevant to that page of memory.
1 5
The LSU 22 is arranged to communicate -certain details of the access to both
the level 1 cache 1010 and the MMU 1005 via a control path 1025. In particular, the
LSU 22 is arranged to output to the level 1 cache and the MMU a start address and an
indication of the size of the block of data to be accessed. Furthermore, in accordance
2) with one embodiment, the LSU 22 also outputs alignment uiformation derived from
the alignment specifier. The manner in which the alignment specifier information is
used by the LSU 22 and/or by the level 1 cache 1010 and the MMU 1005 will now be
described further Avith reference to Figures 53A to 54B.
2 > Figure 53 A illustrates a memory address space, with each soUd horizontal line
indicating a 64-bit alignment in memory. If the access operation specifies the 128-bit
long data block 1040, which for the sake of argument we will assume has a start
address of 0x4, then the LSU 22 needs to determine the number of separate accesses
over the 64-bit data bus 1020 to allocate to the access operation. Further, as discussed
3) earlier, it will typically need to make this determination before it knows what the start
address is. In the embodiment envisaged with respect to Figure 52, the LSU 22 is
arranged to use the alignment specifier information when determining the nimiber of
accesses to allocate.
^rc.- i
^ In the example of Figure 53A, the start address is 32-bit aligned, and the
alignment specifier may have identified this aHgnment. In that instance, as can be
seen firom Figure 53A, the LSU 22 has to assume the worst case scenario, and hence I
assume that three separate accesses will be required in order to perform the necessary
j access operation with regard to the data block 1040. This is the same number of
accesses that would have to be allocated for an imaUgned access. i
However, if we now consider the similar example illustrated in Figure 53B, it
can be seen that again a 128-bit data block 1045 is to be accessed, but in this instance
1) the start address is 64-bit aHgned. If the alignment specifier information identifies
this 64-bit alignment, or indeed identifies the data as being 128-bit aligned, then in
this case the LSU 22 only needs to allocate two separate accesses to the access
operation, thereby providing a significant improvement in efficiency. If, however, the
data bus were 128-bits wide, then if the ahgnment specifier indicated 128-bit
15 aHgnment rather than 64^it ahgnment, the LSU 22 would only need to allocate a
single access. |
Considering now the example in Figure 53C, here it can be seen that a 96-bit
size data block 1050 needs to be accessed, and in this instance it is assumed that the
20 alignment specifier identifies that the start address is 32-bit aligned. Again, in this
example, even though the LSU 22 will not actuall)' have calculated the start address at
the time the number of accesses needs to be committed, the LSU 22 can still assume
that only two accesses need to be allocated to the access operation. Figure 5 3D
illustrates a fourth example in which an 80-bit data block 1055 is to be accessed, and
25 in which the alignment specifier identifies that the start address is 16-bit atigned.
Again, the LSU 22 only needs to allocate two accesses to the access operation. If
instead the alignment specifier had indicated that the access was to be treated as an
unaHgned access, then it is clear that the LSU would have to have allocated three
accesses to the access operation, as indeed would have been the case for the access
JO illustrated in Figure 53 C. Accordingly, it can be seen that the alignment specifier
information can be used by the LSU 22 to significantly improve the performance of
accesses in situations where the alignment specifier indicates a certain predetermined
alignment of the start address.
•^^•UJLJ-] I I J I L. . .. - . - - - - --„— _^
- 06
^ It should be noted that the aHgmnent specifier cannot be taken as a guarantee
that the start address (also referred to herein as the effective address) will have that ahgnment, but does provide the LSU 22 with an assumption on which to proceed. If
the start address subsequently turns out not to obey the alignment specified by the
) alignment specifier, then in one embodiment the associated load or store operation is
arranged to generate an alignment fault. The alignment fault can then be handled
using any one of a number of known techniques. I
As mentioned earlier, the ahgnment information is not only used by the LSU
1) 22, but is also propagated via path 1025 to both the level 1 cache 1010 and the MMU
I
' 1005. The manner in which this information may be used by the level 1 cache or the I
MMU will now be described with reference to Figures 54A and 54B. As illustrated in
Figures 54A and 54B, an access to a 256-bit data block 1060, 1065 is considered, in
these examples the sohd horizontal lines in the diagrams indicating a 128-bit
1 > alignment in memory. ID Figure 54A, it is assumed that the data block is 64-bit
ahgned, whilst in Figure 54B it is assumed that the data block is 128-bit aligned. In
both instances, since the data bus 1020 is only 64-bits wide, it will be clear that the
LSU 22 has to allocate four accesses to the access operation. From the LSU's
perspective, it does not matter whether the ahgnment specifier specifies that the start
2) address is 64-bit aligned or 128-bit aligned.
However, the cache lines within the level 1 cache 1010 may each be capable
of storing in excess of 256 bits of data, and fiulher may be 128-bit ahgned. In the
example of Figure 54A, since the data block is not 128-bit ahgned, the cache will
2) need to assume that two cache hnes will need to be accessed. However, in the
example of Figure 54B, the level 1 cache 1010 can determine firom the alignment
specifier that only a single cache line within the level 1 cache needs to be accessed,
and this can be used to increase the efficiency of the access operation within the level
1 cache 1010.
3)
Similarly, the page tables that need to be accessed by the MMU in order to
retrieve the appropriate descriptors into the TLB 1015 will often store in excess of
256 bits of data, and may often be 128-bit ahgned. Accordingly, the MMU 1005 can
use the alignment information provided over path 1025 in order to determine the
3 5 number of page tables to be accessed. Whilst in the example of Figure 54A, the
- 6 ^ • '•' ' " '-
MMU 1005 may need to assume that more than one page table will need to be
^ . accessed, m the example of Figure 546, the MMU can determine from the alignment
specifier that only a single page table needs to be accessed, and this information can
be used to improve the efficiency of the access control functions performed by the
) MMU 1005.
Accordingly, it can be seen that the use of the ahgnment specifier within the
load or store instructions as described above can be used to enable the hardware to
optimise certain aspects of the access operation, which is especially useful if the
10 number of access cycles and/or cache accesses has to be committed to before tlae start
; address can be determined. This scheme is useful for load or store instructions
specifying various lengths of data to be accessed, and on processors with difieiing
data bus sizes between the LSU and the memory system.
:5 There are a number of data processing operations which do not lend
themselves to being performed in a standard SBVCD format, where multiple data
elements ai-e placed side-by-side within a register, and then the operation is perfoixaed
in parallel on those data elements. Examples of some such operations are illustrated
in Figures 55A to 55C. Figure 55A illustrates an interleave operation, where it is
;.0 desired to interleave four data elements A, B, C, D within a first register 1100 with
four data elements E, V, G, H within a second register 1102. In Figure 55A, the
resultant interleave data elements are shown within destination registers 1104, 1106.
These destination registers may be different registers to the source registers 1100,
1102, or alternatively maybe the same set of two registers as the source registers. As
!5 can be seen from Figure 55A, in accordance with this interleave operation, the first
data elements from each source register are placed side-by-side within the destination
registers, followed by the second data elements from both source registers, followed
by the third data elements from both source registers, followed by the fourth data
elements from both source registers.
50
Figure 55B illustrates the reverse de-interleave operation, where it is required
to de-interleave the eight data elements placed in the tv^'o source registers 1108 and
1110. In accordance with this operation, the first, third, fifth and seventh data
elements are placed in one destination register 1112, whilst the -second, fouilh, sixth
: 3 and eighth data elements are placed in a second destination register 1114. As wjtfa the
i
j ^ . Figure 55A example, it will be appreciated that the destination registers may be
i different to the source registers, or alternatively may be the same registers. If in the
examples of Figures 55 A and 55B it is assumed that the registers are 64-bit registers,
then in this particular example the data elements being interleaved or de-interleaved
,; are 16-bit wide data elements. However, it will be appreciated that there is no
requirement for the data elements being interleaved or de-interleaved to be 16-bits
wide, nor for the source and destination registers to be 64-bit registers.
Figure 55C illustrates the function performed by a transpose operation. In
1) accordance with this example, two data elements A, B from a first source register
1116, and t\vo data elements C, D from a second source register 1118, are to be
transposed, and the result of the transposition is that -the second data element from the
first source register 1116 is SAvapped with the first data element from the second
source register 1118, such that within the first destination register 1120, the data
15 elements A, C are provided, whilst in a second destination register 1122 the data
elements B, D are provided. Again, the destination registers may be different to the
source registers, but it is often the case that the destination registers are in fact the
same registers as the source registers. In one example, each of the registers 1116, I
1118, 1120, 1122 may be considered to be 64-bit registers, in which event the data
: 0 elements are 32-bit wide data elements. However, there is no requirement for the data
elements to be 32-bit wide, nor for the registers to be 64-bit registers.
Further, whilst in all of the above examples it has been assumed that the entire
contents of the registers are shown, it is envisaged that any of these three discussed
: ,5 operations could be performed independently on the data elements within different
portions of the relevant som-ce registers, and hence the figures in that ease illustrate
only a portion of the source/destination registers.
As mentioned earlier, the standard SMD approach involves placing multiple
!0 data elements side-by-side within a register, and then performing an operation in
parallel on those data elements. In other words, the paraUelisation of the operation is
performed at the data element granularity. Whilst this leads to very efficient
execution oF operations where tie required data elements can be arranged in such a
I manner, for example by spreading the required 'source data elements across multiple
S5 registers, there are a significant number of •operations where it is not practical to
^ 0 - , ._.
arrange the required source data elements in such a way, and hence in which the
^ potential speed benefits of a SIMD approach have not previously been able to be
exploited. The above interleave, de-interleave and transpose operations are examples
of such operations which have not previously been able to take advantage of the speed
; benefits of a SIMD approach, but it will be appreciated that there are also many other
examples, for example certain types of arithmetic operations. One particular example
of such an arithmetic operation is an arithmetic operation which needs to be applied to
a complex number consisting of real and imaginary parts.
1) hi accordance with one embodiment, this problem is alleviated by providing
/ the ability for certain data processing instructions to identify not only a data element
size, but also to fiirther identify as a separate entity a lane size, the lane size being a
' multipleof the data element size. The parallehsation of the data processing operation
then occurs at the granularity of the lane size rather than the data element size, such
1) that more than one data element involved in a particular instantiation of the data
processing operation can co-exist within- the same source register. Hence, the
processing logic used to perform the data processing operation can define based on
the lane size a number of lanes of parallel processing, and the data processing
operation can then be performed in parallel in each of the lanes, the data processing
2) operation being appHed to selected data elements within each such lane of parallel
processing.
By such an approach, it is possible to perform in a SIMD manner interleave
operations such as those described earher with reference to Figure 55A. In particular,
2) Figure 56A illustrates the processing performed when executing a "ZIP" instruction in
accordance with one embodiment. In this particular example, the ZIP instmction is a
321 ZIP.8 instruction. This iastruction hence identifies that the data elements are 8-
bits wide, and the lanes are 32-bits wide. For the example of Figure 56A, it is
assumed that the ZIP instruction has specified the source registers to be the 64-bit
3 ) registers DO 1125 and Dl 1130. Each of these registers hence contains eight 8-bit
data elements. Within each lane the interleave operation is appHed independently,
and in parallel, resulting in the rearrangement of data elements as shown in the lower
half of Figure 5 6 A. In one embodiment, it is assumed that for the ZIP instruction, the
destination registers are the same as the source registers, and accordingly these
3 5 rearranged data elements are once again stored within the r-egisters DO 1125 and Dl
--tl-
1130. As can be seen from Figure 56A, within lane 1, the first four data elements of
^ each source register have been interleaved, and within lane 2, the second four data
elements of each source register have been iaterleaved.
5 It will be readily appreciated that different forms of interleaving could be
performed by changing either the lane size, or the data element size. For example, if
the lane size was identified as being 64-bits, i.e. resulting in there being only a single
lane, then it can be seen that the destination register DO would contain the interleaved
result of the first four data elements of each register, whilst the destination register Dl
13 would contain the interleaved result of the second four data elements of each register.
It will be appreciated that a corresponding UNZIP instruction can be provided in
order to perform the corresponding de-interleave operation, the UNZff instruction
again being able to specify both a lane size and a data element size.
15 Typically, a transpose operation is considered to be a quite different operation
to an interleave operation or a de-interleave operation, and hence it would typically be
envisaged that a separate instruction would need to be provided to perform transpose
operations. However, it has been realised that when providing an interleave or a deinterleave
instruction with the ability to separately define a lane size and a data
2 3 element size, then the same instruction can in fact be used to perform a transpose
operation when two source registers are specified, and the lane size is set to be twice
I the data element size. This is illustiated in Figure 56B where the interleave
instruction ZIP has been set to identify a data element size of 8 bits, and a lane size of
16 bits (i.e. twice the data element size). Assuming the same 64-bit source registers
25 DO 1125 and Dl 1130 are chosen as in the Figure 56A example, this defines four
lanes of parallel processing as shown in Figure 56B. As can then be seen firom the
lower half of Figure 56B, the interleaving proce5s actually results within each lane in
the generation of a transposed result, in that the first data element of the second source
register within each lane is swapped with the second data element of the first source
3 3 register within each lane.
Hence, in accordance with the above described embodiment, the same ZIP
instruction can be used to perform either an interleave, or a transpose operation,
dependent on how the lane size and data element size are defined. It should further be
3 5 noted that a transposition can also be performed in exactly the same manner using the
_ _ _ ^ -^i-
UNZff instruction, and accordingly a 161 UNZIP. 8 instruction will perform exactly
^ the same transpose operation as a 161 ZIP.8 instruction. i
Figures 57A to 57C illustrate one particular example of an implementation of
5 such ZIP instructions, in which a four-by-four array of pixels 1135 within an image are to be transposed about the line 1136 (see Figure 57A). Each pixel will typically
consist of red, green and blue components expressed in RGB format. If for the sake !
of argument we assume that the data required to define each pixel is 16-bits in length,
then it can be seen that the data for each horizontal line of four pixels ia the array
n 1135 pan be placed in a separate source register A, B, C,D.
Figure 57B illustrates the various transpositions that occur if the following two
• instructions are executed:
32lZP.16A,B
15 32lZff.l6C,D I
Each ZIP instruction hence defines the lane width to be 32-bits, and the data
element width to be 16-bits, and thus within each lane the first data element in the
second register is swapped with the second data element in the first regiBter, as shown
2) by the four diagonal arrowed lines illustrated in Figure 57B. Hence, separate
transpositions occur within each of the four two-by-two blocks 1137, 1141^, 1143 and
y 1145.
Figure 57C then illustrates the transposition that occurs as a result of execution
2 5 of the following two instructions:
64|Zff.32A, C
64|ZIP.32B,D
In accordance with these instructions, the lane width is set to be 64-bits, i.e.
I 3 3 the entire width of the source registers, and the data element width is chosen to be 32-
bits. Execution of the first ZIP instruction thus results in the second 32-bit wide data
element in register A 1147 being swapped with the first 32-bit wide data element
within the register C 1151. Similarly, the second ZIP instruction results in the second
32-bit wide data element in the register B 1149 being swapped with the first 32-bit
data element within the register D 1153. As illustrated by the diagonal arrowed line
in Figui-e 57C, this hence results in the two-by-two block of pixels in the top left
being swapped by the t^vo-by-two block of pixels in the bottom right. As will be
appreciated by those skilled in the art, this sequence of four ZIP instructions has
) hence transposed the entire four-by-four array 1135 of pixels about the diagonal line
1136. Figure 58 illustrates one particular example of the use of the interleave
instruction, hi this example, complex numbers consisting of real and imaginary parts
are considered. It may be the case that a certain computation needs to be performed
on the real parts of a series of complex numbers, whilst a separate computation needs
13 to be performed on the imaginary part of those complex numbers. As a result, the real
, parts may have been arranged in a particular register DO 1155 whilst the imaginary
parts may have been placed in a separate register Dl 1160. At some point, it may be
desired to reunite the real and imaginary parts of each complex number so that they
are adjacent to each other within the registers. As is illustrated in Figure 58, this can
15 be achieved through the use of a 641 ZIP. 16 instruction which sets the lane width to be
the full width of the source registers, and sets the data element width to be 16-bits, i.e.
the width of each of the real and imaginary parts. As shown by the lower half of
Figure 58, the result of the execution of the ZIP instraction is that each of the real and
imaginary parts of each complex number a, b, c, d are reunited within the register
2 0 space, the destination register DO 1155 containing the real and imaginary parts of the
complex numbers a and b and the destination register Dl 1160 containing the real and
I imaginary parts of the complex numbers c and d.
It is not just data rearranging instructions like interleave and de-interleave
25 instructions that can take advantage of the ability to' specify the lane size
independently of the data element size. For example, figures 59A and 59B illustrate a
sequence of two instructions that can be used to perform a multiphcation of two
complex numbers. In particular, it is desired to multiply a complex number A by a
complex number B, in order to generate a resultant complex number D, as illustrated
: 0 by the following equation:
Dre = Are * Bpe - Aim * Bjm
D™ = Are*Bin, + Ain,*B„
Figure 59A shows the operation performed in response to a first multiply
2 5 instruction ofthe following form:
.,i...utmai--' .... ..-—"• . . . - . ft
.3%-
^ 321MLIL. 16 Dd, Dn, Dni[0]
The source registers are 64-bit registers, and the multiply instruction specifies
a lane width of 32 bits and a data element size of 16 bits. The multiply instruction is
) arranged within each lane to multiply the first data element in that lane within the
source register Dm 1165 with each of the data elements in that lane in the second
source register Dn 1170 (as shown in Figure 59A), with the resultant values being
stored in corresponding locations Avithin the destination register Dd 1175. Within
each lane, the first data element in the destiaation register is considered to represent
1) the real part of the partial result of the complex number, and the second data element
i is considered to represent the imaginary part of the partial result of the complex
number.
FolloAving execution of the instruction illustrated in Figure 59A, the following
15 instraction is then executed:
32|MASX.16Dd,Dn,Dm[l] |
As illustrated by Figure 598, this instruction is a "multiply add subtract with
exchange" instruction. In accordance with this instruction, the second data element |
2 3 within each lane of the source register Dm is multiplied with each data element within i
the coiresponding lane of the second source register Dn, in the manner illustrated in
Figure 59B. Then, the result of that multipHcation is either added to, or subtracted
fi:om, the values of corresponding data elements already stored within the destination
register Dd 1175, with the result then being placed back within the destination register
25 Dd 1175. It win be appreciated firom a comparison of the operations of Figures 59A
and 59B mth the earlier identified equations for generating the real and imaginary
parts of the resultant complex number D that by employing these two instructions in
sequence, the computation can be performed in parallel for two sets of complex
numbers, thereby enabling the speed benefit of a SIMD approach to be reaHsed.
30
From the above examples, it will be appreciated that by providing an
instruction with the abiUty to -specify a lane size in addition to a data element size, the
mmiber of operations that can potentially benefit fi-om a SIMD implementation is
increased, and hence this provides' a much improved flexibility with regard to the
3 5 implementation of operations in a SIMD manner.
-UiCJTm.-L- - - - •• - -. .-,-. .. - -,.. _
The present technique provides the ability to perform SIMD processing on
vectors where the source and destination data element widths are different. One
particularly useful operation in this environment is an add or subtract then return high
5 half SIMD operation. Figure 60 shows an example of an add return high half
operation according to the present technique. An instruction decoder within the
SJMD decoder 16 (see Figure 1) decodes instruction VADH.I16.B2 Dd,Qn,Qm and i
performs the addition return high half illustrated in Figure 60 and set out below.
ID In Figure 60 two source registers located in the SIMD register file 20 (see
Figure 1), Qn and Qm contain vectors of 32-bit data elements a and b. These are
added together to form a vector of 16-bit data elements Dd also located in register file
(
20 formed from the high half of the data sums:
15 Qn = {a3a2al aO]
Qm=[b3b2blb]
Output
2 0 Dd = [(a3+b3)»16, (a2+b2)»16, (al+bl)»16, (a0+b0)»16].
Figure 61 schematically shows a similar operation to that shown in Figure 60
but in this case, the instruction decoded is VRADH.Il 6.132 Dd,Qn,Qm and the
operation performed is an add return high vnlh rounding. This is performed in a very
2 5 similar way to the operation illustrated in Figure 60 but the high half is rounded. This
is done, in this example, by adding a data value having a one in the most significant
bit position of the lower half of the data value and zeros elsewhere afier the addition
and prior to taking the high half
2 0 In this Figure as in Figure 61 intermediate values are shown with dotted hnes
for clarity.
Further instructions (not illustrated) that may be supported are an addition or
subtraction return high with saturation. In this case the addition or subtraction will be
3: saturated where appropriate prior to the high half being taken.
' " ' " ' - ^ '" '
^ Table 11 shows examples of some of the instructions that are supported by the !
present technique. Size returns the size of the data type in bits and ro-und
|
returns rounding constant l«(size -1).
:: !
Mnemonic Data Type Operand Description
Format i
VADH .18.116 bd, Qn, Qm Add returning High Half
.116.132 Vd[ i ] := (Vn[ i ]+Vm[ i ] )»size
i
1 ) .B2.I64
VRADH .18.116 Dd, Qn, Qm Add returning High Half with Rounding
.116.132 Vd[ i ] := (Vn[ i ]+Vm[ i ]+ round |
) »size |
.132.164
( VSBH .18.116 Dd, Qn, Qm Subtract returning High Half
15 .116.132 Vd [ i ] := (Vn [ i ] - Vm[ i ] )»si2e |
.132.164
VRSBH .18.116 Dd, Qn, Qm Subtract returning High Half with Rounding i
.116.132 Vd [ i ] := (Vn [ i ] - Vm[ i ]+round |
) »si2e |
.B2.I64
2)
Table 11
The present technique can be performed on different types of data provided
that taking the high half of the data is a sensible thing to do. It is particularly
, 2 5 appropriate to processing performed on fixed point numbers.
The above technique has many applications and can be used, for example, to
I accelerate SIMD FFT implementations. SIMD is particularly useful for performing
1 FFT (fast fourier transform) operations, where the same operations need to be
3 )i performed on multiple data. Thus, using SIMD processing allows the multiple data to
1 be processed in parallel. The calculations performed for FFTs often invoh^e
multiplying complex numbers together. This involves the multiplication of data
values and.then the addition or subtraction of the products. In SIMD processing these
calculations are performed in parallel to increase processing speed.
35 ^
A simple example of the sort of sums that need to be -performed is .given
below.
i
9 (a +ic) * (b + id) = e +if
•
Thus, the real portion e is equal t o : a * b - c * d and
; The imaginary portion f is equal to: a * d + c * b I
i
Figure 62 shows a calculation to determine the real portion e. As can be seen I
the vectors for a containing 16 bit data element are multiplied with the vectors for b
containing the same size data elements and those for c with d. These products
1) produce two vectors with 32 bit data elements. To produce e one of the vectors needs
to be subtracted from the other but the final result is only needed to the same accuracy
as the original values. Thus, a resulting vector with 16 bit data elements is required.
This operation can be performed in response to the single instruction VSBH. 16.32
Dd, Qn, Qm as is shoAvn in the Figure. This instruction, subtract r-etum high half, is
1 > therefore particularly useful in this context. Furthermore, it has the advantage of
allowing the arithmetic operation to be performed on the wider data width and the
narrowing only occurring after the arithmetic operation (subtraction). This generally
gives a more accurate result than narrowing prior to performing the subtraction.
2) ARM have provided their instruction set with an instruction encoding which
allows an immediate to be specified with some instructions. Clearly, the immediate
size should be limited if it is encoded with the instruction.
An immediate value of a size suitable for encoding with an instruction has
2 5 limited use in SIMD processing where data elements are processed in parallel. In
order to address this problem, a set of instractions with generated constant is provided
that have a limited size immediate associated therewith, but have the ability to expand
this immediate. Thus, for example, a byte sized immediate can be expanded to
produce a 64-bit constant or immediate. In this way the immediate can be used in
3) logical operations with a 64-bit source register comprising multiple source data
elements in SIMD processing.
Figure 63 shows an immediate abcdefgh, that is encoded within an instruction j
along with a control value, which is shown in the left hand column of the table. The
3B binary immediate can be expanded to fill a '64-bit register, the actual expansion
performed depending on the inBtruction and the control portion associated with it. In
" the example shoAvn, the 8-bit immediate abcdefgh, is repeated at different places
within a 64 bit data value, the positions at which the immediate is placed depending
on the control value. Furthermore, zeros and/or ones can be used to fill the empty
) spaces where the value is not placed. The choice of either ones and/or zeros is also
determined by the control value. Thus, in this example a wide range of possible
constants for use in SIMD processing can be produced firom an instruction having an
8-bit immediate and 4-bit control value associated with it.
13 In one embodiment (last line of the table), instead of repeating the immediate
, at certain places, each bit of the immediate is expanded to produce the new 64 bit
immediate or constant.
As can be seen in some cases, die constant is the same in each lane, while in
15 others different constants appear in some of the lanes. In some embodiments (not
shown), the possibihty of inverting these constants is also provided and this also
increases the number of constants that can be generated.
An example of the format of an instruction that can be used for constant
2 0 generation as shown in Figure 63 is given below. In this iostructions is .the
data portion or immediate and is the control portion which provides an
indication as to how the portion is to be expanded within the generated
constant (shown as different lines in the table of Figure 63).
25 VMOV Dd, #,
where is a byte is one of the enumerated expansion functions
2 0 These adapted instructions generally have an associated data value that has a
data portion which comprises the unmediate and a CDntrol portion .
As is shown in Figure 63 the control portion indicates how tiie immediate is to be
expanded. This may be done in a variety of ways, but in some embodiments, the
control portion indicates which expansion of the constant is to be performed using
35 constant generation logic.
I
^ Figure 64 schematically shows an example of constant generation logic
operable to generate a constant from a data portion 1210 and a control portion 1200
associated with an instruction according to the present technique. In the example
5 shown, the control portion 1200 controls the control generation logic 1220, which
comprises gates 1230 to output either a portion of the data value 1210, or a one or a
zero to each bit within the constant 1240 to be generated.
Figure 65 shows a data processor (integrated circuit) similar to that shoivn in
10 Figure 1, with like reference numerals representing hke features. Figure 65 differs
I from Figure 1 in that it explicitly shows constant generation logic 1220. Constant
generation logic 1220 can be considered to be adjacent to, or forming part, of the
decode/control portion 14, 16. As can be seen instructions are sent from the
instruction pipehne 12 to the decode/confrol logic 14, 16. This produces control
1 5 • signals which contiol the operation of the SIMD processing logic 18, the load store
unit 22, and the scalar processing portion 4, 6, 8, 10 of the processor. If an instruction
with constant generation is received at the decode/control portion 14, 16, the constant
generation logic is used to generate a constant for use in SIMD processing. This can
either be sent directly to the SIMD register data store 20 (dotted line 1222), or if the
2 0 instmction with constant generation comprises a SIMD data processing part, the
generated constant is sent to the SIMD processing logic (line 1224) where fiirther
manipulations are performed on the generated constant to produce a new data value.
Figure 66A and B schematically illustrates the two different paths shown in
:.5 Figure 65. Figure 66A shows the case where the instruction generates a constant
which is sent directly to the register store, i.e. dotted line 1222. Figure 66B, shows
the case where the instruction with generated constant comprises a data processing
part. In this case data processing operations (OP) are perfoimed on the generated
constant and a ftuther source operand 1250 to produce a final data value 1260 in
: p response to the instraction, this corresponds to line 1224 of Figure 65.
In addition to the constants shown in Figures 63 and their inversions,
additional data processing operations such as an OR, AND, test, add or -subtract can
be performed on the generated constants to generate a much wider range of data
: 15 values. This correBponds to Figure 13B and path 1224 in Figure 65. Table 12 gives
- ( S o -
an example of bitwise AND and bitwise OR that can be used to generate some
9 additional data values.
Mnemonic Data Type Operand Format Description
5 VAM) none Dd, #, Bitwise AND with
generated constant
Vd := Vd &
VORR none Dd, #, Bitwise OR with
1D generated constant
Vd := Vd I
' The abihty to perform fiirther data processing operations on the generated
constants can have a variety of uses. For example, Figure 67 shows how
15 embodiments of the present technique can be used to generate a bit mask to -extract a
certain bit or bits from a number of data elements in a vector. In the example shown
the fourth bit of each data element from a source vector is extracted, hiitially the
immediate 8 is expanded by repeating it and then this is followed by a logical AND
instruction which AMDs the generated constant with a source vector to exfract the
2 0 desired bit from each data element. These operations are performed in response to the
instruction
VAM)Dd,#Ob00001000, ObllOO
2 5 Wherein the value 1100 refers to a generated constant comprising an
expanded data portion (see Figure 63).
Although a particular embodiment has been described herein, it will be
appreciated that the invention is not limited thereto and that many modifications and
: 0 additions thereto may be made v/ithin tlie scope of the invention. For example, various
combinations of the features of the following dependent claims could be.made with the
features of the independent claims without depaJrting from the scope of the present
invention.
I
- 8 1 '
ORIGINAL
We Claim
1. Apparatus for processing data, said apparatus comprising:
a register data store (20) operable to store a plurality of data elements;
and processing logic (6, 8, 10, 12) responsive to a data processing instruction to perform
a data processing operation in parallel upon a selected plurality of data elements accessed as a
register of said register data store (20);
wherein data elements of said selected plurality of data elements share one of a plurality
of different data element sizes, said register has one of a plurality of different register sizes and
said data processing instruction specifies for said data processing operation a data element size
shared by said selected plurality of data elements and a register size of said register; and
wherein said apparatus comprises register accessing logic operable to map said register to
a portion of said register data store (20) dependent upon said register size of said register such
that a data element stored within said portion of said register data store (20) is accessible as a
part of respective different registers of differing register size.
2. Apparatus as claimed in claim 1, wherein said data processing instruction specifies one or
more source registers each with a respective source register size and source data element size
specified by said data processing instruction.
3. Apparatus as claimed in any one of claims 1 and 2, wherein said data processing
instruction specifies one or more destination registers with a destination register size and
destination data element size specified by said data processing instruction.
4. Apparatus as claimed in claims 2 and 3, wherein said destination data element size differs
from at least one of said one or more source data element sizes.
5. Apparatus as claimed in any one of the preceding claims, wherein said register accessing
logic is configured to read from said data store within a single further register a set of data
elements written to said register data store (20) with two different registers.
W 6. Apparatus as claimed in claim 5, wherein said register accessing logic is configured to write
said set of data elements with two different registers to adjacent portions of said register data
store (20).
7. Apparatus as claimed in claim 6, wherein said single fiirther register has a register size
equal to a sum of the register sizes of said two different registers.
8. Apparatus as claimed in any one of the preceding claims, wherein said register accessing
logic is configured to read from said data store within two different fiirther registers a group of
data elements written together to said register data store (20) from a single register.
9. Apparatus as claimed in claim 8, wherein said two different further registers read from
adjacent portions of said register data store (20).
10. Apparatus as claimed in claim 9, wherein said single register has a register size equal to a
sum of the register sizes of said two different further registers.
11. Apparatus as claimed in any one of claims 1 to 7, wherein a group of data elements
written together to said register data store (20) from a first register of a first register size can be
read from said register data store (20) within a second register of a second register size, said first
register size being different to said second register size.
12. Apparatus as claimed in any one of the preceding claims, wherein said data processing
instruction specifies two source registers with respective register sizes Si and S2 and a
destination register with a register size D.
13. Apparatus as claimed in claim 12, wherein Si = S2 = D.
14. Apparatus as claimed in claim 12, wherein 2* Si = 2* S2 = D.
15. Apparatus as claimed in claim 12, wherein 2* Si = S2 = D.
" 16. Apparatus as claimed in claim 12, wherein Si= 82= 2*D.
17. Apparatus as claimed in any one of claims 1 to 11, wherein said data processing
instruction specifies a source register with a register size S and a destination register with a
register size D.
18. Apparatus as claimed in claim 17, wherein S=D.
19. Apparatus as claimed in claim 17, wherein 2*S=D.
20. Apparatus as claimed in claim 17, wherein S=2*D.
21. Apparatus as claimed in any one of the preceding claims, wherein said data processing
instruction includes a register specifying field operable to specify a register within said register
data store (20), and said register accessing logic is configured to map said register to a portion of
said register data store (20) such that said register for a given register specifying field
corresponds to a different portion of said register data store (20) in dependence upon said data
element size and said register size.
22. Apparatus as claimed in claim 21, wherein said register accessing logic is configured to
map a plurality of registers corresponding to a range of said register specifying field for a given
data element size and register size to a contiguous portion of said register data store (20) when
accessed with registers using at least one of a different register size and a different data element
size.
23. Apparatus as claimed in any one of the preceding claims, wherein said different registers
of differing register size include at least one register with a different data element size from that
of said data element stored within said portion.
24. Apparatus as claimed in any one of the preceding claims, wherein said data processing
instruction includes a plurality of bits encoding a register number of said register, said plurality
of bits being mapable to a contiguous field of bits which is rotatable by a number of bit positions
dependent upon said register size to form said register number.
^ 25. Apparatus as claimed in claim 24, wherein said register accessing logic is also operable
to access said register data store (20) as a scalar register storing a single data element read from
said register data store (20).
26. Apparatus as claimed in claim 24, wherein said register accessing logic is also operable
to access said register data store (20) as a register storing a plurality of copies of a single data
element read from said register data store (20).
27. Apparatus as claimed in any one of claims 25 and 26, wherein said register accessing
logic is operable to generate a row address and a column address for accessing said register data
store (20), a first part of said contiguous field of bits corresponding to said row address and a
second part of said contiguous field of bits corresponding to said column address.
28. Apparatus as claimed in claim 27, wherein one or more boundaries between said first part
and said second part vary in position in dependence upon said data element size.
29. A method of processing data, said method comprising the steps of:
storing a plurality of data elements within a register data store (20); and
in response to a data processing instruction, processing logic (6, 8, 10, 12) performing a
data processing operation in parallel upon a selected plurality of data elements accessed as a
register of said register data store (20), wherein:
data elements of said selected plurality of data elements share one of a plurality of
different data element sizes, said register has one of a plurality of different register sizes and said
data processing instruction specifies for said data processing operation a data element size shared
by said selected plurality of data elements and a register size of said register, and
wherein said method comprises register accessing logic mapping said register to a portion
of said register data store (20) dependent upon said register size of said register such that a data
element stored within said portion of said register data store (20) is accessible as a part of
respective different registers of differing register size.
i
_ i
9 30. A method as claimed in claim 29, wherein said data processing instruction specifies one
or more source registers each with a respective source register size and source data element size
specified by said data processing instruction.
31. A method as claimed in any one of claims 29 and 30, wherein said data processing
instruction specifies one or more destination register with a destination register size and
destination data element size specified by said data processing instruction.
32. A method as claimed in claims 30 and 31, wherein said destination data element size
differs from at least one of said one or more source data element sizes.
33. A method as claimed in any one of claims 29 to 32, said register accessing logic reading
from said register data store (20) within a single further register a set of data elements written to
said register data store (20) with two different registers.
34. A method as claimed in claim 33, wherein said register accessing logic writing said set
of data elements with two different registers to adjacent portions of said register data store (20).
35. A method as claimed in claim 34, wherein said single further register has a register size
equal to a sum of the register sizes of said two different registers.
36. A method as claimed in any one of claims 29 to 35, said register accessing logic reading
from said register data store (20) within two different further registers a group of data elements
written together to said register data store (20) from a single register.
37. A method as claimed in claim 36, wherein said two different further registers read from
adjacent portions of said register data store (20).
38. A method as claimed in claim 37, wherein said single register has a register size equal to
a sum of the register sizes of said two different further registers.
™ 39. A method as claimed in any one of claims 29 to 35, wherein a group of data elements
written together to said register data store (20) from a first register of a first register size can be
read from said register data store (20) within a second register of a second register size, said first
register size being different to said second register size.
40. A method as claimed in any one of claims 29 to 39, wherein said data processing
instruction specifies two source registers with respective register sizes Si and Siand a destination
register with a register size D.
41. A method as claimed in claim 40, wherein Si = S2 = D.
42. A method as claimed in claim 40, wherein 2* Si = 2* S2 = D.
43. A method as claimed in claim 40, wherein 2* Si = S2 = D.
44. A method as claimed in claim 40, wherein Si = S2= 2*D.
45. A method as claimed in any one of claims 29 to 39, wherein said data processing
instruction specifies a source register with a register size S and a destination register with a
register size D.
46. A method as claimed in claim 45, wherein S=D.
47. A method as claimed in claim 45, wherein 2*S=D.
48. A method as claimed in claim 45, wherein S=2*D.
49. A method as claimed in any one of claims 29 to 44, wherein said data processing
instruction includes a register specifying field operable to specify a register within said register
data store (20), and said register accessing logic maps said register to a portion of said register
data store (20) such that said register for a given register specifying field corresponds to a
" different portion of said register data store (20) in dependence upon said data element size and
said register size.
50. A method as claimed in claim 49, said register accessing logic mapping a plurality of
registers corresponding to a range of said register specifying field for a given data element size
and register size to a contiguous portion of said register data store (20) when accessed with
registers using at least one of a different register size and a different data element size.
51. A method as claimed in any one of claims 29 to 50, wherein said register accessing logic
maps register to a portion of said register data store (20) such that said different registers of
differing register size include at least one register with a different data element size from that of
said data element stored within said portion.
52. A method as claimed in any one of claims 29 to 51, wherein said data processing
instruction includes a plurality of bits encoding a register number of said register, said plurality
of bits being mapable to a contiguous field of bits which is rotatable by a number of bit positions
dependent upon said register size to form said register number.
53. A method as claimed in claim 52, wherein said register accessing logic is also operable to
access said register data store (20) as a scalar register storing a single data 20 element read from
said register data store (20).
54. A method as claimed in claim 52, wherein said register accessing logic is also operable to
access said register data store (20) as a register storing a plurality of copies of a single data
element read from said register data store (20).
55. A method as claimed in any one of claims 53 to 54, wherein said register accessing logic
is operable to generate a row address and a column address for accessing said register data store
(20), a first part of said contiguous field of bits corresponding to said row address and a second
part of said contiguous field of bits corresponding to said column address.
W 56. A method as claimed in claim 55, wherein one or more boundaries between said first part
and said second part vary in position in dependence upon said data element size.
57. A computer program product comprising a computer program including at least one data
processing instruction operable to control processing logic to perform a method as claimed in
any one of claims 249 to 456.
[S]
|
|