Title of Invention	"IMPROVED WAVE PIPELINED ARRAY MULTIPLIER"
Abstract	The present invention provides an improved m x n wave pipelined array multiplier comprising an array of combinational logic blocks, each block having two outputs denoted as sum and carry outputs, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+lth row, the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row, the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second extreme left combinational logic block in the n-lth row, jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j-1th and jth combinational logic blocks in n-lth row and carrys from jth and j+lth combinational logic blocks in the n-lth row, for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row, each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication.

Title of Invention

"IMPROVED WAVE PIPELINED ARRAY MULTIPLIER"

Abstract

The present invention provides an improved m x n wave pipelined array multiplier comprising an array of combinational logic blocks, each block having two outputs denoted as sum and carry outputs, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+lth row, the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row, the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second extreme left combinational logic block in the n-lth row, jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j-1th and jth combinational logic blocks in n-lth row and carrys from jth and j+lth combinational logic blocks in the n-lth row, for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row, each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication.

Full Text	FIELD PROGRAMMXm4i4TARRAY (FPGA) BASED WAVE PIPELINED' ARRAY MULTIPLIER (WPARAM) This invention relates to Field Programmable Gate Array (FPGA) based wave pipelined array multiplier (WPARAM). BACKGROUND The conventional 4x4 array multiplier is shown in figure 1 of the accompanying drawings. Consider the multiplication of the 4-bit number X3 X2 XI XO with Y3 Y2 Yl YO. In Fig.l, the multiplication is carried out using the conventional "paper and pencil" technique. Different stages of computation in this array multiplier are separated by dotted lines and are indicated as stage 0, stage 1, stage 2, stage 3 & Stage 4. At stage 0, the product of YO with X3 X2 XI XO is computed. In stage 1, the product of Yl with X3 X2 XI XQ is computed and added to the result of the previous stage. Stages 0 to 3 are used for multiplication by the individual bits of Y with X and stage 4 is used for propagating the carry. The combinational logic blocks MO, Ml, M2 in stages 0 to 3, consists of AND gates/half adders/Full-adders. The above array multiplier can be fed with a fresh data only after the input is processed by all the five stages. The limitation of this conventional array multiplier is that the minimum time required for multiplication is equal to the time taken for processing the input in all the five stages resulting in slowing down of the speed. In order to speed up the conventional array multiplier, ripple carry operation done in Stage 4 may be replaced with carry save operation distributed over 4 stages. The multiplication rate can be increased by introducing registers at the output of the combinational logic blocks MO, Ml, M2 & M3 at each stage shown in Fig.l. The resulting multiplier is shown in Fig. 2 and is called a Pipelined Array multiplier. In this multiplier, the multiplication can be done at a rate which depends only on the largest time taken for processing the inputs at one of the stages, the interconnect delay between the combinational logic blocks and the flip-flop (register) and set-up time as well as hold-time of the flip-flop. In the Pipelined Array multiplier, the product PO becomes available after 2 clock cycles, PI after 3 clock cycles and so on. P7 becomes available after 9 clock pulses from the time the corresponding operands are fed at the input register at the input of the pipelined array multiplier. Hence the latency of the multiplier is 9 clock cycles. In order to ensure that all the product terms arrive simultaneously, the product &ts PO to P6 have to be delayed by 7 to 1 clock cycle respectively. This is achieved by using a chain of Flip-flops: PO requires 7 flip-flops (registers), PI requires 6 flip-flops and so on. In general, for an NxN multiplier, 2N combinational logic blocks separated by 2N+1 stages of registers are required. The latency of the multiplier is 2N+1 clock cycles. In the conventional pipelined array multiplier shown in figure 2, the drawback is that when implemented on FPGA with four input LUT, all the four inputs are not effectively utilized in the first stage and in the last N stages and thereby resulting in more area for implementation and latency. n our co-pending application the drawbacks of the conventional nultipliers shown in figures 1 & 2 have been obviated and all the four inputs of LUTs have een effectively utilized. In our co-pending applicationwe have claimed an improved m x n pipelined array multiplier comprising: - an array of combinational logic blocks, each block having two outputs denoted as sum and carry outputs for the sake of convenience, a set of registers connected to each said combinational logic blocks in a row of said array of combinational logic blocks, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+lth row, the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row, the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second left combinational logic block in the n-lth row, jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j-1th and jth combinational logic blocks in n-lth row and carrys from jth and j+lth combinational logic blocks in the n-lth row, for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row, each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication thereby resulting in reduction in the area and latency. The object of the present invention is to increase the speed of multiplication reducing the area, power consumption and latency. To achieve the said objective, an improved wave pipelined multiplier has been developed. The present invention therefore provides an improved m x n wave pipelined array multiplier comprising: an array of combinational logic blocks, each block having two outputs denoted as sum and carry outputs for the sake of convenience, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+lth row, the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row, the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second extreme left combinational logic block in the n-lth row, jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j-lth and jth combinational logic blocks in n-lth row and carrys from jth and j+lth combinational logic blocks in the n-lth row, for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row, each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication, and a delay control means is provided between combinational logic blocks in the ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block itself equal for all the logic blocks in the ith and i+lth rows, thereby increasing the speed of multiplication, reducing the area, power consumption and latency. In the above improved m x n wave pipelined array multiplier, m x n = 4 x 4. An improved 4x4 wave pipelined array multiplier comprising: an array of combinational logic blocks (each block having two outputs denoted as sum and carry outputs for the sake of convenience), two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining 2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks, in first row combinational logic blocks 1 to 4 are connected to second row such that the sum output of the combinational logic blocks 1 to 3 in the first row is connected to 2 to 4 combinational logic blocks in the second row, in second row combinational logic blocks 1 to 4 are connected to third row such that the sum output of the combinational logic blocks 1 to 3 in the second row is connected to 2 to 4 combinational logic blocks in the third row, in first row combinational logic blocks 2 to 4 are connected to second row such that the carry output of the combinational logic blocks 2 to 4 in the first row is connected to 2 to 4 combinational logic blocks in the second row, in second row combinational logic blocks 2 to 4 are connected to third row such that the carry output of the combinational logic blocks 2 to 4 in the second row is connected to 2 to 4 combinational logic blocks in the third row, in the fourth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of third row and sum output of second extreme right combinational logic block of third row, the third extreme left combinational logic block in fourth row is connected to the sum outputs of second extreme left and third extreme left combinational logic blocks in the third row and carry outputs of third extreme and fourth extreme left combinational logic blocks in the third row the second extreme left combinational logic block in fourth row is connected to the sum outputs of extreme left and second extreme left combinational logic blocks in the third row and carry outputs of second extreme and third extreme left combinational logic blocks in the third row the extreme left combinational logic block in the fourth row is provided inputs from sum output of extreme left and carry output of second extreme left combinational logic blocks in the third row, the left combinational logic block of the last row of the combinational logic block array is provided with sum outputs of extreme left and second extreme left combinational logic blocks and carry outputs of second extreme left and third extreme left combinational logic blocks of the previous row the right combinational logic block of the last row of the combinational logic block array is provided with the sum output of the second left combinational logic block and the carry output of the third left combinational logic block in the previous row, sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last two rows will provide the final multiplication, and a delay control means is provided between combination logic blocks in the rows 1 to 4 and rows 2 to 5 to make the sum of the interconnect delay and the delay in each logic block itself equal for all the logic blocks in rows 1 to 4 and 2 to 5, thereby increasing the speed of multiplication, reducing the area, power consumption and latency. The delay control means comprising selected interconnect wires between combinational logic blocks such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay itself is equal for the selected interconnected wires between logic blocks in the adjacent row. The array of combinational logic blocks includes LUTs. At least one register is included in the input of the multiplier and two registers in the output of the multiplier are provided. The two accumulators are provided in the output of the multiplier for storing the results alternatively. Brief description of the drawings The invention will now be described with reference to the accompanying drawings. Figure 1 shows the conventional 4x4 array multiplier. Figure 2 shows the conventional 4x4 pipelined array multiplier. Figure 3 shows an improved 4x4 pipelined array multiplier described in our co-pending Figure 4 shows an improved wave pipelined array multiplier, according to this invention. Figure 5 shows the conventional 4x4 wave pipelined guild multiplier. Figure 6 shows an improved wave pipelined array multiplier with accumulator, according to this invention. Figure 7 shows a clock circuit. Detailed description Referring to the drawings, figures 1 and 2 have been explained under the heading 'Background'. It is assumed that FPGAs with 4 input Look-Up-Table (LUT) are used for the implementation. One of the objectives of the synthesis technique is to ensure that all the 4 inputs of the LUTs are effectively engaged. When half-adders are implemented, LUT cannot be efficiently utilized. Accordingly, the stages involving half-adders have to be i modified/combined so that LUTs are effectively utilized. Keeping this in view, the stageO and stage 1 of Fig.2 may be modified to compute the partial products due to the two least significant multiplier bits. The last N stages may be reduced to N/2 stages by replacing the half adders with suitable functional blocks and feeding the sum and carry outputs from one stage to another properly. For example, consider the multiplication of the 4-bit number X3 X2 XI XO with Y3 Y2 Yl YO. In stage 0 of the original scheme, product of YO with X3 X2 XI XO is computed. In stage 1, product of Yl with X3 X2 XI XO is computed. These two stages can be modified as follows: The partial product obtained by the least significant two bits of the multiplier (Yl YO) with X3 X2 XI XO is shown in table 1, given below. Since 4-input LUT is available, the partial products corresponding to each column of table 1 can be computed using a single LUT. Hence the two stages of original multiplier of figure 2 can be reduced to a single stage with 4-input LUTs. Table 1 Partial products of multiplying (he multiplicand with two LSBs of multiplier Stages two and three of the original scheme actually require 4-input functional blocks (MO, M2, M2, M2) and hence they can be implemented using the 4-input LUTs efficiently and no modification in the original scheme is required for these two stages. Stages 4 to 7 in the original scheme is used for carry propagation and they use only half-adders for this purpose. Every two stages can be properly combined into one stage in view of the use of 4-input LUTs. Keeping in view the above modifications, an improved 4x4 pipelined array multiplier has been developed and is shown in Fig. 3 of our co-pending application no This is also referred as Optimally Synthesized Pipelined Array Multiplier (OSPAM). It consists of 5 combinational logic blocks (MO, M5, M5, M5, MO - stage 0; MO, M2, M2, M2 - stage 1; MO, M2, M2, M2 - stage 2; M8, M7, M7, M6 - stage 3; M10, M9 - stage 4) separated by 6 stages of registers (0-5) are only required. The original multiplier in figure 2 requires 8 stages of combinational logic blocks and 9 stages of registers. Further the latency "of the multiplier shown in figure 3 is reduced from 9 clock cycles to 6 clock cycles. Hence the latency is reduced by about 30% and the number of registers required are also reduced. In general MxN multiplication can be achieved using (M-l)+(N/2) stages of registers with the latency of M+(N/2) clock cycles. This results in lower latency and requires less area for implementation. Wavepipelined array multiplier (WPARAM): Wavepipelining is a technique for increasing the throughput of a digital system without introducing pipelining registers between adjacent combinational logic blocks. The wavepipelined array multiplier can be obtained from the circuit of Fig. 3. of our co- pending application not,, aS follows: The 4X4 wavepipelined array multiplier of Fig.4 may be obtained as follows:, The latches (registers) in stages 0 - 4 in Fig.3 of our co-pending application no. are removed and the inputs to the various combinational logic blocks denoted as MO, M2, M5, M6, M7, M8, M9 & M10 are made to arrive at the same time at each stage. The appropriate inputs to the various blocks MO to M10 are shown in Fig. 4. For example the sum output of the left most MO block from stageO is fed as one of the inputs to the left most M2 block in the stagel; the carry from the left most M5 block of stageO is fed as another input to the left most M2 block of stagel. These inputs should arrive at stagel with equal delay from stageO. This can be achieved by properly adjusting the routing delays of the inputs from stageO to stagel. In general, wavepipelining of an mxn array multiplier is achieved by carrying out the following modifications on the pipelined array multiplier. 1. The delays of the inputs to each of the combinational logic block are adjusted such that they undergo the same delay, i.e. Sum of the interconnect delay and the LUT delay should be made equal. 2. The pipelining registers are removed and instead the inherent capacitance at the output of one combinational logic block will be adequate to hold the data till the next combinational logic block processes this data and becomes stable. For carrying out the first modification, the interconnect delays between one LUT and the next LUT should be adjusted appropriately. Since Xilinx Foundation Series CAD tool permits the delays to be viewed as well as varied, the wavepipelining technique can be applied to Xilinx FPGAs. However, this technique is applicable for other LUT based FPGAs as well provided there is mechanism to view the interconnect delays and adjust them. After the data is processed by a number of wavepipelined stages, finally the output has to be latched for it to be processed further by the conventional pipelined synchronous systems. For the purpose of latching, the latch in the CLB may be used. The clock input to the XC4000E -1 family CLB requires a minimum of 3ns on-time and 3ns off-time. By pumping-in the data to the wavepipelined multiplier at the rate of 5.4ns and using two latches (registers) to store the output alternately (see figure 6), the minimum on-time and off-time requirements can be satisfied. In this case, each latch stores only one data in a period of 10.8ns. One latch uses the positive edge of the clock and the other latch uses the negative edge. In other words interleaving technique can be used to achieve a multiplication rate of 5.4ns and use latches with clock period of 10.8ns. However the conventional pipelined array multiplier works typically at 10ns due to set-up time and routing complexity. Hence the wavepipelined multiplier is about twice faster than the pipelined multiplier. To ease the routing complexity and to minimize the resources, the wavepipelining delay per stage is retained at 2.7ns rate and a single data is pumped-in twice to the multiplier to effectively feed data at the rate of 5.4ns to the latches. This results in a latency of 2.7ns x number of wavepipelined multiplier stages. However for the conventional pipelined array multiplier the latency is equal to the clock period x number of pipelined multiplier stages. Since the minimum clock period that can be used is about 10ns for the conventional pipelined array multipliers, the latency of the wavepipelined array multiplier is about one third of the latency of the pipelined array multiplier. Comparison of WPARAM with the Wavepipelined Guild Multiplier: Wavepipelined Guild Multiplier shown in Fig. 5 is obtained from the pipelined guild multiplier by removing the pipeline registers and equalizing the delays of all the paths crossing the equitemporal lines. The Guild multiplier requires more area than Accordingly, wavepipelined Guild multiplier also requires more area than the W* Further since, the WP Guild multiplier requires manual routing, increase of area also im, increase in the routing complexity. Further the latency of WP guild multiplier is more thai, that of WP ARAM. Implementation of Area efficient Filters using WPARAM In figure 6, an interleaving scheme for using a single multiplier and two accumulators for finding the sum of products corresponding to two parallel filters is shown. One accumulator finds the sum of products corresponding to odd samples and other accumulator finds the sum of products corresponding to even samples. In addition to this, the third accumulator in figure 6 is used for accumulating the output of the odd filter and even filter. Two parallel filters each with 8-taps and the input samples and the impulse response coefficients represented using 2-bits, 6 bits respectively is shown in figure 6. In XC4010E-1 device, a 2x6 WPARAM can be made to operate at a peak rate of 5.4ns(nanoseconds). The 11-bit accumulator for this filter requires 10ns in the same device. Since the accumulator is slower than the multiplier, a single multiplier may be shared between the two accumulators using an interleaving technique. The multiplier processes alternately the data and coefficients corresponding to odd filter and even filter. The multiplier output corresponding to each filter is latched into two separate registers. A 5.4 ns clock is used to pump-in data into the WPARAM. A 10.8 ns clock is used for latching the data into the registers. Data is latched into one of the registers at the positive edge of the 10.8ns clock and the reverse is true for the other register. A single RAM may be used for storing the input samples. The odd locations contain the odd numbered samples and even locations contain the even numbered samples. Similarly a single RAM may be used for storing the impulse response coefficients. The odd locations contain the coefficients of one filter and the even locations contain the coefficients of the other filter. The basic clock of 5.4ns is used for the multiplier. This clock is divided further using counters to generate the 10.8ns clock and the addresses for the RAM. Results and Conclusions: In order to study the characteristics of the instant multiplier (WPARAM) and compare it with the other multipliers discussed in our co-pending application, 4X4 multipliers are implemented on XC4010E-1 device. Input to the above multipliers may be fed directly through the IO pad, through a register or through a RAM. Feeding the inputs through a RAM becomes convenient when the multiplier is to be used for implementing a filter. Because of the initial latency of the pipelined array multipliers they become efficient only if it is used for performing a number multiplications one after another as is true in the case of computation of sum of products of elements of two vectors. Pipelined array multipliers, OSPAM and Guild multipliers can be tested folly using the simulation tool. Hence these multipliers are tested using simulation and they are found to be satisfactory. However OPARAM, WPARAM and wavepipelined Guild multiplier use manual routing and the interconnect delays play a crucial role in determining their proper operation. Since the interconnect delays are non-deterministic, there may not be one to one correspondence between the delays measured by the CAD tool and the actual delay obtained in a particular device after the design is downloaded to the device. Hence to verify the proper operation of these multipliers, they should be studied not only by using simulation but also by pumping in the test data on to the actual device. For this purpose the test data is applied to both OPARAM and WPARAM through the parallel port of a Personal Computer to the demo board housing the FPGA and the results are read back for verification. The multiplier and the multiplicand are stored in two RAMs RAMO and RAM1. The content of these RAMs are read at the rate of 6ns/5.4ns and fed to the OPARAM/WPARAM respectively. However the wavepipelined Guild multiplier has not been implemented by us. For testing the design on the actual device, the clock signal should also be applied to the multiplier. The clock may be taken from an external source or it may have to be internally generated. One of the requirements for extracting the peak rate from the OPARAM is the ability to operate the pipelining registers at their maximum rate. This in turn requires a clock which operates at this rate. Similarly for the WPARAM and wavepipelined Guild multiplier, a high frequency clock may be required if they are to be operated at the peak rate. The minimum clock period for the WPARAM is equal to the sum of LUT delay and the interconnect delay between the output of the one LUT and the input to the next LUT. The LUT delay of XC4000E family FPGA with speed grade of l(i.e.XC4000E-l) is l.Sns.The interconnect delay can be varied from 0.7ns onwards. An interconnect delay of 1.4ns may be chosen so that a particular LUT output can be fed to a number of LUTs located at different positions in the FPGAs with equal delay. With this a total delay of 2.7 ns is required for each stage of combinational logic block. This enables the data to be pumped-in at a rate of 2.7ns and its multiples. Low cost development and prototype boards may not be able to support signals at high frequency rate even though the FPGA as such can operate internally at rates higher than that which is possible with the board. For example, the demo board for XC4000E device family from Xilinx has a maximum operating speed of 100MHz. Hence pumping in clocks above this frequency using an external source becomes impossible. Instead the clocks have to be internally generated. For this purpose the clock circuit shown in Fig. 7 is used. The LUT delay and the interconnect delay determines the clock period. When circuits of this kind are used, one of the important issue is the stability of the clock generated and extent to which it affects the operation of the pipelined multiplier. To verify this, the stability of the clock and the stability of the output of the pipelined array multiplier need to be checked over a period of at least few hours. For this purpose, the above test was carried out on the XC4010E-1 device and it was found to be working satisfactorily. The clock generated using this technique has been used for OP ARAM and WPARAM. If the board is capable of supporting a high frequency clock, an external source itself can be used for testing the multipliers. The different characteristics of the multipliers such as number of CLBs required, maximum operating frequency, latency are evaluated for all the above multipliers and are tabulated in Table 2. Implementation results of the 7X7 wavepipelined Guild multiplier reported in the literature are also given in table2. For the sake of comparison 7X7 OPARAM is implemented on XC4010E-1 and the results are given in table2. The WPARAM of different sizes are implemented and tested. WPARAM of size 4X4 and 2X6 are found to be satisfactory at a clock rate of 185 MHz. However WPARAM of size 4X6, 6X6 are found to be unsatisfactory at this frequency. The plausible reasons for this could be as follows: As mentioned earlier, for testing the WPARAM, the test data is stored in two RAMs. The delay between the instant when a new address is applied to the RAM and the instant when a stable output appears at the input to the latch after getting processed by the k wavepipelined stages of the multiplier is 2.7(k+l)ns. This data remains stable for a period of 5.4ns as the new address is fed to the RAM only at intervals of 5.4ns. Hence ideally the set-up time available for the latch is 5.4ns. However, the set-up time actually available may be reduced due to the following reasons: (a) Difference in rise and fall time may increase the time required for the output of the multiplier to settle marginally depending upon the input pattern. (b) Due to the fabrication variations, the different interconnects and LUTs may not i have identical characteristics and may differ from the value reported by the CAD tool when the design is actually downloaded to the device. When the actual stable period reduces to less than Ins, it leads to set-up time violations on the latch. For the wavepipelined multiplier to remain in the stable region, it has been experimentally verified that about 5 wavepipelined stages may be reliably used for the XC4010E-1 devices. Since a 4X6 multiplier requires 6 stages for multiplication and 1 stage for RAM, it cannot be reliably implemented directly. It may be noted here that in a wavepipelined multiplier, the interconnect delay variations and LUT delay variations in various stages add up and make the result to appear either too early or too late compared to the active clock edge of the latches at the output of the multiplier. However in the OP ARAM the interconnect delay variation and LUT delay variation in a single stage alone determine the proper operation. Hence multiplier of any size can be built reliably using OP ARAM. It may be noted that the present effort is to operate the multiplier at the rate of 5.4ns (corresponding to operating frequency of 185MHz). When the multiplier is operated at frequencies lower than this, the margin available for latching the data in the conventional latches in the final stage becomes large and hence it can work satisfactorily even for higher order multipliers. There is one to one correspondence between the multiplier size and the rate at which data is pumped in. A 2X4 WPARAM works satisfactorily for a rate of 2.7ns. But at 2.7ns rate 2X6 did not work. When the data rate is increased to 5.4ns, 2X6 WPARAM gave the correct result. When the data rate is increased to 8.1ns, 10.8ns and so on, a still higher order multiplier can be made to work. At a data rate of 10.8ns, even an external clock can be used for the output latch and the RAMs. Multipliers of larger size such as 4X6, 6X6 etc., may be designed directly and applied with clocks with periods of S.lnsec or 10.8 nsec. Alternately a lower order multiplier may be used sequentially to implement the higher order multiplier. For example 4X6 multiplication may be carried out sequentially using a single 2X6 multiplier operating at the rate of 5.4ns and two phases of multiplication. In the first phase least two significant multiplier bits may be multiplied with the 6 bit multiplicand and in the next phase, the most two significant bits may be multiplied with the multiplicand. The output resulting from the two phases have to be added after shifting the output corresponding to the first phase by two bits towards right. The sequential multiplication scheme can be implemented using the interleaved accumulators. The two MSBs of the multiplier may be stored in even locations and the two LSBs of the multiplier may be stored in odd locations of the RAM. The address to the RAMs, RAMO, RAM1 containing the multiplier, multiplicand are varied at a rate of 185 MHz ,92.5 MHz respectively. The 4X6 multiplier designed this fashion is found to be satisfactory. Similarly higher order multipliers such as 6X6 ,8X6 may be implemented using a single 2X6 multiplier and 3,4 accumulators respectively. Compared to the direct implementation scheme, the sequential implementation scheme is area efficient and the routing complexity is also less. In addition to the above two techniques, a higher order WPARAM can be built and operated more reliably using a lower speed grade device such as XC4010E-4. For example, the wavepipelined Guild multiplier built and reported in the literature is of size 7X7 and has been implemented on XC4005E-6. This has larger LUT delay compared to the XC4010E-1 and the interconnect delay becomes a smaller fraction of the LUT delay. In view of this, variations in the interconnect delay does not alter the performance of the wavepipelined multiplier unless the number of wavepipelined stages are increased significantly. 7X7 WPARAM built in the same device would require less area and latency than the wavepipelined guild multiplier due to reduction in the no. of wavepipelined stages As mentioned earlier, the maximum size of multiplier that can be designed using wavepipelining technique is limited. However OP ARAM of larger sizes can be directly implemented and operated with the clock period of 6ns. WPARAM of larger sizes have to be operated with a clock period of 8.1, 10.8 or higher multiples of 2.7ns. For higher order multipliers OP ARAM is to be preferred if higher speeds are required and any data rate lower than 166 MHz can be used. It need not be multiple of a particular frequency as in the case of WPARAM. If power dissipation is the concern one may go in for WPARAM instead of OPARAM. The implementation details of the filters on XC4010E-1 using WPARAM is presented next. For study of filters using the WPARAM, two 8-tap interleaved parallel FIR filters for inputs with 2 bits and impulse response coefficients with 6 bits is implemented on XC4010E-1 FPGA. One 8 tap filter corresponds to odd input samples and the other 8 tap filter corresponds to even input samples. The output of the above two filters may be accumulated to obtain a 16 tap FIR filter. The RAMs introduce a latency of 2.7ns delay for the wavepipelined multiplier. For the conventional pipelined array multiplier, RAM introduces a latency of 1 clock cycle(lOns). The above filters have been found to be working satisfactorily and to perform 16 multiplications, the time required is (5.4x 16) + (2.7x7)ns. The 16 tap filter requires 89 CLBs for implementation. A 16 tap FIR filter may also be implemented using two parallel 8 tap non interleaved filters with 2X6 OSPAM. In this case for each of the parallel filter, one dedicated multiplier is used. The 16 tap filter requires, the computation time of (12 x 16) + (12x 7) ns. The 16 tap filter with OSPAM requires 87 CLBs for implementation. Hence the 16 tap FIR filter with WPARAM is faster by a factor of 2.5 than the filter with OSPAM. Further the wavepipelined multiplier doesn't require any pipelining register and hence the power dissipation due to charging and discharging of the registers will be absent. Next, the implementation results of an 8 tap filter with 4 bit input samples and 6 bits impulse response coefficients are considered: As mentioned earlier, the 4X6 WPARAM is implemented using a sequential scheme with single 2X6 WPARAM and two accumulators. This requires 89 .CLBs and requires (10.8x8)+(10.8x7)ns for computation of the sum of products due to 8 taps. The FIR filter using the conventional pipelined array multiplier may also be implemented using 4x6 OSPAM without interleaving. This has 9 pipeline stages including RAM and requires 68 CLBs for implementation. In this case, the time required for computing the filter output "becomes (13.3 x 8) + (13.3 x 8)ns. Hence the FIR filter using WPARAM is about 1.2 times faster and requires about 30% more area compared to the filter with OSPAM. Ail the FIR filters implemented using WPARAM may also be implemented using OP ARAM. The filters using OP ARAM is slower than that using WPARAM as the maximum clock period that can be used with OP ARAM is about 11% larger than that of WPARAM. The filters using WPARAM results in less power dissipation than those using OP ARAM. Table 2(Table Removed) From table 2 and based on the experiments carried out the following conclusions can be arrived at: 1. The 4X4 WPARAM requires 25-30% less area and latency than the conventional pipelined array multiplier and the Guild multiplier. 2. The 4X4 WPARAM operates faster by about 1.4-1.6 times than the conventional pipelined array multiplier and the Guild multiplier. 3. The 4X4 WPARAM operates faster by about 1.1 times than the OP ARAM and both the multipliers require same area. 4. The 7X7 OPARAM requires about 30% less area than the wavepipelined Guild multiplier. The speeds cannot be compared as OPARAM is implemented in XC4010E-1 which is a faster device than the XC4005E-6 which was used for the implementation of the 7X7 wavepipelined Guild multiplier. 5. WPARAM dispenses with registers and hence the power dissipation is reduced for WPARAM compared to the non wavepipelined multipliers. 6. It is observed from table 2, as the number of wavepipelined stages is increased, the maximum operating frequency is decreased. In other words as the latency of the multiplier increases, the maximum operating frequency is decreases. 4X4 WPARAM requires 5 stages and wavepipelined Guild multiplier requires 7 stages. Hence a wavepipelined Guild multiplier can be operated only at a frequency lower than that of the WPARAM. 7. The Internal clock generation scheme proposed has been found to be stable and it enables the multipliers to be operated at their peak rate immaterial of the speed grade of the PCB on which the FPGA is mounted. 8. The 16 tap Interleaved parallel FIR filter using 2X6 WPARAM requires almost the same area but is faster by a factor of 2.2 times than the filter with two dedicated multipliers. 9. The 8 tap Interleaved FIR filter with 4 bit samples using sequential 2X6 WPARAM requires 30% more area and is faster by 1.2 times than the filter with the OSPAM. 10. For higher order multipliers, OPARAM is faster than the WPARAM and immaterial of the size of the multiplier, OPARAM can be operated at any data rate less than or equal to 166 MHz . We claim: 1. An improved m x n wave pipelined array multiplier comprising: an array of combinational logic blocks (each block having two outputs denoted as sum and carry outputs for the sake of convenience), two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+lth row, the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row, the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second left combinational logic block in the n-lth row, jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j-1th and jth combinational logic blocks in n-lth row and carrys from jth and j+lth combinational logic blocks in the n-lth row, for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row, each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication, a delay control means is provided between combinational logic blocks in the ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block itself equal for all the logic blocks in the ith and i+lth rows, thereby increasing the speed of multiplication, reducing the area, power consumption and latency. 2. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein m x n = 4 x 4. 3. An improved m x n wave pipelined array multiplier as claimed in claim 2 wherein delay control means comprising selected interconnect wires between combinational logic blocks such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay itself is equal for the selected interconnected wires between logic blocks in the adjacent row. 4. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein the array of combinational logic blocks comprise LUTs. 5. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein the input of the multiplier comprises at least one register and two registers in the output of the multiplier are provided. 6. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein two accumulators are provided in the output of the multiplier for storing the results alternatively. 7. An improved m x n wave pipelined array multiplier substantially as herein described with reference to and as illustrated in the accompanying drawings.

Full Text

FIELD PROGRAMMXm4i4TARRAY (FPGA) BASED WAVE PIPELINED'
ARRAY MULTIPLIER (WPARAM)
This invention relates to Field Programmable Gate Array (FPGA) based wave pipelined array multiplier (WPARAM).
BACKGROUND
The conventional 4x4 array multiplier is shown in figure 1 of the accompanying drawings. Consider the multiplication of the 4-bit number X3 X2 XI XO with Y3 Y2 Yl YO. In Fig.l, the multiplication is carried out using the conventional "paper and pencil" technique. Different stages of computation in this array multiplier are separated by dotted lines and are indicated as stage 0, stage 1, stage 2, stage 3 & Stage 4. At stage 0, the product of YO with X3 X2 XI XO is computed. In stage 1, the product of Yl with X3 X2 XI XQ is computed and added to the result of the previous stage. Stages 0 to 3 are used for multiplication by the individual bits of Y with X and stage 4 is used for propagating the carry. The combinational logic blocks MO, Ml, M2 in stages 0 to 3, consists of AND gates/half adders/Full-adders. The above array multiplier can be fed with a fresh data only after the input is processed by all the five stages. The limitation of this conventional array multiplier is that the minimum time required for multiplication is equal to the time taken for processing the input in all the five stages resulting in slowing down of the speed.
In order to speed up the conventional array multiplier, ripple carry operation done in Stage 4 may be replaced with carry save operation distributed over 4 stages. The multiplication rate can be increased by introducing registers at the output of the combinational logic blocks MO, Ml, M2 & M3 at each stage shown in Fig.l. The resulting multiplier is shown in Fig. 2 and is called a Pipelined Array multiplier. In this multiplier, the multiplication can be done at a rate which depends only on the largest time taken for processing the inputs at one of the stages, the interconnect delay between the combinational logic blocks and the flip-flop (register) and set-up time as well as hold-time of the flip-flop. In the Pipelined Array multiplier, the product PO becomes available after 2 clock cycles, PI after 3 clock cycles and so on. P7 becomes available after 9 clock pulses from the time the corresponding operands are fed at the input register at the input of the pipelined array multiplier. Hence the latency of the multiplier is 9 clock cycles. In order to ensure that all the product terms arrive simultaneously, the product

&ts PO to P6 have to be delayed by 7 to 1 clock cycle respectively. This is achieved by using a chain of Flip-flops: PO requires 7 flip-flops (registers), PI requires 6 flip-flops and so on. In general, for an NxN multiplier, 2N combinational logic blocks separated by 2N+1 stages of registers are required. The latency of the multiplier is 2N+1 clock cycles.
In the conventional pipelined array multiplier shown in figure 2, the drawback is that when implemented on FPGA with four input LUT, all the four inputs are not effectively utilized in the first stage and in the last N stages and thereby resulting in more area for implementation and latency.
n our co-pending application the drawbacks of the conventional
nultipliers shown in figures 1 & 2 have been obviated and all the four inputs of LUTs have een effectively utilized.
In our co-pending applicationwe have claimed an improved m x n
pipelined array multiplier comprising:
- an array of combinational logic blocks, each block having two outputs
denoted as sum and carry outputs for the sake of convenience, a set of registers connected to each said combinational logic blocks in a row of said array of combinational logic blocks,
two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+lth row,
the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right

combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row,
the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second left combinational logic block in the n-lth row,
jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j-1th and jth combinational logic blocks in n-lth row and carrys from jth and j+lth combinational logic blocks in the n-lth row, for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row,
each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication thereby resulting in reduction in the area and latency.
The object of the present invention is to increase the speed of multiplication reducing the area, power consumption and latency.
To achieve the said objective, an improved wave pipelined multiplier has been developed.
The present invention therefore provides an improved m x n wave pipelined array multiplier
comprising:
an array of combinational logic blocks, each block having two outputs denoted as sum and carry outputs for the sake of convenience, two least significant multiplier bits are connected to the input of the first row of combinational logic blocks such that the least significant bit is connected to said combinational logic blocks except the first extreme left combinational
logic block and the next significant bit is connected to said combinational logic blocks except the extreme right combinational logic block, the remaining n-2 multiplier bits are connected one at a time in the consecutive rows of said combinational logic blocks such that the sum output of the jth combinational logic block in the ith row is connected to j+lth combinational logic block in the i+lth row,
the carry output of the jth combinational logic block in the ith row is connected to jth combinational logic block of i+lth row, in the nth row of the combinational logic blocks, the extreme right combinational logic block is connected to the carry output of the extreme right combinational logic block of n-lth row and sum output of second extreme right combinational logic block of n-lth row,
the extreme left combinational logic block in the nth row is connected to sum output of extreme left combinational logic block and carry output of second extreme left combinational logic block in the n-lth row, jth (j=2 to n-1) combinational logic block in nth row is connected to sum outputs of j-lth and jth combinational logic blocks in n-lth row and carrys from jth and j+lth combinational logic blocks in the n-lth row, for each of the combinational logic block in the last (n/2)-l rows excepting the extreme right block, the jth block is connected to sum outputs of jth and j+lth combinational logic blocks and carry outputs from j+lth and j+2th combinational logic blocks in the previous row,
each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication, and
a delay control means is provided between combinational logic blocks in the ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block itself equal for all the logic blocks in the ith and i+lth rows,
thereby increasing the speed of multiplication, reducing the area, power consumption and latency.
In the above improved m x n wave pipelined array multiplier, m x n = 4 x 4.
An improved 4x4 wave pipelined array multiplier comprising:
an array of combinational logic blocks (each block having two outputs
denoted as sum and carry outputs for the sake of convenience),
two least significant multiplier bits are connected to the input of the first row
of combinational logic blocks such that the least significant bit is connected to
said combinational logic blocks except the first extreme left combinational
logic block and the next significant bit is connected to said combinational
logic blocks except the extreme right combinational logic block,
the remaining 2 multiplier bits are connected one at a time in the consecutive
rows of said combinational logic blocks,
in first row combinational logic blocks 1 to 4 are connected to second row
such that the sum output of the combinational logic blocks 1 to 3 in the first
row is connected to 2 to 4 combinational logic blocks in the second row,
in second row combinational logic blocks 1 to 4 are connected to third row
such that the sum output of the combinational logic blocks 1 to 3 in the second
row is connected to 2 to 4 combinational logic blocks in the third row,
in first row combinational logic blocks 2 to 4 are connected to second row
such that the carry output of the combinational logic blocks 2 to 4 in the first
row is connected to 2 to 4 combinational logic blocks in the second row,
in second row combinational logic blocks 2 to 4 are connected to third row
such that the carry output of the combinational logic blocks 2 to 4 in the
second row is connected to 2 to 4 combinational logic blocks in the third row,
in the fourth row of the combinational logic blocks, the extreme right
combinational logic block is connected to the carry output of the extreme right
combinational logic block of third row and sum output of second extreme right
combinational logic block of third row,
the third extreme left combinational logic block in fourth row is connected to
the sum outputs of second extreme left and third extreme left combinational
logic blocks in the third row and carry outputs of third extreme and fourth
extreme left combinational logic blocks in the third row
the second extreme left combinational logic block in fourth row is connected
to the sum outputs of extreme left and second extreme left combinational logic
blocks in the third row and carry outputs of second extreme and third extreme
left combinational logic blocks in the third row
the extreme left combinational logic block in the fourth row is provided inputs
from sum output of extreme left and carry output of second extreme left
combinational logic blocks in the third row,
the left combinational logic block of the last row of the combinational logic
block array is provided with sum outputs of extreme left and second extreme
left combinational logic blocks and carry outputs of second extreme left and
third extreme left combinational logic blocks of the previous row
the right combinational logic block of the last row of the combinational logic
block array is provided with the sum output of the second left combinational
logic block and the carry output of the third left combinational logic block in
the previous row,
sum output of all the extreme right combinational logic blocks of each row
and the second right extreme combinational logic blocks of first row and last
two rows will provide the final multiplication, and
a delay control means is provided between combination logic blocks in the
rows 1 to 4 and rows 2 to 5 to make the sum of the interconnect delay and the
delay in each logic block itself equal for all the logic blocks in rows 1 to 4 and
2 to 5,
thereby increasing the speed of multiplication, reducing the area, power consumption
and latency.
The delay control means comprising selected interconnect wires between combinational logic blocks such that the sum of the interconnect delay between said logic blocks and the combinational logic blocks delay itself is equal for the selected interconnected wires between logic blocks in the adjacent row.
The array of combinational logic blocks includes LUTs.
At least one register is included in the input of the multiplier and two registers in the output of the multiplier are provided.
The two accumulators are provided in the output of the multiplier for storing the results alternatively.
Brief description of the drawings
The invention will now be described with reference to the accompanying drawings.
Figure 1 shows the conventional 4x4 array multiplier.
Figure 2 shows the conventional 4x4 pipelined array multiplier.
Figure 3 shows an improved 4x4 pipelined array multiplier described in our co-pending
Figure 4 shows an improved wave pipelined array multiplier, according to this invention. Figure 5 shows the conventional 4x4 wave pipelined guild multiplier.
Figure 6 shows an improved wave pipelined array multiplier with accumulator, according to this invention.
Figure 7 shows a clock circuit. Detailed description
Referring to the drawings, figures 1 and 2 have been explained under the heading
'Background'.
It is assumed that FPGAs with 4 input Look-Up-Table (LUT) are used for the
implementation. One of the objectives of the synthesis technique is to ensure that all the 4
inputs of the LUTs are effectively engaged. When half-adders are implemented, LUT cannot
be efficiently utilized. Accordingly, the stages involving half-adders have to be
i modified/combined so that LUTs are effectively utilized. Keeping this in view, the stageO and stage 1 of Fig.2 may be modified to compute the partial products due to the two least significant multiplier bits. The last N stages may be reduced to N/2 stages by replacing the half adders with suitable functional blocks and feeding the sum and carry outputs from one stage to another properly. For example, consider the multiplication of the 4-bit number X3 X2 XI XO with Y3 Y2 Yl YO. In stage 0 of the original scheme, product of YO with X3 X2 XI XO is computed. In stage 1, product of Yl with X3 X2 XI XO is computed. These two stages can be modified as follows: The partial product obtained by the least significant two bits of the multiplier (Yl YO) with X3 X2 XI XO is shown in table 1, given below. Since 4-input LUT is available, the partial products corresponding to each column of table 1 can be computed using a single LUT. Hence the two stages of original multiplier of figure 2 can be reduced to a single stage with 4-input LUTs.
Table 1 Partial products of multiplying (he multiplicand with two LSBs of multiplier
Stages two and three of the original scheme actually require 4-input functional blocks (MO, M2, M2, M2) and hence they can be implemented using the 4-input LUTs efficiently and no modification in the original scheme is required for these two stages. Stages 4 to 7 in the original scheme is used for carry propagation and they use only half-adders for this purpose. Every two stages can be properly combined into one stage in view of the use of 4-input LUTs.
Keeping in view the above modifications, an improved 4x4 pipelined array multiplier has
been developed and is shown in Fig. 3 of our co-pending application no
This is also referred as Optimally Synthesized Pipelined Array Multiplier (OSPAM). It consists of 5 combinational logic blocks (MO, M5, M5, M5, MO - stage 0; MO, M2, M2, M2 - stage 1; MO, M2, M2, M2 - stage 2; M8, M7, M7, M6 - stage 3; M10, M9 - stage 4) separated by 6 stages of registers (0-5) are only required. The original multiplier in figure 2 requires 8 stages of combinational logic blocks and 9 stages of registers. Further the latency
"of the multiplier shown in figure 3 is reduced from 9 clock cycles to 6 clock cycles. Hence the latency is reduced by about 30% and the number of registers required are also reduced.
In general MxN multiplication can be achieved using (M-l)+(N/2) stages of registers with the latency of M+(N/2) clock cycles. This results in lower latency and requires less area for implementation.
Wavepipelined array multiplier (WPARAM):
Wavepipelining is a technique for increasing the throughput of a digital system without introducing pipelining registers between adjacent combinational logic blocks.
The wavepipelined array multiplier can be obtained from the circuit of Fig. 3. of our co-
pending application not,, aS follows:
The 4X4 wavepipelined array multiplier of Fig.4 may be obtained as follows:, The latches
(registers) in stages 0 - 4 in Fig.3 of our co-pending application no. are
removed and the inputs to the various combinational logic blocks denoted as MO, M2, M5, M6, M7, M8, M9 & M10 are made to arrive at the same time at each stage. The appropriate inputs to the various blocks MO to M10 are shown in Fig. 4. For example the sum output of the left most MO block from stageO is fed as one of the inputs to the left most M2 block in the stagel; the carry from the left most M5 block of stageO is fed as another input to the left most M2 block of stagel. These inputs should arrive at stagel with equal delay from stageO. This can be achieved by properly adjusting the routing delays of the inputs from stageO to stagel.
In general, wavepipelining of an mxn array multiplier is achieved by carrying out the following modifications on the pipelined array multiplier.
1. The delays of the inputs to each of the combinational logic block are adjusted such that
they undergo the same delay, i.e. Sum of the interconnect delay and the LUT delay should be
made equal.
2. The pipelining registers are removed and instead the inherent capacitance at the output of
one combinational logic block will be adequate to hold the data till the next combinational
logic block processes this data and becomes stable.
For carrying out the first modification, the interconnect delays between one LUT and the next LUT should be adjusted appropriately. Since Xilinx Foundation Series CAD tool permits the delays to be viewed as well as varied, the wavepipelining technique can be applied to Xilinx FPGAs. However, this technique is applicable for other LUT based FPGAs as well provided there is mechanism to view the interconnect delays and adjust them.
After the data is processed by a number of wavepipelined stages, finally the output has to be latched for it to be processed further by the conventional pipelined synchronous systems. For the purpose of latching, the latch in the CLB may be used. The clock input to the XC4000E -1 family CLB requires a minimum of 3ns on-time and 3ns off-time.
By pumping-in the data to the wavepipelined multiplier at the rate of 5.4ns and using two latches (registers) to store the output alternately (see figure 6), the minimum on-time and off-time requirements can be satisfied. In this case, each latch stores only one data in a period of 10.8ns. One latch uses the positive edge of the clock and the other latch uses the negative edge. In other words interleaving technique can be used to achieve a multiplication rate of 5.4ns and use latches with clock period of 10.8ns. However the conventional pipelined array multiplier works typically at 10ns due to set-up time and routing complexity. Hence the wavepipelined multiplier is about twice faster than the pipelined multiplier.
To ease the routing complexity and to minimize the resources, the wavepipelining delay per stage is retained at 2.7ns rate and a single data is pumped-in twice to the multiplier to effectively feed data at the rate of 5.4ns to the latches. This results in a latency of 2.7ns x number of wavepipelined multiplier stages. However for the conventional pipelined array multiplier the latency is equal to the clock period x number of pipelined multiplier stages. Since the minimum clock period that can be used is about 10ns for the conventional pipelined array multipliers, the latency of the wavepipelined array multiplier is about one third of the latency of the pipelined array multiplier.
Comparison of WPARAM with the Wavepipelined Guild Multiplier:
Wavepipelined Guild Multiplier shown in Fig. 5 is obtained from the pipelined guild multiplier by removing the pipeline registers and equalizing the delays of all the paths
crossing the equitemporal lines. The Guild multiplier requires more area than Accordingly, wavepipelined Guild multiplier also requires more area than the W* Further since, the WP Guild multiplier requires manual routing, increase of area also im, increase in the routing complexity. Further the latency of WP guild multiplier is more thai, that of WP ARAM.
Implementation of Area efficient Filters using WPARAM
In figure 6, an interleaving scheme for using a single multiplier and two accumulators for finding the sum of products corresponding to two parallel filters is shown. One accumulator finds the sum of products corresponding to odd samples and other accumulator finds the sum of products corresponding to even samples. In addition to this, the third accumulator in figure 6 is used for accumulating the output of the odd filter and even filter.
Two parallel filters each with 8-taps and the input samples and the impulse response coefficients represented using 2-bits, 6 bits respectively is shown in figure 6. In XC4010E-1 device, a 2x6 WPARAM can be made to operate at a peak rate of 5.4ns(nanoseconds). The 11-bit accumulator for this filter requires 10ns in the same device. Since the accumulator is slower than the multiplier, a single multiplier may be shared between the two accumulators using an interleaving technique. The multiplier processes alternately the data and coefficients corresponding to odd filter and even filter. The multiplier output corresponding to each filter is latched into two separate registers. A 5.4 ns clock is used to pump-in data into the WPARAM. A 10.8 ns clock is used for latching the data into the registers. Data is latched into one of the registers at the positive edge of the 10.8ns clock and the reverse is true for the other register. A single RAM may be used for storing the input samples. The odd locations contain the odd numbered samples and even locations contain the even numbered samples. Similarly a single RAM may be used for storing the impulse response coefficients. The odd locations contain the coefficients of one filter and the even locations contain the coefficients of the other filter.
The basic clock of 5.4ns is used for the multiplier. This clock is divided further using counters to generate the 10.8ns clock and the addresses for the RAM.

Results and Conclusions:
In order to study the characteristics of the instant multiplier (WPARAM) and compare it with
the other multipliers discussed in our co-pending application, 4X4 multipliers are
implemented on XC4010E-1 device.
Input to the above multipliers may be fed directly through the IO pad, through a register or through a RAM. Feeding the inputs through a RAM becomes convenient when the multiplier is to be used for implementing a filter. Because of the initial latency of the pipelined array multipliers they become efficient only if it is used for performing a number multiplications one after another as is true in the case of computation of sum of products of elements of two vectors.
Pipelined array multipliers, OSPAM and Guild multipliers can be tested folly using the simulation tool. Hence these multipliers are tested using simulation and they are found to be satisfactory. However OPARAM, WPARAM and wavepipelined Guild multiplier use manual routing and the interconnect delays play a crucial role in determining their proper operation. Since the interconnect delays are non-deterministic, there may not be one to one correspondence between the delays measured by the CAD tool and the actual delay obtained in a particular device after the design is downloaded to the device. Hence to verify the proper operation of these multipliers, they should be studied not only by using simulation but also by pumping in the test data on to the actual device. For this purpose the test data is applied to both OPARAM and WPARAM through the parallel port of a Personal Computer to the demo board housing the FPGA and the results are read back for verification. The multiplier and the multiplicand are stored in two RAMs RAMO and RAM1. The content of these RAMs are read at the rate of 6ns/5.4ns and fed to the OPARAM/WPARAM respectively. However the wavepipelined Guild multiplier has not been implemented by us.
For testing the design on the actual device, the clock signal should also be applied to the multiplier. The clock may be taken from an external source or it may have to be internally generated. One of the requirements for extracting the peak rate from the OPARAM is the ability to operate the pipelining registers at their maximum rate. This in turn requires a clock which operates at this rate.
Similarly for the WPARAM and wavepipelined Guild multiplier, a high frequency clock may be required if they are to be operated at the peak rate. The minimum clock period for the WPARAM is equal to the sum of LUT delay and the interconnect delay between the output of the one LUT and the input to the next LUT. The LUT delay of XC4000E family FPGA with speed grade of l(i.e.XC4000E-l) is l.Sns.The interconnect delay can be varied from 0.7ns onwards. An interconnect delay of 1.4ns may be chosen so that a particular LUT output can be fed to a number of LUTs located at different positions in the FPGAs with equal delay. With this a total delay of 2.7 ns is required for each stage of combinational logic block. This enables the data to be pumped-in at a rate of 2.7ns and its multiples. Low cost development and prototype boards may not be able to support signals at high frequency rate even though the FPGA as such can operate internally at rates higher than that which is possible with the board. For example, the demo board for XC4000E device family from Xilinx has a maximum operating speed of 100MHz. Hence pumping in clocks above this frequency using an external source becomes impossible. Instead the clocks have to be internally generated. For this purpose the clock circuit shown in Fig. 7 is used. The LUT delay and the interconnect delay determines the clock period. When circuits of this kind are used, one of the important issue is the stability of the clock generated and extent to which it affects the operation of the pipelined multiplier. To verify this, the stability of the clock and the stability of the output of the pipelined array multiplier need to be checked over a period of at least few hours. For this purpose, the above test was carried out on the XC4010E-1 device and it was found to be working satisfactorily. The clock generated using this technique has been used for OP ARAM and WPARAM. If the board is capable of supporting a high frequency clock, an external source itself can be used for testing the multipliers.
The different characteristics of the multipliers such as number of CLBs required, maximum operating frequency, latency are evaluated for all the above multipliers and are tabulated in Table 2. Implementation results of the 7X7 wavepipelined Guild multiplier reported in the literature are also given in table2. For the sake of comparison 7X7 OPARAM is implemented on XC4010E-1 and the results are given in table2.
The WPARAM of different sizes are implemented and tested. WPARAM of size 4X4 and 2X6 are found to be satisfactory at a clock rate of 185 MHz. However WPARAM of size 4X6, 6X6 are found to be unsatisfactory at this frequency. The plausible reasons for this could be as follows:
As mentioned earlier, for testing the WPARAM, the test data is stored in two RAMs. The delay between the instant when a new address is applied to the RAM and the instant when a stable output appears at the input to the latch after getting processed by the k wavepipelined stages of the multiplier is 2.7(k+l)ns. This data remains stable for a period of 5.4ns as the new address is fed to the RAM only at intervals of 5.4ns. Hence ideally the set-up time available for the latch is 5.4ns. However, the set-up time actually available may be reduced due to the following reasons:
(a) Difference in rise and fall time may increase the time required for the output of
the multiplier to settle marginally depending upon the input pattern.
(b) Due to the fabrication variations, the different interconnects and LUTs may not
i
have identical characteristics and may differ from the value reported by the CAD tool when the design is actually downloaded to the device.
When the actual stable period reduces to less than Ins, it leads to set-up time violations on the latch. For the wavepipelined multiplier to remain in the stable region, it has been experimentally verified that about 5 wavepipelined stages may be reliably used for the XC4010E-1 devices. Since a 4X6 multiplier requires 6 stages for multiplication and 1 stage for RAM, it cannot be reliably implemented directly. It may be noted here that in a wavepipelined multiplier, the interconnect delay variations and LUT delay variations in various stages add up and make the result to appear either too early or too late compared to the active clock edge of the latches at the output of the multiplier. However in the OP ARAM the interconnect delay variation and LUT delay variation in a single stage alone determine the proper operation. Hence multiplier of any size can be built reliably using OP ARAM.
It may be noted that the present effort is to operate the multiplier at the rate of 5.4ns (corresponding to operating frequency of 185MHz). When the multiplier is operated at frequencies lower than this, the margin available for latching the data in the conventional latches in the final stage becomes large and hence it can work satisfactorily even for higher order multipliers. There is one to one correspondence between the multiplier size and the rate at which data is pumped in. A 2X4 WPARAM works satisfactorily for a rate of 2.7ns. But at 2.7ns rate 2X6 did not work. When the data rate is increased to 5.4ns, 2X6 WPARAM gave the correct result. When the data rate is increased to 8.1ns, 10.8ns and so on, a still higher order multiplier can be made to work. At a data rate of 10.8ns, even an external clock can be used for the output latch and the RAMs.
Multipliers of larger size such as 4X6, 6X6 etc., may be designed directly and applied with clocks with periods of S.lnsec or 10.8 nsec. Alternately a lower order multiplier may be used sequentially to implement the higher order multiplier. For example 4X6 multiplication may be carried out sequentially using a single 2X6 multiplier operating at the rate of 5.4ns and two phases of multiplication. In the first phase least two significant multiplier bits may be multiplied with the 6 bit multiplicand and in the next phase, the most two significant bits may be multiplied with the multiplicand. The output resulting from the two phases have to be added after shifting the output corresponding to the first phase by two bits towards right. The sequential multiplication scheme can be implemented using the interleaved accumulators. The two MSBs of the multiplier may be stored in even locations and the two LSBs of the multiplier may be stored in odd locations of the RAM. The address to the RAMs, RAMO, RAM1 containing the multiplier, multiplicand are varied at a rate of 185 MHz ,92.5 MHz respectively. The 4X6 multiplier designed this fashion is found to be satisfactory. Similarly higher order multipliers such as 6X6 ,8X6 may be implemented using a single 2X6 multiplier and 3,4 accumulators respectively. Compared to the direct implementation scheme, the sequential implementation scheme is area efficient and the routing complexity is also less.
In addition to the above two techniques, a higher order WPARAM can be built and operated more reliably using a lower speed grade device such as XC4010E-4. For example, the wavepipelined Guild multiplier built and reported in the literature is of size 7X7 and has been implemented on XC4005E-6. This has larger LUT delay compared to the XC4010E-1 and the interconnect delay becomes a smaller fraction of the LUT delay. In view of this, variations in the interconnect delay does not alter the performance of the wavepipelined multiplier unless the number of wavepipelined stages are increased significantly. 7X7 WPARAM built in the same device would require less area and latency than the wavepipelined guild multiplier due to reduction in the no. of wavepipelined stages
As mentioned earlier, the maximum size of multiplier that can be designed using wavepipelining technique is limited. However OP ARAM of larger sizes can be directly implemented and operated with the clock period of 6ns. WPARAM of larger sizes have to be operated with a clock period of 8.1, 10.8 or higher multiples of 2.7ns. For higher order multipliers OP ARAM is to be preferred if higher speeds are required and any data rate lower than 166 MHz can be used. It need not be multiple of a particular frequency as in the case of
WPARAM. If power dissipation is the concern one may go in for WPARAM instead of OPARAM.
The implementation details of the filters on XC4010E-1 using WPARAM is presented
next.
For study of filters using the WPARAM, two 8-tap interleaved parallel FIR filters for inputs with 2 bits and impulse response coefficients with 6 bits is implemented on XC4010E-1 FPGA. One 8 tap filter corresponds to odd input samples and the other 8 tap filter corresponds to even input samples. The output of the above two filters may be accumulated to obtain a 16 tap FIR filter. The RAMs introduce a latency of 2.7ns delay for the wavepipelined multiplier. For the conventional pipelined array multiplier, RAM introduces a latency of 1 clock cycle(lOns). The above filters have been found to be working satisfactorily and to perform 16 multiplications, the time required is (5.4x 16) + (2.7x7)ns. The 16 tap filter requires 89 CLBs for implementation.
A 16 tap FIR filter may also be implemented using two parallel 8 tap non interleaved filters with 2X6 OSPAM. In this case for each of the parallel filter, one dedicated multiplier is used. The 16 tap filter requires, the computation time of (12 x 16) + (12x 7) ns. The 16 tap filter with OSPAM requires 87 CLBs for implementation. Hence the 16 tap FIR filter with WPARAM is faster by a factor of 2.5 than the filter with OSPAM.
Further the wavepipelined multiplier doesn't require any pipelining register and hence the power dissipation due to charging and discharging of the registers will be absent.
Next, the implementation results of an 8 tap filter with 4 bit input samples and 6 bits impulse response coefficients are considered:
As mentioned earlier, the 4X6 WPARAM is implemented using a sequential scheme with single 2X6 WPARAM and two accumulators. This requires 89 .CLBs and requires (10.8x8)+(10.8x7)ns for computation of the sum of products due to 8 taps. The FIR filter using the conventional pipelined array multiplier may also be implemented using 4x6 OSPAM without interleaving. This has 9 pipeline stages including RAM and requires 68 CLBs for implementation. In this case, the time required for computing the filter output
"becomes (13.3 x 8) + (13.3 x 8)ns. Hence the FIR filter using WPARAM is about 1.2 times faster and requires about 30% more area compared to the filter with OSPAM.
Ail the FIR filters implemented using WPARAM may also be implemented using OP ARAM. The filters using OP ARAM is slower than that using WPARAM as the maximum clock period that can be used with OP ARAM is about 11% larger than that of WPARAM. The filters using WPARAM results in less power dissipation than those using OP ARAM.
Table 2(Table Removed)

From table 2 and based on the experiments carried out the following conclusions can be arrived at:
1. The 4X4 WPARAM requires 25-30% less area and latency than the conventional
pipelined array multiplier and the Guild multiplier.
2. The 4X4 WPARAM operates faster by about 1.4-1.6 times than the conventional
pipelined array multiplier and the Guild multiplier.
3. The 4X4 WPARAM operates faster by about 1.1 times than the OP ARAM and both the
multipliers require same area.
4. The 7X7 OPARAM requires about 30% less area than the wavepipelined Guild
multiplier. The speeds cannot be compared as OPARAM is implemented in XC4010E-1
which is a faster device than the XC4005E-6 which was used for the implementation of
the 7X7 wavepipelined Guild multiplier.
5. WPARAM dispenses with registers and hence the power dissipation is reduced for
WPARAM compared to the non wavepipelined multipliers.
6. It is observed from table 2, as the number of wavepipelined stages is increased, the
maximum operating frequency is decreased. In other words as the latency of the
multiplier increases, the maximum operating frequency is decreases. 4X4 WPARAM
requires 5 stages and wavepipelined Guild multiplier requires 7 stages. Hence a
wavepipelined Guild multiplier can be operated only at a frequency lower than that of the
WPARAM.
7. The Internal clock generation scheme proposed has been found to be stable and it enables
the multipliers to be operated at their peak rate immaterial of the speed grade of the PCB
on which the FPGA is mounted.
8. The 16 tap Interleaved parallel FIR filter using 2X6 WPARAM requires almost the same
area but is faster by a factor of 2.2 times than the filter with two dedicated multipliers.
9. The 8 tap Interleaved FIR filter with 4 bit samples using sequential 2X6 WPARAM
requires 30% more area and is faster by 1.2 times than the filter with the OSPAM.
10. For higher order multipliers, OPARAM is faster than the WPARAM and immaterial of
the size of the multiplier, OPARAM can be operated at any data rate less than or equal to
166 MHz .

We claim:
1. An improved m x n wave pipelined array multiplier comprising:
an array of combinational logic blocks (each block having two outputs
denoted as sum and carry outputs for the sake of convenience),
two least significant multiplier bits are connected to the input of the first
row of combinational logic blocks such that the least significant bit is
connected to said combinational logic blocks except the first extreme left
combinational logic block and the next significant bit is connected to said
combinational logic blocks except the extreme right combinational logic
block,
the remaining n-2 multiplier bits are connected one at a time in the
consecutive rows of said combinational logic blocks such that
the sum output of the jth combinational logic block in the ith row is
connected to j+lth combinational logic block in the i+lth row,
the carry output of the jth combinational logic block in the ith row is
connected to jth combinational logic block of i+lth row,
in the nth row of the combinational logic blocks, the extreme right
combinational logic block is connected to the carry output of the extreme
right combinational logic block of n-lth row and sum output of second
extreme right combinational logic block of n-lth row,
the extreme left combinational logic block in the nth row is connected to
sum output of extreme left combinational logic block and carry output of
second left combinational logic block in the n-lth row,
jth (j=2 to n-1) combinational logic block in nth row is connected to sum
outputs of j-1th and jth combinational logic blocks in n-lth row and
carrys from jth and j+lth combinational logic blocks in the n-lth row,
for each of the combinational logic block in the last (n/2)-l rows excepting
the extreme right block, the jth block is connected to sum outputs of jth
and j+lth combinational logic blocks and carry outputs from j+lth and
j+2th combinational logic blocks in the previous row,
each of the extreme right combinational logic block of the last (n/2)-l rows of the combinational logic block arrays is provided with the sum output of the third extreme right combinational logic block and the carry output of the second extreme right combinational logic block in the previous row, and
sum output of all the extreme right combinational logic blocks of each row and the second right extreme combinational logic blocks of first row and last n/2th rows will provide the final multiplication, a delay control means is provided between combinational logic blocks in the ith row and i+lth row to make the sum of the interconnect delay and the delay in each logic block itself equal for all the logic blocks in the ith and i+lth rows,
thereby increasing the speed of multiplication, reducing the area, power
consumption and latency.
2. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein
m x n = 4 x 4.
3. An improved m x n wave pipelined array multiplier as claimed in claim 2 wherein
delay control means comprising selected interconnect wires between
combinational logic blocks such that the sum of the interconnect delay between
said logic blocks and the combinational logic blocks delay itself is equal for the
selected interconnected wires between logic blocks in the adjacent row.
4. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein
the array of combinational logic blocks comprise LUTs.
5. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein
the input of the multiplier comprises at least one register and two registers in the
output of the multiplier are provided.
6. An improved m x n wave pipelined array multiplier as claimed in claim 1 wherein
two accumulators are provided in the output of the multiplier for storing the
results alternatively.
7. An improved m x n wave pipelined array multiplier substantially as herein
described with reference to and as illustrated in the accompanying drawings.

Documents:

885-del-2002-abstract.pdf

885-del-2002-claims.pdf

885-del-2002-correspondence-others.pdf

885-del-2002-correspondence-po.pdf

885-del-2002-description (complete).pdf

885-del-2002-drawings.pdf

885-del-2002-form-1.pdf

885-del-2002-form-19.pdf

885-del-2002-form-2.pdf

885-del-2002-form-3.pdf

885-del-2002-gpa.pdf

« Previous Patent

Next Patent »

Patent Number

220117

Indian Patent Application Number

885/DEL/2002

PG Journal Number

28/2008

Publication Date

11-Jul-2008

Grant Date

15-May-2008

Date of Filing

29-Aug-2002

Name of Patentee

DEPARTMENT OF INFORMATION TECHNOLOGY

Applicant Address

Inventors:

#	Inventor's Name	Inventor's Address
1	GOPALAKRISHNAN LAKSHMINARAYANAN	DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, REGIONAL ENGINEERING COLLEGE, TIRUCHIRAPPALI-620 015, TAMIL NADU, INDIA.
2	BALASUBRAMANIAN VENKATARAMANI	DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, REGIONAL ENGINEERING COLLEGE, TIRUCHIRAPPALI-620 015, TAMIL NADU, INDIA.

PCT International Classification Number

H03K 19/00

PCT International Application Number

N/A

PCT International Filing date

PCT Conventions:

#	PCT Application Number	Date of Convention	Priority Country
1			NA