Title of Invention	A DEVICE FOR SCALABLE INTER-NODAL COMMUNICATION IN A PARALLEL COMPUTING SYSTEM
Abstract	A novel device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device. The novel device enables communication along with processing in a scalable parallel computing system. The device is capable of performing global arithmetic operations such as addition, multiplication and finding the maximum or minimum values in the Central Processing Unit (CPU) on the switch itself, thereby bringing symmetry into the communication process and increasing the overall bandwidth as the number of processors are increased, thus eliminating the bottlenecks (which reduce the parallel performance) in the communication part during the execution of parallel jobs. This results in enhanced overall performance of a parallel computer system in terms of significantly reducing the overall computational time. The gain being particularly more for spectral algorithms which involve global operations.

Title of Invention

A DEVICE FOR SCALABLE INTER-NODAL COMMUNICATION IN A PARALLEL COMPUTING SYSTEM

Abstract

A novel device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device. The novel device enables communication along with processing in a scalable parallel computing system. The device is capable of performing global arithmetic operations such as addition, multiplication and finding the maximum or minimum values in the Central Processing Unit (CPU) on the switch itself, thereby bringing symmetry into the communication process and increasing the overall bandwidth as the number of processors are increased, thus eliminating the bottlenecks (which reduce the parallel performance) in the communication part during the execution of parallel jobs. This results in enhanced overall performance of a parallel computer system in terms of significantly reducing the overall computational time. The gain being particularly more for spectral algorithms which involve global operations.

Full Text	The present invention relates to a device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device. The present invention particularly relates to a novel communication device for providing scalable parallel communication between the nodes of a parallel computer system, an essential tool for building scalable parallel computers, and a parallel computer system incorporating the novel communication device of the present invention, which is capable of providing high speed communication along with processing. Main usage / utilities of the device of the present invention are : 1. Fast communication along with processing for scalable parallel computers. 2. Communication between devices using a dual-port memory. 3. The device of the present invention can be used in place of an ethernet hub / switch for interconnecting general purpose computers. 4. The device of the present invention can be used along with a processor such as Pentium HI or equivalent and further improved processors which may be available in the future. In the hitherto known prior art of communication systems for parallel computers, the emphasis has been on improving the data transfer rate and reducing the latency. Hence, the need for high speed switches with large bandwidths and lower latencies. Reference may be made to: (i) The CRAY-link for SGI-origin 200/2000; the SGI Origin: A ccNuma Highly Scalable server, James Laudon and Daniel Lenoski, SGI; http://www.sgi.com/origin/images/isca.pdf (ii) Gigabit Ethernet switches; Technology brief : Introduction to GigaBit Ethernet, Cisco systems; http://www.cisco.com/warp/public/cc/techno/media/lan/gig/tech/gigbt tc.htm (iii) U.S. Patent number 5,274,631, titled "Computer network switching system", (iv) U.S. Patent number 5,325,224, titled "Time multiplexed, optically addressed, gigabit optical crossbar switch". These devices aim to increase the communication bandwidth and reduce the message latencies which is of importance for general purpose communication.These high speed switches with large bandwidths and lower latencies lead to efficient parallel machines for many problems which have low or moderate coupling However, when such devices are used as part of a parallel computing system, it leads to certain drawbacks such as scalability for highly coupled problems involving global operations such as algorithms using the spectral technique for solving partial differential equations. Reference may also be made to U.S. Patent no. 6,009,262 , titled:"Parallel computer system and method of communication between the processors of the parallel computer system", wherein the objective is to limit the amount of data transferred for large scale problems. This approach looks at the communication and computation separately and does not address problems wherein the interdependence between computations on different processors is high. While the strategy of using high speed switches with large bandwidths and lower latencies leads to efficient parallel machines for many problems which have low or moderate coupling, it fails to provide scalability for highly coupled problems involving global operations such as algorithms using the spectral technique for solving partial differential equations. This is due to the de-coupling of computation and communication leading to an asymmetric pattern of communication. The main object of the present invention is to provide a device for scalable inter-nodal communication in a parallel computing system, which obviates the drawbacks of the hitherto known prior-art devices and systems as detailed above. Another object of the present invention is to provide a device which enables fast communication along with processing for scalable parallel computers by coupling both computation and communication leading to a symmetric pattern of communication. Yet another object of the present invention is to be able to connect a plurality of such devices together to enhance the overall bandwidth of communication for parallel computers. Still, another object of the present invention is to provide a device which enables general purpose communication between computers which have a Peripheral Component Interconnect (PCI) interface. Still yet another object of the present invention is to provide communication between devices using a dual-port memory. One more object of the present invention is to provide a device which is compatible and can be used with a processor such as Pentium III or equivalent or higher and further improved processors which may be available in the future. A further object of the present invention is to provide a parallel computing system incorporating the device of the present invention. In the present invention there is provided a device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device. The novel device enables communication along with processing in a scalable parallel computing system. The device is capable of performing global arithmetic operations such as addition, multiplication and finding the maximum or minimum values in the Central Processing Unit (CPU) on the switch itself, thereby bringing symmetry into the communication process and increasing the overall bandwidth as the number of processors are increased, thus eliminating the bottlenecks (which reduce the parallel performance) in the communication part during the execution of parallel jobs. This results in enhanced overall performance of a parallel computer system in terms of significantly reducing the overall computational time. The, gain being particularly more for spectral algorithms which involve global operations. In the drawings accompanying this specification, figure 1 represents the schematic block diagram of the device of the present invention. The various components which constitute the device of the present invention and the function of the individual components in combination with the other components is explained below: The Central Processing Unit (CPU) (1) controls the functions of the device of the present invention which works as a switch and a processor to enable communication along with processing. The Comprehensive Programmable Logic Device (CPLD) (2) functions as the interface for signals from the CPU (1) to and from the other components like Dual Port Memory (DPM) (3), Synchronous Random Access Memory (SRAM) (4), Erasable Programmable Read Only Memory (EPROM) (5) and other standard conventional supporting components. The Dual Port Memory (DPM) (3) enables shared communication between the Processing Elements (PE) and the device of the present invention. The Synchronous Random Access Memory (SRAM) (4) is the working memory of the device of the present invention. The Erasable Programmable Read Only Memory (EPROM) (5) stores the basic monitor program for the device of the present invention. The other standard conventional supporting components such as Clock, Clock-Driver, Reset Logic, Baud Rate Generator, Serial Communication Controller (SCC), Buffer/ Interface and connector are clearly shown with inter-connections in the block diagram depicted in figure 1 of the drawings. Accordingly the present invention provides a device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device, which comprises a Central Processing Unit (CPU) (1) capable of controlling the functions of the device; the said CPU (1) being connected in combination to a Dual Port Memory (DPM) (3), Synchronous Random Access Memory (SRAM) (4), Erasable Programmable Read Only Memory (EPROM) (5) through a Comprehensive Programmable Logic Device (CPLD) (2) capable of interfacing the signals between the said CPU (1) and DPM (3), SRAM (4), EPROM (5); the said CPU (1) and CPLD (2) being also provided in combination conventional peripheral components such as Clock, Clock-Driver, Reset Logic, Baud Rate Generator, Serial communication Controller (SCC), Buffer / Interface and Connector In an embodiment of the present invention , the various individual components which in combination constitute the novel device are selected from conventional state-of the-art components. In another embodiment of the present invention, the device is compatible with any processor such as Pentium III or equivalent or any state-of-the-art or higher version. In yet another embodiment of the present invention , the device is capable of providing scalable communication between the nodes of a parallel computing system. In still another embodiment of the present invention, the device is capable of performing global arithmetic operations such as addition, multiplication and finding the maximum or minimum values in the Central Processing Unit (1) on the device itself. The novel device of the present invention is a communication device designed and developed to provide scalable communication between nodes of a parallel computer and in addition also provides processing capability on the device itself. The said device functions as a switch to provide scalable communication between the nodes of a parallel computing system and performs global arithmetic operations such as addition, multiplication and finding the maximum or minimum values in the Central Processing Unit (1) also on the switch itself, thereby bringing symmetry into the communication process and increasing the overall bandwidth as the number of processors are increased, thus eliminating the bottlenecks (which reduce the parallel performance) in the communication part during the execution of parallel jobs. The gain being particularly more for spectral algorithms which involve global operations. The device of the present invention when incorporated in a computer system, enables scalable inter-nodal communication along with processing on the device itself. Accordingly the present invention provides a parallel computing system incorporating the device of the present invention, which comprises a plurality of individual computers having a Peripheral Component Interconnect (PCI) -Dual Port Memory (DPM) card on the PCI slot of each of the said computers, each of the said PCI - DPM cards of the said individual computers being interconnected through one or more of the devices of the present invention to provide a parallel computing system. In an embodiment of the present invention the individual computers are single board computers each of which has a PCI-DPM card on the PCI slot. In another embodiment of the present invention the individual computers are based on any processor such as Pentium III or equivalent or any state-of-the-art or higher version. In figures 2-A and 2-B of the drawings accompanying this specification are shown a schematic layout diagram of an embodiment of a scalable parallel computer system incorporating the novel device of the present invention. Figure 2-A shows the interconnections between four such interconnected devices of the present invention and the compute nodes of a parallel computer system. In the said figure 2-A is shown the novel device, designated as FloSwitch, connected through individual PCI-DPM cards of each of a plurality of computers to form a parallel computing system. Figure 2-B depicts how each individual computer is connected to the PCI-DPM card. The device of the present invention is connected to PCI-DPM cards which are in turn connected to the PCI slots of the motherboards of the Processing Elements (PE) which constitute the individual computers. The device of the present invention, the PCI-DPM cards and the Processing Elements (PE) form a parallel computing system. The data transfer between the processing Element (PE) of the individualcomputers is as follows. Processors which need to communicate write/read the data to/from the respective DPMs and the device of the present invention, ~7 process and route the data to the corresponding DPMs. In case of simple send-receive transaction between a pair of processors (PE), the device copies the data from the DPM of the sending PE to that of the receiving PE and updates the relevant flags. For a global sum, the device sums the data present on all the DPMs and deposits the result back on the DPMs. The scientific explanation for the improved scalability .possible using the device of the present invention is as follows: Some of the common strategies for operations like global summation are ring summation and binary tree summation as shown in figures 3 &4 respectively of the drawings accompanying this specification. To explain the figures, we take the example of global summation of the four arrays AO, Al, A2 and A3 which are present on the processing elements (PE) PEG, PE1, PE2 and PE3 respectively. In ring summation, as shown in figure 3, one of the processors (Ffc), say PEG, sends the array AO to processor PE1 which adds A1 to it and sends the result (AO+A1) to PE2. PE2 adds A2 to this and sends the result (AO+A1+A2) to PE3. PE3 adds A3 to this to get the final sum S. This is then transferred to the other PEs as shown in the figure. The number of communication steps needed here is proportional to Np, where Np is the number of processors (PE). In binary tree summation, as shown in figure 4, PEG sends AO to PE1 which adds Al to it to get (AO+A1). At the same time, PE2 sends A2 to PE3 which adds A3 to get (A2+A3). Then PE1 sends (AO+A1) toPE3 which adds it to (A2+A3) to get the final sum S. This is then sent back to PE1. The final step consists of PE1 and PE3 sending S to PE0 and PE2 respectively. This algorithm is optimal when the number of processors is a power of two. The number of communication steps needed here is proportional to log Np. Of the two above said strategies, it is very clear that the binary summation technique is better than the ring summation technique in terms of the number of communication steps required as the number of processors increases. The time (T) taken for communication using the binary tree summation can be written as: T = Log Np * (Tcp + Tsum) + Log Np * Tcp where Np is the number of processors, Tcp is the time taken to transfer the data from the processor to the switch and Tsum is the time required to sum two arrays on the processor. In figure 5 of the drawings is shown the manner in which the global summation is done by the device of the present invention. The arrays AO, Al, A2 and A3 are transferred by PE0, PE1, PE2 and PE3 respectively, to buffers on the device simultaneously. The device then performs the summation and places the result S on the buffers accessible to each of the PEs. The PEs then read back the result S. The number of communication steps is only two (2) irrespective of the number of processors. The time (T) taken to perform the same operations using the device of the present invention is : T =2(Top/2) + TSsum where TSsum is the time taken to sum the arrays on the device of the present invention. TSsum = TS1 Np (for sequential addition on the device of the present invention); or TSsum = TS1 *log Np (for binary tree addition on the device of the present invention); where TS1 is the time required to add two arrays on the device of the present invention . It should be noted that the factor multiplying log Np (which increases on increasing the number of processors) is TS1 for the case of the device of the present invention and Tcp for the binary tree summation with hitherto known prior art switches. For modern state-of-the-art processors, since the computation speeds are much greater than the communication speeds, the total communication time taken by the device of the present invention is lower. Also, there is a much slower increase (by an order of magnitude) of the communication time with Np, thus resulting in scalable parallel programs. The novelty of the device of the present invention resides in removing the asymmetric pattern of communication, thereby enhancing the overall performance of a parallel computer svstem in terms of significantly reducing the overall computational time which results in increased overall performance. The novelty of the device of the present invention has been realized by the non-obvious inventive step of providing combined parallel communication with processing on the device itself. This, has been made possible by the incorporation of a processor on the communication device of the present invention. The theoretical estimate of efficiency for a parallel computing system of the prior art (binary tree) using commercially available ethernet switch and parallel computing system using the device of the present invention, with different number of processors and the details of data transfer are given in tables 1 and 2 below: Table-1 (Table 1 Removed) From the details given in table-2 above, it can be clearly seen that the parallel efficiency decreases drastically using the hitherto known prior art techniques, while with the device of the present invention, there is only a moderate decrease on increasing the number of processors. For example, the parallel efficiency of a fast ethernet switch using binary tree summation falls to 18.5% for 128 processors, while with the device of the present invention it is 46.9%. An imoortant ooint to be noted is that the efficiency of the device of the present invention can be further increased by improving the processor frequency on it, while the other strategies are limited by the processor-switch bandwidth. This illustrates the novelty effectively of the device of the present invention for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device. The following example is given by way of illustration and therefore should not be construed to limit the scope of the present invention. Example - I The schematic block diagram of the interconnections of the device of the present invention are depicted in figure 1 of the drawings. The details and specifications of the individual components used to build this version of the device of the present invention as per the interconnections shown in figure-1 are as follows: An 8-port device was built using the Intel 486 processor working at a clock speed of 32 MHz. The higher order address lines from the processor were connected to the CPLD XC 95216 HQ 208 for generating the Chip Select signals for the SRAM (IDT 71256SA), EPROM (AM27C512) and DPM modules (IDT 7MP1015). Lower order address bits were routed directly. The control signals were routed through the CPLD and used in the state machine. The CPLD was programmed in the VHDL language using the Xilinx foundation series tools. The data bus was buffered using the bidirectional buffer 74F245 and the address bus was latched using the 74F373-8 bit latch. This device of the present invention was used for experimentation. In figures 2-A and 2-B of the drawings accompanying this specification are shown a schematic layout diagram of an embodiment of a scalable parallel computer system incorporating the novel device of the present invention. The parallel computing system used for experimentation consisted of a plurality of Pentium based single board computers each of which had a PCI-DPM card on the PCI slot. The device of the present invention was connected to the PCI-DPM cards which were already connected to the. PCI slots of the motherboards of the Processing Elements (PE) of the individual computers. The device of the present invention, the PCI-DPM cards and the Processing Elements (PEs) together in combination formed a parallel computing system. The complete system was then tested to ensure the proper functioning of the device as well as the overall performance of the entire system. A parallel computing system of the prior art (binary tree) using a standard commercially available ethernet switch for communication was used as a bench mark. Experimental setup: (i) 8 Processors : PHI 1 GHz, 256 MB RAM per processor; (ii) OS : Linux; (iii) Communication using commercially available 100 mbps ethernet switch; (iv) Communication using the device of the present invention. Test code : GCM code using the spectral algorithm, with T-80 resolution, where the global reduction operation is the bottleneck in communication. The computed results using both the devices were found to be equivalent (within round-off error range) to the sequential results. This is a necessary check in parallel computing. The total run time (in seconds) for the two cases is given in the table-3 below: Table-3 (Table 3 Removed) It is observed from table-3 above that the time taken using the device of the present invention is lower than that using Ethernet for 4 and 8 processors. The higher time obtained for two processors is due to the use of an Intel 486 processor on this version of the device of the present invention. This test however, demonstrates the working of the device of the present invention. It also indicates the scalability achievable by the device of the present invention. The clear inference which can be drawn from the above example and the theoretical analysis as given above, is that a faster processor on the device of the present invention will result in further improvements. The main advantages of the present invention are: 1. The device is capable of providing scalable inter-nodal communication in a parallel computing system. 2. Provides scalable parallel computers. 3. Enables general purpose communication between computers which have a Peripheral Component Interconnect (PCI) interface. 4. Provides communication between devices using dual-port memory. 5. Compatible with state-of-the-art processors. 6. Enables fast communication along with processing for scalable parallel computers by coupling both computation and communication leading to a symmetric pattern of communication. 7. Enhances overall bandwidth of communication for parallel computers, thus enabling fast communication. 8. Cross section bandwidth of the order of 1.6 Giga Bytes per second. 9. Speed-ups which do not decrease on increasing the number of processors for spectral codes. 10. Cost effectiveness of the order of about 45 to 55 %. We claim: A device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device which comprises a Central Processing Unit (CPU) (1) capable of controlling the functions of the device; the said CPU (1) being connected in combination to a Dual Port Memory (DPM) (3), Synchronous Random Access Memory (SRAM) (4), Erasable Programmable Read Only Memory (EPROM) (5) through a Comprehensive Programmable Logic Device (CPLD) (2) capable of interfacing the signals between the said CPU (1) and DPM (3), SRAM (4), EPROM (5); the said CPU (1) and CPLD (2) being also provided in combination conventional peripheral components such as Clock, Clock-Driver, Reset Logic, Baud Rate Generator, Serial communication Controller (SCC), Buffer / Interface and Connector 1. A device as claimed in claim 1, wherein the various individual components which in combination constitute the novel device are selected from conventional state-of the-art components. 2. A device as claimed in claim 1, wherein the said device is compatible with any processor such as Pentium III or equivalent or any state-of-the-art or higher version. : 4. A device as claimed in claim 1, wherein the said device is capable of providing scalable communication between the nodes of a parallel \| computing system. 5. A parallel computing system incorporating the device as claimed in claim 1,which comprises a plurality of individual computers having a Peripheral Component Interconnect (PCI) - Dual Port Memory (DPM) card on the PCI slot of each of the said computers, each of the said PCI - DPM cards of the said individual computers being interconnected through one or more of the devices of the present invention to provide a parallel computing system. 6. A parallel computing system as claimed in claim 6 , wherein the individual computers are based on any processor such as Pentium III or equivalent or any state-of-the-art or higher version. 7. A device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device substantially as herein described with reference to the example and drawings accompanying this specification.

Full Text

The present invention relates to a device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device.
The present invention particularly relates to a novel communication device for providing scalable parallel communication between the nodes of a parallel computer system, an essential tool for building scalable parallel computers, and a parallel computer system incorporating the novel communication device of the present invention, which is capable of providing high speed communication along with processing.
Main usage / utilities of the device of the present invention are :
1. Fast communication along with processing for scalable parallel
computers.
2. Communication between devices using a dual-port memory.
3. The device of the present invention can be used in place of an ethernet
hub / switch for interconnecting general purpose computers.
4. The device of the present invention can be used along with a processor
such as Pentium HI or equivalent and further improved processors which
may be available in the future.
In the hitherto known prior art of communication systems for parallel computers, the emphasis has been on improving the data transfer rate and reducing the latency. Hence, the need for high speed switches with large bandwidths and lower latencies.
Reference may be made to:
(i) The CRAY-link for SGI-origin 200/2000; the SGI Origin:
A ccNuma Highly Scalable server, James Laudon and Daniel
Lenoski, SGI; http://www.sgi.com/origin/images/isca.pdf (ii) Gigabit Ethernet switches; Technology brief : Introduction to
GigaBit Ethernet, Cisco systems;
http://www.cisco.com/warp/public/cc/techno/media/lan/gig/tech/gigbt tc.htm (iii) U.S. Patent number 5,274,631, titled "Computer network switching
system", (iv) U.S. Patent number 5,325,224, titled "Time multiplexed, optically
addressed, gigabit optical crossbar switch".
These devices aim to increase the communication bandwidth and reduce the message latencies which is of importance for general purpose communication.These high speed switches with large bandwidths and lower latencies lead to efficient parallel machines for many problems which have low or moderate coupling However, when such devices are used as part of a
parallel computing system, it leads to certain drawbacks such as scalability
for highly coupled problems involving global operations such as algorithms
using the spectral technique for solving partial differential equations.
Reference may also be made to U.S. Patent no. 6,009,262 , titled:"Parallel computer system and method of communication between the processors of the parallel computer system", wherein the objective is to limit the amount of data transferred for large scale problems. This approach looks at the communication and computation separately and does not address problems
wherein the interdependence between computations on different processors is high.
While the strategy of using high speed switches with large bandwidths and lower latencies leads to efficient parallel machines for many problems which have low or moderate coupling, it fails to provide scalability for highly coupled problems involving global operations such as algorithms using the spectral technique for solving partial differential equations. This is due to the de-coupling of computation and communication leading to an asymmetric pattern of communication.
The main object of the present invention is to provide a device for scalable inter-nodal communication in a parallel computing system, which obviates the drawbacks of the hitherto known prior-art devices and systems as detailed above.
Another object of the present invention is to provide a device which enables fast communication along with processing for scalable parallel computers by coupling both computation and communication leading to a symmetric pattern of communication.
Yet another object of the present invention is to be able to connect a plurality of such devices together to enhance the overall bandwidth of communication for parallel computers.
Still, another object of the present invention is to provide a device which enables general purpose communication between computers which have a Peripheral Component Interconnect (PCI) interface.
Still yet another object of the present invention is to provide communication between devices using a dual-port memory.
One more object of the present invention is to provide a device which is compatible and can be used with a processor such as Pentium III or equivalent or higher and further improved processors which may be available in the future.
A further object of the present invention is to provide a parallel computing system incorporating the device of the present invention.
In the present invention there is provided a device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device. The novel device enables communication along with processing in a scalable parallel computing system. The device is capable of performing global arithmetic operations such as addition, multiplication and finding the maximum or minimum values in the Central Processing Unit (CPU) on the switch itself, thereby bringing symmetry into the communication process and increasing the overall bandwidth as the number of processors are increased, thus eliminating the bottlenecks (which reduce the parallel performance) in the communication part during the execution of parallel jobs. This results in enhanced overall performance of a parallel computer system in terms of significantly reducing the overall computational time. The, gain being particularly more for spectral algorithms which involve global operations.
In the drawings accompanying this specification, figure 1 represents the schematic block diagram of the device of the present invention. The various components which constitute the device of the present invention and the function of the individual components in combination with the other components is explained below:
The Central Processing Unit (CPU) (1) controls the functions of the device of the present invention which works as a switch and a processor to enable communication along with processing.
The Comprehensive Programmable Logic Device (CPLD) (2) functions as the interface for signals from the CPU (1) to and from the other components like Dual Port Memory (DPM) (3), Synchronous Random Access Memory (SRAM) (4), Erasable Programmable Read Only Memory (EPROM) (5) and other standard conventional supporting components.
The Dual Port Memory (DPM) (3) enables shared communication between the Processing Elements (PE) and the device of the present invention.
The Synchronous Random Access Memory (SRAM) (4) is the working memory of the device of the present invention.
The Erasable Programmable Read Only Memory (EPROM) (5) stores the basic monitor program for the device of the present invention.
The other standard conventional supporting components such as Clock, Clock-Driver, Reset Logic, Baud Rate Generator, Serial Communication
Controller (SCC), Buffer/ Interface and connector are clearly shown with inter-connections in the block diagram depicted in figure 1 of the drawings.
Accordingly the present invention provides a device for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device, which comprises a Central Processing Unit (CPU) (1) capable of controlling the functions of the device; the said CPU (1) being connected in combination to a Dual Port Memory (DPM) (3), Synchronous Random Access Memory (SRAM) (4), Erasable Programmable Read Only Memory (EPROM) (5) through a Comprehensive Programmable Logic Device (CPLD) (2) capable of interfacing the signals between the said CPU (1) and DPM (3), SRAM (4), EPROM (5); the said CPU (1) and CPLD (2) being also provided in combination conventional peripheral components such as Clock, Clock-Driver, Reset Logic, Baud Rate Generator, Serial communication Controller (SCC), Buffer / Interface and Connector
In an embodiment of the present invention , the various individual components which in combination constitute the novel device are selected from conventional state-of the-art components.
In another embodiment of the present invention, the device is compatible with any processor such as Pentium III or equivalent or any state-of-the-art or higher version.
In yet another embodiment of the present invention , the device is capable of providing scalable communication between the nodes of a parallel computing system.
In still another embodiment of the present invention, the device is capable of performing global arithmetic operations such as addition, multiplication and finding the maximum or minimum values in the Central Processing Unit (1) on the device itself.
The novel device of the present invention is a communication device designed and developed to provide scalable communication between nodes of a parallel computer and in addition also provides processing capability on the device itself. The said device functions as a switch to provide scalable communication between the nodes of a parallel computing system and performs global arithmetic operations such as addition, multiplication and finding the maximum or minimum values in the Central Processing Unit (1) also on the switch itself, thereby bringing symmetry into the communication process and increasing the overall bandwidth as the number of processors are increased, thus eliminating the bottlenecks (which reduce the parallel performance) in the communication part during the execution of parallel jobs. The gain being particularly more for spectral algorithms which involve global operations.
The device of the present invention when incorporated in a computer system, enables scalable inter-nodal communication along with processing on the device itself.
Accordingly the present invention provides a parallel computing system incorporating the device of the present invention, which comprises a plurality of individual computers having a Peripheral Component Interconnect (PCI) -Dual Port Memory (DPM) card on the PCI slot of each of the said computers, each of the said PCI - DPM cards of the said individual computers being
interconnected through one or more of the devices of the present invention to provide a parallel computing system.
In an embodiment of the present invention the individual computers are single board computers each of which has a PCI-DPM card on the PCI slot.
In another embodiment of the present invention the individual computers are based on any processor such as Pentium III or equivalent or any state-of-the-art or higher version.
In figures 2-A and 2-B of the drawings accompanying this specification are shown a schematic layout diagram of an embodiment of a scalable parallel computer system incorporating the novel device of the present invention.
Figure 2-A shows the interconnections between four such interconnected devices of the present invention and the compute nodes of a parallel computer system. In the said figure 2-A is shown the novel device, designated as FloSwitch, connected through individual PCI-DPM cards of each of a plurality of computers to form a parallel computing system. Figure 2-B depicts how each individual computer is connected to the PCI-DPM card.
The device of the present invention is connected to PCI-DPM cards which are in turn connected to the PCI slots of the motherboards of the Processing Elements (PE) which constitute the individual computers. The device of the present invention, the PCI-DPM cards and the Processing Elements (PE) form a parallel computing system.
The data transfer between the processing Element (PE) of
the individualcomputers is as follows. Processors which need to
communicate write/read the data to/from the respective DPMs and the device of the
present invention,
~7
process and route the data to the corresponding DPMs.
In case of simple send-receive transaction between a pair of processors (PE), the device copies the data from the DPM of the sending PE to that of the receiving PE and updates the relevant flags. For a global sum, the device sums the data present on all the DPMs and deposits the result back on the DPMs.
The scientific explanation for the improved scalability .possible using the device of the present invention is as follows:
Some of the common strategies for operations like global summation are ring summation and binary tree summation as shown in figures 3 &4 respectively of the drawings accompanying this specification. To explain the figures, we take the example of global summation of the four arrays AO, Al, A2 and A3 which are present on the processing elements (PE) PEG, PE1, PE2 and PE3 respectively.
In ring summation, as shown in figure 3, one of the processors (Ffc), say PEG, sends the array AO to processor PE1 which adds A1 to it and sends the result (AO+A1) to PE2. PE2 adds A2 to this and sends the result (AO+A1+A2) to PE3. PE3 adds A3 to this to get the final sum S. This is then transferred to the other PEs as shown in the figure. The number of communication steps needed here is proportional to Np, where Np is the number of processors
(PE).
In binary tree summation, as shown in figure 4, PEG sends AO to PE1 which adds Al to it to get (AO+A1). At the same time, PE2 sends A2 to PE3 which adds A3 to get (A2+A3). Then PE1 sends (AO+A1) toPE3 which adds it to (A2+A3) to get the final sum S. This is then sent back to PE1. The final step consists of PE1 and PE3 sending S to PE0 and PE2 respectively. This algorithm is optimal when the number of processors is a power of two. The number of communication steps needed here is proportional to log Np.
Of the two above said strategies, it is very clear that the binary summation technique is better than the ring summation technique in terms of the number of communication steps required as the number of processors increases.
The time (T) taken for communication using the binary tree summation can be written as:
T = Log Np * (Tcp + Tsum) + Log Np * Tcp
where Np is the number of processors, Tcp is the time taken to transfer the data from the processor to the switch and Tsum is the time required to sum two arrays on the processor.
In figure 5 of the drawings is shown the manner in which the global summation is done by the device of the present invention. The arrays AO, Al, A2 and A3 are transferred by PE0, PE1, PE2 and PE3 respectively, to buffers on the device simultaneously. The device then performs the summation and places the result S on the buffers accessible to each of the PEs. The PEs then read back the result S. The number of communication steps is only two (2) irrespective of the number of processors.
The time (T) taken to perform the same operations using the device of the present invention is :
T =2*(Top/2) + TSsum
where TSsum is the time taken to sum the arrays on the device of the present invention.
TSsum = TS1 * Np (for sequential addition on the device of the present invention); or
TSsum = TS1 *log Np (for binary tree addition on the device of the present invention); where TS1 is the time required to add two arrays on the device of the present invention .
It should be noted that the factor multiplying log Np (which increases on increasing the number of processors) is TS1 for the case of the device of the present invention and Tcp for the binary tree summation with hitherto known prior art switches. For modern state-of-the-art processors, since the computation speeds are much greater than the communication speeds, the total communication time taken by the device of the present invention is lower. Also, there is a much slower increase (by an order of magnitude) of the communication time with Np, thus resulting in scalable parallel programs.
The novelty of the device of the present invention resides in removing the asymmetric pattern of communication, thereby enhancing the overall performance of a parallel computer svstem in terms of significantly reducing the overall computational time which results in increased overall performance. The novelty of the device of the present invention has been realized by the non-obvious inventive step of providing combined parallel communication with processing on the device itself. This, has been made
possible by the incorporation of a processor on the communication device of the present invention.
The theoretical estimate of efficiency for a parallel computing system of the prior art (binary tree) using commercially available ethernet switch and parallel computing system using the device of the present invention, with different number of processors and the details of data transfer are given in tables 1 and 2 below:
Table-1
(Table 1 Removed)
From the details given in table-2 above, it can be clearly seen that the parallel efficiency decreases drastically using the hitherto known prior art techniques, while with the device of the present invention, there is only a moderate decrease on increasing the number of processors. For example, the parallel efficiency of a fast ethernet switch using binary tree summation falls to 18.5% for 128 processors, while with the device of the present invention it is 46.9%. An imoortant ooint to be noted is that the efficiency of the device of the present invention can be further increased by improving the processor frequency on it, while the other strategies are limited by the processor-switch bandwidth. This illustrates the novelty effectively of the device of the present invention for scalable inter-nodal communication in a parallel computing system and a parallel computing system incorporating the said device.
The following example is given by way of illustration and therefore should not be construed to limit the scope of the present invention.
Example - I
The schematic block diagram of the interconnections of the device of the present invention are depicted in figure 1 of the drawings. The details and specifications of the individual components used to build this version of the device of the present invention as per the interconnections shown in figure-1 are as follows:
An 8-port device was built using the Intel 486 processor working at a clock speed of 32 MHz. The higher order address lines from the processor were connected to the CPLD XC 95216 HQ 208 for generating the Chip Select signals for the SRAM (IDT 71256SA), EPROM (AM27C512) and DPM modules (IDT 7MP1015). Lower order address bits were routed directly.
The control signals were routed through the CPLD and used in the state machine. The CPLD was programmed in the VHDL language using the Xilinx foundation series tools. The data bus was buffered using the bidirectional buffer 74F245 and the address bus was latched using the 74F373-8 bit latch. This device of the present invention was used for experimentation. In figures 2-A and 2-B of the drawings accompanying this specification are shown a schematic layout diagram of an embodiment of a scalable parallel computer system incorporating the novel device of the present invention. The parallel computing system used for experimentation consisted of a plurality of Pentium based single board computers each of which had a PCI-DPM card on the PCI slot. The device of the present invention was connected to the PCI-DPM cards which were already connected to the. PCI slots of the motherboards of the Processing Elements (PE) of the individual computers. The device of the present invention, the PCI-DPM cards and the Processing Elements (PEs) together in combination formed a parallel computing system. The complete system was then tested to ensure the proper functioning of the device as well as the overall performance of the entire system. A parallel computing system of the prior art (binary tree) using a standard commercially available ethernet switch for communication was used as a bench mark.
Experimental setup:
(i) 8 Processors : PHI 1 GHz, 256 MB RAM per processor;
(ii) OS : Linux;
(iii) Communication using commercially available 100 mbps ethernet
switch; (iv) Communication using the device of the present invention.
Test code : GCM code using the spectral algorithm, with T-80 resolution,
where the global reduction operation is the bottleneck in communication.
The computed results using both the devices were found to be equivalent
(within round-off error range) to the sequential results. This is a necessary
check in parallel computing.
The total run time (in seconds) for the two cases is given in the table-3
below:
Table-3

(Table 3 Removed)
It is observed from table-3 above that the time taken using the device of the present invention is lower than that using Ethernet for 4 and 8 processors. The higher time obtained for two processors is due to the use of an Intel 486 processor on this version of the device of the present invention. This test however, demonstrates the working of the device of the present invention. It also indicates the scalability achievable by the device of the present invention.
The clear inference which can be drawn from the above example and the theoretical analysis as given above, is that a faster processor on the device of the present invention will result in further improvements.
The main advantages of the present invention are:
1. The device is capable of providing scalable inter-nodal communication in
a parallel computing system.
2. Provides scalable parallel computers.
3. Enables general purpose communication between computers which have a
Peripheral Component Interconnect (PCI) interface.
4. Provides communication between devices using dual-port memory.
5. Compatible with state-of-the-art processors.
6. Enables fast communication along with processing for scalable parallel
computers by coupling both computation and communication leading to a
symmetric pattern of communication.
7. Enhances overall bandwidth of communication for parallel computers,
thus enabling fast communication.
8. Cross section bandwidth of the order of 1.6 Giga Bytes per second.
9. Speed-ups which do not decrease on increasing the number of processors
for spectral codes.
10. Cost effectiveness of the order of about 45 to 55 %.

We claim:
A device for scalable inter-nodal communication in a parallel
computing system and a parallel computing system incorporating the
said device which comprises a Central Processing Unit (CPU) (1)
capable of controlling the functions of the device; the said CPU (1)
being connected in combination to a Dual Port Memory (DPM) (3),
Synchronous Random Access Memory (SRAM) (4), Erasable
Programmable Read Only Memory (EPROM) (5) through a
Comprehensive Programmable Logic Device (CPLD) (2) capable of
interfacing the signals between the said CPU (1) and DPM (3),
SRAM (4), EPROM (5); the said CPU (1) and CPLD (2) being also
provided in combination conventional peripheral components such as
Clock, Clock-Driver, Reset Logic, Baud Rate Generator, Serial
communication Controller (SCC), Buffer / Interface and Connector
1. A device as claimed in claim 1, wherein the various individual
components which in combination constitute the novel device are
selected from conventional state-of the-art components.
2. A device as claimed in claim 1, wherein the said device is
compatible with any processor such as Pentium III or equivalent or
any state-of-the-art or higher version.
: 4. A device as claimed in claim 1, wherein the said device is capable of
providing scalable communication between the nodes of a parallel | computing system.
5. A parallel computing system incorporating the device as claimed in
claim 1,which comprises a plurality of individual computers having a
Peripheral Component Interconnect (PCI) - Dual Port Memory
(DPM) card on the PCI slot of each of the said computers, each of
the said PCI - DPM cards of the said individual computers being
interconnected through one or more of the devices of the present
invention to provide a parallel computing system.
6. A parallel computing system as claimed in claim 6 , wherein the
individual computers are based on any processor such as Pentium III
or equivalent or any state-of-the-art or higher version.
7. A device for scalable inter-nodal communication in a parallel
computing system and a parallel computing system incorporating the said device substantially as herein described with reference to the example and drawings accompanying this specification.

Documents:

790-del-2001-abstract.pdf

790-del-2001-claims.pdf

790-del-2001-correspondence-others.pdf

790-del-2001-correspondence-po.pdf

790-del-2001-description (complete).pdf

790-del-2001-drawings.pdf

790-del-2001-form-1.pdf

790-del-2001-form-19.pdf

790-del-2001-form-2.pdf

790-del-2001-form-3.pdf

790-del-2001-form-4.pdf

790-del-2001-form-5.pdf

« Previous Patent

Next Patent »

Patent Number

208824

Indian Patent Application Number

790/DEL/2001

PG Journal Number

13/2009

Publication Date

27-Mar-2009

Grant Date

10-Aug-2007

Date of Filing

24-Jul-2001

Name of Patentee

COUNCIL OF SCIENTIFIC & INDUSTRIAL RESEARCH

Applicant Address

RAFI MARG, NEW DELHI-110001, INDIA.

Inventors:

#	Inventor's Name	Inventor's Address
1	UDAY NARAYAN SINHA	NATIONAL AEROSPACE LABORATORIES, KODIHALLI, POST BAG NO. 1779, BANGALORE-560017, INDIA
2	VARAGAPALLI RUDRANIAMMA SARASAMMA	NATIONAL AEROSPACE LABORATORIES, POST BAG NO. 1779, BANGALORE-560017, INDIA
3	RAJALKSHMY SNARAMA KRISHNAN	NATIONAL AEROSPACE LABORATORIES, POST BAG NO. 1779, BANGALORE-560017, INDIA
4	TIRUPATHUR NANSUNDARAO VENKATESH	NATIONAL AEROSPACE LABORATORIES, POST BAG NO. 1779, BANGALORE-560017, INDIA
5	UDAY NARAYAN SINHA	NATIONAL AEROSPACE LABORATORIES, KODIHALLI, POST BAG NO. 1779, BANGALORE-560017, INDIA
6	VARAGAPALLI RUDRANIAMMA SARASAMMA	NATIONAL AEROSPACE LABORATORIES, POST BAG NO. 1779, BANGALORE-560017, INDIA
7	RAJALKSHMY SNARAMA KRISHNAN	NATIONAL AEROSPACE LABORATORIES, POST BAG NO. 1779, BANGALORE-560017, INDIA
8	TIRUPATHUR NANSUNDARAO VENKATESH	NATIONAL AEROSPACE LABORATORIES, POST BAG NO. 1779, BANGALORE-560017, INDIA

PCT International Classification Number

G06F 15/00

PCT International Application Number

N/A

PCT International Filing date

PCT Conventions:

#	PCT Application Number	Date of Convention	Priority Country
1			NA