Protocol processing is the bottleneck in high-speed computer networks. Many network processors have been suggested for switches and routers. Protocol processing in terminals has other characteristics than the processing in switches and routers. Therefore a new type of processor is desirable for terminals.
I define that the protocol processing tasks in a terminal can be partitioned into intra-packet tasks and inter-packet tasks. It is suitable to use two processors that work in a coarse-grained pipeline to implement this partition. The partition crosses over the protocol layers, which have previously been used for partitioning the protocol processing implementations.
The inter-packet tasks are irregular and can be efficiently executed by a traditional von Neuman style processor. The intra-packet tasks can be sub-partitioned into regular tasks and irregular tasks. A novel processor architecture for intra-packet tasks has been developed that makes use of the sub-partitioning by executing the regular tasks in accelerator units and the irregular tasks in a data-flow core unit.
Our core architecture is different from traditional processors because it does not operate on data stored in a memory. Instead it operates in a data-flow fashion directly on the data that is received on the network interface. Thereby no load and store operations are necessary. So the packets are already processed to a large extent when the payload is written into memory. This saves both data memory bandwidth, program memory size, processing time and power consumption. Most of the packets that should be discarded never have to be stored in memory at all.
The data-flow processing also creates some problems since the program flow has to be perfectly synchronized with the incoming data stream, which prevents the use of a pipelined processor architecture. The performance requirements are fulfilled by splitting up the program into three parts and by using a dedicated program memory storage architecture. A standard cell implementation of the processor indicates support of a data flow of more than 10 Gigabit/s. The implementation can be used for network terminals as well as for port acceleration in switches and routers.
The total silicon area of the processor including the program memory is small (0.4 mm2 in 0.18 micron standard cells) and accommodates for an increased number of ports on a real time Ethernet switching chip or the integration of the protocol processing off-loading onto the host processor chip in terminal equipment such as desktop computers.
The accelerator for cyclic redundancy check (CRC) has been implemented with standard cells and manufactured in a 0.35 micron process technology. The chip has measured performance of more than 5.76 Gb/s.
The most significant contribution of my research is the new data-flow processor architecture, which has been proven in a fully functional demonstrator system. The demonstrator system is based on my novel partition of protocol processing tasks into intra-packet tasks and inter-packet tasks. A dual processor architecture, implemented in an FPGA, receives, synchronizes and plays back an audio stream which is sent in UDP/IP packets over fast Ethemet.
The processor architecture will be very important for any processor operating on a data flow. There are possible improvements to the processor, for example a detailed analysis between data width and flexibility will support trading off the internal width and program memory size. Future work also includes investigating in what other areas, besides networks, the processor architecture can be successfully used.
My most important contributions are:
• The partition of protocol processing
• The data-flow processor architecture
Linköping: Uniserv , 2003. , 122 p.
2003-05-19, Sal Visionen, Linköpings Universitet, Linköping, 09:15 (Swedish)