Difference between revisions of "Network"

From NaplesPU Documentation
Jump to: navigation, search
Line 1: Line 1:
The many-core system relies on an interconnection network in order to exchange the coherence, synchronization and boot messages. An interconnection network is a programmable system that moves data between two or more terminals.
+
In a many-core system, the interconnection network has the vital goal of allowing various devices to communicate efficiently.
A network-on-chip is an interconnection network connecting microarchitectural components. Out NoC choice is a 2D mesh. A mesh is a segmented bus in two dimensions with added complexity to route data across dimensions.
 
  
It is explained how the network-on-chip is designed and created. A packet from a source has to be: (1) injected/ejected in/from the network system, (2) routed to the destination over specific wires. The first operation is done by the Network interface, the second from the Router.
+
The most common use case is the exchange of coherence and synchronization messages. In particular, NuPlus coherence system presents a private L1 core cache, and a shared L2 directory based cache. This means that the L2 cache is distributed among tiles, and every address has an associated home tile where the state of the cache line is stored. Similarly, the directory must be able to reply to core’s requests.
 +
Another possible use case is host-to-device communication, or handling of IO mapped peripherals.
  
In order to have a scalable system, each tile has its own Router and Network Interface.
+
A tile can contain multiple devices (called from now on network users) requiring network access.
 +
The network infrastructure must thus provide inter-tile and intra-tile addressing capabilities. The interface offered to its users must be as generic as possible, and independent of the specific network topology and implementation details.
  
== Network Interface ==
+
== General architecture ==
  
The Network Interface is the "glue" that merge all the component inside a tile that want to communicate with other tile in the NoC. It has several interface with the element inside the tile and an interface with the router.
+
Tiles are organized in a 2D grid, a so called mesh topology.
Basically, it has to convert a packet from the tile into flit injected in to the network and viceversa. In order to avoid deadlock, four different virtual network are used: request, forwaded request, response and service network.
 
  
The interface to the tile communicate with directory controller, cache controller and service units (boot manager, barrier core unit, synchronization manager).
+
Every tile has a [[Network router]], which is the component responsible for inter-tile communication, and a [[Network interface]], which offers a transparent interface to network users.
The units use the VN in this way:
+
The network interface acts here as a bridge. Its interface must adapt to the requirements of multiple network users, converting requests from user’s format to network format, and backwards.
 +
Once a request is converted in network format, the router takes charge of its handling.
  
[[File:NI_VN.jpg|400px|Ni virual network]]
+
The basic communication unit supported by the router is the flit. A request is thus broken down in flits by the network interface and sent to the local router.
+
The router has no information of application messages, and it just sees them as a stream of flits. As sequence of flits can be arbitrarily long, the router can offer the maximum flexibility, allowing requests of unspecified length.
The unit is divided in two parts:
+
The ultimate goal of the router is to ensure that flits are correctly injected, ejected and forwarded (routed) through the mesh.
* TO router, in which the vn_core2net units buffer and convert the packet in flit;
 
* FROM router, in which the vn_net2core units buffer and convert the flit in packet.
 
  
These two units support the multicast, sending k times a packet in unicast as many as the destinations are.
+
== Routing protocol ==
  
The vn_net2core units should be four as well as vn_core2net units, but the response network is linked with the DC and CC at the same time.
+
The routing protocol is an important choice in network design.
So the solution is to add another vn_net2core and vn_core2net unit with the same output of the other one. If the output of the NI contains two different output port - so an output arbiter is useless, the two vn_core2net response units, firstly, has to compete among them and, secondly, among all the VN.
 
  
[[File:NI.png|800px|Network Interface]]
+
NuPlus system works under the assumption that no flit can be lost. This means that routers must buffer packets, and eventually stall in case of full output buffers, to avoid packet drop. In this process, of routers waiting for each other, a circular dependency can potentially be created. As routers cannot drop packets to free buffer slots and allow a deadlock to be solved, we must prevent them from happening.
  
Note that packet_body_size is linked with the flit_numb, but we prefer to calculate them separately. (FILT_NUM = ceil(PACKET_BODY/FLIT_PAYLOAD) )
+
As we route packets through the mesh, a flit can enter a router and leave it from any cardinal direction. It is obvious that routing flits along a straight line cannot form a circular dependency. For this reason only turns must be analyzed.
 +
The simplest solution is to ban some possible turns, in a way that disallows circular dependency.
  
=== vn_net2core ===
+
The routing protocol adopted by NuPlus is called XY Dimensional-Ordered Routing, or DOR. It forces packet to be routed first along the X axis, and then along the Y axis. It is one of the simplest routing protocols, as it takes its decision independently of current network status (a so-called oblivious protocol), and requires little logic to be implemented, although offering deadlock avoidance and shortest path routing.
  
This module stores incoming flit from the network and rebuilt the original packet. Also, it handles back-pressure informations (credit on/off).
+
== Virtual channels ==
A flit is formed by an header and a body, the header has two fields: |TYPE|VCID|. VCID is fixed by the virtual channel ID where the flit is sent. The virtual channel depends on the type of message. The filed TYPE can be: HEAD, BODY, TAIL or HT. It is used by the control units to handles different flits.
 
  
When the control unit checks the TAIL or HT header, the packet is complete and stored in packed FIFO output directly connected to the Cache Controller.
+
Virtual channels are an extensively adopted technique in network design. They allow to build multiple virtual network starting from a single physical one.
  
E.g. : If those flit sequence occurs:
+
The main problem they try to solve is head-of-line blocking. It happens when a flit that cannot be routed (maybe because the next router input buffer is full) reaches the head of the input queue, preventing the successive flits, which potentially belong to independent traffic flows, from being served. If those blocked flits belong to different traffic flows, it makes sense to buffer them on different queues.
          1st Flit in => {FLIT_TYPE_HEAD, FLIT_BODY_SIZE'h20}
 
          2nd Flit in => {FLIT_TYPE_BODY, FLIT_BODY_SIZE'h40}
 
          3rd Flit in => {FLIT_TYPE_BODY, FLIT_BODY_SIZE'h60}
 
          4th Flit in => {FLIT_TYPE_TAIL, FLIT_BODY_SIZE'h10};
 
  
The rebuilt packet passed to the Cache Controller is:
+
Virtual channels are called virtual for this reason: there is only a single link between two routers, but the result is like having multiple physical channel dedicated to each traffic flow. It is router’s responsibility to ensure that virtual channels are properly time multiplexed on the same link.
          Packet out => {FLIT_BODY_SIZE'h10, FLIT_BODY_SIZE'h60, FLIT_BODY_SIZE'h40, FLIT_BODY_SIZE'h20}
 
  
A FIFO stores the reconstructed packet. When the CC can read, it asserts packet_consumed bit.
+
Virtual channels can also be used to prevent deadlocks, in case the routing protocol allows them. To achieve this, virtual channels allocation must happen in a fixed order; or an “escape virtual channel” must be provided, whose flits gets routed with a deadlock-free protocol. As long as virtual channels allocation is fair, a flit will eventually be served.
 
The FIFO threshold is reduced of 2 due to controller: if a sequence of consecutive 1-flit packet arrives, the on-off backpressure almost_full signal will raise up the clock edge after the threshold crossing as usual, so it is important to reduce of 2 the threshold to avoid packet lost. If the packet arriving near the threshold are bigger than 1 flit, the enqueue will be stopped with 1 free buffer space.
 
  
==== Control unit ====
+
In NuPlus the number of virtual channels is represented by the constant parameter VC_PER_PORT, currently set to 4, as many as the type of network messages: Requests, Responses, Forwards, Service Messages.
Flits from the network are not stored in any FIFOs. The router_valid signal is directly connected to the rebuilt packet control unit.
+
As a given message type is associated with a specific virtual channel, the network interface component must know on which virtual channel it has to inject the flits. The router, as stated before, has no knowledge of application messages. This means that it must ensure as little as to guarantee that messages don’t get routed on wrong virtual channel: for this reason, a flit will be kept on the same virtual channel in every router along its route.
In Control Unit all incoming flit are mounted in a packet. It checks the Flit header, if it is a TAIL or a HT type, the control unit stores the composed packet in the output FIFO to the Cache Controller.
 
  
[[File:N2C_CU.png|800px|N2C_CU]]
+
== Data structures ==
  
=== vn_core2net ===
+
In this section the main data structures are reported.
 
 
This module stores the original packet and converts in flit for the network. The conversion in flit starts fetching the packet from an internal queue.
 
When the requestor has to send a packet, it asserts packed_valid bit, directly connected to the FIFO enqueue_en port. Those informations are used by the Control Unit to translate packet in FLITs for each destination.
 
 
 
==== Control unit ====
 
The Control Unit strips the packet from the Cache Controller into N flits for the next router. It checks the packet_has_data field, if a packet does not contain data, the CU generates just a flit (HT type), otherwise it generates N flits. It supports multicasting through multiple unicast messages.
 
 
 
A priority encoder selects from a mask which destination has to be served. All the information of the header flit are straightway filled, but the flit type.
 
 
 
assign packet_dest_pending                = packet_destinations_valid & ~dest_served;
 
 
 
rr_arbiter # (
 
    .NUM_REQUESTERS ( DEST_NUMB )
 
)
 
rr_arbiter (
 
    .clk        ( clk                  ) ,
 
    .reset      ( reset                ) ,
 
    .request    ( packet_dest_pending  ) ,
 
    .update_lru ( 1'b0                ) ,
 
    .grant_oh  ( destination_grant_oh )
 
) ;
 
 
 
The units performs the multicast throughout k unicast: when a destination is served (a packet is completed), the corresponding bit in the destination mask is deasserted.
 
 
 
dest_served <= dest_served | destination_grant_oh;
 
 
 
[[File:C2N_CU.png|800px|C2N_CU]]
 
 
 
The units has to know if the multicast is on. In this case, the signal packet_destinations_valid is a bitmap of destination to reach and the real_dest has the TILE_COUNT width; else the signal real_dest contains the (x,y) coordinates of the destination
 
 
 
generate
 
    if ( DEST_OH == "TRUE" ) begin
 
      assign
 
          real_dest.x  = destination_grant_id[`TOT_X_NODE_W - 1 : 0 ],
 
          real_dest.y  = destination_grant_id[`TOT_Y_NODE_W + `TOT_X_NODE_W - 1 -: `TOT_X_NODE_W];
 
    end else
 
      assign real_dest = packet_destinations[destination_grant_id];
 
endgenerate
 
 
 
Note: if DEST_OH is false, the core_destination signal contains the component ID inside the tile that will receive the packet, else it has no sense.
 
 
 
assign cu_flit_out_header.core_destination = tile_destination_t'( destination_grant_oh[`DEST_TILE_W -1 : 0] );
 
 
 
== Router ==
 
 
 
The router moves data between two or more terminals, so the interface is standard: input and output flit, input and output write enable, and backpressure signals.
 
 
 
This is a virtual-channel flow control X-Y look-ahead router for a 2D-mesh topology.
 
 
 
The first choice is to use only input buffering, so this will take one pipe stage. Another technique widely used is the look-ahead routing, that permits the route calculation of the next node. It is possible to merge the virtual channel and switch allocation in just one stage.
 
 
 
Recapping, there are 4 stages, two of them working in parallel (routing and allocation stages), for a total of three stages. To further reduce the pipeline stages, the crossbar and link traversal stage is not buffered, reducing the stages at two and, de facto, merging the last stage to the first one.
 
 
 
[[File:router.jpg|800px|router]]
 
 
 
=== First stage ===
 
 
 
There will be five different port - cardinal directions plus local port -, each one with IV different queues, where IV is the number of virtual channels presented.
 
 
 
[[File:first_stage.jpg|800px|First stage Router]]
 
 
 
There are two queues: one to house flits (FQ) and another to house only head flits (HQ). The queue lengths are equals to contemplate the worst case - packets with only one flit. Every time a valid flit enters in this unit, the HQ enqueues its only if the flit type is `head' or `head-tail'. The FQ has the task of housing all the flits, while the HQ has to "register" all the entrance packets. To assert the dequeue signal for HQ, either allocator grant assertion and the output of a tail flit have to happen, so the number of elements in the HQ determines the number of packet entered in this virtual channel.
 
 
 
header_fifo (
 
    .enqueue_en  ( wr_en_in & flit_in.vc_id == i & ( flit_in.flit_type == HEADER | flit_in.flit_type == HT ) ),
 
    .value_i    ( flit_in.next_hop_port                                                                    ),
 
    .dequeue_en  ( ( ip_flit_in_mux[i].flit_type == TAIL | ip_flit_in_mux[i].flit_type == HT ) & sa_grant[i] ),
 
    ...
 
    ...
 
 
 
This organization works only if a condition is respected: the flits of each packets are stored consecutively and ordered in the FQ. To obtain this condition, a deterministic routing has to be used and all the network interfaces have to send all the flits of a packet without interleaving with other packet flits.
 
 
 
=== Second stage ===
 
 
 
The second stage has got two units working in parallel: the look-ahead routing unit and allocator unit. This two units are linked throughout a intermediate logic.
 
The allocator unit has to accord a grant for each port. This signal is feedback either to first stage and to a second-stage multiplexer as selector signal. This mux receives as input all the virtual channel output  for that port, electing as output only one flit - based on the selection signal. This output flit goes in the look-ahead routing to calculate the next-hop port destination.
 
 
 
[[File:second_stage.jpg|800px|Second stage Router]]
 
 
 
==== Allocation  ====
 
The allocation unit grants a  flit to go toward a specific port of a specific virtual channel, handling the contention of virtual channels and crossbar ports. Each single allocator is a two-stage input-first separable allocator that permits a reduced number of component respect to other allocator.
 
 
 
The overall unit receives as many allocation request as the ports are. Each request asks  to obtain a destination port grant for each of its own virtual channel - the total number of request lines is P x V x P. The allocation outputs are two for each port: (1) the winner destination port that will go into the crossbar selection; (2) the winner virtual channel that is feedback to move the proper flit at the crossbar input.
 
 
 
[[File:allocation.jpg|800px|Allocation]]
 
 
 
The allocation unit has to respect the following rules:
 
* the packets can move only in their respective virtual channel;
 
* a virtual channel can request only one port per time;
 
* the physical link can be interleaved by flits belonging to different flows;
 
* when a packet acquires a virtual channel on an output port, no other packets on different input ports can acquire that virtual channel on that output port.
 
 
 
===== Allocatore core =====
 
The virtual channel and switch allocation is logically the same for both, so it is
 
encased in a unit called allocator core. It is simply a parametrizable number
 
of parallel arbiters in which the input and output are properly scrambled and
 
the output are or-ed to obtain a port-granularity grant.
 
 
 
[[File:allocatore_core.png|400px|allocatore_core]]
 
 
 
The difference between other stages is that each arbiter is a round-robin
 
arbiter with a grant-hold circuit. This permits to obtain an uninterrupted use
 
of the obtained resource, especially requested to respect one of the rule in the
 
VC allocation.
 
 
 
rr_arbiter u_rr_arbiter (
 
  . request ( request ),
 
  . update_lru ('{ default : '0}) ,
 
  . grant_oh ( grant_arb ),
 
  .*
 
);
 
assign grant_oh = anyhold ? hold : grant_arb ;
 
assign hold = last & hold_in ;
 
assign anyhold = | hold ;
 
always_ff @( posedge clk , posedge reset ) last <= grant_oh ;
 
 
 
===== Virtual channel allocation =====
 
The first step for the virtual channel allocator is removed because the hypothesis is that only one port per time can be requested for each virtual channel. Under this condition, a first-stage arbitration is useless, so only the second stage is implemented troughout the allocatore_core instantiation.
 
 
 
The use of grant-hold arbiters in the second stage avoids that a packet loses its grant when other requests arrive after this grant. The on-off input signal is properly used to avoid that a flit is send to a full virtual channel in the next node.
 
 
 
===== Switch allocation =====
 
The switch allocator receives as input the output signals from VC allocation and all the port requests. For each port, there is a signal assertion for each winning virtual channel. These winners now compete for a switch allocation. Two arbiter stage are necessary. The first stage arbiter has as many round-robin arbiter as the input port are.
 
Each round-robin arbiter chooses one VC per port and uses this result to select the request port associated at this winning VC. The winning request port goes at the input of second stage arbiter as well as the winning requests for the other ports. The second stage arbiter is an instantiation of the allocator core and chooses what input port can access to the physical links. This signal is important for two reasons: (1) it is moved toward the round-robin unit previously and-ed with the winning VC for each port; (2) it is registered, and-ed with the winning destination port, and used as selection port for the crossbar (for each port).
 
 
 
==== Flit handler ====
 
A mux uses the granted_vc signal to grant one of the input flit to the output register. This flit then will goes to the input crossbar port.
 
 
 
always_comb begin
 
    flit_in_granted_mod[i] = flit_in_granted[i];
 
    if ( flit_in_granted[i].flit_type == HEADER || flit_in_granted[i].flit_type == HT )
 
      flit_in_granted_mod[i].next_hop_port = port_t'( lk_next_port[i] );
 
end
 
 
 
==== Next hop routing calculation ====
 
The look-ahead routing calculates the destination port of the next node instead of the actual one because the actual destination port is yet ready in the header flit. The algorithm is a version of the X-Y deterministic routing. It is deadlock-free because it removes four on eight possible turns: when a packet turns towards Y directions, it cannot turn more.
 

Revision as of 20:45, 16 January 2018

In a many-core system, the interconnection network has the vital goal of allowing various devices to communicate efficiently.

The most common use case is the exchange of coherence and synchronization messages. In particular, NuPlus coherence system presents a private L1 core cache, and a shared L2 directory based cache. This means that the L2 cache is distributed among tiles, and every address has an associated home tile where the state of the cache line is stored. Similarly, the directory must be able to reply to core’s requests. Another possible use case is host-to-device communication, or handling of IO mapped peripherals.

A tile can contain multiple devices (called from now on network users) requiring network access. The network infrastructure must thus provide inter-tile and intra-tile addressing capabilities. The interface offered to its users must be as generic as possible, and independent of the specific network topology and implementation details.

General architecture

Tiles are organized in a 2D grid, a so called mesh topology.

Every tile has a Network router, which is the component responsible for inter-tile communication, and a Network interface, which offers a transparent interface to network users. The network interface acts here as a bridge. Its interface must adapt to the requirements of multiple network users, converting requests from user’s format to network format, and backwards. Once a request is converted in network format, the router takes charge of its handling.

The basic communication unit supported by the router is the flit. A request is thus broken down in flits by the network interface and sent to the local router. The router has no information of application messages, and it just sees them as a stream of flits. As sequence of flits can be arbitrarily long, the router can offer the maximum flexibility, allowing requests of unspecified length. The ultimate goal of the router is to ensure that flits are correctly injected, ejected and forwarded (routed) through the mesh.

Routing protocol

The routing protocol is an important choice in network design.

NuPlus system works under the assumption that no flit can be lost. This means that routers must buffer packets, and eventually stall in case of full output buffers, to avoid packet drop. In this process, of routers waiting for each other, a circular dependency can potentially be created. As routers cannot drop packets to free buffer slots and allow a deadlock to be solved, we must prevent them from happening.

As we route packets through the mesh, a flit can enter a router and leave it from any cardinal direction. It is obvious that routing flits along a straight line cannot form a circular dependency. For this reason only turns must be analyzed. The simplest solution is to ban some possible turns, in a way that disallows circular dependency.

The routing protocol adopted by NuPlus is called XY Dimensional-Ordered Routing, or DOR. It forces packet to be routed first along the X axis, and then along the Y axis. It is one of the simplest routing protocols, as it takes its decision independently of current network status (a so-called oblivious protocol), and requires little logic to be implemented, although offering deadlock avoidance and shortest path routing.

Virtual channels

Virtual channels are an extensively adopted technique in network design. They allow to build multiple virtual network starting from a single physical one.

The main problem they try to solve is head-of-line blocking. It happens when a flit that cannot be routed (maybe because the next router input buffer is full) reaches the head of the input queue, preventing the successive flits, which potentially belong to independent traffic flows, from being served. If those blocked flits belong to different traffic flows, it makes sense to buffer them on different queues.

Virtual channels are called virtual for this reason: there is only a single link between two routers, but the result is like having multiple physical channel dedicated to each traffic flow. It is router’s responsibility to ensure that virtual channels are properly time multiplexed on the same link.

Virtual channels can also be used to prevent deadlocks, in case the routing protocol allows them. To achieve this, virtual channels allocation must happen in a fixed order; or an “escape virtual channel” must be provided, whose flits gets routed with a deadlock-free protocol. As long as virtual channels allocation is fair, a flit will eventually be served.

In NuPlus the number of virtual channels is represented by the constant parameter VC_PER_PORT, currently set to 4, as many as the type of network messages: Requests, Responses, Forwards, Service Messages. As a given message type is associated with a specific virtual channel, the network interface component must know on which virtual channel it has to inject the flits. The router, as stated before, has no knowledge of application messages. This means that it must ensure as little as to guarantee that messages don’t get routed on wrong virtual channel: for this reason, a flit will be kept on the same virtual channel in every router along its route.

Data structures

In this section the main data structures are reported.