Exam (elaborations)

Multi-Layer Test and Diagnosis for Dependable NoCs

14 views 0 purchase

Course
Multi-Layer Diagnosis for Dependable NoCs

Institution
Multi-Layer Diagnosis For Dependable NoCs

Multi-Layer Test and Diagnosis for Dependable NoCs Hans-Joachim Wunderlich Computer Architecture University of Stuttgart Pfaffenwaldring 47, D-70569 Stuttgart ABSTRACT Networks-on-chip are inherently fault tolerant or at least gracefully degradable as both, connectivity and amount ...

[Show more]

Preview 2 out of 8 pages

View example

Uploaded on July 27, 2024
Number of pages 8
Written in 2023/2024
Type Exam (elaborations)
Contains Questions & answers

multi layer test and diagnosis for dependable nocs
networks on chip are inherently fault tolerant or

Institution Multi-Layer Diagnosis for Dependable NoCs
Course Multi-Layer Diagnosis for Dependable NoCs

Tutorgrades Member since 11 months 51 documents sold

$13.99

Added

Add to cart Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Multi-Layer Test and Diagnosis for Dependable NoCs Hans -Joachim Wunderlich Computer Architecture University of Stuttgart Pfaffenwaldring 47, D -70569 Stuttgart wu@informatik.uni -stuttgart.de Martin Radetzki Embedded Systems Engineering University of Stuttgart Pfaffenwaldring 5b, D- 70569 Stuttgart radetzki@informatik.uni -stuttgart.de ABSTRACT Networks -on-chip are inherently fault tolerant or at least gracefully degradable as both , connectivity and amount of resources , provide some useful redundancy. These properties can only be exploited extensively if test and diagnosis techniques support fault detection and error containment in an optimized way. On the one hand, all faulty components have to be isolated, and on the other hand, remaining fault -free functionalities have to be kept operational. In this contribution, behavioral end -to-end error detection is considered together with functional te st methods for switches and gate level diagnosis to locate and to isolate faults in the network i n an efficient way with low time overhead. Categories and Subject Descriptors B.4.5 [ Input/Output and Data Communications ]: Reliability, Testing, and Fault -Tolerance – built-in tests , diagnostics. General Terms Performance, Design , Reliability Keywords Test, diagnosis, fault tolerance, network -on-chip, cross -layer . 1. INTRODUCTION & RELATED WORK The inherent fault tolerance of networks -on-chips (NoCs ) makes them a special candidate to cope with the reliability threats that accompany further CMOS scaling [25]. While the “power wall” limits the frequency increase and enforces performance improvements by exploiting parallelis m, the resulting “reliability wall” can only be overcome efficiently by applying te st and diagnosis schemes at the various network layers of an NoC. High quality test and diagnosis schemes are technolog y dependent, and a purely functional approach is not sufficient for reaching the same quality as obtained by structural te chniques. The abstraction levels of fault model are related to some extent to the network layer definition of the ISO/ IEC 7498 -1:1994 OSI seven layer model. 1.1 Physical Layer Defects consis t of additional, missing or wrong physical material, and they are modeled by faults of a structural gate level circuit model. Standard fault models include stuck -at faults, transition faults, delay faults, crosstalk or various types of bridging faults. They are associated in this paper with the physical network layer, and require the classical structural methods of automated test pattern generation (ATPG) [5] and test application through test access mechanisms (TAM) such as scan chains [32]. NoC -specific adaptations of these methods include the optimization of scan structures according to NoC topology [14], the transport of test patterns to scan chains using flits [22], and standards -compliant test wrappers for NoC [3]. Beyond just identifying faulty circuits, the circuit ’s test response can be analyzed b y structural diagnosis techniques to locate the faulty circuit component (net or logic gate). Diagno sis can be performed offline with automated test equipment or in situ with dedicated built -in self -test (BIST) logic. The diagnosis result can be used offl ine (to increase production yield ) or online (to cope with emerging faults) by repairing or deactivating faulty circuitry. Repair requires redundant circuit elements such as spare wires [21] to be designed in up -front whereas deactivation keeps the circuit alive at the cost of reduced functionality or performance (graceful degradation , e.g. through reduced flit size [31]). 1.2 Data Link Layer On the data link layer , which establishes connectivity and flow control between adjacent switches, these classical structural test methods are not anymore directly applicable as both , pattern generation and pattern application , are constrained to a well -
formed format for data transmission between two switches. On one hand these constraints reduce the reachab le fault coverage, on the other hand overtesting is avoided and tests can be execu ted more efficiently. An NoC -specific BIST architecture featuring a dedicated test controller and the usage of the NoC data links as TAM has been described by Grecu et al. [13]. Lehtonen et al. show how links can be reconfigured in order to cope with faults. A method for mapping diagnosed faults to switch ports [9] enables graceful degrad ation by deactivating defective ports and the connected links and is also used in the paper at hand. An alternative to these diagnostic approaches is concurrent error detection and error correction. These techniques rely on the use of error correcting or detecting codes (ECC/EDC). Since respective codecs are required in each switch, cheap single error correcting (SEC) codes such as Hamming codes are employed. In case of EDC, switch -to-switch retransmission can be applied for correcting transient errors, but is not effective against permanent Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies ar e not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copy rights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists , requires prior specific permiss ion and/or a fee. Request permissions from Permissions@acm.org . NOCS '15, September 28 - 30, 2015, Vancouver, BC, Canada. © 2015 ACM. ISBN 978 -1-4503 -3396 -2/15/09…$15.00 DOI: http://dx.doi.org/10.1145/2786572.2788708 defects. Studies ( [4][15]) show that the incurred area and power overhead is not justified unless extremely high failure rates are assumed, and suggest the applicatio n of such techniques on higher layers instead. 1.3 Network Layer The network layer establishes functionalities of packet routing and switching in NoC switches (which include routing units) . For testing and diagnosis, the circuit -level structural fault model is widely abstracted. For example, Kohler et al. [18] suggest a functional fault model (xbar faults) that captures connection paths in crossbar switches . Further abstracting , functional failure modes like misrouting or data corruption are used to capture the effect of low-level defects on switch functionalities ( [2]). When circuit -level structural diagnosis is applied to NoC switches, a mapping of diagnosed structural faults to the affected functional ities can be established [7]. Alternatively, functional tests can be applied [1]. In reverse, structural faults can be diagnosed with functional techniques [16], and SAT -based ATPG can be employed to ensure high structural coverage of functional software -based self -test (SBST) [10]. Like on other layers, concurrent error detection with error detecting codes (e.g. [18] ) can replace or supplement diagnostic techniques. The use of fault -
secure synthesis techniques [11] ensures that all faults manifest as detectable errors. In order t o achieve better graceful degr adation than with a complete switch shutdown, defective parts of a switch can be bypassed by data path reconfiguration [23] or can be omitted by local routing adaptation [18]. Potential ly resulting problems related to congestion or deadlocks can be avoided by ahead -
looking adaptation of adjacent switches [27]. 1.4 Transport Layer Finally, the transport layer includes the end -to-end data transmissions from the original sender to the designated receiv er. The use of error -detecting codes (EDC) such as parity (single error detecting), extended Hamming (double error detecting), or cyclic redunda ncy check (CRC, capable of detecting error bursts) is common for concurrent error detection on this layer. Alternatively, the use of heartbeat messages has been suggested [12], which replaces the overhead of equipping each packet with an EDC field by the potentially smaller overhead of eventual test packets . Also the use of forward error correcting codes (FEC) has been investigated [20], but the cost of decoding advanced codes with error -correcting capacity that goes beyond the single error correction (SEC) of Hamming codes, e.g. Reed -Solomon or BCH, appears prohibitive. To diagnose NoCs on the network layer, Raik et al. [26] suggest a method that uses end -to-end messages injected and ejected at test access points at the boundaries of a mesh network. Zhang et al. [34] describe a software -based localization method that gather s information about the position of nodes that have been deactivated after an unsuccessful BIST run. Contrary to that, in Section 2 we outline a diagnostic method that locates defective NoC resources (links, switches) on the network layer using regular data packets. 1.5 Cross-Layer Methods It is advantageous to separate monitoring and coarse fault diagnosis from the more expensive fine grained fault diagn osis for defect location, at least if we are dealing with low and medium error rates. Detecting faulty switches and links is targeted efficiently at the transport layer, while diagnosis for defect location needs finally structural information obtained by l owering the abstraction level in a top down fashion. This leads to a top -
down divide -and-conquer approach across the network layers and will finally point to a defective structure, e.g. wir e, port or gate. However, the description of this proceeding is pr eferably done in a bottom -up way, layer for layer as functionalities and concepts can be reused this way. Hence, this paper is organized as follows: After describing test and monitoring at the transport layer in the next section, test, diagnosis and fault isolation at gate level are discussed in section 3. Section 4 introduces software based self -
test at the data link layer, and section 5 presents the concept o f functional failure modes at the network layer. 2. TRANSPORT LAYER 2.1 Transport Protocol If the absence of post -manufacturing defects is a reasonable assumption, as still the case with current technology, a minimal transport protocol for packetization and re -assembly of end -to-end messages is sufficient. For future technologies, ada ptive repeat request (ARQ) techniques can be employed for retransmission of erroneous packets. This requires each packet to be equipped with an error -detecting code (EDC ). To implement ARQ, a sender keeps a local copy of each sent packet until it is positively acknowledged by the receiver . Should the receiver detect an error by decoding the EDC, it sends a negative acknowledgement. Multiple acknowledgements can be bundled in a single protocol packet so as to reduce the incurred traffic overhead. Upon receiving a negative acknowledgement, the packet is re -sent. If the receiver is not capable of reordering packets, subsequent packets must also be retransmitted. Since data packet s may be completely lost, the receiver implements a time -out mechanism upon which expected packets that did not arrive are negatively acknowledged. Missing packets can be detected by gaps in the sequence IDs transmitted as part of the packet header. Acknowledgement messages may be lost as well. Therefore, the sender implements another t ime-out after which a yet unacknowledged packet is automatically re -sent. 2.2 Diagnostic Protocol Retransmission is able to correct transient faults by temporal redundancy. Howev er, in case of a permanent fault, deterministic routing would lead any retransmitted packet though the same defective component, where it is again corrupted. This situation can be detected with an error counter for failed retransmission attempts. A fault can thereby be classified as permanent, which leads to the n eed of locating it so as to change routing paths. For this purpose, a scoreboard -based mechanism has been suggested [30] that narrows down fault location by using statistic s of faults occurred on multiple transmission paths: Those network resources present in a maximal number of faulty paths are likely fault ca ndidates. To overcome the probabilistic nature of this approach, we have proposed a bisection mechanism [28] to iteratively narrow down fault location to a single switch, using a single transmission path . Our method assumes that the transport layer has some information on the routing policy that is implemented on the network layer: Namely, the path length be known and the switch in the middle of the path be identifiable. This is easily implemented for a deterministic routing scheme such as dimension order routing. Also table -based routing information, where routing table entries

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Tutorgrades. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $13.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

85651 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Exam (elaborations)

Multi-Layer Test and Diagnosis for Dependable NoCs

Document information

Subjects

Written for

Seller

Reviews received

Content preview