Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01zs25x849r
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorPeh, Li-Shiuanen_US
dc.contributor.authorAisopos, Konstantinosen_US
dc.contributor.otherElectrical Engineering Departmenten_US
dc.date.accessioned2012-08-01T19:34:32Z-
dc.date.available2012-08-01T19:34:32Z-
dc.date.issued2012en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp01zs25x849r-
dc.description.abstractTechnology scaling has reached miniaturization levels, where multiple processor cores can be integrated onto the same die. During the last four decades, this scaling has been the primary driver behind improving system performance, at the expense of higher temperatures and power densities. However, when scaling down to deep submicron technologies, a new evil rises: unreliable silicon. The reason behind the increasing concerns for transistor reliability is that the effects of process variation, transistor aging, electrical noise, and high temperatures are becoming stronger when shrinking the transistor dimensions. Consequently, industry projects that future chips will be exposed to large numbers of failures and is researching fault-tolerant designs. At the same time, the number of processor cores in a single chip is increasing steadily, and an efficient on-chip communication medium between them is necessary. Packet-switched on-chip networks have been gaining increased importance in this area, due to their modularity and scalable bandwidth. However, due to extreme transistor scaling, these interconnection networks are expected to experience permanent defects and runtime failures in future technology generations. On top of this, a single failure in the network may cascade across several routers and ultimately cause interruption of network service. Hence, resilient on-chip networks, which can tolerate both permanent and runtime failures transparently to upper layers, are emerging. In this dissertation, we present a characterization study of network faults, and a full-system solution to tackle them. Our characterization is conducted with an accurate circuit-level tool, which we developed to explore the impact of faults in architecture. Specifically, we present a case study where we pinpoint the common fault types in the network, their probabilities, and their architectural outcome. This way, we diagnose the vulnerable components of the interconnection network that need protection, and identify the fault types that resilient network architectures must address. We then propose a resilient architecture that can tolerate both permanent and transient faults in the interconnection network. To address permanent network faults, which disable communication links and network routers, we suggest a network architecture that can reconfigure at runtime and utilize its surviving network resources to enable continued chip operation. Our solution, namely Ariadne, explores the surviving topology upon each permanent failure, and discovers resilient routes to connect functional nodes. We also address transient network faults, which result in corrupted or lost coherence messages. We do so by developing a systematic methodology to incorporate resilience into the coherence protocol, so that it resends lost and corrupted messages, to replay the corresponding transaction after a timeout. Overall, this dissertation argues that designing chips that never experience network failures will not be economically feasible in the future, because this would result in enormous performance degradation, as well as financial losses for chip vendors, since a large number of chips would not meet the required specifications during testing. Instead, we propose to continue exploiting transistor scaling to maintain the current rate of performance improvement, but tolerate failures, so that a chip can gracefully degrade its performance over time only after actual faults occur.en_US
dc.language.isoenen_US
dc.publisherPrinceton, NJ : Princeton Universityen_US
dc.relation.isformatofThe Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>en_US
dc.subjectarchitectureen_US
dc.subjectcoherence protocolen_US
dc.subjectfault toleranten_US
dc.subjectnetwork on chipen_US
dc.subjectreliableen_US
dc.subjectresilienceen_US
dc.subject.classificationComputer engineeringen_US
dc.subject.classificationComputer scienceen_US
dc.subject.classificationElectrical engineeringen_US
dc.titleFault Tolerant Architectures for On-Chip Networksen_US
dc.typeAcademic dissertations (Ph.D.)en_US
pu.projectgrantnumber690-2143en_US
Appears in Collections:Electrical Engineering

Files in This Item:
File Description SizeFormat 
Aisopos_princeton_0181D_10160.pdf3.38 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.