Fault Tolerant Architectures for On-Chip Networks

Aisopos, Konstantinos

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01zs25x849r

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Peh, Li-Shiuan	en_US
dc.contributor.author	Aisopos, Konstantinos	en_US
dc.contributor.other	Electrical Engineering Department	en_US
dc.date.accessioned	2012-08-01T19:34:32Z	-
dc.date.available	2012-08-01T19:34:32Z	-
dc.date.issued	2012	en_US
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01zs25x849r	-
dc.description.abstract	Technology scaling has reached miniaturization levels, where multiple processor cores can be integrated onto the same die. During the last four decades, this scaling has been the primary driver behind improving system performance, at the expense of higher temperatures and power densities. However, when scaling down to deep submicron technologies, a new evil rises: unreliable silicon. The reason behind the increasing concerns for transistor reliability is that the effects of process variation, transistor aging, electrical noise, and high temperatures are becoming stronger when shrinking the transistor dimensions. Consequently, industry projects that future chips will be exposed to large numbers of failures and is researching fault-tolerant designs. At the same time, the number of processor cores in a single chip is increasing steadily, and an efficient on-chip communication medium between them is necessary. Packet-switched on-chip networks have been gaining increased importance in this area, due to their modularity and scalable bandwidth. However, due to extreme transistor scaling, these interconnection networks are expected to experience permanent defects and runtime failures in future technology generations. On top of this, a single failure in the network may cascade across several routers and ultimately cause interruption of network service. Hence, resilient on-chip networks, which can tolerate both permanent and runtime failures transparently to upper layers, are emerging. In this dissertation, we present a characterization study of network faults, and a full-system solution to tackle them. Our characterization is conducted with an accurate circuit-level tool, which we developed to explore the impact of faults in architecture. Specifically, we present a case study where we pinpoint the common fault types in the network, their probabilities, and their architectural outcome. This way, we diagnose the vulnerable components of the interconnection network that need protection, and identify the fault types that resilient network architectures must address. We then propose a resilient architecture that can tolerate both permanent and transient faults in the interconnection network. To address permanent network faults, which disable communication links and network routers, we suggest a network architecture that can reconfigure at runtime and utilize its surviving network resources to enable continued chip operation. Our solution, namely Ariadne, explores the surviving topology upon each permanent failure, and discovers resilient routes to connect functional nodes. We also address transient network faults, which result in corrupted or lost coherence messages. We do so by developing a systematic methodology to incorporate resilience into the coherence protocol, so that it resends lost and corrupted messages, to replay the corresponding transaction after a timeout. Overall, this dissertation argues that designing chips that never experience network failures will not be economically feasible in the future, because this would result in enormous performance degradation, as well as financial losses for chip vendors, since a large number of chips would not meet the required specifications during testing. Instead, we propose to continue exploiting transistor scaling to maintain the current rate of performance improvement, but tolerate failures, so that a chip can gracefully degrade its performance over time only after actual faults occur.	en_US
dc.language.iso	en	en_US
dc.publisher	Princeton, NJ : Princeton University	en_US
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>	en_US
dc.subject	architecture	en_US
dc.subject	coherence protocol	en_US
dc.subject	fault tolerant	en_US
dc.subject	network on chip	en_US
dc.subject	reliable	en_US
dc.subject	resilience	en_US
dc.subject.classification	Computer engineering	en_US
dc.subject.classification	Computer science	en_US
dc.subject.classification	Electrical engineering	en_US
dc.title	Fault Tolerant Architectures for On-Chip Networks	en_US
dc.type	Academic dissertations (Ph.D.)	en_US
pu.projectgrantnumber	690-2143	en_US
Appears in Collections:	Electrical Engineering

Files in This Item:

File	Description	Size	Format
Aisopos_princeton_0181D_10160.pdf		3.38 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse