liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Estimating Time to Repair Failures in a Distributed System
Linköping University, Department of Computer and Information Science.
Linköping University, Department of Computer and Information Science.
2016 (English)Independent thesis Basic level (degree of Bachelor), 10,5 credits / 16 HE creditsStudent thesisAlternative title
Estimering av reparationstid vid haverier i ett distribuerat system (Swedish)
Abstract [en]

To ensure the quality of important services, high availability is critical. One aspect to be considered in availability is the downtime of the system, which can be measured in time to recover from failures. In this report we investigate current research on the subject of repair time and the possibility to estimate this metric based on relevant parameters such as hardware, the type of fault and so on. We thoroughly analyze a set of data containing 43 000 failure traces from Los Alamos National Laboratory on 22 different cluster organized systems. To enable the analysis we create and use a program which parses the raw data, sorts and categorizes it based on certain criteria and formats the output to enable visualization. We analyze this data set in consideration of type of fault, memory size, processor quantity and at what time repairs were started and completed. We visualize our findings of number of failures and average times of repair dependent on the different parameters. For different faults and time of day we also display the empirical cumulative distributionfunction to give an overview of the probability for different times of repair. The failures are caused by a variety of different faults, where hardware and software are most frequently occurring. These two along with network faults have the highest average downtime. Time of failure proves important since both day of week and hour of day shows patterns that can be explained by for example work schedules. The hardware characteristics of nodes seem to affect the repair time as well, how this correlation works is although difficult to conclude. Based on the data extracted we suggest two simple methods of formulating a mathematical model estimating downtime which both prove insufficient; more research on the subject and on how the parameters affect each other is required.

Place, publisher, year, edition, pages
2016. , 27 p.
National Category
Computer and Information Science
Identifiers
URN: urn:nbn:se:liu:diva-131847ISRN: LIU-IDA/LITH-EX-G--16/072—SEOAI: oai:DiVA.org:liu-131847DiVA: diva2:1034002
Supervisors
Examiners
Available from: 2016-10-17 Created: 2016-10-10 Last updated: 2016-10-17Bibliographically approved

Open Access in DiVA

fulltext(525 kB)115 downloads
File information
File name FULLTEXT01.pdfFile size 525 kBChecksum SHA-512
0dacea04982a66f20e513eb42f0d5c6fb2883ddf3190d64ccd0483359368b9c70f29ed8dd5d469a49ede3887901f4f9e5114c81b7b783c71a9ebd3f3df9f04a1
Type fulltextMimetype application/pdf

By organisation
Department of Computer and Information Science
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 115 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 379 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf