liu.seSearch for publications in DiVA
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Enabling the future of Operational Risk Management: A research on the encoding of categorical data insupervised deep learning models
Linköping University, Department of Mathematics, Mathematical Statistics .
2018 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Non-numerical data such as names of persons, cities, brands or artifacts are common in many fields and can provide great value to different types of mathematical models. In order to use these so-called categorical values in numerical models, including most ma-chine learning methods, the information has to be re-coded into something numerically interpretable. The drawback of the most common representation technique, "one-hot-encoding", is that the dimension of the numerical representation becomes impractically large if there are many unique values involved. Since the size of the input space exponentially drives the model complexity, it is of interest to keep the input dense. In this study an overview of compression methods for categorical features of two types is presented; with and without loss of information. The methods have first been reviewed with respect to their fit to a data set containing transactions of financial instruments. To evaluate the effects of the methods on model performance a deep-learning classification model identifying risk-prone transactions was trained under the usage of the different representations. The study concludes that by grouping rarely occurring categorical values the input space can be reduced by 77,9% in comparison to a one-hot-encoding with a small negative effect on the model performance. It is also shown that the time to train a model can be significantly reduced at the price of a smaller decrease of prediction accuracy which indicates that the method can be suitable in applications with limited time or computational power.   

Place, publisher, year, edition, pages
2018. , p. 52
Keywords [en]
deep learning, categorical features, classification, one-hot encoding
National Category
Mathematics
Identifiers
URN: urn:nbn:se:liu:diva-163953ISRN: LiTH-MAT-EX--2020/01--SEOAI: oai:DiVA.org:liu-163953DiVA, id: diva2:1401981
External cooperation
Handelsbanken Capital Markets
Subject / course
Mathematics
Examiners
Available from: 2020-03-04 Created: 2020-02-28 Last updated: 2020-03-04Bibliographically approved

Open Access in DiVA

fulltext(492 kB)5 downloads
File information
File name FULLTEXT01.pdfFile size 492 kBChecksum SHA-512
75c9c66eaa5ac5a27d8831870c9351f4aa66f7ee586bd4b93d724522590f358015d46656bfd72aeedf9d37b85466a14f4efcdca125fcbff233ed2738c9f9d3f1
Type fulltextMimetype application/pdf

By organisation
Mathematical Statistics
Mathematics

Search outside of DiVA

GoogleGoogle Scholar
Total: 5 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 49 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf