liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Domain similarity metrics for predicting transfer learning performance
Linköping University, Department of Computer and Information Science, Human-Centered systems.
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The lack of training data is a common problem in machine learning. One solution to thisproblem is to use transfer learning to remove or reduce the requirement of training data.Selecting datasets for transfer learning can be difficult however. As a possible solution, thisstudy proposes the domain similarity metrics document vector distance (DVD) and termfrequency-inverse document frequency (TF-IDF) distance. DVD and TF-IDF could aid inselecting datasets for good transfer learning when there is no data from the target domain.The simple metric, shared vocabulary, is used as a baseline to check whether DVD or TF-IDF can indicate a better choice for a fine-tuning dataset. SQuAD is a popular questionanswering dataset which has been proven useful for pre-training models for transfer learn-ing. The results were therefore measured by pre-training a model on the SQuAD datasetand fine-tuning on a selection of different datasets. The proposed metrics were used tomeasure the similarity between the datasets to see whether there was a correlation betweentransfer learning effect and similarity. The results found a clear relation between a smalldistance according to the DVD metric and good transfer learning. This could prove usefulfor a target domain without training data, a model could be trained on a big dataset andfine-tuned on a small dataset that is very similar to the target domain. It was also foundthat even small amount of training data from the target domain can be used to fine-tune amodel pre-trained on another domain of data, achieving better performance compared toonly training on data from the target domain.

Place, publisher, year, edition, pages
2019. , p. 38
Keywords [en]
nlp.natural language processing, machine learning, transfer learning, similarity metrics, similarity, predict, performance
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:liu:diva-153747ISRN: LITH-EX-A--18/046--SEOAI: oai:DiVA.org:liu-153747DiVA, id: diva2:1276490
External cooperation
Consid Linköping AB
Subject / course
Computer science
Supervisors
Examiners
Available from: 2019-01-15 Created: 2019-01-08 Last updated: 2019-01-15Bibliographically approved

Open Access in DiVA

fulltext(1474 kB)308 downloads
File information
File name FULLTEXT01.pdfFile size 1474 kBChecksum SHA-512
814ea3af5112fdc39946a4f52e44eff01a483784847a6d1333504fe6099ddf2a55dfd69cda0d4a2587fce3c3c1a3d891cdaed457c5631dced3cc7b886e705d80
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Bäck, Jesper
By organisation
Human-Centered systems
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 308 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 583 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf