liu.seSearch for publications in DiVA
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Duplicate Detection and Text Classification on Simplified Technical English
Linköpings universitet, Institutionen för datavetenskap.
2019 (engelsk)Independent thesis Advanced level (degree of Master (Two Years)), 20 poäng / 30 hpOppgaveAlternativ tittel
Dublettdetektion och textklassificering på Förenklad Teknisk Engelska (svensk)
Abstract [en]

This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.

sted, utgiver, år, opplag, sider
2019. , s. 62
Emneord [en]
NLP, CNL, transformer models, LSTM, BERT, document embeddings, word embeddings, text classification, text clustering, transfer learning, machine learning
HSV kategori
Identifikatorer
URN: urn:nbn:se:liu:diva-158714ISRN: LIU-IDA/LITH-EX-A--19/033--SEOAI: oai:DiVA.org:liu-158714DiVA, id: diva2:1337383
Eksternt samarbeid
Etteplan
Fag / kurs
Computer science
Presentation
2019-06-12, Alan Turing, Linköpings Universitet, Linköping, 10:00 (engelsk)
Veileder
Examiner
Tilgjengelig fra: 2019-08-13 Laget: 2019-07-14 Sist oppdatert: 2019-08-13bibliografisk kontrollert

Open Access i DiVA

fulltext(1862 kB)77 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 1862 kBChecksum SHA-512
d8bc13003822669c5f75c4a06e527957cd2bb907748f6c173256cccb0e7e718ff8a0710c6f4e7c13a6d5de8cad975620682840cfebd2ef2cc657ba788c8299e8
Type fulltextMimetype application/pdf

Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 77 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

urn-nbn

Altmetric

urn-nbn
Totalt: 153 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf