liu.seSearch for publications in DiVA
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Cluster Analysis with Meaning: Detecting Texts that Convey the Same Message
Linköping University, Department of Computer and Information Science, Human-Centered systems.
2018 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Klusteranalys med mening : Detektering av texter som uttrycker samma sak (English)
Abstract [en]

Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database.

This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are.

The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.

Place, publisher, year, edition, pages
2018. , p. 61
Keywords [en]
nlp, text mining, clustering, semantic meaning, text clustering, semantic duplicates, simplified technical english, duplicate detection, dbcv, doc2vec, etteplan
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-153873ISRN: LIU-IDA/LITH-EX-A--2018/055--SEOAI: oai:DiVA.org:liu-153873DiVA, id: diva2:1278929
External cooperation
Etteplan
Subject / course
Computer science
Presentation
2018-12-14, von Neumann, B-huset, Linköpings Universitet, Linköping, 10:00 (English)
Supervisors
Examiners
Available from: 2019-01-25 Created: 2019-01-15 Last updated: 2019-01-25Bibliographically approved

Open Access in DiVA

fulltext(1782 kB)72 downloads
File information
File name FULLTEXT01.pdfFile size 1782 kBChecksum SHA-512
c3c62002e9011a7ebf3564cd74792dff37630eb53c9c704ef07db2d495a35b1eb4fad5af9fd91d957be5ab822711a9c4e09724713f484f15fb848910cf698d5f
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Öhrström, Fredrik
By organisation
Human-Centered systems
Computer SciencesLanguage Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 72 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 264 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf