Design and Implementation of a Name Matching Algorithm for Persian Language
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Name matching plays a vital and crucial role in many applications. They are for example used in information retrieval or deduplication systems to do comparisons among names to match them together or to find the names that refer to identical objects, persons, or companies. Since names in each application are subject to variations and errors that are unavoidable in any system and because of the importance of name matching, so far many algorithms have been developed to handle matching of names. These algorithms consider the name variations that may happen because of spelling, pattern or phonetic modifications. However most existing methods were developed for use with the English language and so cover the characteristics of this language. Up to now no specific one has been designed and implemented for the Persian language. The purpose of this thesis is to present a name matching algorithm for Persian. In this project, after consideration of all major algorithms in this area, we selected one of the basic methods for name matching that we then expanded to make it work particularly well for Persian names. This proposed algorithm, called Persian Edit Distance Algorithm or shortly PEDA, was built based on the characteristics of the Persian language and it compares Persian names with each other on three levels: phonetic similarity, character form similarity and keyboard distance, in order to give more accurate results for Persian names. The algorithm gets Persian names as its input and determines their similarity as a percentage in the output. In this thesis three series of experiments have been accomplished in order to evaluate the proposed algorithm. The f-measure average shows a value of 0.86 for the first series and a value of 0.80 for the second series results. The first series of experiments have been repeated with Levenshtein as well, and have 33.9% false negatives on average while PEDA has a false negative average of 6.4%. The third series of experiments shows that PEDA works well for one edit, two edits and three edits with true positive average values of 99%, 81%, and 69% respectively.
Place, publisher, year, edition, pages
2013. , 63 p.
Name matching Persian language string matching
IdentifiersURN: urn:nbn:se:liu:diva-102210ISRN: LIU-IDA/LITH-EX-A--13/061--SEOAI: oai:DiVA.org:liu-102210DiVA: diva2:675478
Subject / course
Computer and information science at the Institute of Technology
Special Education Programme
Maleki, JalalAmirshekari, Nima
Ahrenberg, Lars, Professor