Functionality Classification Filter for Websites
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
The objective of this thesis is to evaluate different models and methods for website classification. The websites are classified based on their functionality, in this case specifically whether they are forums, news sites or blogs. The analysis aims at solving a search engine problem, which means that it is interesting to know from which categories in a information search the results come.
The data consists of two datasets, extracted from the web in January and April 2013. Together these data sets consist of approximately 40.000 observations, with each observation being the extracted text from the website. Approximately 7.000 new word variables were subsequently created from this text, as were variables based on Latent Dirichlet Allocation. One variable (the number of links) was created using the HTML-code for the web site.
These data sets are used both in multinomial logistic regression with Lasso regularization, and to create a Naive Bayes classifier. The best classifier for the data material studied was achieved when using Lasso for all variables with multinomial logistic regression to reduce the number of variables. The accuracy of this model is 99.70 %.
When time dependency of the models is considered, using the first data to make the model and the second data for testing, the accuracy, however, is only 90.74 %. This indicates that the data is time dependent and that websites topics change over time.
Place, publisher, year, edition, pages
2013. , 58 p.
Website classification, Functionality, Latent Dirichlet Allocation, Multinomial logistic regression
Probability Theory and Statistics
IdentifiersURN: urn:nbn:se:liu:diva-93702ISRN: LIU-IDA/STAT-A--13/004—SEOAI: oai:DiVA.org:liu-93702DiVA: diva2:635113
Subject / course
Program in Statistics and Data Analysis
Villani, Mattias, Professor