liu.seSearch for publications in DiVA
Change search
Refine search result
12 1 - 50 of 66
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Aitken, Colin
    et al.
    Univ Edinburgh, Scotland.
    Nordgaard, Anders
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences. Swedish Police Author, Natl Forens Ctr, SE-58194 Linkoping, Sweden.
    The Roles of Participants Differing Background Information in the Evaluation of Evidence2018In: Journal of Forensic Sciences, ISSN 0022-1198, E-ISSN 1556-4029, Vol. 63, no 2, p. 648-649Article in journal (Other academic)
    Abstract [en]

    n/a

  • 2.
    Alsaadi, Sarah
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Wänström, Linda
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Sjögren, Björn
    Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning.
    Bjärehed, Marlene
    Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning.
    Thornberg, Robert
    Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning. Linköping University, Faculty of Educational Sciences.
    Collective moral disengagement and school bullying: An initial validation study of the Swedish scale version2016Conference paper (Refereed)
  • 3.
    Anderskär, Erika
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Thomasson, Frida
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Inkrementell responsanalys av Scandnavian Airlines medlemmar: Vilka kunder ska väljas vid riktad marknadsföring?2017Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Scandinavian Airlines has a large database containing their Eurobonus members. In order to analyze which customers they should target with direct marketing, such as emails, uplift models have been used. With a binary response variable that indicates whether the customer has bought or not, and a binary dummy variable that indicates if the customer has received the campaign or not conclusions can be drawn about which customers are persuadable. That means that the customers that buy when they receive a campaign and not if they don't are spotted. Analysis have been done with one campaign for Sweden and Scandinavia. The methods that have been used are logistic regression with Lasso and logistic regression with Penalized Net Information Value. The best method for predicting purchases is Lasso regression when comparing with a confusion matrix. The variable that best describes persuadable customers in logistic regression with PNIV is Flown (customers that have own with SAS within the last six months). In Lassoregression the variable that describes a persuadable customer in Sweden is membership level1 (the rst level of membership) and in Scandinavia customers that receive campaigns with delivery code 13 are persuadable, which is a form of dispatch.

  • 4.
    Bartoszek, Krzysztof
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    Exact and approximate limit behaviour of the Yule trees cophenetic index2018In: Mathematical Biosciences, ISSN 0025-5564, E-ISSN 1879-3134, Vol. 303, p. 26-45Article in journal (Refereed)
    Abstract [en]

    In this work we study the limit distribution of an appropriately normalized cophenetic index of the pure-birth tree conditioned on n contemporary tips. We show that this normalized phylogenetic balance index is a sub-martingale that converges almost surely and in L-2. We link our work with studies on trees without branch lengths and show that in this case the limit distribution is a contraction-type distribution, similar to the Quicksort limit distribution. In the continuous branch case we suggest approximations to the limit distribution. We propose heuristic methods of simulating from these distributions and it may be observed that these algorithms result in reasonable tails. Therefore, we propose a way based on the quantiles of the derived distributions for hypothesis testing, whether an observed phylogenetic tree is consistent with the pure-birth process. Simulating a sample by the proposed heuristics is rapid, while exact simulation (simulating the tree and then calculating the index) is a time-consuming procedure. We conduct a power study to investigate how well the cophenetic indices detect deviations from the Yule tree and apply the methodology to empirical phylogenies.

    The full text will be freely available from 2019-05-07 17:34
  • 5.
    Bartoszek, Krzysztof
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    The phylogenetic effective sample size and jumps2018In: MATHEMATICA APPLICANDA (MATEMATYKA STOSOWANA), ISSN 1730-2668, Vol. 46, no 1, p. 25-33Article in journal (Refereed)
    Abstract [en]

    The phylogenetic effective sample size is a parameter that has as its goal the quantification of the amount of independent signal in a phylogenetically correlatedsample. It was studied for Brownian motion and Ornstein-Uhlenbeck models of trait evolution. Here, we study this composite parameter when the trait is allowedto jump at speciation points of the phylogeny. Our numerical study indicates thatthere is a non-trivial limit as the effect of jumps grows. The limit depends on thevalue of the drift parameter of the Ornstein-Uhlenbeck process.

  • 6.
    Bartoszek, Krzysztof
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    Trait evolution with jumps: illusionary normality2017In: Proceedings of the XXIII National Conference on Applications of Mathematics in Biology and Medicine, 2017, p. 23-28Conference paper (Refereed)
    Abstract [en]

    Phylogenetic comparative methods for real-valued traits usually make use of stochastic process whose trajectories are continuous.This is despite biological intuition that evolution is rather punctuated thangradual. On the other hand, there has been a number of recent proposals of evolutionarymodels with jump components. However, as we are only beginning to understandthe behaviour of branching Ornstein-Uhlenbeck (OU) processes the asymptoticsof branching  OU processes with jumps is an even greater unknown. In thiswork we build up on a previous study concerning OU with jumps evolution on a pure birth tree.We introduce an extinction component and explore via simulations, its effects on the weak convergence of such a process.We furthermore, also use this work to illustrate the simulation and graphic generation possibilitiesof the mvSLOUCH package.

  • 7.
    Bartoszek, Krzysztof
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences. Uppsala University, Sweden.
    Glemin, Sylvain
    Uppsala University, Sweden; CNRS University of Montpellier IRD EPHE, France.
    Kaj, Ingemar
    Uppsala University, Sweden.
    Lascoux, Martin
    Uppsala University, Sweden.
    Using the Ornstein-Uhlenbeck process to model the evolution of interacting populations2017In: Journal of Theoretical Biology, ISSN 0022-5193, E-ISSN 1095-8541, Vol. 429, p. 35-45Article in journal (Refereed)
    Abstract [en]

    The Ornstein-Uhlenbeck (OU) process plays a major role in the analysis of the evolution of phenotypic traits along phylogenies. The standard OU process includes random perturbations and stabilizing selection and assumes that species evolve independently. However, evolving species may interact through various ecological process and also exchange genes especially in plants. This is particularly true if we want to study phenotypic evolution among diverging populations within species. In this work we present a straightforward statistical approach with analytical solutions that allows for the inclusion of adaptation and migration in a common phylogenetic framework, which can also be useful for studying local adaptation among populations within the same species. We furthermore present a detailed simulation study that clearly indicates the adverse effects of ignoring migration. Similarity between species due to migration could be misinterpreted as very strong convergent evolution without proper correction for these additional dependencies. Finally, we show that our model can be interpreted in terms of ecological interactions between species, providing a general framework for the evolution of traits between "interacting" species or populations.(C) 2017 Elsevier Ltd. All rights reserved.

  • 8.
    Bartoszek, Krzysztof
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences. Uppsala Univ, Sweden.
    Majchrzak, Marta
    Polish Acad Sci, Poland.
    Sakowski, Sebastian
    Univ Lodz, Poland.
    Kubiak-Szeligowska, Anna B.
    Polish Acad Sci, Poland.
    Kaj, Ingemar
    Uppsala Univ, Sweden.
    Parniewski, Pawel
    Polish Acad Sci, Poland.
    Predicting pathogenicity behavior in Escherichia coli population through a state dependent model and TRS profiling2018In: PloS Computational Biology, ISSN 1553-734X, E-ISSN 1553-7358, Vol. 14, no 1, article id e1005931Article in journal (Refereed)
    Abstract [en]

    The Binary State Speciation and Extinction (BiSSE) model is a branching process based model that allows the diversification rates to be controlled by a binary trait. We develop a general approach, based on the BiSSE model, for predicting pathogenicity in bacterial populations from microsatellites profiling data. A comprehensive approach for predicting pathogenicity in E. coli populations is proposed using the state-dependent branching process model combined with microsatellites TRS-PCR profiling. Additionally, we have evaluated the possibility of using the BiSSE model for estimating parameters from genetic data. We analyzed a real dataset (from 251 E. coli strains) and confirmed previous biological observations demonstrating a prevalence of some virulence traits in specific bacterial sub-groups. The method may be used to predict pathogenicity of other bacterial taxa.

  • 9.
    Bergstrand, Frida
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Nguyen, Ngan
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Bakgrundsvariablers påverkan på enkätsvaren i en telefonintervju: En studie om effekt av intervjuarens, respondentens och intervjuns egenskaper2017Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Norstat recurrently performs a survey that contains questions about how much the respondent is watching different tv-channels, how different media-devices are used, the ownership of different devices and the usage of different tv-channel sites on the internet, social media, internet services, magazine services and streaming services. In this thesis, data from the survey performed during the autumn of 2016 was used. The aim of this thesis is to examine if there is a difference in answers based on different characteristics of the interviewers and respondents. 

    The 15 most important questions from the survey were chosen in this thesis, and to further reduce the number of response variables principal component analysis was used. The new scores that were produced by the analysis were the reduced response variables, which kept the most important information from the questions in the survey. Thereafter multilevel analyses and regression analyses were performed to examine the effects.  

    The results showed that there was an effect of different characteristics in different questions in the survey. The characteristics that showed effect were the age of the interviewer, the length of the employment, the age of the respondent, education, sex and native language. Some of the questions also showed effect based on whether the respondent lived in a metropolitan region or not.

  • 10.
    Bjärehed, Marlene
    et al.
    Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning.
    Sjögren, Björn
    Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning.
    Wänström, Linda
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Thornberg, Robert
    Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning. Linköping University, Faculty of Educational Sciences.
    Bullying and moral disengagement mechanisms2016Conference paper (Refereed)
  • 11.
    Bjärehed, Marlene
    et al.
    Linköping University, Faculty of Educational Sciences. Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning.
    Thornberg, Robert
    Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning. Linköping University, Faculty of Educational Sciences.
    Wänström, Linda
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Gianluca, Gini
    University of Padova.
    Sjögren, Björn
    Linköping University, Faculty of Educational Sciences. Linköping University, Department of Behavioural Sciences and Learning, Education, Teaching and Learning.
    Bullying perpetration and victimization and their associations with warm student–teacher relationship, individual and collective moral disengagement, and collective efficacy in a sample of Swedish fourth grade students: A multi-level analysis2017Conference paper (Refereed)
  • 12.
    Bonneau, Maxime
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Reinforcement Learning for 5G Handover2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The development of the 5G network is in progress, and one part of the process that needs to be optimised is the handover. This operation, consisting of changing the base station (BS) providing data to a user equipment (UE), needs to be efficient enough to be a seamless operation. From the BS point of view, this operation should be as economical as possible, while satisfying the UE needs.  In this thesis, the problem of 5G handover has been addressed, and the chosen tool to solve this problem is reinforcement learning. A review of the different methods proposed by reinforcement learning led to the restricted field of model-free, off-policy methods, more specifically the Q-Learning algorithm. On its basic form, and used with simulated data, this method allows to get information on which kind of reward and which kinds of action-space and state-space produce good results. However, despite working on some restricted datasets, this algorithm does not scale well due to lengthy computation times. It means that the agent trained can not use a lot of data for its learning process, and both state-space and action-space can not be extended a lot, restricting the use of the basic Q-Learning algorithm to discrete variables. Since the strength of the signal (RSRP), which is of high interest to match the UE needs, is a continuous variable, a continuous form of the Q-learning needs to be used. A function approximation method is then investigated, namely artificial neural networks. In addition to the lengthy computational time, the results obtained are not convincing yet. Thus, despite some interesting results obtained from the basic form of the Q-Learning algorithm, the extension to the continuous case has not been successful. Moreover, the computation times make the use of reinforcement learning applicable in our domain only for really powerful computers.

  • 13.
    Brouwers, Jack
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Thellman, Björn
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Klassificering av vinkvalitet2017Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    The data used in this paper is an open source data, that was collected in Portugal over a three year period between 2004 and 2007. It consists of the physiochemical parameters, and the quality grade of the wines.

    This study focuses on assessing which variables that primarily affect the quality of a wine and how the effects of the variables interact with each other, and also compare which of the different classification methods work the best and have the highest degree of accuracy.

    The data is divided into red and white wine where the response variable is ordered and consists of the grades of quality for the different wines. Due to the distribution in the response variable having too few observations in some of the quality grades, a new response variable was created where several grades were pooled together so that each different grade category would have a good amount of observations.

    The statistical methods used are Bayesian ordered logistic regression as well as two data mining techniques which are neural networks and decision trees.

    The result obtained showed that for the two types of wine it is primarily the alcohol content and the amount of volatile acid that are recurring parameters which have a great influence on predicting the quality of the wines.

    The results also showed that among the three different methods, decision trees were the best at classifying the white wines and the neural network were the best for the red wines.

  • 14.
    Bruzzone, Andrea
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    P-SGLD: Stochastic Gradient Langevin Dynamics with control variates2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Year after years, the amount of data that we continuously generate is increasing. When this situation started the main challenge was to find a way to store the huge quantity of information. Nowadays, with the increasing availability of storage facilities, this problem is solved but it gives us a new issue to deal with: find tools that allow us to learn from this large data sets. In this thesis, a framework for Bayesian learning with the ability to scale to large data sets is studied. We present the Stochastic Gradient Langevin Dynamics (SGLD) framework and show that in some cases its approximation of the posterior distribution is quite poor. A reason for this can be that SGLD estimates the gradient of the log-likelihood with a high variability due to naïve sampling. Our approach combines accurate proxies for the gradient of the log-likelihood with SGLD. We show that it produces better results in terms of convergence to the correct posterior distribution than the standard SGLD, since accurate proxies dramatically reduce the variance of the gradient estimator. Moreover, we demonstrate that this approach is more efficient than the standard Markov Chain Monte Carlo (MCMC) method and that it exceeds other techniques of variance reduction proposed in the literature such as SAGA-LD algorithm. This approach also uses control variates to improve SGLD so that it is straightforward the comparison with our approach. We apply the method to the Logistic Regression model. 

  • 15.
    Burdakov, Oleg
    et al.
    Linköping University, Department of Mathematics, Optimization . Linköping University, Faculty of Science & Engineering.
    Sysoev, Oleg
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    A Dual Active-Set Algorithm for Regularized Slope-Constrained Monotonic Regression2017In: Iranian Journal of Operations Research, ISSN 2008-1189, Vol. 8, no 2, p. 40-47Article in journal (Refereed)
    Abstract [en]

    In many problems, it is necessary to take into account monotonic relations. Monotonic (isotonic) Regression (MR) is often involved in solving such problems. The MR solutions are of a step-shaped form with a typical sharp change of values between adjacent steps. This, in some applications, is regarded as a disadvantage. We recently introduced a Smoothed MR (SMR) problem which is obtained from the MR by adding a regularization penalty term. The SMR is aimed at smoothing the aforementioned sharp change. Moreover, its solution has a far less pronounced step-structure, if at all available. The purpose of this paper is to further improve the SMR solution by getting rid of such a structure. This is achieved by introducing a lowed bound on the slope in the SMR. We call it Smoothed Slope-Constrained MR (SSCMR) problem. It is shown here how to reduce it to the SMR which is a convex quadratic optimization problem. The Smoothed Pool Adjacent Violators (SPAV) algorithm developed in our recent publications for solving the SMR problem is adapted here to solving the SSCMR problem. This algorithm belongs to the class of dual active-set algorithms. Although the complexity of the SPAV algorithm is o(n2) its running time is growing in our computational experiments almost linearly with n. We present numerical results which illustrate the predictive performance quality of our approach. They also show that the SSCMR solution is free of the undesirable features of the MR and SMR solutions.

  • 16.
    Bäcklund, JOakim
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Nils, Johdet
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    A Bayesian approach to predict the number of soccer goals: Modeling with Bayesian Negative Binomial regression2018Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This thesis focuses on a well-known topic in sports betting, predicting the number of goals in soccer games.The data set used comes from the top English soccer league: Premier League, and consists of games played in the seasons 2015/16 to 2017/18.This thesis approaches the prediction with the auxiliary support of the odds from the betting exchange Betfair. The purpose is to find a model that can create an accurate goal distribution. %The other purpose is to investigate whether Negative binomial distribution regressionThe methods used are Bayesian Negative Binomial regression and Bayesian Poisson regression. The results conclude that the Poisson regression is the better model because of the presence of underdispersion.We argue that the methods can be used to compare different sportsbooks accuracies, and may help creating better models.

  • 17.
    Cros, Olivier
    et al.
    Linköping University, Department of Biomedical Engineering, Medical Informatics. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV). Aalborg Unversity Hospital, Denmark.
    Eklund, Anders
    Linköping University, Department of Biomedical Engineering, Medical Informatics. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV). Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    Gaihede, Michael
    Department of Otolaryngology, Head & Neck Surgery, Aalborg University Hospital, Denmark.
    Knutsson, Hans
    Linköping University, Department of Biomedical Engineering, Medical Informatics. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Enhancement of micro-channels within the human mastoid bone based on local structure tensor analysis2016In: Image Proceessing Theory, Tools and Apllications, IEEE, 2016Conference paper (Refereed)
    Abstract [en]

    Numerous micro-channels have recently been discovered in the human temporal bone by x-ray micro-CT-scanning. After a preliminary study suggesting that these micro-channels form a separate blood supply for the mucosa of the mastoid air cells, a structural analysis of the micro-channels using a local structure tensor was carried out. Despite the high-resolution of the micro-CT scan, presence of noise within the air cells along with missing information in some micro-channels suggested the need of image enhancement. This paper proposes an adaptive enhancement of the micro-channels based on a local structure analysis while minimizing the impact of noise on the overall data. Comparison with an anisotropic diffusion PDE based scheme was also performed.

  • 18.
    Cros, Olivier
    et al.
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV). Department of Otolaryngology, Head & Neck Surgery, Aalborg University Hospital, Denmark.
    Gaihede, Michael
    Department of Otolaryngology, Head & Neck Surgery, Aalborg University Hospital, Denmark; Department of Clinical Medicine, Aalborg University, Denmark.
    Eklund, Anders
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV). Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    Knutsson, Hans
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Surface and curve skeleton from a structure tensor analysis applied on mastoid air cells in human temporal bones2017In: IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), 2017, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 270-274Conference paper (Refereed)
    Abstract [en]

    The mastoid of human temporal bone contains numerous air cells connected to each others. In order to gain further knowledge about these air cells, a more compact representation is needed to obtain an estimate of the size distribution of these cells. Already existing skeletonization methods often fail in producing a faithful skeleton mostly due to noise hampering the binary representation of the data. This paper proposes a different approach by extracting geometrical information embedded in the Euclidean distance transform of a volume via a structure tensor analysis based on quadrature filters, from which a secondary structure tensor allows the extraction of surface skeleton along with a curve skeleton from its eigenvalues. Preliminary results obtained on a X-ray micro-CT scans of a human temporal bone show very promising results.

  • 19.
    Eamrurksiri, Araya
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Applying Machine Learning to LTE/5G Performance Trend Analysis2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The core idea of this thesis is to reduce the workload of manual inspection when the performance analysis of an updated software is required. The Central Process- ing Unit (CPU) utilization, which is one of the essential factors for evaluating the performance, is analyzed. The purpose of this work is to apply machine learning techniques that are suitable for detecting the state of the CPU utilization and any changes in the test environment that affects the CPU utilization. The detection re- lies on a Markov switching model to identify structural changes, which are assumed to follow an unobserved Markov chain, in the time series data. A historical behav- ior of the data can be described by a first-order autoregression. Then, the Markov switching model becomes a Markov switching autoregressive model. Another ap- proach based on a non-parametric analysis, a distribution-free method that requires fewer assumptions, called an E-divisive method, is proposed. This method uses a hi- erarchical clustering algorithm to detect multiple change point locations in the time series data. As the data used in this analysis does not contain any ground truth, the evaluation of the methods is analyzed by generating simulated datasets with known states. Besides, these simulated datasets are used for studying and compar- ing between the Markov switching autoregressive model and the E-divisive method. Results show that the former method is preferable because of its better performance in detecting changes. Some information about the state of the CPU utilization are also obtained from performing the Markov switching model. The E-divisive method is proved to have less power in detecting changes and has a higher rate of missed detections. The results from applying the Markov switching autoregressive model to the real data are presented with interpretations and discussions. 

  • 20.
    Eklund, Anders
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Repliker. ”Öppen vetenskap behöver inte kosta en enda krona”2016In: Dagens Nyheter, ISSN 1101-2447Article in journal (Other (popular science, discussion, etc.))
  • 21.
    Eklund, Anders
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Öppen vetenskap behöver inte kosta en krona2017In: Svenska Dagbladet, ISSN 1101-2412Article in journal (Other (popular science, discussion, etc.))
  • 22.
    Eklund, Anders
    et al.
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Knutsson, Hans
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Reply to Chen et al.: Parametric methods for cluster inference perform worse for two‐sided t‐tests2018In: Human Brain Mapping, ISSN 1065-9471, E-ISSN 1097-0193Article in journal (Other (popular science, discussion, etc.))
    Abstract [en]

    One‐sided t‐tests are commonly used in the neuroimaging field, but two‐sided tests should be the default unless a researcher has a strong reason for using a one‐sided test. Here we extend our previous work on cluster false positive rates, which used one‐sided tests, to two‐sided tests. Briefly, we found that parametric methods perform worse for two‐sided t‐tests, and that nonparametric methods perform equally well for one‐sided and two‐sided tests.

  • 23.
    Eklund, Anders
    et al.
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Knutsson, Hans
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Nichols, Thomas E
    Big Data Institute, University of Oxford, Oxford, United Kingdom, Department of Statistics, University of Warwick, Coventry, United KingdomWellcome Trust Centre for Integrative Neuroimaging (WIN-FMRIB), University of Oxford, Oxford, United Kingdom, .
    Cluster failure revisited: Impact of first level design and physiological noise on cluster false positive rates2018In: Human Brain Mapping, ISSN 1065-9471, E-ISSN 1097-0193Article in journal (Refereed)
    Abstract [en]

    Methodological research rarely generates a broad interest, yet our work on the validity of cluster inference methods for functional magnetic resonance imaging (fMRI) created intense discussion on both the minutia of our approach and its implications for the discipline. In the present work, we take on various critiques of our work and further explore the limitations of our original work. We address issues about the particular event‐related designs we used, considering multiple event types and randomization of events between subjects. We consider the lack of validity found with one‐sample permutation (sign flipping) tests, investigating a number of approaches to improve the false positive control of this widely used procedure. We found that the combination of a two‐sided test and cleaning the data using ICA FIX resulted in nominal false positive rates for all data sets, meaning that data cleaning is not only important for resting state fMRI, but also for task fMRI. Finally, we discuss the implications of our work on the fMRI literature as a whole, estimating that at least 10% of the fMRI studies have used the most problematic cluster inference method (p = .01 cluster defining threshold), and how individual studies can be interpreted in light of our findings. These additional results underscore our original conclusions, on the importance of data sharing and thorough evaluation of statistical methods on realistic null data.

  • 24.
    Eklund, Anders
    et al.
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Nichols, Thomas
    University of Warwick, England.
    How open science revealed false positives in brain imaging2017In: Significance, ISSN 1740-9705, E-ISSN 1740-9713Article in journal (Other (popular science, discussion, etc.))
    Abstract [en]

    A team set out to validate software used in fMRI analysis, but ended up invalidating one of neuroscience's most common testing procedures.

  • 25.
    Enoksson, Josefin
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Olausson, Sofia
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Bayesiansk flernivåanalys för att undersöka variationen i elevers trygghet i skolan: En studie baserad på enkäten Om mig2017Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
  • 26.
    Fredrik, Schlyter
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Predicting Personal Taxi Destinations Using Artificial Neural Networks2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Taxi Stockholm is a Swedish taxi company which would like to improve their mobile phone application with a destination prediction feature. This thesis has created an algo- rithm which predicts a destination to which a taxi customer would like to go. The problem is approached using the KDD process and data mining methods. A dataset consisting of previous taxi rides is cleaned, transformed, and then used to evaluate the performance of three machine learning models. More specifically a neural network model paired with K- Means clustering, a random forest model, and a k-nearest neighbour model. The results show that the models that were developed in this thesis could be used as a first step in a destination prediction system. The results also show that personal data increase the accu- racy of the neural network model and that there exists a threshold for how much personal information is needed to increase the performance.

  • 27.
    Grek, Viktoria
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Gabrielsson, Molinia
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Analys av nutidens tågindelning: Ett uppdrag framtaget av Trafikverket2018Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    The information used in this paper comes from Trafikverket's delivery monitoring system. It consists of information about planned train missions on the Swedish railways for the years 2014 to 2017 during week four (except planned train missions on Roslagsbanan and Saltsjöbanan).

    Trafikanalys with help from Trafikverket presents public statistics for short-distance trains, middle-distance trains and long-distance trains on Trafikanalys website. The three classes of trains have no scientific basis. The purpose of this study is therefore to analyze if today's classes of trains can be used and which variables that have importance for the classification. The purpose of this study is also to analyze if there is a better way to categorize the classes of trains when Trafikanalys publishes public statistics. The statistical methods that are used in this study are decision tree, neural network and hierarchical clustering.

    The result obtained from the decision tree was a 92.51 percent accuracy for the classification of Train type. The most important variables for Train type were Train length, Planned train kilometers and Planned km/h.Neural networks were used to investigate whether this method could also provide a similar result as the decision tree too strengthening the reliability. Neural networks got an 88 percent accuracy when classifying Train type. Based on these two results, it indicates that the larger proportion of train assignments could be classified to the correct Train Type. This means that the current classification of Train type works when Trafikanalys presents official statistics.

    For the new train classification, three groups were analyzed when hierarchical clustering was used. These three groups were not the same as the group's short-distance trains, middle-distance trains and long-distance trains. Because the new divisions have blended the various passenger trains, this result does not help to find a better subdivision that can be used for when Trafikanalys presents official statistics.

  • 28.
    Gu, Xuan
    et al.
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Eklund, Anders
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Knutsson, Hans
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Repeated Tractography of a Single Subject: How High Is the Variance?2017In: Modeling, Analysis, and Visualization of Anisotropy / [ed] Thomas Schultz, Evren Özarslan, Ingrid Hotz, Springer, 2017, p. 331-354Chapter in book (Other academic)
    Abstract [en]

    We have investigated the test-retest reliability of diffusion tractography, using 32 diffusion datasets from a single healthy subject. Preprocessing was carried out using functions in FSL (FMRIB Software Library), and tractography was carried out using FSL and Dipy. The tractography was performed in diffusion space, using two seed masks (corticospinal and cingulum gyrus tracts) created from the JHU White-Matter Tractography atlas. The tractography results were then warped into MNI standard space by a linear transformation. The reproducibility of tract metrics was examined using the standard deviation, the coefficient of variation (CV) and the Dice similarity coefficient (DSC), which all indicated a high reproducibility. Our results show that the multi-fiber model in FSL is able to reveal more connections between brain areas, compared to the single fiber model, and that distortion correction increases the reproducibility.

  • 29.
    Gu, Xuan
    et al.
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Sidén, Per
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Wegmann, Bertil
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Eklund, Anders
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Villani, Mattias
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Knutsson, Hans
    Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering. Linköping University, Center for Medical Image Science and Visualization (CMIV).
    Bayesian Diffusion Tensor Estimation with Spatial Priors2017In: CAIP 2017: Computer Analysis of Images and Patterns, 2017, Vol. 10424, p. 372-383Conference paper (Refereed)
    Abstract [en]

    Spatial regularization is a technique that exploits the dependence between nearby regions to locally pool data, with the effect of reducing noise and implicitly smoothing the data. Most of the currently proposed methods are focused on minimizing a cost function, during which the regularization parameter must be tuned in order to find the optimal solution. We propose a fast Markov chain Monte Carlo (MCMC) method for diffusion tensor estimation, for both 2D and 3D priors data. The regularization parameter is jointly with the tensor using MCMC. We compare FA (fractional anisotropy) maps for various b-values using three diffusion tensor estimation methods: least-squares and MCMC with and without spatial priors. Coefficient of variation (CV) is calculated to measure the uncertainty of the FA maps calculated from the MCMC samples, and our results show that the MCMC algorithm with spatial priors provides a denoising effect and reduces the uncertainty of the MCMC samples.

  • 30.
    Hammi, Malik
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Akdeve, Ahmet Hakan
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Poweranalys: bestämmelse av urvalsstorlek genom linjära mixade modeller och ANOVA2018Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    In research where experiments on humans and animals is performed, it is in advance important to determine how many observations that is needed in a study to detect any effects in groups and to save time and costs. This could be examined by power analysis, in order to determine a sample size which is enough to detect any effects in a study, a so called “power”. Power is the probability to reject the null hypothesis when the null hypothesis is false.

    Mälardalen University and the Caroline Institute have in cooperation, formed a study (The Climate Friendly and Ecological Food on Microbiota) based on individual’s dietary intake. Every single individual have been assigned to a specific diet during 8 weeks, with the purpose to examine whether emissions of carbon dioxide, CO2, differs reliant to the specific diet each individuals follows. There are two groups, one treatment and one control group. Individuals assigned to the treatment group are supposed to follow a climatarian diet while the individuals in the control group follows a conventional diet. Each individual have been followed up during 8 weeks in total, with three different measurements occasions, 4 weeks apart. The different measurements are Baseline assessment, Midline assessment and End assessment.

    In the CLEAR-study there are a total of 18 individuals, with 9 individuals in each group. The amount of individuals are not enough to reach any statistical significance in a test and therefore the sample size shall be examined through power analysis. In terms of, data, every individual have three different measurements occasions that needs to be modeled through mixed-design ANOVA and linear mixed models. These two methods takes into account, each individual’s different measurements. The models which describes data are applied in the computations of sample sizes and power. All the analysis are done in the programming language R with means and standard deviations from the study and the models as a base.

    Sample sizes and power have been computed for two different linear mixed models and one ANOVA model. The linear mixed models required less individuals than ANOVA in terms of a desired power of 80 percent. 24 individuals in total were required by the linear mixed model that had the factors group, time, id and the covariate sex. 42 individuals were required by ANOVA that includes the variables id, group and time.

  • 31.
    Holm, Rasmus
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Prediction of Inter-Frequency Measurements in a LTE Network with Deep Learning2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The telecommunications industry faces difficult challenges as more and more devices communicate over the internet. A telecommunications network is a complex system with many parts and some are candidates for further automation. We have focused on interfrequency measurements that are used during inter-frequency handovers, among other procedures. A handover is the procedure when for instance a phone changes the base station it communicates with and the inter-frequency measurements are rather expensive to perform.

    More specifically, we have investigated the possibility of using deep learning—an ever expanding field in machine learning—for predicting inter-frequency measurements in a Long Term Evolution (LTE) network. We have focused on the multi-layer perceptron and extended it with (variational) autoencoders or modified it through dropout such that it approximate the predictive distribution of a Gaussian process.

    The telecommunications network consist of many cells and each cell gather its own data. One of the strengths of deep learning models is that they usually increase their performance as more and more data is used. We have investigated whether we do see an increase in performance if we combine data from multiple cells and the results show that this is not necessarily the case. The performances are comparable between models trained on combined data from multiple cells and models trained on data from individual cells. We can expect the multi-layer perceptron to perform better than a linear regression model.

    The best performing multi-layer perceptron architectures have been rather shallow, 1-2 hidden layers, and the extensions/modifications we have used/done have not shown any significant improvements to warrant their presence.

    For the particular LTE network we have worked with we would recommend to use shallow multi-layer perceptron architectures as far as deep learning models are concerned.

  • 32.
    Hosini, Rebin
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Detection of high-risk shops in e- commerce2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
  • 33.
    Izquierdo, Milagros
    et al.
    Linköping University, Department of Mathematics, Mathematics and Applied Mathematics. Linköping University, Faculty of Science & Engineering.
    Johansson, KarinLinköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Meeting of the Catalan, Spanish, Swedish Math Societies (CAT‐SP‐SW‐MATH)2017Conference proceedings (editor) (Other academic)
    Abstract [en]

    A joint Meeting of the Catalan, Spanish, Swedish Math Societies (CAT-SP-SW-MATH) will be held in Umeå (Sweden) from 12th to 15th June 2017.

    The meeting is a symposium devoted to mathematics at large.

    The conference is thought as a meeting point between the different areas of mathematics and its applications.

    The programme will consist of several plenary lectures, covering a wide range of areas of mathematics, and special sessions devoted to a single topic or area of mathematics.

    The venue of the conference will be the Department of Mathematics and Mathematical Statistics of Umeå University.

    Welcome!

    Milagros Izquierdo (Svenska matematikersamfundet)

    Xavier Jarque (Societat Catalana de Matemàtiques)

    Francisco José Marcellán (Real Sociedad Matemática Española)

  • 34.
    Jesperson, Sara
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Johansson, Sara
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Mönster som leder till sjukfrånvaro: Sekvensanalys på longitudinella data2017Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Absence due to sickness results in a cost to both employers and employees. For an unnamed wholesaler this is a problem at one of their warehouses, where the rate of sick leave is high. The aim of this thesis is to identify interesting patterns over time that lead to sick leave by analyzing data from the company's payroll system and their attendance system.

    The data is longitudinal and to detect the patterns that lead to sick leave, sequence analysis is used. To generate the sequential patterns the algorithm cSPADE is used since it allows time constraints to be specified for the sequences. The relevance of the generated sequences is evaluated with three interest measures: support, confidence and lift.

    Three separate analyses are performed where different variables are used, depending on whether they change over time or have a constant value, and for these analyses the data is aggregated weekly. The most common events that lead to sick leave for the employees are different duration of employment, gender and birth year. A few days sick leave during a week, namely between 8 and 40 hours, is more common among the employees compared to shorter and longer sick leave. It can be noted that the pattern of previous sick leave usually leads to continued sick leave.

    The thesis also highlights the problems that arise in sequence analysis, for example that the constant variables overshadow the non-constant variables in the resulting sequences. This happens when variables that change over time are used in combination with variables that have a constant value, which may occur in longitudinal data.

  • 35.
    Jonsson, Fredrik
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    On the Construction of an Automatic Traffic Sign Recognition System2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    This thesis proposes an automatic road sign recognition system, including all steps from the initial detection of road signs from a digital image to the final recognition step that determines the class of the sign.

    We develop a Bayesian approach for image segmentation in the detection step using colour information in the HSV (Hue, Saturation and Value) colour space. The image segmentation uses a probability model which is constructed based on manually extracted data on colours of road signs collected from real images. We show how the colour data is fitted using mixture multivariate normal distributions, where for the case of parameter estimation Gibbs sampling is used. The fitted models are then used to find the (posterior) probability of a pixel colour to belong to a road sign using the Bayesian approach. Following the image segmentation, regions of interest (ROIs) are detected by using the Maximally Stable Extremal Region (MSER) algorithm, followed by classification of the ROIs using a cascade of classifiers.

    Synthetic images are used in training of the classifiers, by applying various random distortions to a set of template images constituting most road signs in Sweden, and we demonstrate that the construction of such synthetic images provides satisfactory recognition rates. We focus on a large set of the signs on the Swedish road network, including almost 200 road signs. We use classification models such as the Support Vector Machine (SVM), and Random Forest (RF), where for features we use Histogram of Oriented Gradients (HOG).

  • 36.
    Klasson Svensson, Emil
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Automatic Identification of Duplicates in Literature in Multiple Languages2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    As the the amount of books available online the sizes of each these collections are at the same pace growing larger and more commonly in multiple languages. Many of these cor- pora contain duplicates in form of various editions or translations of books. The task of finding these duplicates is usually done manually but with the growing sizes making it time consuming and demanding. The thesis set out to find a method in the field of Text Mining and Natural Language Processing that can automatize the process of manually identifying these duplicates in a corpora mainly consisting of fiction in multiple languages provided by Storytel.

    The problem was approached using three different methods to compute distance measures between books. The first approach was comparing titles of the books using the Levenstein- distance. The second approach used extracting entities from each book using Named En- tity Recognition and represented them using tf-idf and cosine dissimilarity to compute distances. The third approach was using a Polylingual Topic Model to estimate the books distribution of topics and compare them using Jensen Shannon Distance. In order to es- timate the parameters of the Polylingual Topic Model 8000 books were translated from Swedish to English using Apache Joshua a statistical machine translation system. For each method every book written by an author was pairwise tested using a hypothesis test where the null hypothesis was that the two books compared is not an edition or translation of the others. Since there is no known distribution to assume as the null distribution for each book a null distribution was estimated using distance measures of books not written by the author. The methods were evaluated on two different sets of manually labeled data made by the author of the thesis. One randomly sampled using one-stage cluster sampling and one consisting of books from authors that the corpus provider prior to the thesis be considered more difficult to label using automated techniques.

    Of the three methods the Title Matching was the method that performed best in terms of accuracy and precision based of the sampled data. The entity matching approach was the method with the lowest accuracy and precision but with a almost constant recall at around 50 %. It was concluded that there seems to be a set of duplicates that are clearly distin- guished from the estimated null-distributions, with a higher significance level a better pre- cision and accuracy could have been made with a similar recall for the specific method. For topic matching the result was worse than the title matching and when studied the es- timated model was not able to create quality topics the cause of multiple factors. It was concluded that further research is needed for the topic matching approach. None of the three methods were deemed be complete solutions to automatize detection of book duplicates.

  • 37.
    Magnusson, Måns
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    Scalable and Efficient Probabilistic Topic Model Inference for Textual Data2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Probabilistic topic models have proven to be an extremely versatile class of mixed-membership models for discovering the thematic structure of text collections. There are many possible applications, covering a broad range of areas of study: technology, natural science, social science and the humanities.

    In this thesis, a new efficient parallel Markov Chain Monte Carlo inference algorithm is proposed for Bayesian inference in large topic models. The proposed methods scale well with the corpus size and can be used for other probabilistic topic models and other natural language processing applications. The proposed methods are fast, efficient, scalable, and will converge to the true posterior distribution.

    In addition, in this thesis a supervised topic model for high-dimensional text classification is also proposed, with emphasis on interpretable document prediction using the horseshoe shrinkage prior in supervised topic models.

    Finally, we develop a model and inference algorithm that can model agenda and framing of political speeches over time with a priori defined topics. We apply the approach to analyze the evolution of immigration discourse in the Swedish parliament by combining theory from political science and communication science with a probabilistic topic model.

    List of papers
    1. Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
    Open this publication in new window or tab >>Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
    2018 (English)In: Journal of Computational And Graphical Statistics, ISSN 1061-8600, E-ISSN 1537-2715, Vol. 27, no 2, p. 449-463Article in journal (Refereed) Published
    Abstract [en]

    Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler.

    Place, publisher, year, edition, pages
    Taylor & Francis, 2018
    Keywords
    Bayesian inference, Gibbs sampling, Latent Dirichlet Allocation, Massive Data Sets, Parallel Computing, Computational complexity
    National Category
    Probability Theory and Statistics
    Identifiers
    urn:nbn:se:liu:diva-140872 (URN)10.1080/10618600.2017.1366913 (DOI)000435688200018 ()
    Funder
    Swedish Foundation for Strategic Research , SSFRIT 15-0097
    Available from: 2017-09-13 Created: 2017-09-13 Last updated: 2018-07-20Bibliographically approved
    2. Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems using Bayesian Classification
    Open this publication in new window or tab >>Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems using Bayesian Classification
    Show others...
    2016 (English)In: 2016 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2016), IEEE , 2016, p. 425-432Conference paper, Published paper (Refereed)
    Abstract [en]

    We suggest a Bayesian approach to the problem of reducing bug turnaround time in large software development organizations. Our approach is to use classification to predict where bugs are located in components. This classification is a form of automatic fault localization (AFL) at the component level. The approach only relies on historical bug reports and does not require detailed analysis of source code or detailed test runs. Our approach addresses two problems identified in user studies of AFL tools. The first problem concerns the trust in which the user can put in the results of the tool. The second problem concerns understanding how the results were computed. The proposed model quantifies the uncertainty in its predictions and all estimated model parameters. Additionally, the output of the model explains why a result was suggested. We evaluate the approach on more than 50000 bugs.

    Place, publisher, year, edition, pages
    IEEE, 2016
    Keywords
    Machine Learning; Fault Detection; Fault Location; Software Maintenance; Software Debugging; Software Engineering
    National Category
    Computer Sciences
    Identifiers
    urn:nbn:se:liu:diva-132879 (URN)10.1109/QRS.2016.54 (DOI)000386751700044 ()978-1-5090-4127-5 (ISBN)
    Conference
    IEEE International Conference on Software Quality, Reliability and Security (QRS)
    Available from: 2016-12-06 Created: 2016-11-30 Last updated: 2018-05-17
    3. Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
    Open this publication in new window or tab >>Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
    2017 (English)In: 15th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of Conference, volume 2: Short Papers, Stroudsburg: Association for Computational Linguistics (ACL) , 2017, Vol. 2, p. 432-436Conference paper, Published paper (Other academic)
    Abstract [en]

    It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is time-consuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corpus-specific stopwords after model inference is more transparent and produces similar results to removing those words prior to inference.

    Place, publisher, year, edition, pages
    Stroudsburg: Association for Computational Linguistics (ACL), 2017
    National Category
    Probability Theory and Statistics General Language Studies and Linguistics Specific Languages
    Identifiers
    urn:nbn:se:liu:diva-147612 (URN)9781945626357 (ISBN)
    Conference
    15th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of Conference, volume 2: Short Papers April 3-7, 2017, Valencia, Spain
    Available from: 2018-04-27 Created: 2018-04-27 Last updated: 2018-04-27Bibliographically approved
  • 38.
    Magnusson, Måns
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Jonsson, Leif
    Linköping University, Department of Computer and Information Science. Linköping University, Faculty of Science & Engineering.
    Villani, Mattias
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Broman, David
    School of Information and Communication Technology, Royal Institute of Technology KTH, Stockholm, Sweden.
    Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models2018In: Journal of Computational And Graphical Statistics, ISSN 1061-8600, E-ISSN 1537-2715, Vol. 27, no 2, p. 449-463Article in journal (Refereed)
    Abstract [en]

    Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler.

  • 39.
    Nalenz, Malte
    et al.
    Linköping University, Department of Computer and Information Science. Linköping University, Faculty of Arts and Sciences.
    Villani, Mattias
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    TREE ENSEMBLES WITH RULE STRUCTURED HORSESHOE REGULARIZATION2018In: Annals of Applied Statistics, ISSN 1932-6157, E-ISSN 1941-7330, Vol. 12, no 4, p. 2379-2408Article in journal (Refereed)
    Abstract [en]

    We propose a new Bayesian model for flexible nonlinear regression and classification using tree ensembles. The model is based on the RuleFit approach in Friedman and Popescu [Ann. Appl. Stat. 2 (2008) 916-954] where rules from decision trees and linear terms are used in a Ll -regularized regression. We modify RuleFit by replacing the L1-regularization by a horseshoe prior, which is well known to give aggressive shrinkage of noise predictors while leaving the important signal essentially untouched. This is especially important when a large number of rules are used as predictors as many of them only contribute noise. Our horseshoe prior has an additional hierarchical layer that applies more shrinkage a priori to rules with a large number of splits, and to rules that are only satisfied by a few observations. The aggressive noise shrinkage of our prior also makes it possible to complement the rules from boosting in RuleFit with an additional set of trees from Random Forest, which brings a desirable diversity to the ensemble. We sample from the posterior distribution using a very efficient and easily implemented Gibbs sampler. The new model is shown to outperform state-of-the-art methods like RuleFit, BART and Random Forest on 16 datasets. The model and its interpretation is demonstrated on the well known Boston housing data, and on gene expression data for cancer classification. The posterior sampling, prediction and graphical tools for interpreting the model results are implemented in a publicly available R package.

  • 40.
    Neville, Kevin
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Channel attribution modelling using clickstream data from an online store2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In marketing, behaviour of users is analysed in order to discover which channels (for instance TV, Social media etc.) are important for increasing the user’s intention to buy a product. The search for better channel attribution models than the common last-click model is of major concern for the industry of marketing. In this thesis, a probabilistic model for channel attribution has been developed, and this model is demonstrated to be more data-driven than the conventional last- click model. The modelling includes an attempt to include the time aspect in the modelling which have not been done in previous research. Our model is based on studying different sequence length and computing conditional probabilities of conversion by using logistic regression models. A clickstream dataset from an online store was analysed using the proposed model. This thesis has revealed proof of that the last-click model is not optimal for conducting these kinds of analyses. 

  • 41.
    Nordgaard, Anders
    et al.
    Linköping University, Faculty of Arts and Sciences. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Grimvall, Anders
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    A resampl ing technique for estimating the powerof non-parametric trend tests2016In: Environmetrics, ISSN 1180-4009, E-ISSN 1099-095X, Vol. 17, p. 257-267Article in journal (Refereed)
    Abstract [en]

    The power of Mann–Kendall tests and other non-parametric trend tests is normally estimated by performingMonte Carlo simulations in which artificial data are generated according to simple parametric models. Here weintroduce a resampling technique for power assessments that can be fully automated and accommodate almost anyvariation in the collected time series data. A rank regression model is employed to extract error terms representingirregular variation in data that are collected over several seasons and may contain a non-linear trend. Thereafter,an autoregressive moving average (ARMA) bootstrap method is used to generate new time series of error termsfor power simulations. A study of water quality data from two Swedish rivers illustrates how our methodcan provide site- and variable-specific information about the power of the Hirsch and Slack test for monotonictrends. In particular, we show how to clarify the impact of sampling frequency on the power of the trend tests.

  • 42.
    Nordgaard, Anders
    et al.
    Linköping University, Department of Computer and Information Science, Statistics. Linköping University, Faculty of Arts and Sciences. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Polismyndigheten - Nationellt Forensiskt Centrum.
    Rasmusson, Birgitta
    Polismyndigheten - Nationellt Forensiskt Centrum.
    Professionell värdering av forensiska fynd borgar för rättssäkerhet2017In: Juridisk Tidskrift, ISSN 1100-7761, Vol. 29, no 1, p. 228-232Article in journal (Other (popular science, discussion, etc.))
  • 43.
    Pena, Jose M
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Representing independence models with elementary triplets2017In: International Journal of Approximate Reasoning, ISSN 0888-613X, E-ISSN 1873-4731, Vol. 88, p. 587-601Article in journal (Refereed)
    Abstract [en]

    In an independence model, the triplets that represent conditional independences between singletons are called elementary. It is known that the elementary triplets represent the independence model unambiguously under some conditions. In this paper, we show how this representation helps performing some operations with independence models, such as finding the dominant triplets or a minimal independence map of an independence model, or computing the union or intersection of a pair of independence models, or performing causal reasoning. For the latter, we rephrase in terms of conditional independences some of Pearls results for computing causal effects. (C) 2016 Elsevier Inc. All rights reserved.

  • 44.
    Pena, Jose M
    et al.
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Bendtsen, Marcus
    Linköping University, Department of Computer and Information Science, Database and information techniques. Linköping University, Faculty of Science & Engineering.
    Causal effect identification in acyclic directed mixed graphs and gated models2017In: International Journal of Approximate Reasoning, ISSN 0888-613X, E-ISSN 1873-4731, Vol. 90, p. 56-75Article in journal (Refereed)
    Abstract [en]

    We introduce a new family of graphical models that consists of graphs with possibly directed, undirected and bidirected edges but without directed cycles. We show that these models are suitable for representing causal models with additive error terms. We provide a set of sufficient graphical criteria for the identification of arbitrary causal effects when the new models contain directed and undirected edges but no bidirected edge. We also provide a necessary and sufficient graphical criterion for the identification of the causal effect of a single variable on the rest of the variables. Moreover, we develop an exact algorithm for learning the new models from observational and interventional data via answer set programming. Finally, we introduce gated models for causal effect identification, a new family of graphical models that exploits context specific independences to identify additional causal effects. (C) 2017 Elsevier Inc. All rights reserved.

  • 45.
    Quiroz, Matias
    et al.
    Linköping University, Department of Computer and Information Science, Statistics. Linköping University, Faculty of Science & Engineering. Research Division, Sveriges Riksbank, Stockholm, Sweden.
    Tran, Minh-Ngoc
    Discipline of Business Analytics, University of Sydney, Camperdown NSW, Australia.
    Villani, Mattias
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Kohn, Robert
    Australian School of Business, University of New South Wales, Sydney NSW, Australia.
    Speeding up MCMC by Delayed Acceptance and Data Subsampling2018In: Journal of Computational And Graphical Statistics, ISSN 1061-8600, E-ISSN 1537-2715, Vol. 27, no 1, p. 12-22Article in journal (Refereed)
    Abstract [en]

    The complexity of the Metropolis–Hastings (MH) algorithm arises from the requirement of a likelihood evaluation for the full dataset in each iteration. One solution has been proposed to speed up the algorithm by a delayed acceptance approach where the acceptance decision proceeds in two stages. In the first stage, an estimate of the likelihood based on a random subsample determines if it is likely that the draw will be accepted and, if so, the second stage uses the full data likelihood to decide upon final acceptance. Evaluating the full data likelihood is thus avoided for draws that are unlikely to be accepted. We propose a more precise likelihood estimator that incorporates auxiliary information about the full data likelihood while only operating on a sparse set of the data. We prove that the resulting delayed acceptance MH is more efficient. The caveat of this approach is that the full dataset needs to be evaluated in the second stage. We therefore propose to substitute this evaluation by an estimate and construct a state-dependent approximation thereof to use in the first stage. This results in an algorithm that (i) can use a smaller subsample m by leveraging on recent advances in Pseudo-Marginal MH (PMMH) and (ii) is provably within O(m^-2) of the true posterior.

  • 46.
    Raoufi-Danner, Torrin
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Effects of Missing Values on Neural Network Survival Time Prediction2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Data sets with missing values are a pervasive problem within medical research. Building lifetime prediction models based solely upon complete-case data can bias the results, so imputation is preferred over listwise deletion. In this thesis, artificial neural networks (ANNs) are used as a prediction model on simulated data with which to compare various imputation approaches. The construction and optimization of ANNs is discussed in detail, and some guidelines are presented for activation functions, number of hidden layers and other tunable parameters. For the simulated data, binary lifetime prediction at five years was examined. The ANNs here performed best with tanh activation, binary cross-entropy loss with softmax output and three hidden layers of between 15 and 25 nodes. The imputation methods examined are random, mean, missing forest, multivariate imputation by chained equations (MICE), pooled MICE with imputed target and pooled MICE with non-imputed target. Random and mean imputation performed poorly compared to the others and were used as a baseline comparison case. The other algorithms all performed well up to 50% missingness. There were no statistical differences between these methods below 30% missingness, however missing forest had the best performance above this amount. It is therefore the recommendation of this thesis that the missing forest algorithm is used to impute missing data when constructing ANNs to predict breast cancer patient survival at the five-year mark.

  • 47.
    Rodriguez-Deniz, Hector
    et al.
    KTH Royal Inst Technol, Sweden.
    Jenelius, Erik
    KTH Royal Inst Technol, Sweden.
    Villani, Mattias
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
    Urban Network Travel Time Prediction via Online Multi-Output Gaussian Process Regression2017In: 2017 IEEE 20TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), IEEE , 2017Conference paper (Refereed)
    Abstract [en]

    The paper explores the potential of Multi-Output Gaussian Processes to tackle network-wide travel time prediction in an urban area. Forecasting in this context is challenging due to the complexity of the traffic network, noisy data and unexpected events. We build on recent methods to develop an online model that can be trained in seconds by relying on prior network dependences through a coregionalized covariance. The accuracy of the proposed model outperforms historical means and other simpler methods on a network of 47 streets in Stockholm, by using probe data from GPS-equipped taxis. Results show how traffic speeds are dependent on the historical correlations, and how prediction accuracy can be improved by relying on prior information while using a very limited amount of current-day observations, which allows for the development of models with low estimation times and high responsiveness.

  • 48.
    Sandberg, Martina
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Credit Risk Evaluation using Machine Learning2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
  • 49.
    Schofield, Alexandra
    et al.
    Cornell University Ithaca, NY, USA.
    Magnusson, Måns
    Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
    Mimno, David
    Cornell University Ithaca, NY, USA.
    Pulling Out the Stops: Rethinking Stopword Removal for Topic Models2017In: 15th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of Conference, volume 2: Short Papers, Stroudsburg: Association for Computational Linguistics (ACL) , 2017, Vol. 2, p. 432-436Conference paper (Other academic)
    Abstract [en]

    It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is time-consuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corpus-specific stopwords after model inference is more transparent and produces similar results to removing those words prior to inference.

  • 50.
    Shipitsyn, Aleksey
    Linköping University, Faculty of Arts and Sciences. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
    Statistical Learning with Imbalanced Data2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided Undersampling Modified, DBSCAN Undersampling, Class Adjusted Jittering, Hierarchical Cluster Based Oversampling, DBSCAN Oversampling, Fitted Distribution Oversampling, Random Linear Combinations Oversampling, Center Repulsion Oversampling.

    A set of requirements on a satisfactory performance metric for imbalanced learning have been formulated and a new metric for evaluating classification performance has been developed accordingly. The new metric is based on a combination of the worst class accuracy and geometric mean.

    In the testing framework nonparametric Friedman's test and post hoc Nemenyi’s test have been used to assess the performance of classifiers, sampling algorithms, combinations of classifiers and sampling algorithms on several data sets. A new approach of detecting algorithms with dominating and dominated performance has been proposed with a new way of visualizing the results in a network.

    From experiments on simulated and several real data sets we conclude that: i) different classifiers are not equally sensitive to sampling algorithms, ii) sampling algorithms have different performance within specific classifiers, iii) oversampling algorithms perform better than undersampling algorithms, iv) Random Oversampling and Random Undersampling outperform many well-known sampling algorithms, v) our proposed algorithms Hierarchical Cluster Based Oversampling, DBSCAN Oversampling with FDO, and Class Adjusted Jittering perform much better than other algorithms, vi) a few good combinations of a classifier and sampling algorithm may boost classification performance, while a few bad combinations may spoil the performance, but the majority of combinations are not significantly different in performance.

12 1 - 50 of 66
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf