liu.seSearch for publications in DiVA
System disruptions
We are currently experiencing disruptions on the search portals due to high traffic. We are working to resolve the issue, you may temporarily encounter an error message.
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Learning to Analyze Visual Data Streams for Environment Perception
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-0418-9694
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

A mobile robot, instructed by a human operator, acts in an environment with many other objects. However, for an autonomous robot, human instructions should be minimal and only high-level instructions, such as the ultimate task or destination. In order to increase the level of autonomy, it has become a foremost objective to mimic human vision using neural networks that take a stream of images as input and learn a specific computer vision task from large amounts of data. In this thesis, we explore several different models for surround sensing, each of which contributes to a higher understanding of the environment being possible. 

As its first contribution, this thesis presents an object tracking method for video sequences, which is a crucial component in a perception system. This method predicts a fine-grained mask to separate the pixels corresponding to the target from those corresponding to the background. Rather than tracking location and size, the method tracks the initial pixels assigned to the target in this so-called video object segmentation. For subsequent time steps, the goal is to learn how the target looks using features from a neural network. We named our method A-GAME, based on the generative modeling of deep feature space, separating target and background appearances. 

In the second contribution of this thesis, we detect, track, and segment all objects from a set of predefined object classes. This information is how the robot increases its capabilities to perceive the surroundings. We experiment with a graph neural network to weigh all new detections and existing tracks. This model outperforms prior works by separating visually, and semantically similar objects frame by frame. 

The third contribution investigates one limitation of anchor-based detectors, which classify pre-defined bounding boxes as either negative or positive and thus provide a limited set of handled object shapes. One idea is to learn an alternative instance representation. We experiment with a neural network that predicts the distance to the nearest object contour in different directions from each pixel. The network then computes an approximated signed distance function containing the respective instance information. 

Last, this thesis studies a concept within model validation. We observed that overfitting could increase performance on benchmarks. However, this opportunity is insipid for sensing systems in practice since measurements, such as length or angles, are quantities that explain the environment. The fourth contribution of this thesis is an extended validation technique for camera calibration. This technique uses a statistical model for each error difference between an observed value and a corresponding prediction of the projective model. We compute a test over the differences and detect if the projective model is incorrect. 

Abstract [sv]

En mobil robot, instruerad av en mänsklig operatör, agerar i en miljö med många andra föremål. För en autonom robot bör det mänskliga ingripandet vara minimalt och endast vara instruktioner på hög nivå, som den ultimata uppgiften eller destinationen. Neurala nätverk som tar en ström av bilder som indata och lär sig en specifik datorseendeuppgift från stora mängder data, för att efterlikna den förmåga som kommer naturligt för människor, har blivit avgörande i strävan efter autonomi. I denna avhandling utforskar vi olika modeller, som var och en bidrar till att en högre förståelse av omgivningen är möjlig.

I avhandlingens första bidrag undersöks en metod för objektföljning, för att hålla reda på objekt. En förmåga som är ett nyckelelement till hur omvärlden kan uppfattas. Metoden skattar en detaljerad pixel-mask av objektet och klassificerar alla andra pixlar som bakgrund. De initiala pixlarna av objektet spåras, så kallad videoobjektsegmentering, istället för att spåra position och storlek. För efterföljande tidssteg är målet att lära sig utseendet av objektet från särdrag beräknat av ett neuralt nätverk. Vi döpte vår metod till A-GAME, baserad på den generativa modelleringen av djupa särdrag, som skiljer på hur objektet och bakgrunden ser ut.

I det andra bidraget i denna avhandling detekterar, spårar och segmenterar vi alla objekt från en uppsättning redan definierade objektklasser. Denna information är hur roboten kan öka sin förmåga att uppfatta omgivningen. Vi experimenterar med ett neuralt nätverk från grafteori för att vikta alla nya detekterade objekt och befintliga objektspår. Metoden, som bearbetar en bild i taget och separerar visuellt och semantiskt liknande objekt, överträffar tidigare arbeten.

Det tredje bidraget undersöker en begränsning av detektorer som använder ankar-baserade objektkandidater. Dessa detektorer klassificerar redan definierade boxtyper för tänkbara objekt som antingen negativa eller positiva och begränsad därmed vilka objekt som kan detekteras be-roende på form. En idé är att lära sig en alternativ objektrepresentation. Vi experimenterar med ett neuralt nätverk som förutsäger avståndet till närmaste objektkontur i olika riktningar från varje pixel. Det neurala nätverket beräknar sedan en approximerad distansfunktion, för en bild i taget, som innehåller information om de individuella objekten.

Till sist studerar denna avhandling ett koncept inom validering. Vi observerade att överanpassning kunde öka prestandamått på dataset avsedda för jämförelse. Denna möjlighet är dock obetydlig för oss i praktiken eftersom mätningar, såsom längd eller vinklar, är storheter som används för att beskriva omgivningen. Det fjärde bidraget i denna avhandling är en utökad valideringsteknik för kamerakalibrering. Denna teknik använder en statistisk modell för varje avvikelse mellan ett observerat värde och en motsvarande förutsägelse av den projektiva modellen. Ett statistiskt test beräknas över avvikelserna för att upptäcka om en sådan model är felaktig.  

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2023. , p. 45
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2283
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-192620DOI: 10.3384/9789180750158ISBN: 9789180750141 (print)ISBN: 9789180750158 (electronic)OAI: oai:DiVA.org:liu-192620DiVA, id: diva2:1745714
Public defence
2023-04-28, Ada Lovelace, B-building, Campus Valla, Linköping, 10:15 (English)
Opponent
Supervisors
Note

Funding agencies: Saab Dynamics and the Wallenberg AI, Autonomous Systems, and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Furthermore, the computations were enabled by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Super-computer Centre; and computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at Alvis partially funded by the Swedish Research Council through grant agreement no. 2018-05973.

Available from: 2023-03-24 Created: 2023-03-24 Last updated: 2025-02-07Bibliographically approved
List of papers
1. A generative appearance model for end-to-end video object segmentation
Open this publication in new window or tab >>A generative appearance model for end-to-end video object segmentation
Show others...
2019 (English)In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 8945-8954Conference paper, Published paper (Refereed)
Abstract [en]

One of the fundamental challenges in video object segmentation is to find an effective representation of the target and background appearance. The best performing approaches resort to extensive fine-tuning of a convolutional neural network for this purpose. Besides being prohibitively expensive, this strategy cannot be truly trained end-to-end since the online fine-tuning procedure is not integrated into the offline training of the network. To address these issues, we propose a network architecture that learns a powerful representation of the target and background appearance in a single forward pass. The introduced appearance module learns a probabilistic generative model of target and background feature distributions. Given a new image, it predicts the posterior class probabilities, providing a highly discriminative cue, which is processed in later network modules. Both the learning and prediction stages of our appearance module are fully differentiable, enabling true end-to-end training of the entire segmentation pipeline. Comprehensive experiments demonstrate the effectiveness of the proposed approach on three video object segmentation benchmarks. We close the gap to approaches based on online fine-tuning on DAVIS17, while operating at 15 FPS on a single GPU. Furthermore, our method outperforms all published approaches on the large-scale YouTube-VOS dataset.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Series
Proceedings - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919, E-ISSN 2575-7075
Keywords
Segmentation; Grouping and Shape; Motion and Tracking
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-161037 (URN)10.1109/CVPR.2019.00916 (DOI)000542649302058 ()9781728132938 (ISBN)9781728132945 (ISBN)
Conference
IEEE Conference on Computer Vision and Pattern Recognition. 2019, Long Beach, CA, USA, USA, 15-20 June 2019
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Foundation for Strategic ResearchSwedish Research Council
Available from: 2019-10-17 Created: 2019-10-17 Last updated: 2025-02-07Bibliographically approved
2. Recurrent Graph Neural Networks for Video Instance Segmentation
Open this publication in new window or tab >>Recurrent Graph Neural Networks for Video Instance Segmentation
2023 (English)In: International Journal of Computer Vision, ISSN 0920-5691, E-ISSN 1573-1405, Vol. 131, p. 471-495Article in journal (Refereed) Published
Abstract [en]

Video instance segmentation is one of the core problems in computer vision. Formulating a purely learning-based method, which models the generic track management required to solve the video instance segmentation task, is a highly challenging problem. In this work, we propose a novel learning framework where the entire video instance segmentation problem is modeled jointly. To this end, we design a graph neural network that in each frame jointly processes all detections and a memory of previously seen tracks. Past information is considered and processed via a recurrent connection. We demonstrate the effectiveness of the proposed approach in comprehensive experiments. Our approach operates online at over 25 FPS and obtains 16.3 AP on the challenging OVIS benchmark, setting a new state-of-the-art. We further conduct detailed ablative experiments that validate the different aspects of our approach. Code is available at https://github.com/emibr948/RGNNVIS-PlusPlus.

Place, publisher, year, edition, pages
Springer, 2023
Keywords
Detection; Tracking; Segmentation; Video
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-190196 (URN)10.1007/s11263-022-01703-8 (DOI)000885236800001 ()
Note

Funding Agencies|Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) - Knut and Alice Wallenberg Foundation; Excellence Center at Linkoping-Lund in Information Technology (ELLIT)

Available from: 2022-11-29 Created: 2022-11-29 Last updated: 2023-11-02Bibliographically approved
3. Predicting Signed Distance Functions for Visual Instance Segmentation
Open this publication in new window or tab >>Predicting Signed Distance Functions for Visual Instance Segmentation
2021 (English)In: 33rd Annual Workshop of the Swedish-Artificial-Intelligence-Society (SAIS), Institute of Electrical and Electronics Engineers (IEEE), 2021, p. 5-10Conference paper, Published paper (Refereed)
Abstract [en]

Visual instance segmentation is a challenging problem and becomes even more difficult if objects of interest varies unconstrained in shape. Some objects are well described by a rectangle, however, this is hardly always the case. Consider for instance long, slender objects such as ropes. Anchor-based approaches classify predefined bounding boxes as either negative or positive and thus provide a limited set of shapes that can be handled. Defining anchor-boxes that fit well to all possible shapes leads to an infeasible number of prior boxes. We explore a different approach and propose to train a neural network to compute distance maps along different directions. The network is trained at each pixel to predict the distance to the closest object contour in a given direction. By pooling the distance maps we obtain an approximation to the signed distance function (SDF). The SDF may then be thresholded in order to obtain a foreground-background segmentation. We compare this segmentation to foreground segmentations obtained from the state-of-the-art instance segmentation method YOLACT. On the COCO dataset, our segmentation yields a higher performance in terms of foreground intersection over union (IoU). However, while the distance maps contain information on the individual instances, it is not straightforward to map them to the full instance segmentation. We still believe that this idea is a promising research direction for instance segmentation, as it better captures the different shapes found in the real world.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Series
Annual Workshop of the Swedish-Artificial-Intelligence-Society (SAIS)
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-179288 (URN)10.1109/SAIS53221.2021.9484039 (DOI)000855522600003 ()2-s2.0-85111580246 (Scopus ID)9781665442367 (ISBN)9781665442374 (ISBN)
Conference
33rd Annual Workshop of the Swedish-Artificial-Intelligence-Society (SAIS), Sweden, 14-15 June, 2021
Note

Funding: Wallenberg AI, Autonomous Systems, and Software Program (WASP) - Knut and Alice Wallenberg Foundation

Available from: 2021-09-16 Created: 2021-09-16 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(5920 kB)624 downloads
File information
File name FULLTEXT01.pdfFile size 5920 kBChecksum SHA-512
2d63be37488839c253e6deb3efde8714d7d947e4027851bcd101f5f392e52b1d4105a3e1fbaac2c7e96e43dbe5291d0afc7ba7296ea0014401b5a71bd25d9091
Type fulltextMimetype application/pdf
Order online >>

Other links

Publisher's full text

Authority records

Brissman, Emil

Search in DiVA

By author/editor
Brissman, Emil
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
Total: 625 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1955 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf