liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Dynamic Visual Learning
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering. Zenseact AB, Gothenburg. (Computer Vision Laboratory)ORCID iD: 0000-0003-2553-3367
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Autonomous robots act in a \emph{dynamic} world where both the robots and other objects may move. The surround sensing systems of said robots therefore work with dynamic input data and need to estimate both the current state of the environment as well as its dynamics. One of the key elements to obtain a high-level understanding of the environment is to track dynamic objects. This enables the system to understand what the objects are doing; predict where they will be in the future; and in the future better estimate where they are. In this thesis, I focus on input from visual cameras, images. Images have, with the advent of neural networks, become a cornerstone in sensing systems. Image-processing neural networks are optimized to perform a specific computer vision task -- such as recognizing cats and dogs -- on vast datasets of annotated examples. This is usually referred to as \emph{offline training} and given a well-designed neural network, enough high-quality data, and a suitable offline training formulation, the neural network is expected to become adept at the specific task.

This thesis starts with a study of object tracking. The tracking is based on the visual appearance of the object, achieved via discriminative convolution filters (DCFs). The first contribution of this thesis is to decompose the filter into multiple subfilters. This serves to increase the robustness during object deformations or rotations. Moreover, it provides a more fine-grained representation of the object state as the subfilters are expected to roughly track object parts. In the second contribution, a neural network is trained directly for object tracking. In order to obtain a fine-grained representation of the object state, it is represented as a segmentation. The main challenge lies in the design of a neural network able to tackle this task. While the common neural networks excel at recognizing patterns seen during offline training, they struggle to store novel patterns in order to later recognize them. To overcome this limitation, a novel appearance learning mechanism is proposed. The mechanism extends the state-of-the-art and is shown to generalize remarkably well to novel data. In the third contribution, the method is used together with a novel fusion strategy and failure detection criterion to semi-automatically annotate visual and thermal videos.

Sensing systems need not only track objects, but also detect them. The fourth contribution of this thesis strives to tackle joint detection, tracking, and segmentation of all objects from a predefined set of object classes. The challenge here lies not only in the neural network design, but also in the design of the offline training formulation. The final approach, a recurrent graph neural network, outperforms prior works that have a runtime of the same order of magnitude.

Last, this thesis studies \emph{dynamic} learning of novel visual concepts. It is observed that the learning mechanisms used for object tracking essentially learns the appearance of the tracked object. It is natural to ask whether this appearance learning could be extended beyond individual objects to entire semantic classes, enabling the system to learn new concepts based on just a few training examples. Such an ability is desirable in autonomous systems as it removes the need of manually annotating thousands of examples of each class that needs recognition. Instead, the system is trained to efficiently learn to recognize new classes. In the fifth contribution, we propose a novel learning mechanism based on Gaussian process regression. With this mechanism, our neural network outperforms the state-of-the-art and the performance gap is especially large when multiple training examples are given.

To summarize, this thesis studies and makes several contributions to learning systems that parse dynamic visuals and that dynamically learn visual appearances or concepts.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2022. , p. 59
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2196
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
URN: urn:nbn:se:liu:diva-181604DOI: 10.3384/9789179291488ISBN: 9789179291471 (print)ISBN: 9789179291488 (electronic)OAI: oai:DiVA.org:liu-181604DiVA, id: diva2:1616651
Public defence
2022-01-19, Ada Lovelace, B Building, Campus Valla, Linköping, 09:00 (English)
Opponent
Supervisors
Projects
WASP Industrial PhD student
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2021-12-08 Created: 2021-12-03 Last updated: 2022-03-29Bibliographically approved
List of papers
1. DCCO: Towards Deformable Continuous Convolution Operators for Visual Tracking
Open this publication in new window or tab >>DCCO: Towards Deformable Continuous Convolution Operators for Visual Tracking
2017 (English)In: Computer Analysis of Images and Patterns: 17th International Conference, CAIP 2017, Ystad, Sweden, August 22-24, 2017, Proceedings, Part I / [ed] Michael Felsberg, Anders Heyden and Norbert Krüger, Springer, 2017, Vol. 10424, p. 55-67Conference paper, Published paper (Refereed)
Abstract [en]

Discriminative Correlation Filter (DCF) based methods have shown competitive performance on tracking benchmarks in recent years. Generally, DCF based trackers learn a rigid appearance model of the target. However, this reliance on a single rigid appearance model is insufficient in situations where the target undergoes non-rigid transformations. In this paper, we propose a unified formulation for learning a deformable convolution filter. In our framework, the deformable filter is represented as a linear combination of sub-filters. Both the sub-filter coefficients and their relative locations are inferred jointly in our formulation. Experiments are performed on three challenging tracking benchmarks: OTB-2015, TempleColor and VOT2016. Our approach improves the baseline method, leading to performance comparable to state-of-the-art.

Place, publisher, year, edition, pages
Springer, 2017
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 10424
National Category
Computer Vision and Robotics (Autonomous Systems) Computer Engineering
Identifiers
urn:nbn:se:liu:diva-145373 (URN)10.1007/978-3-319-64689-3_5 (DOI)000432085900005 ()9783319646886 (ISBN)9783319646893 (ISBN)
Conference
17th International Conference, CAIP 2017, Ystad, Sweden, August 22-24, 2017, Proceedings, Part I
Note

Funding agencies: SSF (SymbiCloud); VR (EMC2) [2016-05543]; SNIC; WASP; Nvidia

Available from: 2018-02-26 Created: 2018-02-26 Last updated: 2023-04-03Bibliographically approved
2. A generative appearance model for end-to-end video object segmentation
Open this publication in new window or tab >>A generative appearance model for end-to-end video object segmentation
Show others...
2019 (English)In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 8945-8954Conference paper, Published paper (Refereed)
Abstract [en]

One of the fundamental challenges in video object segmentation is to find an effective representation of the target and background appearance. The best performing approaches resort to extensive fine-tuning of a convolutional neural network for this purpose. Besides being prohibitively expensive, this strategy cannot be truly trained end-to-end since the online fine-tuning procedure is not integrated into the offline training of the network. To address these issues, we propose a network architecture that learns a powerful representation of the target and background appearance in a single forward pass. The introduced appearance module learns a probabilistic generative model of target and background feature distributions. Given a new image, it predicts the posterior class probabilities, providing a highly discriminative cue, which is processed in later network modules. Both the learning and prediction stages of our appearance module are fully differentiable, enabling true end-to-end training of the entire segmentation pipeline. Comprehensive experiments demonstrate the effectiveness of the proposed approach on three video object segmentation benchmarks. We close the gap to approaches based on online fine-tuning on DAVIS17, while operating at 15 FPS on a single GPU. Furthermore, our method outperforms all published approaches on the large-scale YouTube-VOS dataset.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Series
Proceedings - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919, E-ISSN 2575-7075
Keywords
Segmentation; Grouping and Shape; Motion and Tracking
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-161037 (URN)10.1109/CVPR.2019.00916 (DOI)9781728132938 (ISBN)9781728132945 (ISBN)
Conference
IEEE Conference on Computer Vision and Pattern Recognition. 2019, Long Beach, CA, USA, USA, 15-20 June 2019
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Foundation for Strategic Research Swedish Research Council
Available from: 2019-10-17 Created: 2019-10-17 Last updated: 2023-04-03Bibliographically approved
3. Semi-automatic Annotation of Objects in Visual-Thermal Video
Open this publication in new window or tab >>Semi-automatic Annotation of Objects in Visual-Thermal Video
Show others...
2019 (English)In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Institute of Electrical and Electronics Engineers (IEEE), 2019Conference paper, Published paper (Refereed)
Abstract [en]

Deep learning requires large amounts of annotated data. Manual annotation of objects in video is, regardless of annotation type, a tedious and time-consuming process. In particular, for scarcely used image modalities human annotationis hard to justify. In such cases, semi-automatic annotation provides an acceptable option.

In this work, a recursive, semi-automatic annotation method for video is presented. The proposed method utilizesa state-of-the-art video object segmentation method to propose initial annotations for all frames in a video based on only a few manual object segmentations. In the case of a multi-modal dataset, the multi-modality is exploited to refine the proposed annotations even further. The final tentative annotations are presented to the user for manual correction.

The method is evaluated on a subset of the RGBT-234 visual-thermal dataset reducing the workload for a human annotator with approximately 78% compared to full manual annotation. Utilizing the proposed pipeline, sequences are annotated for the VOT-RGBT 2019 challenge.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Series
IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), ISSN 2473-9936, E-ISSN 2473-9944
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-161076 (URN)10.1109/ICCVW.2019.00277 (DOI)000554591602039 ()978-1-7281-5023-9 (ISBN)978-1-7281-5024-6 (ISBN)
Conference
IEEE International Conference on Computer Vision Workshop (ICCVW)
Funder
Swedish Research Council, 2013-5703Swedish Foundation for Strategic Research Wallenberg AI, Autonomous Systems and Software Program (WASP)Vinnova, VS1810-Q
Note

Funding agencies: Swedish Research CouncilSwedish Research Council [2013-5703]; project ELLIIT (the Strategic Area for ICT research - Swedish Government); Wallenberg AI, Autonomous Systems and Software Program (WASP); Visual Sweden project ndimensional Modelling [VS1810-Q]

Available from: 2019-10-21 Created: 2019-10-21 Last updated: 2021-12-03
4. Video Instance Segmentation with Recurrent Graph Neural Networks
Open this publication in new window or tab >>Video Instance Segmentation with Recurrent Graph Neural Networks
2021 (English)In: Pattern Recognition: 43rd DAGM German Conference, DAGM GCPR 2021, Bonn, Germany, September 28 – October 1, 2021, Proceedings. / [ed] Bauckhage C., Gall J., Schwing A., Springer, 2021, p. 206-221Conference paper, Published paper (Refereed)
Abstract [en]

Video instance segmentation is one of the core problems in computer vision. Formulating a purely learning-based method, which models the generic track management required to solve the video instance segmentation task, is a highly challenging problem. In this work, we propose a novel learning framework where the entire video instance segmentation problem is modeled jointly. To this end, we design a graph neural network that in each frame jointly processes all detections and a memory of previously seen tracks. Past information is considered and processed via a recurrent connection. We demonstrate the effectiveness of the proposed approach in comprehensive experiments. Our approach, operating at over 25 FPS, outperforms previous video real-time methods. We further conduct detailed ablative experiments that validate the different aspects of our approach.

Place, publisher, year, edition, pages
Springer, 2021
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 13024
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-183945 (URN)10.1007/978-3-030-92659-5_13 (DOI)978-3-030-92658-8 (ISBN)978-3-030-92659-5 (ISBN)
Conference
43rd DAGM German Conference, DAGM GCPR 2021, Bonn, Germany, September 28 – October 1, 2021
Available from: 2022-03-28 Created: 2022-03-28 Last updated: 2022-03-29Bibliographically approved

Open Access in DiVA

fulltext(4093 kB)392 downloads
File information
File name FULLTEXT01.pdfFile size 4093 kBChecksum SHA-512
7a9ab8c8b66f0a16bc3ad37710a3de43bf6a238b94d8e4cd8bb685ffb6fb4bb71f59eccc3537cd6162b6b8ac00499d46c81cf5a9759877f573cf747ef76c2a41
Type fulltextMimetype application/pdf
Order online >>

Other links

Publisher's full text

Authority records

Johnander, Joakim

Search in DiVA

By author/editor
Johnander, Joakim
By organisation
Computer VisionFaculty of Science & Engineering
Computer Vision and Robotics (Autonomous Systems)

Search outside of DiVA

GoogleGoogle Scholar
Total: 392 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1853 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf