liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Learning Convolution Operators for Visual Tracking
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0001-6144-9520
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Visual tracking is one of the fundamental problems in computer vision. Its numerous applications include robotics, autonomous driving, augmented reality and 3D reconstruction. In essence, visual tracking can be described as the problem of estimating the trajectory of a target in a sequence of images. The target can be any image region or object of interest. While humans excel at this task, requiring little effort to perform accurate and robust visual tracking, it has proven difficult to automate. It has therefore remained one of the most active research topics in computer vision.

In its most general form, no prior knowledge about the object of interest or environment is given, except for the initial target location. This general form of tracking is known as generic visual tracking. The unconstrained nature of this problem makes it particularly difficult, yet applicable to a wider range of scenarios. As no prior knowledge is given, the tracker must learn an appearance model of the target on-the-fly. Cast as a machine learning problem, it imposes several major challenges which are addressed in this thesis.

The main purpose of this thesis is the study and advancement of the, so called, Discriminative Correlation Filter (DCF) framework, as it has shown to be particularly suitable for the tracking application. By utilizing properties of the Fourier transform, a correlation filter is discriminatively learned by efficiently minimizing a least-squares objective. The resulting filter is then applied to a new image in order to estimate the target location.

This thesis contributes to the advancement of the DCF methodology in several aspects. The main contribution regards the learning of the appearance model: First, the problem of updating the appearance model with new training samples is covered. Efficient update rules and numerical solvers are investigated for this task. Second, the periodic assumption induced by the circular convolution in DCF is countered by proposing a spatial regularization component. Third, an adaptive model of the training set is proposed to alleviate the impact of corrupted or mislabeled training samples. Fourth, a continuous-space formulation of the DCF is introduced, enabling the fusion of multiresolution features and sub-pixel accurate predictions. Finally, the problems of computational complexity and overfitting are addressed by investigating dimensionality reduction techniques.

As a second contribution, different feature representations for tracking are investigated. A particular focus is put on the analysis of color features, which had been largely overlooked in prior tracking research. This thesis also studies the use of deep features in DCF-based tracking. While many vision problems have greatly benefited from the advent of deep learning, it has proven difficult to harvest the power of such representations for tracking. In this thesis it is shown that both shallow and deep layers contribute positively. Furthermore, the problem of fusing their complementary properties is investigated.

The final major contribution of this thesis regards the prediction of the target scale. In many applications, it is essential to track the scale, or size, of the target since it is strongly related to the relative distance. A thorough analysis of how to integrate scale estimation into the DCF framework is performed. A one-dimensional scale filter is proposed, enabling efficient and accurate scale estimation.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. , p. 71
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1926
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
URN: urn:nbn:se:liu:diva-147543DOI: 10.3384/diss.diva-147543ISBN: 9789176853320 (print)OAI: oai:DiVA.org:liu-147543DiVA, id: diva2:1201230
Public defence
2018-06-11, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:00 (English)
Opponent
Supervisors
Available from: 2018-05-03 Created: 2018-04-25 Last updated: 2018-05-24Bibliographically approved
List of papers
1. Adaptive Color Attributes for Real-Time Visual Tracking
Open this publication in new window or tab >>Adaptive Color Attributes for Real-Time Visual Tracking
2014 (English)In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014, IEEE Computer Society, 2014, p. 1090-1097Conference paper, Published paper (Refereed)
Abstract [en]

Visual tracking is a challenging problem in computer vision. Most state-of-the-art visual trackers either rely on luminance information or use simple color representations for image description. Contrary to visual tracking, for object recognition and detection, sophisticated color features when combined with luminance have shown to provide excellent performance. Due to the complexity of the tracking problem, the desired color feature should be computationally efficient, and possess a certain amount of photometric invariance while maintaining high discriminative power.

This paper investigates the contribution of color in a tracking-by-detection framework. Our results suggest that color attributes provides superior performance for visual tracking. We further propose an adaptive low-dimensional variant of color attributes. Both quantitative and attributebased evaluations are performed on 41 challenging benchmark color sequences. The proposed approach improves the baseline intensity-based tracker by 24% in median distance precision. Furthermore, we show that our approach outperforms state-of-the-art tracking methods while running at more than 100 frames per second.

Place, publisher, year, edition, pages
IEEE Computer Society, 2014
Series
IEEE Conference on Computer Vision and Pattern Recognition. Proceedings, ISSN 1063-6919
National Category
Computer Engineering
Identifiers
urn:nbn:se:liu:diva-105857 (URN)10.1109/CVPR.2014.143 (DOI)2-s2.0-84911362613 (Scopus ID)978-147995117-8 (ISBN)
Conference
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, USA, June 24-27, 2014
Note

Publication status: Accepted

Available from: 2014-04-10 Created: 2014-04-10 Last updated: 2018-04-25Bibliographically approved
2. Coloring Channel Representations for Visual Tracking
Open this publication in new window or tab >>Coloring Channel Representations for Visual Tracking
2015 (English)In: 19th Scandinavian Conference, SCIA 2015, Copenhagen, Denmark, June 15-17, 2015. Proceedings / [ed] Rasmus R. Paulsen, Kim S. Pedersen, Springer, 2015, Vol. 9127, p. 117-129Conference paper, Published paper (Refereed)
Abstract [en]

Visual object tracking is a classical, but still open research problem in computer vision, with many real world applications. The problem is challenging due to several factors, such as illumination variation, occlusions, camera motion and appearance changes. Such problems can be alleviated by constructing robust, discriminative and computationally efficient visual features. Recently, biologically-inspired channel representations \cite{felsberg06PAMI} have shown to provide promising results in many applications ranging from autonomous driving to visual tracking.

This paper investigates the problem of coloring channel representations for visual tracking. We evaluate two strategies, channel concatenation and channel product, to construct channel coded color representations. The proposed channel coded color representations are generic and can be used beyond tracking.

Experiments are performed on 41 challenging benchmark videos. Our experiments clearly suggest that a careful selection of color feature together with an optimal fusion strategy, significantly outperforms the standard luminance based channel representation. Finally, we show promising results compared to state-of-the-art tracking methods in the literature.

Place, publisher, year, edition, pages
Springer, 2015
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 9127
Keywords
Visual tracking, channel coding, color names
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-121003 (URN)10.1007/978-3-319-19665-7_10 (DOI)978-3-319-19664-0 (ISBN)978-3-319-19665-7 (ISBN)
Conference
Scandinavian Conference on Image Analysis
Available from: 2015-09-02 Created: 2015-09-02 Last updated: 2018-04-25Bibliographically approved
3. Discriminative Scale Space Tracking
Open this publication in new window or tab >>Discriminative Scale Space Tracking
2017 (English)In: IEEE Transaction on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539, Vol. 39, no 8, p. 1561-1575Article in journal (Refereed) Published
Abstract [en]

Accurate scale estimation of a target is a challenging research problem in visual object tracking. Most state-of-the-art methods employ an exhaustive scale search to estimate the target size. The exhaustive search strategy is computationally expensive and struggles when encountered with large scale variations. This paper investigates the problem of accurate and robust scale estimation in a tracking-by-detection framework. We propose a novel scale adaptive tracking approach by learning separate discriminative correlation filters for translation and scale estimation. The explicit scale filter is learned online using the target appearance sampled at a set of different scales. Contrary to standard approaches, our method directly learns the appearance change induced by variations in the target scale. Additionally, we investigate strategies to reduce the computational cost of our approach. Extensive experiments are performed on the OTB and the VOT2014 datasets. Compared to the standard exhaustive scale search, our approach achieves a gain of 2.5 percent in average overlap precision on the OTB dataset. Additionally, our method is computationally efficient, operating at a 50 percent higher frame rate compared to the exhaustive scale search. Our method obtains the top rank in performance by outperforming 19 state-of-the-art trackers on OTB and 37 state-of-the-art trackers on VOT2014.

Place, publisher, year, edition, pages
IEEE COMPUTER SOC, 2017
Keywords
Visual tracking; scale estimation; correlation filters
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-139382 (URN)10.1109/TPAMI.2016.2609928 (DOI)000404606300006 ()27654137 (PubMedID)
Note

Funding Agencies|Swedish Foundation for Strategic Research; Swedish Research Council; Strategic Vehicle Research and Innovation (FFI); Wallenberg Autonomous Systems Program; National Supercomputer Centre; Nvidia

Available from: 2017-08-07 Created: 2017-08-07 Last updated: 2018-04-25
4. Learning Spatially Regularized Correlation Filters for Visual Tracking
Open this publication in new window or tab >>Learning Spatially Regularized Correlation Filters for Visual Tracking
2015 (English)In: Proceedings of the International Conference in Computer Vision (ICCV), 2015, IEEE Computer Society, 2015, p. 4310-4318Conference paper, Published paper (Refereed)
Abstract [en]

Robust and accurate visual tracking is one of the most challenging computer vision problems. Due to the inherent lack of training data, a robust approach for constructing a target appearance model is crucial. Recently, discriminatively learned correlation filters (DCF) have been successfully applied to address this problem for tracking. These methods utilize a periodic assumption of the training samples to efficiently learn a classifier on all patches in the target neighborhood. However, the periodic assumption also introduces unwanted boundary effects, which severely degrade the quality of the tracking model.

We propose Spatially Regularized Discriminative Correlation Filters (SRDCF) for tracking. A spatial regularization component is introduced in the learning to penalize correlation filter coefficients depending on their spatial location. Our SRDCF formulation allows the correlation filters to be learned on a significantly larger set of negative training samples, without corrupting the positive samples. We further propose an optimization strategy, based on the iterative Gauss-Seidel method, for efficient online learning of our SRDCF. Experiments are performed on four benchmark datasets: OTB-2013, ALOV++, OTB-2015, and VOT2014. Our approach achieves state-of-the-art results on all four datasets. On OTB-2013 and OTB-2015, we obtain an absolute gain of 8.0% and 8.2% respectively, in mean overlap precision, compared to the best existing trackers.

Place, publisher, year, edition, pages
IEEE Computer Society, 2015
Series
IEEE International Conference on Computer Vision. Proceedings, ISSN 1550-5499
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-121609 (URN)10.1109/ICCV.2015.490 (DOI)000380414100482 ()978-1-4673-8390-5 (ISBN)
Conference
International Conference in Computer Vision (ICCV), Santiago, Chile, December 13-16, 2015
Available from: 2015-09-28 Created: 2015-09-28 Last updated: 2018-04-25
5. Convolutional Features for Correlation Filter Based Visual Tracking
Open this publication in new window or tab >>Convolutional Features for Correlation Filter Based Visual Tracking
2015 (English)In: Proceedings of the IEEE International Conference on Computer Vision, IEEE conference proceedings, 2015, p. 621-629Conference paper, Published paper (Refereed)
Abstract [en]

Visual object tracking is a challenging computer vision problem with numerous real-world applications. This paper investigates the impact of convolutional features for the visual tracking problem. We propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. These activations have several advantages compared to the standard deep features (fully connected layers). Firstly, they mitigate the need of task specific fine-tuning. Secondly, they contain structural information crucial for the tracking problem. Lastly, these activations have low dimensionality. We perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, our results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Our results further show that the convolutional features provide improved results compared to standard handcrafted features. Finally, results comparable to state-of-theart trackers are obtained on all three benchmark datasets.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2015
Series
IEEE International Conference on Computer Vision. Proceedings, ISSN 1550-5499
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-128869 (URN)10.1109/ICCVW.2015.84 (DOI)000380434700075 ()978-146738390-5 (ISBN)
External cooperation:
Conference
15th IEEE International Conference on Computer Vision Workshops, ICCVW 2015
Available from: 2016-06-02 Created: 2016-06-02 Last updated: 2018-04-25
6. Adaptive Decontamination of the Training Set: A Unified Formulation for Discriminative Visual Tracking
Open this publication in new window or tab >>Adaptive Decontamination of the Training Set: A Unified Formulation for Discriminative Visual Tracking
2016 (English)In: 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CPVR), IEEE , 2016, p. 1430-1438Conference paper, Published paper (Refereed)
Abstract [en]

Tracking-by-detection methods have demonstrated competitive performance in recent years. In these approaches, the tracking model heavily relies on the quality of the training set. Due to the limited amount of labeled training data, additional samples need to be extracted and labeled by the tracker itself. This often leads to the inclusion of corrupted training samples, due to occlusions, misalignments and other perturbations. Existing tracking-by-detection methods either ignore this problem, or employ a separate component for managing the training set. We propose a novel generic approach for alleviating the problem of corrupted training samples in tracking-by-detection frameworks. Our approach dynamically manages the training set by estimating the quality of the samples. Contrary to existing approaches, we propose a unified formulation by minimizing a single loss over both the target appearance model and the sample quality weights. The joint formulation enables corrupted samples to be down-weighted while increasing the impact of correct ones. Experiments are performed on three benchmarks: OTB-2015 with 100 videos, VOT-2015 with 60 videos, and Temple-Color with 128 videos. On the OTB-2015, our unified formulation significantly improves the baseline, with a gain of 3.8% in mean overlap precision. Finally, our method achieves state-of-the-art results on all three datasets.

Place, publisher, year, edition, pages
IEEE, 2016
Series
IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-137882 (URN)10.1109/CVPR.2016.159 (DOI)000400012301051 ()978-1-4673-8851-1 (ISBN)
Conference
29th IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Note

Funding Agencies|SSF (CUAS); VR (EMC2); VR (ELLIIT); Wallenberg Autonomous Systems Program; NSC; Nvidia

Available from: 2017-06-01 Created: 2017-06-01 Last updated: 2018-04-25
7. Deep motion and appearance cues for visual tracking
Open this publication in new window or tab >>Deep motion and appearance cues for visual tracking
Show others...
2018 (English)In: Pattern Recognition Letters, ISSN 0167-8655, E-ISSN 1872-7344Article in journal (Refereed) Published
Abstract [en]

Generic visual tracking is a challenging computer vision problem, with numerous applications. Most existing approaches rely on appearance information by employing either hand-crafted features or deep RGB features extracted from convolutional neural networks. Despite their success, these approaches struggle in case of ambiguous appearance information, leading to tracking failure. In such cases, we argue that motion cue provides discriminative and complementary information that can improve tracking performance. Contrary to visual tracking, deep motion features have been successfully applied for action recognition and video classification tasks. Typically, the motion features are learned by training a CNN on optical flow images extracted from large amounts of labeled videos. In this paper, we investigate the impact of deep motion features in a tracking-by-detection framework. We also evaluate the fusion of hand-crafted, deep RGB, and deep motion features and show that they contain complementary information. To the best of our knowledge, we are the first to propose fusing appearance information with deep motion features for visual tracking. Comprehensive experiments clearly demonstrate that our fusion approach with deep motion features outperforms standard methods relying on appearance information alone.

Place, publisher, year, edition, pages
Elsevier, 2018
Keywords
Visual tracking, Deep learning, Optical flow, Discriminative correlation filters
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:liu:diva-148015 (URN)10.1016/j.patrec.2018.03.009 (DOI)2-s2.0-85044328745 (Scopus ID)
Available from: 2018-05-24 Created: 2018-05-24 Last updated: 2018-05-31Bibliographically approved
8. Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking
Open this publication in new window or tab >>Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking
2016 (English)In: Computer Vision - ECCV 2016, Pt V, SPRINGER INT PUBLISHING AG , 2016, Vol. 9909, p. 472-488Conference paper, Published paper (Refereed)
Abstract [en]

Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color (+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments.

Place, publisher, year, edition, pages
SPRINGER INT PUBLISHING AG, 2016
Series
Lecture Notes in Computer Science, ISSN 0302-9743
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-133550 (URN)10.1007/978-3-319-46454-1_29 (DOI)000389385400029 ()978-3-319-46454-1; 978-3-319-46453-4 (ISBN)
Conference
14th European Conference on Computer Vision (ECCV)
Available from: 2016-12-30 Created: 2016-12-29 Last updated: 2018-04-25
9. ECO: Efficient Convolution Operators for Tracking
Open this publication in new window or tab >>ECO: Efficient Convolution Operators for Tracking
2017 (English)In: 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), IEEE , 2017, p. 6931-6939Conference paper, Published paper (Refereed)
Abstract [en]

In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable parameters, have introduced the risk of severe over-fitting. In this work, we tackle the key causes behind the problems of computational complexity and over-fitting, with the aim of simultaneously improving both speed and performance. We revisit the core DCF formulation and introduce: (i) a factorized convolution operator, which drastically reduces the number of parameters in the model; (ii) a compact generative model of the training sample distribution, that significantly reduces memory and time complexity, while providing better diversity of samples; (iii) a conservative model update strategy with improved robustness and reduced complexity. We perform comprehensive experiments on four benchmarks: VOT2016, UAV123, OTB-2015, and Temple-Color. When using expensive deep features, our tracker provides a 20-fold speedup and achieves a 13.0% relative gain in Expected Average Overlap compared to the top ranked method [12] in the VOT2016 challenge. Moreover, our fast variant, using hand-crafted features, operates at 60 Hz on a single CPU, while obtaining 65.0% AUC on OTB-2015.

Place, publisher, year, edition, pages
IEEE, 2017
Series
IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:liu:diva-144284 (URN)10.1109/CVPR.2017.733 (DOI)000418371407004 ()978-1-5386-0457-1 (ISBN)
Conference
30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Note

Funding Agencies|SSF (SymbiCloud); VR (EMC2) [2016-05543]; SNIC; WASP; Visual Sweden; Nvidia

Available from: 2018-01-12 Created: 2018-01-12 Last updated: 2018-04-25

Open Access in DiVA

Learning Convolution Operators for Visual Tracking(10414 kB)187 downloads
File information
File name FULLTEXT01.pdfFile size 10414 kBChecksum SHA-512
74e0dfa738c93fc5a332814c238abfddce42a600757a3e999bcfa97a1009b9b2117f1199fb2fb2f4ad80bab805fa403d4c42564f1e69853d2377ab166eb1b430
Type fulltextMimetype application/pdf
omslag(2818 kB)15 downloads
File information
File name COVER01.pdfFile size 2818 kBChecksum SHA-512
b88d335a49cb143290a9dc1fe7c4853d2154742a540b818d0837cdf2c6a13b3f57c7751f0207a9427391fced94b234d3435bdbb60d9e98a1727e885534751766
Type coverMimetype application/pdf

Other links

Publisher's full text

Authority records BETA

Danelljan, Martin

Search in DiVA

By author/editor
Danelljan, Martin
By organisation
Computer VisionFaculty of Science & Engineering
Computer Vision and Robotics (Autonomous Systems)

Search outside of DiVA

GoogleGoogle Scholar
Total: 187 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 2347 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf