liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0001-6144-9520
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.
Linköping University, Department of Electrical Engineering, Computer Vision. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-6096-3648
2016 (English)In: Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V / [ed] Bastian Leibe, Jiri Matas, Nicu Sebe and Max Welling, Cham: Springer, 2016, p. 472-488Conference paper, Published paper (Refereed)
Abstract [en]

Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color (+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments.

Place, publisher, year, edition, pages
Cham: Springer, 2016. p. 472-488
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 9909
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-133550DOI: 10.1007/978-3-319-46454-1_29ISI: 000389385400029ISBN: 9783319464534 (print)ISBN: 9783319464541 (electronic)OAI: oai:DiVA.org:liu-133550DiVA, id: diva2:1060848
Conference
14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, October 11-14, 2016
Available from: 2016-12-30 Created: 2016-12-29 Last updated: 2025-02-07Bibliographically approved
In thesis
1. Learning Convolution Operators for Visual Tracking
Open this publication in new window or tab >>Learning Convolution Operators for Visual Tracking
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Visual tracking is one of the fundamental problems in computer vision. Its numerous applications include robotics, autonomous driving, augmented reality and 3D reconstruction. In essence, visual tracking can be described as the problem of estimating the trajectory of a target in a sequence of images. The target can be any image region or object of interest. While humans excel at this task, requiring little effort to perform accurate and robust visual tracking, it has proven difficult to automate. It has therefore remained one of the most active research topics in computer vision.

In its most general form, no prior knowledge about the object of interest or environment is given, except for the initial target location. This general form of tracking is known as generic visual tracking. The unconstrained nature of this problem makes it particularly difficult, yet applicable to a wider range of scenarios. As no prior knowledge is given, the tracker must learn an appearance model of the target on-the-fly. Cast as a machine learning problem, it imposes several major challenges which are addressed in this thesis.

The main purpose of this thesis is the study and advancement of the, so called, Discriminative Correlation Filter (DCF) framework, as it has shown to be particularly suitable for the tracking application. By utilizing properties of the Fourier transform, a correlation filter is discriminatively learned by efficiently minimizing a least-squares objective. The resulting filter is then applied to a new image in order to estimate the target location.

This thesis contributes to the advancement of the DCF methodology in several aspects. The main contribution regards the learning of the appearance model: First, the problem of updating the appearance model with new training samples is covered. Efficient update rules and numerical solvers are investigated for this task. Second, the periodic assumption induced by the circular convolution in DCF is countered by proposing a spatial regularization component. Third, an adaptive model of the training set is proposed to alleviate the impact of corrupted or mislabeled training samples. Fourth, a continuous-space formulation of the DCF is introduced, enabling the fusion of multiresolution features and sub-pixel accurate predictions. Finally, the problems of computational complexity and overfitting are addressed by investigating dimensionality reduction techniques.

As a second contribution, different feature representations for tracking are investigated. A particular focus is put on the analysis of color features, which had been largely overlooked in prior tracking research. This thesis also studies the use of deep features in DCF-based tracking. While many vision problems have greatly benefited from the advent of deep learning, it has proven difficult to harvest the power of such representations for tracking. In this thesis it is shown that both shallow and deep layers contribute positively. Furthermore, the problem of fusing their complementary properties is investigated.

The final major contribution of this thesis regards the prediction of the target scale. In many applications, it is essential to track the scale, or size, of the target since it is strongly related to the relative distance. A thorough analysis of how to integrate scale estimation into the DCF framework is performed. A one-dimensional scale filter is proposed, enabling efficient and accurate scale estimation.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. p. 71
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1926
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-147543 (URN)10.3384/diss.diva-147543 (DOI)9789176853320 (ISBN)
Public defence
2018-06-11, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:00 (English)
Opponent
Supervisors
Available from: 2018-05-03 Created: 2018-04-25 Last updated: 2025-02-07Bibliographically approved
2. Discriminative correlation filters in robot vision
Open this publication in new window or tab >>Discriminative correlation filters in robot vision
2021 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In less than ten years, deep neural networks have evolved into all-encompassing tools in multiple areas of science and engineering, due to their almost unreasonable effectiveness in modeling complex real-world relationships. In computer vision in particular, they have taken tasks such as object recognition, that were previously considered very difficult, and transformed them into everyday practical tools. However, neural networks have to be trained with supercomputers on massive datasets for hours or days, and this limits their ability adjust to changing conditions.

This thesis explores discriminative correlation filters, originally intended for tracking large objects in video, so-called visual object tracking. Unlike neural networks, these filters are small and can be quickly adapted to changes, with minimal data and computing power. At the same time, they can take advantage of the computing infrastructure developed for neural networks and operate within them.

The main contributions in this thesis demonstrate the versatility and adaptability of correlation filters for various problems, while complementing the capabilities of deep neural networks. In the first problem, it is shown that when adopted to track small regions and points, they outperform the widely used Lucas-Kanade method, both in terms of robustness and precision. 

In the second problem, the correlation filters take on a completely new task. Here, they are used to tell different places apart, in a 16 by 16 square kilometer region of ocean near land. Given only a horizon profile - the coast line silhouette of islands and islets as seen from an ocean vessel - it is demonstrated that discriminative correlation filters can effectively distinguish between locations.

In the third problem, it is shown how correlation filters can be applied to video object segmentation. This is the task of classifying individual pixels as belonging either to a target or the background, given a segmentation mask provided with the first video frame as the only guidance. It is also shown that discriminative correlation filters and deep neural networks complement each other; where the neural network processes the input video in a content-agnostic way, the filters adapt to specific target objects. The joint function is a real-time video object segmentation method.

Finally, the segmentation method is extended beyond binary target/background classification to additionally consider distracting objects. This addresses the fundamental difficulty of coping with objects of similar appearance.

Abstract [sv]

På mindre än tio år har djupa neurala nätverk utvecklats till heltäckande verktyg inom flera vetenskapliga och tekniska områden på grund av deras nästan orimliga effektivitet när det gäller att modellera komplexa verkliga förhållanden. I synnerhet inom datorseende har de tagit uppgifter som objektigenkänning, som tidigare ansågs vara mycket svåra, och förvandlat dem till praktiska vardagliga verktyg. Neurala nätverk måste dock tränas med superdatorer på massiva datamängder i timmar eller dagar, och detta begränsar deras förmåga att anpassa sig till förändrade förhållanden.

Denna avhandling undersöker diskriminerande korrelationsfilter, ursprungligen avsedda för spårning av stora objekt i video, så kallad visual object tracking. Till skillnad från neurala nätverk är dessa filter små och kan snabbt anpassas till förändringar, med lite data och minimal datorkraft. Samtidigt kan de dra nytta av den infrastruktur som utvecklats för neurala nätverk och arbeta inom den.

De viktigaste bidragen i denna avhandling visar mångsidigheten och anpassningsförmågan hos korrelationsfilter för olika problem, samtidigt som de kompletterar kapaciteten hos djupa neurala nätverk. I det första problemet visas det att när de appliceras på att spåra små regioner och punkter, överträffar de den ofta använda Lucas-Kanade-metoden, både när det gäller robusthet och precision.

I det andra problemet appliceras korrelationsfiltren på en helt ny uppgift. Här används de för att skilja mellan olika platser i en 16 x 16 kvadratkilometer stor havsregion nära land, givet endast en horisontprofil - kustlinjens silhuett av öar och holmar sett från ett fartyg.

I det tredje problemet visas hur korrelationsfilter kan användas för segmentering av objekt i video. Detta är uppgiften att klassificera enskilda pixlar som tillhörande antingen ett målobjekt eller bakgrunden, givet en segmenteringsmask försedd med den första bildrutan som enda vägledning. Det visas också att diskriminerande korrelationsfilter och djupa neurala nätverk kompletterar varandra; där det neurala nätverket behandlar videon på ett innehålls-agnostiskt sätt, anpassar filtren sig till specifika målobjekt. Den sammansatta funktionen är en realtidsmetod för segmentering.

Slutligen utvidgas segmenteringsmetoden bortom binär mål- / bakgrundsklassificering till att dessutom beakta distraherande objekt. Detta adresserar den grundläggande svårigheten att hantera objekt som liknar varandra.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2021. p. 53
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2146
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-174939 (URN)10.3384/diss.diva-174939 (DOI)9789179296360 (ISBN)
Public defence
2021-06-14, Ada Lovelace, B-building, Campus Valla, Linköping, 13:00 (English)
Opponent
Supervisors
Available from: 2021-05-17 Created: 2021-04-19 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking(1742 kB)684 downloads
File information
File name FULLTEXT02.pdfFile size 1742 kBChecksum SHA-512
52fa945e239edd3b0d0ffa783384423e0818b7bc1bfc2e5c56606029fe677da1235c70782b268eae4670ca34111982674769acffbe2021ec9ee5c5e9a938550a
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Danelljan, MartinRobinson, AndreasKhan, Fahad ShahbazFelsberg, Michael

Search in DiVA

By author/editor
Danelljan, MartinRobinson, AndreasKhan, Fahad ShahbazFelsberg, Michael
By organisation
Computer VisionFaculty of Science & Engineering
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
Total: 684 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1626 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf