liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Convolutional Features for Correlation Filter Based Visual Tracking
Linköping University, Faculty of Science & Engineering. Linköping University, Department of Electrical Engineering, Computer Vision.ORCID iD: 0000-0001-6144-9520
Linköping University, Faculty of Science & Engineering. Linköping University, Department of Electrical Engineering, Computer Vision.ORCID iD: 0000-0001-6199-9362
Linköping University, Faculty of Science & Engineering. Linköping University, Department of Electrical Engineering, Computer Vision.
Linköping University, Faculty of Science & Engineering. Linköping University, Department of Electrical Engineering, Computer Vision.ORCID iD: 0000-0002-6096-3648
2015 (English)In: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), IEEE conference proceedings, 2015, p. 621-629Conference paper, Published paper (Refereed)
Abstract [en]

Visual object tracking is a challenging computer vision problem with numerous real-world applications. This paper investigates the impact of convolutional features for the visual tracking problem. We propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. These activations have several advantages compared to the standard deep features (fully connected layers). Firstly, they mitigate the need of task specific fine-tuning. Secondly, they contain structural information crucial for the tracking problem. Lastly, these activations have low dimensionality. We perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, our results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Our results further show that the convolutional features provide improved results compared to standard handcrafted features. Finally, results comparable to state-of-theart trackers are obtained on all three benchmark datasets.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2015. p. 621-629
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:liu:diva-128869DOI: 10.1109/ICCVW.2015.84ISI: 000380434700075ISBN: 9781467397117 (electronic)ISBN: 9781467397100 (electronic)OAI: oai:DiVA.org:liu-128869DiVA, id: diva2:933006
Conference
15th IEEE International Conference on Computer Vision Workshops, ICCVW 2015, 7-13 December 2015, Santiago, Chile
Available from: 2016-06-02 Created: 2016-06-02 Last updated: 2025-02-07Bibliographically approved
In thesis
1. Learning Convolution Operators for Visual Tracking
Open this publication in new window or tab >>Learning Convolution Operators for Visual Tracking
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Visual tracking is one of the fundamental problems in computer vision. Its numerous applications include robotics, autonomous driving, augmented reality and 3D reconstruction. In essence, visual tracking can be described as the problem of estimating the trajectory of a target in a sequence of images. The target can be any image region or object of interest. While humans excel at this task, requiring little effort to perform accurate and robust visual tracking, it has proven difficult to automate. It has therefore remained one of the most active research topics in computer vision.

In its most general form, no prior knowledge about the object of interest or environment is given, except for the initial target location. This general form of tracking is known as generic visual tracking. The unconstrained nature of this problem makes it particularly difficult, yet applicable to a wider range of scenarios. As no prior knowledge is given, the tracker must learn an appearance model of the target on-the-fly. Cast as a machine learning problem, it imposes several major challenges which are addressed in this thesis.

The main purpose of this thesis is the study and advancement of the, so called, Discriminative Correlation Filter (DCF) framework, as it has shown to be particularly suitable for the tracking application. By utilizing properties of the Fourier transform, a correlation filter is discriminatively learned by efficiently minimizing a least-squares objective. The resulting filter is then applied to a new image in order to estimate the target location.

This thesis contributes to the advancement of the DCF methodology in several aspects. The main contribution regards the learning of the appearance model: First, the problem of updating the appearance model with new training samples is covered. Efficient update rules and numerical solvers are investigated for this task. Second, the periodic assumption induced by the circular convolution in DCF is countered by proposing a spatial regularization component. Third, an adaptive model of the training set is proposed to alleviate the impact of corrupted or mislabeled training samples. Fourth, a continuous-space formulation of the DCF is introduced, enabling the fusion of multiresolution features and sub-pixel accurate predictions. Finally, the problems of computational complexity and overfitting are addressed by investigating dimensionality reduction techniques.

As a second contribution, different feature representations for tracking are investigated. A particular focus is put on the analysis of color features, which had been largely overlooked in prior tracking research. This thesis also studies the use of deep features in DCF-based tracking. While many vision problems have greatly benefited from the advent of deep learning, it has proven difficult to harvest the power of such representations for tracking. In this thesis it is shown that both shallow and deep layers contribute positively. Furthermore, the problem of fusing their complementary properties is investigated.

The final major contribution of this thesis regards the prediction of the target scale. In many applications, it is essential to track the scale, or size, of the target since it is strongly related to the relative distance. A thorough analysis of how to integrate scale estimation into the DCF framework is performed. A one-dimensional scale filter is proposed, enabling efficient and accurate scale estimation.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. p. 71
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1926
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-147543 (URN)10.3384/diss.diva-147543 (DOI)9789176853320 (ISBN)
Public defence
2018-06-11, Ada Lovelace, B-huset, Campus Valla, Linköping, 13:00 (English)
Opponent
Supervisors
Available from: 2018-05-03 Created: 2018-04-25 Last updated: 2025-02-07Bibliographically approved
2. Learning visual perception for autonomous systems
Open this publication in new window or tab >>Learning visual perception for autonomous systems
2021 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the last decade, developments in hardware, sensors and software have made it possible to create increasingly autonomous systems. These systems can be as simple as limited driver assistance software lane-following in cars, or limited collision warning systems for otherwise manually piloted drones. On the other end of the spectrum exist fully autonomous cars, boats or helicopters. With increasing abilities to function autonomously, the demands to operate with minimal human supervision in unstructured environments increase accordingly.

Common to most, if not all, autonomous systems is that they require an accurate model of the surrounding world. While there is currently a large number of possible sensors useful to create such models available, cameras are one of the most versatile. From a sensing perspective cameras have several advantages over other sensors in that they require no external infrastructure, are relatively cheap and can be used to extract such information as the relative positions of other objects, their movements over time, create accurate maps and locate the autonomous system within these maps.

Using cameras to produce a model of the surroundings require solving a number of technical problems. Often these problems have a basis in recognizing that an object or region of interest is the same over time or in novel viewpoints. In visual tracking this type of recognition is required to follow an object of interest through a sequence of images. In geometric problems it is often a requirement to recognize corresponding image regions in order to perform 3D reconstruction or localization. 

The first set of contributions in this thesis is related to the improvement of a class of on-line learned visual object trackers based on discriminative correlation filters. In visual tracking estimation of the objects size is important for reliable tracking, the first contribution in this part of the thesis investigates this problem. The performance of discriminative correlation filters is highly dependent on what feature representation is used by the filter. The second tracking contribution investigates the performance impact of different features derived from a deep neural network.

A second set of contributions relate to the evaluation of visual object trackers. The first of these are the visual object tracking challenge. This challenge is a yearly comparison of state-of-the art visual tracking algorithms. A second contribution is an investigation into the possible issues when using bounding-box representations for ground-truth data.

In real world settings tracking typically occur over longer time sequences than is common in benchmarking datasets. In such settings it is common that the model updates of many tracking algorithms cause the tracker to fail silently. For this reason it is important to have an estimate of the trackers performance even in cases when no ground-truth annotations exist. The first of the final three contributions investigates this problem in a robotics setting, by fusing information from a pre-trained object detector in a state-estimation framework. An additional contribution describes how to dynamically re-weight the data used for the appearance model of a tracker. A final contribution investigates how to obtain an estimate of how certain detections are in a setting where geometrical limitations can be imposed on the search region. The proposed solution learns to accurately predict stereo disparities along with accurate assessments of each predictions certainty.

Abstract [sv]

De senaste årens allt snabbare utveckling av beräkningshårdvara, sensorer och mjukvarutekniker har gjort det möjligt att skapa allt mer autonoma system. Sådana kan variera i autonomigrad från ett antisladdsystem för en i övrigt manuellt kontrollerad bil, till system för kollisionsundvikning i en manuellt kontrollerad drönare, till en helt autonom bil eller annan farkost. Med en ökande förmåga att arbeta självständigt utan mänsklig övervakning ökar också bredden på möjliga situationer som systemen förväntas hantera. 

Gemensamt för många, om inte alla, autonoma system är att de behöver en korrekt och updaterad bild av sin omgivning för att kunna agera på ett intelligent sätt. En lång rad av sensorer som gör detta möjligt finns tillgängliga, där kameror är en av de mest mångsidiga. Jämfört med andra typer av sensorer har kameror en rad fördelar, som att de är relativt billiga, passiva, och kan användas utan krav på extern infrastruktur. Det visuella data som kameror genererar kan användas för att följa externa objekt, bestämma positionen för kameran själv, eller beräkna avstånd. 

Att framgångsrikt utnyttja möjligheterna i denna information kräver dock att en lång rad tekniska problem hanteras. Många av dessa problem är grundar sig i att kunna känna igen att två bildregioner från olika tidpunkter eller betraktningsvinklar avbildar samma sak. 

Ett typexempel på ett sådant problem är det visuella följningsproblemet. I det visuella följningsproblemet är målet att bestämma ett objekts position och storlek för alla bilder i en sekvens av bilder. I allmänhet är objektets utseende inte känt av algoritmen, utan en utseendemodell måste skapas succesivt med hjälp av maskininlärning. 

Problem som liknar detta förekommer inom många andra områden av datorseende, speciellt inom geometri. Inom många geometriska problem krävs det till exempel att man finner korresponderande punkter i ett flertal bilder. 

Den första samlingen av bidrag i denna avhandling behandlar det visuella följningsproblemet. De föreslagna metoderna är baserade på en adaptiv utseendemodell kallad diskriminativa korrelationsfilter. I det första bidraget till sådana metoder utökas ramverket till att skatta ett objekts storlek såväl som position. Ett andra bidrag undersöker hur korrelationsfilterbaserade metoder kan utökas till att även utnyttja visuella särdrag som har framställt med hjälp av maskininlärning. 

En andra samling med bidrag behandlar utvärdering av metoder för visuell följning. Dels inom den årligt förekommande tävlingen visual object tracking challenge. Ett andra bidrag till utvärderingsmetodig inom visuell följning syftar till att unvdika fallgropar som lätt uppkommer då metoder anpassas allt för väl för måtten som används för att utvärdera dem. 

En tredje samling med bidrag relaterar till olika sätt att hantera situationer då inlärningsprocessen i de tidigare beskrivna följningsmetoderna introducerar felaktiga data i modellen. Detta görs i ett första bidrag i ett robotiksystem för följning av människor i en ostrukturerad miljö. Ett andra bidrag är baserat på dynamisk omviktning av tidigare samlad data för att dynamiskt vikta ned datapunkter som inte representerar det följda objektet väl. I ett sista bidrag undersöks hur en prediktions osäkerhet kan skattas samtidigt som prediktionen själv.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2021. p. 49
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2138
Keywords
computer vision, visual object tracking, tracking, machine learning, deep learning
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:liu:diva-175177 (URN)10.3384/diss.diva-175177 (DOI)9789179296711 (ISBN)
Public defence
2021-06-04, Ada Lovelace, B-Building, Campus Valla, Linköping, 09:15 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2021-05-04 Created: 2021-04-20 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(2296 kB)1426 downloads
File information
File name FULLTEXT02.pdfFile size 2296 kBChecksum SHA-512
dbe55dcc4b555f1b34eab09c29ac4cce47182a1160b648b3d3b4c0b9703b01fc1104b6ac3d7ebf0d1d69c2294b3c4dbcbf6f92d642bcc89d7092240f8a60f82a
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Danelljan, MartinHäger, GustavKhan, Fahad ShahbazFelsberg, Michael

Search in DiVA

By author/editor
Danelljan, MartinHäger, GustavKhan, Fahad ShahbazFelsberg, Michael
By organisation
Faculty of Science & EngineeringComputer Vision
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
Total: 1437 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 2193 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf