In this work, we address the problem of semi-supervised video object segmentation, where the task is to segment a target object in every image of the video sequence, given a ground truth only in the first frame. To be successful it is crucial to robustly handle unpredictable target appearance changes and distracting objects in the background. In this work we obtain a robust and efficient representation of the target by integrating a fast and light-weight discriminative target model into a deep segmentation network. Trained during inference, the target model learns to discriminate between the local appearances of target and background image regions. Its predictions are enhanced to accurate segmentation masks in a subsequent refinement stage.To further improve the segmentation performance, we add a new module trained to generate global target attention vectors, given the input mask and image feature maps. The attention vectors add semantic information about thetarget from a previous frame to the refinement stage, complementing the predictions provided by the target appearance model. Our method is fast and requires no network fine-tuning. We achieve a combined J and F-score of 70.6 on the DAVIS 2019 test-challenge data