We propose to leverage optical flow features for higher generalization power in semi-supervised video object segmentation. Optical flow is usually exploited as additional guidance information in many computer vision tasks. However, its relevance in video object segmentation was mainly in unsupervised settings or using the optical flow to warp or refine the previously predicted masks. Different from the latter, we propose to directly leverage the optical flow features in the target representation. We show that this enriched representation improves the encoder-decoder approach to the segmentation task. A model to extract the combined information from the optical flow and the image is proposed, which is then used as input to the target model and the decoder network. Unlike previous methods, e.g. in tracking where concatenation is used to integrate information from image data and optical flow, a simple yet effective attention mechanism is exploited in our work. Experiments on DAVIS 2017 and YouTube-VOS 2019 show that integrating the information extracted from optical flow into the original image branch results in a strong performance gain, especially in unseen classes which demonstrates its higher generalization power.