Knee osteoarthritis presents a significant health challenge for many adults globally. At present, there are no pharmacological treatments that can cure this medical condition. The primary method for managing the progress of knee osteoarthritis is through early identification. Currently, X-ray imaging serves as a key modality for predicting the onset of osteoarthritis. Nevertheless, the traditional manual interpretation of X-rays is susceptible to inaccuracies, largely due to the varying levels of expertise among radiologists. In this paper, we propose a multimodal model based on pre-trained vision and language models for the identification of the knee osteoarthritis severity Kellgren-Lawrence (KL) grading. Using Vision transformer and Pre-training of deep bidirectional transformers for language understanding (BERT) for images and texts embeddings extraction helps Transformer encoders extracts more distinctive hidden-states that facilitates the learning process of the neural network classifier. The multimodal model was trained and tested on the OAI dataset, and the results showed remarkable performance compared to the related works. Experimentally, the evaluation of the model on the test set comprising X-ray images demonstrated an overall accuracy of 82.85%, alongside a precision of 84.54% and a recall of 82.89%.