This paper looks at the Estonian coordinating conjunction ja ‘and’ in video- recorded Pilates classes, focusing on the instructors’ practical problem of making the students perform proper movement sequences. It shows how grammatical coordination emerges within a multimodal activity in which the instructor’s talk both directs and responds to student performance. As opposed to the frequent juxtaposition of clauses without connectors, explicit coordination with ja isused for the overall structuring of the class as well as the temporal extensionof talk to achieve synchronicity of vocal and embodied behavior. In contrast to formal theories that consider grammar as a device for coherent expression of pre-planned propositions, this study argues that grammatical structure emerges as part of practical action across participants and modalities.