Multimodal systems, which process multiple input types such as text, audio, and images, are becoming increasingly prevalent in software systems, enabled by the huge advancements in Machine Learning. This triggers the need to easily define the requirements linked to these new types of user interactions, potentially involving more than one modality at the same time. This remains an open challenge due to the lack of languages and methods adapted to the diverse nature of multimodal interactions, ...