Large language models (LLMs) have demonstrated remarkable proficiency in various natural language tasks and an impressive ability to follow open-ended instructions, showcasing strong generalization capabilities. Despite these successes, a notable limitation of LLMs is their inability to perceive non-textual modalities such as audio. In a new paper SpeechVerse: A Large-scale Generalizable Audio Language Model, a