For years, embedding models based on bidirectional language models have led the field, excelling in retrieval and general-purpose embedding tasks. However, past top-tier methods have relied on fine-tuning Large Language Models (LLMs) with extensive amounts of proprietary synthetic data from GPT-4, which isn't accessible to the broader community. In a new paper NV-Embed: Improved Techniques