Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including masking, distillation, and architectural modifications - have been proposed to improve training ...