Large Language Model (LLM) inference workloads deal with extremely large model files (often many gigabytes) that must be loaded quickly and repeatedly across distributed GPU instances. Traditional single-tier storage (either solely local disk or only remote cloud storage) cannot meet the throughput and latency demands of serving these models at