For this experimentation, I was using relatively short prompts, a few paragraphs or so max for the data extraction testing and no context from previous calls was included in subsequent calls (each call was stand-alone). Yes, Ollama can be slow to first token with larger contexts. The reason I used Ollama was it was a continuation from a previous article I wrote about Ollama, and it’s dead simple to use. I’m experimenting with vLLM currently as it’s reportedly more appropriate for heavy production workloads. Replacing Ollama with vLLM as the backend for a batch system like this should be very simple. I might do a write-up of vLLM that compares it to Ollama soon. Thanks for reading.