Unless I'm missing something, this uses a simple synchronous for loop:
for text in texts:
key = (text, model)
if key not in pickle_cache:
pickle_cache[key] = openai_client.create_embedding(text, model=model)
embeddings.append(pickle_cache[key])
operations.save_pickle_cache(pickle_cache, pickle_path)
return embeddings
At the throughput rates I was seeing of one embedding per second, a million comments would take over a week to process!I had to call the Gemini model with ten comments at a time from eight threads to reach even the paltry 3K rpm rate limit they offer to "Tier 1" customers.
Based on this experience, for real "enterprise" customers I might implement a generic wrapper for Google's Batch API that could handle continuous streaming from a database, chunking it, uploading, and then in parallel checking the status of the pending jobs and streaming the results back into a database.
Re-reading your comment :) Yes, my demo has just a simple loop when loading the embeddings.
I was replying more towards the latency you mentioned. Because duckdb runs on device, you save yourself the additional round trip network time when comparing similarities.
Hey, idk if that helps but I developed something similar to the wrapper you're mentioning as an open-source python library.
Just plug any async function into the provided async context manager and you get Batch APIs in two lines of code with any existing framework you currently have: https://github.com/vienneraphael/batchling
Let me know if you have any questions, looking forward to having your feedback!