Unless I'm missing something, this uses a simple synchronous for loop: for tex...

jiggawatts • today at 2:26 AM • 2 replies • view on HN

Unless I'm missing something, this uses a simple synchronous for loop:

    for text in texts:
        key = (text, model)
        if key not in pickle_cache:
            pickle_cache[key] = openai_client.create_embedding(text, model=model)
        embeddings.append(pickle_cache[key])
    operations.save_pickle_cache(pickle_cache, pickle_path)
    return embeddings

At the throughput rates I was seeing of one embedding per second, a million comments would take over a week to process!

I had to call the Gemini model with ten comments at a time from eight threads to reach even the paltry 3K rpm rate limit they offer to "Tier 1" customers.

Based on this experience, for real "enterprise" customers I might implement a generic wrapper for Google's Batch API that could handle continuous streaming from a database, chunking it, uploading, and then in parallel checking the status of the pending jobs and streaming the results back into a database.

Replies

vienneraphael • today at 4:05 AM

Hey, idk if that helps but I developed something similar to the wrapper you're mentioning as an open-source python library.

Just plug any async function into the provided async context manager and you get Batch APIs in two lines of code with any existing framework you currently have: https://github.com/vienneraphael/batchling

Let me know if you have any questions, looking forward to having your feedback!

➕ show 1 reply

pjot • today at 2:56 AM

Re-reading your comment :) Yes, my demo has just a simple loop when loading the embeddings.

I was replying more towards the latency you mentioned. Because duckdb runs on device, you save yourself the additional round trip network time when comparing similarities.

➕ show 1 reply

alt Hacker News

Replies