Currently trying out libretranslate for real time translations. We’ve built an image with 4 language models pre-installed and deployed the following setup:
2 data centers * 32 pods (16cpu/32ram), 64 pods total
We then started a load test with the following setup: 0 to 110 rps in 20 minutes, then 110 rps for 5 hours.
We go the required texts from real data: html text up to 15kb (avg 2-3kb).
The results:
- On 110 rps, the response time was 4-5seconds on .99 percentile.
- The in-fight translations went above 200.
- The logs started flooding with warnings like "
WARNING:waitress.queue:Task queue depth is 10" - Detected a possible memory leak. After about 4 hours of load, the pods were restarted by OOM because the memory usage went above the provided 32GB.
Did anyone do something similar? Or maybe someone has a real world example of libretranslate under load?