I was wondering if anyone has any performance or benchmark information available, not necessarily for specific hardware but for example how many 1 paragraph requests can a server of a certain size (assuming no GPU) handle in a reasonable amount of time (5-6 seconds or less)?
I guess the kind of information I am looking for is what kind of performance can you expect from systems similar to the following if anyone happens to know?
2x Xeon e5-2660v3 with 256GB DDR4?
These servers can be had for around 500-600 USD on Ebay that is why I went with this example. But this is a pretty specific ask but honestly I would be happy to see any kind of data for running on a physical box and not in a container/pod.
CTranslate2 publishes some benchmarks (Argos Translate uses int8 quantization) you can look at. The CTranslate2 benchmarks wouldn’t include the time to run Stanza sentence boundary detection part of Argos Translate or the LibreTranslate application though so LibreTranslate will probably be ~2x slower than the CTranslate2 benchmarks.
I don’t think we’ve done many for LibreTranslate end to end. As a heuristic I would estimate LibreTranslate does ~3 sentences/second on medium end CPUs and 15-20 sentences/second on high end CPUs.
Adding automation for benchmarking could be a good feature. If we have a standard script for benchmarking LibreTranslate instances we could have people submit data on specific hardware to publish.
If this were to become a standard feature or a script could be written for this then I could run it in various environments to give us all an idea or at least a starting point to reference. To make this work you would probably have to disable caching (if caching is even a thing which it may not be?).
LibreTranslate can use parallel CPU cores pretty well so you could also test how many requests it can handle at once. Just be careful not to overload other people’s servers.
Yea knowing how many requests per second for a few different hosts would probably be a lot more useful because if you are only running 1 request at a time the result should always be close to the same assuming you aren’t running on a truly ancient machine. Is there any chance this could be modified for that? Maybe something like an argument that lets you say how many requests per second. My python isn’t super strong otherwise I would modify it.
I intend to run it on my own resources not public APIs so finding that point where the server starts to lag terribly or crashes is totally acceptable.