OpenNMT-py v3.4.3 released - blazing fast beam search inference

argosopentech · January 24, 2024, 8:16pm

Hello Community

We are happy to release v3.4.3 with very fast beam search inference.
In essence we more than doubled the inference speed.

Some numbers:
v3.0.3 (Feb 2023)

Batch size	tok/sec	Time	Memory
32	2733	33.8	940M
64	4305	22.2	1.6G
128	6296	15.8	2.7G
256	8002	12.8	4.5G
512	8836	11.8	5.6G
960	8805	11.8	9.9G

v3.3.0 (June 2023)

Batch size	tok/sec	Time	Memory
32	2520	36.0	990M
64	3880	24.0	1.7G
128	5591	17.2	2.9G
256	7232	13.6	4.4G
512	7934	12.6	5.4G
960	7966	12.5	9.5G

v3.4.3 (Nov 2023)

Batch size	tok/sec	Time	Memory
32	5853	16.8	990M
64	10249	10.4	1.1G
128	15025	7.8	2.0G
256	18667	6.6	2.7G
512	20319	6.3	5.9G
960	21027	6.1	8.9G

All these numbers were run on a RTX4090 for a vanilla EN-DE base transformer.
The test set is 3003 sentences from WMT14, using a beam_size of 4.
A few comments:

The reported tok/sec is calculated out of the translator it does not count for:

Python interpreter loading / terminate (about 1.5 sec on my system)
Model loading (0.4 sec on my system)

I ran the same with CT2:
with a batch size of 960 examples, it takes 2.3 sec. To be fair we need to remove the python loading/termination (so 6.1 sec - 1.5 sec = 4.6 sec)
So OpenNMT-py is still twice slower than CT2 and 3 times slower at batch size 32.