In my tests the GPU also did not show much increase in performance (for single translations).
I think there’s a good use case for using the GPU to speed-up batch inference via ctranslate (where it’s actually faster), but it’s hard to scale it right, as large batch sizes quickly fill up the GPU memory.