Using sentences of a copyrighted source

Is it possible to train sentences of a public dictionary that have copyright?

In what context? I generally try to use data not protected by copyright that permits commercial use when training Argos Translate models. If you train your own model you can use whatever data you want though. Realistically, virtually all LLMs train on some copyright data. Most data from web scraping is probably covered by copyright and there’s no good way to contact rights-holders at scale. Copyright for the Internet Era is just kinda broken. Unless your model is giving users copyright text verbatim I think you’re probably fine.


It’s an open legal question and at least in the U.S. we’ll find out in a few years; but if you want to err on the safe (and ethical) side, don’t do it.

