Using sentences of a copyrighted source

be4zad · March 29, 2024, 7:33am

Is it possible to train sentences of a public dictionary that have copyright?

argosopentech · March 29, 2024, 12:03pm

In what context? I generally try to use data not protected by copyright that permits commercial use when training Argos Translate models. If you train your own model you can use whatever data you want though. Realistically, virtually all LLMs train on some copyright data. Most data from web scraping is probably covered by copyright and there’s no good way to contact rights-holders at scale. Copyright for the Internet Era is just kinda broken. Unless your model is giving users copyright text verbatim I think you’re probably fine.

pierotofy · March 29, 2024, 7:11pm

It’s an open legal question and at least in the U.S. we’ll find out in a few years; but if you want to err on the safe (and ethical) side, don’t do it.