A copyright lawsuit with major implications for AI’s large language models (LLMs) has taken a turn as OpenAI has said the New York Times (NYT) hacked ChatGPT. OpenAI is hoping to get the case dismissed on this basis, after federal judges in smaller cases have dismissed similar claims for similar reasons.
Such a dismissal would not fully settle the case, as this defense does not address the broader issue of AI companies scraping the open internet for their training model materials. But it could end claims that LLMs provide an equal substitute for the publications they train on, in terms of copyright-related claims of damages.
Fair use defense to be tested by NYT/OpenAI copyright lawsuit
OpenAI’s movement for dismissal is essentially that ChatGPT is not a viable tool for reproducing articles that one would otherwise need a subscription to access. That does not address the greater question of whether content scraping can be used to train models, but it does address the way in which NYT has chosen to illustrate how the model uses copyrighted material. As part of its copyright lawsuit filing the paper submitted some 100 examples of the chatbot reproducing considerable chunks of existing articles that appear to be part of its training database.
OpenAI contends that NYT essentially hacked ChatGPT in the creation of these samples. It notes that the paper may have had to use thousands to tens of thousands of prompts to get the chatbot to “hallucinate” part of an existing article, and when found would continually prompt it for the next sentence in the article. OpenAI is also accusing NYT of hiring some sort of a “prompt engineer” specialist who knows how to massage LLMs in this way to produce this specific output.
Whether or not a court finds that NYT hacked ChatGPT, the articles were clearly used and stored as a part of the chatbot’s training material. The copyright lawsuit will ultimately hinge on whether that is considered legal under the “fair use doctrine” used to govern rights in the case of derivative works, and will test the claim of OpenAI (and others in the AI space) that any content they can reach on the internet is theirs to rightfully use for AI training.
All of that is an issue that will likely drag on for some time, and stands a good chance of winding up before the Supreme Court. For now the court’s decision is whether or not NYT has a right to claim damages from losses that would essentially trace back to someone reading an article with an LLM instead of by paying to access their site.
Court to examine if NYT hacked ChatGPT
Some recent federal court decisions in smaller and more individual cases tend to support OpenAI’s claim that an LLM cannot be seen as a substitute to a publication or a threat to its profits. However, some of those decisions also seemed to have hinged on a judge openly stating they don’t really understand the technological aspects of the issue. Courts tend to struggle with hacking issues and charges for this reason, and it will be interesting to see if the judge determines NYT hacked ChatGPT by simply bombarding it with prompts searching for evidence of its articles being repeated.
OpenAI (and its similar rivals) has a strong motivation to use every trick in the book in getting these cases dismissed. If copyright lawsuits ultimately determine that LLMs don’t have some natural right to scraped content, it could mean they would need to flush their present databases and start over with a much smaller collection of material (or face massive fines per piece of unauthorized content used). OpenAI has already gone on record saying that it cannot run a chatbot without freely scraping the internet for massive quantities of material, and having to pay for access to that level of material could undermine the entire project.