Home ยป Forum ยป Author Hangout

Forum: Author Hangout

AI & copyright infringement

Switch Blayde ๐Ÿšซ

Interesting article: "OpenAI whistleblower found dead by apparent suicide" at
https://www.yahoo.com/news/openai-whistleblower-found-dead-apparent-050222303.html

- Suchir Balaji, a former OpenAI researcher, was found dead on Nov. 26 in his apartment, reports say.

- Balaji, 26, was an OpenAI researcher of four years who left the company in August.

- He had accused his employer of violating copyright law with its highly popular ChatGPT model.

From Balaji's essay:

"While generative models rarely produce outputs that are substantially similar to any of their training inputs, the process of training a generative model involves making copies of copyrighted data," Balaji wrote. "If these copies are unauthorized, this could potentially be considered copyright infringement, depending on whether or not the specific use of the model qualifies as 'fair use.' Because fair use is determined on a case-by-case basis, no broad statement can be made about when generative AI qualifies for fair use."

Radagast ๐Ÿšซ

@Switch Blayde

Reminds me of the Boeing whistleblower that committed arkancide the day before testifying, or the inventor of FLIR who had a fatal heart attack the day before testifying about the FBIs Waco imagery. Open AI recently appointed a retired NSA director to its board of directors.

Balthus ๐Ÿšซ

@Switch Blayde

Yeah, this is a problem with all generative AI. But like most things in this second gilded age, we're just along for the ride. Another problem that may crop up in the future is corruption of gen AI when it starts to scrape images that it assumes are real, but are actually gen AI.
By the way, I'm a visual artist by trade and if anyone would like to chat about using AI to illustrate sex stories, hit me up. I have my own preferences in terms of quality and subject matter, so this isn't necessarily a blanket offer, but if you're interested I can send you in the direction of some NSFW examples of my work.

akarge ๐Ÿšซ

@Switch Blayde

Looks like the AIs are getting proactive about protecting themselves. โ˜ ๏ธ

jk

Replies:   Radagast
Radagast ๐Ÿšซ

@akarge

Sadly Asimov's three laws weren't as exciting as Skynet.

JoeBobMack ๐Ÿšซ
Updated:

@Switch Blayde

What is "making a copy" that is a violation of copyright law in the internet age?

Every time I read an article, blog post, etc. on my computer, a copy is produced. This is not a violation of copyright law; it is expected. In the case of e-books, for example, my use may be constrained by license agreements and, in some cases, digital rights management software. But the point is, just making a copy is not a copyright violation.

As I understand it, the key issue in copyright challenges to the use of "works" to train AI involves the question of a type of "fair use," specifically, whether the use is "transformative." My understanding is that all of these issues are still unresolved in the various court cases -- there has been no ruling, and certainly not a controlling precedent from the Supreme Court.

However, the claim that "Because fair use is determined on a case-by-case basis, no broad statement can be made about when generative AI qualifies for fair use," seems mostly false as far as the use of materials to train AIs is concerned. The argument on the side of fair use is that the use is transformative -- the process analyzes the material, extracts patterns and principles, then does the same with vast quantities of other material (much of which may not be copyrighted) and the end result is a large language model -- an "AI." I get this is vastly oversimplified, but I think that is the gist. If this argument ultimately prevails, it will, indeed, be a broad statement that approves of the use of copyrighted materials for training purposes of AI.

And, if AI training is allowed, the production of work with AI will not automatically be copyright violation, even if it mimics the style of a particular artist. I suspect those cases would get into the weeds of "similarity" much like some of the cases based on the similarity of music.

Going back to the training question, I suspect that this is going to come down to two things: the "transformative use" question and relative harm.

What is the balance between the harm to the public -- to society -- of hindering or even stopping development of this technology vs. the harm to any individual copyright holder of allowing it to go forward?

Since the use for training involves only one copy, the loss to the holder is miniscule. The loss to society is potentially huge, and the burden on the companies to compensate copyright holders individually would be huge. (And, of course, the companies and large copyright holders are already negotiating and making licensing agreements to address this issue.)

Of course, the horse is already out of the barn, and whether this affects legal analysis, it will affect the decisions and strategies of the participants, especially the plaintiffs. Some will likely choose not to pursue a long and expensive legal case for little to no effect in the real world.

Edit: This is obviously from a United States perspective. While I suspect that what the US does will carry significant weight on the final international framework, it will not necessarily be controlling. See differences in privacy laws, "subversive" materials, etc. that are different between countries and require compliance schemes by the big international players.

Replies:   julka
julka ๐Ÿšซ

@JoeBobMack

> Since the use for training involves only one copy, the loss to the holder is miniscule. The loss to society is potentially huge, and the burden on the companies to compensate copyright holders individually would be huge.

I don't think your logic on loss follows. Loss would be calculated based on how it's used by the person who violated copyright, not based on how many copies were taken. If somebody copies a book and sells it on amazon, they took one copy from the author but profited on the thousand copies they sold; did the author lose the one copy, or the thousand potential sales that went to the infringing item?

And saying that the burden to repay copyright holders would be huge is irrelevant, isn't it? You shouldn't, in a just system, be exonerated of a crime just because the magnitude of your crime exceeds your capacity to pay for the crime. Companies are fined in excess of their holdings and go bankrupt, and there is a legal process to identify who has senior claims on the money.

Switch Blayde ๐Ÿšซ

@julka

I don't think your logic on loss follows.

It's not my logic. It's the guy who committed suicide who's giving his beliefs. I guess he was some bigshot in the development of AI and then started warning people of the dangers of AI.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Switch Blayde

It's not my logic. It's the guy who committed suicide who's giving his beliefs. I guess he was some bigshot in the development of AI and then started warning people of the dangers of AI.

Edit: Wait, you're not even the person I was quoting. Yes, I agree it's not your logic, but I don't believe I implied that it was when I quoted an entirely different person.

JoeBobMack ๐Ÿšซ

@julka

If somebody copies a book and sells it on amazon, they took one copy from the author but profited on the thousand copies they sold; did the author lose the one copy, or the thousand potential sales that went to the infringing item?

That would be 1001 copies, not one, and the number of copies combined with the profiting commercially from the the infringement would be elements that would weigh in favor of a finding of copyright violation and awarding damages.

And saying that the burden to repay copyright holders would be huge is irrelevant, isn't it? You shouldn't, in a just system, be exonerated of a crime just because the magnitude of your crime exceeds your capacity to pay for the crime.

Yeah, I wasn't clear here. I was thinking more about legislation -- schemes that might be set up which would weigh the loss to the public and the burden on those developing what could be a revolutionary technology for society against a small, remote, and mostly hypothetical loss to an individual copyright holder.

Finally, copyright infringement isn't a crime; it's a civil matter. Including an author's work in an LLM's training materials doesn't "steal" the author's copyright. They still have it. At most, it is the loss of one sale of that work. Which, of course, isn't what authors are concerned about -- they are worried about AI putting them out of work. As are many, many others. We don't know yet if those worries are justified, but I personally lean toward the view that these systems are going to be VERY disruptive.

Replies:   awnlee jawking  julka
awnlee jawking ๐Ÿšซ

@JoeBobMack

At most, it is the loss of one sale of that work.

If that were true, then there wouldn't be any point in their stealing that work in the first place. But the number of times that work is used by the AI's output could be zillions.

Contrast that with pop music, where an artist pays a fee to include a sample of another artist's work and, if the new work is wildly successful, further royalties may be payable.

Why should writers have less protection than pop stars?

AJ

Dominions Son ๐Ÿšซ
Updated:

@awnlee jawking

Contrast that with pop music,

What you describe applies to all music, not just pop. And a large part of it is the fact that there is a compulsory licensing scheme for music in the US that is managed by a government sponsored entity*.

And what's meant by compulsory is that an artist can't refuse to grant or revoke a license if they don't like what their music is going to be used for or who is using it.

For example Bruce Springsteen's "Born in the U.S.A." is seen by a lot of people as a celebration of patriotism by a lot of people and frequently gets used by conservative politicians at rallies.

This is actually the opposite of what Bruce Springsteen intended when he wrote it and he has threatened to sue conservative politicians using it.

However as long as those politicians have paid into the compulsory licensing scheme, he doesn't have any legitimate basis to sue.

The compulsory licensing is also what keeps parody artists like "Weird Al" Yankovic from getting sued into oblivion.

You might not be so happy if such a scheme got applied to stories.

*I believe that the US is not the only country with compulsory licensing for music.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@Dominions Son

The compulsory licensing is also what keeps parody artists like "Weird Al" Yankovic from getting sued into oblivion.

"Weird Al" is not the best example here. Al has gotten permission from the artist (and label, as necessary) for all of his songs. He has written songs for which he did not receive permission; those have not been released.

'Parody' is legally protected use, but most of Weird Al's songs do not really fall into the formal definition of 'parody' and might be lawsuit fodder. However, Al's bigger goal is to maintain positive relationships with other recording artists.

And, of course, 'legally protected use' doesn't pay lawyers in a lawsuit. You can be legally in the right but still 'sued into oblivion.'

JoeBobMack ๐Ÿšซ

@awnlee jawking

If that were true, then there wouldn't be any point in their stealing that work in the first place.

It's true, and the motivation wasn't to allow the LLM to analyze that one novel, but rather the millions in the datasets. And, of course, licensing agreements either already made or in the works that may drastically reduce the relevance of these points.

But the number of times that work is used by the AI's output could be zillions.

As I understand it, the "use" by an LLM of any particular work does not include accessing a copy of the work (which isn't "kept" in the model after training).

I think it is more accurate to say that, after studying the structure of millions of novels, plus huge amounts of other information, current LLMs can produce chunks of text that read very much like "stories" produced by authors. However, per Sturgeon's law, 90% of everything is crap, and the LLMs were trained on lots of crap, so... GIGO.

That said, while my few early experiments with getting LLMs to "write" turned me off to that effort very quickly -- I like my writing better! -- I don't generally find what they produce to be garbage. For example:

I've used them to write macros to take story planning info from an Excel spreadsheet and put it in outline form in a Word document, something I would not have known how to do myself.

I like writing about my stories and having the AI respond with strengths and weaknesses. So, I might put down my thoughts about events to occur in one section of the novel and then get feedback. I benefit just from getting my ideas down, and knowing I can get the AI to organize my stream-of-consciousness rambling frees me up from letting a focus on organization hinder creativity. Plus, sometimes the feedback from the AI sparks something I hadn't thought of, though generally it's not "That's it!" but "That isn't it, but it makes me think of this other idea I really like."

As Ethan Mollick says, LLMs are weird. What they are good at and what they aren't isn't unpredictable, a "jagged frontier." However, research into their use in organizations is already showing that they make professionals better -- at least the bottom 80%, though it also seems that they may be of value to the top 20%, just in different ways. And the only way to discover their benefits is to play with them. In fact, one CEO of an international corporation said it was his non-English speaking employees that produced the most significant use cases for AI, mostly because they adopted the tools early to help polish their written English.

I remember reading somewhere that the impact of technological developments between 1900 and 1920 was greater than that of rest of the century. I have a feeling the pace of change from 2020 - 2040 could parallel that of the early 1900s. I hope I live to see it.

julka ๐Ÿšซ
Updated:

@JoeBobMack

a small, remote, and mostly hypothetical loss to an individual copyright holder.

Well, it's not "mostly hypothetical" - a large number of copyrighted works were used for commercial purposes without permission, ranging from published writing to videos. And "small" seems arguable as well; if the data was used in training, then I think there's a pretty reasonable argument that it's contributing in every single query made to the LLM, and now that's some amount of loss that the original artist is suffering. At over a billion queries per day, that's gonna add up at even tiny fractions of a penny. And "individual copyright holder" sounds like a deliberately myopic view of the situation, because while nailing down the exact number of people who's work was used without permission would be challenging, the number is going to be larger than 1, by many orders of magnitude.

Edit: and to be clear, I think the argument about LLMs being a revolutionary technology for society is another one that's worth interrogating a little bit - it's easy to say "wow this tool is incredible" when somebody else is footing the bill for it. As soon as the various ai companies start charging actual rate for their LLMs, I'll be interested to see how many companies decide that maybe the value prop doesn't make as much sense anymore.

Replies:   Switch Blayde  Grey Wolf
Switch Blayde ๐Ÿšซ

@julka

the argument about LLMs being a revolutionary technology for society is another one that's worth interrogating a little bit - it's easy to say "wow this tool is incredible" when somebody else is footing the bill for it.

That was mentioned by the guy in the article:

He cited a research paper that described the example of Stack Overflow, a coding Q&A website that saw big declines in traffic and user engagement after ChatGPT and AI models such as GPT-4 came out.

Large language models and chatbots answer user questions directly, so there's less need for people to go to the original sources for answers now.

In the case of Stack Overflow, chatbots and LLMs are answering coding questions, so fewer people visit Stack Overflow to ask the community for help. This means the coding website generates less new human content.

Replies:   julka
julka ๐Ÿšซ

@Switch Blayde

That's not responsive to my argument, though. It costs a lot of money to keep ChatGPT up and running, both in terms of compute infrastructure and energy costs. Right now all that money is being subsidized by venture capital. If you have to pay a dollar every time you submit a question to chatgpt, are you still going there to get an answer instead of stackoverflow? If your question requires four rounds of prompt engineering after your initial query to get what you're looking for and you get charged for four pieces of garbage before you get an answer you can use, is the tool still cost effective? When you pay real money and the tool gives you what is functionally a confident answer that there are two Rs in "STRAWBERRY", are you still feeling like society is undergoing a revolutionary change, or do you think that you used to be able to get garbage for free on the internet and didn't have to pay for the privilege?

Grey Wolf ๐Ÿšซ

@julka

Note that 'fair use' can apply in the case of commercial use, and AI training is almost certainly the most 'transformative' use possible of the original material. If AI training is not 'transformative', it is nearly impossible to imagine a use which would be 'transformative'.

For a fair use analysis, one factor is the percentage of the work used. That is literally impossible to determine for AI training. The closest approximation might be how much of the work could be regenerated from the resulting AI, but that's not really a reasonable test for a number of reasons.

Another (for commercial works) would be the value of the copied work in relation to the value of the new work. In this case, the value of any one source would be infinitesimal (even if one counted the entire 'New York Times' as a source).

Licensing seems good on its face, but as a practical matter, I'm not sure how it works. If AI training has infringed, the set of copyright-holders whose rights have been violated is very close to the set of 'entities which have published on the internet'. The mere mechanics of giving each of them so much as a penny would likely cost a ridiculous amount of money.

Replies:   julka  awnlee jawking
julka ๐Ÿšซ

@Grey Wolf

I'll grant you that the works were certainly transformed, but in this case it's more like they were transformed from "a collection of individual works" into "an oracle that can reproduce copyrighted texts word-by-word", which is maybe not what the concept of "fair use" is really looking for when it comes to transformation. It's absolutely possible to e.g. ask chatgpt to give you the first word of Falkner's "As I Lay Dying", and the second word, and the third word, and with enough prompt engineering you can just get it to spit out chunks of text from the book. Note that "As I Lay Dying" doesn't enter the public domain until next year, so that seems like a bit of a problem.

And yes, I agree that the mechanics of paying licensing fees to everybody who publishes works on the internet would be ridiculously expensive. It probably means that slurping up everything on the internet and using it in a commercial enterprise without permission is a bad idea! Maybe companies shouldn't do it.

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@julka

an oracle that can reproduce copyrighted texts word-by-word

This doesn't really play out in reality, though, your 'As I Lay Dying' examples notwithstanding. Sure, it can reproduce some random chunks. So can some readers who borrowed 'As I Lay Dying' from the library, memorized bits of it, and returned it. By itself, that's meaningless. I can reproduce large chunks of the script to 'Monty Python and the Holy Grail', but I am not guilty of infringing its copyright.

Information theory offers some guidance here. It is widely believed that the maximum reduction in size for arbitrary English text is 2.3 bits / character. For somewhat constrained sets, known encoding can get down to 1 bit / character (however, note that that would preclude e.g. computer code). The database sizes for most known LLM AI models is 1/100th to 1/100,000th the input size. It's literally impossible for them to contain the vast majority of the input text. For instance, based on what's publicly reported, ChatGPT's input text was 45TB of compressed data (likely well over 180TB of uncompressed data) and its runtime database is under 600GB. That's a 300-fold reduction in size, well beyond the level at which it would be even theoretically possible for the model to contain the entirety of the average input work (and that ignores the fact that this isn't merely compression but entirely a transformation of the nature of the information, which further reduces the effective 'compression' bandwidth).

Note that I'm not saying a court will declare that AI training is fair use. What I'm saying is that, based on the written definition of fair use and the history of cases related to it prior to the rise of AI training, AI training is entirely consistent with fair use, and it was and remains reasonable for those training AIs to believe it is fair use.

Replies:   julka
julka ๐Ÿšซ
Updated:

@Grey Wolf

> I can reproduce large chunks of the script to 'Monty Python and the Holy Grail', but I am not guilty of infringing its copyright.

If you offer a service in which you take money from people and on-demand produce copies of copyrighted works, you are absolutely infringing on copyright.

You're getting hung up on the idea that the text is "memorized", which a) is meaningless to a computer and b) is completely irrelevant to the issue. I can't memorize a book, then type it out and sell my typed copies for money, even though I memorized it; it's not my book. The memorization isn't the issue, the reproduction for money is.

awnlee jawking ๐Ÿšซ

@Grey Wolf

Do the characters in your stories pay for the books they use for the purposes of their own education or do they steal them, claiming 'fair use'?

AJ

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ
Updated:

@awnlee jawking

They pay for their books, certainly, but that's somewhat of a red herring (in my opinion). The point of that is access, not content. No one would make the claim that two people sharing a textbook is a copyright violation (though, the way publication is going right now, I imagine some publishers are already looking for a way to do that). One buys a textbook to have access to it 24/7, not to legally read the content in it.

Two (or more) individuals living in the same house, all of whom have agreed to not mark up the textbook, might well share a copy of it. That is entirely legal and above-board, and always has been.

In the modern era, of course, publishers have realized that they can include a one-use access code with the textbook and gate homework assignments behind that (with an online turn-in scheme that effectively prohibits sharing). That both precludes same-time sharing and the used-book market. I consider that abusive and hope that professors will not select those books (even if the publishers give them incentives to use their book), but it's where we are currently.

Here's a reverse analogy: when you borrow a book from the library, do you force yourself to forget it when you return it? If not, you have 'copied' the book, have you not? You may well have 'copied' it far more effectively than most AIs have 'copied' it.

Replies:   awnlee jawking
awnlee jawking ๐Ÿšซ

@Grey Wolf

Here's a reverse analogy: when you borrow a book from the library, do you force yourself to forget it when you return it?

The library paid for the book and the author received royalties for it (unless the work was out of copyright?)

If an AI company paid for a book, they could train their product on it as many times as they wanted without any issues IMO.

AJ

Replies:   Grey Wolf
Grey Wolf ๐Ÿšซ

@awnlee jawking

To the best of my knowledge, the AI companies used information freely available on the internet without paywalls. Thus, they paid for the content at the same rate anyone else would have (and presumably thus could train their product on it as many times as they wanted).

The issue isn't payment, it's copying. The author receiving royalties does not entitle one to copy the work. If AI training is 'copying', then the AI companies cannot train regardless of payment. If AI training is 'fair use', then the AI companies can train - but not regardless of payment (they cannot steal the work, then train on it).

Replies:   Big Ed Magusson
Big Ed Magusson ๐Ÿšซ

@Grey Wolf

Yes... but some of the AI companies also fed published books that weren't free on the internet into their algorithms. That's what publishers are suing over.

awnlee jawking ๐Ÿšซ

@Switch Blayde

I believe at least one major player in the AI market has applied to the government for a waiver to have copyright set aside so they can use the texts for AI training.

Sorry, I can't track down the news story. I think it was from this year.

AJ

awnlee jawking ๐Ÿšซ

@Switch Blayde

The UK's new socialist government has been receptive to arguments from tech companies that copyright law should be waived for the purpose of training AI. The government's latest idea is that there should be an opt-out facility for those who don't want their work plagiarised.

AJ

Replies:   Pixy
Pixy ๐Ÿšซ

@awnlee jawking

The government's latest idea

And they have a few of them....

DBActive ๐Ÿšซ
Updated:

@Switch Blayde

Theft is nothing new and that's what copyright infringement is.
People have no respect for other people's intellectual property. If it can be stolen, it will be stolen. It happens on this board all the time when people share links to works the author has deleted. And those same people are outraged when their works are stolen and appear on Amazon.

It may be a crime. https://www.justice.gov/archives/jm/criminal-resource-manual-1847-criminal-copyright-infringement-17-usc-506a-and-18-usc-2319

Replies:   awnlee jawking
awnlee jawking ๐Ÿšซ

@DBActive

Theft is nothing new and that's what copyright infringement is.
People have no respect for other people's intellectual property. If it can be stolen, it will be stolen.

There's a difference between scumbags stealing other people's property and supposedly responsible companies stealing other people's property.

I'm not sure the scumbags don't occupy higher ground, morally.

AJ

jimq2 ๐Ÿšซ

@Switch Blayde

Ever know someone with eidetic memory? I new a fellow at Princeton University that could read a text one day and a day or a week or a month later recite it back. The only time he didn't ace an exam was when he had to give original input.

Back to Top

 

WARNING! ADULT CONTENT...

Storiesonline is for adult entertainment only. By accessing this site you declare that you are of legal age and that you agree with our Terms of Service and Privacy Policy.


Log In