Forum: Author Hangout

AI & copyright infringement

Switch Blayde 🚫

Interesting article: "OpenAI whistleblower found dead by apparent suicide" at
https://www.yahoo.com/news/openai-whistleblower-found-dead-apparent-050222303.html

- Suchir Balaji, a former OpenAI researcher, was found dead on Nov. 26 in his apartment, reports say.

- Balaji, 26, was an OpenAI researcher of four years who left the company in August.

- He had accused his employer of violating copyright law with its highly popular ChatGPT model.

From Balaji's essay:

"While generative models rarely produce outputs that are substantially similar to any of their training inputs, the process of training a generative model involves making copies of copyrighted data," Balaji wrote. "If these copies are unauthorized, this could potentially be considered copyright infringement, depending on whether or not the specific use of the model qualifies as 'fair use.' Because fair use is determined on a case-by-case basis, no broad statement can be made about when generative AI qualifies for fair use."

Replies: Radagast Balthus akarge JoeBobMack awnlee jawking awnlee jawking DBActive jimq2 awnlee jawking jimq2 richardshagrin JoeBobMack Michael Loucks jimq2 Grant

Radagast 🚫

@Switch Blayde

Reminds me of the Boeing whistleblower that committed arkancide the day before testifying, or the inventor of FLIR who had a fatal heart attack the day before testifying about the FBIs Waco imagery. Open AI recently appointed a retired NSA director to its board of directors.

Balthus 🚫

@Switch Blayde

Yeah, this is a problem with all generative AI. But like most things in this second gilded age, we're just along for the ride. Another problem that may crop up in the future is corruption of gen AI when it starts to scrape images that it assumes are real, but are actually gen AI.
By the way, I'm a visual artist by trade and if anyone would like to chat about using AI to illustrate sex stories, hit me up. I have my own preferences in terms of quality and subject matter, so this isn't necessarily a blanket offer, but if you're interested I can send you in the direction of some NSFW examples of my work.

Replies: blurred

blurred 🚫

@Balthus

Wait until we start getting into shit like Unreal engine...

akarge 🚫

@Switch Blayde

Looks like the AIs are getting proactive about protecting themselves. ☠️

jk

Replies: Radagast

Radagast 🚫

@akarge

Sadly Asimov's three laws weren't as exciting as Skynet.

JoeBobMack 🚫
Updated:

@Switch Blayde

What is "making a copy" that is a violation of copyright law in the internet age?

Every time I read an article, blog post, etc. on my computer, a copy is produced. This is not a violation of copyright law; it is expected. In the case of e-books, for example, my use may be constrained by license agreements and, in some cases, digital rights management software. But the point is, just making a copy is not a copyright violation.

As I understand it, the key issue in copyright challenges to the use of "works" to train AI involves the question of a type of "fair use," specifically, whether the use is "transformative." My understanding is that all of these issues are still unresolved in the various court cases -- there has been no ruling, and certainly not a controlling precedent from the Supreme Court.

However, the claim that "Because fair use is determined on a case-by-case basis, no broad statement can be made about when generative AI qualifies for fair use," seems mostly false as far as the use of materials to train AIs is concerned. The argument on the side of fair use is that the use is transformative -- the process analyzes the material, extracts patterns and principles, then does the same with vast quantities of other material (much of which may not be copyrighted) and the end result is a large language model -- an "AI." I get this is vastly oversimplified, but I think that is the gist. If this argument ultimately prevails, it will, indeed, be a broad statement that approves of the use of copyrighted materials for training purposes of AI.

And, if AI training is allowed, the production of work with AI will not automatically be copyright violation, even if it mimics the style of a particular artist. I suspect those cases would get into the weeds of "similarity" much like some of the cases based on the similarity of music.

Going back to the training question, I suspect that this is going to come down to two things: the "transformative use" question and relative harm.

What is the balance between the harm to the public -- to society -- of hindering or even stopping development of this technology vs. the harm to any individual copyright holder of allowing it to go forward?

Since the use for training involves only one copy, the loss to the holder is miniscule. The loss to society is potentially huge, and the burden on the companies to compensate copyright holders individually would be huge. (And, of course, the companies and large copyright holders are already negotiating and making licensing agreements to address this issue.)

Of course, the horse is already out of the barn, and whether this affects legal analysis, it will affect the decisions and strategies of the participants, especially the plaintiffs. Some will likely choose not to pursue a long and expensive legal case for little to no effect in the real world.

Edit: This is obviously from a United States perspective. While I suspect that what the US does will carry significant weight on the final international framework, it will not necessarily be controlling. See differences in privacy laws, "subversive" materials, etc. that are different between countries and require compliance schemes by the big international players.

Replies: julka Joe Long

julka 🚫

@JoeBobMack

> Since the use for training involves only one copy, the loss to the holder is miniscule. The loss to society is potentially huge, and the burden on the companies to compensate copyright holders individually would be huge.

I don't think your logic on loss follows. Loss would be calculated based on how it's used by the person who violated copyright, not based on how many copies were taken. If somebody copies a book and sells it on amazon, they took one copy from the author but profited on the thousand copies they sold; did the author lose the one copy, or the thousand potential sales that went to the infringing item?

And saying that the burden to repay copyright holders would be huge is irrelevant, isn't it? You shouldn't, in a just system, be exonerated of a crime just because the magnitude of your crime exceeds your capacity to pay for the crime. Companies are fined in excess of their holdings and go bankrupt, and there is a legal process to identify who has senior claims on the money.

Replies: Switch Blayde JoeBobMack

Switch Blayde 🚫

@julka

I don't think your logic on loss follows.

It's not my logic. It's the guy who committed suicide who's giving his beliefs. I guess he was some bigshot in the development of AI and then started warning people of the dangers of AI.

Replies: julka

julka 🚫
Updated:

@Switch Blayde

It's not my logic. It's the guy who committed suicide who's giving his beliefs. I guess he was some bigshot in the development of AI and then started warning people of the dangers of AI.

Edit: Wait, you're not even the person I was quoting. Yes, I agree it's not your logic, but I don't believe I implied that it was when I quoted an entirely different person.

JoeBobMack 🚫

@julka

If somebody copies a book and sells it on amazon, they took one copy from the author but profited on the thousand copies they sold; did the author lose the one copy, or the thousand potential sales that went to the infringing item?

That would be 1001 copies, not one, and the number of copies combined with the profiting commercially from the the infringement would be elements that would weigh in favor of a finding of copyright violation and awarding damages.

And saying that the burden to repay copyright holders would be huge is irrelevant, isn't it? You shouldn't, in a just system, be exonerated of a crime just because the magnitude of your crime exceeds your capacity to pay for the crime.

Yeah, I wasn't clear here. I was thinking more about legislation -- schemes that might be set up which would weigh the loss to the public and the burden on those developing what could be a revolutionary technology for society against a small, remote, and mostly hypothetical loss to an individual copyright holder.

Finally, copyright infringement isn't a crime; it's a civil matter. Including an author's work in an LLM's training materials doesn't "steal" the author's copyright. They still have it. At most, it is the loss of one sale of that work. Which, of course, isn't what authors are concerned about -- they are worried about AI putting them out of work. As are many, many others. We don't know yet if those worries are justified, but I personally lean toward the view that these systems are going to be VERY disruptive.

Replies: awnlee jawking julka hst666

awnlee jawking 🚫

@JoeBobMack

At most, it is the loss of one sale of that work.

If that were true, then there wouldn't be any point in their stealing that work in the first place. But the number of times that work is used by the AI's output could be zillions.

Contrast that with pop music, where an artist pays a fee to include a sample of another artist's work and, if the new work is wildly successful, further royalties may be payable.

Why should writers have less protection than pop stars?

AJ

Replies: Dominions Son JoeBobMack

Dominions Son 🚫
Updated:

@awnlee jawking

Contrast that with pop music,

What you describe applies to all music, not just pop. And a large part of it is the fact that there is a compulsory licensing scheme for music in the US that is managed by a government sponsored entity*.

And what's meant by compulsory is that an artist can't refuse to grant or revoke a license if they don't like what their music is going to be used for or who is using it.

For example Bruce Springsteen's "Born in the U.S.A." is seen by a lot of people as a celebration of patriotism by a lot of people and frequently gets used by conservative politicians at rallies.

This is actually the opposite of what Bruce Springsteen intended when he wrote it and he has threatened to sue conservative politicians using it.

However as long as those politicians have paid into the compulsory licensing scheme, he doesn't have any legitimate basis to sue.

The compulsory licensing is also what keeps parody artists like "Weird Al" Yankovic from getting sued into oblivion.

You might not be so happy if such a scheme got applied to stories.

*I believe that the US is not the only country with compulsory licensing for music.

Replies: Grey Wolf hst666

Grey Wolf 🚫

@Dominions Son

The compulsory licensing is also what keeps parody artists like "Weird Al" Yankovic from getting sued into oblivion.

"Weird Al" is not the best example here. Al has gotten permission from the artist (and label, as necessary) for all of his songs. He has written songs for which he did not receive permission; those have not been released.

'Parody' is legally protected use, but most of Weird Al's songs do not really fall into the formal definition of 'parody' and might be lawsuit fodder. However, Al's bigger goal is to maintain positive relationships with other recording artists.

And, of course, 'legally protected use' doesn't pay lawyers in a lawsuit. You can be legally in the right but still 'sued into oblivion.'

hst666 🚫

@Dominions Son

Artist actually can deny the use of their songs at rallies. ASCAP, BMI, etc. are rights clearinghouse organizations.

If they want to use it in a commercial, they can stop that at will.

The venue may have a blanket rate they pay (and the calculations are quite arcane), but if the venue is being used for a political rally, the artist can refuse to allow the use of their songs.

Of course, if the politician in question ignores the cease and desist, then they would have to go to court, and that ain't cheap, so they may just rely on their public announcements condemning the usage.

Replies: Dominions Son

Dominions Son 🚫

@hst666

The venue may have a blanket rate they pay (and the calculations are quite arcane), but if the venue is being used for a political rally, the artist can refuse to allow the use of their songs.

A venue blanket license is irrelevant if the campaign gets its' own license.

Replies: hst666

hst666 🚫

@Dominions Son

Even if they do, and an artist's song is included, an artist can still pull permission for their song.

JoeBobMack 🚫

@awnlee jawking

If that were true, then there wouldn't be any point in their stealing that work in the first place.

It's true, and the motivation wasn't to allow the LLM to analyze that one novel, but rather the millions in the datasets. And, of course, licensing agreements either already made or in the works that may drastically reduce the relevance of these points.

But the number of times that work is used by the AI's output could be zillions.

As I understand it, the "use" by an LLM of any particular work does not include accessing a copy of the work (which isn't "kept" in the model after training).

I think it is more accurate to say that, after studying the structure of millions of novels, plus huge amounts of other information, current LLMs can produce chunks of text that read very much like "stories" produced by authors. However, per Sturgeon's law, 90% of everything is crap, and the LLMs were trained on lots of crap, so... GIGO.

That said, while my few early experiments with getting LLMs to "write" turned me off to that effort very quickly -- I like my writing better! -- I don't generally find what they produce to be garbage. For example:

I've used them to write macros to take story planning info from an Excel spreadsheet and put it in outline form in a Word document, something I would not have known how to do myself.

I like writing about my stories and having the AI respond with strengths and weaknesses. So, I might put down my thoughts about events to occur in one section of the novel and then get feedback. I benefit just from getting my ideas down, and knowing I can get the AI to organize my stream-of-consciousness rambling frees me up from letting a focus on organization hinder creativity. Plus, sometimes the feedback from the AI sparks something I hadn't thought of, though generally it's not "That's it!" but "That isn't it, but it makes me think of this other idea I really like."

As Ethan Mollick says, LLMs are weird. What they are good at and what they aren't isn't unpredictable, a "jagged frontier." However, research into their use in organizations is already showing that they make professionals better -- at least the bottom 80%, though it also seems that they may be of value to the top 20%, just in different ways. And the only way to discover their benefits is to play with them. In fact, one CEO of an international corporation said it was his non-English speaking employees that produced the most significant use cases for AI, mostly because they adopted the tools early to help polish their written English.

I remember reading somewhere that the impact of technological developments between 1900 and 1920 was greater than that of rest of the century. I have a feeling the pace of change from 2020 - 2040 could parallel that of the early 1900s. I hope I live to see it.

julka 🚫
Updated:

@JoeBobMack

a small, remote, and mostly hypothetical loss to an individual copyright holder.

Well, it's not "mostly hypothetical" - a large number of copyrighted works were used for commercial purposes without permission, ranging from published writing to videos. And "small" seems arguable as well; if the data was used in training, then I think there's a pretty reasonable argument that it's contributing in every single query made to the LLM, and now that's some amount of loss that the original artist is suffering. At over a billion queries per day, that's gonna add up at even tiny fractions of a penny. And "individual copyright holder" sounds like a deliberately myopic view of the situation, because while nailing down the exact number of people who's work was used without permission would be challenging, the number is going to be larger than 1, by many orders of magnitude.

Edit: and to be clear, I think the argument about LLMs being a revolutionary technology for society is another one that's worth interrogating a little bit - it's easy to say "wow this tool is incredible" when somebody else is footing the bill for it. As soon as the various ai companies start charging actual rate for their LLMs, I'll be interested to see how many companies decide that maybe the value prop doesn't make as much sense anymore.

Replies: Switch Blayde Grey Wolf

Switch Blayde 🚫

@julka

the argument about LLMs being a revolutionary technology for society is another one that's worth interrogating a little bit - it's easy to say "wow this tool is incredible" when somebody else is footing the bill for it.

That was mentioned by the guy in the article:

He cited a research paper that described the example of Stack Overflow, a coding Q&A website that saw big declines in traffic and user engagement after ChatGPT and AI models such as GPT-4 came out.

Large language models and chatbots answer user questions directly, so there's less need for people to go to the original sources for answers now.

In the case of Stack Overflow, chatbots and LLMs are answering coding questions, so fewer people visit Stack Overflow to ask the community for help. This means the coding website generates less new human content.

Replies: julka

julka 🚫

@Switch Blayde

That's not responsive to my argument, though. It costs a lot of money to keep ChatGPT up and running, both in terms of compute infrastructure and energy costs. Right now all that money is being subsidized by venture capital. If you have to pay a dollar every time you submit a question to chatgpt, are you still going there to get an answer instead of stackoverflow? If your question requires four rounds of prompt engineering after your initial query to get what you're looking for and you get charged for four pieces of garbage before you get an answer you can use, is the tool still cost effective? When you pay real money and the tool gives you what is functionally a confident answer that there are two Rs in "STRAWBERRY", are you still feeling like society is undergoing a revolutionary change, or do you think that you used to be able to get garbage for free on the internet and didn't have to pay for the privilege?

Grey Wolf 🚫

@julka

Note that 'fair use' can apply in the case of commercial use, and AI training is almost certainly the most 'transformative' use possible of the original material. If AI training is not 'transformative', it is nearly impossible to imagine a use which would be 'transformative'.

For a fair use analysis, one factor is the percentage of the work used. That is literally impossible to determine for AI training. The closest approximation might be how much of the work could be regenerated from the resulting AI, but that's not really a reasonable test for a number of reasons.

Another (for commercial works) would be the value of the copied work in relation to the value of the new work. In this case, the value of any one source would be infinitesimal (even if one counted the entire 'New York Times' as a source).

Licensing seems good on its face, but as a practical matter, I'm not sure how it works. If AI training has infringed, the set of copyright-holders whose rights have been violated is very close to the set of 'entities which have published on the internet'. The mere mechanics of giving each of them so much as a penny would likely cost a ridiculous amount of money.

Replies: julka awnlee jawking

julka 🚫

@Grey Wolf

I'll grant you that the works were certainly transformed, but in this case it's more like they were transformed from "a collection of individual works" into "an oracle that can reproduce copyrighted texts word-by-word", which is maybe not what the concept of "fair use" is really looking for when it comes to transformation. It's absolutely possible to e.g. ask chatgpt to give you the first word of Falkner's "As I Lay Dying", and the second word, and the third word, and with enough prompt engineering you can just get it to spit out chunks of text from the book. Note that "As I Lay Dying" doesn't enter the public domain until next year, so that seems like a bit of a problem.

And yes, I agree that the mechanics of paying licensing fees to everybody who publishes works on the internet would be ridiculously expensive. It probably means that slurping up everything on the internet and using it in a commercial enterprise without permission is a bad idea! Maybe companies shouldn't do it.

Replies: Grey Wolf

Grey Wolf 🚫

@julka

an oracle that can reproduce copyrighted texts word-by-word

This doesn't really play out in reality, though, your 'As I Lay Dying' examples notwithstanding. Sure, it can reproduce some random chunks. So can some readers who borrowed 'As I Lay Dying' from the library, memorized bits of it, and returned it. By itself, that's meaningless. I can reproduce large chunks of the script to 'Monty Python and the Holy Grail', but I am not guilty of infringing its copyright.

Information theory offers some guidance here. It is widely believed that the maximum reduction in size for arbitrary English text is 2.3 bits / character. For somewhat constrained sets, known encoding can get down to 1 bit / character (however, note that that would preclude e.g. computer code). The database sizes for most known LLM AI models is 1/100th to 1/100,000th the input size. It's literally impossible for them to contain the vast majority of the input text. For instance, based on what's publicly reported, ChatGPT's input text was 45TB of compressed data (likely well over 180TB of uncompressed data) and its runtime database is under 600GB. That's a 300-fold reduction in size, well beyond the level at which it would be even theoretically possible for the model to contain the entirety of the average input work (and that ignores the fact that this isn't merely compression but entirely a transformation of the nature of the information, which further reduces the effective 'compression' bandwidth).

Note that I'm not saying a court will declare that AI training is fair use. What I'm saying is that, based on the written definition of fair use and the history of cases related to it prior to the rise of AI training, AI training is entirely consistent with fair use, and it was and remains reasonable for those training AIs to believe it is fair use.

Replies: julka awnlee jawking

julka 🚫
Updated:

@Grey Wolf

> I can reproduce large chunks of the script to 'Monty Python and the Holy Grail', but I am not guilty of infringing its copyright.

If you offer a service in which you take money from people and on-demand produce copies of copyrighted works, you are absolutely infringing on copyright.

You're getting hung up on the idea that the text is "memorized", which a) is meaningless to a computer and b) is completely irrelevant to the issue. I can't memorize a book, then type it out and sell my typed copies for money, even though I memorized it; it's not my book. The memorization isn't the issue, the reproduction for money is.

awnlee jawking 🚫

@Grey Wolf

For instance, based on what's publicly reported, ChatGPT's input text was 45TB of compressed data (likely well over 180TB of uncompressed data) and its runtime database is under 600GB. That's a 300-fold reduction in size, well beyond the level at which it would be even theoretically possible for the model to contain the entirety of the average input work (and that ignores the fact that this isn't merely compression but entirely a transformation of the nature of the information, which further reduces the effective 'compression' bandwidth).

AI by homeopathy :-)

The less of a book the AI contains, the stronger its effect?

AJ

Replies: Dominions Son

Dominions Son 🚫

@awnlee jawking

As I understand it, what's actually stored in the LLM is a set of tokens and probabilities for each token.

It's not actually keeping any information specific to any of the works in the training set.

Replies: awnlee jawking DBActive

awnlee jawking 🚫

@Dominions Son

That's my understanding too. I wonder what causes a token to be created. If a book is rich in unfamiliar concepts, does it result in more tokens than other works?

AJ

Replies: Grey Wolf Dominions Son

Grey Wolf 🚫

@awnlee jawking

No. If a book is rich in 'the' followed by 'man', the tokens for 'the' and 'man' will have a stronger relationship. It's more complicated than that, though (for instance, if 'fuck' was before 'the man', that increases the chance of 'fuck' 'the' 'man' being generated).

It's a massive web of context-dependent probabilities for which word to emit next.

The current generation of Large Language Models (LLMs) - GPTs - know and care almost nothing about 'concepts'. The closest they get to concepts is 'tagging', which can associate things like an author and a style. So, if one says, "I want something like Heinlein wrote in 'A Stranger in a Strange Land'", the LLM knows 'grok' is much more likely to come up than it would be otherwise.

That's how we get back to regenerating works. If an tag/word combination leads to certain phrases being highly probable, you'll get those phrases.

Note that it's not just probability - LLMs also use a number of post-generation feedback loops to look at what's generated and tune it to better fit the prompt (as well as block anything they're not supposed to generate).

In any case, the LLM hasn't 'memorized' the work, it's learned how to 'write as Heinlein writes'. It's much more analogous to Borges' 'Pierre Menard, Author of the Quixote', only without all of the reasoning and conceptualization.

Non-GPT LLMs are starting to appear which prioritize reasoning and concepts, but that's very much a work in progress. How much they differentiate from probability-based generation is unclear so far, and it may be that a 'reasoning' post-generation feedback loop is increasing reasoning.

This is also why LLMs make things up so often - they have no idea what their output means, they just know that certain words are more likely following other words given certain prompts and styles. Get off track and mayhem ensues.

Dominions Son 🚫

@awnlee jawking

As I understand it, tokens aren't concepts, just words, short phrases, and fragments of larger words.

Replies: Michael Loucks awnlee jawking awnlee jawking

Michael Loucks 🚫

@Dominions Son

As I understand it, tokens aren't concepts, just words, short phrases, and fragments of larger words.

Not all that different (conceptually) from how compression works for text files.

Replies: Dominions Son

Dominions Son 🚫

@Michael Loucks

Not all that different (conceptually) from how compression works for text files.

Yes, but compression has to keep track of specific locations where a given token occurs. An LLM is just keeping probabilities.

Replies: awnlee jawking

awnlee jawking 🚫

@Dominions Son

It's difficult to see how LLMs and compression would work in the legal profession where exact wording is crucial. And if any profession needs an injection of intelligence, artificial or human, the legal profession must be second only to the medical profession.

AJ

Replies: Michael Loucks

Michael Loucks 🚫
Updated:

@awnlee jawking

It's difficult to see how LLMs and compression would work in the legal profession where exact wording is crucial

Compression can be 'lossy' (e.g.some music/image compression algorithms) or 'lossless' (e.g. zip files).

Replies: Grey Wolf

Grey Wolf 🚫

@Michael Loucks

To add to that: text compression is almost always done with lossless encoding. One form (a common one) is conceptually it's "1 = the, 2 = party, 3 = of, 4 = first, 5 = part" and so forth. Another common form relies on letters rather than words. And all of that is an oversimplification - the algorithm itself is defining what 'words' are, and 'words' might be word parts.

Video, music, and photo compression is very often lossy. Nearly all streaming platforms use lossy compression, as does broadcast digital video.

However, there are also commonly-used lossless audio compression formats (FLAC, for instance), video, and photo formats. Lossless photo formats are very common in medicine and science.

Current generative AIs would count as a highly lossy compression format, potentially, but they're really useless for compression since they're a sum of all of their inputs, not just the one you wish to regenerate.

Generative AIs do not play well with the legal profession in terms of factual accuracy. They are well known to 'make up' citations that do not exist referencing cases that never happened, get verdicts wrong, and so forth.

On the other hand, some AIs play extremely well with medicine. One of the things some types of AIs are very good at doing is finding patterns of small divergences in large volumes of data. There are AI models already out there that do a better job of diagnosis statistically than most doctors do.

Replies: Dominions Son

Dominions Son 🚫

@Grey Wolf

On the other hand, some AIs play extremely well with medicine.

It depends on what kind of AI you are talking about. I wouldn't trust anything based on an LLM for anything where safety is critical.

Replies: Grey Wolf

Grey Wolf 🚫

@Dominions Son

I agree. Generally, they're going to be a similar design (probabilistic neural network) but trained on (for instance) a million blood tests along with tagging for the health conditions each person had (or none, if they had no known issues at the time). The goal is for the AI to determine that, if a bunch of factors are out of alignment in a certain small way, that probably indicates something.

That's diagnosis, not treatment. A clinician would then verify the finding before treatment.

Replies: Dominions Son fasteddiecoder

Dominions Son 🚫

@Grey Wolf

That's diagnosis, not treatment. A clinician would then verify the finding before treatment.

That assumes that they don't start allowing less knowledgeable clinicians because we have this AI to handle the diagnostics.

fasteddiecoder 🚫

@Grey Wolf

Generally, they're going to be a similar design (probabilistic neural network) but trained on (for instance) a million blood tests along with tagging for the health conditions each person had (or none, if they had no known issues at the time)

You've hit on a key issue that I don't think gets enough consideration. The training material is critical. Most of the LLM's say that they get most of their training from "the internet ". Well it is well accepted that a significant percentage of the info on the web is crap. How can you be surprised at occasional AI nonsense when you consider the training material

The medical AI models that have a well established training basis are frequently very impressive

Replies: awnlee jawking

awnlee jawking 🚫

@fasteddiecoder

The medical AI models that have a well established training basis are frequently very impressive

My understanding is that they're more akin to 'expert systems' than self-taught via LLMs.

AJ

Replies: Grey Wolf

Grey Wolf 🚫

@awnlee jawking

My understanding is that they're more akin to 'expert systems' than self-taught via LLMs.

This used to be true, but is increasingly not true. There is still a big market for expert systems, taught extensive numbers of rules by human experts, but the real frontier in diagnostic AI is AIs that are shown large numbers of data points for healthy and non-healthy people and themselves figure out which data points correlate to 'healthy' and 'has condition X' (or simply 'needs more investigation').

Neural-net style AIs are (much) better than humans at finding small signals in large volumes of noise.

The important part is making sure that the input data is good (which is amusing, given @fasteddiecoder's comment about 'the Internet' as source material). 'The internet' may be 'good enough' for things like learning how to write English, but it can struggle with facts. Hence the Google AI's response that it's worthwhile to eat a small number of rocks. The AI got that from a humor piece - it didn't invent it, it was taught that eating rocks can be worthwhile.

For medical data, it's important that the sample space be both large and well-tagged. If, for instance, you only get data from caucasian males, you'll misdiagnose everyone else. If the system has large numbers of people in all sorts of categories, though, and knows which are which, it can then work with a patient if it knows which categories the patient falls into.

awnlee jawking 🚫

@Dominions Son

As I understand it, tokens aren't concepts, just words, short phrases, and fragments of larger words.

That's how it was explained in the Royal Institution Christmas Lectures (last year).

'Concept' was a bad choice of word. Something like 'richer language' or 'more complex language' might have been more appropriate.

AJ

awnlee jawking 🚫
Updated:

@Dominions Son

I wonder how AI copes when presented with 'nonsense' words like 'jabberwock'. "AI, what does jabberwock mean?" And how about asking for a description of something that doesn't exist, like the Tannhauser Gate of Blade Runner fame.

And it's tempting to 'poison' writing by including something tricky for AI to interpret. "The turkey was described as with giblets but it was delivered delivered." :-)

AJ

Replies: julka Grey Wolf LupusDei

julka 🚫
Updated:

@awnlee jawking

It'll tell you what there is or it will make things up, it might do both at once, and there's no way to tell which option it took unless you already know the answer to your question.

Incredibly important and revolutionary for humanity!

Edit: it's easier to reason about when you think of it as giving you statistically probable sentences based on your input. If your sentence includes jabberwock, it's probable that words around it will be words like "carrol" and "wonderland", so it will include them. If your input includes "gate" and "blade" and "runner", well, those are all words that statistically have other words near them, so it'll give you a statistically probable sentence.

Replies: awnlee jawking

awnlee jawking 🚫

@julka

It'll tell you what there is or it will make things up, it might do both at once, and there's no way to tell which option it took unless you already know the answer to your question.

I can't disagree with you.

I'm living in the technological dark ages, but it should be possible for someone to attempt those tests now, as a sort of litmus paper.

I understand Google has now incorporated AI in its browser. It certainly seems more difficult to find things when you know exactly what you want.

After much deliberation, I had thought up an appropriate name for one of my characters but I wanted to check the etymology for any unpleasant surprises. I entered the name in quotes and got 'showing the results for' plus the results for a similar name but not an exact match, with 'search instead for' underneath with my chosen name. I clicked on the 'search instead' and got the results for a third name, slightly more different than its first attempt had been :-(

AJ

Grey Wolf 🚫

@awnlee jawking

I used the free trial (no registration required) version of ChatGPT and tested these out. This is specifically 'ChatGPT 4o mini'. 4o uses a much larger model and adds at least some 'reasoning' capabilities, so it's beyond what I've described earlier (and a bit more of a 'black box', at least for me).

Here's how ChatGPT answers 'Describe the Tannhauser Gate':

The Tannhäuser Gate is a fictional location in the science fiction universe of Blade Runner, specifically mentioned in the movie's iconic monologue by the character Roy Batty. The gate itself is never seen on-screen, but it has become one of the most memorable phrases associated with the film.

In his final moments, Batty reflects on the things he's experienced during his life, saying:

"All those moments will be lost in time, like tears in rain. Time to die."

Before this, however, he also mentions the Tannhäuser Gate, stating:

"I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate."

The Tannhäuser Gate, with its poetic and mysterious nature, has sparked much curiosity. While the gate itself is not directly described or depicted in the movie, it evokes a sense of vast, distant, and beautiful experiences, adding to the mystique of Batty's character and his personal journey. It's often assumed that the gate is part of a grand, otherworldly setting, perhaps a spaceport or an astronomical location, although Ridley Scott (the director) and other creators have never explained its exact meaning.

The name "Tannhäuser" likely derives from the German legend of Tannhäuser, a medieval poet and knight whose story involves themes of human longing, transcendence, and unattainable desires. This connection further imbues the term with a sense of deep yearning, making the Tannhäuser Gate a symbol of lost or fading experiences that are beautiful yet unattainable.

Ultimately, the Tannhäuser Gate has come to represent something poignant and fleeting — an idea that resonates with the themes of memory, mortality, and the passage of time that are central to Blade Runner and its exploration of what it means to be human.

Not a bad answer.

And this is the answer to 'Write a story about the Jabberwock traveling through the Tannhauser Gate':

The Jabberwock and the Tannhäuser Gate

In a time when the stars were closer than the whispers of dreams, the Jabberwock, a creature of myth and mystery, found itself standing before the Tannhäuser Gate. It had wandered for eons across dimensions, crossing the edges of forgotten realms, where the threads of time twisted and coiled like the serpentine shadows of its own form. Its eyes, glowing with the light of distant suns, blinked in curiosity as the gate shimmered before it.

The Tannhäuser Gate was not a mere doorway, but an ancient, celestial threshold, its edges made of shifting stardust, its surface a mosaic of forgotten constellations and time-worn galaxies. Legends spoke of it as the path to the farthest reaches of the universe, a bridge between worlds, where reality bent and merged with the impossible.

The Jabberwock, with its fanged grin and leathery wings, had seen many wonders in its long life—firestorms on the plains of Orion, shimmering oceans of molten glass on distant moons—but the gate was unlike anything it had ever encountered. It did not simply beckon; it pulled, like the gravity of a star, irresistible and profound.

It stepped forward, its claws tapping softly against the pulsating surface of the gate. The fabric of the universe seemed to stretch as the creature passed through. For an instant, the Jabberwock's form fragmented, split into countless reflections, as though it was both everywhere and nowhere at once. Its roar echoed, a dissonant sound that rippled across the very core of reality, as if the universe itself shuddered in response.

And then, it was through.

What lay beyond was not as expected. No vast, endless expanse of stars or silent nebulae. Instead, it found itself standing on a strange, alien shore. The landscape was impossible, yet eerily familiar—a world where the ground seemed to breathe, and the sky was streaked with colors that could not exist in the known spectrum of the eye. The horizon wavered, as though it were made of liquid, shifting and changing with every blink.

The Jabberwock's eyes scanned the scene, its mind grasping for meaning. This was no world—it was something more, a liminal space, the intersection of memory and imagination. Here, the Tannhäuser Gate had brought it to the intersection of forgotten thoughts and untold futures.

The creatures that lived here were unlike any the Jabberwock had ever encountered. Silhouettes, humanoid and beastly, flickered in and out of existence, as if they were made of the very light of stars. Their voices were a chorus of fragments—words that existed in no language it could understand. But there was a rhythm to their sound, a pulse, as though they, too, were part of this impossible world.

The Jabberwock did not know why it had been brought here. Was it summoned? Or had it simply wandered, as it often did, into the unknown, where the limits of imagination bent around it? It stepped forward, wings unfurling, sending ripples through the strange air, and felt the pulse of the land respond in kind.

In the distance, it could make out a structure—a tower, perhaps, or a monument—a spire that reached upward and pierced the shifting sky. Its presence was both ancient and yet newly formed, as if it had always existed, and yet was waiting to be discovered.

Without thinking, the Jabberwock began to move toward it, drawn by a force it could neither name nor comprehend. It was not alone. The sky seemed to open with the silent passage of figures—ghostly shapes that stretched across the horizon, like memories too distant to recall. One shape, however, stood out among them, an ethereal figure with a crown of stars.

The Jabberwock halted as it neared the tower, its senses heightened, its instincts flaring. The figure stepped forward, its form coming into clearer focus. It was tall, slender, its face veiled in light, its eyes glowing with the ancient knowledge of the universe.

"You have traveled far," the figure said, its voice not spoken but felt within the core of the Jabberwock's being. "Through the Tannhäuser Gate, you have crossed the boundaries of time and space. But now you must understand: to wander is not to belong. To seek is not to find. What lies beyond is not a destination, but a journey."

The Jabberwock's claws twitched, its eyes narrowing. It had seen many realms, many worlds, many strange places. But this—this was different. The creature before it was not just a guide, but a part of the very fabric of the Tannhäuser Gate, a manifestation of its purpose. The Jabberwock had not come by accident. Its path had led here for a reason.

"You have seen worlds burn," the figure continued, its voice a song of sorrow and hope. "But you have also seen worlds reborn. The Gate offers a choice, not of place, but of time. A choice to see the universe in its fullness."

The Jabberwock tilted its head, intrigued. Time was something it had long left behind. It had been a creature of momentary violence, of fleeting existence. But here, in this place of strange beauty, it could feel something stirring—a new understanding of what it meant to exist, not just to be.

"The Tannhäuser Gate," the figure said, "is a mirror of your soul. You have passed through, and now you must decide: will you remain as you are, bound by the past, or will you transform into something new?"

The Jabberwock looked at its reflection in the shimmering surface of the land, its twisted form now strange to its own eyes. It had never thought of transformation, of change. It had been a creature of instinct, of chaos. But now, in this place, it felt the stirrings of something greater. Something beyond its fangs, its claws, and its wings.

The Jabberwock stepped forward, its wings folding in close, and it whispered into the air—a sound like a song, a call to the unknown.

And as it did, the landscape shifted once more, and the Jabberwock's journey through the Tannhäuser Gate began anew.

This time, it would not seek destruction. It would seek the unknown—transformation, understanding, and the infinite potential of what could be.

A quick Google search does not reveal an obvious work for ChatGPT to have mined for this story, though perhaps there is one and I just didn't find it.

It's not all that bad as a story, either.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

Thank you.

Not a bad answer.

Indeed. But that doesn't look like something held in a highly-compressed LLM.

It's not all that bad as a story, either.

It's too florid for my liking, with way too many anthropomorphisms. It contains elements of a description of a Jabberwock, though, which is what I was seeking.

Using Google, I found some evidence that particular phrases had been mined from social media, but I can't think of any way of proving it.

AJ

Replies: Grey Wolf Grey Wolf

Grey Wolf 🚫
Updated:

@awnlee jawking

ChatGPT 4o mini is guessed to be roughly 20-60 billion parameters. Yes, that's 'mini' - ChatGPT 4o itself is about 1.5 trillion parameters.

The point is: that text isn't held in ChatGPT. It's generating that text on the fly based on the input.

For instance, if I write 'Tell me about the Tannhauser Gate', it returns:

The Tannhäuser Gate is a mysterious and evocative concept from the 1982 science fiction film Blade Runner, directed by Ridley Scott. The phrase itself is famously spoken by the character Roy Batty (played by Rutger Hauer), a replicant (artificial human), in his final monologue as he faces his own imminent death. The line is part of a poignant reflection on his experiences and the fleeting nature of existence:

"I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain."

The Tannhäuser Gate is never directly explained in the film, and its meaning has been left open to interpretation. However, it serves as a symbol of the vast and awe-inspiring universe in which Roy Batty, despite his artificial nature, has had profound, almost transcendent experiences.

Possible Interpretations and Significance:
Uncharted, Distant Territory: The Tannhäuser Gate is implied to be a far-off and exotic place in space—possibly a distant gate or spaceport near the outer edges of the galaxy, symbolizing a place of wonder or mystery. Batty's mention of it, alongside other awe-inspiring sights ("attack ships on fire," "C-beams glittering in the dark"), evokes images of cosmic phenomena, exploring the unknown frontiers of the universe. The fact that Batty has witnessed these experiences contrasts sharply with the fact that his life—due to his artificial nature—has been limited and predetermined.

The Concept of Time and Memory: The gate and the associated imagery evoke the transitory nature of time. Batty reflects on how his remarkable experiences will be "lost in time, like tears in rain," emphasizing the impermanence of both memory and existence. His experiences, even if extraordinary, are ultimately fleeting and will fade into oblivion—an existential realization that highlights the tragedy of his short-lived life as a replicant.

The Tannhäuser Legend: The name "Tannhäuser" is likely a reference to the German legend of the same name, in which the knight and poet Tannhäuser seeks redemption after living a life of indulgence. In some versions of the legend, he attempts to find forgiveness by traveling to a far-off place, invoking themes of redemption, sin, and spiritual longing. This connection enriches the emotional and thematic depth of the film, adding a layer of poetic and mythological resonance to Batty's words.

The Myth of the Gate: The gate itself can be seen as a metaphor for the boundary between life and death, between the known and the unknown, or between human and replicant experiences. Batty, as a replicant who has been designed with a limited lifespan, sees his experiences as valuable and unique, yet ultimately doomed to fade away like everything else in the universe.

Cultural Impact and Influence:
Since the film's release, the Tannhäuser Gate has become an iconic phrase in science fiction and film culture. It has sparked a wide range of interpretations and has been referenced in various media, often representing the unknowable, the mysterious, or the unattainable in the vastness of space and time.

In essence, the Tannhäuser Gate remains an evocative symbol within Blade Runner—a phrase that enhances the film's themes of memory, mortality, and the longing for meaning. Its ambiguity allows it to be both a concrete element of the film's universe and a timeless metaphor for human (and non-human) experience.

That's a similar, but also quite different, answer. No human wrote either of them.

But I can also say 'Describe the Tannhauser Gate in the style of Robert Ludlum', which returns:

The Tannhäuser Gate.

A name that echoed through the cold, sterile corridors of interstellar intelligence like a whispered ghost. It was not merely a location—it was a legend wrapped in enigma, a fragment of history lost in the vacuum of deep space, guarded by time, secrecy, and a handful of men who would do anything to protect its secrets.

Located somewhere beyond the known reaches of the galaxy, past the familiar constellations and into the void where light itself seemed to falter, the Tannhäuser Gate was the last thing any man—or machine—would ever see before the abyss swallowed them whole. It was said to be more than just a physical threshold. Some believed it was a point of convergence for the very essence of life and death, a place where time folded into itself, where one could step across the boundary between the living and the eternal.

For Roy Batty, a replicant—an artificial human designed for war and servitude—the Tannhäuser Gate was an anchor. A symbol of everything he had witnessed and everything he had lost. But Batty's memories were no different than the others who came and went in the service of corporate machinations, the same memories embedded into his psyche as though they were his own, as though they meant something to him. The Tannhäuser Gate was no ordinary waypoint. Its mythic allure had drawn him, like so many before him, into a world of shadowy figures who lived and died on the fringes of galactic law. It was a place where reality blurred, where intelligence agencies and shadow governments played their hand in the deepest pockets of space, away from the prying eyes of Earth's militarized watchdogs.

The Gate was a relic from a time long past—a remnant of a failed experiment, one buried deep within the crumbling files of a defunct corporate empire. The coordinates were lost to all but a few—men who had used the Gate not just as a waystation, but as a threshold into forbidden territory. A space where the rules of physics bent, where artificial life was more than just a synthetic copy of a flawed human design—it was where humanity itself could transcend the borders of its limitations. Some whispered that the Tannhäuser Gate had once been a portal, a bridge to the outermost reaches of the galaxy, where powerful corporations and shadow organizations could leverage the very fabric of space-time itself.

Yet the Gate had no master. No one truly controlled it. It was a thing that existed only for those who sought to define it, for those willing to take the risk of stepping across its threshold and paying the ultimate price for the knowledge it promised.

When Batty spoke of the Gate, his words dripped with the bitter tang of death—his own and that of the universe around him. The ships that burned in the blackness of space, the glimmering beams that cut through the darkness like shards of pure memory—they were all connected. The Tannhäuser Gate was a point of no return, a place where all that had been was lost, swallowed in the vast nothingness between the stars.

But there were others who had a different take—men who wore the suits and ties of corporate power, their eyes trained on more than just profit. They knew the truth, and the truth was this: The Tannhäuser Gate was a tool. A weapon. And like any weapon, it would ultimately fall into the wrong hands.

No one could be trusted with such power. Batty knew that. And in the end, as his synthetic life flickered toward its inevitable end, he realized that the Gate was more than just a relic of the past—it was a reminder of the price of knowledge, of power, of humanity's endless reach for something beyond itself.

And as he lay dying in the rain, his words fell away like the final, lost echoes of a forgotten past.

All those moments will be lost in time, like tears in rain.

Even more unlikely that, at any point, anything really resembling that text was ingested during training. Google can't find a single hit combining "Tannhauser Gate" and "Robert Ludlum", so that is likely an 'original' work.

Which, of course, begs the questions: is that response 'creative'? If so, who or what 'created' it? If not, what term is appropriate to describe a written work which has never previously existed?

Replies: awnlee_jawking

awnlee_jawking 🚫

@Grey Wolf

"I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain."

I believe that's a direct quote from the film script. I can't imagine a LLM reconstructing that from tokens.

AJ

Replies: julka Grey Wolf

julka 🚫

@awnlee_jawking

It absolutely could reconstruct that from tokens. It's looking at tokens that have a statistical relationship. If you have a prompt that includes tokens like "blade", "runner", "tannhauser", and "gate", do you think it is more likely or less likely that subsequent tokens will collect to form that quote? Or put another way, do you think it is more likely or less likely that a large number of training texts which talked about blade runner also included that particular quote?

Replies: awnlee jawking

awnlee jawking 🚫

@julka

If you have a prompt that includes tokens like "blade", "runner", "tannhauser", and "gate"

Grey Wolf only used two of those, Tannhauser and gate.

Given Tannhauser and gate, blade and runner might have strong relationships but tears and rain would surely have much stronger relationships elsewhere.

AJ

Replies: julka

julka 🚫

@awnlee jawking

Is the phrase "tannhauser gate" more likely or less likely to be adjacent to phrases like "blade runner", and is the second half of a quote more likely or less likely to follow the first half?

The fact that "tear" as a token might appear next to a bunch of other tokens doesn't change the fact that when it appears next to "tannhauser" and "gate", it's very likely to appear next to tokens like "moments" and "rain", because that's what the quote is and the quote gets used a lot in writing about blade runner.

Replies: awnlee jawking

awnlee jawking 🚫

@julka

If you have tokens for 'tears' and 'in', rain is incredibly unlikely to follow. The only way that AI could turn that up was if "tears in rain" were a token.

Turning up the whole quote verbatim from tokens is so extremely unlikely that the only rational deduction is that the whole quote has been stored.

AJ

Replies: julka

julka 🚫

@awnlee jawking

You have no idea what a token is and your assertion is wrong.

Replies: awnlee jawking

awnlee jawking 🚫

@julka

Blame the Royal Institution Christmas Lectures for that.

AJ

Grey Wolf 🚫

@awnlee_jawking

I can, largely because there is nothing in most AI models but tokens. The way most LLMs are trained, there simply are not any preserved strings of text. It's just an extremely large number of (context-dependent) data points with a vector of many, many weighting factors for which data point to continue to next.

If you think of simple Bayesian probability (the probability that B is true given that A is true, built from the probabilities that A is true, B is true, and A is true given that B is true), neural nets do that for A, B, C ... Z (and likely many more parameters). Which, incidentally, is why 'graphics cards' are so important in AI - GPUs are very efficient at performing vector operations in comparison to most CPUs. Graphics are millions of vector operations done in parallel, for the most part.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

The way most LLMs are trained, there simply are not any preserved strings of text. It's just an extremely large number of (context-dependent) data points with a vector of many, many weighting factors for which data point to continue to next.

Are you suggesting that the token for "tears" in the contexts of "Tannhauser" and "Gate" is different from other tokens for "tears"?

AJ

Replies: Grey Wolf

Grey Wolf 🚫

@awnlee jawking

I'm suggesting that a 'token' is both the content ("tears", in this case) and a vector composed of many, many probabilities for the next token based on context. So, in the context of "Tannhauser", yes, we'll have gone down a probability path influenced by that.

The same "tears" token, when influenced by "but in a lonely room I cry" will send the model down the path of "of a clown". It's still the same token, though.

That's the point of it being a network. What connects to what is a very complex series of vectorized weights based on context.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

So, in the context of "Tannhauser", yes, we'll have gone down a probability path influenced by that.

If a token has different output vectors depending on its chain of input vectors, then discussing LLMs at the token level is somewhat pointless because the unit of interest is token plus input vectors. In that case I can see complex quotations might be reproduced verbatim but that's because they're effectively stored verbatim.

AJ

Grey Wolf 🚫

@awnlee jawking

It's too florid for my liking, with way too many anthropomorphisms. It contains elements of a description of a Jabberwock, though, which is what I was seeking.

I couldn't resist, and asked 'Write a terse story about the Jabberwock traveling through the Tannhauser Gate in the style of Hemmingway.'

Thus, I bring you:

The Jabberwock moved through the Tannhäuser Gate like a shadow crossing a threshold. It did not look back. The light was dim, fractured by stars too far to matter. The Gate was no longer a door, but a void, a place for things to end.

It had crossed before, on instinct. There were no stories here, only the cold weight of endless space pressing down, quiet and unyielding. The Jabberwock had no name for what it was. Its purpose was simple: it was here.

Its claws clicked softly against the metal floor. It felt the thrum of distant power—a hum so faint it could be a memory, or a warning. The Gate was old. It had seen things, been places, held secrets it could no longer protect.

A voice whispered from nowhere.

"Go back."

The Jabberwock did not. It was not made for retreat. Its eyes burned, twin coals in the darkness. The Gate opened wider ahead, a horizon not meant for creatures like it. Still, it moved on, faster now, knowing what it would find. The end, the beginning, the edge of everything.

It was close. A flash. A ripple. And then—nothing. The Jabberwock was gone, swallowed whole. The Gate clicked shut behind it, leaving only the faintest trace of its passing.

In time, even that would fade.

Is it terse? Perhaps. Would Hemmingway approve? Unknown. But it is amusing, at the least.

LupusDei 🚫

@awnlee jawking

I don't have that much experience with LLMs, but by my little playing with Stable Diffusion based (or so I believe) image generation seems to suggest generative AI is seemingly much better at interpretation of nonsense and totally butchered words than my impression of some English speakers or their implied self-assessments encouraged here and elsewhere. That said, there's always the baseline of a random output and it's not always easy to identify that and deviance from that in presence of weird input. However, it definitely recognize partial words, words with random letters or symbols preceding or trailing, words with spaces inserted, words with no spaces in between, words with string of random symbols in between instead of spaces. For testing against somewhat unlikely random output, those can be used, in particular, to coerce inappropriate content out of input dictionary filter "protected" services.

I also have played with made-up place names and while it might be at large part research into their innate biases more than anything, the associations are on point as intended more often than could be at random.

It reacts to *similarly* of the word to known words in much a way one could expect from speaker of analytical Indoeuropean language with active wordcraft, and perhaps beyond.

DBActive 🚫

@Dominions Son

I've played around with some song writing on AIs. Those "tokens" seem to include actual, recognizable riffs from songs, often multiple measures long.
If a human did that they would be sued.

Replies: Dominions Son awnlee jawking Grey Wolf

Dominions Son 🚫

@DBActive

Individual tokens don't represent long phrases, the kinds of cases where you would see single token representing multiple words are things like "I am"

So no, the token database does not contain "recognizable riffs from songs, often multiple measures long" as single tokens.

Could an LLM kick out "recognizable riffs from songs, often multiple measures long" given the right prompt sure, but that's going to be many tokens.

Replies: DBActive

DBActive 🚫

@Dominions Son

I'm just saying my experience. Note for note copies.
And the the other thing I've tried are for legal research where it pulls whole sections from opinions, filed briefs and journal articles.
I strongly doubt that people have actual knowledge of the way the AI is trained to respond to the token.

awnlee jawking 🚫

@DBActive

That seems to match Grey Wolf's experiment, with whole phrases appearing to have been harvested from social media.

AJ

Grey Wolf 🚫

@DBActive

I agree. Music copyrights (as opposed to lyrical copyrights) are a strange and unusual part of copyright law where most assumptions from other parts of copyright law don't apply.

A short sequence of notes can be copyrighted and successfully defended, unlike most short sequences of words, painting styles, or other analogous things.

So, yes, an AI could easily generate a riff from a song. But so could a random-number generator, and it would lose in court just as quickly as the AI-generated song would.

The AI has a 'head start' in that it knows that note Y often follows note X, and note Z often follows note Y, and so forth. It still doesn't 'know' the riff. Instead, it 'knows' how to make riffs, which increases the changes it'll hit an existing riff.

Compare this to a generative AI making pictures. If I say 'Make me a painting of a banana in the style of Banksy', I may well get a Banksy-looking banana. Copyright law doesn't permit Banksy to sue me for using his style, though. If I copy some existing Banksy banana-painting, that's different. I also cannot claim my AI-generated Banksy-banana is actually Banksy's work.

There are cases in music where it seems extremely likely that two people independently discovered the same riff, and the one who discovered it second lost in court to the one who discovered it first. Musical riffs are much more similar to patents in some ways than to than other copyrights.

Replies: DBActive

DBActive 🚫

@Grey Wolf

I was not just speaking of riffs but several measures: that isn't random chance.

awnlee jawking 🚫

@Grey Wolf

Do the characters in your stories pay for the books they use for the purposes of their own education or do they steal them, claiming 'fair use'?

AJ

Replies: Grey Wolf

Grey Wolf 🚫
Updated:

@awnlee jawking

They pay for their books, certainly, but that's somewhat of a red herring (in my opinion). The point of that is access, not content. No one would make the claim that two people sharing a textbook is a copyright violation (though, the way publication is going right now, I imagine some publishers are already looking for a way to do that). One buys a textbook to have access to it 24/7, not to legally read the content in it.

Two (or more) individuals living in the same house, all of whom have agreed to not mark up the textbook, might well share a copy of it. That is entirely legal and above-board, and always has been.

In the modern era, of course, publishers have realized that they can include a one-use access code with the textbook and gate homework assignments behind that (with an online turn-in scheme that effectively prohibits sharing). That both precludes same-time sharing and the used-book market. I consider that abusive and hope that professors will not select those books (even if the publishers give them incentives to use their book), but it's where we are currently.

Here's a reverse analogy: when you borrow a book from the library, do you force yourself to forget it when you return it? If not, you have 'copied' the book, have you not? You may well have 'copied' it far more effectively than most AIs have 'copied' it.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

Here's a reverse analogy: when you borrow a book from the library, do you force yourself to forget it when you return it?

The library paid for the book and the author received royalties for it (unless the work was out of copyright?)

If an AI company paid for a book, they could train their product on it as many times as they wanted without any issues IMO.

AJ

Replies: Grey Wolf

Grey Wolf 🚫

@awnlee jawking

To the best of my knowledge, the AI companies used information freely available on the internet without paywalls. Thus, they paid for the content at the same rate anyone else would have (and presumably thus could train their product on it as many times as they wanted).

The issue isn't payment, it's copying. The author receiving royalties does not entitle one to copy the work. If AI training is 'copying', then the AI companies cannot train regardless of payment. If AI training is 'fair use', then the AI companies can train - but not regardless of payment (they cannot steal the work, then train on it).

Replies: Big Ed Magusson awnlee jawking

Big Ed Magusson 🚫

@Grey Wolf

Yes... but some of the AI companies also fed published books that weren't free on the internet into their algorithms. That's what publishers are suing over.

Replies: Grey Wolf

Grey Wolf 🚫

@Big Ed Magusson

Based on awnlee jawking's claim, that would be fine. In my opinion, that's fine if the training is fair use and not fine if it's not fair use.

But it would be just as fine if the content was free as if they paid for it if it's fair use, and just as wrong if the content was free as if they paid for it if it wasn't. Because the training isn't the issue; what the model later generates is the issue.

That is, unless they stole the books. That's theft. It has nothing to do with copyright, however.

awnlee jawking 🚫
Updated:

@Grey Wolf

The author receiving royalties does not entitle one to copy the work.

I thought it did, as many times as they liked, for their own personal use.

To the best of my knowledge, the AI companies used information freely available on the internet without paywalls. Thus, they paid for the content at the same rate anyone else would have (and presumably thus could train their product on it as many times as they wanted).

I understand Amazon and Google are very active in the AI world and they both have access to works not freely available on the internet. (Although in Google's case, the works they scanned in were out of copyright so they can argue they have the right to use the electronic copies how they wish.)

AJ

Replies: Grey Wolf

Grey Wolf 🚫

@awnlee jawking

The author receiving royalties does not entitle one to copy the work.

I thought it did, as many times as they liked, for their own personal use.

That is incorrect, at least under US law (and, I'm nearly certain, Berne as well). The purchaser may usually make one copy under 'fair use', as long as the copy is not distributed to anyone else. For things in electronic form, there is a (fairly hard-fought) right to make a backup, but there is not an unlimited right to backup. How many backups is hazy; any decent backup strategy involves multiple copies, and since that's the standard of the art, it would be assumed a court would consider a reasonable backup strategy to be fair use.

I understand Amazon and Google are very active in the AI world and they both have access to works not freely available on the internet.

That is true. However, if they trained based on works they should not have had access to for that purpose, a suit against them should be for misappropriation and based on the input, not copyright infringement based on the output.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

That is incorrect, at least under US law (and, I'm nearly certain, Berne as well).

That's what the overwhelming majority of google links say as well, but a couple of educational sites say you can make up to four copies if you're visually impaired. But one site, supposedly the opinion of a UK solicitor, says you can make as many copies as you want provided you don't violate the author's intellectual property. In all cases, if you sell or give away the book, you can't keep any copies.

AJ

hst666 🚫

@JoeBobMack

But what ChatGPT generates may be considered derivative works which are also the property of the copyright holder.

I have seen example of people asking for a Beatles type song on a particular topic and it has sounded very close to one of their actual songs.

Joe Long 🚫

@JoeBobMack

In a situation like this, I humanize the AI. What if another person read a novel and then told me about the story, and critiqued the author? That would not be a violation. I use ChatGPT as a Plotting Assistant. It's read my novel and I feed it new scenes as I progress. I ask it questions like "Compare and contract 'Eleanor and Park' to my novel" and it gives me a detailed analysis. That's not a violation. It has rfead the book in question but is not providing me large chunks of text to read without compensation to the author. Rather, it uses its knowledge of the book to inform me generally.

Replies: awnlee jawking

awnlee jawking 🚫

@Joe Long

You're allegedly paying $20/mo to ChatGPT, so surely it owes royalties to the author of any books it uses.

Professional book reviewers often/usually receive courtesy copies of books they review, but in turn their reviews promote sales of those books, which doesn't apply in your situation.

AJ

Replies: Joe Long Switch Blayde Dominions Son

Joe Long 🚫
Updated:

@awnlee jawking

I have often purchased books after discussing them with ChatGPT, then discussed again after reading.

You might say "Hey, Joe, have you read the new book 'Sink the Rising Sun' by Jon Gabriel. "Why, yes I have, Awnlee. It's set in the openings days of World War II, on an American submarine based in Fremantle, Australia. While telling a fictional story, it strives to be historically accurate and inform readers of the state of geopolitics and military technology at the time. Impressive for a new author."

Switch Blayde 🚫

@awnlee jawking

You're allegedly paying $20/mo to ChatGPT, so surely it owes royalties to the author of any books it uses.

I do not see your logic.

Dominions Son 🚫

@awnlee jawking

You're allegedly paying $20/mo to ChatGPT, so surely it owes royalties to the author of any books it uses.

ChatGPT is not using any books at run time when someone uses it to generate text.

Replies: Switch Blayde awnlee jawking

Switch Blayde 🚫

@Dominions Son

ChatGPT is not using any books at run time when someone uses it to generate text.

Even if they were, it wouldn't make a difference.

awnlee jawking 🚫

@Dominions Son

ChatGPT is not using any books at run time when someone uses it to generate text.

Joe Long's hypothetical response looks like a 'taken from social media' response, not showing any indication the AI was trained on the actual book. That doesn't mean the AI's trainers didn't steal the book, just that in this case they don't appear to have profited from it.

AJ

Replies: Dominions Son

Dominions Son 🚫

@awnlee jawking

. That doesn't mean the AI's trainers didn't steal the book

It doesn't mean that they did steal the book.

Replies: Big Ed Magusson

Big Ed Magusson 🚫

@Dominions Son

It doesn't matter. Copyright law doesn't care whether the person purchased or did not purchase the original book. It says you can't reproduce it without permission from the copyright owner.

Furthermore, it says that the original author not only holds the rights to the book in question but also the rights to any derivative works. If you use my characters to go create your own story, I can sue you and I will win.

The reason fanfic is tolerated is because it's rarely sold for a profit and generally helps sell more copies of the original works. But Larry Niven famously sent a cease and desist order to an erotica author who wasn't making money off his work so profit is not a requirement. Fanfic authors could still be sued if the original copyright owner thought it was worth the effort.

If you go look at the Author's Guild vs. OpenAI class action lawsuit, the derivative rights are a core element of the suit. One of the points of OpenAI is to create derivative works. It doesn't matter if the source material was free as long as it wasn't in the public domain. If OpenAI creates a story "in the style of Harry Potter using the original characters" then OpenAI has violated Rowling's copyright by creating a derivative work she didn't authorize.

So imagine that you, as an author, write a good story. Then suddenly a dozen "sequels" to your story pop up that you didn't write and they're all being sold for $$$. But you don't see a dime. They're making money off your inspiration and hard work and you get nothing. How would you feel?

The Author's Guild wants OpenAI to pay for licensing rights to use books which would (probably) include the right to make derivative works. That's the core issue--"pay me for permission to create derivative works".

Whether the original work was stolen or not is a red herring.

Replies: Grey Wolf Joe Long Joe Long

Grey Wolf 🚫

@Big Ed Magusson

The legal status of fanfiction is currently unknown, at least in the United States. There are valid reasons to believe noncommercial fanfiction may fall entirely within Fair Use and reasons to believe that it won't.

Interestingly enough, the Niven case is one where the fanfiction author would have a stronger case, since it was more transformative than one where the characters were seen in the same settings as Niven's work. His sending a 'cease and desist' is meaningless in terms of whether the use was legal or not. Plenty of 'cease and desist' letters are ridiculous attempts to assert rights which do not exist. I could send anyone in the world a 'cease and desist' tomorrow; it would mean nothing.

One of the points of OpenAI is to create derivative works.

On what basis do you claim this? I would tend to argue that one of the points of OpenAI is to create transformative works. Transformative works are protected by Fair Use.

So imagine that you, as an author, write a good story. Then suddenly a dozen "sequels" to your story pop up that you didn't write and they're all being sold for $$$. But you don't see a dime. They're making money off your inspiration and hard work and you get nothing. How would you feel?

That is commercial use, and commercial use argues against Fair Use (but does not preclude it). Most commercial use is not Fair Use, but some remains protected. The most likely in this case would be parody. See 'Bored of the Rings', a legal commercial retelling of 'Lord of the Rings', and other parody works.

However, there's also a test under Fair Use as to the effect on the market value of the original work. If these sequels being 'sold for $$$' only make a pittance, they likely have little effect on the market value. If they're raking in the money, their authors are much more likely to lose.

No one knows if fanfiction, in the absence of the permission of the author, is always legal, sometimes legal, or never legal within the US, because there's nowhere near sufficient case law to establish that. Indeed, it seems very likely that the lack of case law stems from reluctance of publishers to test this in court because they have a fairly high likelihood of losing.

And no one knows if any given output from an OpenAI product is itself a copyright infringement, Fair Use based on Transformative Works, Fair Use based on the exceptionally small commercial value of the infringement, or something else. Nor is it established if the offender is OpenAI or the user who submitted the prompt asking for the potentially derivative work. We don't have case law on that, either. It likely depends enormously on the prompt and what is generated. If I were to request a parody of Harry Potter using the original characters and Rowling's style, that would be much more likely to be protected than requesting a commercially viable potential bestseller using the original characters than Rowling's style. But, if I was not to publish that 'commercially viable potential bestseller' for money, it would be more likely to be protected than if I tried to charge for it. Note that OpenAI itself is doing the same thing in each case; it's my actions as a user that are changing the facts and thus the legal standing of the output.

Disclaimer: I am not a lawyer. I have read quite a bit of lawyer-authored commentary on this subject, though.

Replies: Joe Long Big Ed Magusson

Joe Long 🚫

@Grey Wolf

it would be more likely to be protected than if I tried to charge for it

For fair use, whether you make money isn't the same as whether it hurts the market for the original. Example: The local newspaper publishes obituaries every day, and one has to subscribe to see the full text. People into genealogy share those obituaries and post them on message boards where everyone can see them for free. If posted in real time, that hurt's the newspaper's ability to sell subscriptions. But if the genealogist wait even a few weeks, the immediacy is gone and both can coexist, one as subscribed news and the other as a reference archive for older publications.

Replies: Grey Wolf

Grey Wolf 🚫

@Joe Long

That's true - it's both. Whether you charge or not is one concern. Whether you hurt the market for the original is a different concern.

And no single test says 'this is Fair Use' or 'this is not Fair Use.' A work can meet a single test and be Fair Use; it can meet multiple tests and not be Fair Use. Until it's litigated, it's very hard to be certain.

In the genealogy case, one also has to factor in the amount of the work used. An obituary is a tiny part of a newspaper. That tends to also argue in favor of it being Fair Use.

Big Ed Magusson 🚫

@Grey Wolf

I appreciate the long response, which basically supports my original point.

The issue is less about the "theft" of the original book as about how copyright law applies to what was generated from that author's work.

Replies: Grey Wolf

Grey Wolf 🚫

@Big Ed Magusson

That's somewhat true. However, it's worth going back and looking at the Authors' Guild v OpenAI suit. The Authors' Guild is using the presence of derivative works as evidence that the model was trained on those works. One of their complaints is with the training. They want licensing before a model can be trained on copyrighted works.

Training is inherently transformative, and I suspect they will find that a very tough argument to win in court. On that side of their lawsuit, there's virtually no difference between what OpenAI did and my going to an art museum, studying a bunch of copyrighted paintings (yes, those exist), and thereby learning to make my own art in that style. Or, for that matter, my reading the collective works of Steven King and applying the knowledge I've gained to write horror stories.

Producing derivative works is an entirely different thing. That may (or may not) be Fair Use. However, as I said, one interesting sidelight is that a significant part of answering whether the output is Fair Use or not depends on the behavior of the user of the tool, not the maker of the tool. Is OpenAI guilty because their tool can be misused? Or should the Authors' Guild need to sue users of OpenAI's products who misuse the output?

In that sense, their suit is much like suing Xerox for creating technology which allows books to be copied. Is Xerox the infringer, or is someone using a Xerox machine the infringer?

The answer remains: we don't know. We have no idea if OpenAI has violated some or all of the copyright rights of Authors' Guild members, we don't know how much of their use is covered by Fair Use, and so forth. This is brand new territory.

And it's not hard to imagine bad outcomes resulting from any possible answer to those questions.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

On that side of their lawsuit, there's virtually no difference between what OpenAI did and my going to an art museum, studying a bunch of copyrighted paintings (yes, those exist), and thereby learning to make my own art in that style.

That assumes you're a dumb machine rather than a sentient human.

AJ

Replies: Grey Wolf

Grey Wolf 🚫

@awnlee jawking

Not really, no. It's still learning from examples.

And, if there was a difference, it would be biased against the sentient human. A sentient human can violate copyright. A dumb machine cannot (if it cannot 'author' something, how can it 'author' a derivative work?); only the sentient human operating it can.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

I think it's the other way around. Transformative work requires creativity and imagination, which pretty much require sentience. AIs are (hopefully) a long way from that still, and all their outputs are necessarily derivative despite what the human operative types in.

AJ

Replies: LupusDei Dominions Son Grey Wolf

LupusDei 🚫

@awnlee jawking

That's an idealistic viewpoint. But at what point random reshuffling becomes indistinguishable from creativity? There's certainly such a point.

And AI is doing slightly more than just that, it's choosing random reshuffle results that adhere to common patterns. It's not yet clear at all if what LLMs and human brain language centers does differ principally. The only (radical?) difference is that we assign "meanings" to language tokens in a way AI is yet unable to, operating on pure associative chains and/or relative frequency of occurrence in given context. But the results are surprisingly coherent even without that "understanding" with seems almost superficial in light of it.

Or to put in other way of parable, does a human bulshitter, a professional conman who talks about things they have absolutely no understanding about in right terms with utmost (if false) confidence, does a creative work or not? Even while most of their output is total crap, I think you will tend to agree it's creative just because it's created by a human, engaging their supposed intelligence, even if in a roundabout way -- upon assumptions one cannot be dead stupid to bullshit successfully. AIs seem to prove the later assumption is likely false, as they bullshit exceptionally well with zero "understanding" of what they're saying. But so, if we extend that "creativity" to the human bullshitter, the AI deserves it too.

Replies: awnlee jawking

awnlee jawking 🚫

@LupusDei

But at what point random reshuffling becomes indistinguishable from creativity? There's certainly such a point.

I guess people who claim there are only five/thirteen unique story plots might think so but I'm not convinced.

I think you will tend to agree it's creative just because it's created by a human, engaging their supposed intelligence

If they're just regurgitating something off eg twitter etc or from a political tract, I don't think it's creative.

AJ

Dominions Son 🚫

@awnlee jawking

AIs are (hopefully) a long way from that still, and all their outputs are necessarily derivative despite what the human operative types in.

But derivative of what? The LLM is not referencing any of the works in the training set at the time it generates text.

If a human writes a story about two young lovers whose families hate each other, the human did read Romeo and Juliet once, but that was decades ago does that make their story a derivative work of Romeo and Juliet?

No, it does not, not in a way that "derivative work" is used in copyright law.

It's essentially the same thing with an LLM.

Replies: awnlee jawking

awnlee jawking 🚫

@Dominions Son

The LLM is not referencing any of the works in the training set at the time it generates text.

If that's true, the LLM is a derivative of the training works. So it may be okay under current copyright law but the output isn't innovative.

However previous examples have shown AIs exactly reproducing key sentences from the source material, so unless someone can look under the covers and check, I'm suspending judgement.

As to the 'Romeo and Juliet' issue, I believe in some jurisdictions Shakespeare could claim plagiarism for having his plot stolen. Except that he's dead and Romeo and Juliet is out of copyright.

AJ

Replies: Dominions Son

Dominions Son 🚫

@awnlee jawking

I believe in some jurisdictions Shakespeare could claim plagiarism for having his plot stolen. Except that he's dead and Romeo and Juliet is out of copyright.

Not in the US. US copyright law does not protect abstract plot idea and there is no civil tort for "plagiarism".

AFIK, the same is true as to copyright for any country that follows the Berne convention. I can't speak to a "plagiarism" tort in other countries.

Replies: awnlee jawking

awnlee jawking 🚫

@Dominions Son

I vaguely remember a plagiarism case in France, although I can't remember what the work in question was or the outcome.

Didn't someone try to sue Dan Brown because much of the plot of 'The Da Vinci Code', and even some of the characters names, appeared to be lifted from their earlier work? The case failed, I believe. And someone even had a try at J K Rowling for her Harry Potter scenario. It would be strange if those cases were brought without any hope of retribution.

AJ

Replies: Dominions Son

Dominions Son 🚫
Updated:

@awnlee jawking

Didn't someone try to sue Dan Brown because much of the plot of 'The Da Vinci Code', and even some of the characters names, appeared to be lifted from their earlier work? The case failed, I believe.

Yes, and IIRC, it failed specifically because the elements that Dan Brown claimed were copied are not protected by US copyright law.

It would be strange if those cases were brought without any hope of retribution.

I don't know about the UK, but in the US many outright frivolous cases get filed in the hopes that the defendant will settle to make the case go away rather than try to fight it in court.

It's not strange at all.

In fact, this is particularly problematic in copyright law.

https://www.findlaw.com/legalblogs/seventh-circuit/sanctions-against-porno-copyright-trolls-at-prenda-law-upgheld/

https://www.yourclassical.org/story/2015/12/02/can-silence-be-copyrighted

https://digitalcommons.law.uga.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1401&context=jipl

Grey Wolf 🚫

@awnlee jawking

'Transformative' does not require creativity and imagination. Amongst other things, automated indexing has been found to be 'transformative'. So has thumbnail generation, text mining, data mining, and digitization. None of those require creativity and imagination.

Unless, of course, one refers to the 'creativity and imagination' of those who come up with the idea to index, make thumbnails, mine works, or digitize them. But, in that case, the creativity and imagination in the AI's output comes from the humans who created and trained it and the human who gave it the prompt it followed.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

'Transformative' does not require creativity and imagination.

When it comes to literary works when there's an issue of plagiarism, I think it does.

AJ

Replies: Grey Wolf

Grey Wolf 🚫

@awnlee jawking

That is your opinion, and you are certainly welcome to it, but (in the US, at least) I don't see much of a legal argument for it.

Replies: awnlee jawking

awnlee jawking 🚫

@Grey Wolf

but (in the US, at least) I don't see much of a legal argument for it.

Perhaps the introduction of AI will act as a catalyst for rationalising the law.

IMO, if significant portions of the original work can be retrieved from whatever process has been applied to the original, copyright and plagiarism should be strong considerations.

AJ

Replies: Dominions Son

Dominions Son 🚫

@awnlee jawking

copyright and plagiarism should be strong considerations.

Replies: awnlee jawking

awnlee jawking 🚫

@Dominions Son

Plagiarism

Isn't that the implication of the word 'derivative' (as opposed to transformative?

AJ

Replies: Dominions Son

Dominions Son 🚫
Updated:

@awnlee jawking

@Dominions Son

Plagiarism

Isn't that the implication of the word 'derivative' (as opposed to transformative?

No. You can call both plagiarism in the court of public opinion, but if you want to go to a court of law and get injunctive relief or monetary damages there is copyright and only copyright.

If it doesn't meet the tests for a copyright violation, you have no legal recourse.

Replies: awnlee jawking

awnlee jawking 🚫

@Dominions Son

It's been argued above that derivative works become the property of the copyright owner of the original. If that's true then a plagiarist doesn't own the copyright of their poisonous fruit.

In any case, I'm thinking of what the law ought to be if it were rationalised, instead of a mixture of questionable definitions designed to line the pockets of the legal profession.

AJ

Replies: Dominions Son

Dominions Son 🚫

@awnlee jawking

It's been argued above that derivative works become the property of the copyright owner of the original.

As far as I know, only Disney has seriously made this claim. And no US court has signed off on it.

In any case, I'm thinking of what the law ought to be if it were rationalised

I doubt that we would be in complete agreement on what that would mean.

In my opinion, this would require outright scrapping the Bern Convention and drastically reducing copyright terms.

Replies: Grey Wolf awnlee jawking

Grey Wolf 🚫

@Dominions Son

In my opinion, this would require outright scrapping the Bern Convention and drastically reducing copyright terms.

Strong second on that. Copyright has been enormously skewed from the original intent over the past 50 years or so.

Replies: Dominions Son

Dominions Son 🚫

@Grey Wolf

Copyright has been enormously skewed from the original intent over the past 50 years or so.

Nope. The biggest skew in copyright law globally is life+ copyright terms. And that starts with the Bern Convention which was adopted in 1886, a bit more than 50 years ago.

awnlee jawking 🚫

@Dominions Son

I doubt that we would be in complete agreement on what that would mean

A precise description of 'fair use' would help.

AJ

Joe Long 🚫

@Big Ed Magusson

It doesn't matter. Copyright law doesn't care whether the person purchased or did not purchase the original book. It says you can't reproduce it without permission from the copyright owner.

My point is that discussing or analyzing a book isn't reproducing it, and those discussions can drive new book sales. It has often been the case with me.

Joe Long 🚫

@Big Ed Magusson

If OpenAI creates a story "in the style of Harry Potter using the original characters" then OpenAI has violated Rowling's copyright by creating a derivative work she didn't authorize.

I agree with you on derivative works. Those compete against the original author while mimicking them.

Replies: Grey Wolf

Grey Wolf 🚫

@Joe Long

Did OpenAI violate the copyright, or did the person who asked the tool to create the story violate it? We don't know.

And, if the derivative work exists only within a single user's session, does that actually meaningfully compete against the original author? If it requires the user copying the output and redistributing it, that suggests the user, not OpenAI, is the infringer.

As of now, it's entirely reasonable (though absolutely not proven) to believe that noncommercial derivative works which do not significantly impact the market value of the original work are not infringing (e.g. fanfiction). That is especially true if the derivative work is transformative (showing the characters in new situations, especially ones not similar to those found in the original work).

awnlee jawking 🚫

@Switch Blayde

I believe at least one major player in the AI market has applied to the government for a waiver to have copyright set aside so they can use the texts for AI training.

Sorry, I can't track down the news story. I think it was from this year.

AJ

awnlee jawking 🚫

@Switch Blayde

The UK's new socialist government has been receptive to arguments from tech companies that copyright law should be waived for the purpose of training AI. The government's latest idea is that there should be an opt-out facility for those who don't want their work plagiarised.

AJ

Replies: Pixy

Pixy 🚫

@awnlee jawking

The government's latest idea

And they have a few of them....

DBActive 🚫
Updated:

@Switch Blayde

Theft is nothing new and that's what copyright infringement is.
People have no respect for other people's intellectual property. If it can be stolen, it will be stolen. It happens on this board all the time when people share links to works the author has deleted. And those same people are outraged when their works are stolen and appear on Amazon.

It may be a crime. https://www.justice.gov/archives/jm/criminal-resource-manual-1847-criminal-copyright-infringement-17-usc-506a-and-18-usc-2319

Replies: awnlee jawking

awnlee jawking 🚫

@DBActive

Theft is nothing new and that's what copyright infringement is.
People have no respect for other people's intellectual property. If it can be stolen, it will be stolen.

There's a difference between scumbags stealing other people's property and supposedly responsible companies stealing other people's property.

I'm not sure the scumbags don't occupy higher ground, morally.

AJ

jimq2 🚫

@Switch Blayde

Ever know someone with eidetic memory? I new a fellow at Princeton University that could read a text one day and a day or a week or a month later recite it back. The only time he didn't ace an exam was when he had to give original input.

awnlee jawking 🚫

@Switch Blayde

I wonder whether Creative Commons licences will be adapted for content creators to be able to opt out of having their works plundered for AI training. Some people feel AI is, or will be, used for evil, so it would be understandable if they didn't want their work involved.

AJ

jimq2 🚫

@Switch Blayde

I don't know how true it is, but I just read that ChatGPT has been skewing research answers in favor of their corporate's preferred results. Kind of like Google searches where the first few results are "Sponsored." I love how I search for something within my zip code, and the first answers are 20-50 miles away.

Replies: Switch Blayde

Switch Blayde 🚫

@jimq2

I don't know how true it is, but I just read that ChatGPT has been skewing research answers in favor of their corporate's preferred results.

That would not surprise me.

richardshagrin 🚫
Updated:

@Switch Blayde

If not copyright is it copywrong?

Right, Write, and Rite all get confusing. Shopping at Rite-Aide might help. Get copyright aid.

JoeBobMack 🚫

@Switch Blayde

However previous examples have shown AIs exactly reproducing key sentences from the source material

I know this is a claim in the NY Times vs. OpenAI lawsuit, but I've also seen what seem to me to be quite persuasive critiques of these portions of the complaint (as in that they may have been generated in a bizarre fashion intended to produce the article) and AFAIK, discovery hasn't been completed so we don't know what has been revealed about hos the NY Times legal team got the material that is included in their complaint.

Are there other examples?

Replies: Joe Long

Joe Long 🚫

@JoeBobMack

This is from the most recent page I've written. I borrowed a few words as I was inspired by something I'd seen and wanted to paint a picture of wallowing in self pity. I even asked ChatGPT which episode it was from.

It moved…I felt it. I mean, I was holding it, right there in my hand. It's a sign.

Replies: JoeBobMack

JoeBobMack 🚫

@Joe Long

??? I'm afraid I'm not following you on this one. Could use a little more explanation.

Replies: Joe Long

Joe Long 🚫

@JoeBobMack

Seinfeld! George got a massage from a male masseuse then emoted to Jerry that "it moved" and therefor was afraid he was a homosexual (not that there's anything wrong with that)

Replies: JoeBobMack Joe Long

JoeBobMack 🚫

@Joe Long

Oh. Okay. Never was a Seinfeld fan, so missed that reference.

Replies: jimq2

jimq2 🚫

@JoeBobMack

Who is Seinfeld? 😀😀

Joe Long 🚫

@Joe Long

ChatGPT even found the episode and presented me with a link to video of the scene on Youtube.

Michael Loucks 🚫

@Switch Blayde

My theory is you don't create a fixed term, you create a set of escalating registration fees. When the fee is not paid, copyright ceases. Something like this might work:

Years..Ten Year Fee
1-10.....$100
10-20....$1000
20-30....$10,000
30-40....$100,000
40-50....$1,000,000
50-60....$10,000,000
60-70....$100,000,000
70-80....$1,000,000,000
90-90....$10,000,000,000
90-100...$100,000,000,000
100+.....$1,000,000,000,000

In such a system, everything goes into the public domain when it no longer has economic value to the creator. It'll generate a ton of revenue.

The numbers could be tweaked to make it less costly at the start, but I figured 10x each decade was a good starting point.

Replies: Dominions Son awnlee jawking

Dominions Son 🚫

@Michael Loucks

My theory is you don't create a fixed term, you create a set of escalating registration fees.

An interesting idea. However, the Berne Convention established that copyright exists automatically as soon as a work is set in a fixed medium.

So while this may be a good idea, the Berne Convention must be destroyed first.

Replies: Michael Loucks

Michael Loucks 🚫

@Dominions Son

So while this may be a good idea, the Berne Convention must be destroyed first.

The US can withdraw from it, as they can from any treaty or international agreement.

Replies: Dominions Son Joe Long

Dominions Son 🚫

@Michael Loucks

The US can withdraw from it, as they can from any treaty or international agreement.

I don't think the idea of rationalizing copyright law is just for the US.

Joe Long 🚫

@Michael Loucks

My first publication was in 1978. I filled out an application along with a copy of my publication and waited to be approved by the copyright office. After an additional round of letters for more information, I was approved. There was a small fee. I was than allowed to say "Copyright 1978" on my publication. Now it's automatic and implied.

Replies: Dominions Son

Dominions Son 🚫

@Joe Long

The mandatory registration was ended by the 1976 copyright act, which went into affect January 1 1978.

So if your first publication was in 1978, while there might have been good reason to register it wasn't absolutely necessary.

Replies: Joe Long

Joe Long 🚫

@Dominions Son

Pretty sure it was 1978, as I graduated HS in 1977 and worked on it during my gap year.

Replies: Dominions Son

Dominions Son 🚫

@Joe Long

Pretty sure it was 1978, as I graduated HS in 1977 and worked on it during my gap year.

As I said, the copyright automatically exists became part of US law with the 1976 copyright act which went into effect Jan 1st 1978.

So if it was 1978, registration shouldn't have been necessary in the way you describe.

awnlee jawking 🚫
Updated:

@Michael Loucks

As an author, I would consider that a tax on creativity. Very few authors even make a profit from their work, let alone a living.

It would encourage an orgy of theft of the works of novice writers who aren't sure whether their stories are worth anything. And if they belatedly realise their work has value, can they pay the copyright tax retrospectively?

AJ

Replies: Dominions Son Michael Loucks

Dominions Son 🚫
Updated:

@awnlee jawking

As an author, I would consider that a tax on creativity. Very few authors even make a profit from their work, let alone a living.

In the US:
The first comprehensive copyright act was passed in 1909. Under this act:
The term of a copyright was 28 years, renewable once and once only.
Registration was mandatory. No copyright existed without registration.

In 1976 the US Congress passed a new copyright act that moved US copyright law closer to the Berne Convention.

Registration became non-mandatory, though to encourage registration, relief for violation of non-registered copyrights is limited. You can not register retroactively. The registration had to have happened before the violation occurred.

Copyright term was aligned with the Berne Convention at life + 50 years, but there were exceptions.

Anonymous or pseudonymous works: 75 years from the date of fist publication, or 100 years from the date of creation, whichever expires first.

Works for Hire: 75 years from the date of fist publication, or 100 years from the date of creation, whichever expires first. With works for hire, the employer is considered the author, so life + was not considered appropriate since corporations are effectively immortal.

The proposed fee for the first term is not drastically out of line with current US registration fees.

https://www.copyright.gov/about/fees.html

As to the cost of renewal, I wouldn't call that a tax on creativity. How much creativity is involved in sitting back and raking in royalties on a 10 or 20 year old work?

Replies: awnlee jawking

awnlee jawking 🚫

@Dominions Son

Am I understanding this correctly, that authors who want to protect their stories properly need to stump up $45 per story when their prospects of making a profit on their stories are minuscule?

That's definitely a deterrent to authors and a tax on creativity.

AJ

Replies: Switch Blayde Dominions Son

Switch Blayde 🚫

@awnlee jawking

Am I understanding this correctly, that authors who want to protect their stories properly need to stump up $45 per story

No, you automatically get copyright protection. But there's something about collecting damages when you prove someone infringed on your copyright that requires it to be registered. So you don't need to pay the money to have your copyright protected. It just comes in handy when you sue.

Dominions Son 🚫
Updated:

@awnlee jawking

Am I understanding this correctly, that authors who want to protect their stories properly need to stump up $45 per story when their prospects of making a profit on their stories are minuscule?

A lot depends on what you consider "properly". All I can discuss is what US Law allows.

As I understand it, in US law, there are three possible forms of relief for copyright violations. Injunctive relief, actual damages, and statutory damages.

Statutory damages can be quite high and you don't have to prove an actual monetary loss. Actual damages means you have to prove that the violation cost you $X. Injunctive relief is an order to the court for the defendant to stop violating your copyright.

Absent a registered copyright, you are limited to injunctive relief and actual damages (this is money you lost, not disgorgement of the infringer's profits).

https://www.jdsupra.com/legalnews/damages-for-copyright-infringement-5432434/

Michael Loucks 🚫

@awnlee jawking

As an author, I would consider that a tax on creativity. Very few authors even make a profit from their work, let alone a living.

IMHO, $10 a year is not excessive to gain the protections of federal copyright law. Heck, that's one Patreon or BuyMeACoffee patron for two months (at my most popular subscription level).

But as I said, that was a suggestion. You could always set the initial fees to $0 for a time (ten to twenty years), then start at $100.

The point is, anything that has any economic value to the author is protected, so long as that value > fee. I don't think requiring registration in the modern computer age is problematic.

One other thing I'd change is that companies like Amazon would be liable for infringement if they didn't honor takedown notices immediately (with registration #) and all funds collected would have to go to the registrant. It would be on Amazon to recoup them from the fraudster.

Replies: Dominions Son awnlee jawking

Dominions Son 🚫
Updated:

@Michael Loucks

One other thing I'd change is that companies like Amazon would be liable for infringement if they didn't honor takedown notices immediately (with registration #)

Outside of Amazon, there are a lot of fraudulent DMCA take down notices that get sent out. At a minimum they need time to verify that the notice is in fact from the legitimate copyright owner or an authorized agent there of.

https://taxpolicy.org.uk/2024/02/17/the-invisible-campaign-to-censor-the-internet/

A legal duty to respond immediately will dramatically increase the incentive for fake takedown notices.

Replies: Michael Loucks

Michael Loucks 🚫

@Dominions Son

A legal duty to respond immediately will dramatically increase the incentive for fake takedown notices.

That's why I pointed to needing a registration number (or other proof). Perhaps an email registered with the Copyright Office that Amazon sends to via the registration number (i.e. they don't get the email) and you say yes/no it's me.

There are ways to use technology to make the takedown process equitable.

Replies: Dominions Son

Dominions Son 🚫
Updated:

@Michael Loucks

That's why I pointed to needing a registration number (or other proof).

Registration numbers aren't secret, they are public record that's not in any way evidence that the takedown notice is coming from the copyright owner or a lawful agent there of.

And if the registration numbers were confidential, they would have no way to validate that the registration number is valid.

There are ways to use technology to make the takedown process equitable.

Yes there are, but none of them will function instantaneously.

Replies: Michael Loucks

Michael Loucks 🚫

@Dominions Son

Yes there are, but none of them will function instantaneously.

Replace with 'expeditiously'.

awnlee jawking 🚫

@Michael Loucks

But as I said, that was a suggestion. You could always set the initial fees to $0 for a time (ten to twenty years), then start at $100.

We yerpeens are innately socialist (although I'm possibly a far-right socialist) so we like our taxes to be proportionate. Perhaps a better idea would be a tiny tax, perhaps a few cents, on each sale. A newbie author with a handful of sales would pay virtually nothing whereas James Patterson would contribute massively.

(No, I'm not French and I don't want to marry you.)

AJ

Replies: Michael Loucks

Michael Loucks 🚫

@awnlee jawking

A newbie author with a handful of sales would pay virtually nothing whereas James Patterson would contribute massively.

We already tax income from creative work here in the former colonies. I receive 1099Ks from Patreon, BuyMeACoffee, and PayPall.

Are you proposing an additional copyright tax?

Replies: Switch Blayde awnlee jawking

Switch Blayde 🚫

@Michael Loucks

I receive 1099Ks from Patreon, BuyMeACoffee, and PayPall.

And a 1099-MISC from Amazon.

awnlee jawking 🚫

@Michael Loucks

Are you proposing an additional copyright tax?

If that's what's necessary to get decent copyright protection in the USA, I think a sales tax would be fairer than an up-front payment tax.

AJ

jimq2 🚫

@Switch Blayde

The way that sounds, If you aren't paid for your story, then you wouldn't get copyright protection. So none of the stories on SoL could get copyrights.

Replies: Switch Blayde awnlee jawking

Switch Blayde 🚫

@jimq2

The way that sounds, If you aren't paid for your story, then you wouldn't get copyright protection. So none of the stories on SoL could get copyrights.

I don't think you were responding to my post. You don't have to get paid for something to be copyrighted. You can write something, lock it in a drawer, and it's still copyrighted.

Replies: Dominions Son jimq2

Dominions Son 🚫

@Switch Blayde

I don't think you were responding to my post. You don't have to get paid for something to be copyrighted. You can write something, lock it in a drawer, and it's still copyrighted.

Yes, but I think people are unhappy that under those conditions in the US at best you could get injunctive relief and that's going to be expensive to pursue.

Replies: Michael Loucks

Michael Loucks 🚫

@Dominions Son

Yes, but I think people are unhappy that under those conditions in the US at best you could get injunctive relief and that's going to be expensive to pursue.

Under 17 U.S.C. § 412, statutory damages are only available in the United States for works that were registered with the Copyright Office prior to infringement, or within three months of publication.

Replies: Dominions Son

Dominions Son 🚫
Updated:

@Michael Loucks

Yes, but if you are publishing it commercially even without a registered copyright, actual damages are still available. And AFIK, no other country has anything equivalent to our statutory damages.

The problem is that if you publish for free on a site like SOL, your actual damages are $0.00

The scenario I was responding to was this:

You can write something, lock it in a drawer, and it's still copyrighted.

Yes, technically that's still copyrighted, but as you note, statutory damages are out for lack of registration. But then the author isn't making money on it so actual damages are zero. That leaves just injunctive relief.

jimq2 🚫

@Switch Blayde

Actually I was responding to AJ and hit the wrong button.

awnlee jawking 🚫

@jimq2

The way that sounds, If you aren't paid for your story, then you wouldn't get copyright protection. So none of the stories on SoL could get copyrights.

Was that aimed at me? My intention was that authors who didn't sell their stories would be entitled to enhanced US copyright protection free.

AJ

Grant 🚫

@Switch Blayde

Meta used pirated books to train its AI models, and there are emails to prove it

A group of authors has sued Meta, alleging that the company used unauthorized copies of their books to train its generative AI models. While Meta has denied any wrongdoing, newly unsealed messages suggest that executives and engineers were well aware of their actions – and that they were violating copyright law.

The lawsuit filed by Sarah Silverman, Richard Kadrey, and other writers and rights holders against Meta may be entering its most critical phase. The authors have obtained internal company emails in which Meta employees openly discussed "torrenting" well-known archives of pirated content to train more powerful AI models.

https://www.techspot.com/news/106696-meta-used-pirated-books-train-ai-models-there.html

Reply to topic

Home | Forum | All Threads by Date

Forum Rules |

Forum: Author Hangout

AI & copyright infringement

WARNING! ADULT CONTENT...