Professor Mark A. Lemley on Generative AI and the Law

ODSC - Open Data Science
5 min readSep 4, 2023

As newer fields emerge within data science and the research is still hard to grasp, sometimes it’s best to talk to the experts and pioneers of the field. Recently, we spoke with Mark A. Lemley, William H. Neukom Professor of Law at Stanford Law School and the Director of the Stanford Program in Law, Science and Technology. In the interview, we talked about the legal concerns about generative AI, including challenges related to governance, copyright, data privacy, and legal woes. You can listen to the full Lightning Interview here, and read the transcript for two interesting questions with Mark A. Lemley below.

Q: What are a few new legal problems to come out of generative AI?

Mark A. Lemley: The legal issues fit into three large buckets. One: Is it legal to train my AI using copyrighted works; Two: What do we do about the output of those works; Three: Who might own the output? On the training issue, that’s where we’re starting to see the most litigation right now. Generative AI is new, but the problem isn’t new. We’ve got a number of cases in which prior technology companies will grab the entirety of a copyrighted work in order to generate something from it. It’s how search engines work for instance, and there are a bunch of lawsuits about whether the search engine going through and creating a temporary database of something to to create a search link is illegal. The courts say it isn’t.

Google’s book search is an even more interesting example because Google didn’t just pull things down from the internet, as it actually made and scanned a physical copy of all the books in the Stanford Library and the court said that was fair use. Because, in both cases, while you’re making a copy of the whole thing, you’re making a copy only behind the scenes in your database, you’re not sharing that copy with the world, and you’re using that copy to do something transformative and productive — something you couldn’t have done without engaging in that temporary intermediate copy.

Q: Tell us a bit about the memorization problem and the issues there.

The memorization problem is when AI-generated content resembles existing, copyrighted work.

Mark A. Lemley: Lately, we’ve been talking about training on copyrighted works, and that’s being litigated right now, but I think the answer will be yes; the courts will say that it’s fair use. It’ll be like training a search engine or work. The much harder problem is if the output of the AI looks substantially similar to a particular copyrighted work.

I’ve done some work with a number of the computer science folks at Stanford in Percy Lang’s group on this problem and so one of the things we’ve noted is that it actually doesn’t happen that often. When it happens, it’s usually one of three things that causes it. One is a problem of deduplication. It’s almost never the case that the AI is actually memorizing a particular work; what it is doing is looking at the several hundred closest works and if those several hundred closest works are identical copies of the same photograph it may generate a composite that looks very much like that photograph, but it’s actually drawing from a bunch of different learning examples. We just didn’t do a very good job of eliminating deduplication because it turns out, technically that’s much harder than it sounds.

The second way it can happen is if you ask a very specific prompt, you can direct ChatGPT for instance, towards creating a very similar work. So in our paper, we ask ChatGPT to give us a children’s story about wizard kids who go to a Wizarding school. It doesn’t give us Harry Potter or anything much like Harry Potter, but if you give it a story that begins with the first paragraph of the first Harry Potter book, it pretty faithfully spits out the next few chapters with only a few changes.

And then the third category I think — and this is in some sense the hardest one for copyright law — to think about is either the image engines come up with concepts, and I abstract away from things, and I figure out okay this is what a cup of coffee looks like, this is what a cat looks like. So if you ask it to generate a cat drinking a cup of coffee, it has those concepts and can generate one. But I think there are some things that are both concepts in the AI’s sense and also copyrighted. Think about baby Yoda or Snoopy. You can get a pretty good baby Yoda out of a Stable Diffusion image, and it’s not because it’s memorized a particular image. but it’s because it’s seen enough pictures of baby Yoda so that it basically understands baby Yoda as a concept.

How to learn more about large language models, generative AI, and AI ethics.

If you haven’t already gotten started with large language models or generative, or you want to further your existing expertise, then ODSC West is the conference for you. This October 30th to November 2nd, you can check out dozens of sessions related to NLP, large language models, and more. Here are a few confirmed sessions with plenty more to come:

  • Personalizing LLMs with a Feature Store
  • Evaluation Techniques for Large Language Models
  • Understanding the Landscape of Large Models
  • Democratizing Fine-tuning of Open-Source Large Models with Joint Systems Optimization
  • Building LLM-powered Knowledge Workers over Your Data with LlamaIndex
  • General and Efficient Self-supervised Learning with data2vec
  • Towards Explainable and Language-Agnostic LLMs
  • Fine-tuning LLMs on Slack Messages
  • Aligning Open-source LLMs Using Reinforcement Learning from Feedback
  • Generative AI, Autonomous AI Agents, and AGI — How new Advancements in AI will Improve the Products we Build
  • Implementing Gen AI in Practice
  • Beyond Demos and Prototypes: How to Build Production-Ready Applications Using Open-Source LLMs
  • Adopting Language Models Requires Risk Management — This is How
  • Scope of LLMs and GPT Models in the Security Domain
  • Prompt Optimization with GPT-4 and Langchain
  • Building Generative AI Applications: An LLM Case Study
  • Graphs: The Next Frontier of GenAI Explainability
  • Automating Business Processes Using LangChain
  • Stable Diffusion: A New Frontier for Text-to-Image Paradigm
  • Connecting Large Language Models — Common Pitfalls & Challenges
  • Attribution and Moral Rights in Generative AI
  • A background to LLMs and intro to PaLM 2: A smaller, faster and more capable LLM
  • Integrating Language Models for Automating Feature Engineering Ideation
  • The Devil in the Details: How defining an NLP task can undermine or catalyze its successful implementation

Don’t delay getting your ticket! 50% off ends soon! Register here.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.