Discussing OpenAI's o1 Strawberry: Unveiling the Next Generation of AI Reasoning

Summary

In this episode of AI and the Future of Law, hosts Bridget McCormack and Jen Leonard explore the latest advancements in artificial intelligence with a focus on OpenAI’s new "Strawberry" model o1. They discuss how this model revolutionizes reasoning processes, its implications for the legal industry, and how it compares to existing AI tools.

Along the way, they share real-world applications, including a look into its performance in fields like mathematics and coding. They discuss the differences between publicly available and open-source AI tools, the significance of Strawberry's reasoning capabilities, and its potential applications in the legal field. The conversation highlights the importance of understanding these advancements for legal professionals and the broader implications for various industries.

Key Takeaways

Generative AI is becoming increasingly relevant in the legal field.
Strawberry represents a significant advancement in AI reasoning capabilities.
Publicly available tools differ from open-source tools in accessibility and usage.
Some trial judges are more interested in generative AI than appellate judges.
The new model, Strawberry, performs better in complex problem-solving tasks.
o1’s ability to reason through problems is a game changer for various fields.
Lawyers should be aware of AI advancements and their implications.
The next language model, Orion, will build on Strawberry's technology.
Understanding AI's capabilities is crucial for legal professionals.
Experimentation with AI tools is essential for discovering their potential.

Transcript

Jen Leonard: Hi, everybody, and welcome to the newest episode of our podcast, 2030 Vision: AI and the Future of Law. I'm Jen Leonard, founder of Creative Lawyers, and I'm thrilled as always to be joined by my wonderful co-host, Bridget McCormack, president and CEO of the American Arbitration Association. Hi Bridget, how are you?

Bridget McCormack: I'm well, good morning, Jen. Happy to be here. Excited to talk about Strawberry today.

Jen Leonard: Same, and I love the fruit! I'm very excited about the tech, and I'm excited that there's finally a generative AI model with a decent name that we can talk about in shorthand without being overly confusing. So we're going to talk today all about what Strawberry is and why lawyers should care about it. As always, we will start with our Gen AI moments of the week (or the past two weeks, since we last recorded). What is one thing that we've each done with generative AI that we found particularly noteworthy or magical?

And then we will sift through a couple of definitions of terms that might be useful for people to understand as we grapple with the AI era. Then we'll dive into “Strawberry”. So Bridget, I'm going to toss it to you to hear about your Gen AI moment of the week.

Gen AI Moments

Bridget McCormack: I'm happy to talk later about my own attempts to get the most out of Strawberry, which were not very magical—but I think that was a limitation on my part. I presented last week to the Ohio judiciary (they have their annual All Judges meeting), and I was really doing a presentation about innovation and technology for courts. But the conversation veered quickly into generative AI, and there were so many questions that that became really what the presentation was about. It was like generative AI only, which is fine by me. As you know, I love talking about it. And the audience was judges from every court—so there were trial court judges at the district court level and circuit court level, and then appellate judges as well. Even the Chief Justice was in the room (I didn't actually meet her, but I'm told she was).

I would say that appellate judges were pretty skeptical that the technology could be useful to them and their clerks—although I think Adam Unikowsky's work that we've talked about before is pretty persuasive that it can already be useful to them and their clerks, as well as many other tools and ways it can be helpful. But the trial judges were the ones that had their hands up really quickly and basically wanted me, right away, to give them tools to help with their very busy dockets full of litigants who appear without lawyers, both on their end (on the court end, in better handling those busy, long dockets) but more so for the litigants. They clearly feel that they're just not able to give those litigants the kind of information and help that they might need to understand what's happening and navigate their busy courtrooms.

So I will say there were not very many judges in the room who were experimenting with the tech yet, but the interest level was so high from the trial judges in the courts with the busiest dockets. It was quite notable, and it motivated me to make sure that every group I'm involved with—the ABA task force and the National Center group—is thinking about that.

Jen Leonard: That is really exciting. Were you surprised at the difference between appellate court judges and trial court judges?

Bridget McCormack: I was, because as you know, the tools as we know them are so useful for people whose currency is words, and appellate judges—even more than trial judges—their currency is words. So I had the opposite expectation of what I saw. I sort of thought the appellate judges would be most excited, and the trial judges would think, "I don't write opinions. Do you know how many cases I'm processing every day? I don't have time to write opinions." But it was really the opposite. So it was super interesting.

Jen Leonard: Yeah, I'm surprised too. I would think that the complexities of the different moving parts in a trial court—including all the evidence, the self-represented litigants, the preservation of the record, all of those things that go into creating the puzzle of a trial—would be a little bit of a disincentive. But I guess the pressure and the excitement around trying to solve these problems that they've been seeing for so long must overcome those complexities.

Bridget McCormack: I think that's right. And it's a reminder that trial court judges oversee trials. Most cases don't go to trial, right? They're resolved in some other way, and many are resolved by people who don't have the help of lawyers. So I don't know. It'd be a fun space to keep an eye on. Maybe we can talk about it a little more in the future. How about you? Did you have a fun Gen AI moment this week?

Jen Leonard: So after yours, this is really like the sublime to the mundane. This was just one of those very delightful little things that appeared in my Google Workspace that I wasn't expecting. And I don't even actually know whether it's generative AI or traditional AI, but I write a newsletter every month, and I like to pull different sources that I think are particularly interesting. The way I'd been doing it to date was, you know, copying and pasting the URL—or, I'm sorry, writing the text, then highlighting it, and selecting the link, dropping the URL in. It's very manual and takes a lot of time.

This month, when I went in to write it, I dropped a URL in and then I highlighted the URL, and Google prompted me—I think it said something like "with Magic AI"—and it asked, "Would you like to replace this with the title of the article?" And it already had the title ready to go. It was one of those moments where I thought, "Yes, that is exactly what I want to do!"

It sped up writing the newsletter by a good five or six minutes, because I didn't have to do all of that manual work in the background. I thought that was really cool. I think it's also a hint of what's going to happen, right? These things are going to be infused into the tools we use in a really user-centric way, developing features that anticipate how we're going to work and then doing it for us.

Bridget McCormack: Yeah, that's amazing. Honestly, I actually have to re-teach myself how to do that every single time I put a hyperlink in anything I'm writing. So that's incredible. I'm eventually going to get the new iPhone, and I'm excited for the ways in which it's just going to infuse into our day-to-day lives. And I do think that's how we get widespread adoption—when it just inserts itself into your life and makes something easier. So that's very cool.

Jen Leonard: I think—and hope—that this era where we're trying to sort through all these different models and figure out their capabilities will not last forever. Over time, more and more will be seamlessly infused into what we do, and there's going to be a lot less friction on the front end. That was just one early example of that.

Another cool thing I saw (I didn't experience it myself, but I saw it on LinkedIn) was a demo by Ethan Mollick. I've seen others as well—I know Josh Kubicki did a demo at the LVNX conference last week of a similar capability of Google's NotebookLM tool, which I understand to be a place where you can create repositories for your thoughts, for research, and for things you've written. In Ethan's case, I think he actually uploaded his entire book and a few articles, and then the tech created a podcast with two people that sound sort of human—like you and me—talking back and forth about the book and some of its takeaways. He said that it cited the book accurately. I'm sure there are all sorts of applications for this, but it was both very cool and a little bit creepy how fast the tech is moving. I would encourage people who are skeptical about it to take a look—maybe just Google "NotebookLM demo" and check it out. Have you seen this?

Bridget McCormack: I have. I saw a number of examples of it last week, and I was walking into my office and Jason Cabrera, who is now serving in our AI enablement role at the AAA, was excitedly telling me about it. You know, we put a lot of information and classes up online for people who are new to ADR processes, and he was so excited about the ways in which we can put that content out in new forms just by loading it into this tool. I haven't tried it yet, but I will soon.

Jen Leonard: Same. I haven't tried it either, but I'm planning to. And I think it runs on Google's Gemini 1.5 technology. So that was very cool.

Definitions: Publicly Available vs. Open Source

Jen Leonard: OK, let's define a couple of terms that we hear frequently in the Gen AI landscape. This week we've picked two terms that often become conflated or are used interchangeably, but they're actually quite different. The terms we picked are "publicly available generative AI tools" and "open source generative AI tools." So, do you want to start us off, Bridget, by explaining what publicly available Gen AI tools are?

Bridget McCormack: Yeah, I will. And then I'll turn it over to you for open source. Publicly available generative AI tools are tools that are accessible to the general public, either through web interfaces or APIs, but without disclosing or showing the public the underlying model’s architecture or the code that's been used to build it. These are the tools you can use—often for free, sometimes with a subscription like ChatGPT or Google's Gemini or Claude. They are offered either in a free version (as we've talked about many times) or a subscription model, but they keep the source code and details to themselves (those aspects are proprietary). They may also have usage restrictions on some of their more advanced features.

Some examples of other publicly available tools are DALL·E and Midjourney for image creation. However, open source means something different. How is open source different from publicly available?

Jen Leonard: Open source is different from publicly available in that the actual underlying code and the model architecture are shared outside of the company that created the technology. The source code becomes freely available—you can view it and even change it. (Now, I would have no idea how to do any of that myself, but if you're a developer, you can dig into the underlying code, make modifications, and find new applications, and then distribute what you've created to others.) I understand, as I'm getting to know the developer community better, that this is how models have historically advanced—through open-source collaboration, where developers build on one another's work. It's very collaborative and community-oriented. You can also run the code on your own local infrastructure, if you have that capability.

You can download it to your local device and play around with it to figure out new applications. We generally don’t see people outside of engineers and technologists using these versions of the models in practice. We use the publicly available tools. Examples of open-source models might be ones from Hugging Face (a tech company) or TensorFlow. And I know Meta, under Mark Zuckerberg's leadership, has taken the approach of open-sourcing its code—it pretty much stands alone as a major tech company doing that. There are, of course, controversies around releasing open source code because of the power of the technology and the desire of those companies that are not open-sourcing their code to roll it out safely and responsibly (and to keep it out of the hands of bad actors, which is a risk with open source).

But open source is also the way tech communities generally develop software and new innovations. So there's a bit of friction right now in Silicon Valley about whether we should be open-sourcing Gen AI. Is that consistent with how you understand it?

Bridget McCormack: It is. I've tried a couple of times to dig into exactly what the arguments are on both sides of the open source debate. It seems there are some fundamental, philosophical commitments at stake, but there are also commercial, competitive elements to that debate. I haven't fully processed it all. But yeah, that's exactly how I understand it too.

Jen Leonard: And there's a bit of discussion in Genius Makers (which we used with our class as a primer on AI history). I know people like Jeff Hinton—who is the “godfather of AI” but also deeply concerned about some of its ramifications—have conversations in that book about whether open-sourcing the code is a smart idea. But remember, all these companies are competing, too. So there are commercial reasons why they may not want to open source. If you hear "open source," it means something very different from publicly available models, but it is confusing nomenclature and understandably people use them interchangeably.

Main Topic: The Reasoning Power of OpenAI's o1 Strawberry

Jen Leonard: So with that, we're going to jump into our single topic of the day. We don't usually do a single tech-oriented topic—we try to take a bird’s-eye view of how tech impacts the profession. But we thought there was an important development this week in the technology itself. Bridget, do you want to set us up for our Strawberry topic?

Bridget McCormack: Absolutely. Yeah, so OpenAI released a new model, which is actually—confusingly—called GPT-01, or more commonly (and I prefer to call it) "Strawberry", which seems to be what they were calling it during development and what was hinted at by OpenAI executives (including Sam Altman) over the last few months. And it's not just a new model the way each previous model has been released. It is a new model, but it's really different—which I understand is maybe why it's now called "Zero One." But why don't you give us an overview of what O1 is and does, and maybe what's different about it?

Jen Leonard: Sure, I'll do my best to describe what's different about it. I will say, with respect to the name, my understanding is that it's called "Strawberry" because in the first era of models, if you asked them how many R's are in the word strawberry, you would get an incorrect answer. There are very few places on the internet, in the training data, where the number of R's in "strawberry" is explicitly stated.

So there's no ability to reason through that in those earlier models—you'd just get the fill-in-the-blank architecture that we're familiar with from the GPT models to date. So "Strawberry" became the nickname of this new version because the model itself is focused more on reasoning, rather than simply trying to predict the next word based on the most likely token to follow.

They obviously could market these better and label them in ways we could understand. But as you said, Bridget, O1 has become the name because it's a new class of models. Before now, the GPTs were all trained on vast amounts of internet text, books, writings—predicting the next most likely token or word in a sequence.

This model, on the other hand, really digs into the reasoning process itself. (I know we're using words like "reasoning" and "thinking" and anthropomorphizing the tech, but it's the only way to describe it in an understandable way.) So rather than just focusing on the output and guessing, it tries to supervise the process of reasoning out the answer to a question. It's sort of the difference between grading a fourth grader's math homework only by checking if they got the right answer, versus being able to look inside their brain and watch every single step as they're working through an equation—making adjustments at different points so they can get to the right answer. The difference here is in the scale, scope, and comprehensiveness of what the technology is analyzing and how it's strengthening the reasoning process.

The intended or expected applications for O1 are really highly complex problems in specialized fields. Right now, GPT-4.0—which we've been using as the most cutting-edge model from OpenAI—remains, to my mind, the easy solution for most tasks we do in a word-based world. But this new technology is a game changer. It will advance and accelerate all of the tech we've been following for a while. We thought it was important to spend a bit of time helping lawyers understand the distinction and what we think the long-term applications and implications will be. How are you understanding it, in a way that might be clearer than how I just described?

Bridget McCormack: Your analogy of grading the math homework versus seeing the steps the fourth grader is taking is a great one. I have a couple others. First of all, I should note: it does count the R's in strawberry correctly, right? This model actually does that correctly, whereas GPT-4 does not—so clearly it does something different. I'm going to borrow an explanation from Kevin Roose and Casey Newton on the Hard Fork podcast, because they put it in a way that helped me understand why it's different and important. I'll try to explain it how it made sense to me: The previous models were all trained on massive amounts of data, and their abilities scaled with how much data they were trained on.

You know, the general rule of scaling was that the more data they were trained on—the more they learned—the better they were at producing outputs. That's why GPT-4 is better than the original ChatGPT: each model was trained on more data and it performs better. But this new model, O1 (or "Strawberry"), is trained not on more data, but rather on the ability to reason—to sort of think through a question or a problem step-by-step.

It can sort of work both forwards and backwards, and take its time thinking through those steps, even showing its work. It doesn't show every single step, but it does show how it's thinking through the answer it's about to give you. What's significant about that is that it's an entirely new way to scale these models.

Which is not to say the first scaling methodology (more data) is done—there's always more data to train on. (We haven't even gotten into synthetic data, which we can discuss eventually.) There's always more and more data that the models can train on to scale in that way. But now they can also scale in another way: by being better and better at reasoning and carefully thinking through a problem. It's an entirely new way to grow and scale the models. That helps me understand how significant it is. I admit, I haven't had a lot of physics or advanced math needs for Strawberry yet, but I am pretty taken aback by what it's going to mean for how quickly these models will impact so many important fields. I think you've probably followed areas where people are already seeing a difference. I don't think lawyers are yet the folks getting the most out of Strawberry. But what's your sense of where this model is already having an impact, and where it's likely to have the biggest impact?

Jen Leonard: Well, as you noted, Bridget, right now on the benchmarking tests the model is performing at much more impressive levels than the GPT class so far in mathematics and science. Just a couple of data points: in a qualifying exam for the International Mathematics Olympiad, the new version — the O1 "Strawberry" — scored 83% on the test, compared to GPT-4.0's 13%. So, enormous gains.

It sort of reminds me of when we went from GPT-3.5 to GPT-4.0 and saw jumps in performance on the LSAT and the UBE bar exam—they were exponentially better. Similarly, in coding proficiency, O1 reached the 89th percentile in Codeforces competitions. In scientific reasoning, the next update of O1 is reportedly performing similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. I've heard lots of examples—for instance, on the Hard Fork podcast, some PhDs said it took them a few years to figure out problems that O1 can figure out in maybe an hour. So the compression of time in which it's able to reason through these complex problems is really mind-blowing.

And like you said, in our language-based profession, unless you're in a very, very technical area or you have specialized training, you're probably not going to be able to tell the difference between O1 and GPT-4.0. In fact, you might even think O1 is inferior at first, because it relies on System 2 slower thinking—it goes through every step in the reasoning process, so it takes longer to produce outputs. I also found it not to be as eloquent as Claude in the way it processes language, but it will be here in our profession soon.

We can discuss that more, but I’d really like to focus on this new back-and-forth reasoning process—not just moving in one direction, but being iterative and going back to adjust parts of the reasoning. We’re following our favorite, Ethan Mollick, who wrote a great Substack and gave an example that helps contextualize it. I think you followed his example too, Bridget. Would you be willing to share what he found?

Bridget McCormack: As always, Ethan was able to translate for lay people like me why this is a significant step for these models. He had Strawberry work on a crossword puzzle—and it was a pretty hard crossword puzzle. (I'm not a crossword aficionado, so maybe it wouldn't be as hard for some of you, but for me, it looked pretty hard.) The previous models would often guess the first clue, and if that guess wasn't right, then of course they couldn't complete the puzzle.

With Strawberry, though, it could reconsider. It might fill in an answer and then realize something was off and go back to revise. In other words, it can exhibit that System 2 style slow thinking to solve the puzzle. And that's another way I've been thinking about it—it's like Daniel Kahneman's slow thinking, but done by the model. (Sorry for talking about the models like they're humans, but that is one of the weird things about them.) That example really helped me understand the new capabilities. Do you think I got it about right? Is there more you would add to that?

Jen Leonard: No, totally. And the reason the outputs are slower, as you said, is because it's deploying System 2 thinking. I think the long-term idea is to blend different models. We (the users) wouldn't necessarily be the ones doing the blending, but the companies would—so that we can have really quick System 1 answers for things that aren't complicated (like an internet search, for example), and then give the AI more time to think through the System 2-type problems that are really thorny.

Bridget McCormack: I was trying to come up with questions complicated enough to test Strawberry's new capabilities, but it turns out I'm just not smart enough to stump it! It kept answering me so quickly that I couldn't trip it up. But I will say, we have an engineer at the AAA building things with these generative AI tools, and he showed up the Monday after Strawberry was first released so excited about everything it was already able to help him do quickly.

Obviously, we're going to see some tremendous impacts. I'm not sure we'll see them for lawyers immediately. But why do you think lawyers should care about this new model?

Jen Leonard: I think lawyers should care, probably because—even though we're not going to see Strawberry integrated into legal work this week—lawyers need to know what's coming. Many lawyers are still trying to wrap their arms around the first class of GPT models, and there's a lot of skepticism (though I think it's waning a bit). Lawyers are great critical thinkers—it's one of our skills—but there's a tendency to focus on the current limitations of the models or to dismiss it as "just guessing the next word," instead of recognizing it might be processing information in a more human-like way. I think more lawyers should be aware that this technology exists and is already having profound impacts in science and math. And rumor has it that OpenAI will be releasing its next language model, which I believe is codenamed Orion, and that model will be built on Strawberry-like technology.

So I think the implications for the legal field are enormous. I can't even wrap my head around what it will look like once Strawberry-type reasoning is applied to language tasks in our domain. But lawyers should be aware that while we're pointing out the limitations in the generative AI applications we've seen so far, the world around us has already changed—we just haven't felt it yet. What do you think?

Bridget McCormack: I think that's exactly right. And I do think there are some lawyers who will feel its impact sooner rather than later. For example, in-house teams that work closely with their business-side partners or clients will probably have to roll up their sleeves and experiment with it sooner rather than later. They’re not going to wait for Orion to be integrated into the tools they're comfortable with. I suspect legal ops professionals will find use cases for it pretty quickly, too.

Honestly, any law firm, in-house team, or legal organization that’s already thinking about how this technology will change what they do (and how they do it) should assign a couple of people on their team to experiment with it and figure out how it might make a difference for their operations and clients. Just as we've said with every other model: the best way to figure out how it's going to make a difference for you and your practice is to start using it. Ethan Mollick would say you can't just use it for 15 minutes—you have to use it for at least 10 hours to really see the impact. Do you have other thoughts or advice for lawyers thinking about this new model right now?

Jen Leonard: I'm going to go back to your story about the Ohio judiciary and the difference between appellate and trial courts. I think we've touched on this before, but you're exactly right that corporate law departments will feel this sooner than law firms, because those departments will be influenced by the organizations they're part of. Law firms are one step removed from that pressure. So yes, law firms should be thinking proactively about this, because their clients will be dealing with it before they will. Getting their arms around it—at least understanding that it's out there—will be useful.

In terms of what lawyers should be doing right now, my view is that most lawyers probably don't need to do much with Strawberry yet, aside from being aware that it exists. It's more about looking forward and planning strategically, rather than fixating on current model limitations and flaws. Bridget, what do you think lawyers should be doing right now about Strawberry?

Bridget McCormack: I agree. I think lawyers (and legal educators) should start thinking about how, when this type of reasoning training is combined with the models lawyers are already using, it could change things. If you have people on your team with a bit more technical expertise or a science background—like those folks who went to law school after considering a PhD in chemistry or math (I had many of them as students at Michigan and they were amazing)—pull those people in. They might not have been very impressed with these models so far, so get them excited about helping figure out what might work.

Our daughter's doing her PhD in economics (which is basically math these days) and her boyfriend is doing his PhD in math, and they've been pretty unimpressed with generative AI over the last year and a half. They always said, "Yeah, we keep trying it, but it doesn't really help us very much."

That's because the work they do is so complicated and technical. The models that have blown us away can't help them much—yet. And I think this is going to be the game changer, right? This is going to allow their fields to make incredible advances pretty quickly. The Hard Fork guys said (I forget who it was they've been interviewing repeatedly—some AI professor) that GPT-4 was basically a crummy grad student. I don't think they actually said "crummy"—I think that was my word—but essentially, a subpar grad student. And this new model is already a mediocre grad student. And it's the first of its kind! So imagine the next version, right? I don't know... my brain, like yours, probably can't fully comprehend what that's going to mean. But it's clearly significant—a significant change in every industry. And any significant change in an industry will impact the lawyers who help those industries navigate change.

Jen Leonard: Well, we hope you've enjoyed indulging us in a bit of a deep dive on this new iteration of the technology. I'm really, really curious to see—across all professions, as you said, Bridget—the ability to solve problems, to develop cures for diseases, to create breakthrough technologies to fight climate change... I mean, I've been excited to see some of the scientific and mathematical applications (even if I don't understand them). It has succeeded in making me feel much dumber than I usually feel, which is probably a good thing! But just testing it out and realizing I have nothing challenging enough to stump it—while knowing that people smarter than I am (like your daughter, her boyfriend, and others in the scientific community) are already seeing promise—is very, very cool and also a little unsettling.

Bridget McCormack: Yep, I agree. More cool than unsettling—but yes, a bit of both.

Jen Leonard: Well, I think we can leave it there for now, until our next episode. Certainly when Orion comes out, we'll have a lot to talk about regarding Strawberry’s application in language and in law. Until then, we will continue to explore all of the evolving technology, business models, and educational shifts that we hope generative AI will bring to our profession in the years ahead.