I generated pelicans riding bicycles on both thinking level low and thinking lev...

keyle · 2026-05-29T00:11:57 1780013517

It's pretty safe to say that AI will be used on the battlefield making real life and death decisions before it will be able to render a decent pelican on a bike in SVG.

culi · 2026-05-29T02:50:38 1780023038

It already has been and this has been widely written about. AI was used to identify and prioritize targets for the US to bomb in Iran.

Here's an article from 2 months ago for example: https://www.theguardian.com/technology/commentisfree/2026/ma...

It was also implicated in the bombing of a girls elementary school which left 168 dead. The US did a "triple tap" to kill any first responders.

https://www.theguardian.com/news/2026/mar/26/ai-got-the-blam...

https://www.theguardian.com/technology/2026/apr/01/dont-blam...

dmix · 2026-05-29T03:38:11 1780025891

I read the article and it doesn’t say it was used for targeting or prioritizing?

> Neither Claude nor any other LLMs detects targets, processes radar, fuses sensor data or pairs weapons to targets. LLMs are late additions to Palantir’s ecosystem. In late 2024, years after the core system was operational, Palantir added an LLM layer – this is where Claude sits – that lets analysts search and summarise intelligence reports in plain English

There’s a lot of humans in that loop who make those decisions.

saint_yossarian · 2026-05-29T06:56:29 1780037789

Yeah militaries don't use commercial chatbots for that, they have their own machine learning implementations. Look into Project Maven for example.

And while there are still humans in the loop, the impression I get is that this is increasingly becoming meaningless, from the way they talk about optimizing the "kill chain" and letting small teams make hundreds of targeting decisions per hour.

an0malous · 2026-05-29T11:23:28 1780053808

“US Military Using Claude to Select Targets in Iran Strikes”

https://futurism.com/artificial-intelligence/claude-anthropi...

culi · 2026-06-04T00:08:10 1780531690

Palantir's Maven is not an LLM. Maven uses Claude.

culi · 2026-05-29T04:25:15 1780028715

First link says

> AI is ‘identifying and prioritising targets, recommending weaponry and evaluating legal grounds for a strike’.

simonw · 2026-05-29T06:06:34 1780034794

It doesn't specify which "AI" though.

These days that pretty much means "somebody used a computer".

culi · 2026-05-30T01:39:14 1780105154

The first link is a reader letter to a piece they published. The original piece is the second link in my comment. It has more information

https://www.theguardian.com/technology/commentisfree/2026/ma...

> The paradigm shift has already begun. Despite the row, Anthropic’s Claude has reportedly facilitated the massive and intensifying offensive which has already killed an estimated thousand-plus civilians in Iran. This is an era of bombing “quicker than the speed of thought”, experts told the Guardian this week, with AI identifying and prioritising targets, recommending weaponry and evaluating legal grounds for a strike.

See also: https://www.theguardian.com/technology/2026/mar/03/iran-war-...

an0malous · 2026-05-29T11:12:03 1780053123

“US Military Using Claude to Select Targets in Iran Strikes”

https://futurism.com/artificial-intelligence/claude-anthropi...

It cites the WSJ but that article is paywalled so I shared this one

simonw · 2026-05-29T12:55:14 1780059314

This later story suggested it was Palantir's Maven, not Anthropic's Claude: https://www.theguardian.com/news/2026/mar/26/ai-got-the-blam...

culi · 2026-05-30T01:40:20 1780105220

Maven is not an LLM. Maven is software that uses LLMs. Mostly notably Claude

Kiro · 2026-05-29T09:49:55 1780048195

I think it's beyond decent. I don't understand how people are not more impressed by this. Just a few years ago the only expectation would be garbled nonsense.

notatoad · 2026-05-29T01:41:01 1780018861

the battlefield sounds much easier. worst case scenario you kill somebody, but that's what you're trying to do anyways.

if you kill somebody while trying to render a pelican on a bicycle it's a real problem.

ares623 · 2026-05-29T05:07:43 1780031263

"shift left" on the battlefield. break down those silos. if you have to ask for permission it's already too late. remember the goal. find the bottlenecks in your system and remove them.

pwagland · 2026-05-29T13:52:25 1780062745

In many battlefield scenarios, there is more than one "somebody" on it. The "somebody" that you kill might not be the "somebody" that you intended to kill.

Depending on the how pelicans are created, it is entirely possible to indirectly kill "somebody" due to the externalised costs of global warming etc.

Markstar · 2026-05-29T08:32:34 1780043554

Haha, yeah. I tried for it to create a SVG with scissors and it was hopelessly overwhelmed. I think at least the SVG design niche will be safe a little while longer

ares623 · 2026-05-29T03:29:10 1780025350

I think that's a fair tradeoff. There's no way I'm going back to writing code by hand again. No one deserves that.

keyle · 2026-05-29T04:12:10 1780027930

Heh? How long were you writing "code by hand" before?

ares623 · 2026-05-29T05:01:25 1780030885

Years and years. It was horrible. No number of misidentified targets will make me go back.

keyle · 2026-05-29T06:00:46 1780034446

It doesn't sound like you're in the industry you want to be in.

hombre_fatal · 2026-05-29T11:20:58 1780053658

Maybe all along what mattered most to them was making good software that people love, not the day to day part of writing code. Now it’s the industry they’ve always wanted, and less the industry of people who wanted to get paid to write code.

Software engineers who never cared about the higher level product design aspect are finding themselves in the wrong industry. It’s dismal.

GistNoesis · 2026-05-28T18:12:00 1779991920

> the bicycle frame is the correct shape

No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.

Hopefully 4.9 will read my comments :)

loeg · 2026-05-28T18:36:02 1779993362

Could be an extremely high angle stem that just happens to match the downtube angle.

Venkatesh10 · 2026-05-28T22:38:46 1780007926

Maybe the pelican is just riding a road bike/gravel bike

eminence32 · 2026-05-28T20:30:23 1780000223

I bet someone shares this link every time you post about bicycles, but since I didn't see anyone share it yet in this thread, I'll take the opportunity to do so:

https://www.gianlucagimini.it/portfolio-item/velocipedia/

Turns out even humans can be pretty bad at drawing bicycles :)

walthamstow · 2026-05-28T21:31:37 1780003897

On a new model release, you can guarantee two things are in the replies to Simon. One is your link, the other is "surely the models are being trained on this now"

saghm · 2026-05-28T23:44:01 1780011841

Sure, but no one is trying to force art from most people into about every area in the economy where anyone ever pays for something visual. If you asked professional artists to draw a realistic bicycle, I'm guessing few of them would try to just randomly guess what the mechanical parts looked like

skydhash · 2026-05-28T20:55:59 1780001759

But if you need to draw a bicycle, you wouldn’t pick a random person in the street. You would hire an artist and you’d be guaranteed to have at least a believable one if not a perfect rendering.

No guarantees is why LLM is akin to gambling. Every new context is essentially picking someone out of the crowd.

jodrellblank · 2026-05-29T13:58:58 1780063138

As an aside, some of the renders have only a single side connection to the wheel and that is a valid bike design, the Cannondale Lefty front fork only has a left leg:

https://duckduckgo.com/?q=cannondale+lefty&iar=images&t=ffab

kvirani · 2026-05-28T22:30:37 1780007437

> The most unintelligible drawing has also the most unintelligible handwriting. It was made by a doctor.

Haha

jonas21 · 2026-05-28T17:20:35 1779988835

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.

usef- · 2026-05-28T22:58:54 1780009134

And yet some people doubt Anthropic's commitment to AI safety

simonw · 2026-05-28T19:46:39 1779997599

Here's pelicans in all of the thinking levels - low, medium, high, xhigh, max

https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

motza · 2026-05-29T00:02:13 1780012933

low: yolo

medium: redesign bike so peli can reach bars

high: redesign bike so peli can rest on frame

xhigh: yolo

max: big peli reach bars

ionwake · 2026-05-28T21:39:39 1780004379

I like the way the max pelican has a stern look on his face

virgildotcodes · 2026-05-29T09:40:48 1780047648

Max seems to me to be notably better than the others.

stratos123 · 2026-05-28T19:53:21 1779998001

Is the output on the max level meant to be missing?

simonw · 2026-05-28T19:55:33 1779998133

I just fixed that (force refresh). It hit my default 8,000 output token limit, it worked when I bumped that up.

For max I used 25 input, 17,167 output which cost me 43 cents! https://www.llm-prices.com/#it=25&ot=17167&ic=5&oc=25&sel=cl...

spmartin823 · 2026-05-28T17:32:08 1779989528

You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?

phainopepla2 · 2026-05-28T18:02:42 1779991362

If these were in the internal evals then the output would be much better. The 4.8 pelicans are pretty meh

HDThoreaun · 2026-05-28T18:03:44 1779991424

Click the link

ceroxylon · 2026-05-28T17:33:45 1779989625

I really like that thinking level high gave the pelican a helmet.

Xunjin · 2026-05-28T17:23:10 1779988990

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.

simonw · 2026-05-28T18:59:04 1779994744

I don't think the API supports "max" as an option, that might just be a Claude Code harness thing.

UPDATE: My mistake, the API does support max. I added a max one at the bottom of this page (cost 43 cents): https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

Xunjin · 2026-05-29T02:51:38 1780023098

The legend.

yanis_t · 2026-05-28T17:15:00 1779988500

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects

simonw · 2026-05-28T17:15:45 1779988545

I've been meaning to do a "run 3 times and pick the best" version for quite a while, I should really pull the trigger on that one. Currently it's one-shot only.

notaharvardmba · 2026-05-28T23:19:01 1780010341

You could run 3 times and overlay/average the images to show how consistent they are

xiphias2 · 2026-05-28T17:48:44 1779990524

Best-of-3 would be cheating, ruin the test, middle of 3 makes more sense

nik736 · 2026-05-28T18:23:15 1779992595

Why would you need the 3rd run if you pick the "one in the middle"?

jmaw · 2026-05-28T19:47:48 1779997668

Middle as in not the best, and not the worst. As opposed to the second generated in sequence.

But not the best/not the worst is somewhat subjective.. so not sure how well that would work.

BrokenCogs · 2026-05-29T02:00:36 1780020036

I think GP meant picking the median pelican

silisili · 2026-05-28T18:53:57 1779994437

The vast majority (if not all) of these make it impossible to turn, among other fun things. Only out of curiosity, have you tried prompting further with how a bike must operate to see if it does the right thing?

fendy3002 · 2026-05-29T02:57:38 1780023458

tried it myself, not much of difference

https://gist.github.com/fendy3002/3026a8c4d67d1301666ec40fc0...

looks like the model already trained well on both bicycle and pelicans

lysecret · 2026-05-29T13:56:42 1780063002

Sadly I think the correlation between this benchmark and performance is starting to break down imo. Still a legendary idea will be remembered and ingrained in the models forever haha

1attice · 2026-05-28T17:18:11 1779988691

That little red hat on hard mode is sending me. 4.8 has whimsy

toastmaster11 · 2026-05-28T18:12:21 1779991941

I find the most miraculous thing about 4.7 to be that the pelican is facing left, wonder why the right facing everything is so ubiquitous in these images.

i000 · 2026-05-28T18:49:17 1779994157

This happened to me in elementary school. We were doing fingerpaintings using plasticine. After all the bikes were hung on the wall, mine was racing the other way... Somehow it really stuck with me.

sunnybeetroot · 2026-05-28T21:27:32 1780003652

What do you think it means?

gboss · 2026-05-28T18:35:38 1779993338

It's facing left but looking right...

toastmaster11 · 2026-05-28T19:02:53 1779994973

Profound political commentary?

whalesalad · 2026-05-28T18:43:17 1779993797

Eventually the frontier model folks are going to pick up on your pelican on a bike test and bake-in flawless results for that particular request.

impalallama · 2026-05-28T21:38:55 1780004335

I actually like the 4.7 the most, interestingly enough. Not like you can "objectively" weight artistic output like this.

alex_duf · 2026-05-29T09:28:51 1780046931

It's funny that we've reached the level where LLMs draw more correct bikes than any random person

prmoustache · 2026-05-28T22:41:24 1780008084

I don't see how a frame without a headtube can be "the correct shape".

fragmede · 2026-05-28T20:47:11 1780001231

For comparison, what's GPT-5.5 producing today?

simonw · 2026-05-28T21:49:41 1780004981

The reasoning xhigh one is pretty solid: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

fragmede · 2026-05-28T23:08:33 1780009713

Lends credence to my vibe-based assertion that GPT-5.5 > Opus 4.7 (and now 4.8), which is why I've cancelled my Claude plan. Opus 4.8 is them seeing it reflected in their own numbers and having to pull stopgap measures to avoid falling behind while they embargo Mythos.

timsuchanek · 2026-05-28T18:00:58 1779991258

thanks for always providing this very much on time. I'm wondering what the next, harder challenge could be? Maybe some animated svg?

nickvec · 2026-05-28T17:11:32 1779988292

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)

simonw · 2026-05-28T17:19:03 1779988743

Good call, it's cute: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304... - but nothing like GLM-5.1: shttps://static.simonwillison.net/static/2026/glm-possum-esco...

highwaylights · 2026-05-28T18:07:36 1779991656

Am I allowed to say that pelican's little helmet is adorable? I can't provide a strong computational proof, or even a shred of anecdata...

...but that pelican's little helmet is adorable.

onlyrealcuzzo · 2026-05-28T17:09:59 1779988199

4.7 reigns supreme IMO.