GLM 5.2 beats Claude in our benchmarks

pimeys · 2026-06-28T21:59:10 1782683950

I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars...

This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.

Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.

I used it unquantized through Fireworks, but there are multiple other providers too.

gertlabs · 2026-06-29T00:05:58 1782691558

GLM 5.2 is a great model, but if you only want to use the best model available, it isn't there yet. Every lab releases models that memorize benchmark answers, both intentionally and unintentionally. But we consistently find that models from Chinese labs have a wider gap between public benchmarks and our evaluations, which we designed to be less vulnerable to benchmaxxing.

In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average. Data at https://gertlabs.com/rankings

But when factoring in performance/cost, GLM 5.2 is the frontier model.

neya · 2026-06-29T03:26:35 1782703595

What is the methodology of your benchmark?

On the contrary, I personally think these broader benchmarks are meaningless. I think personalized benchmarks are the way to go. They should answer "How does this model perform for MY use-case?" rather than trying to answer "How does this model perform across all coding environments?"

Case in point: I use Elixir which is not as popular as Python, is always a hit or miss with most SOTA models at the top of these benchmarks. Whereas, the ones in the middle of the benchmarks (like the GLM) almost always outperform even SOTA models from Google / Anthropic. However, this is relevant only for my use case and I wouldn't just advocate a model for everyone based off my use-case alone.

jfaat · 2026-06-29T03:05:17 1782702317

> but if you only want to use the best model available, it isn't there yet

I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price. And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable. Almost like buying a Ferrari for your daily commute instead of a Toyota or even a Mercedes.

I think there are several factors. Certainly marketing making us think we need the shiny thing which is rampant online and very smart people think they aren't susceptible to. There's a lot of really odd 'I trust Anthropic/OpenAI more than Deepseek' which tends to ignore, for starters, that you can run choose your provider and still save a ton. I also think there's some amount of addiction and brand loyalty where a Ferrari is one hell of a drive so that you turn your nose up at that sensible Toyota. Oh the other one I see used is like oh only fable can oneshot updating my embedded systems thing from 1975 to rust which is great but let's recognize how niche that is.

And it ends up just coming across as people are getting SO reliant on the tools so fast. Maybe it's ok to think and like read a few lines of code and work with these agents to convert your thing to rust or center your div. Even if coding is over which in some sense it certainly is, don't turn your mind into the wall-e people yet. I found myself guilty of this so often. It takes way more time and effort to do things via prompt and I wouldn't just open the editor and fix it because that dopamine hit of the magic the abstraction provided was so strong.

So I'm pretty much done using the 'best' (on benchmarks, if money isn't an object, etc etc) models available. After a year on Sonnet/Opus/GPT5x I'm having way better results with open weights models that don't get lobotomized weekly. I'm finding ways to do the crafting part of building software by focusing on honing my harness and workflow. I'm enjoying changing the oil on my Toyota after a year of almost flying off cliffs in my Ferrari and if I can check my ego it's a purely positive thing.

ssk42 · 2026-06-29T03:19:56 1782703196

What is your favorite harness for the open weights?

NamlchakKhandro · 2026-06-29T03:31:11 1782703871

pi-mono

ronsor · 2026-06-29T02:13:38 1782699218

Opus 4.6 is still my preferred model for work, so this is great to hear.

echelon · 2026-06-29T02:53:36 1782701616

I can't wait for open models to take over in all categories.

Sounds like this is the year for coding.

pizzly · 2026-06-29T03:10:35 1782702635

It looks possible open models will. I never expected the reason would be political/legal rather than technical.

bjourne · 2026-06-29T01:41:45 1782697305

Man, there is exactly zero information on your site about how your benchmarks work. Why should one trust your numbers when there is no way to verify them?

gertlabs · 2026-06-29T01:48:52 1782697732

Scroll to the bottom for the methodology (sorry, this should be linkable)

skeptic_ai · 2026-06-29T01:04:34 1782695074

Why Deepseek v4 flash is better than pro in your benchmarks?

gertlabs · 2026-06-29T01:50:14 1782697814

It's 100% due to tool use -- Flash adapts much better to our custom harness with tool names that are not identical to what models were likely trained on. DeepSeek V4 Pro performs much worse in that aspect than almost all other recent releases, for whatever reason.

rockwotj · 2026-06-29T01:23:57 1782696237

I have also found deepseek flash beat pro in some of my own internal evals for tasklet.ai it’s really surprising and I don’t understand it

xbmcuser · 2026-06-29T03:32:34 1782703954

maybe they distilled claude for the flash version and not for the other hence better tool use and programming benchmarks

freakynit · 2026-06-29T01:44:28 1782697468

Same.. although rare, but have observed twice till date.

Some blog post I read few weeks back said that DSV4Flash in xHigh effort beats even the pro model in xHigh effort.

onoesworkacct · 2026-06-29T01:45:18 1782697518

The rumour is that it's trained on Opus, but who knows

rockwotj · 2026-06-29T01:57:08 1782698228

Oh of course all deepseek and glm are. Multiple people have seen GLM self report that it is claude, which makes it super obvious.

I think the surprising thing is I expect flash to be a pure distillation and strictly worse quality but clearly it’s more nuanced than that.

kennywinker · 2026-06-29T02:41:42 1782700902

Claude claims to be deepseek, under some circumstances:

https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...

jchw · 2026-06-29T00:30:30 1782693030

After having used GLM 5.2 and Opus 4.8 for enough time, I'm very unconvinced of the benchmark maxxing claims - if anything, GLM 5.2's rather lackluster performance on benchmarks compared to Opus 4.8 paints the opposite picture when compared to the subjective experience.

When I first used Opus 4.8, I threw several different workloads I had at it - I have Claude doing a lot of misc projects whose primary purpose is pretty much just studying what AI agents can do for my own curiosity and no other reason. Opus 4.8 was one of the first models I ever snuck in there that basically ran out of control. No previous Opus or Sonnet model I had used ever did this. Within hours every agent I had running was writing non-sense tool calls that echoed pretend commands that didn't exist, like 10 in a row, and talking about the "tool channel" being dirty. I switched back to Opus 4.7 and assumed Opus 4.8 was legitimately just broken.

I did come back to Opus 4.8 and found that it was indeed, pretty powerful. But that initial experience has stuck with me on just how narrow of a perspective any given test or benchmark is guaranteed to have. LLMs are too broad, it really doesn't matter what you try to do in your benchmark, you will necessarily get a limited view of what the model is capable of and its shortcomings. This will remain true for at least as long as models are susceptible to massive swings in performance based on randomness and minor differences in prompts and other environmental factors.

I'm not saying benchmarks are useless or that your benchmarks are not possibly closer to the truth either. All evidence at least points to the idea that Chinese models perform very well in coding but often have more mixed results on other tasks. I'm just saying that at this point, benchmarks feel like they have limited connection to my actual real experiences. GLM 5.2 actually scored kinda meh on a lot of benchmarks (compared to closed frontier models) but my actual experience using it does not match this.

And I'm definitely not saying GLM 5.2 is better than the frontier LLMs here, just that the race is close. I still prefer GPT 5.5 right now for code review, I think, and Opus clearly has some advantages depending on the task. It's just no longer a given that Opus 4.8 will perform better than GLM 5.2 on any given task, so to me the calculus behind "using the best model available" is getting complex and you might need to get a feel for what models have what strengths to really figure it out.

I do feel like the "use the best model available" mentality is not going to die any time soon, but if it does die, it will be gradual and start soon for programming. Modern LLMs are still not a full superset of what human programmers can do, but still larger models are definitely starting to hit diminishing returns for tasks at the lower end of complexity, and that is a big deal. It's a weird world where some tasks you can feel kinda confident just throwing Gemma 4 at it and not sweating whether you should use a better model; I've certainly done it for some quick Python scripts or getting an overview of some code I'm unfamiliar with.

Madmallard · 2026-06-29T01:31:32 1782696692

Notice the website url is the same name as the commentor.

Notice he's using "trust me bro" benchmarks.

Can we just remove all the motivated speech on HN? This is just not trustworthy information at all and obviously is incentivized.

Everyone is grinding and marketing nobody is actually discussing anything for real.

neya · 2026-06-29T03:22:07 1782703327

I am seeing extremely positive results with Elixir too. Previously I was on Deepseek (deepseek-v4-pro) and GLM5.2 outperforms Deepseek easily. It's been a month since I used any native Claude models (simply because of pricing) but then, GLM5.2 is running for me at $20/day in usage on OpenRouter for GLM5.2. I am not sure if I've misconfigured Claude code or if this is indeed normal usage pricing. But, the output more than makes up for it. However, using Deepseek v4 pro directly from deepseek.com using their discounted pricing is insanely cost efficient. I topped up $10 a month and a half ago and I'm still yet to use up all the money in my account. Here's hoping that SOTA models become even cheaper!

Aditya_Garg · 2026-06-28T23:20:46 1782688846

Im really curious about this. Why pay API pricing? I burn 1000s of dollars a month of api according to claude usage but only pay the $100 subscription

redox99 · 2026-06-29T03:29:49 1782703789

And codex is even more subsidized. It's an absurdly good deal.

horsawlarway · 2026-06-28T23:38:53 1782689933

My increasing frustration with these plans is the harness lock in.

Anthropic won't even let you run "claude -p [prompt]" any more... They bill it at api rates.

So if you're trying to automate the ai (and seriously, that's the point) the subsidized plans are crippled.

cortesoft · 2026-06-28T23:54:18 1782690858

They postponed that change, here is the email they sent out:

> In May, we sent you an email announcing that starting today, the Claude Agent SDK, claude -p, and third-party apps built on the Agent SDK would stop drawing from subscription rate limits and move to a dedicated monthly credit. We're writing to let you know that we’re not making this change today. We’re working to update the plan to better support how users build with Claude subscriptions.

> What this means for you

> Nothing changes for now. Agent SDK, claude -p, and third-party app usage continues to work with your subscription exactly as it did before today, and there's no credit to claim. Your subscription limits are unchanged. When we have an update, we'll share it with advance notice before it takes effect

smcleod · 2026-06-28T23:54:45 1782690885

They canned the moved to make -p commands API billable.

sroerick · 2026-06-28T23:52:15 1782690735

I'm using synthetic.new and Neuralwatt with pi and its good and also cheap

computerex · 2026-06-28T23:57:02 1782691022

I have had bad experience with neuralwatt GLM 5.2. Seems like they may be using quantized version of the model.

scottcha · 2026-06-29T01:34:26 1782696866

Hi I'm the CTO of neuralwatt, would love to hear your feedback on what your experience was. Feel free to email me scott@neuralwatt.com. Also for GLM5.2 we run the FP8 quantization at 1M context which is a common deployment target.

throwawayffffas · 2026-06-29T00:02:02 1782691322

Z.ai does not lock you in to any harness.

weird-eye-issue · 2026-06-28T23:39:52 1782689992

I think they rolled that back

SV_BubbleTime · 2026-06-28T23:32:55 1782689575

There is a whole iceberg topic on subsidizing.

So your question is really “if they’re giving free usage, why not take advantage of it?”

I do, so I don’t know the reasons not to, other than to experiment.

shostack · 2026-06-28T22:07:47 1782684467

If you're using Matrix, consider Hermes as a harness if you haven't already. Native gateway support. I've been primarily using mine through Element and it has largely been great.

pimeys · 2026-06-28T22:17:00 1782685020

Oh interesting. I basically chose Matrix because setting anything up with Whatsapp or signal was kind of painful and telegram doesn't make it easy to use encryption with bots.

I kind of wanted to see if I can make a Matrix agent from scratch with Rust with GLM and it was surprisingly easy. Just make something for myself how I want it. Maybe I'll take a look on Hermes later...

Barbing · 2026-06-29T00:33:45 1782693225

Very interesting—Element X solved a lot of the pains of Element (iOS), could be a good solution!

KaoruAoiShiho · 2026-06-28T22:37:24 1782686244

Are you sure fireworks is unquant? It's not listing precision on openrouter like everyone else.

jklmnopqrstuvw · 2026-06-29T01:30:47 1782696647

> A typical session for me with GPT is usually over a hundred dollars.

I don't think a $100 session is "typical". I use GPT for months. $20/m plus plan is enough for my daily work.

simple10 · 2026-06-29T02:34:54 1782700494

I use an observability tool with claude code [1] that shows me usage including prompt and session cost. Even though I use a max subscription, it's interesting to see what it would cost me if I was using API directly.

My typical session ranges from $100-$400 - higher end when using workflows with lots of subagents. $100/session is expected when using the API without the subsidized subscription pricing. Most larger orgs have to use API pricing AFAIK.

[1] https://github.com/simple10/agents-observe

tjwebbnorfolk · 2026-06-29T02:28:22 1782700102

I have Claude max plan and the vscode claude dashboard plugin has logged about $4k worth of tokens in the past 2 months. I upgraded because I was using my weekly basic plan tokens in like 5 hours.

Likewise, I don't understand how anyone survives on the basic plans. It's funny seeing these two camps not understanding what the other is doing :)

adamtaylor_13 · 2026-06-29T02:13:30 1782699210

It's really interesting what "normal" is for folks. I use the $200/month Anthropic subscription and use it within a few percentages of my limit every week.

I'd blow through $20/month plan in hours.

dist-epoch · 2026-06-28T22:07:20 1782684440

$20 on API pricing or on subscription?

pimeys · 2026-06-28T22:11:29 1782684689

API, pay per token.

Chrisoaks · 2026-06-29T01:55:15 1782698115

Why are you not using the subscription plan?

HKCM852 · 2026-06-28T22:06:08 1782684368

Which harness did u use?

pimeys · 2026-06-28T22:11:08 1782684668

Opencode and Zed about 40/60.

noncoml · 2026-06-28T22:37:32 1782686252

[flagged]

term333 · 2026-06-28T22:58:39 1782687519

Please take comments like this back to reddit.

sertsa · 2026-06-28T22:43:41 1782686621

Its an editor: https://zed.dev/

HAL3000 · 2026-06-28T22:55:11 1782687311

Just FYI, this question was a quote from Pulp Fiction, the other commenter (mdre) replied also with a quote, that was an answer to this question in the movie.

dom96 · 2026-06-29T00:03:30 1782691410

Twenty dollars?

How are you comfortable spending that much to write something as simple as a matrix bot?

Are people doing this kind of thing just super rich or am I missing something?

ygjb · 2026-06-29T01:13:21 1782695601

It's pretty simple. There are things that I do because it's fun, like gamedev. I hand code that, and don't use LLM tools because I like learning and building. I do lots of utility stuff coding for my wife's business, most of that is stuff I could do in a few hours. It's worth $20 to not spend a few hours doing it. It's a cost benefit tradeoff. I won't learn much fixing WordPress themes or adding a feature to her web page, or setting up an automation for her, so I don't see the point of doing that.

Same thing for stuff at work. Oh, the tables/schema changed and my queries broke? I could dork around with spark and cypher for an hour, or I can tell claude to update the queries for the new schema. At the rate I am paid, spending on Claude tokens is generally a better use of my resources.

Building a net new solution? Coding tools take a back seat until I get the core logic right, then I let automation handle web page and UI scaffolding.

annzabelle · 2026-06-29T02:57:55 1782701875

A lot of people spend $20 on a hobby for an hour of enjoyment a couple times a week. Not odd at all to do that for a few hours of coding if you find it fun. It could be a day pass at a bouldering gym or a yoga class or amortized running shoes/garmin/electrolytes.

adamtaylor_13 · 2026-06-29T02:14:03 1782699243

Is spending $20 considered "super rich"?

copperx · 2026-06-29T02:17:32 1782699452

$20 is really cheap for the amount of work saved, considering you're in the US.

SwellJoe · 2026-06-28T22:33:37 1782686017

I added GLM 5.2 to my security bug hunting benchmark when it came out, and found it to be a good performer, but not the best open model. The benchmark tests whether models can find bugs Mythos found. The best open models in the initial benchmark were DeepSeek V4 Pro or MiMo 2.5 Pro. But it turned out MiMo got lucky, it's performed worse on almost every test I've done since, while DeepSeek has consistently been among the best performers and its extreme caching performance makes it cheaper than just about anything, including much smaller models.

https://swelljoe.com/post/will-it-mythos/

Also of note, I found giving models access to the open source semgrep as a tool makes some perform worse and none perform better, though it's plausible there's a way to wire it up in a harness that presents useful information to the model without the model having to know how to use it (my theory is that semgrep isn't heavily represented in the training data, so you're asking the model to do two things at once: figure out how to use semgrep and find security bugs, and both tasks suffer for the lack of focus...most small models, and some big models, can't do that well).

Edit: But, also, more testing is ongoing. I suspect GLM 5.2 will also be a consistently strong performer. It seems to excel at most things I've tested on it.

lebovic · 2026-06-29T01:12:20 1782695540

GLM 5.2 and DeepSeek v4 Pro seem to approach security research differently. This benchmark was with GLM 5.1, but the patterns are similar: https://dualuse.dev/posts/deepseek-v4-thinks-different

Overall, I still think GLM 5.2 is the much stronger performer. It's hard to tell the difference between GLM 5.2 and Opus at <120k tokens.

SwellJoe · 2026-06-29T02:01:54 1782698514

I have found that some models consistently find or miss specific bugs, and which bugs are hard don't completely line up across all models, so I believe that. I just refactored the security bug-finding harness I've been working on completely (not checked in yet, testing it currently) to strongly encourage "multi-model, multi-pass" scans and make them easy to orchestrate with de-dupe and weeding false positives with a strong model, rather than one model or doing just one pass over each file. Giving a model a second attempt increases their findings by 20-30%, and giving them a third, adds another 10-15%.

I'm inclined to use DeepSeek V4 Pro the most, because it is consistently extremely strong, it's very fast, it's very cheap and has excellent caching and cheap-as-free cached input tokens (something like 80% of token usage is cached when I'm using it for security scanning). So, my probably "pair" of frontline security researchers will probably be DeepSeek V4 Pro and Gemma 4 31B self-hosted (another shockingly strong contender, competitive with the best models once you let it loop on the same file a couple/few times). But, I won't be surprised if GLM 5.2 turns out better than DeepSeek V4 Pro...it costs quite a bit more.

acters · 2026-06-29T02:57:18 1782701838

I believe it is because GLM 5.2 has extra anti-cyber training instilled in it. Similar to Kimi k2.7 code.

Deepseek v4 pro being in preview with less "safety" training makes it stronger for that reason. Thinking will be different and in the end, it will actually try to be useful. Just expect future Chinese LLMs to further push out "safety" guided LLMs. The future is bleak for open weight models. Prepare to have "guidelines" enforced unceremoniously to all.

amhoab · 2026-06-29T03:30:23 1782703823

Aren't you the Webmin guy?

Barbing · 2026-06-29T00:34:39 1782693279

We need a benchmark of independent community sourced benchmarks!

…probably already is one

SwellJoe · 2026-06-29T00:45:59 1782693959

I don't know how you'd judge benchmarks beyond "did it test and measure what it says it tests and measures". And, I guess there have been instances where the benchmark failed to do that, and the models could cheat in some way and it just tested the models ability to find the answer key. In the case of my benchmarks every model other than Claude models running in Claude Code never have network access and all information from after the bug was discovered has been removed from the repository the model can see.

But, there are benchmarks for so many different kinds of ability, I don't know how to compare them directly against one another. Like, models that do well on terminal and agentic coding benchmarks tend to do well on finding security bugs, but it's not a 1:1 correlation, there are surprises.

mapontosevenths · 2026-06-29T01:56:52 1782698212

It's not super scientific, but I really like to watch Bijan Bowen's videos on Youtube. I think he's pretty fair about the way he compares them, and it's enough for what I'm doing.

SwellJoe · 2026-06-29T02:09:32 1782698972

Actually doing something normal but challenging with a model is generally enough for me. I do a quick (an hour or two) project, and see how it holds up. If I'm feeling like it's harder than it should be, I switch to a comparable model I know is good. e.g. I most recently tested Gemini Flash 3.5 for making a web app. It shit the bed...kinda worked, but was ugly and needed several bugfixes right off the bat. I tried the same app in Opus 4.8, which aced it with barely any extra conversation, it looked great (basic but clean, like it was intentional) without any effort.

I like reading benchmarks, but I take them all with a grain of salt. They're just to tell me if the model is worth even trying for my task. I've heavily used self-hosted Qwen 3.6 and Gemma 4 on a bunch of different tasks, and while the benchmarks consistently say Qwen is the better model, I simply don't find that to be the case for anything I do. I think Qwen is tuned for benchmarks, while Google couldn't give two shits about most of the benchmarks, they're just busy making unusually smart tiny models.

bArray · 2026-06-28T21:24:00 1782681840

Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?

[1] https://huggingface.co/zai-org/GLM-5.2

Retro_Dev · 2026-06-29T00:17:59 1782692279

I ran it on my laptop, which is a Lenovo Legion 5i (think 32 GB RAM, 4060 w/ 8 GB VRAM, you get the picture). It was a quantized model (otherwise it would not fit on my NVMe 1TB drive) at 4 bits per weight - UD_Q4_K_XL. It ran at about 12 seconds per token (not tokens per second). A fun project, but not worth it. I used 4096 tokens of context cache, and I ran it with llama.cpp - as it supports memory mapping. Because the whole thing could obviously not fit in RAM, I was curious how much it would need to stream from SSD. The answer? For a simple 4 sentence description of who it was, about 1.5 TiB was streamed from disk.

bArray · 2026-06-29T00:34:14 1782693254

Thank you for sharing. 1.5TB of streamed data at 12 seconds per token on a high end consumer laptop is a pretty high requirement - I can only imagine how much that cost to train. I don't know how running this model could be cost effective for anybody.

Retro_Dev · 2026-06-29T01:02:21 1782694941

Indeed - definitely not cost effective to run it on this laptop LOL. It makes me wonder how fast we could run the model if we could fit the weights entirely within CPU cache (assuming a whole ton of CPUs with low latency & high speed IO of course).

kccqzy · 2026-06-28T22:30:27 1782685827

Run quantized versions. https://unsloth.ai/docs/models/glm-5.2

scosman · 2026-06-29T01:36:46 1782697006

short answer: they mostly aren't

A few people are running highly quantized models with limited context windows. It's still impressive, but not the benchmark level intelligence. Very few people could afford a rig for reasonable local performance at a reasonable quant, at full context size.

The antirez example is 2.6bit quant, 32k context, and few tokens per second... on a ~$7000 MacBook M5 (new RAM pricing).

crocowhile · 2026-06-28T21:26:06 1782681966

follow antirez - https://x.com/antirez/status/2071173841175363905?s=20

nozzlegear · 2026-06-28T23:03:49 1782687829

https://xcancel.com/antirez/status/2071173841175363905

JamesSwift · 2026-06-28T21:39:28 1782682768

Thats quantized

dakolli · 2026-06-28T21:55:04 1782683704

8 X RTX6000. It will run you around 80-100k to get started with a model at this size with decent tps..

Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.

For $100k you could run this model 24/7 through open router with 10 concurrent sessions at 50tps for a decade and have money left over for a vacation. There's no point in investing this type of money in local models unless you have a business where you're already paying for many employee's individual token usage.

Aurornis · 2026-06-28T22:10:59 1782684659

> 8 X RTX6000. It will run you around 80-100k to get started

8 x RTX6000 GPUs cost $100,000 alone. You then need to build a system that can support those GPUs with enough PCIe lanes through a PCIe switch.

It's going to be $120K to $150K to build or buy a system to run this.

cheschire · 2026-06-29T00:35:15 1782693315

Not to mention the three separate dedicated 15A circuits you would need to have installed in order to run the 3x 2000W power supplies running ideally at no more than 1400W sustained load each. And definitely would need 200A service to the house if you have a family living there with you.

But hey you could save on heating?

InvertedRhodium · 2026-06-29T01:37:22 1782697042

That’s a uniquely US issue - in NZ you can get a 100A single phase at 230V nominal without any issue. 23kw, straight to your door.

A single circuit using 10mm TPS would technically be enough to run what you’re describing. Might be pricey though, I’d probably take the excuse to get 3 phase installed so I could get access to the stock of used 3 phase machinery.

bentinney · 2026-06-29T03:38:53 1782704333

Not so sure about that. 200amp @ 240v is pretty standard for modern houses in the US. My house in Japan was only 40amps, so there are plenty of countries where this would be an issue.

knollimar · 2026-06-28T23:17:59 1782688679

isn't throwing that into a [insert financial vehicle that gives 99.99999% safe returns] going to destroy that when you factor in electricity costs?

Or even just electricity costs vs token cost

CamperBob2 · 2026-06-28T22:20:54 1782685254

You can run the NV4FP quant with 8x RTX6000 cards at 50-75 tps output, but not (practically speaking) the OEM FP8 version. You will learn more about PCIe than you ever wanted to know.

The real gangstas are running 16x RTX6000s. Too rich for my blood, and the NV4FP quant doesn't seem to be that much worse.

Sanzig · 2026-06-28T23:32:59 1782689579

Anyone done any benchmarks on the NV4FP quant? Seriously considering pitching an 8 x RTX 6000 Pro box at work to run GLM-5.2 in an air gapped environment.

tiahura · 2026-06-28T23:57:03 1782691023

Good luck. I’m in the legal field, and even there, selling airgapped is tough.

botro · 2026-06-29T03:06:12 1782702372

What are the challenges you've seen in selling air gapped? Is it the high upfront cost? Challenges with hardware maintenance or something else?

AussieWog93 · 2026-06-29T01:24:04 1782696244

>Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.

Not sure if you're being sarcastic, but I can run a quantised version of Gemma or Qwen on my 16GB M1 Macbook Pro that beats GPT-4 from 2023 hands-down.

I wouldn't be surprised if, in another 3 years, you'd be able to run something as powerful as Opus 4.5 or GLM-5.2 on standard consumer hardware - say a 32GB/64GB M7 Pro.

I also wouldn't be surprised if, 3 years after that, cheaper hardware and improved model efficiency means that there's a much smaller gap between what you can run on a consumer CPU (which, with memory prices coming down, could look like a 256GB M9 or M10 Pro) and $100k GPU cluster.

marcus_holmes · 2026-06-29T01:53:35 1782698015

This is clearly where the industry is going, imho. Everyone who is playing with LLMs wants a laptop with enough grunt to run a decent model locally.

We've been sat with basically the same PC specs for ~20 years - our current specs are within an order of magnitude of the ones we could buy back in 2010. This is not really constrained by tech, as we could have much, much, larger machines. It's more because there's no mass demand for much, much, larger machines - if it's big enough to run Office apps or VSCode then you're good to go. The exponential growth we saw in the 90's was driven as much by software demand as it was by hardware development.

I can see the next 10 years produce the same kind of push for larger machines that the 90's did. And we should probably expect the same kind of standards churn as our existing technologies for storage, memory, etc, don't scale up enough and new technologies become worth developing because there's demand for them.

InvertedRhodium · 2026-06-28T22:31:59 1782685919

Depends how much you value privacy and running uncensored models.

Personally, I’m waiting for hardware to hit the secondary market before I buy something to run unquantized models like GLM. But I have no doubt that I will, at some point.

Ldorigo · 2026-06-29T00:20:39 1782692439

How do the economics of your statement work out? Clearly inference providers don't have a time to ROI of 10 years on their hardware costs; and that's without even taking ongoing energy costs into account. What's missing here?

ac29 · 2026-06-29T01:40:25 1782697225

The inference providers are running batch sizes much larger than 10

8note · 2026-06-28T21:58:33 1782683913

you can however, have fun with it.

oil workers buy 100k trucks they do not-much with. why not a 100k in computer?

afavour · 2026-06-28T22:32:37 1782685957

Because car loans can’t be used to buy computers

ElProlactin · 2026-06-28T23:27:29 1782689249

And there's your idea. If you could find a way to get people to add another $500/month over 80+ months to an auto loan, dealers would eat that up like filet mignon.

Ken_At_EM · 2026-06-28T22:02:31 1782684151

I can't help but ask where this comment came from, you must have some exposure..

CamperBob2 · 2026-06-28T22:21:49 1782685309

It is so easy to spend $100K on a pickup truck these days, it's not even funny.

tiahura · 2026-06-28T23:58:02 1782691082

A Honda minivan is > 50k.

SV_BubbleTime · 2026-06-28T23:36:29 1782689789

Factory F350 Platinum is at least 90k sticker.

jliptzin · 2026-06-28T23:42:14 1782690134

Yea as far has hobbies go, I feel like this is on the low end. I know people who collect watches and corvettes, that's way more expensive and functionally you can't really do anything special with them.

theteapot · 2026-06-28T23:57:57 1782691077

The difference is watches and corvettes typically appreciate in value, where as computer hardware typically drops like a rock.

15155 · 2026-06-29T00:41:11 1782693671

> watches

Some, and the market fluctuates a ton.

> corvettes

Only the oldest, most unique model years: nobody is buying (C4-C5-realistically C6) mid-90s or early 2000s Corvettes for more than what they paid for them, and they never will.

randomNumber7 · 2026-06-29T00:14:22 1782692062

Also LLMs are mainly used for work and if you can spend 6 digits on watches your likely financially independent.

parineum · 2026-06-29T00:44:52 1782693892

> The difference is watches and corvettes typically appreciate in value

Both of those things' value drops like a rock as soon as you buy them and, at least for cars, they don't all appreciate. Most don't. Even so, they appreciate at an incredible slow rate.

I can't speak for watches but I'd be surprised if it wasn't the same situation.

At least the gpus can create value after you buy them before they are worthless.

cdelsolar · 2026-06-29T02:03:23 1782698603

hmm ok let's build a state of the art from 2021 homelab using 2x Epyc Milan chips + DDR4 RAM and lmk how much it costs...

dakolli · 2026-06-28T22:00:32 1782684032

Sure, If you want to light money on fire for entertainment, more power to you. There's probably worse ways to light 100k on fire. If I have an extra 100k laying around it's going to my family though.

krackers · 2026-06-28T22:07:23 1782684443

Would you be better off pooling that money with some hackerspace group and then setting up shared inference infra, so that way you at least get better utilization?

KaoruAoiShiho · 2026-06-28T22:39:02 1782686342

And before you know it, you invented some openrouter provider from first principles...

janalsncm · 2026-06-28T23:17:32 1782688652

Right. For example you will need to figure out how to share it and who maintains it.

aetch · 2026-06-29T00:05:09 1782691509

You can then rent spare capacity out to people on a subscription or token basis ….wait

KetoManx64 · 2026-06-28T22:42:48 1782686568

As an individual I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag. Once they start trimming down the excess and making them field focused they will run just fine on people's individual devices.

JumpCrisscross · 2026-06-28T22:48:00 1782686880

> I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag

Isn’t the performance gap between quantized and full models indicative that even if you aren’t using it directly, the model knowing the colors in the Russian flag does have something to do with the intelligence you demand?

KetoManx64 · 2026-06-28T22:54:54 1782687294

Do quantized models specifically prune out specific knowledge? I think they just compress things down but they're still in there. You'd most likely need to do that when you're doing the initial model training, but I'm not expert.

JumpCrisscross · 2026-06-29T01:20:47 1782696047

> they just compress things down but they're still in there

The compression is almost certainly in part specific knowledge getting fuzzed.

DennisP · 2026-06-29T02:25:03 1782699903

Yeah, but it's everything getting fuzzed, including the parts you care about.

JumpCrisscross · 2026-06-29T03:01:39 1782702099

Sure. There is a legitimate question around whether one can selectively excise “useless” knowledge. My guess is you can’t. The act of learning it encodes both the act of learning and the knowledge per se. The former is the power of the LLM. (I personally force mine to double check everything instead of going off memory.)

kibwen · 2026-06-28T23:06:37 1782687997

Quantizing is one thing. But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.

Likewise, LLMs do not violate the laws of information theory, and therefore the only way to encode X amount of information in Y amount of bits where X > Y is by performing what is effectively lossy compression, and as X grows larger relative to Y the compression ratio must change to lose ever more information.

Yes, for the sake of making chatbots that are "conversational" in that they can interpret natural language as input and produce code as output you can easily benefit in incidental and unintuitive ways by training it on more natural language text. But for a given fixed parameter size, it's possible to produce a better model for a specific task by selectively not muddying its training set in the first place with things that are likely irrelevant to the task.

coldtea · 2026-06-29T00:21:11 1782692471

>But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.

It's hardly self-evident, and your counter-example is hardly applicable.

The first 10^50 of pi is not the same as having BREADTH of information in the training data, which is the whole point not just any random "information that is irrelevant to your use case".

not to mention that the first 10^50 digits of pi compress to quite small formula, so not much information there to begin with from a shannon/kolmogorov perspective

kibwen · 2026-06-29T01:06:23 1782695183

It is self-evident. Bringing up Kolmogorov complexity is irrelevant, we're talking about rote memorization, but if you can't ignore the given example then replace "digits of pi" with "bits of output from a true random number generator". There's an infinite amount of information that we could shove into a model, and a finite amount of bits with which to store any of that information such that it can be usefully recalled or form useful logical associations.

JumpCrisscross · 2026-06-29T01:22:08 1782696128

> it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability

We don’t understand AI or natural intelligence well enough to make such statements. As for self evidence, cross-domain competence in humans and the rise of generalist models over domain-specific ones (on competence, not cost) seems to pretty directly tank your hypothesis.

tiahura · 2026-06-29T00:10:37 1782691837

Apparently irrelevant data can help because model weights are entangled.

wonnage · 2026-06-28T22:34:06 1782686046

Yeah, the neoclouds and hyperscalers are taking massive losses right now, self hosting is basically signing yourself up to do the same. There are philosophical reasons to do so but it’s a terrible economic decision

rekttrader · 2026-06-28T21:59:25 1782683965

Or you have data that HIPAA, GDPR, PII, or have to care about the concern of others training on your data.

dakolli · 2026-06-28T22:01:45 1782684105

That too.

dist-epoch · 2026-06-28T22:08:42 1782684522

> 50tps for a decade

assuming demand doesn't keep on increasing. even google has trouble having enough capacity apparently.

himata4113 · 2026-06-28T20:29:10 1782678550

These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact. It honestly wouldn't surprise me anymore if china started surpassing models that US makes public, at least in specific categories such as cyber.

GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.

acters · 2026-06-29T03:06:50 1782702410

I am finding Chinese models are introducing more guidelines against cyber. Especially Kimi k2.7 code seems to have extra training against cyber security capabilities. Last one, k2.6 was a lot stronger at cyber but obviously the Kimi team improved over time, so this is not the best they can do but no one will be able to get the best anymore.

I expect future Chinese models to introduce even more of this type of bogus "safety" training.

Looks like if you are a white hat, then you will be fighting an uphill battle. Black hats will be fine, they will not care, they can just run a heretic model or specialty trained model.

EMIRELADERO · 2026-06-29T00:45:51 1782693951

> These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact.

Care to give more context to this? Seems interesting

danmaz74 · 2026-06-28T22:46:50 1782686810

It will almost for sure surpass the models which Trump will allow US "allies" (which he just considers client states) to use. This, together with China's growing dominance in PV, rechargeable batteries, EV, could really be the nail in the coffin for the post WWII economic world order.

himata4113 · 2026-06-28T22:48:48 1782686928

Honestly, it's becoming increasily hard to disagree with such sentiment when china is preparing itself to lead in energy, manufacturing, research, chip production and so on while there's an entire group of people trying to put datacenters in space.

woeirua · 2026-06-28T23:17:52 1782688672

You are delusional if you think China is going to let Europe have access to Mythos level models for free.

chillfox · 2026-06-29T00:43:18 1782693798

Why not?

Mythos level really doesn't seem that scary. And it would be a great way to take away the American labs international market.

I think it would make strategic sense for them to release more capable models than what American labs are allowed to make available to the world. It would help them grow their global soft-power and be a destabilizing effect on the American economy.

BobbyJo · 2026-06-29T03:06:15 1782702375

It is fairly obvious to me that the open models are a form of "dumping" as far as the economics and the desired outcome from China's perspective. They get to watch as the US pours tons of money and talent into an industry, then prevent that investment from having any return. In 5 years we'll be on equal footing, China will have spent 1/1000th the money, and the only downside will be that they spent 5 years being 6 months behind.

China could not be happier.

The same model is going to apply to the silicon supply chain as well is my guess. 1000th the expenditure in exchange for being a little behind the curve.

I worry it will have a very real chilling effect on research and development, since customers will probably very quickly switch to the thing that costs 1/10th as much, sucking out the ROI.

lukan · 2026-06-29T00:28:25 1782692905

To hurt the US, maybe. I have not tried it, but GLM here seems already pretty capable.

jmye · 2026-06-29T00:31:08 1782693068

What does "free" have to do with anything?

WithinReason · 2026-06-28T21:02:54 1782680574

> [...] beating Claude Code (32%) at roughly $0.17 per vulnerability found

Claude Code is an agent harness, not an LLM.

Claude is a brand (or group of LLMs), not an LLM.

raincole · 2026-06-28T21:36:04 1782682564

Yes, and the article author is fully aware of that. Thank you for pointing out this small mistake though.

mkagenius · 2026-06-28T22:23:32 1782685412

It looks like the author is specifically avoiding model's name, because results are really weird.

  Opus 4.8/4.7 scored 28%

  Opus 4.6 score 37%

So the author thought as let's not get into that just write Claude.

happycube · 2026-06-28T23:00:08 1782687608

Not weird at all, given the variance in Opus' quality over the last few months.

wild guess - I wouldn't be surprised if Opus 4.6 was run quantized for a while, and 4.7/4.8 have QAT for that nerfed size.

andriy_koval · 2026-06-28T22:30:46 1782685846

many people think opus 4.6 was the best

raincole · 2026-06-29T02:26:34 1782699994

Where is the weird part?

tills13 · 2026-06-28T22:08:33 1782684513

It costs nothing to not be pedantic.

alienbaby · 2026-06-28T22:52:43 1782687163

Possibly, nothing other than accuracy

croemer · 2026-06-29T00:22:48 1782692568

The dollar amount is meaningless without comparison - and no other model has a price tag. Sloppy article.

Onavo · 2026-06-28T21:18:02 1782681482

Claude code it's the only way to get access to the actual amortized cost of running a Claude-scale model. The consumer non-enterprise API is extremely expensive (with increasing marginal costs for the user and fat profit margins for Anthropic). If you want to approximate a State level attacker's cost where they can have the model on their own hardware, Claude Code is probably the best guess at the amortized cost.

solenoid0937 · 2026-06-28T19:59:17 1782676757

GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.

Not that it would make any sense.

rgbrenner · 2026-06-28T20:37:04 1782679024

If that happens it'll be an absolute disaster. Imagine a scenario where Anthropic and OpenAI prohibit most US companies from using their latest models because of safety.. And meanwhile attackers use equivalent open source models to attack US companies.

Any prohibition on open source models will do nothing to fix the problem.. since attackers will never feel bound to the law. All advanced models must be available for defensive purposes.

andy99 · 2026-06-28T20:51:46 1782679906

Right, but is there any evidence of intelligence behind any of these (government) decisions? It’s just regulatory capture + marketing (plus some people living out an imaginary fantasy that they’re in Neuromancer or something), absolutely no reason to think they won’t try and target open models as part of this.

popalchemist · 2026-06-28T21:06:32 1782680792

There's at least one reason: much harder to make a profit in policing non-american companies and open-source models without huge (or even any) MRR.

If the real motive is profit, then open source models are likely simply not a viable means to that end.

richardlblair · 2026-06-29T00:10:58 1782691858

And someone will start a competing company in a sane environment.

solenoid0937 · 2026-06-28T21:08:19 1782680899

> since attackers will never feel bound to the law.

But that's the whole point.

Fall out of favor with the admin and you lose access to the good American models, aren't allowed to use Chinese ones, and fall prey to the attackers and behind your competitors.

lenerdenator · 2026-06-28T22:53:46 1782687226

It'd be less about "safety" and more "we've spent trillions developing these AI tools only to have the Chinese, once again, copy them and offer them for pennies on the dollar, and no one seems to care about the impact that has on the long-term sustainability of this sector of the American economy as a whole, so we're yanking the models."

jmye · 2026-06-29T00:36:40 1782693400

"I'm going to take this box razor and make some really deep cuts around the middle of my face because my tech sector is too good and that's actually a bad thing because $foreigners."

lenerdenator · 2026-06-29T01:03:18 1782694998

I'm not saying it's necessarily a good thing. I'm also not saying it's about foreigners at this point. It's about seeing a bet through. They've burned a metric crapload of capital on developing AI models and the infrastructure to host them. They want that money back and then some. Remember, the fine shareholders of OpenAI think that 100x returns just aren't reasonable and want that cap lifted. If this kind of thing continues, they'd be lucky to make their money back at all, let alone 100x.

Which would be fine, but as we know, people securitize the crap out of their investments these days, and least some people probably leveraged themselves on some US AI companies, so now the risk is spreading outside of the sector to the economy in general, which is made worse by the sheer amount of spending on AI.

aussiegreenie · 2026-06-28T21:36:22 1782682582

The Americans may ban the use of the Chinese models in America. But like the Chinese car ban, everyone else will use them.

lenerdenator · 2026-06-28T22:58:54 1782687534

That's not necessarily a good thing for everyone else, mind.

Yes, you get your free model, but the cost of this is not developing your own capability and tying your fate to a country which may or may not have your best interests as a nation in mind.

This is just the deindustrialization that occurred in my home region (the American Midwest) playing out on a global scale in different sectors. It was originally driven by the Japanese, who, to their credit, acted more as partners than competition. Eventually that desire for larger margins went to China, and now you basically can't build anything of consequence without at least some Chinese parts, because there's "no economic case" for it. This means that you have to play Beijing's game if you want access to any sort of modern market.

You see this happening with Volkswagen's restructuring, next you'll see it with non-American, non-Chinese AI.

singpolyma3 · 2026-06-28T23:58:48 1782691128

It's not really the same because we already have the model. If China stopped letting us have it tomorrow I'd doesn't matter because... We have it already

chillfox · 2026-06-29T00:59:23 1782694763

So... how's that any different from using American stuff for those of us in the rest of the world?

Over the last decade, the US has been way more unreliable than China. There's been a near constant negative impact from the US doing something.

At least with China, we are very good at winning trade wars with them here in Australia.

lenerdenator · 2026-06-29T01:10:17 1782695417

You might feel differently if you were a Filipino or Vietnamese fisherman whose family relied on the income from the stocks of the South China Sea, or a Uighur person living in Western China, or a Ukrainian soldier who has to deal with drones built with Chinese components, or a democracy advocate in Hong Kong, or arguably, a person who had plans for 2020-2021.

Or, on a more local note, an Australian automotive worker who worked for a company that figured out 10 years ago that they wouldn't be able to pay him a decent wage, compete with the then-upcoming Chinese EVs, and remain profitable.

Paradigm2020 · 2026-06-29T01:31:41 1782696701

You might feel different if you're a palestinian who's getting american bombs dropped on him, or an afghani collateral damage or...

There is no good guys in general, and whataboutism and making the scope bigger doesn't help.

The thing is that if the models you are building on are open source whether hosted on chinese / american / whatever service at least give you an option to switch provider easier vs a fable / chatgpt 5.6 that gets banned for none americans etc...

2 years ago america would have had the branding/perception advantage but right now that is well and truly gone...

skissane · 2026-06-28T22:23:35 1782685415

> GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.

I’m sceptical they could find the legal framework to do this even if they wanted to

They have legal authority to (a) prevent export of US goods/services; (b) ban imports of physical goods; (c) ban transactions (including purchasing services or license agreements) with foreign firms

But I’m not aware of any legal authority which lets them ban US firms from running a Chinese-developed open source AI model in the United States, if they are at arms length from the vendor, and aren’t using it for government contracts or regulated applications

Possibly they could order HuggingFace/etc to suspend Chinese accounts. But if someone in the US (or a third country) downloads the model from China then reuploads it to a US server, completely independently of the vendor - where is the legal hook to prohibit that?

bardak · 2026-06-28T22:47:19 1782686839

They could ban payment processors from processing payments to any hosts of GML 5.2, despite the open weights the vast majority of people will be using cloud providers to get access since it is to heavy to host for 99% of people.

This would be extremely heavy handed and probably end up accelerating the loss of the virtual US monopoly of payment network. The reast of the world isn't going to let the US dictate that only they get the frontier models whether their US made or otherwise

skissane · 2026-06-28T22:52:43 1782687163

> They could ban payment processors from processing payments to any hosts of GML 5.2

Can they actually though? Do they have legal authority to tell a payment processor that it has to block transactions of a legal US company, just because the company is hosting a Chinese-developed open source model? I’m sceptical

And what about companies (e.g. AWS) that let you “bring your own model”?

bardak · 2026-06-28T23:05:40 1782687940

It would be extremely heavy handed but the administration has sanctioned the International Criminal Court judges such that they basically have no access to the Wests modern financial system. I think domestic US providers would have to deal with different ways but someone like Herzner could easily be cut off from the financial system if the administration doesn't feel that they are adequately blocking the model

skissane · 2026-06-29T00:18:36 1782692316

> It would be extremely heavy handed but the administration has sanctioned the International Criminal Court judges

That's sanctioning specific individuals for specific acts they performed which the US claims contravene its interests and those of its allies.

I don't agree with the ICC sanctions, but it really can't be compared with the proposal "sanction any company, even US domestic entities, which use a Chinese-developed open source model".

In fact, I think part of what enables the US to sanction them (under US law) is the fact they are neither US citizens nor residents; if they were US citizens living in the United States, I don't think the President would have the legal authority to impose those kinds of sanctions.

They could sanction Hetzner–because it is a German firm based in Germany. I don't see how they could sanction a US firm based in the US whose owners and staff were US citizens.

Also, the 5th Circuit Court of Appeal decision Van Loon v Treasury (Nov 2024) is relevant–it held that IEEPA (the law used to sanction ICC officials) couldn't be used to sanction the Tornado Cash smart contract system, since open source code wasn't "foreign property" under IEEPA.

phs318u · 2026-06-28T23:33:14 1782689594

Swapping the footgun for a huge long-range boomerang doesn’t mean it’s not going to eventually swing around and whack you in the back of the head.

bardak · 2026-06-28T23:54:42 1782690882

100% agree and don't think it will come to that but I won't completely put it past this administration

addandsubtract · 2026-06-28T23:54:50 1782690890

Label AI as porn and the payment processors will cut their ties automatically.

mrandish · 2026-06-28T22:58:09 1782687489

> I’m sceptical they could find the legal framework to do this even if they wanted to

I agree, my only caveat is that the current administration has shown it's willing to go beyond aggressive regulatory interpretations to questionable and outright implausible interpretations. As we've seen recently, the federal courts and SCOTUS are overturning most of these but that can take a year or more to resolve. The one positive light is they seem to push the hardest on certain culture war issues (immigration, voting, districting, etc). AI doesn't seem like a core hot button issue for the White House and there is a strong pro-AI / business faction.

eunos · 2026-06-28T23:29:37 1782689377

OpenRouter or Huggingface should consider moving to Switzerland

gruez · 2026-06-28T20:00:43 1782676843

>GLM export controls incoming?

US imposing export restrictions on a model from China?

mcintyre1994 · 2026-06-28T20:32:56 1782678776

It’d be restrictions on Americans and American companies, and probably also pressure on America’s allies.

mkagenius · 2026-06-28T22:04:00 1782684240

Token smuggler sounds like a profession coming soon. For distillation and stuff.

addandsubtract · 2026-06-28T23:51:38 1782690698

I mean, there are already places where you can buy tokens at 10% of their original cost.

manquer · 2026-06-28T20:10:06 1782677406

While unlikely , it is not without precedent , there are restrictions on ASML a Dutch company to sell EUV machines

throwup238 · 2026-06-28T21:35:28 1782682528

That’s because the Department of Energy originally funded and contributed IP to the EUV Corp joint venture between several semiconductor companies (including ASML and Intel). Their ability to export control EUV was part of that original agreement that the entire technology is built on.

verdverm · 2026-06-28T20:22:15 1782678135

ASML complies as an ally, why would China comply?

The weights are already available and downloaded, is it going to be a crime to have them, run them, make them available? Constitutional rights still exist (I hope)

solenoid0937 · 2026-06-28T20:31:55 1782678715

> is it going to be a crime to have them, run them, make them available?

Now you're getting it! Commerce will call it a munition and those harboring it as harboring illegal/foreign munitions.

No business will take the hit, so they will quickly deplatform the models.

No end user has the GPU capacity to use GLM 5.2 or similar models at full precision so the government will call the problem "mostly solved." But they might choose to "make examples" out of a few people using p2p software to download the weights if they choose to.

verdverm · 2026-06-28T20:36:24 1782678984

Or we use the models to work on fixing vulns and stop over-blowing the doom scenarios. Gotta save the kids and kill the terrorists though!

I'm for making software better instead of banning it based on what the rich and powerful claim.

I suspect the real fear is that open weight models undermine the financials and token prices they thought were going to pay off their ludicrous spending because they have all raced and raised hardware prices.

hadlock · 2026-06-28T21:40:14 1782682814

> making software better instead of banning it

We're still in the middle of the cambrian explosion.

If Anthropic was capable of developing Opus 4.49-4.5 2H 2025.... then any company with a research team capable of reading all the papers and press releases will be capable of producing Opus 4.8 by the end of 2027, either raw model competency, or in a harness like claude code (or better with both). I guess what I am trying to say is that Opus 4.5 does not represent the edge of agentic capability, merely somewhere in the thick meaty layer of "functional and achievable".

We can draw the line at Sonnet 4.6 in the US but much like encryption export restrictions in the 1980s, the line drawn will be laughably low within a few years and simply unthinkable in a decade.

solenoid0937 · 2026-06-28T20:43:29 1782679409

> making software better instead of banning it

That would be the rational thing to do.

> financials and token prices

I do not think the government thinks this deeply. Market manipulation might be a rational, if unethical reason to ban open source models.

But this admin banned Anthropic models to "own the libs." They will continue to ban what they want for whatever reason they want. I don't think those reasons will be particularly coherent.

verdverm · 2026-06-28T21:15:39 1782681339

Yeah, the current admin is reactionary, they appear to put little thought in, or at least disregard input they dislike. I don't think Ant's ban was about "owning the libs" as much as it was asserting dominance over someone who spoke up counter to the admin's aims and claims. They do listen to money, which is where I see Big Ai paying for executive orders (because the admin forgot what it means to compromise as part of legislating for all americans).

manquer · 2026-06-29T01:05:02 1782695102

That too has precedence , there is long history of controls of cryptographic algorithms up until the 90s. It wasn't abstract either, older greybeards would remember browsers like Netscape had two versions International and U.S. for this reason.

If you classify AI as a weapon which seems to be the direction that we are all heading towards, they yes first amendment rights won't likely apply.

matheusmoreira · 2026-06-28T21:00:50 1782680450

> it going to be a crime to have them, run them, make them available?

Yeah. Illegal numbers.

fragmede · 2026-06-28T23:06:56 1782688016

DeCss was short enough to fit in a t-shirt. Americans are larger these days, but not by enough to fit a decent LLM's weights on an XXXXL shirt, even double sided.

Art9681 · 2026-06-28T23:06:11 1782687971

They can easily issue an order for any American company to stop hosting/serving the models. If the model was a threat to national security because of its capabilities then a lot of other countries would follow, including China. No nation will allow some vibe coder with a rogue AI to pose a threat to their systems.

The reason GLM-5.2 hasn't been banned is that despite these cherry picked use cases, GLM-5.2 isn't even close to Opus in all use cases. These vibe benchmarks are ran by companies that are not part of the cyber services offered by Anthropic and OpenAI where they can use the models without the safeguards and refusals so their actual cyber capabilities can be utilized.

These guys that wrote the article compared a gimped Opus to GLM-5.2, knew full well it's misleading, and got the clicks regardless. They don't have enough clout to be a part of something like Project Glasswing, GPT Cyber, etc.

fph · 2026-06-28T21:00:19 1782680419

How would that even work for an open-weight model?

bardak · 2026-06-28T23:08:02 1782688082

Go after the hosts, 99% of people won't be able to run this locally even if they wanted to.

djeastm · 2026-06-28T21:35:11 1782682511

I think state-of-the-art AI is going to be defense industry only from now on. We can have our toy drones but not the Predators and Reapers.

Gigachad · 2026-06-28T21:46:11 1782683171

Turns out toy drones are more useful in war than multi million dollar planes anyway.

techpression · 2026-06-28T21:57:09 1782683829

Reaper and Predator are both drones and there’s really no comparison to toy drones in terms of sheer destruction and capabilities in general, the comparison is actually quite apt imo.

solenoid0937 · 2026-06-29T01:18:59 1782695939

You're right. Toy drones have proven vastly more effective IRL.

The others are a waste of taxpayer money. Extraordinarily low return on investment (kill-on-investment?)

fragmede · 2026-06-28T23:07:51 1782688071

Which ones are the ones Ukraine has used to bomb Moscow?

serf · 2026-06-28T21:41:07 1782682867

the things that empower modern toy drones were export restricted for years before hand.

mullingitover · 2026-06-28T23:11:13 1782688273

Obvious answer: build all your open source LLMs into firearms, get the SC to grant 2A protections.

dakolli · 2026-06-28T21:57:57 1782683877

Cool then everyone will just change their config to route through a provider overseas for an added 50-100ms latency. Who cares.

solenoid0937 · 2026-06-29T01:20:24 1782696024

Countries and businesses that don't want to be sanctioned by the US government or the US financial system care - so all western countries and corporations.

jackdawed · 2026-06-28T23:57:24 1782691044

I use GLM 5.2 via Neuralwatt and it's gotten so cheap I wouldn't mind cancelling my personal Claude subscription if work gave me one. I've spent 374M tokens this month and it only cost me $18 on energy-based pricing.

cmrdporcupine · 2026-06-29T02:16:26 1782699386

How's the reliability and speed?

theptip · 2026-06-29T03:24:51 1782703491

But… what effort level? “Opus 4.8” is a massive capability range. If you just ran it on medium that is a completely different result than vs. max.

sidcool · 2026-06-29T03:24:27 1782703467

Genuinely curious. Say GLM 5.2 is better than Opus. But how does one go about using it by themselves?

croemer · 2026-06-29T00:21:59 1782692519

They should also at least run Opus through the same Pydantic harness they used for GLM. As is, it's apples vs pears.

Where's the cost per vulnerability for all the other models than GLM?

Also, without code this isn't very trustworthy. Could all be made up as well.

XCSme · 2026-06-29T01:03:16 1782694996

Does a bit worse than Opus 4.8 in my tests[0], but it's 5x cheaper and 3x slower.

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...

XCSme · 2026-06-29T02:07:44 1782698864

Note that being open-weights, "slower" is relative, as it depends on who's serving the model. This can drastically change over time too.