Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
red75prime
64 days ago
|
parent
|
context
|
favorite
| on:
SWE-bench Verified no longer measures frontier cod...
Great on small snippets of code, passable on larger pieces of code, great at finding vulnerabilities in large pieces of code, terrible in Zork. All-in-all, a jagged frontier that defies a simple sarcastic characterization.
girvo
64 days ago
[–]
Very kiki, not very bouba, as Aphyr rightfully stated.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: