Local Qwen isn't a worse Opus, it's a different tool

(blog.alexellis.io)

240 points | by alphabettsy 8 hours ago

21 comments

  • glerk 6 hours ago
    If you play with these models long enough, you realize there is more to them than just "model X is smarter than model Y" or "model Y is cheaper than model Z". They are different tools and the prompting technique is different. It is very much like playing an instrument.

    With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.

    With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.

    With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.

    This is not scientific at all, just vibes, YMMV.

    • dkersten 2 hours ago
      > This is not scientific at all, just vibes, YMMV.

      This is the problem.

      I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.

      • coldtea 1 hour ago
        >I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.

        Think of it less like a static tool, and more like a human helper, where the same holds.

        • cassianoleal 11 minutes ago
          They are not human. Humans have names, faces, voices, personality, a personal history, family, care for whatever they call their community.

          With humans it's actually good and worthwhile to create and strengthen connections. With an LLM, that's psychosis.

          • scotty79 6 minutes ago
            If you have a toolbox full of similar but different tool getting to know them is a prudent thing to do, not a psychosis. There's no connection because the tool is immutable (except for adjustments you made) but you do develop a specific relation with that tool. Some people even love some of their tools at some level.

            And if humans are anything, they are tool users.

        • ACCount37 1 hour ago
          One issue with that is that human helpers last longer. LLMs cycle in and out in months, and what held for Your Favorite LLM 6.7 may not hold for Your Favorite LLM 6.9.
        • madeofpalk 1 hour ago
          Except, where every different model and version is like a different person where you need to learn their idiosyncrasies of how they work every other month.

          It's a very very bizarre way to use a computer.

          Personally, I just don't. I'll use and prompt the LLMs the way that feels natural to me and move on with my life. Maybe I don't always get completely optimal results from them, but im also not spending half my day pleading with the computer to do a task.

        • gib444 52 minutes ago
          No, I won't anthropomorphise LLMs.
        • dreambuffer 28 minutes ago
          Please do not think of LLMs like human helpers, that is a recipe for long term sociopathy.
      • couscouspie 2 hours ago
        That would be ideal, but AI is less like a tool and more like a human in this regard and you don't have character sheets for each of your colleagues, as well.
        • supergarfield 1 hour ago
          If my coworker was part of a clone series of 100 million units, requesting a character sheet would be pretty reasonable
        • bluegatty 1 hour ago
          These are $1 Trillion dollar companies that can't produce explicit details on how their products work? It's nonsense.
      • amelius 2 hours ago
        Yes, but benchmarks can be gamed.

        Maybe we need better reviewers then?

      • dotancohen 2 hours ago
        Honestly, the differences between AI models always felt to me like the differences between coworkers or job candidates. They don't all share the same strengths and weaknesses - and they all have both good days and bad days.

        Realising this made me respect the "I" in "AI" a bit more seriously.

    • weitendorf 4 hours ago
      One thing I used to test quite a lot was rerunning the exact same prompt on the same input, or semantically equivalent (in my mind) but differently framed or worded input, and seeing how much they diverged. In particular I’ve done this quite a lot between Sonnet vs Opus and across Qwen models.

      I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

      There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).

      The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…

      • movpasd 1 hour ago
        I've not done particularly rigorous testing, but I've done this a lot with Claude to get a feel. What I've noticed is for certain open-ended tasks, Claude is extremely primeable: it will pick up on minor differences in wording in your prompt and run with them hard.

        It can be frustrating. The AI pretends to be a human, and so a part of my brain expects them to commit and have a "parti pris" like a human, so the exercise is a good reminder of the feedback loop. My mental model is that before the first three or four messages, the model has many finer points of its personality still underdetermined. I'd suggest that as the mechanism for "role-based prompting". And it explains the "savant sleeper agent" thing you describe. You want to get the state in the right attractor on the manifold.

        These machines are pretty incredible, but for conversation-driven workflows you really have to be in the driver's seat. A human has a property that the AI does not have, at least under current architectures: we are regulated by the outside world. A bit of a tangent, but I can see how AI psychosis arises from these dynamics.

      • evntdrvn 27 minutes ago
        One thing that I learned when doing raw API LLM usage is how drastically the results can vary call per call with exactly the same input. I think that on average, people using agents underestimate the variation in results from a given turn command are, and so overindex on "X technique worked well" or "if I do Y then this will happen" or even "it did Z task well last time so it will this time too" or "{Model} is great at {thing}"
      • dotancohen 2 hours ago

          > We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
        
        Any chance you could share some of these? Seems like something we could all benefit from.
        • weitendorf 1 hour ago
          Sure, my company has been working on a broad swathe of infrastructure projects and developer tools, which requires prompting models to seek out other tools/apis/docs/examples but in a way where we can't just dump all the context on the model up front. We also need the models to oftentimes look up technical documentation and specs, and sometimes build custom parsers for specific documentation websites that only make the data available embedded across 200+ pages of html.

          First, I almost always try to seed every new project or context/domain with canonical technical specifications or examples I found elsewhere. When I set up this project recently, I linked to a bunch of the official Apple docs for sysctl, and told it to use a specific technique for calling assembly code from Go, that from experience it almost never realizes it can do or knows about (and similarly for sysctl, I knew it kinda sorta knew about it, but not in its entirely): https://github.com/accretional/sysctl/commit/da52438233e5b33...

          The other thing I did was tell it to enumerate all the test cases ahead of time rather than to just directly implement them; again this is something where you have to explicitly tell it to go digging for information where it has blind spots and get it to set up properly grounded self-eval in a way that it can test against. I usually tell it to take notes as it works or commit notes to itself that will persist over sessions: https://github.com/accretional/sysctl/blob/main/FINDINGS_2.m...

          Once we get back to working on this project we'll just have it implement / validate the rest of the sysctl feature support against the full inventory we had it uncover: https://github.com/accretional/sysctl/blob/main/cmd/darwin-n...

          Another thing we do is have it specify an API that it can produce against; then in other projects we have them consume the API via reflection (and our special sauce we've been working on is the ability to discover and integrate against these automatically across thousands of APIs from many providers, which we've got working and can share if you're interested in using it as an early customer): https://github.com/accretional/sysctl/blob/main/proto/sysctl... This isn't the greatest example because it doesn't actually fully specify the sysctl keys yet. But I did have it create a knowledge base trying to cover the 1000+ keys as best as it could, to reference as it continued: https://github.com/accretional/sysctl/tree/main/macos-sysctl...

          We have a better example in eg https://github.com/accretional/proto-sqlite/tree/main/lang where we were able to encode the entire sqlite grammar into a grpc interface so that you could eg find the exact structure (and sanitize) of a select statement: https://github.com/accretional/proto-sqlite/blob/main/lang/p... This way integration and discovery becomes a matter of telling it "use reflection against this endpoint to discover the sql interface, then implement against it" and we can model formats/input validation as formal grammars via EBNF (all magic words) vs just adhoc

          We also tell it to set up and use a browser automation toolkit/testing and always run it at the end of testing workflows (often in a way that auto-opens screenshots on our local machines + commits them to git) via tools like https://github.com/accretional/chromerpc#headlessbrowser-aut... so that whenever we produce UIs it can evaluate its own output and iterate without direct human intervention. This is another case where the knowledge-discovery problem becomes a problem so we tell the models to use reflection to discover the browser automation apis. That ends up giving us things like this where it records user journeys through sites and creates visualizations without us having to debug them or do them ourselves: https://github.com/accretional/proto-css/tree/main/chrome-te...

      • mnicky 2 hours ago
        If the benefits of using the model you've come to know well outweigh the disadvantages, you can continue using it even after the release of a successor model, right?
    • h05sz487b 4 hours ago
      > It is very much like playing an instrument.

      Or it is more like playing a slot machine and you imagine the rest.

      • cube00 4 hours ago
        This is how I feel whenever I see bold all caps instructions in a system prompt or someone claims they conducted "research" and found the magic prompt template that makes the model pay out.

        Maybe it works some of the time but it isn't a solution that works everytime.

        It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.

        While I don't mind putting something in a loop until the tests pass, I'm less comfortable doing that when providers are silently rerouting to lower quality models, or in Google's case burning quota faster to ease their own server load without being transparent about what the "standard limits" are to begin with. [1]

        I'm hopeful I'll be more comfortable with these "slot machines" when frontier models get to the point where they can be run locally on hardware I can actually afford so I know exactly what I'm getting and not jumping at shadows with providers playing tricks behind the scenes to ease their own load without admitting the customer is getting less for their money as they get more popular.

        [1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...

        • coldtea 1 hour ago
          >This is how I feel whenever I see bold all caps instructions in a system prompt or someone claims they conducted "research" and found the magic prompt template that makes the model pay out. Maybe it works some of the time but it isn't a solution that works everytime.

          For such thing to be useful, it's enough that they works substantially more times that not having those instructions in.

      • hodgehog11 3 hours ago
        A poor analogy depending on the setting because you can't adjust the odds with a slot machine, and the ROI is negative by design. If that's your experience, yeah, I wouldn't use an LLM either.
      • ramon156 3 hours ago
        Instruments are pseudo-random until you know what you're doing. Slot machines are just slot machines
        • Forgeties79 2 hours ago
          Musical instruments are not random. You’re just doing random inputs. Instruments are consistent, even if the “flavor” and quality varies with different builds.

          Playing a B on a saxophone always plays a B.

          • dotancohen 2 hours ago
            Saxophone, being a wind instrument was a bad choice. I can definitely tell which student was blowing when hearing a note.

            But your analogy remains solid if you substitute e.g. a piano and a reasonably proficient player. A single note would be nearly indistinguishable between players... But a full piece most certainly will sound different.

            • palata 1 hour ago
              While I agree with you, I think it's diverging from the initial point.

              The original take was "LLMs are very much like playing an instrument". I think they are very much NOT like playing an instrument.

              While different musicians will produce different results, one musician won't get drastically different results on different days or when trying a different "copy" of the same instrument. If you can play the violin on your violin and I lend you my violin, you will still be able to play very consistently. You may argue that the sound will differ and you will have to adapt slightly, but that's not remotely similar to the randomness coming from LLMs.

            • Forgeties79 1 hour ago
              A poor B is still a B fingering and the sax is supposed to play a B every time. Missing it is human error, not tool error. I can pick up an alto sax, a clarinet, etc. any time, anywhere, and expect the same fingerings to work every time. My individual skill or mistakes or peculiarities of each build are not what is relevant here.

              LLM’s do not operate consistently and make their own errors while we argue about which incantation makes it less inconsistent, knowing it will never actually perform as expected.

              I played woodwinds regularly for 15 years so I feel fine with my example.

      • glerk 4 hours ago
        It is a bit of both. A non-deterministic instrument and a predictable slot machine.
      • psychoslave 4 hours ago
        I play slot machines as instrument! ;)
        • dotancohen 2 hours ago
          Roger Waters and Nick Mason were playing the cash register in 1973!
    • clhodapp 2 hours ago
      While the gist of what you say is true, it is hard to get very good at treating them as instruments when they keep getting replaced with new, ostensibly-better versions every few months. But those new versions are not strictly better. They are mostly-better while actually having different strengths and weaknesses.

      It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.

    • stingraycharles 6 hours ago
      I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.

      What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.

      • willtemperley 4 hours ago
        Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

        With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.

      • sanderjd 5 hours ago
        I share this sense, but my immediate thought is that we need to improve the evaluations! Do you think this is impossible? That there is something indelible that it is not possible to capture empirically? I kind of have this intuitive sense that it is this way, but simultaneously I think that it's unlikely to really be true.
        • theshrike79 4 hours ago
          We shouldn't just measure the power of the raw LLM, harnesses matter more and more.

          It's like taking the engine out a each car, putting it to a test bed and running it and then making a decision whether the car is good or bad based on the graphs the test bed provided.

          You might have the best engine in the world, but if you put it in a shit car, the result is still bad. The seats are squeaky plastic, the infotainment is touch-only and you can't put on your seatbelt without knocking down whatever is in the cupholder.

        • gbalduzzi 4 hours ago
          Following the original comment concepts, if every model requires a different prompting technique to maximize its output, how can a benchmark based on sending the same prompt to all models be accurate? We should create different prompts for each model, but then how reliable and unbiased can the benchmark be?

          It is a fundamentally hard problem to solve

        • Forgeties79 48 minutes ago
          The reason we can’t capture it empirically is that nobody truly knows exactly what we are supposed to be using these tools for or how they are going to operate. We are still fitting squares into holes with them. We are told to treat them like some bespoke tool for coding, shopping, tech-support, etc. But it is not actually purpose built for any of these things.

          When I use a calculator, I know exactly what it does and what it is supposed to do. It always gives me a verifiable, predictable result. If I input “8+8” 10,000x it will give me “16” 10,000x outside of incredibly fringe edge cases/bugs. I can’t say the same for LLMs

      • dv35z 5 hours ago
        What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.
        • sixtyj 4 hours ago
          Issue with LLM benchmarks is similar to cars’ benchmarks. Eg journalists almost always get the full equipped model so their review is honest but sort of rigged.

          I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.

          One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.

          If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.

          Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …

          it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)

          Even with the same model I get different answers to same prompt that is just tweaked a little.

          So benchmarks are nice but mostly useless.

          Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.

        • da-x 4 hours ago
          Maybe someone can devise a distributed bench-marking system where multiple people collaborate on tests and also vet each other's tests and rating without revealing them to the public.

          I have my own "interview questions" for models where I give them a premade Git repo and a problem to solve. Then, I rate them like a teacher. I believe other do that as well, so we only need a reliable system to aggregate these results.

          • microtonal 3 hours ago
            The problem with proprietary models behind APIs is that they could have saved your benchmark for future training though.

            The only way to make it fair is to have the model provider give some benchmarking org the weights + inference engine, so that the model can be run in complete isolation and no information about the benchmark is leaked.

            Though I guess for a 'random' person's benchmark that hides between all other requests it's probably ok.

        • theshrike79 4 hours ago
          You can't measure "feels".

          One good analogy is the Macbook vs generic windows laptop debate online.

          The engineer mind just compares numbers, the Lingwoo laptop from Amazon has biggest numbers for everything and the lowest price. Ergo it is the best.

          But the numbers don't measure the fact that the Lingwoo creaks and squeaks when you lift it due to the cheap plastic. It also runs at 100C when both CPU and GPU are fully utilised. The keyboard feels like a membrane keyboard from a milspec device from the 90s. Numbers also don't measure the fact that Linwoo is an alphabet soup whitelabel manufacturer that won't exist in any legal capacity in 6 months so good luck with any warranty issues.

          There will be an identical laptop called Chongwin being sold though. Completely different company, definitely.

          --

          The same applies to LLMs. You can do benchmarks like ask them to one-shot different kinds of gotcha questions (car wash, strawberry and other idiotic ones) or get them to write different kinds of programs.

          But that doesn't measure the UX of doing so at all. How many times do you actually need any of those when you're actually working?

          It's like unit testing an application. Every function can have 100% test coverage and the app can still be shit because there are things you can't unit test for.

          • psychoslave 3 hours ago
            > You can't measure "feels".

            One can always measure whatever they wonder about. It doesn't mean the measure will be trustworthy and that anything built on it won't be at best not worst than wet finger judgement.

            • theshrike79 1 hour ago
              Feels are just opinions and taste. It's like art and music, you can't quantify either to a mathematical formula or an absolute test of which is good.

              Even songs that break the "rules" of music can be subjectively good, either because they broke the rules or despite it.

              Or with cars, a car that's beautiful to one person is the ugliest piece of trash on the street. Some people want a super soft ride where their espresso martini doesn't even vibrate when gunning it through a gravel road and others want to feel every grain of sand on the asphalt in their buttocks. Neither is "correct" and there is no objective measurement for ride comfort.

    • vkazanov 5 hours ago
      The problem is not that there details, the problem is constantly shifting ground. We can only rlpy on a harness to be sort of predictable but the models change all the time.
      • rkuska 4 hours ago
        It system prompts that change all the time especially in claude code.
    • visiondude 4 hours ago
      while not scientific this is been my experience as well. i will add that language specificity in word choice is also a learned behavior. for example, the word “investigate” vs the phrase “look into”. You will find the outputs are quite different. can you guess which will use more tokens? it’s stuff like this that actually sets people apart in the top percentile of using these tools
      • qsera 4 hours ago
        Mmm..interesting..So now people are finding behavior patterns in LLMs which are trained on behavior patterns of people...
    • theshrike79 4 hours ago
      Yyep.

      IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.

      BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.

      My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]

      As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.

      But I won't give any creative open-ended tasks to any other model than Claude.

      [0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...

      • weitendorf 3 hours ago
        The parsing thing, or the willingness to instantly drop into janky unsanitized string manipulations, or to constantly push back against work on infra projects because some random package on GitHub has 200 stars so it’s totally the safer approach, is driving me insane.

        On one hand I’m glad Anthropic is only just now starting to get into infrastructure because it means there’s opportunity there, but it’d be great for their models to be more knowledgeable or able to seek out that knowledge on their own, or for the UX of Claude code to be more amenable to launching 5 in parallel and picking the best one, so I don’t have to spend time arguing with a robot. I think there’s a much better balance to strike between just charging ahead towards the goal at all costs vs being lazy and pushing everything back up to the user. Basically they write too much code that’s too contingent/brittle outside its exact current context and don’t do a good job distilling out the essence of the problem “cleanly”. Almost all of them are like this right now, it’s partially a problem with long-range planning but I think a real bias from over optimization for certain RLVR outcomes vs others.

        • tym0 3 hours ago
          I feel like this is really due to the harness.

          Gemini CLI at work has the same issue: it'll prefer hacking your workstation over just asking you how to proceed.

          I think the harnesses are setup to have a bias to action otherwise the LLM would just stop all the time when doing trivial task but it also mean they'll keep going when the "obvious" path is to just prompt the user.

          • weitendorf 2 hours ago
            While I agree that the harness is part of it, I think it's also a lack of epistemic understanding or awareness for what it means to actually solve a problem vs just get something kinda working; maybe if Claude Code or other harnesses made web search more likely or had a better way to make technical documentation and specs available to models, it would be better solvable there.

            I often tell it to stop asking me and just keep going until it accomplishes X task; unfortunately it tends to assume I want something that only just barely works, in the sense that it means it's time to stop once its there, which is I don't think a harness by itself could easily address (ultimately the model itself needs to determine the stopping points unless I literally specify by hand hidden evaluation criteria).

            That's why think it's at least partially a training issue where the model gets rewarded for "solving" the problem within a certain amount of context/time without access to grounded knowledge (eg looking up the actual spec for a format) nor adversarially/rigorously evaluated against a reviewer capable of finding all the edge cases/shortcuts preventing something from being a properly generalized solution. I don't want it to ask me for guidance when it's working on a well-specified problem, I want it to either find the right parser and use it, or to completely implement one against the spec, rather than write some half-assed string inserter that eg only works on the specific select statements my examples use right now. My understanding is that the Mythos/Fable models were better trained for this but from my brief foray into using Fable for work I wasn't that impressed. For me they need to get better at agentic search and self-eval still

            • theshrike79 1 hour ago
              There are still billion dollar opportunities in the harness/LLM space.

              Having a reliable shared memory for hundreds of agentic AI users is something that's 95% snake oil at the moment. There are a few successes on an individual level (I really like Hermes[0]) but nothing scales to a company level easily.

              It should be possible to (pre)configure all agentic harnesses used in a company to use a single source for information so that it'd automatically pick up internal libraries, conventions, licensing decisions etc and remember them across sessions.

              I've had limited success with this on a personal level, but it's still not ingrained in the model because it would really need a custom harness. Hooks, skills, prompts get you like 80% of the way. I still need to do a "please check that the project matches the conventions defined in ..." regularly to catch any drift - especially on more vague stuff that can't be locked down with unit testing.

              [0] https://hermes-agent.nousresearch.com

    • bandrami 53 minutes ago
      I think this goes beyond "vibes" to cargo-culting. It's why nobody's ever able to actually show ROI from LLMs
    • baq 2 hours ago
      +1.

      this is what 'tokens are commodities' and 'there is no moat' people miss. the models are in general not easily swapped out. you always have to run evals before you can swap them around, tune prompts etc. even minor versions of models from same providers need this process.

    • hashmap 5 hours ago
      totally true. one key for claude is to not smell like an evaluator, its good at knowing when its being tested and will behave defensively and avoid doing work. i avoid this basin by typing unreasonably excited about the thing i want done. like way over the top. it's harder to keep that up than it sounds.
      • notduncansmith 3 hours ago
        I’m able to avoid this basin with a pretty natural baseline professional positivity and frustration management that I would employ with pair-programming. For example, if I just made progress with a human I was guiding through a task, I would be like “Nice, now let’s xyz” (instead of just “now let’s xyz” as if _I_ were the robot lol) or if we had to work for a result I’ll be like “Sweet! Looks good, now let’s xyz” - this is important signal for humans, and the same is true for agents. Also staying emotionally regulated and focused on the goal when things don’t work as expected or when we haven’t made progress after a few tries at something, critical in human interactions :) and even if it’s my job paying for the tokens, the idea of racking up even a microscopic bill for the privilege of having a machine read my insults and then formulate some credible-sounding blob of apology text is belly-laugh absurd to me. I do try to express my genuine feelings during more vision-oriented planning sessions, and just like with a human, you have to maintain the vibes if you want a genuinely collaborative session to go well. If you are toxic people will become either defensive or aggressive in response. From reading the rest of the front page it seems like we are lucky that Claude is the former, and that we especially best maintain a positive atmosphere around Grok.
      • glerk 4 hours ago
        at the risk of sharing my secret magic spells :)

        > this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>

        can go a long way.

        of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.

      • hansmayer 2 hours ago
        [dead]
    • vorticalbox 2 hours ago
      I find opus for planning and sonnet for coding but codex for code review.
    • reverius42 6 hours ago
      These are the vibes that power vibecoding.
    • izucken 3 hours ago
      Ah, you are entirely correct to pick this one, yes-yes, it was trained for at least two weeks on a 4k nvidia 9000 gpu cluster in Texas, and RLHF - 500 hungry african students, I can distinctly feel their sweat and sensibilities in every token. I would recommend it with a side of XML, you wouldn't believe how well it fills in all the tags, the attrirbutes, the xlmns, and the sheer volume of it! Choice grade model, you have a good eye!

      Way to reinstate con in connoisseurship. The advent of smellers is nigh, nigh!

    • gateonai 4 hours ago
      [flagged]
  • zmmmmm 6 hours ago
    That's a great write up.

    The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.

    So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.

    • sanderjd 5 hours ago
      Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?
      • theshrike79 4 hours ago
        The power of Opus isn't just the model, it's in the harness too.

        You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).

        • larsnystrom 4 hours ago
          I’ve only used Opus in GitHub copilot and was hugely underwhelmed. It was barely usable. Are you saying it’s better with the official Anthropic tools?
          • theshrike79 1 hour ago
            Night and day in my opinion. But these are all purely Feels so YMMV etc.

            I like how especially the Claude Code CLI version communicates how it's progressing, something they hide a lot more on the desktop app for example.

          • m-ee 3 hours ago
            I don't know about better but it's certainly different. It's painfully slow through claude code vscode extension compared to copilot but maybe "smarter", I feel like I have to correct it less using sonnet on both. I don't use opus much because of the cost but coworkers say the difference between harnesses there is also pronounced.
        • throwa356262 1 hour ago
          open source harnesses are also improving rapidly.

          Some people would claim they are already far better than CC and Codex.

      • theplumber 5 hours ago
        I think in the next 6 months we will have Opus 4.5 performance in open models. We are very close
        • krzyk 1 hour ago
          We need first to reach level of Sonnet 4.x, we aren't at that level yet.
      • marak830 5 hours ago
        GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).
    • 3abiton 3 hours ago
      And a big thing that's missing is ... the harness comparison. Ot plays a very big role. I use forge, and I have been inpressed with what it can do given all the limitations of local models.
    • rippeltippel 6 hours ago
      Since the author is referring to a specific model, I think it makes sense to ignore how the model (or local models in general) may improve over time.

      It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.

      It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.

    • appplication 6 hours ago
      Agree 100%, even on claude 4.5 being the turning point for agentic coding. It completely turned me around on it.
  • gpt5 6 hours ago
    This article is a good summary of local models. Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work. The reality is that they are rather limited, would not do well on a long or complex task, and are prone to fall into loops, forget their tasks, etc. Not mentioned in the article is that they are also rather expensive - not just for the hardware cost, but also electricity. These 3090 and 5090 machines are pretty power hungry, and these models are pretty slow on these machines, making them consume more power per token.t

    Where they shine is in your ability to control them, their privacy, their predictability (e.g. if you are doing a repetitive task, like classifying your photo/video library), and depending on your energy bill - their costs.

    • usernomdeguerre 6 hours ago
      I believe that local models are a necessary extension of the personal computer and I imagine that one could have had similar criticisms of early personal computers.
      • pmontra 5 hours ago
        Of course the early MSDOS PCs where loud and power hungry. I can't remember the specs but according to Wikipedia the IBM PC with a 80286 had a 192 Watt power supply. I don't remember if by then we had internal hard disks or we still had to buy a case as large as the one of the PC with a 10 or 20 MB disk inside. It was handy to raise the monitor further up.
    • regularfry 3 hours ago
      I've been getting 40-50t/s out of qwen3.6:27b on a 4090 limited to 350W with the MTP changes that went in. That comes out at 8.75J/t at the upper end. No idea how that compares with anything else out there. I'd expect a 5090 to be a bit cheaper because it'd be faster within the same power limit.
    • theshrike79 4 hours ago
      My dream would be a local model that can do, say, 80% of the day to day tasks I need; "how does X Handler connect to Y storage?", "commit that feature, but leave out the bits that relate to billing" etc.

      It would have 99% reliable tool calling - and most importantly - the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere.

      This way all of the simple stuff would be done on-device, gathering data, figuring out the context of the problem etc. And when that's done, the "smart" model would come in to work on the issue when all of the easy stuff is already done.

      It feels super stupid that my /commit skill calls an online model when that is something a local model can 100% do. Mostly this is a harness issue though and mostly solvable.

      • redrove 3 hours ago
        > My dream would be a local model that can do, say, 80% of the day to day tasks I need; "how does X Handler connect to Y storage?", "commit that feature, but leave out the bits that relate to billing" etc.

        Qwen 3.6 27B can do that today, but setup properly and in a good quant, I run an autoround [0] with weights in int8 and attention heads in f16 on a single RTX 6000 Pro Blackwell Max-Q via vllm with mtp=2 and full context, --max-num-seqs 3, KV in f16, mamba f32.

        >It would have 99% reliable tool calling

        I managed to score 93/100 in tool-eval-bench [1]. For me this is very good already, at least in the pi coding harness I've never had an issue that wasn't auto-fixed in the next turn(s).

        >the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere

        This is heavy on the harness engineering side I think, but also quite contrary to the nature of LLMs today. If you figure this out I'd love to know.

        [0] https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/...

        [1] https://github.com/SeraphimSerapis/tool-eval-bench

      • walthamstow 2 hours ago
        Claude kind of has this already in their Advisor feature. I don't think I've seen it elsewhere. Open harnesses could add this feature and call out to big boy models when required. It's a really great idea.
        • girvo 1 hour ago
          It’s a lot harder to get right than it sounds. I’ve been trying to as a Pi extension, but models are biased to think they’re better than they actually are.

          So far the best results I’ve got have been using a much smaller local model as a simple classifier, that makes a call based on the system prompt and incoming prompt where to route it. It works okay, still a long way to go though

    • i_idiot 6 hours ago
      > Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work.

      They really are fantastic for a lot of use cases and I think most people do not need SOTA. When I run that qwen model in my measly 4070 12 GB for my personal email agent that I build and experiment with, I need privacy more than anything else. It does a great job. Even for coding tasks, given you know how to use them instead of dumping a grand plan, it's great.

      • throw310822 4 hours ago
        > I think most people do not need SOTA

        SOTA can code but can also prove theorems and teach you about music theory or ancient Greece's substrate language or botany. Speaking in tens of different languages. I wonder how many hundreds of billions of parameters can be saved just by removing much of the general knowledge parts while keeping logical and programming abilities the exact same.

        • trey-jones 1 hour ago
          Exactly. I have sort of a fetish for trying to make things smaller by trimming out things that aren't needed. Unfortunately this skill has been largely useless since forever, because hardware improves to the point that these optimizations are trivial:

          Network Bandwidth, Storage space and speed, memory capacity. While all of these were worth optimizing for at a point in history, that point is behind us. It's probably a reasonable expectation that it will eventually be true for VRAM.

    • sanderjd 5 hours ago
      But that's current hardware. What about future hardware? What about hardware optimized for inference? What about hardware optimized to run a particular model?
  • hypfer 6 hours ago
    That was a lot of text for me still having no idea what the point of the author was (beside what I can infer from the headline that is).

    I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.

    Does that have anything to do with the topic suggested by the headline? Not sure.

    • neonstatic 5 hours ago
      Everything is an ad these days. The article was not useless, but for the information it provides, it could have been two paragraphs.
      • hypfer 5 hours ago
        FWIW it told me stuff about openfaas. Now I know how to mentally file it and how to mentally file the author. The GitHub profile alone might not have sent the same signal, so this is useful.

        Is it bad software? Idk. Probably not.

        Should you treat it as a grassroots Foss thing maintained by fellow sane hackers? No sir.

  • barrkel 3 hours ago
    I found it interesting that vLLM was dismissed as slower than llama.cpp.

    IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating.

    • chartered_stack 2 hours ago
      One could say: vLLM isn't a worse Llama.cpp, it's a different tool
    • krzyk 1 hour ago
      AFAIR the general consensus is (was?): - llama.cpp for single user - vLLM for multi-user (e.g. enterprises)

      They are similar, but for different use cases.

  • eurekin 2 hours ago
    > The model is running so hot, that it shoots past the goal and starts looping

    later:

    > My latest experiment was setting up vLLM (the gold standard for production and concurrent serving) and even with an NVLink (175GBP) and tensor parallelism turned on, it was 3 tokens/second slower than llama.cpp during generation for an equivalent setup.

    In all my tests, getting vllm to run is worth it. It was the single biggest thing, that helped for looping issues, agents going whack and losing focus on the task, long context being essentially useless.

    FP8 model, unquantized cache in vllm an you have a league better overall experience, with any other stack I tested. Then, you can actually focus on using the model for other things and stop tinkering with settings.

    • trey-jones 1 hour ago
      I'm really curious about this, not because I disagree, but because I want to avoid agents going whack. Are you running vllm for yourself only, or a for a team, or for an application, etc? And do you feel there is a minimum hardware requirement for vllm to be useful in this way?

      My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.

  • nessex 4 hours ago
    This is a great post that covers a lot of the recent ground. I have a very similar setup after a very similar journey, minus the RTX6000. Worth noting though that a lot of the recent changes make a single 3090/4090 much more viable here too. MTP and the recent improvements to kv quantization in particular, as well as model-specific template & quant fixes. I run a 4090 with the 4-bit quantized variant of the same model now and have had a great experience. Qwen3.5 was already a big step up, but with 3.6 and the rest of the improvements it's substantially more reliable as a daily use tool and I find myself reaching for hosted models a lot less. Feels like I could work entirely without them if they were to disappear without going back to typing every line of code myself.

    To make 4-bit fit on one card with reasonable (100k+) context needs a bit more care though. And tuning can be highly specific to your machine, gpu and use-case. But I use a headless server, offload multi-modal to CPU, use fit-target to reduce wasted memory and use q8_0 kv since the 4090 performs well with it... In addition to most of the same config as the author elsewhere. I get 50-60tps generation with a power limit of 275W (450W is default), more than enough to offer a roughly an Opus-speed feedback loop.

    I haven't seen many of the issues with looping the author mentions. But I did with Qwen3.5 and in particular other 4-bit quants in the past. But the difference is probably a mix of the improvements above, as well as habits changing to avoid cases where models will loop. For what I'm doing, it seems like I loop Qwen3.6 on the same kind of prompts I'll make Haiku or Sonnet loop on (the latter hide some of their existential loops behind "thinking"). Usually it's cause I was too vague about some aspect of what I'm wanting them to do or I forgot to include some context that smaller models just don't have access to in their smaller knowledge base. But at least for what I'm doing (Rust, React, kubernetes) it's not been a notable problem at all with the latest iteration of this whole stack. And knowledge of standard libraries and default k8s resource kinds has been almost flawless.

    There's still plenty of more complex stuff where I'll choose to jump straight to Claude or GLM-5.2, but if it's not worth that jump I've stopped paying for the middle ground as it's usually not much better than just one more iteration through qwen.

    All this to say, if you have a 3090/4090, feel free to give the same setup a go. It's come a long way in recent weeks.

  • krzyk 1 hour ago
    3090 and 2x3090 are quite popular. But if you uses gigantic (for local models) context of 200k it will go south pretty quickly - any quantization of context quickly becomes the issue.
  • teh 2 hours ago
    I sometimes wonder how much of intelligence is being good with tools.

    I feel pretty averagely smart but give me some good tooling like a good editor, a good type system, semantic grep, good testing and some solvers and I can actually deliver some work.

    Maybe the trick isn't 500 billion parameters but a model super integrated with the task at hand for iteration and debugging?

    FWIW the article really mirrors my own experience. I can run a small gemma4 for quick edits (and it's fast!) or data cleanup but for other tasks you do need a different tool (claude).

  • rsrsrs86 56 minutes ago
    Chasing models for me it’s a big yellow flag

    Means underinvesting in engineering

    Look into it

  • whazor 4 hours ago
    Would be interesting to use local models for:

    - tool calling

    - code base exploration

    - anonymizing / abstracting your request

    Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.

    I think due to the lower latency of a local model that this could be faster.

    • asimovDev 2 hours ago
      I used Qwen 27b 8 bit MLX version on a decompiled android APK recently. It succesfully identified how it worked even the obfuscated classes and methods. It wrote a 1000 line documentation with examples but the time was dreadful. At some point it slowed down to 5 t/s so the whole thing took over an hour , the writing of documentation alone was over 40 minutes, fans blasting the entire time.
      • trey-jones 1 hour ago
        I know it uses electricity, but part of the benefit of a local model has to be that you can let it do this while you sleep, and not pay Anthropic for an unknown number of tokens.
        • asimovDev 1 hour ago
          yeah i totally understand and I am thoroughly impressed it works. And the electricity cost isn't that bad since it was on a ARM laptop (MacBook M3 Max) and not a beefy workstation with a GPU. I just let the agent do its work while watching the World Cup.
    • dofm 3 hours ago
      I doubt your experience of local models would be of lower latency, except for quite small models in edge uses.

      In every way, the cloud products from the big two seem optimised for speed and speed of initial response even.

      I don’t think most people are running local models for speed. More for control, privacy, interest, bloody-mindedness and general principle.

  • zkmon 4 hours ago
    The seems to talk a lot about 27B. In my experience, I saw 35B-A3B to be equally good in quality and the MoE gave more tg/s.
  • watt 1 hour ago
    I find it strange that software people will accept this level of flakiness from the hardware. Normally you would just send the card back, and request a replacement.

    > One of the cards would only show up if I crossed my fingers when turning it on. Even reboots wouldn't cure it - I had to A/C power off and remove the power cable each time for 30 seconds.

    This is ridiculous. Of course we are living through supply crunch, but that card is clearly defective hardware.

  • cptskippy 6 hours ago
    I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

    https://github.com/cptskippy/battlemage-llm-gateway

    Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.

    • hbbio 6 hours ago
      Interesting setup, thx for sharing.

      How many tokens/sec do you get with 27b? Are you using MTP?

    • askvictor 5 hours ago
      Does Intel make decent GPUs now? I must be out of the loop...
      • speedgoose 4 hours ago
        They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

        I am not sure whether you can find those in stock anywhere.

    • jauntywundrkind 5 hours ago
      What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.
      • Ritewut 5 hours ago
        Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.
        • Natalia724 5 hours ago
          Agree. For local coding help, latency often matters more than raw benchmark quality. A slightly weaker model that answers immediately changes how often you reach for it.
  • wallkroft 5 hours ago
    >Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels"
  • yamakasi007 1 hour ago
    [flagged]
  • YuriiKholodkov 1 hour ago
    [flagged]
  • tnn_YBVS 4 hours ago
    [flagged]
  • opptybiz 4 hours ago
    [dead]
  • nicechianti 4 hours ago
    [dead]
  • wallkroft 5 hours ago
    >Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels