2 comments

  • besterman23 4 hours ago
    I wonder if multiple attempts at the opossum would produce better results.

    If we didn’t have the previous example I would interpret this as pretty solid evidence that labs were training on the Pelican “benchmark”.

    I just can’t imagine a model dropping so significantly from one version to the next on such a silly task.

  • ChrisArchitect 34 minutes ago
    Related:

    GLM-5.2 is the new leading open weights model on Artificial Analysis

    https://news.ycombinator.com/item?id=48567759