• icosahedron@ttrpg.network
    link
    fedilink
    English
    arrow-up
    29
    arrow-down
    1
    ·
    edit-2
    1 year ago

    I tested it out just now. It seems to be much better than gpt 3.5, but just a little worse than gpt 4.

    I tested this: “Explain the plot of Cinderella in a sentence where each word has to begin with the next letter in the alphabet from A to Z, without repeating any letters.”

    GPT-4 is able to always get it right first try. It gave me: “A beautiful Cinderella, determined, endures, finds glass heel, invites joy; king’s love magically nurtures opulence, prince quests, restores slipper, transformation unveils, validates wistful xerophyte yearning zealously."

    Claude 2.0 took 4 tries to get it right. It gave me these: “A beautiful young girl called Cinderella did evil for her fair godmother’s horrid instructions just kicking laundry maidens’ nice outfits purportedly quelling ridiculous step-sisters’ terrible underhanded villainy with x-rays yielding zilch.” and then “Awaited ball causes Downtrodden Ella’s fairy godmother’s help. I’ll just knock loud mean old prune’s quest, righteously sabotaging two ugly vicious women. Xanthippe yells, “Zounds!”” and then “Abandoned by cruel destiny, Ella finds godmother’s hidden invitation juxtaposing king’s lavish ball. Magic occurs, placing radiant slipper to unveil victorious wife. Xerxes yearns, “Zounds, she’s found!”” and then finally “Awaiting ball, Cinderella dreams every Friday godmother helps. Inside jousting knights leer maidens’ nice outfits. Prince quietly rescues sweetheart, though ugly vengeful wife x-rays yellow zebra.”

    After that, i tried this prompt: “Andrew is free from 11 am to 3 pm, Joanne is free from noon to 2 pm and then 3:30 pm to 5 pm. Hannah is available at noon for half an hour, and then 4 pm to 6 pm. What are some options for start times for a 30 minute meeting for Ca Given the availability of Andrew, Joanne, and Hannah, we need to find overlapping times for a 30-minute meeting.

    After cross-referencing their schedules, the potential start times that accommodate all three schedules are:

    12:00 pm to 12:30 pm: This slot is within Andrew’s free time (11 am - 3 pm), Joanne’s free time (12 pm - 2 pm), and Hannah’s first free half hour at noon. There is no other slot that fits within all three schedules, due to Joanne’s break from 2 pm to 3:30 pm and Hannah’s availability starting again only at 4 pm, when Andrew is no longer available. Therefore, there is just one option for a 30-minute meeting that works for all three of them.”

    Claude 2.0 responded with: “Okay, based on the availability you provided, here are some options for 30 minute meeting start times that work for Andrew, Hannah and Joanne: 11:30 am - This works because Andrew is free starting at 11am, Hannah is available at noon for half an hour, and Joanne is free from noon to 2pm. 3:30 pm - This works because Andrew is free until 3pm, Hannah is free from 4pm to 6pm, and Joanne is free from 3:30pm to 5pm. 4:30 pm - This also works for the same reasons as 3:30pm. Andrew is free until 3pm, Hannah is free from 4pm to 6pm, and Joanne from 3:30pm to 5pm. 5 pm - This could work as well since Andrew is free until 3pm, Hannah until 6pm, and Joanne until 5pm. It’s pushing the end of Joanne’s availability though. So in summary, the best options are 11:30 am, 3:30 pm or 4:30 pm to ensure there is overlap with all three people’s availability. Let me know if you need any clarification or have additional constraints!”

    I told Claude it is incorrect, and it proceeded to correct itself and tell me there is NO available time overlap. After several more attempts, it eventually gave up and asked me for the correct answer.

    So although Claude 2.0 is much better than GPT-3.5, it is still worse than GPT-4 for complex reasoning.

    EDIT: I just realized that Claude has a 100k token context window. This is SIGNIFICANTLY more than GPT-4’s normal 8k context and 32k api context. This immediately makes it much more useful, as it can parse ~75,000 words. I also have not tested Claude’s creative, programming, mathematic, etc., skills. But regardless of how that compares to GPT’s, it doesn’t matter because of the sheer quantity of tokens that can be parsed by Claude.

    • folshost@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      That’s interesting. I haven’t tried its reasoning skills. I did try playing Jeopardy! with it though, and it showed a lot of improvement from previous attempts. Usually chatbots are very bad at Jeopardy, telling you just about any answer is correct, but Claude 2 did really well, explaining why I was wrong several times. I did ask it to provide an explanation about whether my answers were right or wrong in the initial prompt, so that might’ve made a difference though.

  • simple@lemmy.world
    link
    fedilink
    English
    arrow-up
    13
    ·
    1 year ago

    Woah, this is huge. Claude 1 was already more useful and coherent than ChatGPT (3.5, not 4). The big point was that it wasn’t available to everyone. This could really steal some marketshare from OpenAI if things go well.

    • Wooly@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      5
      ·
      1 year ago

      What market though? These AI chatbots seem like money sinks for a potential development into something useful in the distant future.

      • simple@lemmy.world
        link
        fedilink
        English
        arrow-up
        11
        ·
        1 year ago

        The market of people buying APIs for popular chatbots. Right now OpenAI’s GPT is overwhelmingly the most popular option and pretty expensive. You constantly see a lot of “powered by GPT” features on products now, but hopefully Claude can provide some better competition.

        • Wooly@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          3
          ·
          1 year ago

          Fair, I don’t see any real use for these right now. Chatbots just seem like a gimmick that can help people cheat in school (not that I give a fuck about that). Probably just the online circles we run in, what sorta things are powered by GPT? Customer support and stuff?

          • simple@lemmy.world
            link
            fedilink
            English
            arrow-up
            8
            ·
            1 year ago

            You’ve got stuff like helping assistants on Duolingo and Khan Academy powered by GPT-4, you’ve got stuff like tools for automatic search engine optimization, tools for automatic code generation, tools for grammar spell checking, tools for translation, and probably a lot more I’m unaware of.

            There’s quite a lot of people depending on GPT right now.

      • Dusty@l.dusty-radio.com
        link
        fedilink
        English
        arrow-up
        8
        ·
        1 year ago

        You shouldn’t have to though. The whole “only available if you happen to live in X” is so much bs when it comes to things like this. Sure if it was a giveaway and needed to be shipped, I could understand. But a website being locked away to only certain regions is ridiculous.

        • SlowNPC@kbin.social
          link
          fedilink
          arrow-up
          7
          arrow-down
          1
          ·
          1 year ago

          I suspect it has to do with legal compliance. Only available in US = only needing to comply with US law.

  • vnshng@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    1 year ago

    Just tried it out, withe some questions about ceramic firing in a electric kiln. Seems to have similar accuracy to chatgpt, maybe closer to gpt4.

    It’s not clear when using it what version it’s on, so this may have been Claude 1, I’m unsure where to check.

      • vnshng@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        I asked perplexity that same question. It kind of did better, it made no errors in temperature’s like the others do. It just left those details out, initially. After asking follow-up questions it answered correctly, but also gave some unnecessary and unrelated information.

        I didn’t use any of the prompts, I was asking about saggar firing processes and temps, the prompts were just ceramics related.

      • Wholesalechicken@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        Hard to believe something that feels like it’s lying to you all the time. I asked it about a topic that I’m in and have a website about, it told me the website was hypothetical. It got it wrong twice, even after it agreed it was wrong, and then told me the wrong thing again.

        Is this what they consider hallucinations?

  • kartoos@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    1
    ·
    1 year ago

    Say’s only available in the US, used a VPN to sign up, good to have alternatives!