Are you self-hosting LLMs (AI models) on your headless servers? I’d like to hear about your hardware setup. What server do you have your GPUs in?

When I do a hardware refresh I’d like to ensure my next server can support GPU(s?) for local LLM inferencing. I figured I could put in either a 4090 or x2 3090’s(?) maybe into an R730. But I’ve only barely started to research this. Maybe it isn’t practical.

I don’t know much other hardware lineups besides the Dell R7xx lineup.

I host oobagooba on an R710 as a model server API, and host sillytavern and stable diffusion which use oobagooba as clients. I use an R710 using a CPU, so as you can imagine inferencing is so slow it’s basically unusable. But I wired it up as a proof of concept.

I’m curious what other people who self-host LLMs do. I’m aware of remote options like Mancer or Runpod. I’d like the option for purely local inferencing.

Thanks all

  • tigress667@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    One challenge with the 4090 specifically is I don’t believe there are any dual-slot variants out there, even my 4080 is advertised as a triple-slot card (and actually takes four because Zotac did something really, really annoying with the fan mounting)…you could liquid-cool and swap the brackets, but then you have the unenviable task of mounting sufficient radiators and support equipment (pump, res, etc) into a rackmount server. That assumes you’re looking at something 2-3U, since you mentioned an R730; if you’re willing to do a whitebox 4U build it’s a lot more doable.

    Of course if money is no object, ditch plans for the GeForce cards and get the sort of hardware that’s made to live in 2U/3U boxes, i.e. current-gen Tesla (or Quadro, if you want display outputs for whatever reason). If money is an object, get last-gen Teslas. Tossed an old Tesla P100 (Pascal/10-series) into my Proxmox server to replace a 2060S with half the VRAM, for LLMs I didn’t really notice an obvious performance decrease (i.e. still inferences faster than I can read), and in a rack server you won’t even have to mess with custom shrouds for cooling, since the fans in the server are going to provide more than enough directed airflow.