192 comments
  • ultrasaurus1y

    The improvements in ease of use for locally hosting LLMs over the last few months have been amazing. I was ranting about how easy https://github.com/Mozilla-Ocho/llamafile is just a few hours ago [1]. Now I'm torn as to which one to use :)

    1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

    • keriati11y

      I think it is even easier right now for companies to self host an inference server with basic rag support:

      - get a Mac Mini or Mac Studio - just run ollama serve, - run ollama web-ui in docker - add some coding assitant model from ollamahub with the web-ui - upload your documents in the web-ui

      No code needed, you have your self hosted LLM with basic RAG giving you answers with your documents in context. For us the deepseek coder 33b model is fast enough on a Mac Studio with 64gb ram and can give pretty good suggestions based on our internal coding documentation.

    • vergessenmir1y

      Personally I'd recommend Ollama, because they have a good model (dockeresque), the APIs are quite more widely supported

      You can mix models in a single model file, it's a feature I've been experimenting with lately

      Note: you don't have to rely on their model Library, you can use your own. Secondly, support for new models is through their bindings with llama.cpp

    • xyc1y

      The pace of progress here is pretty amazing. I loved how easy it is to get llamafile up and running, but I missed feature complete chat interfaces, so I built one based off it: https://recurse.chat/.

      I still need GPT-4 for some tasks, but in daily usage it's replaced much of ChatGPT usage, especially since I can import all of my ChatGPT chat history. Also curious to learn about what people want to do with local AI.

      • SOLAR_FIELDS1y

        My primary use case would be to feed large internal codebases into an LLM with a much larger context window than what GPT-4 offers. Curious what the best options here are, in terms of model choice, speed, and ideas for prompt engineering

      • littlestymaar1y

        What's up with the landing page though? Unless I'm not well awaken, there doesn't seem to be a download section or anything.

    • jondwillis1y

      I’ve been using Ollama with Mixtral-7B on my MBP for local development and it has been amazing.

      • gnicholas1y

        I have used it too and am wondering why it starts responding so much faster than other similar-sized models I've tried. It doesn't seem quite as good as some of the others, but it is nice that the responses start almost immediately (on my 2022 MBA with 16 GB RAM).

        Does anyone know why this would be?

        • regularfry1y

          I've had the opposite experience with Mixtral on Ollama, on an intel linux box with a 4090. It's weirdly slow. But I suspect there's something up with ollama on this machine anyway, any model I run with it seems to have higher latency than vLLM on the same box.

          • kkzz991y

            You have to specify the amount of layers to put on the GPU with ollama. Ollama defaults to far less layers compared to what is actually possible.

      • castles1y

        To clarify - did you mean Mixtral (8x)7b, or Mistral 7b?

    • a_wild_dandan1y

      I've always used `llamacpp -m <model> -p <prompt>`. Works great as my daily driver of Mixtral 8x7b + CodeLlama 70b on my MacBook. Do alternatives have any killer features over Llama.cpp? I don't want to miss any cool developments.

      • Casteil1y

        70b is probably going to be a bit slow for most on M-series MBPs (even with enough RAM), but Mixtral 8x7b does really well. Very usable @ 25-30T/s (64GB M1 Max), whereas 70b tends to run more like 3.5-5T/s.

        'llama.cpp-based' generally seems like the norm.

        Ollama is just really easy to set up & get going on MacOS. Integral support like this means one less thing to wire up or worry about when using a local LLM as a drop-in replacement for OpenAI's remote API. Ollama also has a model library[1] you can browse & easily retrieve models from.

        Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.

        [1] https://ollama.ai/library/mixtral

        [2] https://github.com/ollama-webui/ollama-webui

        • visarga1y

          Yeah, ollama-webui is an excellent front end and the team was responsive in fixing a bug I reported in a couple of days

          It's also possible to connect to OpenAI API and use GPT-4 on per token plan. I cancelled my chatGPT subscription since. But 90% of the usage for me is Mistral 7B fine-tunes, I rarely use OpenAI

          • mark_l_watson1y

            Thanks for that idea, I use Ollama as my main LLM driver, but I still use OpenAI, Anthropic, and Mistral commercial API plans. I access Ollama via a REST API and my own client code, but I will try their UI.

            re: cancelling ChatGPT subscription: I am tempted to do this also except I suspect that when they release GPT-5 there may be a waiting list, and I don’t want any delays in trying it out.

      • skp19951y

        I have found deepseek coder 33B to be better than codellama 70B (personal opinion tho).. I think the best parts of deepseek are around the fact that it understands multi-file context the best.

        • karolist1y

          Same here, I run deepseek coder 33b on my 64GB M1 Max at about 7-8t/s and it blows all other models I've tried for coding. It feels like magic and cheating at the same time, getting these lenghty and in-depth answers with activity monitor showing 0 network IO.

          • hcrisp1y

            I tried running Deepseek 33b using llama.cpp with 16k context and it kept injecting unrelated text. What is your setup so it works for you? Do you have some special CLI flags or prompt format?

        • _ink_1y

          How exactly do you use the LLM with multiple files? Do you copy them enterly into the prompt?

          • skp19951y

            not really copying the whole files, but I work in an editor which has keyboard shortcuts for adding the relevant context from files and putting that in the prompts.

            Straight up giving large files tends to degrade performance (so you need to do some reranking on the snippets before sending them over

      • ultrasaurus1y

        Based on a day's worth of kicking tires, I'd say no -- once you have a mix that supports your workflow the cool developments will probably be in new models.

        I just played around with this tool and it works as advertised, which is cool but I'm up and running already. (For anyone reading this though who, like me, doesn't want to learn all the optimization work... I might see which one is faster on your machine)

      • livrem1y

        With all the models I tried there was a quite a bit of fiddling for each one to get the correct command-line flags and a good prompt, or at least copy-paste some command-line from HF. Seems like every model needs its own unique prompt to give good results? I guess that is what the wrappers take care of? Other than that llama.cpp is very easy to use. I even run it on my phone in Termux, but only with a tiny model that is more entertaining than useful for anything.

        • te_chris1y

          For the chat models, they're all finetuned slightly differently in their prompt format - see Llama's. So having a conversion between the OAI api that everyone's used to now and the slightly inscrutable formats of models like Llama is very helpful - though much like langchain and its hardcoded prompts everywhere, there's probably some subjectivity and you may be rewarded by formatting prompts directly.

      • mirekrusin1y

        ollama is extremely convenient wrapper around llamacpp.

        they separate serving heavy weights from model definition and usage itself.

        what that means is weights of some model, let's say mixtral are loaded on the server process (and kept in memory for 5m as default) and you interact with it by using modelfile (inspired by dockerfile) - all your modelfiles that inherit FROM mixtral will reuse those weights already loaded in memory, so you can instantly swap between different system prompts etc - those appear as normal models to use through cli or ui.

        the effect is that you have very low latency, very good interface - for programming api and ui.

        ps. it's not only for macs

        open weight models + (llama.app) as ollama + ollama-webui = real openai.

    • myaccountonhn1y

      Curious if anyone has any recommendation for what LLM model to use today if you want a code assistant locally. Mistral?

    • thrdbndndn1y

      From the blog article:

      > A few pip install X’s and you’re off to the races with Llama 2! Well, maybe you are, my dev machine doesn’t have the resources to respond on even the smallest model in less than an hour.

      I never tried to run these LLMs on my own machine -- is it this bad?

      I guess if I only have a moderate GPU, say a 4060TI, there is no chance I can play with it, then?

      • pitched1y

        I would expect that 4060ti to get about 20-25 tokens per second on Mixtral. I can read at roughly 10-15 tokens per second so above that is where I see diminishing returns for a chatbot. Generating whole blog articles might have you sit waiting for a minute or so though.

        • thrdbndndn1y

          Thanks, that sounds more than tolerable than "more than an hour"!

          I also have the 16GB version, which I assume would be a little bit better.

        • cellis1y

          It depends on the context window, but my 3090 gets ~60/s on smaller windows.

          • visarga1y

            I get 50-60t/s on Mistral 7B on 2080 Ti

      • jsjohnst1y

        The Apple M1 is very useable with ollama using 7B parameter models and is virtually as “fast” as ChatGPT in responding. Obviously not same quality, but still useful.

      • Eisenstein1y

        You can load a 7B parameter model quantized at Q4_K_M as gguf. I don't know ollama, but you can load it in koboldcpp -- use cuBLAS and gpu layers 100 context 2048 and it should fit it all into 8GB of VRAM. For quantized models look at TheBloke on huggingface -- Mistral 7B is a good one to try.

        • dizhn1y

          If I am not mistaken, layer offloading is a llama.cpp feature so a lot of frontends/loaders that use it also have it. I use it with koboldcpp and text-generation-webui.

      • jwr1y

        On an M3 MacBook Pro with 32GB of RAM, I can comfortably run 34B models like phind-codellama:34b-v2-q8_0.

        Unfortunately, having tried this and a bunch of other models, they are all toys compared to GPT-4.

    • 1y
      [deleted]
  • mrtimo1y

    I am business prof. I wanted my students to try out ollama (with web-ui), so I built some directions for doing so on google cloud [1]. If you use a spot instance you can run it for 18 cents an hour.

    [1] https://docs.google.com/document/d/1OpZl4P3d0WKH9XtErUZib5_2...

    • ijustlovemath1y

      The way you've set this up, your students could be too late to claim admin and have their instance hijacked. Very insecure. Would highly recommend you make them use an SSH key from git-bash; it's no more technical than anything you already have.

    • dizhn1y

      You can run a lot of things on Google Colab for free as well. KoboldCPP has a nice premade thing on their website that can even load different models.

    • teruakohatu1y

      Very useful thanks

  • swyx1y

    I know a few people privately unhappy that openai api compatibility is becoming a community standard. Apart from some awkwardness around data.choices.text.response and such unnecessary defensive nesting in the schema, I don't really have complaints.

    wonder what pain points people have around the API becoming a standard, and if anyone has taken a crack at any alternative standards that people should consider.

    • simonw1y

      I want it to be documented.

      I'm fine with it emerging as a community standard if there's a REALLY robust specification for what the community considers to be "OpenAI API compatible".

      Crucially, that standard needs to stay stable even if OpenAI have released a brand new feature this morning.

      So I want the following:

      - A very solid API specification, including error conditions

      - A test suite that can be used to check that new implementations conform to that specification

      - A name. I want to know what it means when software claims to be "compatible with OpenAI-API-Spec v3" (for example)

      Right now telling me something is "OpenAI API compatible" really isn't enough information. Which bits of that API? Which particular date-in-time was it created to match?

      • londons_explore1y

        It's a JSON API... JSON API's tend to be more... 'flexible'.

        To consume them, just assume that every field is optional and extra fields might appear at any time.

        • swyx1y

          and disappear at any time... was a leetle bit unsettled by the sudden deprecation of "functions" for "tools" with only minor apparante benefit

          • athyuttamre1y

            Hey swyx — I work at OpenAI on our API. Sorry the change was surprising, we definitely didn't do a great job communicating it.

            To confirm, the `functions` parameter will continue to be supported.

            We renamed `functions` to `tools` to better align with the naming across our products (Assistants, ChatGPT), where we support other tools like `code_interpreter` and `retrieval` in addition to `function`s.

            If you have any other feedback for us, please feel free to email me at [email protected]. Thanks!

            • londons_explore1y

              Might be a good idea to have API versions for this... Then when someone builds a product against "version 1", they can be sure that new features might be added to version 1, but no fields will be removed/renamed without openai releasing version 2.

          • nl1y

            and what does `auto` even mean?

            • athyuttamre1y

              Hey nl — I work at OpenAI on our API. Do you mean `tool_choice="auto"`? If so, it means the model gets to pick which tool to call. The other options are:

              - `tool_choice={type: "function", function: {name: "getWeather"}}`, where the developer can force a specific tool to be called. - `tool_choice="none"`, where the developer can force the model to address the user, rather than call a tool.

              If you have any other feedback, please feel free to email me at [email protected]. Thanks!

      • te_chris1y

        Amen! The lack of decent errors from OpenAI is the most annoying. They'll silently return 400 with no explanation. Let's hope that doesn't catch on.

        OpenAI compatible just seems to mean 'you can format your prompt like the `messages` array'.

        • athyuttamre1y

          Hi te_chris — I work at OpenAI and am currently working to improve our error messages. Would you be willing to share more about what errors you find annoying? My email is [email protected] (or feel free to reply here). Thanks!

    • Patrick_Devine1y

      TBH, we debated about this a lot before adding it. It's weird being beholden to someone else's API which can dictate what features we should (or shouldn't) be adding to our own project. If we add something cool/new/different to Ollama will people even be able to use it since there isn't an equivalent thing in the OpenAI API?

      • minimaxir1y

        That's more of a marketing problem than a technical problem. If there is indeed a novel use case with a good demo example that's not present in OpenAI's API, then people will use it. And if it's really novel, OpenAI will copy it into their API and thus the problem is no longer an issue.

        The power of open source!

        • Patrick_Devine1y

          You're right that it's a marketing problem, but it's also a technical problem. If tooling/projects are built around the compat layer it makes it really difficult to consume those features without having to rewrite a lot of stuff. It also places a cognitive burden on developers to know which API to use. That might not sound like a lot, but one of the guiding principles around the project (and a big part of its success) is to keep the user experience as simple as possible.

      • satellite21y

        At some point, (probably in a relatively close future), there will be the AI Consortium (AIC) to decide what enters the common API?

    • minimaxir1y

      That's why it's good as an option to minimize friction and reduce lock-in to OpenAI's moat.

    • sheepscreek1y

      I would take an imperfect standard over no standard any day!

      • dimask1y

        There is a difference between a standard and a monopoly, though.

    • tracerbulletx1y

      It's so trivially easy to create your own web server in your language of choice that calls directly into llama.cpp functions with the bindings for your language of choice it doesn't really matter all that much. If you want more control you can get with just a little more work. You don't really need these plug and play things.

  • slimsag1y

    Useful! At work we are building a better version of Copilot, and support bringing your own LLM. Recently I've been adding an 'OpenAI compatible' backend, so that if you can provide any OpenAI compatible API endpoint, and just tell us which model to treat it as, then we can format prompts, stop sequences, respect max tokens, etc. according to the semantics of that model.

    I've been needing something exactly like this to test against in local dev environments :) Ollama having this will make my life / testing against the myriad of LLMs we need to support way, way easier.

    Seems everyone is centralizing behind OpenAI API compatibility, e.g. there is OpenLLM and a few others which implement the same API as well.

  • hubraumhugo1y

    It feels absolutely amazing to build AI startup right now:

    - We first struggled with token limits [solved]

    - We had issues with consistent JSON ouput [solved]

    - We had rate limiting and performance issues for the large 3rd party models [solved]

    - We wanted to reduce costs by hosting our own OSS models for small and medium complex tasks [solved]

    It's like your product becomes automatically cheaper, more reliable, and more scalable with every new major LLM advancement.

    Obivously you still need to build up defensibility and focus on differentiating with everything “non-AI”.

    • topicseed1y

      > We first struggled with token limits [solved]

      How has this been solved in your opinion? Do you mean with recent versions with much bigger limits but also heaps more expensive?

      • gitfan861y

        The limits still exist but for certain use cases larger limits have helped

        • martin821y

          Most of the very large token limits are just fake marketing bullshit. If you really try them out, you will immediately realise that the model is not at all able to keep all 100k tokens in its memory. The results tend to be pure luck, so in the end you end up just using 16k tokens anyways, which is already much much better than the initial 4k, but still quite limiting.

  • ilaksh1y

    I think it's a little misleading to say it's compatible with OpenAI because I expect function or tool calling when you say that.

    It's nice that you have the role and content thing but that was always fairly trivial to implement.

    When it gets to agents you do need to execute actions. In the agent hosting system I started, I included a scripting engine, which makes me think that maybe I need to set up security and permissions for the agent system and just let it run code. Which is what I started.

    So I guess I am not sure I really need the function/tool calling. But if I see a bunch of people actually am standardizing on tool calls then maybe I need it in my framework just because it will be expected. Even if I have arbitrary script execution.

    • minimaxir1y

      The documentation is upfront about which features are excluded: https://github.com/ollama/ollama/blob/main/docs/openai.md

      Function calling/tool choice is done at the application level and currently there's no standard format, and the popular ones are essentually inefficient bespoke system prompts: https://github.com/langchain-ai/langchain/blob/master/libs/l...

      • e12e1y

        > Function calling/tool choice is done at the application level and currently there's no standard format,

        Is this true for open ai - or just everything else?

    • ianbicking1y

      I was drawn to Gemini Pro because it had function/tool calling... but it works terribly. (I haven't tried Gemini Ultra yet; unclear if it's available by API?)

      Anyway, probably best that they didn't release support that doesn't work.

      • williamstein1y

        Gemini Ultra is not available via API yet, at least according to the Google reps we talked with today. There's a waiting list. I suspect they are figuring out how to charge for API access, among other things. The announcement today only seemed to have pricing for the "$20/month" thing.

    • osigurdson1y

      It makes obvious sense to anyone with experience with OpenAI APIs.

  • ptrhvns1y

    FYI: the Linux installation script for Ollama works in the "standard" style for tooling these days:

        curl https://ollama.ai/install.sh | sh
    
    However, that script asks for root-level privileges via sudo the last time I checked. So, if you want the tool, you may want to download the script and have a look at it, or modify it depending on your needs.
    • Vinnl1y

      They have manual install instructions [0], and judging by those, what it does is set up a SystemD service that automatically runs on startup. But if you're just looking to play around, I found that downloading [1], making it executable (chmod +x ollama-linux-amd64), and then running it, worked just fine. All without needing root.

      [0] https://github.com/ollama/ollama/blob/main/docs/linux.md#man...

      [1] https://ollama.ai/download/ollama-linux-amd64

    • dizhn1y

      The ollama binary goes into /usr/bin which it doesn't have to but it's convenient. I haven't checked what else needs root access.

    • riffic1y

      we have package managers in this day and age, lol.

      • jazzyjackson1y

        do package managers make promises that they only distribute code that's been audited to not pwn you? I'm not sure I see the difference if I decided I'm going to run someone's software whether I install it with sudo apt install vs sudo curl | bash

      • jampekka1y

        Sadly most of them kinda suck, especially for packagers.

      • 1y
        [deleted]
  • lolpanda1y

    The compatibility layer can be also built in libraries. For example, Langchain has llm() which can work with multiple LLM backend. Which do you prefer?

    • avereveard1y

      I'd prefer it in library but there are a number of issues with that currently, the larger of it being that the landscape moves too fast and library wrappers aren't keeping up. the other is, what if the world standardize on a terrible library like langchain we'd be stuck with it for a long time since maintenance cost of non uniform backend tend to kill possible runner ups. So for now the uniform api seems the choice of convenience.

    • Szpadel1y

      but this means you need each library to support each llm, and I think this is the same issue what is with object storage where basically everyone support S3 compatible API

      it's great to have some standard API even if that's isn't perfect, but having second API that allows you to use full potential (like B2 for backblaze) is also fine

      so there isn't one model fits all, and if your model have different capabilities, then imo you should provide both options

      • SOLAR_FIELDS1y

        This is hopefully much better than the s3 situation due to its simplicity. Many offerings that say “s3 compatible api” often mean “we support like 30% of api endpoints”. Granted often the most common stuff is supported and some stuff in the s3 api really only makes sense in AWS, but a good hunk of the s3 api is just hard or annoying to implement and a lot of vendors just don’t bother. Which ends up being rather annoying because you’ll pick some vendor and try to use an s3 client with it only to find out you can’t because of the 10% of the calls your client needs to make that are unsupported.

    • mise_en_place1y

      Before OpenAI released their app I was using langchain in a system that I built. It was a very simple SMS interface to LLMs. I preferred working with langchain's abstractions over directly interfacing with the GPT4 API.

  • patelajay2851y

    We've been working on a project that provides this sort of easy swapping between open source (via HF, VLLM) & commercial models (OpenAI, Google, Anthropic, Together) in Python: https://github.com/datadreamer-dev/DataDreamer

    It's a little bit easier to use if you want to do this without an HTTP API, directly in Python.

  • eclectic291y

    What's the use case of Ollama? Why should I not use llama.cpp directly?

    • TheCoreh1y

      It's like a docker/package manager for the LLMs. You can easily install them, discover new ones, update them via a standardized, simple CLI. It also auto updates effortlessly.

      • dizhn1y

        Yesterday I learned it also deduplicates skmiler model files.

    • jpdus1y

      I have the same question. Noticed that Ollama got a lot of publicity and seems to be well received, but what exactly is the advantage over using llama.cpp (which also has a built-in server with OpenAI compatibility nowadays?) Directly?

      • visarga1y

        ollama swaps models from the local library on the fly, based on the request args, so you can test against a bunch of models quickly

        • eclectic291y

          Once you've tested to your heart's content, you'll deploy your model in production. So, looks like this is really just a dev use case, not a production use case.

          • silverliver1y

            In production, I'd be more concerned about the possibly of it going off on it's own and autoupdating and causing regressions. FLOSS LLMs are interesting to me because I can precisely control the entire stack.

            If Ollama doesn't have a cli flag that disables auto updating and networking altogether, I'm not letting it anywhere near my production environments. Period.

            • eclectic291y

              If you’re serious about production deployments vLLM is the best open source product out there. (I’m not affiliated with it)

  • shay_ker1y

    Is Ollama effectively a dockerized HTTP server that calls llama.cpp directly? For the exception of this newly added OpenAI API ;)

    • okwhateverdude1y

      More like an easy-mode llama.cpp that does a cgo wrapping of the lib (now; before they built patched llama.cpp runners and did IPC and managed child processes) and it does a few clever things to auto figure out layer splits (if you have meager GPU VRAM). The easy mode is that it will auto-load whatever model you'd like per request. They also implement docker-like layers for their representation of a model allowing you to overlay parameters of configuration and tag it. So far, it has been trivial to mix and match different models (or even the same models just with different parameters) for different tasks within the same application.

  • theogravity1y

    Isn't LangChain supposed to provide abstractions that 3rd parties shouldn't need to conform to OpenAI's API contract?

    I know not everyone uses LangChain, but I thought that was one of the primary use-cases for it.

    • minimaxir1y

      Which just then creates lock-in for LangChain's abstractions.

      • ludwik1y

        Which are pretty awful btw - every project at my job that started with LangChain openly regrets it - the abstractions, instead of making hard things easy, trend to make the way things hard (and hard to debug and maintain).

        • emilsedgh1y

          We use langchain and don't regret it at all. As a matter of fact, it is likely that without lc we would've failed to deliver our product.

          The main reason is langsmith. (But there are other reasons too). Because of langchain we got "free" (as in no development necessary) langsmith integration and now I can debug my llm.

          Before that it was trying to make sense of whats happening inside my app within hundreds and hundreds of lines of text which was extremely painful and time consuming.

          Also, lc people are extremely nice and very open/quick to feedback.

          The abstractions are too verbose, and make it difficult, but the value we've been getting from lc as a whole cannot be overstated.

          other benefits:

          * easy integrations with vector stores (we tried several until landing on one but switching was easy)

          * easily adopting features like chat history, that would've taken us ages to determine correctly on our own

          people that complain and say "just call your llm directly": If your usecase is that simple, of course. using lc for that usecase is also almost equally simple.

          But if you have more complex use cases, lc provides some verbose abstractions, but it's very likely that you would've done the same.

        • 1y
          [deleted]
        • phantompeace1y

          What are some better options?

          • bdcs1y

            https://www.llamaindex.ai/ is much better IMO, but it's definitely a case of boilerplate-y, well-supported incumbent vs smaller, better, less supported (e.g. Java vs Python in the 00s or something like that). Depends on your team and your needs.

            Also Autogen seems popular and well-ish liked https://microsoft.github.io/autogen/

            LangChain definitely has the most market-/mind- share. For example, GCP has a blog post on supporting it: https://cloud.google.com/blog/products/ai-machine-learning/d...

          • dragonwriter1y

            Have a fairly thin layer than wraps the underlying LLM behind a common API (e.g., Ollama as being discussed here, Oobabooga, etc.) and leaves the application-level stuff for the application rather than a framework like LangChain.

            (Better for certain use cases, that is, I’m not saying LangChain doesn't have uses.)

          • minimaxir1y

            Not using an abstraction at all and avoiding the technical debt it causes.

          • v3ss0n1y

            Haystack is much better option and way alot flexible, scalable

          • hospitalJail1y

            Don't use langchain, just make the calls?

            Its what I ended up doing.

  • SamPatt1y

    Ollama is great. If you want a GUI, LMStudio and Jan are great too.

    I'm building a React Native app to connect mobile devices to local LLM servers run with these programs.

    https://github.com/sampatt/lookma

  • init01y

    Trying to openai am I missing something?

        import OpenAI from 'openai'
    
        const openai = new OpenAI({
          baseURL: 'http://localhost:11434/v1',
          apiKey: 'ollama', // required but unused
        })
    
        const chatCompletion = await 
          openai.chat.completions.create({
          model: 'llama2',
          messages: [{ role: 'user', content: 'Why is the sky blue?' }],
        })
    
        console.log(completion.choices[0].message.content)
    
    I am getting the below error:

        return new NotFoundError(status, error, message, headers);
                       ^
        NotFoundError: 404 404 page not found
    • xena1y

      Remove the v1

  • Roark661y

    There has been a lot of progress with tools like llama.cpp and ollama, but despite slightly more difficult setup I prefer huggingface transformer based stuff(TGI for hosting, openllm proxy for (not at all)OpenAI compatibility). Why? Because you can bet the latest newest models are going to be supported in huggingface transformers library.

    Llama.cpp is not far behind, but I find the well structured python code of transformers easy to modify and extend(with context free grammars, function calling etc) than just waiting for your favourite alternate runtime support a new model.

  • behnamoh1y

    ollama seems like taking a page from langchain book: develop something that's open source but get it so popular that attracts VC money.

    I never liked ollama, maybe because ollama builds on llama.cpp (a project I truly respect) but adds so much marketing bs.

    For example, the @ollama account on twitter keeps shitposting on every possible thread to advertise ollama. The other day someone posted something about their Mac setup and @ollama said: "You can run ollama on that Mac."

    I don't like it when +500 people are working tirelessly on llama.cpp and then guys like langchain, ollama, etc. rip off the benefits.

    • slimsag1y

      Make something better, then. (I'm not being dismissive, I really genuinely mean it - please do)

      I don't know who is behind Ollama and don't really care about them. I can agree with your disgust for VC 'open source' projects. But there's a reason they become popular and get investment: because they are valuable to people, and people use them.

      If Ollama was just a wrapper over llama.cpp, then everyone would just use llama.cpp.

      It's not just marketing, either. Compare the README of llama.cpp to the Ollama homepage, notice the stark contrast of how difficult getting llama.cpp connected to some dumb JS app is compared to Ollama. That's why it becomes valuable.

      The same thing happened with Docker and we're just now barely getting a viable alternative after Docker as a company imploded, Podman Desktop, and even then it still suffers from major instability on e.g. modern macs.

      The sooner open source devs in general learn to make their projects usable by an average developer, the sooner it will be competitive with these VC-funded 'open source' projects.

      • behnamoh1y

        llama.cpp already has OpenAI compatible API.

        It takes literally one line to install it (git clone and then make).

        It takes one line to run the server as mentioned on their examples/server README.

            ./server -m <model> <any additional arguments like mmlock>
      • homarp1y

        >notice the stark contrast of how difficult getting llama.cpp connected to some dumb JS app is compared to Ollama.

        Sorry, I'm new to ollama 'ecosystem'.

        From llama.cpp readme, I ctrl-F-ed "Node.js: withcatai/node-llama-cpp" and from there, I got to https://withcatai.github.io/node-llama-cpp/guide/

        Can you explain how ollama does it 'easier' ?

    • FanaHOVA1y

      ggml is also VC backed, so that has nothing to do with it.

    • udev40961y

      I didn't know ollama was VC funded

  • ben_w1y

    I had trouble installing Ollama last time I tried, I'm going to try again tomorrow.

    I've already got a web UI that "should" work with anything that matches OpenAI's chat API, though I'm sure everyone here knows how reliable air-quotes like that are when a developer says them.

    https://github.com/BenWheatley/YetAnotherChatUI

    • ben_w1y

      Turns out my failure to install last time was due to thinking that the instructions on the python library blog post were complete installation instructions for the whole thing.

      > pip install ollama

      - https://ollama.ai/blog/python-javascript-libraries

      is just the python libraries, not ollama itself, which the libraries need, and without which they will just…

      > httpx.ConnectError: [Errno 61] Connection refused

      Install the main app from the big friendly download button, and this problem fixed itself: https://ollama.ai/download

    • regularfry1y

      If you don't care about the electron app and just want the API, you can `go generate ./... && go build && ./ollama serve` and you're off to the races. No installation needed.

      • ben_w1y

        I made my web interface before I'd even heard of Ollama, and because I wanted a PAYG interface for GPT-4.

        You also don't need to actually install my web UI, as it runs from the github page and the endpoint and API key are both configurable by the user during a chat session.

        Also (a) the ollama command line interface is good enough for what I actually want, (b) my actual problem was not realising I'd only installed the python and not the underlying model.

  • tosh1y

    I wonder why ollama didn't namespace the path (e.g. under "/openai") but in any case this is great for interoperability.

  • 678j53671y

    Ollama is very good and runs better than some of the other tooling I have tried. It also Just Works™. I ran Dolphin Mixtral 7b on a Raspberry pi 4 off a 32 gig SD card. Barely had room. I asked it for a cornbread recipe, stepped away for a few hours and it had generated two characters. I was surprised it got that far if I am being honest.

  • syntaxing1y

    Wow perfect timing. I personally love it. There’s so many projects out there that use OpenAI’s API whether you like it or not. I wanted to try this unit test writer notebook that OpenAI has but with Ollama. It was such a pain in the ass to fix it that I just didn’t bother cause it was just for fun. Now it should be 2 line of code change.

  • LightMachine1y

    Gemini Ultra release day, and a minor post on ollama OpenAI compatibility gets more points lol

    • subarctic1y

      Who cares about another closed LLM that's no better than GPT 4? I think there's more exciting potential in open weights LLMs that you can run on your own machine and do whatever you want with.

  • laingc1y

    What's the current state-of-the-art in deploying large, "self-hosted" models to scalable infrastructure? (e.g. AWS or k8s)

    Example use case would be to support a web application with, say, 100k DAU.

    • kkielhofner1y

      Nvidia Triton Inference Server with the TensorRT-LLM backend:

      https://github.com/triton-inference-server/tensorrtllm_backe...

      It’s used by Mistral, AWS, Cloudflare, and countless others.

      vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).

      100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc.

      Anyway, the point is Triton is just about the only thing out there for use in this general range and up.

      • Palmik1y

        Do you have a source on Mistral API, etc. being based on TensoRT-LLM? And what are the main distinguishing features?

        What I like about vLLM is the following:

        - It exposes AsyncLLMEngine, which can be easily wrapped in any API you'd like.

        - It has a logit processor API making it simple to integrate custom sampling logic.

        - It has decent support for interference of quantized models.

        • kkielhofner1y

          You can Google all of them + nvidia triton, but here you go...

          Mistral[0]:

          "Acknowledgement We are grateful to NVIDIA for supporting us in integrating TensorRT-LLM and Triton and working alongside us to make a sparse mixture of experts compatible with TRT-LLM."

          Cloudflare[1]: "It will also feature NVIDIA’s full stack inference software —including NVIDIA TensorRT-LLM and NVIDIA Triton Inference server — to further accelerate performance of AI applications, including large language models."

          Amazon[2]: "Amazon uses the Text-To-Text Transfer Transformer (T5) natural language processing (NLP) model for spelling correction. To accelerate text correction, they leverage NVIDIA AI inference software, including NVIDIA Triton™ Inference Server, and NVIDIA® TensorRT™, an SDK for high performance deep learning inference."

          There are many, many more results for AWS (internally and for customers) with plenty of "case studies", and "customer success stories", etc describing deployments. You can also find large enterprises like Siemens, etc using Triton internally and embedded/deployed within products. Triton also runs on the embedded Jetson series of hardware and there are all kinds of large entities doing edge/hybrid inference with this approach.

          You can also add at least Phind, Perplexity, and Databricks to the list. These are just the public ones, look at a high scale production deployment of ML/AI in any use case and there is a very good chance there's Triton in there.

          I encourage you to do your own research because the advantages/differences are too many to list. Triton can do everything you listed and often better (especially quantization) but off the top of my head:

          - Support for the kserve API for model management. Triton can load/reload/unload models dynamically while running, including model versioning and config params to allow clients to specify model version, require specification of version, or default to latest, etc.

          - Built in integration and support for S3 and other object stores for model management that in conjunction with the kserve API means you can hit the Triton API and just tell it to grab model X version Y and it will be running in seconds. Think of what this means when you have thousands of Triton instances throughout core, edge, K8s, etc, etc... Like Cloudflare.

          - Multiple backend support with support for literally any model: TF, Torch, ONNX, etc with dynamic runtime compilation for TensorRT (with caching and int8 calibration if you want it), OpenVINO, etc acceleration. You can run any LLM (or multiple), Whisper, Stable Diffusion, sentence embeddings, image classification, and literally any model on the same instance (or whatever) because at the fundamental level Triton was designed for multiple backends, multiple models, and multiple versions. It operates on a in/out concept with tensors or arbitrary data. Which can be combined with the Python and model ensemble support to do anything...

          - Python backend. Triton can do pre/post-processing in the framework for things like tokenizers and decoders. With ensemble you can arbitrary chain together inputs/outputs from any number of models/encoders/decoders/custom pre-processing/post-processing/etc. You can also, of course, build your own backends to do anything you need to do that can't be done with included backends or when performance is critical.

          - Extremely fine grained control for dispatching, memory management, scheduling, etc. For example the dynamic batcher can be configured with all kinds of latency guarantees (configured in nanoseconds) to balance request latency vs optimal max batch size while taking into account node GPU+CPU availability across any number of GPUs and/or CPU threads on a per-model basis. It also supports loading of arbitrary models to CPU, which can come in handy for acceleration of models that can run well on unused CPU resources - things like image classification/object detection. With ONNX and OpenVINO it's surprisingly useful. This can be configured on a per model basis, with a variety of scheduling/thread/etc options.

          - OpenTelemetry (not that special) and Prometheus metrics. Prometheus will drill down to an absurd level of detail with not only request details but also the hardware itself (including temperature, power, etc).

          - Support for Model Navigator[3] and Performance Analyzer[4]. These tools are on a completely different level... They will take any arbitrary model, export it to a package, and allow you to define any number of arbitrary metrics to target a runtime format and model configuration so you can do things like:

          - p95 of time to first token: X

          - While achieving X RPS

          - While keeping power utilization below X

          They will take the exported model and dynamically deploy the package to a triton instance running on your actual inference serving hardware, then generate requests to meet your SLAs to come up with the optimal model configuration. You even get exported metrics and pretty reports for every configuration used/attempted. You can take the same exported package, change the SLA params, and it will automatically re-generate the configuration for you.

          - Performance on a completely different level. TensorRT-LLM especially is extremely new and very early but already at high scale you can start to see > 10k RPS on a single node.

          - gRPC support. Especially when using pre/post processing, ensemble, etc you can configure clients programmatically to use the individual models or the ensemble chain (as one example). This opens up a very wide range of powerful architecture options that simply aren't available anywhere else. gRPC could probably be thought of as AsyncLLMEngine on steroids, it can abstract actual input/output or expose raw in/out so models, tokenizers, decoders, clients, etc can send/receive raw data/numpy/tensors.

          - DALI support[5]. Combined with everything above, you can add DALI in the processing chain to do things like take input image/audio/etc, copy to GPU once, GPU accelerate scaling/conversion/resampling/whatever, pipe through whatever you want (all on GPU), and get output back to the network with a single CPU copy for in/out.

          vLLM and HF TGI are very cool and I use them in certain cases. The fact you can give them a HF model and they just fire up with a single command and offer good performance is very impressive but there are an untold number of reasons these providers use Triton. It's in a class of its own.

          [0] - https://mistral.ai/news/la-plateforme/

          [1] - https://www.cloudflare.com/press-releases/2023/cloudflare-po...

          [2] - https://www.nvidia.com/en-us/case-studies/amazon-accelerates...

          [3] - https://github.com/triton-inference-server/model_navigator

          [4] - https://github.com/triton-inference-server/client/blob/main/...

          [5] - https://github.com/triton-inference-server/dali_backend

      • laingc1y

        Very helpful answer, thank you!

  • arbuge1y

    Genuinely curious to ask HN this: what are you using local models for?

    • codazoda1y

      I got the most use out of it on an airplane with no wifi. It let me keep working on a coding solution without the internet because I could ask it quick questions. Magic.

    • mysteria1y

      I use it for personal entertainment, both writing and roleplaying. I put quite a bit of effort into my own responses and actively edit the output to get decent results out of the larger 30B and 70B models. Trying out different models and wrangling the LLM to write what you want is part of the fun.

    • teruakohatu1y

      Experimenting, as well as a cheaper alternative to cloud/paid models. Local models don't have the encyclopaedic knowledge as huge models such as GPT 3.5/4, but they can perform tasks well.

    • chown1y

      I use it to compare outputs from different models (along with OpenAI, MistralAI) and pick-and-choose-and-compose those outputs. I wrote an app[1] that facilitates this. This also allows me to work offline mode and not having to worry about sharing client's data to OpenAI or Mistral AI

      [1]: https://msty.app

    • RamblingCTO1y

      I built myself a hacky alternative to the chat UI from openAI and implemented ollama to test different models locally. Also, openAI chat sucks, the API doesn't seem to suck as much. Chat is just useless for coding at this point.

      /e: https://github.com/ChristianSch/theta

    • amelius1y

      I'm hoping someone will write a tool to do project estimations. Like instead of my manager asking me "how long would it take you to implement X,Y,Z ...", he could use the LLM instead.

      It doesn't even need to be very accurate because my own estimations aren't either :)

    • dimask1y

      I used them to extract data from relatively unstructured reports into structured csv format. For privacy/gdpr reasons it was not something I could use an online model for. Saved me from a lot of manual work, and it did not hallucinate stuff as far as I could see.

  • Implicated1y

    Love it! Ollama has been such a wonderful project (at least, for me).

  • lxe1y

    Does ollama support loaders other than llamacpp? I'm using oobabooga with exllama2 to run exl2 quants on a dual NVIDIA gpu, and nothing else seems to beat performance of it.

    • _ink_1y

      I tried that, but failed to get the GPU split working. Do you have a link on how to do that?

      • lxe1y

        Do what exactly? I have no issues with GPU split on oobabooga with either exl2 or gguf.

  • jhoechtl1y

    How does ollama compare to H2o? We dabbled a bit with H2o and it looks very promising

    https://gpt.h2o.ai/

  • Havoc1y

    I don’t quite follow why people use ollama ? It sounds like lama.cpp with less features and training wheels

    Is it just ease of use or is there something I’m missing?

    • dizhn1y

      Downloading and activating models is very convenient. This llm stuff is really complicated and every little bit helps at the beginning. I only started two weeks ago and was very frustrated. A tool that just works is good for that kind of thing. Of course at that point you think it's their models and there's something special they are doing to the models etc. Honestly no tool that allows easy downloads goes out of their way to say they are just downloading TheBloke's gguf files and that the same models will run anywhere. (minus ollama's blob format on disk) :)

    • mark_l_watson1y

      I started by using lama.cpp directly, and a few other options. I now just use Ollama because it is simple to download models, keep software and models up to date, and really easy to run a local REST query service. I like spending more time playing with application ideas and less time running infrastructure. Of course, lama.cpp under the hood provides the magic.

    • sp3321y

      The CLI for llama.cpp is very clunky IMO. I put some kind of UI on it when I want to get something done.

    • __loam1y

      It's always ease of use lol. Thinking the best technology wins is a fallacy.

    • boarush1y

      Ollama is just easier to use and serve the model on a local http server. I personally use it for testing stuff with llama-index as well. Pretty useful to say the least with zero configuration issues.

    • skp19951y

      not sure why you are getting downvoted, its a very valid question. Its kind of down to the ergonomics of running LLM. Downloading a user friendly CLI tool with good UX beats having to clone a repo and run make files. llama.cpp is the better option if you want to do anything non-trivial when it comes to LLMs

    • titaniumtown1y

      It's a wrapper around llama.cpp that provides a stable api

  • osigurdson1y

    Smart. When they do come, will the embedding vectors be OpenAI compatible? I assume this is quite hard to do.

    • minimaxir1y

      Embeddings as an I/O schema are just text-in, a list of numbers out. There are very few embedding models which require enough preprocessing to warrant an abstraction. (A soft example is the new nomic-embed-text-v1, which requires adding prefix annotations: https://huggingface.co/nomic-ai/nomic-embed-text-v1 )

      • osigurdson1y

        Yes of course (syntactically it is just float[] getEmbeddings(text)) but are the numbers close to what OpenAI would produce? I assume no.

        • minimaxir1y

          This submission only about I/O schema: the embeddings themselves are dependent on the model, and since OpenAI's models are closed source no one can reproduce them.

          No direct embedding model can be cross-compatable. (exception: constrastive learning models like CLIP)

    • dragonwriter1y

      Probably not, embedding vectors aren't conpatible across different embedding models, and other tools presenting OAI-compatible APIs don't use OAI-compatible embedding models (e.g., oobabooga lets you configure different embeddings models, but none of them produce compatible vectors to the OAI ones.)

  • thedangler1y

    Is Ollama model I can use locally to use for my own project and keep my data secure?

    • jasonjmcghee1y

      Ollama is an easy way to run local models on Mac/linux. See https://ollama.ai they have a web UI and a terminal/server approach

    • MOARDONGZPLZ1y

      I would not explicitly count on that. I’m a big fan of Ollama and use it every day but they do have some dark patterns that make me question a usecase where data security is a requirement. So I don’t use it where that is something that’s important.

      • jasonjmcghee1y

        Ollama team are a few very down to earth, smart people. I really liked the folks I've met. I can't imagine they are doing anything malicious and I'm sure would address any issues (log them on GitHub) / entertain PRs to address any legitimate concerns

      • mbernstein1y

        Examples?

      • slimsag1y

        like what? If you're gonna accuse a project of shady stuff, at least give examples :)

        • MOARDONGZPLZ1y

          The same examples given every time ollama is posted. Off the top of my head the installer silently adds login items with no way to opt out, spawns persistent processes in the background in addition to the application with unclear purposes, no info on install about the install itself, doesn’t let you back out of the installer when it requests admin access. Basically lots of dark patterns in the non-standard installer.

          Reminds me of how Zoom got it start with the “growth hacking” of the installation. Not enough to keep me from using it, but enough for me to keep from using it for anything serious or secure.

          • Patrick_Devine1y

            These are some fair points. There definitely wasn't an intention of "growth hacking", but just trying to get a lot of things done with only a few people in a short period of time. Requiring admin access really sucks though and is something we've wanted to get rid of for a while.

            • ehack1y

              Please, as an old guy with great thanks for Ollama and great admiration for your abilities, I do feel creating an autorun login item should be something you tell the user you're doing, while giving a nutshell explanation, and opt-out.

            • visarga1y

              I am running ollama in the CLI, under screen, and always disable the ollama daemon. It's hard to configure, while CLI it is just adding a few env vars in front.

              • Patrick_Devine1y

                We're planning to make it so you can change the env variables w/ the tray icon. The CLI will always work too though.

          • v3ss0n1y

            Show me the code

            • MOARDONGZPLZ1y

              Install it on MacOs. Observe for yourself. This is a repeated problem mentioned in every thread. If you need help on the part about checking to see how many processes are running, let me know and I can assist. The rest are things you will observe, step by step, during the install process.

              • v3ss0n1y

                Send me a MacOS first, we don't use mac here.

                If you can't find in foul play in code you can't prove.

                • jacquesm1y

                  That's a ridiculous requirement.

      • v3ss0n1y

        Opensource project so you can find evidence of foul play . Prove it or it is bs

  • jacooper1y

    Does ollama support ROCm? It's not clear from their github repo if it does.

  • philprx1y

    How does Ollama compare to LocalGPT ?

  • v01d4lph41y

    This is super neat! Thanks folks!

  • udev40961y

    Awesome!

  • samstave1y

    [flagged]

  • bulbosaur1231y

    Anyone actually tested it with GPT4 api to see how well it performs?

    • minimaxir1y

      That's not what this announcement is: it's an I/O schema for OSS local LLMs.