catching up on news, I saw this huggingface LLM pricing tool: https://lnkd.in/gdrM7eZC I like the selection of models and the price of input and output tokens normalized at 1M tokens each (editable). I wish it had optional performance or capabilities columns for people who don't already know these differences between these models. still it's a great resource and you can cross reference with https://lnkd.in/g7QNJp_8 or your preferred leaderboard.
Alan Blount’s Post
More Relevant Posts
-
Do you want to estimate and compare how much it will cost the model you want to use. Try this https://lnkd.in/d8UsH3it
Llm Pricing - a Hugging Face Space by philschmid
huggingface.co
To view or add a comment, sign in
-
Ever notice your LLM provider being slower during the day and faster nights and weekends? This is the result of hidden rate limits and throttling LLM providers do to optimize their GPU usage. Over the past day we did an investigation on the leading LLM providers to measure the quality (intelligence) vs speed tradeoffs that you're making with each provider. On providers like GPT4 we observed a ~40% difference in speed at peak hours (10am eastern). Read the full blog post here: https://lnkd.in/eqMm8NXv If your application is latency or cost sensitive, at flyflow (https://flyflow.dev) we're using fine tuning to create a custom model that matches GPT4 performance, but provides ~5x faster query speeds. Feel free to DM me and we can set your team up with API keys. All integration takes is changing a single line in your code base to move your LLM usage to our API.
To view or add a comment, sign in
-
Your terminal reimagined IDE-style text input/navigation Block-based command grouping Ability to save and share commands Warp AI can generate commands from normal text Customise keybindings and launch configs Built-in themes + support for custom ones
Warp: Your terminal, reimagined
warp.dev
To view or add a comment, sign in
-
Do you have multiple users/departments each wishing to deploy their own customized LLM? There's now a path that'll allow you to do so with many fewer resources. Let's say each user wants Llama 3 70B fine-tuned on their own datasets. Using "LoRA" allows you to fine-tune a 70B model on one 8-GPU node. But deployment can also be resource-intensive, as you'll need to store complete model weights for each custom LLM. And that'll add up fast. This blog describes how you can dynamically load many customized models - even hundreds or thousands of them - without storing the full model weights. And you can still send batched requests, which means you don't sacrifice efficiency when serving many users and their custom models at the same time. https://lnkd.in/eVrTJarU
Seamlessly Deploying a Swarm of LoRA Adapters with NVIDIA NIM | NVIDIA Technical Blog
developer.nvidia.com
To view or add a comment, sign in
-
A splendid scaling, denoising, 🗑️ in 🗑️ out analysis for LLM. It also clarifies why we (non giant tech) don't train base models. Must read (we want more of this 😄) https://lnkd.in/extYPSax
FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFW
huggingface.co
To view or add a comment, sign in
-
The rabbit R1 is starting to show what the next phase of LLM/AI applications will be trying to achieve going forwards. If we consider the public releases of GPT etc to be day 0 or the incubation phase of the current tech cycle then this is phase 1. Automated integration between apps that you use daily. Whoever becomes the mass market leader in this field will accelerate their productivity massively as they will be able to collect the initial data points and trends to make the tech better, smarter and more aligned to customer needs and wishes. This is also the beginning of LLM style operating systems where the new paradigm of efficiency will be presiding. https://www.rabbit.tech/
rabbit — home
rabbit.tech
To view or add a comment, sign in
-
vLLM via PagedAttention can batch 5x more sequences together, increasing GPU utilization and thus significantly increases the throughput of your model Looking to integerated that with your models? We got you covered For example: Now you can Deploy Llama 3-8B with vLLM on Monster API
Deploy Llama-3-8B with vLLM | no need to write any code | Deploy directly from ChatGPT
https://www.youtube.com/
To view or add a comment, sign in
-
Having fun testing the limits of https://lnkd.in/dj6EJw4G (it's like https://lnkd.in/dXwtnNQq but with the recent version of Qwen)
Qwen2.5 Coder Artifacts - a Hugging Face Space by Qwen
huggingface.co
To view or add a comment, sign in
-
I want to share with you this article I wrote about LLM memory requirements, I've written this article as I didn't find any useful information about the subject. You can try the LLM memory calculator on huggingface https://lnkd.in/eA4zJFkU you can read the full details on the computation steps in this article https://lnkd.in/exJqh8aj
Llm Memory Requirement - a Hugging Face Space by Hennara
huggingface.co
To view or add a comment, sign in
-
Ollama 0.2 can now: * Run different models side-by-side * Process multiple requests in parallel This enables a whole new set of RAG, agent and model serving use cases. Ollama will automatically load and unload models dynamically based on how much memory is in the system. Ollama 0.2 is here! Concurrency is now enabled by default. ollama.com/download This unlocks 2 major features: Parallel requests Ollama can now serve multiple requests at the same time, using only a little bit of additional memory for each request. This enables use cases such as: - Handling multiple chat sessions at the same time - Hosting code completion LLMs for your team - Processing different parts of a document simultaneously - Running multiple agents at the same time Run multiple models Ollama now supports loading different models at the same time. This improves several use cases: - Retrieval Augmented Generation (RAG): both the embedding and text completion models can be loaded into memory simultaneously. - Agents: multiple versions of an agent can now run simultaneously - Running large and small models side-by-side Models are automatically loaded and unloaded based on requests and how much GPU memory is available.
To view or add a comment, sign in