Alan Blount’s Post

PM <- Tech Lead - Web, FP, OS, ML, DA, & moarrr letters

2mo

catching up on news, I saw this huggingface LLM pricing tool: https://lnkd.in/gdrM7eZC I like the selection of models and the price of input and output tokens normalized at 1M tokens each (editable). I wish it had optional performance or capabilities columns for people who don't already know these differences between these models. still it's a great resource and you can cross reference with https://lnkd.in/g7QNJp_8 or your preferred leaderboard.

Llm Pricing - a Hugging Face Space by philschmid

huggingface.co

To view or add a comment, sign in

More Relevant Posts

Rubén Toribio Gallardo

Microsoft 365 & Azure AI Architect / Full Stack Developer at DNV
5mo
Report this post
Do you want to estimate and compare how much it will cost the model you want to use. Try this https://lnkd.in/d8UsH3it

Llm Pricing - a Hugging Face Space by philschmid

huggingface.co
Like Comment
To view or add a comment, sign in
Carl Cortright

Builder | YC Founder | Previously Engineering + Ventures at Coinbase
8mo Edited
Report this post
Ever notice your LLM provider being slower during the day and faster nights and weekends? This is the result of hidden rate limits and throttling LLM providers do to optimize their GPU usage. Over the past day we did an investigation on the leading LLM providers to measure the quality (intelligence) vs speed tradeoffs that you're making with each provider. On providers like GPT4 we observed a ~40% difference in speed at peak hours (10am eastern). Read the full blog post here: https://lnkd.in/eqMm8NXv If your application is latency or cost sensitive, at flyflow (https://flyflow.dev) we're using fine tuning to create a custom model that matches GPT4 performance, but provides ~5x faster query speeds. Feel free to DM me and we can set your team up with API keys. All integration takes is changing a single line in your code base to move your LLM usage to our API.
3 Comments
Like Comment
To view or add a comment, sign in
Flaviu Vlaicu 👾

Cybersecurity Wizzard @ NTT DATA
8mo
Report this post
Your terminal reimagined IDE-style text input/navigation Block-based command grouping Ability to save and share commands Warp AI can generate commands from normal text Customise keybindings and launch configs Built-in themes + support for custom ones

Warp: Your terminal, reimagined

warp.dev
Like Comment
To view or add a comment, sign in
Eliot Eshelman

Accelerate research & learning through AI and computational tools @ NVIDIA
3mo
Report this post
Do you have multiple users/departments each wishing to deploy their own customized LLM? There's now a path that'll allow you to do so with many fewer resources. Let's say each user wants Llama 3 70B fine-tuned on their own datasets. Using "LoRA" allows you to fine-tune a 70B model on one 8-GPU node. But deployment can also be resource-intensive, as you'll need to store complete model weights for each custom LLM. And that'll add up fast. This blog describes how you can dynamically load many customized models - even hundreds or thousands of them - without storing the full model weights. And you can still send batched requests, which means you don't sacrifice efficiency when serving many users and their custom models at the same time. https://lnkd.in/eVrTJarU

Seamlessly Deploying a Swarm of LoRA Adapters with NVIDIA NIM | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in
Thomas Bury

Lead Data Scientist chez Allianz Belgium
6mo
Report this post
A splendid scaling, denoising, 🗑️ in 🗑️ out analysis for LLM. It also clarifies why we (non giant tech) don't train base models. Must read (we want more of this 😄) https://lnkd.in/extYPSax

FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFW

huggingface.co
Like Comment
To view or add a comment, sign in
Jordan Orlebar

Cyber Scheme Team Leader - Infrastructure (CTL - CSTL INF)
10mo
Report this post
The rabbit R1 is starting to show what the next phase of LLM/AI applications will be trying to achieve going forwards. If we consider the public releases of GPT etc to be day 0 or the incubation phase of the current tech cycle then this is phase 1. Automated integration between apps that you use daily. Whoever becomes the mass market leader in this field will accelerate their productivity massively as they will be able to collect the initial data points and trends to make the tech better, smarter and more aligned to customer needs and wishes. This is also the beginning of LLM style operating systems where the new paradigm of efficiency will be presiding. https://www.rabbit.tech/

rabbit — home

rabbit.tech
Like Comment
To view or add a comment, sign in
Saurabh Vij

Founder & CEO at NEO: The first autonomous ML engineer
7mo Edited
Report this post
vLLM via PagedAttention can batch 5x more sequences together, increasing GPU utilization and thus significantly increases the throughput of your model Looking to integerated that with your models? We got you covered For example: Now you can Deploy Llama 3-8B with vLLM on Monster API

Deploy Llama-3-8B with vLLM | no need to write any code | Deploy directly from ChatGPT

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Jean-Michel Sadoul

Building software solutions. Part-time recruiter. Open Metaverse Strategist. 85% performer
2w
Report this post
Having fun testing the limits of https://lnkd.in/dj6EJw4G (it's like https://lnkd.in/dXwtnNQq but with the recent version of Qwen)

Qwen2.5 Coder Artifacts - a Hugging Face Space by Qwen

huggingface.co

1 Comment
Like Comment
To view or add a comment, sign in
Khalil Hennara

Machine Learning Engineer
5mo
Report this post
I want to share with you this article I wrote about LLM memory requirements, I've written this article as I didn't find any useful information about the subject. You can try the LLM memory calculator on huggingface https://lnkd.in/eA4zJFkU you can read the full details on the computation steps in this article https://lnkd.in/exJqh8aj

Llm Memory Requirement - a Hugging Face Space by Hennara

huggingface.co

2 Comments
Like Comment
To view or add a comment, sign in
Bhaskara Reddy Sannapureddy

Senior Project Manager|Infosys|B.E(Hons) BITS, Pilani & PGD in ML & AI at IIITB & Master of Science in ML & AI at LJMU, UK | (Building AI for World & Create AICX)(Learn, Unlearn, Relearn)
4mo
Report this post
Ollama 0.2 can now: * Run different models side-by-side * Process multiple requests in parallel This enables a whole new set of RAG, agent and model serving use cases. Ollama will automatically load and unload models dynamically based on how much memory is in the system. Ollama 0.2 is here! Concurrency is now enabled by default. ollama.com/download This unlocks 2 major features: Parallel requests Ollama can now serve multiple requests at the same time, using only a little bit of additional memory for each request. This enables use cases such as: - Handling multiple chat sessions at the same time - Hosting code completion LLMs for your team - Processing different parts of a document simultaneously - Running multiple agents at the same time Run multiple models Ollama now supports loading different models at the same time. This improves several use cases: - Retrieval Augmented Generation (RAG): both the embedding and text completion models can be loaded into memory simultaneously. - Agents: multiple versions of an agent can now run simultaneously - Running large and small models side-by-side Models are automatically loaded and unloaded based on requests and how much GPU memory is available.
Like Comment
To view or add a comment, sign in

1,664 followers

69 Posts

View Profile Connect

Alan Blount’s Post

More Relevant Posts

Deploy Llama-3-8B with vLLM | no need to write any code | Deploy directly from ChatGPT

https://www.youtube.com/

Explore topics