Hacker News: juliensalinas

New comment by juliensalinas in "Cursor IDE support hallucinates lockout policy, causes user cancellations"

juliensalinas — Fri, 18 Apr 2025 07:02:47 +0000

Relying on GenAI for support like that without a human in the loop is a huge mistake...

New comment by juliensalinas in "Comparing GenAI Inference Engines: TensorRT-LLM, VLLM, HF TGI, and LMDeploy"

juliensalinas — Tue, 08 Apr 2025 11:35:11 +0000

You can read the full comparison here: https://nlpcloud.com/genai-inference-engines-tensorrt-llm-vs...

Comparing GenAI Inference Engines: TensorRT-LLM, VLLM, HF TGI, and LMDeploy

juliensalinas — Tue, 08 Apr 2025 11:32:20 +0000

Hey everyone, I’ve been diving into the world of generative AI inference engines for quite some time at NLP Cloud, and I wanted to share some insights from a comparison I put together. I looked at four popular options—NVIDIA’s TensorRT-LLM, vLLM, Hugging Face’s Text Generation Inference (TGI), and LMDeploy—and ran some benchmarks to see how they stack up for real-world use cases. Thought this might spark some discussion here since I know a lot of you are working with LLMs or optimizing inference pipelines:

TensorRT-LLM

------------

NVIDIA’s beast for GPU-accelerated inference. Built on TensorRT, it optimizes models with layer fusion, precision tuning (FP16, INT8, even FP8), and custom CUDA kernels.

Pros: Blazing fast on NVIDIA GPUs—think sub-50ms latency for single requests on an A100 and ~700 tokens/sec at 100 concurrent users for LLaMA-3 70B Q4 (per BentoML benchmarks). Dynamic batching and tight integration with Triton Inference Server make it a throughput monster.

Cons: Setup can be complex if you’re not already in the NVIDIA ecosystem. You need to deal with model compilation, and it’s not super flexible for quick prototyping.

vLLM

----

Open-source champion for high-throughput inference. Uses PagedAttention to manage KV caches in chunks, cutting memory waste and boosting speed.

Pros: Easy to spin up (pip install, Python-friendly), and it’s flexible—runs on NVIDIA, AMD, even CPU. Throughput is solid (~600-650 tokens/sec at 100 users for LLaMA-3 70B Q4), and dynamic batching keeps it humming. Latency’s decent at 60-80ms solo.

Cons: It’s less optimized for single-request latency, so if you’re building a chatbot with one user at a time, it might not shine as much. Also, it’s still maturing—some edge cases (like exotic model architectures) might not be supported.

Hugging Face TGI

----------------

Hugging Face’s production-ready inference tool. Ties into their model hub (BERT, GPT, etc.) and uses Rust for speed, with continuous batching to keep GPUs busy.

Pros: Docker setup is quick, and it scales well. Latency’s 50-70ms, throughput matches vLLM (~600-650 tokens/sec at 100 users). Bonus: built-in output filtering for safety. Perfect if you’re already in the HF ecosystem.

Cons: Less raw speed than TensorRT-LLM, and memory can bloat with big batches. Feels a bit restrictive outside HF’s world.

LMDeploy

--------

This Toolkit from the MMRazor/MMDeploy crew, focused on fast, efficient LLM deployment. Features TurboMind (a high-performance engine) and a PyTorch fallback, with persistent batching and blocked KV caching for speed.

Pros: Decoding speed is nuts—up to 1.8x more requests/sec than vLLM on an A100. TurboMind pushes 4-bit inference 2.4x faster than FP16, hitting ~700 tokens/sec at 100 users (LLaMA-3 70B Q4). Low latency (40-60ms), easy one-command server setup, and it even handles multi-round chats efficiently by caching history.

Cons: TurboMind’s picky—doesn’t support sliding window attention (e.g., Mistral) yet. Non-NVIDIA users get stuck with the slower PyTorch engine. Still, on NVIDIA GPUs, it’s a performance beast.

What’s your experience with these tools? Any hidden issues I missed? Or are there other inference engines that should be mentioned? Would love to hear your thoughts!

Julien

Comments URL: https://news.ycombinator.com/item?id=43620472

Points: 1

# Comments: 1

New comment by juliensalinas in "Ask HN: What are you working on? (March 2025)"

juliensalinas — Mon, 31 Mar 2025 09:56:27 +0000

Sounds very cool. I'm curious how you manage to monitor Linkedin though. The only tool that seems capable of monitoring Linkedin is https://kwatch.io , so if you manage to achieve that too it's impressive.

New comment by juliensalinas in "Ask HN: Founders, what was the major sourcing channel for your first 100 users?"

juliensalinas — Thu, 17 Oct 2024 05:15:49 +0000

Social listening on HN, Reddit, X... I used https://kwatch.io and jumped into the relevant conversations to mention my product.

New comment by juliensalinas in "Ask HN: What is used instead of mention.com nowadays?"

juliensalinas — Thu, 02 May 2024 09:29:04 +0000

I use KWatch.io (https://kwatch.io) for social listening and it works very well for HN monitoring in my case. They also support other platforms (Reddit, Linkedin, Twitter..). But they don't propose advanced features like dashboards, analytics...

New comment by juliensalinas in "[dead]"

juliensalinas — Tue, 23 Apr 2024 12:44:04 +0000

Many are trying to install and deploy their own LLaMA 3 model, so here is a tutorial I just made showing how to deploy LLaMA 3 on an AWS EC2 instance: https://nlpcloud.com/how-to-install-and-deploy-llama-3-into-...

Deploying LLaMA 3 8B is fairly easy but LLaMA 3 70B is another beast. Given the amount of VRAM needed you might want to provision more than one GPU and use a dedicated inference server like vLLM in order to split your model on several GPUs.

LLaMA 3 8B requires around 16GB of disk space and 20GB of VRAM (GPU memory) in FP16. As for LLaMA 3 70B, it requires around 140GB of disk space and 160GB of VRAM in FP16.

I hope it is useful, and if you have questions please don't hesitate to ask!

Julien

New comment by juliensalinas in "Show HN: Crowdlens – AI-powered social listening"

juliensalinas — Fri, 15 Mar 2024 16:24:26 +0000

How does this solution compare to platforms like https://kwatch.io or brand24 for hacker news monitoring?

Does it monitor hacker news in real time?

New comment by juliensalinas in "Who uses Google TPUs for inference in production?"

juliensalinas — Tue, 12 Mar 2024 11:45:58 +0000

We tried hard to move some of our inference workloads to TPUs at NLP Cloud, but finally gave up (at least for the moment) basically for the reasons you mention. We now only perform our fine-tunings on TPUs using JAX (see https://nlpcloud.com/how-to-fine-tune-llama-openllama-xgen-w...) and we are happy like that.

It seems to me that Google does not really want to sell TPUs but only showcase their AI work and maybe get some early adopters feedback. It must be quite a challenge for them to create a dynamic community around JAX and TPUs if TPUs stay a vendor locked-in product...

New comment by juliensalinas in "Ask HN: Best Alternatives to OpenAI ChatGPT?"

juliensalinas — Thu, 23 Nov 2023 08:16:21 +0000

Claude (Anthropic) might be the closest direct alternative to ChatGPT (but it's not available in alls countries). You might also want to try ChatDolphin by NLP Cloud (a company I created 3 years ago as an OpenAI alternative): https://chat.nlpcloud.com Open-source is also catching up very quickly. The best models you might want to try today are LLaMA 2 70B, Yi 34B, or Mistral 7B.

New comment by juliensalinas in "Mistral 7B"

juliensalinas — Fri, 29 Sep 2023 22:00:16 +0000

For those who want to try Mistral 7b, here is a video that shows how to do it on an A10 GPU on AWS: https://www.youtube.com/watch?v=88ByWjM-KGM

New comment by juliensalinas in "GPT-4 API General Availability"

juliensalinas — Tue, 18 Jul 2023 07:05:32 +0000

Thank you.

New comment by juliensalinas in "GPT-4 API General Availability"

juliensalinas — Fri, 14 Jul 2023 06:46:55 +0000

Thank you for the update! Do you happen to know if there are quality comparisons somewhere, between llama.cpp and exllama? Also, in terms of VRAM consumption, are they equivalent?

New comment by juliensalinas in "GPT-4 API General Availability"

juliensalinas — Tue, 11 Jul 2023 06:20:29 +0000

Oh it seems you're right, I had missed that.

As far as I can see llama.cpp with CUDA is still a bit slower than ExLLaMA but I never had the chance to do the comparison by myself, and maybe it will change soon as these projects are evolving very quickly. Also I am not exactly sure whether the quality of the output is the same with these 2 implementations.

New comment by juliensalinas in "GPT-4 details leaked?"

juliensalinas — Tue, 11 Jul 2023 06:17:01 +0000

LLaMA 30B or 60B can be very impressive when correctly prompted. Deploying the 60B version is a challenge though and you might need to apply 4-bit quantization with something like https://github.com/PanQiWei/AutoGPTQ or https://github.com/qwopqwop200/GPTQ-for-LLaMa . Then you can improve the inference speed by using https://github.com/turboderp/exllama .

If you prefer to use an "instruct" model à la ChatGPT (i.e. that does not need few-shot learning to output good results) you can use something like this: https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored... The interesting thing with these Uncensored models is that they don't constantly answer that they cannot help you (which is what ChatGPT and GPT-4 are doing more and more).

New comment by juliensalinas in "GPT-4 API General Availability"

juliensalinas — Mon, 10 Jul 2023 10:51:01 +0000

llama.cpp focuses on optimizing inference on a CPU, while exllama is for inference on a GPU.

New comment by juliensalinas in "ChatGPT loses users for first time, shaking faith in AI revolution"

juliensalinas — Sun, 09 Jul 2023 17:52:57 +0000

Totally agree. Actually a couple of months ago Sam Altman even admitted that they had a very hard time doing proper "engineering" (meaning that they had the right team to create a very good LLM but not the right team to productionize and their models and APIs). Many people are actually finding the OpenAI API very unstable and do not plan to rely on OpenAI for their production workloads.

New comment by juliensalinas in "ChatGPT loses users for first time, shaking faith in AI revolution"

juliensalinas — Sun, 09 Jul 2023 17:37:50 +0000

NLP Cloud (especially the Dolphin and Fine-tuned GPT-NeoX 20B models)

Correctly using foundational AI models and instruct AI models

juliensalinas — Tue, 27 Jun 2023 13:12:18 +0000

Correctly using generative AI models can be a challenge because it depends on the type of model that you are using...

At NLP Cloud we made 2 tutorials to help you make the most of your model:

- Using foundational models (GPT-3, GPT-J, GPT-NeoX, Falcon, Llama, MPT...) with few-shot learning: https://nlpcloud.com/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html

- Using instruct models (ChatGPT, GPT-3 Instruct, GPT-4, Falcon Instruct, MPT Instruct...) with natural language instructions: https://nlpcloud.com/effectively-using-chatdolphin-the-chatgpt-alternative-with-simple-instructions.html

I hope it will be useful.

Comments URL: https://news.ycombinator.com/item?id=36492212

Points: 4

# Comments: 0

New comment by juliensalinas in "Ask HN: Best UNCENSORED language model comparable to ChatGPT?"

juliensalinas — Fri, 02 Jun 2023 10:40:47 +0000

You might want to try our ChatDolphin model on NLP Cloud that is very similar to Vicuna and uncensored: https://nlpcloud.com/home/playground/text-generation (select the ChatDolphin model at the top right).

I hope it will be useful.