<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: juliensalinas</title><link>https://news.ycombinator.com/user?id=juliensalinas</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sat, 25 Apr 2026 14:14:46 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=juliensalinas" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by juliensalinas in "Cursor IDE support hallucinates lockout policy, causes user cancellations"]]></title><description><![CDATA[
<p>Relying on GenAI for support like that without a human in the loop is a huge mistake...</p>
]]></description><pubDate>Fri, 18 Apr 2025 07:02:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=43725720</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=43725720</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43725720</guid></item><item><title><![CDATA[New comment by juliensalinas in "Comparing GenAI Inference Engines: TensorRT-LLM, VLLM, HF TGI, and LMDeploy"]]></title><description><![CDATA[
<p>You can read the full comparison here: <a href="https://nlpcloud.com/genai-inference-engines-tensorrt-llm-vs-vllm-vs-hugging-face-tgi-vs-lmdeploy.html" rel="nofollow">https://nlpcloud.com/genai-inference-engines-tensorrt-llm-vs...</a></p>
]]></description><pubDate>Tue, 08 Apr 2025 11:35:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=43620491</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=43620491</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43620491</guid></item><item><title><![CDATA[Comparing GenAI Inference Engines: TensorRT-LLM, VLLM, HF TGI, and LMDeploy]]></title><description><![CDATA[
<p>Hey everyone, I’ve been diving into the world of generative AI inference engines for quite some time at NLP Cloud, and I wanted to share some insights from a comparison I put together. I looked at four popular options—NVIDIA’s TensorRT-LLM, vLLM, Hugging Face’s Text Generation Inference (TGI), and LMDeploy—and ran some benchmarks to see how they stack up for real-world use cases. Thought this might spark some discussion here since I know a lot of you are working with LLMs or optimizing inference pipelines:<p>TensorRT-LLM<p>------------<p>NVIDIA’s beast for GPU-accelerated inference. Built on TensorRT, it optimizes models with layer fusion, precision tuning (FP16, INT8, even FP8), and custom CUDA kernels.<p>Pros: Blazing fast on NVIDIA GPUs—think sub-50ms latency for single requests on an A100 and ~700 tokens/sec at 100 concurrent users for LLaMA-3 70B Q4 (per BentoML benchmarks). Dynamic batching and tight integration with Triton Inference Server make it a throughput monster.<p>Cons: Setup can be complex if you’re not already in the NVIDIA ecosystem. You need to deal with model compilation, and it’s not super flexible for quick prototyping.<p>vLLM<p>----<p>Open-source champion for high-throughput inference. Uses PagedAttention to manage KV caches in chunks, cutting memory waste and boosting speed.<p>Pros: Easy to spin up (pip install, Python-friendly), and it’s flexible—runs on NVIDIA, AMD, even CPU. Throughput is solid (~600-650 tokens/sec at 100 users for LLaMA-3 70B Q4), and dynamic batching keeps it humming. Latency’s decent at 60-80ms solo.<p>Cons: It’s less optimized for single-request latency, so if you’re building a chatbot with one user at a time, it might not shine as much. Also, it’s still maturing—some edge cases (like exotic model architectures) might not be supported.<p>Hugging Face TGI<p>----------------<p>Hugging Face’s production-ready inference tool. Ties into their model hub (BERT, GPT, etc.) and uses Rust for speed, with continuous batching to keep GPUs busy.<p>Pros: Docker setup is quick, and it scales well. Latency’s 50-70ms, throughput matches vLLM (~600-650 tokens/sec at 100 users). Bonus: built-in output filtering for safety. Perfect if you’re already in the HF ecosystem.<p>Cons: Less raw speed than TensorRT-LLM, and memory can bloat with big batches. Feels a bit restrictive outside HF’s world.<p>LMDeploy<p>--------<p>This Toolkit from the MMRazor/MMDeploy crew, focused on fast, efficient LLM deployment. Features TurboMind (a high-performance engine) and a PyTorch fallback, with persistent batching and blocked KV caching for speed.<p>Pros: Decoding speed is nuts—up to 1.8x more requests/sec than vLLM on an A100. TurboMind pushes 4-bit inference 2.4x faster than FP16, hitting ~700 tokens/sec at 100 users (LLaMA-3 70B Q4). Low latency (40-60ms), easy one-command server setup, and it even handles multi-round chats efficiently by caching history.<p>Cons: TurboMind’s picky—doesn’t support sliding window attention (e.g., Mistral) yet. Non-NVIDIA users get stuck with the slower PyTorch engine. Still, on NVIDIA GPUs, it’s a performance beast.<p>What’s your experience with these tools? Any hidden issues I missed? Or are there other inference engines that should be mentioned? Would love to hear your thoughts!<p>Julien</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43620472">https://news.ycombinator.com/item?id=43620472</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Tue, 08 Apr 2025 11:32:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=43620472</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=43620472</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43620472</guid></item><item><title><![CDATA[New comment by juliensalinas in "Ask HN: What are you working on? (March 2025)"]]></title><description><![CDATA[
<p>Sounds very cool. I'm curious how you manage to monitor Linkedin though.
The only tool that seems capable of monitoring Linkedin is <a href="https://kwatch.io" rel="nofollow">https://kwatch.io</a> , so if you manage to achieve that too it's impressive.</p>
]]></description><pubDate>Mon, 31 Mar 2025 09:56:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=43533108</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=43533108</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43533108</guid></item><item><title><![CDATA[New comment by juliensalinas in "Ask HN: Founders, what was the major sourcing channel for your first 100 users?"]]></title><description><![CDATA[
<p>Social listening on HN, Reddit, X...
I used <a href="https://kwatch.io" rel="nofollow">https://kwatch.io</a> and jumped into the relevant conversations to mention my product.</p>
]]></description><pubDate>Thu, 17 Oct 2024 05:15:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=41866623</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=41866623</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41866623</guid></item><item><title><![CDATA[New comment by juliensalinas in "Ask HN: What is used instead of mention.com nowadays?"]]></title><description><![CDATA[
<p>I use KWatch.io (<a href="https://kwatch.io" rel="nofollow">https://kwatch.io</a>) for social listening and it works very well for HN monitoring in my case. They also support other platforms (Reddit, Linkedin, Twitter..). But they don't propose advanced features like dashboards, analytics...</p>
]]></description><pubDate>Thu, 02 May 2024 09:29:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=40234279</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=40234279</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40234279</guid></item><item><title><![CDATA[New comment by juliensalinas in "[dead]"]]></title><description><![CDATA[
<p>Many are trying to install and deploy their own LLaMA 3 model, so here is a tutorial I just made showing how to deploy LLaMA 3 on an AWS EC2 instance: <a href="https://nlpcloud.com/how-to-install-and-deploy-llama-3-into-production.html" rel="nofollow">https://nlpcloud.com/how-to-install-and-deploy-llama-3-into-...</a><p>Deploying LLaMA 3 8B is fairly easy but LLaMA 3 70B is another beast. Given the amount of VRAM needed you might want to provision more than one GPU and use a dedicated inference server like vLLM in order to split your model on several GPUs.<p>LLaMA 3 8B requires around 16GB of disk space and 20GB of VRAM (GPU memory) in FP16. As for LLaMA 3 70B, it requires around 140GB of disk space and 160GB of VRAM in FP16.<p>I hope it is useful, and if you have questions please don't hesitate to ask!<p>Julien</p>
]]></description><pubDate>Tue, 23 Apr 2024 12:44:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=40131317</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=40131317</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40131317</guid></item><item><title><![CDATA[New comment by juliensalinas in "Show HN: Crowdlens – AI-powered social listening"]]></title><description><![CDATA[
<p>How does this solution compare to platforms like <a href="https://kwatch.io" rel="nofollow">https://kwatch.io</a> or brand24 for hacker news monitoring?<p>Does it monitor hacker news in real time?</p>
]]></description><pubDate>Fri, 15 Mar 2024 16:24:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=39717496</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=39717496</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39717496</guid></item><item><title><![CDATA[New comment by juliensalinas in "Who uses Google TPUs for inference in production?"]]></title><description><![CDATA[
<p>We tried hard to move some of our inference workloads to TPUs at NLP Cloud, but finally gave up (at least for the moment) basically for the reasons you mention. We now only perform our fine-tunings on TPUs using JAX (see <a href="https://nlpcloud.com/how-to-fine-tune-llama-openllama-xgen-with-jax-on-tpu-gpu.html" rel="nofollow">https://nlpcloud.com/how-to-fine-tune-llama-openllama-xgen-w...</a>) and we are happy like that.<p>It seems to me that Google does not really want to sell TPUs but only showcase their AI work and maybe get some early adopters feedback. It must be quite a challenge for them to create a dynamic community around JAX and TPUs if TPUs stay a vendor locked-in product...</p>
]]></description><pubDate>Tue, 12 Mar 2024 11:45:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=39678390</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=39678390</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39678390</guid></item><item><title><![CDATA[New comment by juliensalinas in "Ask HN: Best Alternatives to OpenAI ChatGPT?"]]></title><description><![CDATA[
<p>Claude (Anthropic) might be the closest direct alternative to ChatGPT (but it's not available in alls countries).
You might also want to try ChatDolphin by NLP Cloud (a company I created 3 years ago as an OpenAI alternative): <a href="https://chat.nlpcloud.com" rel="nofollow noreferrer">https://chat.nlpcloud.com</a>
Open-source is also catching up very quickly. The best models you might want to try today are LLaMA 2 70B, Yi 34B, or Mistral 7B.</p>
]]></description><pubDate>Thu, 23 Nov 2023 08:16:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=38390611</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=38390611</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38390611</guid></item><item><title><![CDATA[New comment by juliensalinas in "Mistral 7B"]]></title><description><![CDATA[
<p>For those who want to try Mistral 7b, here is a video that shows how to do it on an A10 GPU on AWS: <a href="https://www.youtube.com/watch?v=88ByWjM-KGM">https://www.youtube.com/watch?v=88ByWjM-KGM</a></p>
]]></description><pubDate>Fri, 29 Sep 2023 22:00:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=37710622</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=37710622</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=37710622</guid></item><item><title><![CDATA[New comment by juliensalinas in "GPT-4 API General Availability"]]></title><description><![CDATA[
<p>Thank you.</p>
]]></description><pubDate>Tue, 18 Jul 2023 07:05:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=36768873</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36768873</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36768873</guid></item><item><title><![CDATA[New comment by juliensalinas in "GPT-4 API General Availability"]]></title><description><![CDATA[
<p>Thank you for the update!
Do you happen to know if there are quality comparisons somewhere, between llama.cpp and exllama?
Also, in terms of VRAM consumption, are they equivalent?</p>
]]></description><pubDate>Fri, 14 Jul 2023 06:46:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=36720281</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36720281</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36720281</guid></item><item><title><![CDATA[New comment by juliensalinas in "GPT-4 API General Availability"]]></title><description><![CDATA[
<p>Oh it seems you're right, I had missed that.<p>As far as I can see llama.cpp with CUDA is still a bit slower than ExLLaMA but I never had the chance to do the comparison by myself, and maybe it will change soon as these projects are evolving very quickly. 
Also I am not exactly sure whether the quality of the output is the same with these 2 implementations.</p>
]]></description><pubDate>Tue, 11 Jul 2023 06:20:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=36677063</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36677063</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36677063</guid></item><item><title><![CDATA[New comment by juliensalinas in "GPT-4 details leaked?"]]></title><description><![CDATA[
<p>LLaMA 30B or 60B can be very impressive when correctly prompted.
Deploying the 60B version is a challenge though and you might need to apply 4-bit quantization with something like <a href="https://github.com/PanQiWei/AutoGPTQ">https://github.com/PanQiWei/AutoGPTQ</a> or <a href="https://github.com/qwopqwop200/GPTQ-for-LLaMa">https://github.com/qwopqwop200/GPTQ-for-LLaMa</a> . Then you can improve the inference speed by using <a href="https://github.com/turboderp/exllama">https://github.com/turboderp/exllama</a> .<p>If you prefer to use an "instruct" model à la ChatGPT (i.e. that does not need few-shot learning to output good results) you can use something like this: <a href="https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored-fp16" rel="nofollow noreferrer">https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored...</a>
The interesting thing with these Uncensored models is that they don't constantly answer that they cannot help you (which is what ChatGPT and GPT-4 are doing more and more).</p>
]]></description><pubDate>Tue, 11 Jul 2023 06:17:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=36677039</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36677039</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36677039</guid></item><item><title><![CDATA[New comment by juliensalinas in "GPT-4 API General Availability"]]></title><description><![CDATA[
<p>llama.cpp focuses on optimizing inference on a CPU, while exllama is for inference on a GPU.</p>
]]></description><pubDate>Mon, 10 Jul 2023 10:51:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=36664513</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36664513</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36664513</guid></item><item><title><![CDATA[New comment by juliensalinas in "ChatGPT loses users for first time, shaking faith in AI revolution"]]></title><description><![CDATA[
<p>Totally agree. Actually a couple of months ago Sam Altman even admitted that they had a very hard time doing proper "engineering" (meaning that they had the right team to create a very good LLM but not the right team to productionize and their models and APIs). Many people are actually finding the OpenAI API very unstable and do not plan to rely on OpenAI for their production workloads.</p>
]]></description><pubDate>Sun, 09 Jul 2023 17:52:57 +0000</pubDate><link>https://news.ycombinator.com/item?id=36656916</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36656916</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36656916</guid></item><item><title><![CDATA[New comment by juliensalinas in "ChatGPT loses users for first time, shaking faith in AI revolution"]]></title><description><![CDATA[
<p>NLP Cloud (especially the Dolphin and Fine-tuned GPT-NeoX 20B models)</p>
]]></description><pubDate>Sun, 09 Jul 2023 17:37:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=36656698</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36656698</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36656698</guid></item><item><title><![CDATA[Correctly using foundational AI models and instruct AI models]]></title><description><![CDATA[
<p>Correctly using generative AI models can be a challenge because it depends on the type of model that you are using...<p>At NLP Cloud we made 2 tutorials to help you make the most of your model:<p>- Using foundational models (GPT-3, GPT-J, GPT-NeoX, Falcon, Llama, MPT...) with few-shot learning: https://nlpcloud.com/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html<p>- Using instruct models (ChatGPT, GPT-3 Instruct, GPT-4, Falcon Instruct, MPT Instruct...) with natural language instructions: https://nlpcloud.com/effectively-using-chatdolphin-the-chatgpt-alternative-with-simple-instructions.html<p>I hope it will be useful.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=36492212">https://news.ycombinator.com/item?id=36492212</a></p>
<p>Points: 4</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 27 Jun 2023 13:12:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=36492212</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36492212</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36492212</guid></item><item><title><![CDATA[New comment by juliensalinas in "Ask HN: Best UNCENSORED language model comparable to ChatGPT?"]]></title><description><![CDATA[
<p>You might want to try our ChatDolphin model on NLP Cloud that is very similar to Vicuna and uncensored: <a href="https://nlpcloud.com/home/playground/text-generation" rel="nofollow">https://nlpcloud.com/home/playground/text-generation</a> (select the ChatDolphin model at the top right).<p>I hope it will be useful.</p>
]]></description><pubDate>Fri, 02 Jun 2023 10:40:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=36163451</link><dc:creator>juliensalinas</dc:creator><comments>https://news.ycombinator.com/item?id=36163451</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36163451</guid></item></channel></rss>