<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: jeffreyip</title><link>https://news.ycombinator.com/user?id=jeffreyip</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 26 Jun 2026 01:47:58 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=jeffreyip" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[Show HN: Vibe code your agents without vibe coding your agent]]></title><description><![CDATA[
<p>Article URL: <a href="https://deepeval.com/docs/vibe-coding">https://deepeval.com/docs/vibe-coding</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48069001">https://news.ycombinator.com/item?id=48069001</a></p>
<p>Points: 6</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 08 May 2026 21:28:34 +0000</pubDate><link>https://deepeval.com/docs/vibe-coding</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=48069001</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48069001</guid></item><item><title><![CDATA[The Complete LLM Evaluation Playbook: How To Run LLM Evals That Matter]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook">https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44290511">https://news.ycombinator.com/item?id=44290511</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Mon, 16 Jun 2025 15:27:13 +0000</pubDate><link>https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=44290511</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44290511</guid></item><item><title><![CDATA[DeepTeam: Penetration Testing for LLMs]]></title><description><![CDATA[
<p>Hi HN, we’re Jeffrey and Kritin, and we’re building DeepTeam (https://github.com/confident-ai/deepteam), an open-source Python library to scan LLM apps for security vulnerabilities. You can start “penetration testing” by defining a Python callback to your LLM app (e.g. `def model_callback(input: str)`), and DeepTeam will attempt to probe it with prompts designed to elicit unsafe or unintended behavior.<p>Note that the penetration testing process treats your LLM app as a black-box - which means that DeepTeam will not know whether PII leakage has occurred in a certain tool call or incorporated in the training data of your fine-tuned LLM, but rather just detect that it is present. Internally, we call this process “end-to-end” testing.<p>Before DeepTeam, we worked on DeepEval, an open-source framework to unit-test LLMs. Some of you might be thinking, well isn’t this kind of similar to unit-testing?<p>Sort of, but not really. While LLM unit-testing focuses on 1) accurate eval metrics, 2) comprehensive eval datasets, penetration testing focuses on the haphazard simulation of attacks, and the orchestration of it. To users, this was a big and confusing paradigm shift, because it went from “Did this pass?” to “How can this break?”.<p>So we thought to ourselves, why not just release a new package to orchestrate the simulation of adversarial attacks for this new set of users and teams working specifically on AI safety, and borrow DeepEval’s evals and ecosystem in the process?<p>Quickstart here: https://www.trydeepteam.com/docs/getting-started#detect-your-first-llm-vulnerability<p>The first thing we did was offer as many attack methods as possible - simple encoding ones like ROT13, leetspeak, to prompt injections, roleplay, and jailbreaking. We then heard folks weren’t happy because the attacks didn’t persist across tests and hence they “lost” their progress every time they tested, and so we added an option to `reuse_simulated_attacks`.<p>We abstracted everything away to make it as modular as possible - every vulnerability, attack, can be imported in Python as `Bias(type=[“race”])`, `LinearJailbreaking()`, etc. with methods such as `.enhance()` for teams to plug-and-play, build their own test suite, and even to add a few more rounds of attack enhancements to increase the likelihood of breaking your system.<p>Notably, there are a few limitations. Users might run into compliance errors when attempting to simulate attacks (especially for AzureOpenAI), and so we recommend setting `ignore_errors` to `True` in case that happens. You might also run into bottlenecks where DeepTeam does not cover your custom vulnerability type, and so we shipped a `CustomVulnerability` class as a “catch-all” solution (still in beta).<p>You might be aware that some packages already exist that do a similar thing, often known as “vulnerability scanning” or “red teaming”. The difference is that DeepTeam is modular, lightweight, and code friendly. Take Nvidia Garak for example, although comprehensive, has so many CLI rules, environments to set up, it is definitely not the easiest to get started, let alone pick the library apart to build your own penetration testing pipeline. In DeepTeam, define a class, wrap it around your own implementations if necessary, and you’re good to go.<p>We adopted a Apache 2.0 license (for now, and probably in the foreseeable future too), so if you want to get started, `pip install deepteam`, use any LLM for simulation, and you’ll get a full penetration report within 1 minute (assuming you’re running things asynchronously). GitHub: https://github.com/confident-ai/deepteam<p>Excited to share DeepTeam with everyone here – let us know what you think!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44128270">https://news.ycombinator.com/item?id=44128270</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 29 May 2025 17:35:12 +0000</pubDate><link>https://news.ycombinator.com/item?id=44128270</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=44128270</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44128270</guid></item><item><title><![CDATA[DeepTeam: Open-Source Pennetration Testing for LLMs]]></title><description><![CDATA[
<p>Hi HN, we’re Jeffrey and Kritin, and we’re building DeepTeam (https://github.com/confident-ai/deepteam), an open-source Python library to scan LLM apps for security vulnerabilities. You can start “penetration testing” by defining a Python callback to your LLM app (e.g. `def model_callback(input: str)`), and DeepTeam will attempt to probe it with prompts designed to elicit unsafe or unintended behavior.
Note that the penetration testing process treats your LLM app as a black-box - which means that DeepTeam will not know whether PII leakage has occurred in a certain tool call or incorporated in the training data of your fine-tuned LLM, but rather just detect that it is present. Internally, we call this process “end-to-end” testing.<p>Before DeepTeam, we worked on DeepEval, an open-source framework to unit-test LLMs. Some of you might be thinking, well isn’t this kind of similar to unit-testing?<p>Sort of, but not really. While LLM unit-testing focuses on 1) accurate eval metrics, 2) comprehensive eval datasets, penetration testing focuses on the haphazard simulation of attacks, and the orchestration of it. To users, this was a big and confusing paradigm shift, because it went from “Did this pass?” to “How can this break?”.<p>So we thought to ourselves, why not just release a new package to orchestrate the simulation of adversarial attacks for this new set of users and teams working specifically on AI safety, and borrow DeepEval’s evals and ecosystem in the process?<p>Quickstart here: https://www.trydeepteam.com/docs/getting-started#detect-your-first-llm-vulnerability<p>The first thing we did was offer as many attack methods as possible - simple encoding ones like ROT13, leetspeak, to prompt injections, roleplay, and jailbreaking. We then heard folks weren’t happy because the attacks didn’t persist across tests and hence they “lost” their progress every time they tested, and so we added an option to `reuse_simulated_attacks`.<p>We abstracted everything away to make it as modular as possible - every vulnerability, attack, can be imported in Python as `Bias(type=[“race”])`, `LinearJailbreaking()`, etc. with methods such as `.enhance()` for teams to plug-and-play, build their own test suite, and even to add a few more rounds of attack enhancements to increase the likelihood of breaking your system.<p>Notably, there are a few limitations. Users might run into compliance errors when attempting to simulate attacks (especially for AzureOpenAI), and so we recommend setting `ignore_errors` to `True` in case that happens. You might also run into bottlenecks where DeepTeam does not cover your custom vulnerability type, and so we shipped a `CustomVulnerability` class as a “catch-all” solution (still in beta).<p>You might be aware that some packages already exist that do a similar thing, often known as “vulnerability scanning” or “red teaming”. The difference is that DeepTeam is modular, lightweight, and code friendly. Take Nvidia Garak for example, although comprehensive, has so many CLI rules, environments to set up, it is definitely not the easiest to get started, let alone pick the library apart to build your own penetration testing pipeline. In DeepTeam, define a class, wrap it around your own implementations if necessary, and you’re good to go.<p>We adopted a Apache 2.0 license (for now, and probably in the foreseeable future too), so if you want to get started, `pip install deepteam`, use any LLM for simulation, and you’ll get a full penetration report within 1 minute (assuming you’re running things asynchronously). GitHub: https://github.com/confident-ai/deepteam<p>Excited to share DeepTeam with everyone here – let us know what you think!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44124610">https://news.ycombinator.com/item?id=44124610</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 29 May 2025 10:21:59 +0000</pubDate><link>https://news.ycombinator.com/item?id=44124610</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=44124610</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44124610</guid></item><item><title><![CDATA[Show HN: DeepTeam – Penetration Testing for LLMs]]></title><description><![CDATA[
<p>Hi HN, we’re Jeffrey and Kritin, and we’re building DeepTeam (<a href="https://trydeepteam.com" rel="nofollow">https://trydeepteam.com</a>), an open-source Python library to scan LLM apps for security vulnerabilities. You can start “penetration testing” by defining a Python callback to your LLM app (e.g. `def model_callback(input: str)`), and DeepTeam will attempt to probe it with prompts designed to elicit unsafe or unintended behavior.<p>Note that the penetration testing process treats your LLM app as a black-box - which means that DeepTeam will not know whether PII leakage has occurred in a certain tool call or incorporated in the training data of your fine-tuned LLM, but rather just detect that it is present. Internally, we call this process “end-to-end” testing.<p>Before DeepTeam, we worked on DeepEval, an open-source framework to unit-test LLMs. Some of you might be thinking, well isn’t this kind of similar to unit-testing?<p>Sort of, but not really. While LLM unit-testing focuses on 1) accurate eval metrics, 2) comprehensive eval datasets, penetration testing focuses on the haphazard simulation of attacks, and the orchestration of it. To users, this was a big and confusing paradigm shift, because it went from  “Did this pass?” to “How can this break?”.<p>So we thought to ourselves, why not just release a new package to orchestrate the simulation of adversarial attacks for this new set of users and teams working specifically on AI safety, and borrow DeepEval’s evals and ecosystem in the process?<p>Quickstart here: <a href="https://www.trydeepteam.com/docs/getting-started#detect-your-first-llm-vulnerability" rel="nofollow">https://www.trydeepteam.com/docs/getting-started#detect-your...</a><p>The first thing we did was offer as many attack methods as possible - simple encoding ones like ROT13, leetspeak, to prompt injections, roleplay, and jailbreaking. We then heard folks weren’t happy because the attacks didn’t persist across tests and hence they “lost” their progress every time they tested, and so we added an option to `reuse_simulated_attacks`.<p>We abstracted everything away to make it as modular as possible - every vulnerability, attack, can be imported in Python as `Bias(type=[“race”])`, `LinearJailbreaking()`, etc. with methods such as `.enhance()` for teams to plug-and-play, build their own test suite, and even to add a few more rounds of attack enhancements to increase the likelihood of breaking your system.<p>Notably, there are a few limitations. Users might run into compliance errors when attempting to simulate attacks (especially for AzureOpenAI), and so we recommend setting `ignore_errors` to `True` in case that happens. You might also run into bottlenecks where DeepTeam does not cover your custom vulnerability type, and so we shipped a `CustomVulnerability` class as a “catch-all” solution (still in beta).<p>You might be aware that some packages already exist that do a similar thing, often known as “vulnerability scanning” or “red teaming”. The difference is that DeepTeam is modular, lightweight, and code friendly. Take Nvidia Garak for example, although comprehensive, has so many CLI rules, environments to set up, it is definitely not the easiest to get started, let alone pick the library apart to build your own penetration testing pipeline. In DeepTeam, define a class, wrap it around your own implementations if necessary, and you’re good to go.<p>We adopted a Apache 2.0 license (for now, and probably in the foreseeable future too), so if you want to get started, `pip install deepteam`, use any LLM for simulation, and you’ll get a full penetration report within 1 minute (assuming you’re running things asynchronously). GitHub: <a href="https://github.com/confident-ai/deepteam">https://github.com/confident-ai/deepteam</a><p>Excited to share DeepTeam with everyone here – let us know what you think!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44117323">https://news.ycombinator.com/item?id=44117323</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 28 May 2025 15:49:43 +0000</pubDate><link>https://github.com/confident-ai/deepteam</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=44117323</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44117323</guid></item><item><title><![CDATA[YC helped us raise our seed round in 5 days]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.confident-ai.com/blog/how-i-closed-confident-ais-2-2m-seed-round-in-5-days">https://www.confident-ai.com/blog/how-i-closed-confident-ais-2-2m-seed-round-in-5-days</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43426147">https://news.ycombinator.com/item?id=43426147</a></p>
<p>Points: 4</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 20 Mar 2025 17:26:49 +0000</pubDate><link>https://www.confident-ai.com/blog/how-i-closed-confident-ais-2-2m-seed-round-in-5-days</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43426147</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43426147</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>You sure can! A few lines of code is all it takes, and a few simple rules to follow as shown here: <a href="https://docs.confident-ai.com/guides/guides-building-custom-metrics#building-a-custom-non-llm-eval">https://docs.confident-ai.com/guides/guides-building-custom-...</a><p>If you're using DSPy, you can also include it directly in this custom metric from the link above. It's hard for me to say 100% if there are advantages of doing this within DeepEval, but 8/10 times running evals in our ecosystem brings you more benefits than drawbacks. Let me know if you have trouble setting up!</p>
]]></description><pubDate>Sat, 22 Feb 2025 04:29:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=43136124</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43136124</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43136124</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>Definitely, feel free to join our discord for any questions on it: <a href="https://discord.com/invite/a3K9c8GRGt" rel="nofollow">https://discord.com/invite/a3K9c8GRGt</a></p>
]]></description><pubDate>Fri, 21 Feb 2025 17:43:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=43130512</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43130512</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43130512</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>Do check it out, the early feedback has been great: <a href="https://docs.confident-ai.com/docs/metrics-dag">https://docs.confident-ai.com/docs/metrics-dag</a></p>
]]></description><pubDate>Fri, 21 Feb 2025 17:42:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=43130503</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43130503</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43130503</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>Hey yes would definitely love to, my contact info is in my bio, please drop me an email :)</p>
]]></description><pubDate>Fri, 21 Feb 2025 01:23:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=43122956</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43122956</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43122956</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>Interesting, how are you remixing the order of questions? If we're talking about an academic benchmark like MMLU, the questions are independent of one another. Unless you're generating multiple answers in one go?<p>Do do synthetic data generation for custom application use cases. Such as RAG, summarization, text-sql, etc. We call this module the "synthesizer", and you can customize your data generation pipeline however you want (I think, let me know otherwise!).<p>Docs for synthesizer's here: <a href="https://docs.confident-ai.com/docs/synthesizer-introduction">https://docs.confident-ai.com/docs/synthesizer-introduction</a>, there's a nice "how does it work" section at the bottom explaining it more.</p>
]]></description><pubDate>Thu, 20 Feb 2025 23:22:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=43121814</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43121814</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43121814</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>It's actually langfuse.com! Our quickstart walks you through the whole process: <a href="https://docs.confident-ai.com/confident-ai/confident-ai-introduction">https://docs.confident-ai.com/confident-ai/confident-ai-intr...</a></p>
]]></description><pubDate>Thu, 20 Feb 2025 19:57:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=43119361</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43119361</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43119361</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>That's great! Hope you enjoyed it :)</p>
]]></description><pubDate>Thu, 20 Feb 2025 19:46:12 +0000</pubDate><link>https://news.ycombinator.com/item?id=43119225</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43119225</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43119225</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>Thanks and great question! There's a ton of eval tools out there but there are only a few that actually focuses on evals. The quality of LLM evaluation depends on the quality of dataset and the quality of metrics, and so tools that are more focused on the platform side of things (observability/tracing) tend to fall short on the ability to do accurate and reliable benchmarking. What tends to happen for those tools are users use them for one-off debugging, but when errors only happen 1% of the time, there is no capability for regression testing.<p>Since we own the metrics and the algorithms that we've spent the last year iterating on with our users, we balance between giving engineers the ability to customize our metric algorithms and evaluation techniques, while offering the ability for them to bring it to the cloud for their organization when they're ready.<p>This brings me to the tools that does have their own metrics and evals. Including us, there's only 3 companies out there that does this to a good extent (excuse me for this one), and we're the only one with a self-served platform such that any open-source user can get the benefit of Confident AI as well.<p>That's not all the difference, because if you were to compare DeepEval's metrics on more nuance details (which I think is very important), we provide the most customizable metrics out there. This includes researched-backed SOTA LLM-as-a-judge G-Eval for any criteria, and the recently released DAG metric that is a decision-based that is virtually deterministic despite being LLM-evaluated. This means as user's use cases get more and more specific, they can stick with our metrics and benefit from DeepEval's ecosystem as well (metric caching, cost tracking, parallelization, integrated with Pytest for CI/CD, Confident AI, etc)<p>There's so much more, such as generating synthetic data to get started with testing even if you don't have a prepared test set, red-teaming for safety testing (so not just testing for functionality), but I'm going to stop here for now.</p>
]]></description><pubDate>Thu, 20 Feb 2025 19:45:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=43119210</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43119210</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43119210</guid></item><item><title><![CDATA[New comment by jeffreyip in "Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps"]]></title><description><![CDATA[
<p>I see, although most users come to us for evaluating LLM applications, you're correct that the academic benchmarking of foundational models is also offered in DeepEval, which I'm assuming what you're talking about.<p>We actually designed it to make it easily work off any API. How it works is you just have to create a wrapper around your API and you're good to go. We take care of the async/concurrent handling of such benchmarking so the evaluation speed is really just limited by the rate limit of your LLM API.<p>This link shows what a wrapper looks like: <a href="https://docs.confident-ai.com/guides/guides-using-custom-llms#creating-a-custom-llm">https://docs.confident-ai.com/guides/guides-using-custom-llm...</a><p>And once you have your model wrapper setup, you can use any benchmark we provide.</p>
]]></description><pubDate>Thu, 20 Feb 2025 19:03:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=43118696</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43118696</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43118696</guid></item><item><title><![CDATA[Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps]]></title><description><![CDATA[
<p>Hi HN - we're Jeffrey and Kritin, and we're building Confident AI (<a href="https://confident-ai.com">https://confident-ai.com</a>). This is the cloud platform for DeepEval (<a href="https://github.com/confident-ai/deepeval">https://github.com/confident-ai/deepeval</a>), our open-source package that helps engineers evaluate and unit-test LLM applications. Think Pytest for LLMs.<p>We spent the past year building DeepEval with the goal of providing the best LLM evaluation developer experience, growing it to run over 600K evaluations daily in CI/CD pipelines of enterprises like BCG, AstraZeneca, AXA, and Capgemini. But the fact that DeepEval simply runs, and does nothing with the data afterward, isn’t the best experience. If you want to inspect failing test cases, identify regressions, or even pick the best model/prompt combination, you need more than just DeepEval. That’s why we built a platform around it.<p>Here’s a quick demo video of how everything works: <a href="https://youtu.be/PB3ngq7x4ko" rel="nofollow">https://youtu.be/PB3ngq7x4ko</a><p>Confident AI is great for RAG pipelines, agents, and chatbots. Typical use cases involve allowing companies to switch the underlying LLM, rewrite prompts for newer (and possibly cheaper) models, and keep test sets in sync with the codebase where DeepEval tests are run.<p>Our platform features a "dataset editor," a "regression catcher," and "iteration insights". The datasets editor in Confident AI allows domain experts to edit datasets while keeping them in sync with your codebase for evaluation. We’ll then generate sharable LLM testing/benchmark reports once DeepEval has finished running evaluations on these datasets that are pulled from the cloud. The regression catcher then identifies any regressions in your new implementation, and we use these evaluation results to determine the best iteration based on your metric scores.<p>Our goal is to make benchmarking LLM applications so reliable that picking the best implementation is as simple as reading the metric values off the dashboard. To achieve this, the quality of curated datasets and the accuracy and reliability of metrics must be the highest possible.<p>This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal.<p>To address this, we recently released a DAG (Directed Acyclic Graph) metric in DeepEval. It is a decision-tree-based, LLM-as-a-judge metric that provides deterministic results by breaking a test case into finer atomic units. Each edge represents a decision, each node represents an LLM evaluation step, and each leaf node returns a score. It works best in scenarios where success criteria are clearly defined, such as text summarization.<p>The DAG metric is still in its early stages, but our hope is that by moving towards better, code-driven, open-source metrics, Confident AI can deliver deterministic LLM benchmarks that anyone can blindly trust.<p>We hope you’ll give Confident AI a try. Quickstart here: <a href="https://docs.confident-ai.com/confident-ai/confident-ai-introduction">https://docs.confident-ai.com/confident-ai/confident-ai-intr...</a><p>The platform runs on a freemium tier, and we've dropped the need to signup with a work email for the next four days.<p>Looking forward to your thoughts!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43116633">https://news.ycombinator.com/item?id=43116633</a></p>
<p>Points: 117</p>
<p># Comments: 27</p>
]]></description><pubDate>Thu, 20 Feb 2025 16:23:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=43116633</link><dc:creator>jeffreyip</dc:creator><comments>https://news.ycombinator.com/item?id=43116633</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43116633</guid></item></channel></rss>