Hacker News: DarkPlayer

New comment by DarkPlayer in "Abusive AI Web Crawlers: Get Off My Lawn"

DarkPlayer — Wed, 02 Apr 2025 15:24:29 +0000

We observed the same behavior. Each request used a different IP address and a random user agent. In our case, most of the IP addresses belonged to Chinese ISPs. They went to great lengths to avoid being blocked, but at the same time used user agents such as Windows 95/98 or IE 5. Fortunately, the combination of the odd user agents and the fact that they still use HTTP/1.1 makes them somewhat easy to identify. So you can use a captcha on more expensive endpoints to block them.

New comment by DarkPlayer in "Software development topics I've changed my mind on"

DarkPlayer — Wed, 05 Feb 2025 16:42:38 +0000

> What we should have instead is syntax-aware diffs that can ignore meaningless changes like curly braces moving into another line or lines getting wrapped for reasons.

These diffs already exist (at least for some languages) but aren't yet integrated into the standard tools. For example, if you want a command line tool, you can use https://github.com/Wilfred/difftastic or if you are interested in a VS Code extension / GitHub App instead, you can give https://semanticdiff.com a try.

New comment by DarkPlayer in "Software development topics I've changed my mind on"

DarkPlayer — Wed, 05 Feb 2025 14:42:57 +0000

> However at a minimum formatting changes shouldn’t regularly complicate doing a diff.

If the code needs to be reformatted, this should be done in a separate commit. Fortunately, there are now structural/semantic diff tools available for some languages that can help if someone hasn't properly split their formatting and logic changes.

New comment by DarkPlayer in "Types are a basic tool of software design (2018)"

DarkPlayer — Fri, 03 Jan 2025 21:27:04 +0000

Difftastic would not solve the issue described by madeofpalk because it still highlights the added comma. You need a diff tool that can distinguish between optional and required syntax. So far I am not aware of any tool that supports this, except the one I am working on (SemanticDiff).

New comment by DarkPlayer in "Mergiraf: a syntax-aware merge driver for Git"

DarkPlayer — Sat, 09 Nov 2024 20:01:17 +0000

Our parsers simply return the concrete syntax trees in a JSON format. We do not unify all the different syntax constructs into a common AST if that is what you are looking for. The languages and file formats we support are too diverse for that.

The language specific logic does not end with the parsers though. The core of SemanticDiff also contains language specific rules that are picked up by the matching and visualization steps. For example, the HTML module might add a rule that the order of attributes within a tag is irrelevant. So it all comes down to writing a generic rule system that makes it easy to add new languages.

New comment by DarkPlayer in "Mergiraf: a syntax-aware merge driver for Git"

DarkPlayer — Sat, 09 Nov 2024 19:11:11 +0000

> - for diffing, the matching of the leaves is what matters the most, for merging the internal nodes are more important,

The leaves are the ones that end up being highlighted in the diff, but the inner nodes play an important role as well. We try to preserve as much of the code structure as possible when mapping the nodes. A developer is unlikely to change the structure of the code just for fun. A mapping with a larger number of structural changes is therefore more likely to be incorrect.

> - for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,

We use a pipeline based approach and visualizing the changes is the last step. For some types of changes we don't have a way to visualize them yet (e.g. moves within the same line) and ignore that part of the mapping. We are still trying to get the mapping right though :)

We upstreamed a few bug fixes for tree-sitter itself. The grammars were a bit more complicated because we were just using them as a starting point. We patched tree-sitter, added our own annotations to the grammars and restructured them to help our matching algorithm achieve better results and improve performance. In the end there was not much to upstream any more.

Using a well tested parsing library, such as Roslyn for C#, and writing some code to integrate it into our existing system aligned more with our goals than tinkering with grammars. Context-sensitive keywords in particular were a constant source of annoyance. The grammar looks correct, but it will fail to parse because of the way the lexer works. You don't want your tool to abort just because someone named their parameter "async".

New comment by DarkPlayer in "Mergiraf: a syntax-aware merge driver for Git"

DarkPlayer — Sat, 09 Nov 2024 16:48:44 +0000

I don't think that different algorithms are better for merging or diffing. In both cases, the first step is to match identical nodes, and the quality of the final result depends heavily on this step. The main problem with GumTree is that it is a greedy algorithm. One incorrectly matched node can completely screw up the rest of the matches. A typical example we encountered was adding a decorator to a function in Python. When other functions with the same decorator followed, the algorithm would often map the newly added decorator to an existing decorator, causing all other decorator mappings to be "off-by-one". GumTree has a tendency to come up with more changes than there actually are.

We try to really get the diff quality nailed down before going after merges. We don't have merge functionallity in SemanticDiff yet.

The main issue we have with tree-sitter is that the grammars are often written from scratch and not based on the upstream grammar definition. Sometimes they only cover the most likely cases which can lead to parsing errors or incorrectly parsed code. When you encounter parsing errors it can be difficult to fix them, because the upstream grammar is structured completely different. To give you an example, try to compare the tree-sitter Go grammar for types [1] with the upstream grammar [2]. It is similar but the way the rules are structured is somewhat inverted.

We use separate executables for the parsers (this also helps to secure them using seccomp on Linux), and they all use the same JSON schema for their output. This allows us to write the parser executable in the most appropriate language for the target language. Building all them statically and cross-platform for our VS Code extension isn't easy though ;)

[1]: https://github.com/tree-sitter/tree-sitter-go/blob/master/gr... [2]: https://go.dev/ref/spec#Types

New comment by DarkPlayer in "Mergiraf: a syntax-aware merge driver for Git"

DarkPlayer — Sat, 09 Nov 2024 15:16:54 +0000

Looking at the architecture, they will probably run into some issues. We are doing something similar with SemanticDiff [1] and also started out using tree-sitter grammars for parsing and GumTree for matching. Both choices turned out to be problematic.

Tree sitter grammars are primarily written to support syntax highlighting and often use a best effort approach to parsing. This is perfectly fine for syntax highlighting, since the worst that can happen is that a few characters are highlighted incorrectly. However, when diffing or modifying code you really want the code to be parsed according to the upstream grammar, not something that mostly resembles it. We are currently in the process of moving away from tree-sitter and instead using the parsers provided by the languages themselves where possible.

GumTree is good at returning a result quickly, but there are quite a few cases where it always returned bad matches for us, no matter how many follow-up papers with improvements we tried to implement. In the end we switched over to a dijkstra based approach that tries to minimize the cost of the mapping, which is more computationally expensive but gives much better results. Difftastic uses a similar approach as well.

[1]: https://semanticdiff.com/

New comment by DarkPlayer in "How far should a programming language aware diff go?"

DarkPlayer — Mon, 22 Jul 2024 20:00:49 +0000

You can find a comparison of the two tools here: https://semanticdiff.com/blog/semanticdiff-vs-difftastic/

As author of SemanticDiff, I am obviously a bit biased. But Wilfred, the author of difftastic, found the analysis to be "pretty even-handed" [1], so I think it should be somewhat fair.

[1]: https://x.com/_wilfredh/status/1764424652611318146

New comment by DarkPlayer in "How far should a programming language aware diff go?"

DarkPlayer — Mon, 22 Jul 2024 19:20:22 +0000

The VS Code extension works offline. The diff calculation is performed on the host where the VS Code GUI is running (makes a difference in case of SSH/Docker/WSL).

New comment by DarkPlayer in "How far should a programming language aware diff go?"

DarkPlayer — Sat, 20 Jul 2024 20:21:44 +0000

Hi, author of SemanticDiff here.

I'm sorry you didn't have a good experience testing the tool. If it doesn't work / makes things worse than a standard diff, that's definitely considered a bug. It is probably something specific to your code and not a general issue. It would therefore be great if you could open an issue [1] or support ticket [2], ideally with some sample code, so we can take a look. Thanks in advance!

[1] https://github.com/Sysmagine/SemanticDiff/issues [2] support@semanticdiff.com

How far should a programming language aware diff go?

DarkPlayer — Wed, 17 Jul 2024 16:36:24 +0000

Article URL: https://semanticdiff.com/blog/language-aware-diff-how-far/

Comments URL: https://news.ycombinator.com/item?id=40987674

Points: 2

# Comments: 0

SemanticDiff 0.9.0: Support for HTML, Vue, Swift and More

DarkPlayer — Tue, 09 Jul 2024 16:23:54 +0000

Article URL: https://semanticdiff.com/blog/semanticdiff-0.9.0/

Comments URL: https://news.ycombinator.com/item?id=40917931

Points: 2

# Comments: 0

New comment by DarkPlayer in "Difftastic, a structural diff tool that understands syntax"

DarkPlayer — Thu, 21 Mar 2024 19:12:32 +0000

> I am curious if there's been any work on _semantic_ diff tools as well (for when eg the syntax changes but the meaning is the same)

We are working on https://semanticdiff.com/ which detects basic semantic changes like converting a literal from decimal to hex or reordering keys within JSON objects. It is not a command line utility but a VS Code extension and GitHub App. You can check out https://semanticdiff.com/blog/semanticdiff-vs-difftastic/ if you want to learn more about how it works and how it differs from difftastic.

New comment by DarkPlayer in "Large pull requests slow down development"

DarkPlayer — Wed, 22 Nov 2023 02:17:53 +0000

There are some tools that can separate actual code changes from reformatting changes. I am working on https://semanticdiff.com, a VS Code Extension / GitHub App that can help you with this. There is also difftastic if you prefer a CLI based solution. It supports more languages but can detect fewer types of reformatting changes.

New comment by DarkPlayer in "Will LLMs Eclipse the Classic Code Diff Algorithms?"

DarkPlayer — Thu, 09 Nov 2023 20:45:14 +0000

I think LLMs will have a hard time replacing traditional diff algorithms. You want your diff to be reliable and predictable. It shouldn't omit relevant changes. This is not really a strength of LLMs.

I think a system that parses the code and uses verifiable rules to hide invariant changes would have a much easier time being adopted. I may be biased since I work on SemanticDiff (think difftastic but for VS Code and GitHub) and have implemented such rules myself. There have been several times where I've thought about implementing a new rule, only to find out that some obscure edge case would not be handled correctly. So I don't see how LLMs could handle these cases correctly in the near future.

New comment by DarkPlayer in "I gave commit rights to someone I didn't know (2016)"

DarkPlayer — Tue, 30 May 2023 18:42:28 +0000

I am working on a GitHub pull request viewer that displays changes using a semantic diff, and therefore has some more advanced whitespace handling behavior than just ignoring leading or trailing whitespace. I tried it with this PR:

https://app.semanticdiff.com/django-money/django-money/pull/...

It doesn't make a huge difference, but it filters out changes like the added line break in "if value: value = str(value)" nicely. I haven't announced the project yet, but maybe someone will find it useful :-)

New comment by DarkPlayer in "Difftastic: A diff that understands syntax"

DarkPlayer — Tue, 29 Mar 2022 18:55:55 +0000

We are working on a code review tool which supports unified diffs with semantic diffing. If that sounds interesting for you, take a look at https://mergeboard.com

Execute Docker Containers as QEMU MicroVMs

DarkPlayer — Wed, 16 Jun 2021 16:05:21 +0000

Article URL: https://mergeboard.com/blog/2-qemu-microvm-docker/

Comments URL: https://news.ycombinator.com/item?id=27530074

Points: 178

# Comments: 63

Use a real Windows 7 partition in Virtualbox / KVM / VMware Player under Linux

DarkPlayer — Wed, 25 Dec 2013 21:38:20 +0000

Article URL: http://fds-team.de/cms/articles/2013-12/use-a-real-windows-7-partition-in-virtualbox-kvm-vmware-player-u.html

Comments URL: https://news.ycombinator.com/item?id=6964346

Points: 63

# Comments: 15