Hacker News: kmike84

New comment by kmike84 in "All AI Videos Are Harmful (2025)"

kmike84 — Mon, 05 Jan 2026 19:10:26 +0000

I'm not sure I want AI to touch me emotionally.

It feels insincere and manipulative, especially when I don't know upfront if the content (music, video, text) is from another human being or from AI.

AI will become good enough to write songs better than humans; it's a matter of time. But it feels like someone tries to hack my mind, exploit my human instincts, it doesn't feel like genuine art the way it was for the whole human history - people expressing themselves, creating and sharing something beautiful with each other.

The end result is an automated personalized "enjoy" button, and this is sad.

New comment by kmike84 in "A recent chess controversy"

kmike84 — Fri, 26 Sep 2025 16:35:18 +0000

> whereas the best engines average 99.something%?

To compute accuracy, you compare the moves which are made during the game with the best moves suggested by the engine. So, the engine will evaluate itself 100%, given its settings are the same during game and during evaluation.

You get 99.9something% when you evaluate one strong engine by using another strong engine (they're mostly aligned, but may disagree in small details), or when the engine configuration during the evaluation is different from the configuration used in a game (e.g. engine is given more time to think).

New comment by kmike84 in "A useful productivity measure?"

kmike84 — Mon, 06 May 2024 14:13:56 +0000

I think he/she is reacting mostly to this quote from the article, not to the main article topic:

> I have a good answer: my job is to double our value-add capacity over the next three years. Essentially, to double our output without increasing spending.

> You know what? With my XP plans and the XP coaches I’ve hired, it’s totally doable. I think I’m being kind of conservative, actually.

TBH, this part felt off to me as well.

New comment by kmike84 in "Parsing URLs in Python"

kmike84 — Sat, 16 Mar 2024 22:33:11 +0000

The URL parsing in httpx is rfc3986, which is not the same as WHATWG URL living standard.

rfc3986 may reject URLs which browsers accept, or it can handle them in a different way. WHATWG URL living standard tries to put on paper the real browser behavior, so it's a much better standard if you need to parse URLs extracted from real-world web pages.

New comment by kmike84 in "Parsing URLs in Python"

kmike84 — Sat, 16 Mar 2024 22:13:34 +0000

A great initiative!

We need a better URL parser in Scrapy, for similar reasons. Speed and WHATWG standard compliance (i.e. do the same as web browsers) are the main things.

It's possible to get closer to WHATWG behavior by using urllib and some hacks. This is what https://github.com/scrapy/w3lib does, which Scrapy currently uses. But it's still not quite compliant.

Also, surprisingly, on some crawls URL parsing can take CPU amounts similar to HTML parsing.

Ada / can_ada look very promising!

New comment by kmike84 in "Ask HN: What are some of the best documentaries you've seen?"

kmike84 — Sun, 11 Sep 2022 17:01:28 +0000

Exit Through the Gift Shop - an amusing documentary about somebody trying to find Banksy (a street artist), and much more, supposingly directed by Banksy himself.

There is some debate if it is documentary or not (the story is almost too good), but it seems the evidence suggests it is real.

EDIT: sorry, I missed the "last 4 years" part in the question. This film is older than that.

New comment by kmike84 in "Things I've learned building a modern TUI framework"

kmike84 — Wed, 03 Aug 2022 18:22:36 +0000

No.

New comment by kmike84 in "Things I've learned building a modern TUI framework"

kmike84 — Wed, 03 Aug 2022 18:17:17 +0000

The advice to use lru_cache is good.

But there is an issue if lru_cache is used on methods, like in the example given in the article:

1. When lru_cache is used on a method, `self` is used as a part of cache key. That's good, because there is a single cache for all instances, and using self as a part of the key allows not to share data between instances (it'd be incorrect in most cases).

2. But: because `self` is a part of a key, a reference to `self` is stored in the cache.

3. If there is a reference to Python object, it can't be deallocated. So, an instance can't be deallocated until the cache is deallocated (or the entry is expired) - if a lru_cache'd method is called at least once.

4. Cache itself is never deallocated (well, at least until the class is destroyed, probably at Python shutdown). So, instances are kept in memory, unless the cache is over the size limit, and all entries for this instance are purged.

I think there is a similar problem in the source code as well, e.g. https://github.com/Textualize/textual/blob/4d94df81e44b27fff... - a DirectoryTree instance won't be deallocated if its render_tree_label method is called, at least until new cache records push out all the references to this particular instance.

It may be important or not, depending on a situation, but it's good to be aware of this caveat. lru_cache is not a good fit for methods unfortunately.

New comment by kmike84 in "Using a "proper" camera as a webcam"

kmike84 — Tue, 17 May 2022 18:58:58 +0000

Not sure about the autofocus advice; I'm pretty happy with manual focus. It requires static camera placement, and fixed distance to the person, but isn't this happening anyways? Are people really walking around the room or moving camera between calls?

Manual means there are less failure modes - slow autofocus, autofocus trying to refocus, focusing on a wrong thing, etc.

It also means the hardware can be cheaper - camera doesn't need to have good autofocus (some old DSLR is fine), you can also use manual lenses.

New comment by kmike84 in "Using a "proper" camera as a webcam"

kmike84 — Tue, 17 May 2022 18:51:47 +0000

Hm, I haven't noticed any increased latency when using a DSLR as a webcam.

New comment by kmike84 in "Using a "proper" camera as a webcam"

kmike84 — Tue, 17 May 2022 18:50:22 +0000

As I understand, the drivers (webcam utility? not sure) are built for x86. For some reason they don't work in apps which are built for M1, so the camera only works if an app which needs a video is running in emulation mode.

So, if you want to use Canon DSLR on M1 in a web browser (e.g. google meet), get a browser built for x86.

I'm using Chromium, it can be downloaded for x86. The issue is that Chromium doesn't have screen share feature. So, for screen share, I'm using Chrome, and joining the call for the second time, in "companion mode". That's 2 separate browsers to participate in a call. Maybe there is a way to get Chrome or Firefox for x86, but I was a bit too lazy when setting it up :)

New comment by kmike84 in "Using a "proper" camera as a webcam"

kmike84 — Tue, 17 May 2022 18:40:33 +0000

Is it such a big issue? My Canon DSLR turns off every 30min, but that's only for a couple of seconds, it then turns back on. On a positive side, it's now easy to notice when 30min or 1hr meeting is running over, it's a nice reminder :)

New comment by kmike84 in "We were promised Strong AI, but instead we got metadata analysis"

kmike84 — Mon, 26 Apr 2021 18:24:24 +0000

That's interesting.. We're working on web data extraction in Zyte (former Scrapinghub); we have an Automatic Extraction product (https://docs.zyte.com/automatic-extraction-get-started.html) which combines ML and metadata to get data from websites automatically. Our learnings from building it:

1) metadata is helpful - not all of it, but some; 2) ML is obviously needed when metadata is missing, and metadata is missing very often; 2) Even when metadata is present, pure ML-based extraction often beats it in quality, with right ML models. A combination of ML+metadata fallbacks is even better.

Website creators often make mistakes providing metadata, they may misunderstand the schema and purpose of various fields, have metadata auto-generated incorrectly, etc. It is rarely about deceiving for the tasks we're working on (though it also may happen).

So, I don't see Zyte falling back to metadata analysis, ML models are already better than this human-provided metadata - but metadata is helpful, as one of the inputs.

We're going to publish product extraction benchmark soon, where, among other things, we compare automatic extraction with metadata-based extraction. In this evaluation we've got a result that ML + metadata is better than metadata not only overall (which is expected), but on precision as well.

I wonder if the reasons metadata is sometimes preferred are not related to quality, or to failure of ML approaches. If Google doesn't get data right, it is not Google's fault anymore, it is website's fault.

New comment by kmike84 in "The Ultimate MacBook+PC Monitor Showdown"

kmike84 — Sat, 02 Jan 2021 18:14:10 +0000

I'm unsure about the advice of sticking to 1440p at 27".

I have a non-retina imac 27 (1440p), external LG 27" 4K USB-C monitor and a macbook pro 13 with a real "retina", and use them all regularly.

For my eyes, scaling works fine with 4K - font rendering is significantly better than on 1440p imac.

13" screen on macbook pro is even better, and 5K 27" would be perfect, but that's a different price point. I'm quite happy with the improvement from 1440p => "4K with scaling" transition, and won't consider buying 1440p in future.

Scaled 4K may be not the best for high precision design work, but for development tasks / text reading that's an improvement, in my experience.

New comment by kmike84 in "Pippi and the Moomins"

kmike84 — Sat, 17 Oct 2020 09:26:22 +0000

My son (4.5yo) became a huge fan of Moomin tales recently. Books, audio books, cartoons; he likes them more than super heroes these days. These tales are not only nostalgia material fo adults, these are still great children stories.

New comment by kmike84 in "Never use a dependency that you could replace with an afternoon of programming"

kmike84 — Tue, 11 Aug 2020 19:54:17 +0000

I wouldn't consider a high number of open issues a problem on its own. All big popular projects with a history have a high number of open issues. There are some exceptions, who may be closing isses aggressvely, but it is more about a style of managing of those issues, not about project health.

Over time an issue tracker inevitably becomes a collection of hard-to-reproduce bugs, incomplete patches, underspecified feature requests, random tracebacks, etc. Maintainers can choose to just close everything which is not actionable immediately, or be in comfort with such issues, and let them live in the bug tracker. I personally like a style when an issue is closed only if it is fixed, or if it doesn't contain useful information, or if it is a duplicate.

A better indicator is activity and responsiveness of the maintainers in the issue tracker.

New comment by kmike84 in "Article extraction benchmark: open-source libraries and commercial services"

kmike84 — Wed, 24 Jun 2020 21:02:33 +0000

The approach is very different from Dragnet. AutoExtract uses neural networks. CSS and HTML can only get you so far; we actually process screenshots as pixels (like humans do), it is not just shallow features like in Dragnet.

Speed of the AutoExtract ML part is not a concern (many pages per second on GPU) - the bottleneck is in the browser rendering.

New comment by kmike84 in "Article extraction benchmark: open-source libraries and commercial services"

kmike84 — Wed, 24 Jun 2020 09:48:23 +0000

Firefox Reader View uses https://github.com/mozilla/readability; if I'm not mistaken, it should be an algorithm which is similar to the one implemented in python-readability.

New comment by kmike84 in "Ask HN: Dear open source devs how do you sustain yourself"

kmike84 — Mon, 18 May 2020 07:45:16 +0000

Not sure about fulltime career, and also about your current life circumstances; the best way may depend a lot on them. This is what worked for me:

1. University, some time after it. No much obligations. Take low-effort job to sustain yourself (maybe freelance), spend the rest of the time contributing to open source. Treat it as a time to learn. The main goal is to become good. You can learn very different things by contributing to OSS packages, as compared to working for some outdated local company. Try to internalize how popular software is organized, how people review code, etc. Find people you respect, work with them. You don't need to have a shiny CV and pass technical interviews to work with great people you can learn from, developing great real-world technology, solving hard problems.

2. You need a real job. Try to find one which allows you to spend some time doing Open Source; have it as an important criteria for choosing a job, among salary, work environment, etc.

For me two types of companies worked as a "real job" which allows OSS contributions.

First, some small startups / companies. They often don't mind if you open source a few libraries from the codebase you've created, because usually it is not the code itself which is important for startups; they're trying to find product-market fit. For them a benefit is that code become organized better (after an idea fails, code can be reused for the next idea), and developers are happier, so it can be win-win. You won't be working on open source full time, but you'll be able to create something useful, and spend significant amount of time on it.

Second, there are companies which are built around open source, or contributing a lot to open source. Often there is a company behind a popular OSS software (e.g. Elasticsearch for Elasticsearch, or Scrapinghub for Scrapy). Sometimes company's github has many actively developing OSS projects, which is a good sign. Look for such companies, apply. There is a higher chance to be able to work on open source if you join such company. It is not given you'll be allocated to work on OSS, but a previous experience maintaining Open Source and contributing to it helps. That's good to be proactive here - use your experience gained from unpaid OSS work or small startup OSS work, start contributing without being asked.

According to my experience, working full time, having family and having significant Open Source contributions is very hard, unless an employer supports it, or unless the job is not really a full time job.

There are "rockstars" which are able to sustain themselves just by working on their own OSS projects, but I think currently they are outliers, not a norm. It may be possible to do this, but I've personally seen way more opportunities to do sustainable OSS work as a part of day job, as compared to donations or a new business.

New comment by kmike84 in "I Want Decentralized Version Control for Structured Data"

kmike84 — Mon, 13 Apr 2020 08:45:01 +0000

The API is quite simple - you need to implement a script which takes 3 arguments, writes a result of a merge to a file, and exits with non-zero status code in case of merge error. Quote from https://git-scm.com/docs/gitattributes#_defining_a_custom_me...:

To define a custom merge driver filfre, add a section to your $GIT_DIR/config file (or $HOME/.gitconfig file) like this

  [merge "filfre"]
    name = feel-free merge driver
    driver = filfre %O %A %B %L %P
    recursive = binary

The merge.?.name variable gives the driver a human-readable name.

The merge.?.driver variable’s value is used to construct a command to run to merge ancestor’s version (%O), current version (%A) and the other branches' version (%B). These three tokens are replaced with the names of temporary files that hold the contents of these versions when the command line is built. Additionally, %L will be replaced with the conflict marker size (see below).

The merge driver is expected to leave the result of the merge in the file named with %A by overwriting it, and exit with zero status if it managed to merge them cleanly, or non-zero if there were conflicts.