Hacker News: pierrefar

New comment by pierrefar in "Brave Search launches own image and video search"

pierrefar — Fri, 04 Aug 2023 00:13:24 +0000

I replied in detail elsewhere:

https://news.ycombinator.com/item?id=36993739

New comment by pierrefar in "Brave Search launches own image and video search"

pierrefar — Fri, 04 Aug 2023 00:02:41 +0000

A search engine index is an economic exchange between the website and the publisher.

To massively (over)simplify the argument to its essence (and ignore other important points): the publisher goes through the trouble and expense of creating the content The publisher then allows its content to be copied by a search engine only because being shown in search results gets it traffic back. The traffic it gets in return has value, and the publisher is happy for this arrangement to continue as long as the value of the traffic is more than the cost of producing and serving the content.

Brave offering a "license", for its own financial benefit, to "allow" others to use the content for LLM training gives zero benefit to the original publisher. This is why I use words like "sleazy" to describe Brave's position.

This argument applies to Google and Microsoft. Right now both are failing at citing sources in their generative AI search results. That is terrible and I hope it's fixed soon, as otherwise they're being sleazy scrapers as much as Brave is.

Finally, I wholeheartedly disagree they what Brave is doing is for the "greater good". The fact they charge extra for the "license" to use the content for LLM training shows that.

New comment by pierrefar in "Brave Search launches own image and video search"

pierrefar — Thu, 03 Aug 2023 22:28:21 +0000

There is a a difference between a human being able to access content vs a search engine indexing it (and in the case of Brave, "licensing" it on).

I share your concern about Google having this much power, and I'd add that Microsoft Bing is equally bad but gets away with it because they're smaller. Still, the final decision about which search engine indexes a website is purely the publisher's.

New comment by pierrefar in "Brave Search launches own image and video search"

pierrefar — Thu, 03 Aug 2023 22:25:51 +0000

This is explained more in the article I referred to, but briefly: Brave delegates crawling to normal Brave browsers, so it's a huge IP addresses pool, not a single IP address or range.

Also, these search crawls by the browser do not identify themselves beyond the Brave standard UA header, namely a plain Chrome user-agent string.

New comment by pierrefar in "Brave Search launches own image and video search"

pierrefar — Thu, 03 Aug 2023 22:23:32 +0000

That would be bad, and it is already bad that Google and Microsoft control so much of search queries, but the decision about which search engine indexes a website is purely the publisher's.

New comment by pierrefar in "Brave Search launches own image and video search"

pierrefar — Thu, 03 Aug 2023 21:25:29 +0000

The major problem with Brave search is their position about indexing and licensing content against the wishes of the website publisher. Their robot does not identify itself, meaning the publisher cannot use the standard robots.txt to block its crawling if the publisher so wishes. Incidentally, the robots.txt file has been used in court cases litigating if a search engine is legal or not.

Even worse, they state that Brave search won't index a page only if other search engines are not allowed to index it. It is morally not their right to make that call. A publisher should have full control to discriminate which search engine indexes the website's content. That's the very heart of why the Robots Exclusion Protocol exists, and Brave is brazenly ignoring it.

Even worse than that, the Brave search API allows you (for an extra fee) to get the content with a "license" to use the content for AI training? Who allowed them the right to distribute the content that way?

I wrote about all this here:

https://searchengineland.com/crawlers-search-engines-generat...

and more references elsewhere in this thread:

https://news.ycombinator.com/item?id=36989129

Amusingly, while I was writing my article, this got posted to their forums, asking about how to block their crawler:

https://community.brave.com/t/stop-website-being-shown-in-br...

No reply so far.

New comment by pierrefar in "ChatGPT-3: Google Says AI Generated Contents Are Against Webmaster Guidelines"

pierrefar — Sat, 18 Feb 2023 17:27:22 +0000

This article is way out of date and wrong, although published recently (at least according to the timestamp).

Google's official position was published on 8 February here:

https://developers.google.com/search/blog/2023/02/google-sea...

It's a much more nuanced position that can be summarized as "make sure you create good content, however you create it". A focus on quality, not process, is reasonable.

Disclosure: ex-Googler in search.

Welcome to Apple

pierrefar — Tue, 07 Jan 2020 19:18:46 +0000

Article URL: https://members.tortoisemedia.com/2020/01/06/day-1-apple-state-of-the-nation/content.html

Comments URL: https://news.ycombinator.com/item?id=21983567

Points: 4

# Comments: 0

New comment by pierrefar in "A Pirate's Guide to Accuracy, Precision, Recall, and Other Scores"

pierrefar — Mon, 18 Nov 2019 20:25:35 +0000

In simplified terms, did you find everything you could have possibly found? Looking at the formula in the article, it includes the false negatives, that is, items you misclassified as negatives when you should have considered them positives. And because that happened, you didn't find them in the set, that is you "forgot them". The opposite of forgetting is... recall.

Another place this idea comes up is a search engine index. If the algo doesn't find, for a given query, documents in the index it should have (falsely classified as not matching the query), it will have lower recall.

New comment by pierrefar in "33 Thomas Street"

pierrefar — Sun, 17 Nov 2019 07:57:43 +0000

Looks like the stages of mitosis (cell division):

https://www.nature.com/scitable/topicpage/mitosis-and-cell-d...

The faint lines are the cell walls and the bright spots in the middle would be the DNA. I can believe this is what they're going for with a bit of squinting.

New comment by pierrefar in "Webtest.app – Website Speed Test with and Without Ad Blocker"

pierrefar — Sun, 03 Nov 2019 14:09:13 +0000

Very neat. I built a virtually identical internal tool for Blockmetry. A couple of tips from experience:

1. Add other browse extensions, and you'll see a big difference between their effects. Defaults matter a lot in this space.

2. Compare mobile vs desktop. Getting mobile emulation to be good enough is a bit of work, but worth it IMO.

Based on internal usage, the typical web page will load 35-45% faster with uBlock Origin installed.

My email address is my profile if you want to compare notes or whatnot.

New comment by pierrefar in "Cookieless cookies (2013)"

pierrefar — Tue, 09 Jul 2019 17:54:16 +0000

No that's not a solution. It's the tracking that counts, not the cookies. I commented elsewhere on this thread more details:

https://news.ycombinator.com/item?id=20394661

New comment by pierrefar in "Cookieless cookies (2013)"

pierrefar — Tue, 09 Jul 2019 17:53:17 +0000

Before anyone thinks this (and similar) approaches are a way around the GDPR's cookie consent tracking crackdown: It's not.

The GDPR talks about online identifiers, of which cookies, IP address and fingerprints are examples. If you read any regulator's guidance carefully, you'll see they talk about "cookies and similar technologies", with just "cookies" being used alone for brevity.

To rephrase tracking of any kind is the issue, not cookies. Don't mistake the implementation for the activity.

Disclosure: Founder of a non-tracking web analytics service because of this exact issue.

New comment by pierrefar in "Show HN: Chaoslist – A self-prioritizing todo-list"

pierrefar — Sun, 27 Jan 2019 18:19:33 +0000

Congrats on the launch.

The privacy policy is very not suited for this service. The most important point is that you're based in Germany based on the address in the policy, but there isn't a single mention of the GDPR. That and the ePrivacy Directive are what count for you the most. My recommendation is don't use a free policy generator and get proper advice. I appreciate this isn't something commonly seen as a launch blocker, but it's important to sort it out properly.

Find your German state data protection authority, and invariably you'll find they have great guidance.

New comment by pierrefar in "Complete guide to GDPR compliance"

pierrefar — Wed, 09 Jan 2019 06:10:19 +0000

Here is a write-up of the decision from EU’s highest court on this topic: https://www.whitecase.com/publications/alert/court-confirms-...

It’s easy to see why quote I gave says what it says with this context.

Also, if you’re worries, talk to your lawyer.

New comment by pierrefar in "Complete guide to GDPR compliance"

pierrefar — Tue, 08 Jan 2019 22:05:09 +0000

Yes, and also cookie IDs. Both are called out as examples in recital 30:

“Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.”

Source: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A...

New comment by pierrefar in "Show HN: Open source JavaScript library to record and replay the web"

pierrefar — Fri, 28 Dec 2018 16:58:03 +0000

To add to the key point about privacy, this research from Princeton is really illuminating and scary: https://freedom-to-tinker.com/2017/11/15/no-boundaries-exfil...

New comment by pierrefar in "Show HN: I made a privacy-first minimalist Google Analytics"

pierrefar — Wed, 19 Sep 2018 15:30:51 +0000

Looks good! I'm the founder of a similar service (Blockmetry). Obviously non-tracking web analytics is the future!

I'm curious why you chose to host the data yourself instead of giving customers the data immediately at the point of collection. That's the path we chose for Blockmetry as it genuinely required to be a non-tracking web analytics service and makes it impossible to profile users. Any service that hosts its data would still be open to being untrusted on the "no tracking no profiling" argument.

Thanks, Pierre

PS - YC Startup School founders: ping me via the forums and get an extended-period free trial.

New comment by pierrefar in "Oxford Comma Dispute Is Settled as Maine Drivers Get $5M"

pierrefar — Sat, 10 Feb 2018 08:46:52 +0000

Speaking of commas, you're missing one after the end of the interrupting phrase in your last sentence (should say ", as well as the state of Maine,"). It's a pet peeve bigger than the lack of Oxford commas, and definitely affects readability and may affect meaning.

New comment by pierrefar in "The Google Analytics Setup I Use on Every Site I Build"

pierrefar — Mon, 13 Feb 2017 21:13:56 +0000

I don't have access to the raw log files from the customers, so can't give you a percentage. All I'll say confidently is that my service processes a lot of bot traffic that needs to be filtered out before reporting.

BTW, are you the same Peter Hartree on this Segment thread? https://community.segment.com/t/1889n1/how-common-is-client-... It would appear we've crossed paths before on this topic. Please do email me if you want to talk properly. That Segment thread has my email.