Hacker News: sfletcher

New comment by sfletcher in "GitHub Copilot is not infringing copyright"

sfletcher — Mon, 05 Jul 2021 12:14:00 +0000

The Google Books case cited here allowed Google to show exact snippets (extracts) from the copyrighted books, hard to see how this is any different.

New comment by sfletcher in "We Found Joe Biden’s Venmo. Why That’s a Privacy Nightmare for Everyone"

sfletcher — Sat, 15 May 2021 22:22:17 +0000

This is not nearly as fun as the time Ashley Feinberg found James Comey on Twitter: https://gizmodo.com/this-is-almost-certainly-james-comey-s-t...

New comment by sfletcher in "Machine Learning Is a Marvelously Executed Scam"

sfletcher — Fri, 14 May 2021 02:22:39 +0000

Ya seriously - while I too am skeptical about ML-as-a-service, if Corey genuinely thinks that ML has no business value I have a rabid pack of Applied Scientists from Amazon Ads & Search that I'd like to set on him

New comment by sfletcher in "Differentiable programming for gradient-based machine learning"

sfletcher — Thu, 19 Nov 2020 21:56:01 +0000

Is anyone still excited about differentiable programming? At least in NLP it seems like a lot of the energy has shifted to large scale overparameterized models like BERT, e.g. you don't need arbitrary control flow in your model, all you need is attention.

New comment by sfletcher in "We can do better than DuckDuckGo"

sfletcher — Wed, 18 Nov 2020 02:04:20 +0000

I don't fully understand something about the general tech industry discourse around search and would love to hear if I'm wrong.

Here's my brief and slightly made up history of search engines:

In the beginning of time, search engines took a Boolean query (duck AND pond) and found all the documents which contained both words using an inverted index and then returned them in something like descending date order. But for queries which had big result sets, this order wasn't very useful and so search engines began letting users enter more "natural language" queries (duck pond) and sorting documents based on the number of terms that overlap with the query. They came up with a bunch of relevance formulas - tfidf, BM25 - that tried to model the query overlap. But it turns out this is tricky because user intent is a really tricky problem and so modern day search engines just declare that relevance is whatever users click on. Specifically they just model the probability that you're going to click on a link (or something) using a DNN that uses things like the individual term overlap, the number of users that have clicked on this link, the probability it's spam, the PageRank etc. Some search engines like Google also include personalized features like the number of times you have clicked on this particular domain - because for instance as a programmer your query of (Java) might have different intent than your grandmother's. This score then gets used to sort the results into a ranked list. This is why search engines (DDG included) collect all this data - because it makes the relevance problem tractable at web scale.

Maybe just my perspective but I just really don't understand why OP would want to build an index - it's hard boring expensive and doesn't violate data privacy - and I don't think people grasp that - at least to some extent - data privacy and relevance are in direct conflict?