Hacker News: blr246

New comment by blr246 in "Potemkin Data Science (2020)"

blr246 — Wed, 26 May 2021 12:33:23 +0000

Hi beforeolives—

I like your breakdown, and I've observed similar things in my experience as an engineering focused data person! I've had many discussions with my colleagues about how to manage effectively these different blends of roles and skills.

I'm looking for someone for an engineering type of data role right now. Is there a way to get in touch with you about it?

Our product helps companies listen to their customers by unifying natural language feedback across various channels, applying signals using various natural language modeling techniques, then aggregating them to help teams deliver better outcomes using more relevant information.

Hope to hear from you (brandon at frame.ai) :)

edit: forgot to share agreement for your breakdown

New comment by blr246 in "Americans Don’t Know What Urban Collapse Looks Like"

blr246 — Sun, 31 Jan 2021 17:27:43 +0000

We are agreeing that NYC is not at a moment of urban collapse. The processes that drives away the tax base includes policies and social and market forces that erode the city's effectiveness as a sustaining economic and social hub.

The 1960s and 1970s crisis had a lot to do with the end of NYC's industrial epoch. Suburban development and globalization eliminated manufacturing and pulled workers and residents out of the city. The recovery of NYC was bringing high-value services, retail, and tourism back along with arts and culture.

In the time since, NYC has become increasingly a luxury experience, which is indeed part of its strength but also its weakness, since it accelerates decline when people can up and leave without having roots.

New comment by blr246 in "Americans Don’t Know What Urban Collapse Looks Like"

blr246 — Sun, 31 Jan 2021 16:14:38 +0000

>Instead, what kills cities is a long period in which their leaders fail to reckon honestly with ongoing, everyday problems—how workers are treated, whether infrastructure is repaired. Unsustainable, unresponsive governance in the face of long-term challenges may not look like a world-historical problem, but it’s the real threat that cities face.

The feels correct to me.

I lived in New York City for 15 years. Until last year. I've thought about this theme all year. Decades of policy supporting foreign investment and developer speculation gutted the chance for even affluent upper middle class New Yorkers to afford housing and setup a home base, and so many left. The situation has been incomparably more challenging for low income residents.

I agree the urban collapse meme is much easier to spread than a thoughtful discussion about policy and priorities and how to balance the economic strength of a city's major players with the daily priorities of everyday citizens. I hope the New York remainders shift priorities and initiate a different kind of prosperous era than the one I got to enjoy.

Does SIAC’s Recent Channel Rebalance Leave Tesla Exposed?

blr246 — Fri, 13 Mar 2020 02:53:48 +0000

Article URL: http://maystreet.com/news/does-siacs-recent-channel-rebalance-leave-tesla-exposed/

Comments URL: https://news.ycombinator.com/item?id=22564280

Points: 2

# Comments: 0

New comment by blr246 in "Using the linear distance operator in Postgres 12 to find the closest match"

blr246 — Thu, 07 Nov 2019 17:09:54 +0000

I had the same initial thought based on the title. Unfortunately, the answer is no.

The article discusses a low-dimensional KNN problem. The curse of dimensionality guides intuition that the methods here likely will not apply to extremely high-dimensional problems.

faiss actually comes with a lot of excellent documentation that describes the problems unique to KNN on embedding vectors. In particular, for extremely large datasets, most of the tractable methods are approximations that make use of clustering, quantization, and centriod-difference tricks to make computation efficient.

See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes and related links for more information.

New comment by blr246 in "Challenges in Implementing a Full-Text-Search Engine"

blr246 — Thu, 19 Sep 2019 21:36:45 +0000

At Frame.ai, we are using both PostgreSQL and faiss (and other tools) in our stack to do several different kinds of inference tasks on semantic representations of text to help companies understand and act on customer chats, emails, and phone call transcripts.

We've frequently had the same dream of adding more native support for nearest-neighbor type queries, since that is the workhorse of so many useful techniques in the modern NLP stack.

Right now, we have lots of dense vectors stored in massive toast tables in PG. It's faster to fetch them rather than recompute them, especially since there are a number of preprocessing steps that limit what we pay attention to.

The discussion here about full text search versus semantic search is interesting. In our experience, both are highly relevant. Sometimes it's most useful for our customers to segment their conversation data by exact text matches, and other times semantic clustering is most effective. I think there's plenty of reason to offer both kinds of capabilities.

New comment by blr246 in "Dark Patterns"

blr246 — Mon, 02 Sep 2019 14:48:59 +0000

It's built into the inbox view, so GMail is extracting the action from the message content and placing a button on the row element. Sorry if that wasn't clear from my initial post.

New comment by blr246 in "Dark Patterns"

blr246 — Mon, 02 Sep 2019 14:27:13 +0000

GMail seems to have opened a vector to amplify dark patterns by placing action buttons on messages. My least favorite is the LinkedIn accept invitation button, which I've clicked now several times by accident because I've spent years using GMail without it taking actions like opening GitHub PRs and accepting LinkedIn invites.

I can't find a way to disable this feature. Does anybody know how?

New comment by blr246 in "Ways to Tweak Slow SQL Queries"

blr246 — Mon, 02 Sep 2019 11:53:55 +0000

For query plans using a sort node, there can be a major difference in performance depending on the row width.

New comment by blr246 in "A guide to Oauth2"

blr246 — Mon, 02 Sep 2019 11:48:12 +0000

It's worth mentioning that it's a bad idea to invalidate refresh_token grants ever during the lifetime of an authorization. I've seen APIs do this immediately upon sending the response to the token endpoint, which makes the system unusable due to the frequency of network transmission errors that would result in having to contact the resource owner to grant access again. Even an expiry after days and years is only likely to result in more support requests to the API maintainer without increasing security enough to justify it.

The reason this bad practice is common is that it is allowed by the spec in https://tools.ietf.org/html/rfc6749#section-6 as an optional action to take on refresh grants. Please, do not do this.

New comment by blr246 in "Details of the Cloudflare outage on July 2, 2019"

blr246 — Fri, 12 Jul 2019 17:55:54 +0000

Your response highlights a good idea to mitigate the risk I was trying to highlight in mine.

They want to have a rapid response path (little to no delay using staging envs) to respond to emergencies. The old SOP allowed all releases to use the emergency path. By not using it in the SOP anymore, I'd be concerned that it would break silently from some other refactor or change.

Your notion is to maintain the emergency rollout as a relaxation of the new SOP such that the time in staging is reduced to almost nothing. That sounds like a good idea since it avoids maintaining two processes and having greater risk of breakage. So, same logic but using different thresholds versus two independent processes.

New comment by blr246 in "Details of the Cloudflare outage on July 2, 2019"

blr246 — Fri, 12 Jul 2019 17:10:54 +0000

Appreciate the detail here. It's a great writeup. Wondering what folks think about one of the changes:

  5. Changing the SOP to do staged rollouts of rules in
     the same manner used for other software at Cloudflare
     while retaining the ability to do emergency global
     deployment for active attacks.

One concern I'd have is whether or not I'm exercising the global rollout procedure often enough to be confident it works when it's needed. Of the hundreds of WAF rule changes rolled out every month, how many are global emergencies?

It's a fact of managing process that branches are liability and the hot path is the thing that will have the highest level of reliability. I wonder if anyone there has concerns about diluting the rapid response path (the one having the highest associated risk) by making this process change.

edit: fix verbatim formatting

New comment by blr246 in "Password expiration is dead, long live passwords"

blr246 — Sun, 02 Jun 2019 20:59:49 +0000

The other part of this story I did not see mentioned is that I suspect that password expiration also makes organizations more vulnerable to social engineering hacks because legitimate users (I have done this) become locked out due to poorly managed password expiration, then have to call in to restore access. The use of insecure identity and authentication mechanisms like student IDs and security questions is a recipe for abuse.

Good riddance to password expiration.

New comment by blr246 in "On SQS"

blr246 — Mon, 27 May 2019 14:33:22 +0000

Kinesis is not necessarily well-suited fan-out. It is very well suited for fan-in (single consumer, multiple producers).

Each shard allows at most 5 GetRecords operations per second. If you want to fan out to many consumers, you will reach those limits quickly and have to implement a significant latency/throughput tradeoff to make it work.

For API limits, see: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_...

New comment by blr246 in "Ask HN: How to Self-Study Integrated Circuit Design?"

blr246 — Mon, 13 May 2019 11:13:21 +0000

Pick up some data sheets for IC parts that might be useful in a system you'd like to design. These are published by the manufacturer, and they contain a lot of design requirements about how to layout properly a PCB and some about the theory of operation of the parts. You can piece together a lot of practical information this way.

New comment by blr246 in "Demystifying Database Systems: An Introduction to Transaction Isolation Levels"

blr246 — Sat, 04 May 2019 00:50:17 +0000

You're correct that a table-level lock is a useful tool to ensure serializability, but it's also a drastic one since it blocks other transactions requesting a higher access level. The concurrency control system is trying to provide a generic way to both guarantee consistency and maintain a high throughout on a variety of workloads, and locking tables always would diminish throughout for many common workloads. For example, one where inserts and reads tend to occur simultaneously at high volume.

New comment by blr246 in "Demystifying Database Systems: An Introduction to Transaction Isolation Levels"

blr246 — Fri, 03 May 2019 23:13:45 +0000

To implement the serializable isolation level, the database system must track access to every single row you access, even the ones you read or filter out. (The need to track reads, even for rows filtered out of a select statement is surprising but necessary.)

Consider a common scenario where you SELECT a set of rows and take a SUM over a column. Suppose your query and another query begin reading from the same committed state of the same table. Suppose that the other query uses an UPDATE command on a set of rows in the table, and that the other query commits before yours does. In order to be consistent, the database system must detect the situation where the other query updated a row that affects the filter WHERE you scanned the table, otherwise your sum could be incorrect if the other query's committed state would cause your query to compute a SUM over a different set of rows or over modified values in your SUM. The only way for the database system to guarantee there is no conflict is to keep track of every single row your query accesses, even if it is a row passed over by a WHERE clause!

Serializability is a well studied concept. You can find lots of good resources about algorithms for implementing concurrency control and for detecting whether or not two transactions are serializable. The high-level summary is that it takes a lot of operational bookkeeping to guarantee that two queries have no conflicts, especially when you are using real-world examples having many filters and joins.

New comment by blr246 in "Facebook Stored Hundreds of Millions of User Passwords in Plain Text for Years"

blr246 — Thu, 21 Mar 2019 16:36:24 +0000

Agree this is pretty much inexcusable.

Logging request or response payloads without an explicit whitelist should raise flags for any developer. There are very few cases where you can assert that not only in the present but also for all future use cases of a system, the entirety of a payload will not contain sensitive user data.

Only a whitelist will suffice to maintain good security. It's common for developers to attach sensitive data for debugging and other use cases under arbitrary paths.

Systems can improve further by adding patterns and other heuristics to drop values from the whitelist that look like sensitive data.

New comment by blr246 in "Faster hash joiner with vectorized execution"

blr246 — Fri, 01 Feb 2019 12:52:30 +0000

Great write-up. Is the long-term vision to go completely to the vectorised query execution model, or are there cases where a row-oriented plan might be better, such as cases when there are complex computations involving multiple columns of a single row?

New comment by blr246 in "SchemaCrawler: Free database schema discovery and comprehension tool"

blr246 — Mon, 31 Dec 2018 12:36:37 +0000

This is a great tool. We use it to generate an Entity Relationship Diagram from our canonical DDL file checked into our repo.

Here's the basic recipe:

  1. Spin up a fresh Postgres instance on Docker using -P to claim an available ephemeral TCP port
  2. Use `docker inspect` to read the Postgres port
  3. Run DDL script on the fresh instance
  4. Run SchemaCrawler Docker container using --network host option so it can connect to Postgres
     and using -v so it can save a schema image to the host filesystem

This entire process is a `/bin` script checked into our repo, so we can update `/doc/db-schema.png` any time. It takes about 15s total since we have to pause for the Postgres instance to come online.