Hacker News: rw

New comment by rw in "CRDTs are the future"

rw — Tue, 29 Sep 2020 00:11:35 +0000

Operational Transformation and Conflict-Free Replicated Datatypes are very different from each other.

As the author explains, OT relies on some ordering of system events, and CRDTs don't. That means CRDTs need to be commutative (and probably associative), and OT doesn't.

So, OT is less scalable but more powerful, and CRDTs are more scalable but less powerful (in theory).

It's sort of like comparing Paxos/Raft to Bittorrent.

(I am not an expert on OT.)

New comment by rw in "Stegasuras: Neural Linguistic Steganography"

rw — Fri, 06 Sep 2019 09:54:10 +0000

Stegasuras is convincing work and the quality looks excellent.

I wrote a steganographic tool in this same spirit back in 2011, called Plainsight.

Back then, we didn't have deep learning, and the "Imagenet moment for NLP" had yet to arrive.

My Python code, with examples, is here: https://github.com/rw/plainsight

Unlike the OP, my Plainsight algorithm is 100% invertible by construction, and accepts binary input. (I verified the inversion process with "roundtrip fuzzing", a technique I still use today.)

Plainsight uses each bit of the input message to generate tokens. Bits are used to decide how to traverse a Huffman-style n-gram tree, weighted by frequency. This tree of n-grams is the model used in both the encoding and decoding steps. The drawbacks to my method are that the output 1) can be verbose and 2) does not convince a human that it's plausible, except for short messages.

Stegasuras has orders-of-magnitude better output, and seems to solve the problems I couldn't solve eight years ago. I would venture that their new result has as much to do with advances in language modeling, as it does with the particulars of their encoding and decoding algorithms.

I'll also note that I'm glad these researchers were able to use grant money to do this work. As a non-academic, I applied for an AI Grant to support me in upgrading Plainsight to use deep learning, but I was turned away at the time.

Finally, one of the ideas I picked up back then is that spam can be used to contain secret messages. Send enough gibberish to enough people, with your intended recipient included, and you'll look like a spammer--not a spy:

   $ wget https://spamassassin.apache.org/publiccorpus/20030228_spam.tar.bz2
   $ tar -jxvf 20030228_spam.tar.bz2
   $ cat spam/0* > spam-corpus.txt

   $ echo "The Magic Words are Squeamish Ossifrage" | plainsight -m encipher -f spam-corpus.txt > spam_ciphertext
   
   $ cat spam_ciphertext
   (8.11.6/8.11.6) 3 (Normal) Internet can send e-mails until to transfer 26 10 [127.0.0.1]
   also include address from the most logical, mail business for your Car have a many our
   portals ESMTP Thu, 29 1.0 this letter on internet, ", output is ""   
   deciphering: 100% | 543.84  B/s | Time: 0:00:00
   
   The Magic Words are Squeamish Ossifrage

Tutorial: Use FlatBuffers in Rust

rw — Tue, 04 Jun 2019 20:08:36 +0000

Article URL: https://rwinslow.com/posts/use-flatbuffers-in-rust/

Comments URL: https://news.ycombinator.com/item?id=20098740

Points: 2

# Comments: 0

New comment by rw in "TimescaleDB vs. InfluxDB: built differently for time-series data"

rw — Wed, 15 Aug 2018 18:37:43 +0000

The TimescaleDB benchmark code is a fork of code I wrote, as an independent consultant, for InfluxData in 2016 and 2017. The purpose of my project was to rigorously compare InfluxDB and InfluxDB Enterprise to Cassandra, Elasticsearch, MongoDB, and OpenTSDB. It's called influxdb-comparisons and is an actively-maintained project on Github at [0]. I am no longer affiliated with InfluxData, and these are my own opinions.

I designed and built the influxdb-comparisons benchmark suite to be easy to understand for customers. From a technical perspective, it is simulation-based, verifiable, fast, fair, and extensible. In particular, I created the "use-case approach" so that, no matter how technical our benchmark reports got, customers could say to themselves: "I understand this!". For example, in the devops use-case, we generate data and queries from a realistic simulation of telemetry collected from a server fleet. Doing it this way creates benchmarking stories that appeal to a wide variety of both technical and nontechnical customers.

This user-first design of a benchmarking suite was a novel innovation, and was a large factor in the success of the project.

Another aspect of the project is that we tried to do right by the competition. That means that we spoke with experts (sometimes, the creators of the databases themselves) on how to best achieve our goals. In particular, I worked hard to make the Cassandra, Elasticsearch MongoDB, and OpenTSDB benchmarks show their respective databases in the best light possible. Concretely, each database was configured in a way that is 1) featureful, like InfluxDB, 2) fast at writes, 3) fast at reads, and 4) efficient with disk space.

As an example of my diligence in implementing this benchmark suite for InfluxData, I included a mechanism by which the benchmark query results can be verified for correctness across competing databases, to within floating point tolerances. This is important because, when building adapters for drastically different databases, it is easy to introduce bugs that could give a false advantage to one side or the other (e.g. by accidentally throwing data away, or by executing queries that don't range over the whole dataset).

I don't see that TimescaleDB is using the verification functionality I created. I encourage TimescaleDB to run query verification, and write up their benchmarking methods in detail, like I did here: [1].

I think it's great that TimescaleDB is taking these ideas and extending them. At InfluxData, we made the code open-source so that others could build and learn from our work. In that tradition, I hope that the ongoing discussion about how to do excellent benchmarking of time-series databases keeps evolving.

[0] https://github.com/influxdata/influxdb-comparisons (Note that others maintain this project now.)

[1] https://rwinslow.com/rwinslow-benchmark-tech-paper-influxdb-...

New comment by rw in "I'm Scott Aaronson, quantum computing/computational complexity researcher. AMA"

rw — Fri, 29 Jun 2018 18:12:42 +0000

Hi Scott, thank you for writing your blog all these years. Your Busy Beaver essay ignited my passion for computer science, especially in algorithm analysis, logic, undecidability, and probability theory. I used to be someone who only thought in code; thanks to you, I now also think in math.

New comment by rw in "Show HN: Diamond – Full-stack web-framework in D"

rw — Tue, 03 Apr 2018 19:48:51 +0000

Why the hard dependency on MySQL?

New comment by rw in "China using big data to detain people before crime is committed"

rw — Wed, 28 Feb 2018 21:39:13 +0000

"Your scientists were so preoccupied with whether or not they could, that they didn't stop to think if they should."

- Jeff Goldblum as Dr. Ian Malcolm in Jurassic Park

The untold story of systemic gender discrimination at UC Berkeley's IT Dept

rw — Mon, 26 Feb 2018 17:03:42 +0000

Article URL: https://pando.com/2018/02/23/bears-lair-untold-story-systemic-gender-discrimination-inside-uc-berkeleys-it-department/

Comments URL: https://news.ycombinator.com/item?id=16466834

Points: 10

# Comments: 0

We're building a dystopia just to make people click on ads

rw — Mon, 30 Oct 2017 02:56:38 +0000

Article URL: https://www.ted.com/talks/zeynep_tufekci_we_re_building_a_dystopia_just_to_make_people_click_on_ads/transcript

Comments URL: https://news.ycombinator.com/item?id=15582843

Points: 2

# Comments: 1

New comment by rw in "Show HN: Using LDA to suggest GitHub repositories based on what you have starred"

rw — Mon, 02 Oct 2017 09:05:13 +0000

Good idea, the READMEs would be best of all.

New comment by rw in "Show HN: Using LDA to suggest GitHub repositories based on what you have starred"

rw — Sun, 01 Oct 2017 23:16:19 +0000

As I said, your approach is a clever way to use the GitHub API. I think you need to change the title and readme to indicate that this isn't an LDA index of GitHub descriptions. To ML practitioners, that's what you are implying with a title of "Show HN: Using LDA to suggest GitHub repositories based on what you have starred".

New comment by rw in "Show HN: Using LDA to suggest GitHub repositories based on what you have starred"

rw — Sun, 01 Oct 2017 23:05:51 +0000

This only uses LDA on your starred repository descriptions, to find topic terms that describe your starred repositories. These topic terms are then used to query the GitHub search API to find matching repositories. The results are then sorted by star count.

That is a clever way to make use of a search API like GitHub's. The principled way to do this, though, is to run LDA over all descriptions on GitHub, then use that similarity index to find similar repositories. You could run LDA over code, too.

I'll note that there is a cold start problem with this implementation: using LDA on such a small set of short documents will often lead to uninformative topics with words that are too-specific. You need a big corpus to capture e.g. synonym relationships.

New comment by rw in "Gophersat: A SAT solver written in Go"

rw — Thu, 28 Sep 2017 19:40:20 +0000

No, polynomial time. For reference, see these Wikipedia pages:

https://en.wikipedia.org/wiki/Polynomial-time_reduction

https://en.wikipedia.org/wiki/Karp%27s_21_NP-complete_proble...

New comment by rw in "Zuckerberg's trust problem"

rw — Thu, 28 Sep 2017 07:52:40 +0000

Why has this article been removed from the top 250 news results? It was #1 for a few minutes, then #5, and now it's gone. We've successfully discussed much more risqué topics here on HN...

Why did the comment by `TAForObvReasons calling out this apparent censorship get deleted?

New comment by rw in "Thoughts on OpenAI, reinforcement learning, and killer robots"

rw — Sat, 29 Jul 2017 00:45:36 +0000

No, it's called insufficient feature engineering. Data leakage is when your test data contaminates your training data.

New comment by rw in "The Future of Go Summit – Ke Jie vs. AlphaGo"

rw — Tue, 23 May 2017 03:35:59 +0000

A) How would you characterize the differences and similarities between AlphaGo and the best human players?

B) How has human play style changed since AlphaGo's introduction?

C) What is the answer to the question you most want to be asked?

New comment by rw in "Open sourcing Sonnet – a new library for constructing neural networks"

rw — Fri, 07 Apr 2017 19:02:42 +0000

TensorFlow is a dataflow computation system. Keras is for building neural networks. Each exists at a different level of abstraction.

New comment by rw in "Cracking Minesweeper with Z3 SMT Solver"

rw — Mon, 06 Mar 2017 04:24:34 +0000

How does this contrast with Answer Set Programming (using e.g. clasp)?

New comment by rw in "Introducing Keybase Chat"

rw — Thu, 09 Feb 2017 04:26:45 +0000

How did you find these changes?

New comment by rw in "The Axiom of Choice Is Wrong (2007)"

rw — Sun, 05 Feb 2017 10:16:19 +0000

You could have answered all of your questions with "finitely many", because, after all, we can each only perform a finite number of actions in the world.

In general, the infinite hierarchy of infinite sets "exists" because we can define it.