Hacker News: niviksha

New comment by niviksha in "Human Bottlenecks"

niviksha — Wed, 27 May 2026 20:35:59 +0000

Excellent read - thanks! Always felt there was a ‘self-licking ice cream cone’ at the heart of the present moment. If only AI had better context than the messy reality that is human knowledge…oh wait, we have AI to help with that problem

New comment by niviksha in "Show HN: Mandala – Automatically save, query and version Python computations"

niviksha — Fri, 12 Jul 2024 16:16:05 +0000

This is a very innovative take on infra for ML observability. Shreya Shankar and collaborators at Berkeley came up with https://github.com/loglabs/mltrace, which treads the same ground - perhaps you've looked at it already?

New comment by niviksha in "Cyc: History's Forgotten AI Project"

niviksha — Wed, 17 Apr 2024 21:40:36 +0000

There are a few symbolic logic entailment engines that run atop OWL the Web Ontology Language, some flavors of which which are rough equivalent of Cycs KBs. The challenge though is the underlying approaches are computationally hard so nobody really uses them in practice, plus the retrieval language associated with OWL is SPARQL which also has little traction.

New comment by niviksha in "Generate Synthetic Data in 3 Lines of Code"

niviksha — Wed, 07 Sep 2022 17:42:49 +0000

My use case is generating a very high rate (10k e/s up to 100k e/s) of JSON-NL events from samples of JSON-encoded log data (JSON-NL to be exact). Is this supported in OSS Gretel?

FYI, I'd built a hand-crafted generator using JSONNet templates and Golang, but I really wanted something that could model source data distributions accurately. The use case is large-scale load testing of customer workloads without requiring actual data.

New comment by niviksha in "Generate Synthetic Data in 3 Lines of Code"

niviksha — Wed, 07 Sep 2022 17:16:52 +0000

Wow, this is great. I built my own synthetic time series data generator for benchmarking, could have saved myself a bunch of trouble with this.

New comment by niviksha in "Lesser known features of ClickHouse"

niviksha — Tue, 31 May 2022 23:54:47 +0000

Thanks for sharing this. It is a very interesting problem that highlights some of the technical challenges of working with modern event data, which happens to 'prefer' being semi-structured (i.e JSON is the most natural serialization format while creating events).

It's also something we're working on! Shameless plug - I happen to work at Sneller (sneller.io, open source at https://github.com/SnellerInc/sneller) that might be interesting to you.

A couple of key ideas - first, we bypass the need for any sort of 'semi-structured to relational' ETL/ELT overhead by running vectorized SQL on a (compressed) binary form of the JSON data which preserves its original structure. So we're schema-on-read first and foremost - you don't need to worry about adding new fields in the source JSON as long as your queries know of these new fields.

Second, we completely separate storage from compute. Unlike CH we don't use local disk as any sort of storage tier, and use cloud object stores as our _primary_ storage tier. So all your data (including the compressed binary version of your source JSON) lives in s3 buckets in your control.

Feel free to check us out and let us know what you think!

1. Github - https://github.com/SnellerInc/sneller

2. Intro blog - https://github.com/SnellerInc/blogs/blob/main/introducing-sn...

New comment by niviksha in "Accelerated SQL for JSON with AVX512 (Golang)"

niviksha — Thu, 19 May 2022 19:15:32 +0000

Sneller head of product here. Arrow is a data exchange format, are you referring to benchmarking against DataFusion or Ballista? Also, on Presto - we did early benchmarks against Amazon's Athena (Presto under the covers) running on parquet, and will rerun these benchmarks shortly. The interesting thing to note vs Presto is that it is clunky to use with raw JSON - see https://prestodb.io/docs/current/functions/json.html. While benchmarking against Athena we actually used AWS Glue (Spark under the hood) to transform JSON into parquet, but that adds both complexity and latency to the overall pipeline, which doesn't show up in just query timings

New comment by niviksha in "US Army's Land Trains (2020)"

niviksha — Mon, 16 May 2022 20:27:12 +0000

Great read! Minor nit - shouldn't they be called 'road trains'? The other kind are also land trains, no?

New comment by niviksha in "Mechanical Watch"

niviksha — Wed, 04 May 2022 21:59:00 +0000

This blog itself is a work of art, like mechanical watches themselves

New comment by niviksha in "Is everything falling apart?"

niviksha — Fri, 29 Apr 2022 15:12:39 +0000

This is an interesting response to the original article. My TL;DR for this is something like 'Yes, information tech has always resulted in social upheaval, so notwithstanding the scale and speed of the current iteration, this isn't new. Also, it can result in international social cohesion (aka Nazis of the world unite). The thing is, it feels like the author is telling us not to worry because all of this has happened before (with horrific consequences to those caught in the churn, to borrow my favorite phrase from The Expanse). That is a weak rationalization at best - 'same shit, different day, (albeit global scale, speed of light) so don't worry' isn't as reassuring as it should sound

New comment by niviksha in "Getting Started with MapD, Part 2: Electricity Dataset"

niviksha — Tue, 27 Feb 2018 02:00:53 +0000

i'm not questioning your use of indexes. i'm questioning the basis of the unqualified assertion that 'query times are slower than a SQLite database' - with indexes off, and both running on CPUs, the basis of comparison isn't tilted one way or the other, and then it isnt clear at all that mapd is 'much slower'.

New comment by niviksha in "Getting Started with MapD, Part 2: Electricity Dataset"

niviksha — Sun, 25 Feb 2018 18:16:16 +0000

hi - i took a look at the sqlite db. not to nitpick, but you have 4 indexes there?

the overhead of indexes at tiny data sizes like this is minimal, so no surprise that sqlite or pg will compare favorably at laptop scale (FWIW, i ran this on a 4-core macbook pro with no GPU (i.e CPU only) and without indexes, mapd runs the query at ~1.5 sec while sqlite takes ~6).

however, the bigger point is about how this experience scales. take a look at this public demo - https://www.mapd.com/demos/ships. it's 11 billion geo data points, and you'll see the same low millisecond response that FridgeSeal is talking about - again, no indexing (to verify, go to dev tools in your browser, turn on SQLLogging(true) in the console and look at the query times reported for the round trip. i'm sure you appreciate that the overhead (both creation and maintenance) of indexing scales with data size, plus sqlite/pg dont (AFAIK) offer anything by way of GPU-rendered charting.

New comment by niviksha in "Cray and Microsoft Bring Supercomputing to Azure"

niviksha — Mon, 23 Oct 2017 22:10:04 +0000

TL;DR - It does come down mainly to the network, but in far more interesting ways than is apparent from some of the answers here - and also the nature of the HPC software ecosystem that co-evolved with supercomputing for the last 30+ years. This community has pioneered several key ideas in large scale computing that seem to be at risk in the world of cheap, lease-able compute.

In scientific computing (usually where you see them), the primary workload is simulation/modeling of natural phenomena. The nature of this workload is that the more parallelism that is available, the bigger/more fine-grained a simulation can run, and hence the better it can approximate reality (as defined by the scientific models which are being simulated). Examples of this are fluid dynamics, multi-particle physics, molecular dynamics, etc.

The big push with these types of workloads is to be able to get efficient parallel performance at scale - so it isnt about just the # of cores, PB of disk or TB of DRAM, but whether the software and underlying hardware work well together at scale to exploit the available aggregate compute.

So the network matters, not just raw bandwidth but things like latency of remote memory access and the topology itself - for example, the Cray XCs going to Azure allow for a programming model (PGAS) that allows for large, scalable global memory views where a program can view the total memory of a set of nodes as a single address space. Underneath, the hardware and software work together to bound latency, do adaptive per-packet routing and ensure reliability - all at the level of 10s of thousands of nodes. In a real sense, the network is the (super)computer - the old Sun slogan.

Where else is this useful? Well, look at deep learning - the new hotness in parallel computing these days - they are all realizing that it's amazing to run on GPUs, but once you have large enough problems (which the big guys do), you end up having to figure out how to efficiency get a bunch of GPUs to efficiently communicate during data parallel training (that efficient parallelism thing). This happens to map to a relatively simple set of communication patterns (e.g. AllReduce) that is a small subset of the kinds that the HPC community has solved for - so it's interesting that many deep learning engineers are starting to see the value of things like RDMA and frameworks like MPI (Baidu, Uber, MSFT and Amazon for starters).

Interestingly though, the word supercomputing is being co-opted by the very companies that you're positioning as the alternative - the Google TPU Cloud is a specialized incarnation of a typical supercomputing architecture. Sundar Pichai refers to Google as being a 'supercomputer in your pocket'.