Hacker News: proddata

New comment by proddata in "The world of PostgreSQL wire compatibility"

proddata — Thu, 10 Feb 2022 19:10:03 +0000

CrateDB DevRel here :)

> databases providing an abstraction through the Postgres wire protocol

I would not call it an abstraction, if one has a full parser, analyzer, planner and execution engine. It is just a common language ;)

New comment by proddata in "How Time Series Databases Work, and Where They Don’t"

proddata — Wed, 20 Oct 2021 06:50:48 +0000

Not being able to keep up with the incoming data. But 100-200Hz I'd consider fine for most

New comment by proddata in "How Time Series Databases Work, and Where They Don’t"

proddata — Mon, 18 Oct 2021 19:09:45 +0000

What do yo mean by high-frequency data? 100Hz, 1KHz, 100KHz? For that kind of use cases many time-series DBs break apart. We have customers storing multiple millions of high frequency measurements per sec in arrays.

I would say, Postgres is not too storage efficient in itself for large amounts of data, especially if you need any sorts of indexes. Timescale basically mitigates that by automatically creating new table in the background ("chunks") and keeping individual tables small.

New comment by proddata in "How Time Series Databases Work, and Where They Don’t"

proddata — Mon, 18 Oct 2021 18:59:45 +0000

Sorry, mixed up the number 2GB memory (0.5GB heap). So 1:500 is correct

New comment by proddata in "How Time Series Databases Work, and Where They Don’t"

proddata — Mon, 18 Oct 2021 18:56:18 +0000

Most of CrateDB clusters run on cloud providers hardware (azure, aws, alibaba). Using EBS (GP2 or now GP3) is also quite common. Due to the indexing / storage engine, gp disks are typically sufficient and faster disks have little to no advantage

New comment by proddata in "How Time Series Databases Work, and Where They Don’t"

proddata — Mon, 18 Oct 2021 09:26:33 +0000

- Depends - Just inserting, indexing, storing and simple querying can be done with little memory (i.e. 1:500 memory-disk-ratio 0.5GB RAM per 1TB disk). Typical production clusters with high query load are in the 1:150 range i.e. 64GB RAM for 10TB disk).

Otherwise typical general purpose hardware (Standard SSDs, 1:4 vCPU:memory ratios, ...)

New comment by proddata in "How Time Series Databases Work, and Where They Don’t"

proddata — Mon, 18 Oct 2021 06:23:50 +0000

> is an OLAP database a common go-to for longer-timescale analytics (as in [1])?

I would not consider Clickhouse or CrateDB "classic" OLAP DBs. I can speak for CrateDB (I work there), that it definitely would be able to handle 600GB and query across it in an ad-hoc manner.

We have users ingesting Terabytes of events per day and run aggregations across 100 Terabyte.

New comment by proddata in "How Time Series Databases Work, and Where They Don’t"

proddata — Mon, 18 Oct 2021 05:08:42 +0000

That article takes various concepts from typical TSDB solutions and seemingly only looks at the bad sides. Time series data has many different forms, not every form works for every TSDB solution.

For the 3 caveats at the top, there are already two TS solutions that look promising (QuestDB, TimescaleDB). Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution.

New comment by proddata in "OpenSearch: AWS fork of Elasticsearch and Kibana"

proddata — Tue, 13 Apr 2021 05:23:31 +0000

Sorry, but this is not true at all.

Some of the biggest changes within ES come from Lucene, like _massive_ reduction in memory footprint, enabling ES to use cases not even possible before.

New comment by proddata in "ClickHouse as an alternative to Elasticsearch for log storage and analysis"

proddata — Tue, 02 Mar 2021 18:18:32 +0000

If you are looking an OSS ES replacement, CrateDB might also be worth a look :)

Basically a best of both worlds combination of ES and PostgreSQL, perfect for time-series and log analytics.

New comment by proddata in "Doubling down on permissive licensing and the Elasticsearch lockdown"

proddata — Thu, 28 Jan 2021 14:54:29 +0000

Yes, we have customers using CrateDB as part of their proprietary product.

Also the SSPL is so vague, that we probably would not only have to release CrateDB itself - which we already do, but also everything we use for the services we provide. Also we could never make any kind of deals with OEMs, etc.

New comment by proddata in "Doubling down on permissive licensing and the Elasticsearch lockdown"

proddata — Thu, 28 Jan 2021 07:46:30 +0000

The thing is, that all the arguments they now bring up for the move, have been true in 2018 as well ...

New comment by proddata in "Doubling down on permissive licensing and the Elasticsearch lockdown"

proddata — Wed, 27 Jan 2021 20:32:21 +0000

> So, they don't run Linux, don't use glibc? That can't be all that common? (I mean sure, there's the bsds.. But still..).

We do run Linux :)

But there is a difference between building on and building with.

New comment by proddata in "Doubling down on permissive licensing and the Elasticsearch lockdown"

proddata — Wed, 27 Jan 2021 20:02:33 +0000

> This begs the question: isn't "a restrictive OSS licence" not less "fully open source" than a more permissive licence like GPL, MIT or BSD?

We gonna change CrateDB fully to Apache License v2 ;) I would say that counts as a "more permissive" license.

> Is that really only because of some enterprises not liking GPL?

There are various reasons for the change. A big part is definitely also the spirit of many our contributors. We built CrateDB on open source software and also want to make the software available as open source. It also was planned for quite some time to be more open.

New comment by proddata in "Doubling down on permissive licensing and the Elasticsearch lockdown"

proddata — Wed, 27 Jan 2021 19:03:28 +0000

> If your business model cannot survive when a critical upstream piece of your infrastructure moves to GPL, you probably have a bad business model to begin with.

To be clear CrateDB started out as OSS and we decide to stay OSS. Elasticsearch used the Apache License and so did CrateDB. All in the spirit of OSS. Elastic are however now the ones how decided, that their business model isn't viable anymore.

> It sounds like they are making up excuses for not wanting to fully Open Source their code

We do want to make it fully open source! Everything that was under a more restrictive License is going to be offered under Apache License.

New comment by proddata in "CrateDB: Purpose-built to scale modern applications in a machine data world"

proddata — Sat, 28 Nov 2020 09:22:17 +0000

Fair point - I will review this with our marketing and get that fixed

New comment by proddata in "CrateDB: Purpose-built to scale modern applications in a machine data world"

proddata — Sat, 28 Nov 2020 09:20:00 +0000

Many reasons actually ...

- Scalability CrateDB is built for horizontal scale from the ground up on top of distributed technologies. We have customers using clusters with 80+ nodes in production for many years now.

Timescale just released their multi-node feature in beta and they follow a different concept then we do. While Timescale uses a leader (access node) - follower (data node) model with a single point of failure CrateDB is built on a shared-nothing architecture. Many features you would want to see in a distributed system are present in CrateDB and still missing in TS:

- cluster wide replication - automatic rebalancing - cluster wide backup - shared nothing architecture / no single point of failure

- Full Text Search CrateDB is built on Lucene and parts of ES and includes search capabilities you would typically need a separate product for when using PG/TS.

- Distributed Query Engine Yes, PG/TS are fast if you query "small" amounts of data (e.g. last days data). But if you have distributed system, you might as well also want to run queries on larger data sets.

- Geospatial Queries Powered with Lucenes BKD-Trees

---

Disclaimer: I work for Crate.io and I also think Timescale are doing awesome stuff in many ways and give Influx the competition they deserve. I don't see us in direct competition (at least not yet), as the focus of Timescale is clearly more on smaller use cases.