Hacker News: rpedela

New comment by rpedela in "AWS Tools Suck"

rpedela — Fri, 17 Dec 2021 12:38:54 +0000

GCP is pretty pleasant overall. The API, command line, and UI are all pretty good. Vast majority of functionality is supported in all three places too. Like all providers, they have a few services that suck but overall Google has done a great job at developer experience.

New comment by rpedela in "The Inherent Limitations of GPT-3"

rpedela — Mon, 29 Nov 2021 20:04:48 +0000

There are several use cases where ML can help even if it isn't perfect or even just better than random. Here is one example in NLP/search.

Let's say you have a product search engine and you analyzed the logged queries. What you find is a very long tail of queries that are only searched once or twice. In most cases, the queries are either misspellings, synonyms that aren't in the product text, or long queries that describe the product with generic keywords. And the queries either return zero results or junk.

If text classification for the product category is applied to these long tail queries, then the search results will improve and likely yield a boost in sales because users can find what they searched for. Even if the model is only 60% accurate, it will still help because more queries are returning useful results than before. However you don't apply ML with 60% accuracy to your top N queries because it could ruin the results and reduce sales.

Knowing when to use ML is just as important as improving its accuracy.

New comment by rpedela in "ETL Pipelines with Airflow: The Good, the Bad and the Ugly"

rpedela — Sat, 09 Oct 2021 03:31:36 +0000

If you can write a SQL query or a set of SQL queries to do your transformation, then you can use DBT. DBT doesn't do transformation itself rather it helps you manage all the dependencies between your SQL models. Whether you can use SQL depends on your data and database/warehouse functionality. For example, JSON parsing support is pretty good now in many databases and warehouses. If your objects can be represented as JSON, then you could write SQL via DBT to parse the objects into columns and tables.

New comment by rpedela in "What Is the Data Lakehouse Pattern?"

rpedela — Wed, 15 Sep 2021 17:14:28 +0000

DBT has two main innovations. First, everything is a SELECT statement and DBT handles all the DDL for you. You can handle DDL yourself if you have a special case too. Second, the ref/source macros build a DAG of all your models so you don't have to think about build order. There are other innovations but those are the main ones.

You can give it truly pure SQL in both models and scripts, and mixing in Jinja if you need it for dynamic models. But I'd recommend at least using ref/source.

New comment by rpedela in "What Is the Data Lakehouse Pattern?"

rpedela — Wed, 15 Sep 2021 05:30:56 +0000

I think using a data warehouse as your data lake or lake house is optimal. Even for data that isn't relational. Storage is so cheap now and is decoupled from compute costs for several providers that I don't even give it a thought. You get a fast, scalable SQL interface which is still nice and useful for non-relational data. Then all, or most, of the transformations needed for analysis can be pure SQL using a tool like DBT. In my experience, it greatly simplifies the entire pipeline.

New comment by rpedela in "Nearest neighbor indexes for similarity search"

rpedela — Thu, 12 Aug 2021 22:04:34 +0000

For my search use case, documents are mostly single topic and less than 10 pages. However I have found embeddings still work surprisingly well for longer documents with a few topics in them. But yes, multi-topic documents can certainly be an issue. Segmentation by sentence, paragraph, or page can help here. I believe there are ML-based topic segmentation algorithms too, but that certainly starts making it less simple.

New comment by rpedela in "Nearest neighbor indexes for similarity search"

rpedela — Thu, 12 Aug 2021 12:34:38 +0000

Google has a distributed embedding matching service in preview: https://cloud.google.com/vertex-ai/docs/matching-engine/over...

I guess it depends on what you mean by "simple". The algorithms are complex but there are good tools that implement them. I would imagine smaller companies would use off the shelf tooling, and I would argue that is simpler. Vector embeddings are so unbelievably powerful and often yield better results than classical methods with one of the good tools + pretrained embeddings.

Specifically for search, I use them to completely replace stemming, synonyms, etc in ES. I match the query's embedding to the document embeddings, find the top 1000 or so. Then I ask ES for the BM25 score for that top 1000. I combine the embedding match score with BM25, recency, etc for final rank. The results are so much better than using stemming, etc and it's overall simpler because I can use off the shelf tooling and the data pipeline is simpler.

New comment by rpedela in "Official Elasticsearch Python library no longer works with open-source forks"

rpedela — Mon, 09 Aug 2021 01:20:33 +0000

Yes I know for sure. Postgres search is essentially an easier to use regex engine. If you have a recall-only use case and/or a small dataset, then that works great. As soon as you need multiple languages, advanced autocomplete, misspelling detection, large documents, large datasets, custom scoring, etc you need Solr or ES.

New comment by rpedela in "Vertical farms grow veggies on site at restaurants and grocery stores"

rpedela — Thu, 21 Jan 2021 02:57:23 +0000

I disagree based on the trends, especially in the US where few, if any, new coal plants are being built and most existing ones are scheduled to be converted to natural gas or shut down. But let's assume that a significant amount of coal will continue to be burned over the next 20 years. Do you think we should stop innovation in other sectors until we are off coal?

New comment by rpedela in "Vertical farms grow veggies on site at restaurants and grocery stores"

rpedela — Thu, 21 Jan 2021 01:20:58 +0000

Coal is on its way out. We don't have to wait for the perfect energy solution before we can improve agriculture.

New comment by rpedela in "Intentionally Leaking AWS Keys"

rpedela — Tue, 19 Jan 2021 22:07:55 +0000

If you are going to put creds in git, at least use sops.

https://github.com/mozilla/sops

New comment by rpedela in "Cincinnati is home to the largest unused subway system in the world (2016)"

rpedela — Wed, 06 Jan 2021 02:14:38 +0000

I grew up in Cincinnati. You are correct, however traffic is bad enough now that a subway or light rail would make a big difference. Unfortunately it is unlikely the voters would approve a tax increase to support that.

New comment by rpedela in "FDA statement on following the authorized dosing schedules for Covid-19 vaccines"

rpedela — Tue, 05 Jan 2021 23:56:11 +0000

I personally try to make decisions using probability. I understand where you are coming from but there is one factor missing in your analysis: this situation is literally life and death. That changes the math a bit to something more akin to "better safe than sorry" in my opinion. We have found a guaranteed path out of this mess. There may be other faster paths that save more lives, but it could also end up killing millions more too. I'm all for experimenting but I take issue with making the experiment the policy when lives are on the line.

New comment by rpedela in "FDA statement on following the authorized dosing schedules for Covid-19 vaccines"

rpedela — Tue, 05 Jan 2021 21:11:44 +0000

We have no data for mRNA vaccines except those trials. We don't know if the historical data on traditional vaccines are applicable here. As such, it is also possible that a single dose isn't good enough or a single dose is good enough but only for a couple months. We just don't know and it would be way worse if it turns out we have to re-vaccinate everyone because we were impatient.

New comment by rpedela in "Peer-reviewed papers are getting increasingly boring"

rpedela — Fri, 01 Jan 2021 21:53:03 +0000

Yeah I agree, I would like to see some percentage be lottery based. I think you should still need to write a proposal, but if the proposal isn't selected then it goes in the lottery pool. There are two reasons to still write the proposal: helps organize the researcher's thoughts and shows the researcher is serious.

New comment by rpedela in "New U.S. Dietary Guidelines Reject Recommendation to Cut Sugar, Alcohol Intake"

rpedela — Tue, 29 Dec 2020 16:23:01 +0000

Where do you think those calories come from?

New comment by rpedela in "New U.S. Dietary Guidelines Reject Recommendation to Cut Sugar, Alcohol Intake"

rpedela — Tue, 29 Dec 2020 16:18:12 +0000

> no dietary need for sugar

Can I assume you mean simple sugars that are added for taste? If you literally meant what you said, well you will die without carbohydrates (sugar).

New comment by rpedela in "FAA issuing new rules to allow drones to fly over people and at night"

rpedela — Tue, 29 Dec 2020 00:22:21 +0000

I think a better analogy would be living near an airport like I do. Depending on the day and weather, even my small airport is quite loud and annoying. My neighbor's motorcycle is also louder than the planes, except the occasional military jet, but that noise is far less frequent than the planes. I think drone delivery will mean everyone in cities and suburbs will feel like they live near a small airport on a clear day. I too am not looking forward to the flying drone future. I think we should be working on reducing noise as much as possible, not create more in the name of slightly more convenient consumerism.

New comment by rpedela in "Bye Bye Mongo, Hello Postgres (2018)"

rpedela — Sun, 27 Dec 2020 19:07:34 +0000

I think Citus can be used effectively for that scenario now.

https://www.citusdata.com/

New comment by rpedela in "[dead]"

rpedela — Sun, 27 Dec 2020 02:41:00 +0000

The post makes it sound like it is cheap to get started, but 32 ETH is currently ~$20K. If you already have the necessary ethereum then it is cheap.