Hacker News: perturbation

New comment by perturbation in "Francois Chollet is leaving Google"

perturbation — Thu, 14 Nov 2024 01:38:20 +0000

I think a lot of these may have improved since your last experience with Keras. It's pretty easy to override the training loop and/or make custom loss. The below is for overriding training / test step altogether, custom loss is easier by making a new loss function/class.

https://keras.io/examples/keras_recipes/trainer_pattern/

> - Keras's training loop assumes you can fit all the data in memory and that the data is fully preprocessed, which in the world of LLMs and big data is infeasible.

The Tensorflow backend has the excellent tf.data.Dataset API, which allows for out of core data and processing in a streaming way.

New comment by perturbation in "PyTorch 2.0"

perturbation — Sat, 03 Dec 2022 04:34:06 +0000

The big thing that PyTorch Mobile is lacking compared to TF Lite is on-device accelerator support (GPU/DSP/etc.) (there's experimental support for NNAPI https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.ht..., but this is a hack).

New comment by perturbation in "Swift for TensorFlow Shuts Down"

perturbation — Sat, 13 Feb 2021 05:54:28 +0000

Let's hope not, I like autodiff (and this project :( ).

New comment by perturbation in "Apple machine learning in 2020: What’s new?"

perturbation — Tue, 30 Jun 2020 14:32:41 +0000

> To return to the point about image augmentations being hard to add: It's so easy to explain what your training code should do "Just distort the hue a bit" and there seem to be operations explicitly for that: https://www.tensorflow.org/api_docs/python/tf/image/adjust_h.... but when you go to train with them, you'll discover that backpropagation isn't implemented, i.e. they break in training code.

Why not do the data augmentation during preprocessing (so that the transformations don't have to be done by differentiable transforms)? I.e., map over a tf.Dataset with the transformation (and append to the original dataset).

New comment by perturbation in "Ask HN: Non-cloud voice recognition for home use?"

perturbation — Sat, 14 Mar 2020 17:20:45 +0000

If you don't mind getting your hands dirty a bit, I think Nvidia's model [Jasper](https://arxiv.org/pdf/1904.03288.pdf) is near SOTA, and they have [pretrained models](https://ngc.nvidia.com/catalog/models/nvidia:jaspernet10x5dr) and [tutorials / scripts](https://nvidia.github.io/NeMo/asr/tutorial.html) freely available. The first is in their library "nemo", but they also have it available in [vanilla Pytorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/P...) as well.

New comment by perturbation in "Show HN: Train a language model to talk like you"

perturbation — Tue, 21 Jan 2020 17:38:07 +0000

This is cool - might be worth training a simple discriminator model to identify your utterances, and then you can use the plug-and-play language model (PPLM - https://github.com/huggingface/transformers/blob/master/exam...) to generate utterances modeling a specific speaker without special tokens. Could also take less time to fine-tune.

New comment by perturbation in "Nim vs. Crystal"

perturbation — Thu, 26 Dec 2019 18:44:29 +0000

Ah, thank you for explaining! That makes sense.

New comment by perturbation in "Nim vs. Crystal"

perturbation — Thu, 26 Dec 2019 18:21:45 +0000

Additionally, the Nim code was not compiled with many optimizations turned on! (I.e., without -d:release).

    $ nim c -o:base64_test_nim -d:danger --cc:gcc --verbosity:0 base64_test.nim

    $ nim c -o:json_test_nim -d:danger --cc:gcc --verbosity:0 json_test.nim

IIRC the -d:danger flag is necessary for some optimizations (like disabling bounds checking) but -d:release is necessary for most optimizations to be enabled.

Edit: It appears I'm incorrect, -d:danger does imply -d:release in newer Nim versions.

New comment by perturbation in "Nim vs. Crystal"

perturbation — Thu, 26 Dec 2019 18:17:47 +0000

Another thing I noticed: the Nim code was compiled without the -d:release flag.

For example, the JSON test was compiled with:

    $ nim c -o:json_test_nim -d:danger --cc:gcc --verbosity:0 json_test.nim

I don't think that the -d:danger implies release (even if necessary to do things like disable bounds checking)?

New comment by perturbation in "Foundations of Data Science [pdf]"

perturbation — Mon, 07 Oct 2019 15:50:46 +0000

I'd recommend Elements of Statistical Learning or ISLR instead, if you want to start with a theory-heavy introduction. Most of what you need for DS you'd I think better learn through projects or on-the-job.

Also, as others have mentioned, some of the most important skills for DS are data munging, data "presentation", and soft skills like managing expectations / relationships / etc.

I would not recommend this book if you want to get into DS with the idea that, "I'll read this and then I'll know everything I need to." It's too dense and academically-focused, and it would probably be discouraging if you try to read this all without getting your feet wet.

New comment by perturbation in "Nim 1.0"

perturbation — Mon, 23 Sep 2019 20:35:54 +0000

Congrats guys!!! This has been much-anticipated and I'm very excited. I personally wish that the owned reference stuff (https://nim-lang.org/araq/ownedrefs.html) had been part of 1.0, but I think that at some point shipping 1.0 >> everything else.

I've been following (and evangelizing) Nim for a while, this will make it easier to do so.

New comment by perturbation in "Deep learning outperformed dermatologists in melanoma image classification task"

perturbation — Tue, 30 Apr 2019 18:08:01 +0000

> * They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

A̶F̶A̶I̶K̶,̶ ̶a̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶c̶a̶n̶ ̶b̶e̶ ̶m̶i̶s̶l̶e̶a̶d̶i̶n̶g̶ ̶f̶o̶r̶ ̶a̶n̶ ̶i̶m̶b̶a̶l̶a̶n̶c̶e̶d̶ ̶d̶a̶t̶a̶s̶e̶t̶,̶ ̶b̶u̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶i̶s̶ ̶s̶t̶i̶l̶l̶ ̶o̶k̶a̶y̶ ̶f̶o̶r̶ ̶s̶e̶l̶e̶c̶t̶i̶n̶g̶ ̶m̶o̶d̶e̶l̶s̶.̶ Edit: This is incorrect, a PR curve + PR AUC should be used for model selection if imbalanced. I agree it would be really misleading if they (say) just reported accuracy (since the null classifier of always guess negative would give 80% overall accuracy). I̶ ̶t̶h̶o̶u̶g̶h̶t̶ ̶t̶h̶a̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶f̶o̶r̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶s̶h̶o̶u̶l̶d̶ ̶s̶t̶i̶l̶l̶ ̶b̶e̶ ̶a̶ ̶v̶a̶l̶i̶d̶ ̶m̶e̶a̶s̶u̶r̶e̶ ̶s̶i̶n̶c̶e̶ ̶i̶t̶'̶s̶ ̶s̶h̶o̶w̶i̶n̶g̶ ̶h̶o̶w̶ ̶m̶u̶c̶h̶ ̶b̶e̶t̶t̶e̶r̶ ̶t̶h̶e̶ ̶m̶o̶d̶e̶l̶ ̶p̶e̶r̶f̶o̶r̶m̶s̶ ̶t̶h̶a̶n̶ ̶r̶a̶n̶d̶o̶m̶ ̶g̶u̶e̶s̶s̶i̶n̶g̶.̶

How do you usually handle imbalanced data? I've had some success with SMOTE or weighted loss for imbalanced datasets, but I'm embarrassed to say I've been using AUC with ROC curves as the default - if this gives inferior model selection than AUC with PR curve I'll have to start doing that instead.

New comment by perturbation in "How to Ace the Google Interview: Ultimate Guide"

perturbation — Wed, 24 Apr 2019 04:50:20 +0000

https://github.com/yuki24/did_you_mean#installation :

    Ruby 2.3 and later ships with this gem and it will automatically be required when a Ruby process starts up. No special setup is required.

It doesn't call the method for you, but it does do the did-you-mean automatically if you misspell and it's close enough.

New comment by perturbation in "Google launches an end-to-end AI platform"

perturbation — Wed, 10 Apr 2019 19:19:02 +0000

AutoML is essentially training a ML model using some heuristics or optimization algorithm to select model architecture and train a model. Feature engineering / feature synthesis as well as interpretability remain open challenges.

If I'm understanding your questions correctly, the main problems I see with this are:

- Using raw data instead of feature engineering (less of a problem given feature synthesis libraries like https://www.featuretools.com/ and other heuristic methods). I'd expect Google to do a good job of basic things like normalization of raw input features before training.

- Using features that it really shouldn't (if you just throw ML at your database for say, loan applications, then sensitive / personally identifying information can/will be used as features)

- Lack of insight / understanding as to what is driving the model. This can be partially overcome with post-training methods like LIME, Shapley values, etc.

I wouldn't expect predictions to be from a set of discrete values - if (say) predicting housing values and training a NN, the output should be continuous and based on the input features.

New comment by perturbation in "A New Runtime for Nim"

perturbation — Tue, 26 Mar 2019 15:24:17 +0000

Disclaimer: I really love Nim and have written a fair amount in it, but I'm not a language designer guy.

I like that this will make multithreading easier with shared memory, but I worry this will make the language more complex and delay 1.0. I'll have to read the linked list examples a few times before the owned pointer model sinks in.

New comment by perturbation in "D6tflow: Python library for building data science workflows"

perturbation — Sat, 02 Mar 2019 18:53:35 +0000

Are you using reticulate (https://github.com/rstudio/reticulate), or having Python spawn a new worker process for R?

New comment by perturbation in "Ask HN: Who wants to be hired? (March 2019)"

perturbation — Sat, 02 Mar 2019 16:49:05 +0000

Location: Dallas, TX

Remote: Yes

Technologies: Keras, R, Python, and Spark are what I use ever day. Jupyter, scikit-learn, spaCy, NLTK, H2O, mlr, caret, prophet, tsfresh, ggplot, and tidyverse packages are libraries that I use commonly. I'm interested in exploring more with Pytorch but haven't used it much in production. I'm comfortable with Go, Flask, and Docker (mainly for productionizing models as a microservice) but don't use much in my current role.

Willing to relocate: No

Resume / CV: https://docs.google.com/document/d/1bx9MtbzhzMQDsRiX80Sh4GO5...

Email: sloanes dot k at gmail dot com

---

Looking for a data scientist position.

New comment by perturbation in "JSON with Sqlite"

perturbation — Fri, 01 Mar 2019 19:57:43 +0000

Dplyr works great with SQL (both SQLite and others).

New comment by perturbation in "AutoML toolkit for neural architecture search and hyper-parameter tuning"

perturbation — Fri, 01 Mar 2019 17:19:57 +0000

Their example with LightGBM (https://nni.readthedocs.io/en/latest/gbdt_example.html) is very cool - I wanted to put together a custom script with mlflow + catboost + mlrMBO to do something similar, but this puts everything together in one package.

I think this does everything MLFlow does and more (besides maybe helping with deployment?)

New comment by perturbation in "Leon: An open-source personal assistant"

perturbation — Fri, 15 Feb 2019 18:17:49 +0000

Haven't looked at MyCroft before. It looks like MyCroft exposes less of the nuts-and-bolts of modeling? I'm not sure where I would plug in a custom entity extraction or intent detection model, but I do see that it lets you add custom 'skills'.