Hacker News: zdevito

New comment by zdevito in "I don't like NumPy"

zdevito — Thu, 15 May 2025 20:43:11 +0000

I tried to do something similar with 'first-class' dimension objects in PyTorch https://github.com/pytorch/pytorch/blob/main/functorch/dim/R... . For instance multi-head attention looks like:

    from torchdim import softmax
    def multiheadattention(q, k, v, num_attention_heads, dropout_prob, use_positional_embedding):
        batch, query_sequence, key_sequence, heads, features = dims(5)
        heads.size = num_attention_heads
    
        # binding dimensions, and unflattening the heads from the feature dimension
        q = q[batch, query_sequence, [heads, features]]
        k = k[batch, key_sequence, [heads, features]]
        v = v[batch, key_sequence, [heads, features]]
    
        # einsum-style operators to calculate scores,
        attention_scores = (q*k).sum(features) * (features.size ** -0.5)
    
        # use first-class dim to specify dimension for softmax
        attention_probs = softmax(attention_scores, dim=key_sequence)
    
        # dropout work pointwise, following Rule #1
        attention_probs = torch.nn.functional.dropout(attention_probs, p=dropout_prob)
    
        # another matrix product
        context_layer = (attention_probs*v).sum(key_sequence)
    
        # flatten heads back into features
        return context_layer.order(batch, query_sequence, [heads, features])

However, my impression trying to get a wider userbase is that while numpy-style APIs maybe are not as good as some better array language, they might not be the bottleneck for getting things done in PyTorch. However, other domains might suffer more, and I am very excited to see a better array language catch on.

New comment by zdevito in "Debugging a Mixed Python and C Language Stack"

zdevito — Tue, 25 Apr 2023 23:20:47 +0000

When developing PyTorch, we also run into a lot of mixed Python/C++ language situations. We've recently been experimenting with in-process 'combined' Python/C++/PyTorch 2.0 stack traces to make it easier to understand where code is executing (https://dev-discuss.pytorch.org/t/fast-combined-c-python-tor...).

New comment by zdevito in "Terra – a low-level counterpart to Lua"

zdevito — Tue, 14 May 2013 08:24:32 +0000

One of our design goals was to make sure terra could execute independently of Lua. So everything that you describe is possible. For instance our simple hello world program (https://github.com/zdevito/terra/blob/master/tests/hello.t) compiles a standalone executable with the "terralib.saveobj" function. You can also write out object (.o) files that are ABI compatible with C. For instance, gemm.t (https://github.com/zdevito/terra/blob/master/tests/gemm.t) our matrix-matrix multiply autotuner writes out a .o file my_dgemm.o which we then call from a test harness in a separate C program (https://github.com/zdevito/terra/blob/master/tests/reference...). Once you have the .o files, you can use Lua to call the system linker to generate a dynamic library.

New comment by zdevito in "Terra – a low-level counterpart to Lua"

zdevito — Tue, 14 May 2013 06:15:51 +0000

Yes! One of the benefits of making sure that Terra code can execute independly of Lua is that you can use multi-threading libraries pretty much out-of-the box. For instance, we have an example that launches some threads using pthreads (https://github.com/zdevito/terra/blob/master/tests/pthreads....).

There are still some limitations. You'd still have to manage thread synchronization manually, and I think LuaJIT only allows one thread of Lua execution to run at a time, so if your threads call back into Lua they may serialize on that bottleneck.

New comment by zdevito in "Terra – a low-level counterpart to Lua"

zdevito — Tue, 14 May 2013 05:43:46 +0000

Author here. You're right that we designed Terra primarily to be an enviornment for generate low-level code. In particular, we want to be able to easily design and prototype DSLs and auto-tuners for high-performance programming applications. We explain this use-case in more detail in our upcoming PLDI paper (http://terralang.org/pldi071-devito.pdf).

Since we are primarily using it for dynamic code generation, I haven't done much benchmarking against LuaJIT directly. Instead, we have compared it C by implementing a few of the language benchmarks (nbody and fannkuchredux, performance is normally within 5% of C), and comparing it against ATLAS, which implements BLAS routines by autotuning x86 assembly. In the case of ATLAS, we're 20% slower, but we are comparing auto-tuned Terra and auto-tuned x86 assembly.

Small note, the BF description on the website does go on to implement the '[' and ']' operators below. I just left them out of the initial code so it was easier to grok what was going on. The full implementation is at (https://github.com/zdevito/terra/blob/master/tests/bf.t).