srush 15 hours ago

I made these a couple of years ago as a teaching exercise for https://minitorch.github.io/. At the time the resources for doing anything on GPUs were pretty sparse and the NVidia docs were quite challenging.

These days there are great resources for going deep on this topic. The CUDA-mode org is particularly great, both their video series and PMPP reading groups.

aleinin 12 hours ago

I recently ported this to Metal for Apple Silicon computers. If you're interested in learning GPU programming on an M series Mac, I think this is a very accessible option. Thanks to Sasha for making this!

https://github.com/abeleinin/Metal-Puzzles

saagarjha 9 hours ago

When working on GPU code there’s really two parts to it, I feel. One is “how do I even write code for the GPU” which this tutorial seems to cover but there’s a second part which is “how do I write good code for the GPU” which seems like it would need another resource or expansion to this one.

  • derefr 6 hours ago

    I've always felt like the best interactive educational model for forming a good intuition on how to maximize throughput and minimize worst-case latency in a pipelined parallel dataflow system (e.g. DSPs, FPGAs, GPUs, or even distributed message-passing systems) would be some variant of the game Factorio. Specifically, one with:

    1. instead of buildings, IP cores doing processing steps;

    2. instead of belts, wires — which take up far less than one tile, so many can run together along one tile and many can connect to a single IP core; where each wire can move its contents at arbitrary speed (including "stopped") — but where this will have a power-use cost proportional to the wire's speed;

    3. an overall goal of optimizing for rocket launches per second per power-usage watt. (Which should overall require minimizing the amount of stuff moving around across the whole base, avoiding pipeline stalls; doing as much parallel batching as possible; etc.)

    (Yes, I know Shenzhen I/O exists. It's great for what it does — modelling signals and signal transformations — but it doesn't model individual packets of data as moving along wires with propagation delay, and with the potential for e.g. parallel-line interference given a bad encoding scheme, quantum tunnelling, overclocking or undervolting components, etc. I think a Factorio-variant would actually be much more flexible to implement these aspects.)

czhu12 9 hours ago

I loved the tensor puzzles you made. I spent the morning revisiting and liking all the videos on youtube you've made. Hope for many more in the future!

  • srush 6 hours ago

    Thanks so much!

throwaway314155 12 hours ago

Either puzzle 4 has a bug in it or I'm losing my mind. (Possible answer to solution below, so don't read if you want to go in fresh)

    # FILL ME IN (roughly 2 lines)
    if local_i < size and local_j < size:
        out[local_i][local_j] = a[local_i][local_j] + 10

Results in a failed assertion:

     AssertionError: Wrong number of indices

But the test cell beneath it will still pass?
  • imjonse 11 hours ago

    maybe try out[local_i, local_j] ?

wmil 10 hours ago

So I'm used to working with lists and maps, which doesn't really track well with tackling problems on thousands of cores.

Is the usual strategy to worry less about repeating calculations and just use brute force to tackle the problem?

Is there a good resource to read about how to tackle problems in an extremely parallel way?

  • srush 6 hours ago

    I would recommend first learning Numpy or a similar vectorized library. If you have a good sense of those data structures (array broadcasting) it is a good starting point for what you can do in a GPU world.

az226 7 hours ago

Can I hire you to make Flash Attention a reality for V100?

  • srush 6 hours ago

    Nope! Too hard for me. But it would be a great practice for someone who wants to get started in this space. There is a Triton implementation that might be a good starting place.

dejanig 10 hours ago

Wow, It looks realy interesting, I will definitely look into it.

867-5309 12 hours ago

seems like an opportune moment to gift a plug for bitcoin puzzles, namely BTC32 / 1000 BTC Challenge[1]

pools are in dire need of cuda developers

[1]https://bitcointalk.org/index.php?topic=1306983.0

  • jamilton 10 hours ago

    Why? Wouldn't existing tools be about as good as they could be?

  • talldayo 11 hours ago

    > pools are in dire need of cuda developers

    Pools have money; if they need CUDA engineers, they are fully capable of hiring them at the industry rate.

    • 867-5309 10 hours ago

      most are community-based, plus, the prize can far exceed such a rate

      • talldayo 10 hours ago

        > the prize can far exceed such a rate

        For all the good it's done them.