forked from Rust-related/luminal
Merge branch 'main' of https://github.com/jafioti/luminal
This commit is contained in:
13
README.md
13
README.md
@@ -14,8 +14,8 @@ use luminal::prelude::*;
|
||||
|
||||
// Setup graph and tensors
|
||||
let mut cx = Graph::new();
|
||||
let a = cx.tensor().set([[1.0], [2.0], [3.0]]);
|
||||
let b = cx.tensor().set([[1.0, 2.0, 3.0, 4.0]]);
|
||||
let a = cx.tensor((3, 1)).set([[1.0], [2.0], [3.0]]);
|
||||
let b = cx.tensor((1, 4).set([[1.0, 2.0, 3.0, 4.0]]);
|
||||
|
||||
// Do math...
|
||||
let mut c = a.matmul(b).retrieve();
|
||||
@@ -80,17 +80,15 @@ Now we can do:
|
||||
- Devices and Dtypes are handled through compilers (just run the CUDA compiler to convert the graph to use CUDA kernels, then the fp16 compiler to convert to half-precision kernels)
|
||||
- Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures (try writing a PyTorch network that works with both TF32 dtypes and TPUs; get ready for if statement hell...)
|
||||
|
||||
### Compile-time Shape Checks
|
||||
All operations are shape checked at compile time, so no more shape mismatches! Credit for this goes to [dfdx](https://github.com/coreylowman/dfdx).
|
||||
|
||||
### View the Graph
|
||||
Once you've written all your computation code, run `cx.display()` to see the entire computation graph in all it's glory. Pretty messy looking! Now run `cx.compile(GenericCompiler::default())` and display the graph again. Much better.
|
||||
|
||||
## Where are we?
|
||||
- Metal and Cuda are supported for running models on Macs and Nvidia GPUs respectively, in both full and half precision.
|
||||
- Performance on M-series macs with LLMs is within 20% of llama.cpp (a *heavily* optimized library)
|
||||
- Mistral 7B and Llama 8B are implemented in `examples/`. See instructions above for running.
|
||||
- We have a small library of NN modules in `nn`, including transformers.
|
||||
- Full training support with graph-based autograd.
|
||||
- Llama 3, Phi 3, Whisper and Yolo v8 are implemented in `examples/`. See instructions above for running.
|
||||
- We have a small library of NN modules in `luminal_nn`, including transformers.
|
||||
- A significant amount of high-level ops are implemented in `hl_ops`. We are aiming to match the most used ~80% of the pytorch api.
|
||||
- The aim for 0.3 is to achieve SOTA performance on an M1 pro (50 tok/s), and near SOTA on single nvidia gpus (>100 tok/s), as well as support many mainstream models (Whisper, Stable Diffusion, Yolo v9, etc.) See the tracking issue [here](https://github.com/jafioti/luminal/issues/29)
|
||||
|
||||
@@ -98,7 +96,6 @@ Some things on the roadmap:
|
||||
- Optimize cuda and metal matmul kernels
|
||||
- Fine-grained metal and cuda IR
|
||||
- Build benchmarking suite to test against other libs
|
||||
- Autograd engine
|
||||
- Distributed data, pipeline and tensor parallel.
|
||||
- Beat PT 2.0 perf on LLM training
|
||||
- Write compiler for quantum photonic retro encabulator
|
||||
|
||||
Reference in New Issue
Block a user