Merge branch 'main' of https://github.com/jafioti/luminal

2024-07-09 11:28:55 -04:00
parent fb544b7530 9d37f25c7e
commit 9aba341b95
1 changed files with 5 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -14,8 +14,8 @@ use luminal::prelude::*;

 // Setup graph and tensors
 let mut cx = Graph::new();
-let a = cx.tensor().set([[1.0], [2.0], [3.0]]);
-let b = cx.tensor().set([[1.0, 2.0, 3.0, 4.0]]);
+let a = cx.tensor((3, 1)).set([[1.0], [2.0], [3.0]]);
+let b = cx.tensor((1, 4).set([[1.0, 2.0, 3.0, 4.0]]);

 // Do math...
 let mut c = a.matmul(b).retrieve();
@@ -80,17 +80,15 @@ Now we can do:
 - Devices and Dtypes are handled through compilers (just run the CUDA compiler to convert the graph to use CUDA kernels, then the fp16 compiler to convert to half-precision kernels)
 - Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures (try writing a PyTorch network that works with both TF32 dtypes and TPUs; get ready for if statement hell...)

-### Compile-time Shape Checks
-All operations are shape checked at compile time, so no more shape mismatches! Credit for this goes to [dfdx](https://github.com/coreylowman/dfdx).
-
 ### View the Graph
 Once you've written all your computation code, run `cx.display()` to see the entire computation graph in all it's glory. Pretty messy looking! Now run `cx.compile(GenericCompiler::default())` and display the graph again. Much better.

 ## Where are we?
 - Metal and Cuda are supported for running models on Macs and Nvidia GPUs respectively, in both full and half precision.
 - Performance on M-series macs with LLMs is within 20% of llama.cpp (a *heavily* optimized library)
- Mistral 7B and Llama 8B are implemented in `examples/`. See instructions above for running.
- We have a small library of NN modules in `nn`, including transformers.
+- Full training support with graph-based autograd.
+- Llama 3, Phi 3, Whisper and Yolo v8 are implemented in `examples/`. See instructions above for running.
+- We have a small library of NN modules in `luminal_nn`, including transformers.
 - A significant amount of high-level ops are implemented in `hl_ops`. We are aiming to match the most used ~80% of the pytorch api.
 - The aim for 0.3 is to achieve SOTA performance on an M1 pro (50 tok/s), and near SOTA on single nvidia gpus (>100 tok/s), as well as support many mainstream models (Whisper, Stable Diffusion, Yolo v9, etc.) See the tracking issue [here](https://github.com/jafioti/luminal/issues/29)

@@ -98,7 +96,6 @@ Some things on the roadmap:
 - Optimize cuda and metal matmul kernels
 - Fine-grained metal and cuda IR
 - Build benchmarking suite to test against other libs
- Autograd engine
 - Distributed data, pipeline and tensor parallel.
 - Beat PT 2.0 perf on LLM training
 - Write compiler for quantum photonic retro encabulator