Added new doc site

This commit is contained in:
Joe Fioti
2024-04-24 16:44:52 -05:00
parent 1424c40384
commit 5b8e922a70
20 changed files with 298 additions and 42 deletions

View File

@@ -1,19 +0,0 @@
# GraphTensors
We're working with pretty complicated graphs to build our computation on, but we don't want to manually place all the nodes ourselves! So how can we build these static graphs in a nice, familiar way? GraphTensors!
Essentially GraphTensors are pointers to a specific node on the graph, as well as some metadata about the output of that node, such as its shape. We can make a new GraphTensor by doing:
```rust
let mut cx = Graph::new(); // We need a graph to build!
let a: GraphTensor<R1<3>> = cx.tensor(); // Here we create a new node on the graph and get a GraphTensor back, pointing to it.
```
Notice the type of `a`: `GraphTensor<R1<3>>`. So what's that generic all about? It's the shape! We make tensor shapes part of the type, so they're tracked at compile time! In this case, the shape is rank 1, with 3 elements, or in other words, a vector of 3 dimensions. (Side note: `R1<N>` is a typedef of `(Const<N>,)`) It should be impossible to accidentally get a runtime shape mismatch.
Now we can use the `a` as you would in a library like PyTorch, performing linear algebra:
```rust
let b = a.exp().sqrt();
let c = b + a;
```
Looks familiar!
[Let's take a look at how GraphTensors are used to build whole neural networks.](https://github.com/jafioti/luminal/blob/main/docs/03%20Modules.md)

View File

@@ -1 +0,0 @@
Coming Soon

View File

@@ -1 +0,0 @@
Coming Soon

4
docs/README.md Normal file
View File

@@ -0,0 +1,4 @@
```
npm i -g mintlify
mintlify dev
```

48
docs/blog/4-24-2024.mdx Normal file
View File

@@ -0,0 +1,48 @@
---
title: 'Luminal: Efficient ML in Rust through graph compilation'
description: 'A new approach to ML'
---
![](https://raw.githubusercontent.com/jafioti/luminal/main/dag.jpeg)
**Luminal is a deep learning library that uses composable compilers to achieve high performance.**
Current ML libraries tend to be large and complex because they try to map high level operations directly on to low level handwritten kernels, and focus on eager execution. Libraries like PyTorch contain hundreds of thousands of lines of code, making it nearly impossible for a single programmer to understand it all, set aside do a large refactor.
But does it need to be so complex? ML models tend to be static dataflow graphs made up of a few simple operators. This allows us to have a dirt simple core only supporting a few primitive operations, and use them to build up complex neural networks. We can then write compilers that modify the graph after we build it, to swap more efficient ops back in depending on which backend we're running on.
Luminal takes this approach to the extreme, supporting only 11 primitive operations (primops):
- **Unary** - Log2, Exp2, Sin, Sqrt, Recip
- **Binary** - Add, Mul, Mod, LessThan
- **Other** - SumReduce, MaxReduce, Contiguous
Every complex operation boils down to these primitive operations, so when you do `a - b` for instance, `add(a, mul(b, -1))` gets written to the graph. Or when you do `a.matmul(b)`, what actually gets put on the graph is `sum_reduce(mul(reshape(a), reshape(b)))`.
Once the graph is built, iterative compiler passes can modify it to replace primops with more efficient ops, depending on the device it's running on. On Nvidia cards, for instance, efficient Cuda kernels are written on the fly to replace these ops, and specialized cublas kernels are swapped in for supported operations.
This approach leads to a simple library, and performance is only limited by the creativity of the compiler programmer, not the model programmer.
Luminal has a number of other neat features, check out the repo [here](https://github.com/jafioti/luminal).
## Welcome
There are two ways to build API documentation: [OpenAPI](https://mintlify.com/docs/api-playground/openapi/setup) and [MDX components](https://mintlify.com/docs/api-playground/mdx/configuration). For the starter kit, we are using the following OpenAPI specification.
<Card
title="Plant Store Endpoints"
icon="leaf"
href="https://github.com/mintlify/starter/blob/main/api-reference/openapi.json"
>
View the OpenAPI specification file
</Card>
## Authentication
All API endpoints are authenticated using Bearer tokens and picked up from the specification file.
```json
"security": [
{
"bearerAuth": []
}
]
```

View File

@@ -0,0 +1,4 @@
---
title: 'Create Plant'
openapi: 'POST /plants'
---

View File

@@ -0,0 +1,4 @@
---
title: 'Delete Plant'
openapi: 'DELETE /plants/{id}'
---

View File

@@ -0,0 +1,4 @@
---
title: 'Get Plants'
openapi: 'GET /plants'
---

View File

@@ -1,10 +1,22 @@
# Contributing to luminal
![image](https://raw.githubusercontent.com/jafioti/luminal/main/resources/dag.jpeg)
---
title: Developing Luminal
description: 'Building the future of ML.'
icon: 'hand-wave'
---
Please take a look at the [issues](https://github.com/jafioti/luminal/issues) and [roadmap](https://github.com/users/jafioti/projects/1) to see what's targeted for upcoming releases. Contributions for those features are preferred and will be reviewed and merged very rapidly. Other contributions are welcome, but please note luminal is and always will be a fairly minimal library.
<img
className="block dark:hidden rounded-xl"
src="/images/abstract_light.jpg"
alt="Hero Light"
/>
<img
className="hidden dark:block rounded-xl"
src="/images/abstract.jpg"
alt="Hero Dark"
/>
The core design of luminal is heavily predicated on extensibility. Compilers alow for immense complexity to be removed from the core library and added with third party compilers. For instance, datatypes and devices are typically first class primitives. In luminal, they're compilers and the core has no idea about them. This is the general trend we'll stick to: core remains brutally simple, and everything that can be externalized to a compiler will be.
Please take a look at the [issues](https://github.com/jafioti/luminal/issues) and [roadmap](https://github.com/users/jafioti/projects/1) to see what's targeted for upcoming releases. Contributions for those features are preferred and will be reviewed and merged very rapidly. Other contributions are welcome, but please note Luminal is and always will be a fairly minimal library.
We will be adding training support soon, and as you guessed, it will entirely reside in a compiler. Just define the model's graph, run the output through an optimizer, and then run the `AutogradCompiler` before any other compilers. Boom, we got training, and the core of the library has no idea! (aside from some quality of life apis)
The core design of Luminal is heavily predicated on extensibility. Compilers alow for immense complexity to be removed from the core library and added with third party compilers. For instance, datatypes and devices are typically first class primitives. In Luminal, they're compilers and the core has no idea about them. This is the general trend we'll stick to: core remains brutally simple, and everything that can be externalized to a compiler will be.
PRs that remove complexity are always welcome, but note that line count often is a bad proxy for complexity. Ideally the entire luminal core should be a few thousand lines of code, but anything remotely resembling code golf is not allowed.
PRs that remove complexity are always welcome, but note that line count often is a bad proxy for complexity. Ideally the entire Luminal core should be a few thousand lines of code, but anything remotely resembling code golf is not allowed.

View File

@@ -1,17 +1,22 @@
# Compilers
---
title: Compilers
description: 'Core transformations of the computation graph.'
icon: 'microchip'
---
So now we have our graph all set up. We did our forward passes through the model, so now what? Do we run it?
We could! But it wouldn't be very fast. Right now your graph is full of **primops**, which are the simplest set of primitive operations in luminal. One of the key tenants of luminal is a small primop set, which makes it easy to add new backends and write compilers for. But another consequence of a small primset is that even simple operations usually end up creating quite a few operations, and even small neural networks can end up with hundreds or thousands of primops, which are slow to run directly. So it's time to compile the graph!
Compilers are structs that implement the `Compiler` trait, which simply specifies a single function:
We use a loose definition of a compiler. Compilers are structs that implement the `Compiler` trait, which simply specifies a single function:
```rust
pub trait Compiler {
type Output = ();
/// Run a compilation pass
fn compile<T: ToIdsMut>(&self, graph: &mut Graph, remap: T);
fn compile<T: ToIdsMut>(&self, graph: &mut Graph, remap: T) -> Self::Output;
}
```
So all a compiler does is take a mutable reference to the graph, something called remap (beyond the scope of this introduction), and does something to the graph. That something is compilation, usually in the form of finding patterns of nodes and replacing them with other nodes. For instance, there's no Subtract operation in the primops, so subtractions are implemented as `add(a, mul(b, -1))`. We can have a compiler that looks for that pattern of nodes and directly replaces it with a `Subtract` operation. We'll look at how to do this in the [Writing Compilers](https://github.com/jafioti/luminal/blob/main/docs/06%20Writing%20Compilers.md) section.
So all a compiler does is take a mutable reference to the graph, something called remap (beyond the scope of this introduction), and does something to the graph. That something is compilation, usually in the form of finding patterns of nodes and replacing them with other nodes. For instance, there's no Subtract operation in the primops, so subtractions are implemented as `add(a, mul(b, -1))`. We can have a compiler that looks for that pattern of nodes and directly replaces it with a `Subtract` operation. We'll look at how to do this in the [Writing Compilers](/developers/compilers) section.
All you need to know for now is that we can use this compiler on the graph by doing:
```rust
@@ -19,9 +24,7 @@ cx.compile(SubtractionCompiler::default());
```
Now the graph will have the old mul + add pattern removed and Subtract ops placed in. There are plenty of different compilers for different purposes. Some of the popular ones:
- GenericCompiler - A handful of hardware-agnostic optimizations like [CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination) to be ran before any hardware-specific compilers.
- CudaCompiler<T> - The full stack of cuda compilers to convert a graph to a cuda-specialized graph with T as the datatype (either f32 or f16). Imported from luminal_cuda
- MetalCompiler<T> - Same as CudaCompiler. Imported from luminal_metal
- CudaCompiler\<T\> - The full stack of cuda compilers to convert a graph to a cuda-specialized graph with T as the datatype (either f32 or f16). Imported from luminal_cuda.
- MetalCompiler\<T\> - Same as CudaCompiler. Imported from luminal_metal.
Compilers are entirely seperate from luminal, so they can be fully implemented by third party crates. For instance, everything specific to Cuda is contained in luminal_cuda.
[Now let's look into how to load weights from a file.](https://github.com/jafioti/luminal/blob/main/docs/05%20Serialization.md)
Compilers are entirely seperate from luminal, so they can be fully implemented by third party crates. For instance, everything specific to Cuda is contained in luminal_cuda.

View File

@@ -1,5 +1,10 @@
# Luminal Introduction
---
title: GraphTensor API
description: 'The high-level interface for writing ML code, checked at compile time.'
icon: 'webhook'
---
## Familiarizing ourselves
Let's get up to speed with how to use luminal, and how it works internally.
First we'll take a look at what the simplest program will look like:
@@ -35,4 +40,22 @@ Then we set the data for these tensors. But if `GraphTensor` doesn't hold data,
Alright, that was a lot but now we've touched on all the main aspects of running a model in luminal.
[Let's take a look at each piece in more depth.](https://github.com/jafioti/luminal/blob/main/docs/02%20GraphTensor%20API.md)
## GraphTensors
We're working with pretty complicated graphs to build our computation on, but we don't want to manually place all the nodes ourselves! So how can we build these static graphs in a nice, familiar way? GraphTensors!
Essentially GraphTensors are pointers to a specific node on the graph, as well as some metadata about the output of that node, such as its shape. We can make a new GraphTensor by doing:
```rust
let mut cx = Graph::new(); // We need a graph to build!
let a: GraphTensor<R1<3>> = cx.tensor(); // Here we create a new node on the graph and get a GraphTensor back, pointing to it.
```
Notice the type of `a`: `GraphTensor<R1<3>>`. So what's that generic all about? It's the shape! We make tensor shapes part of the type, so they're tracked at compile time! In this case, the shape is rank 1, with 3 elements, or in other words, a vector of 3 dimensions. (Side note: `R1<N>` is a typedef of `(Const<N>,)`) It should be impossible to accidentally get a runtime shape mismatch.
Now we can use the `a` as you would in a library like PyTorch, performing linear algebra:
```rust
let b = a.exp().sqrt();
let c = b + a;
```
We just placed some ops on the graph! It doesn't look like it because you don't need to think about the graph while writing ML code.
Next we'll see how GraphTensors are used to build whole neural networks.

View File

@@ -0,0 +1,67 @@
---
title: Introduction
description: 'Welcome to a new way to do ML.'
icon: 'hand-wave'
---
<img
className="block dark:hidden rounded-xl"
src="/images/abstract_light.jpg"
alt="Hero Light"
/>
<img
className="hidden dark:block rounded-xl"
src="/images/abstract.jpg"
alt="Hero Dark"
/>
Luminal is a new machine learning framework focused on **speed**, **simplicity** and **composability**. We take a new approach to ML by focusing on static graphs and leaning heavily on compilers.
## Contents
Navigate around the Luminal docs.
<CardGroup cols={2}>
<Card
title="Quickstart"
icon="bolt"
href="/docs/quickstart"
>
Get up and running ML models in a flash.
</Card>
<Card
title="Why Luminal"
icon="lightbulb"
href="/docs/why"
>
Dive into why Luminal was created and the design philosophy behind it.
</Card>
<Card
title="GraphTensor API"
icon="webhook"
href="/docs/graphtensor"
>
High-level interface for building models.
</Card>
<Card
title="Modules"
icon="shapes"
href="/docs/modules"
>
Composable building blocks of complex neural networks.
</Card>
<Card
title="Compilers"
icon="microchip"
href="/docs/compilers"
>
Core transformations of the computation graph.
</Card>
<Card
title="Developers"
icon="code"
href="/docs/developers"
>
Resources for contributors and future development.
</Card>
</CardGroup>

View File

@@ -1,4 +1,9 @@
# NN Modules
---
title: Modules
description: 'Composable building blocks of complex neural networks.'
icon: 'shapes'
---
Like any good DL library, we organize our networks into `Module`s. Here is the module trait:
```rust
/// A module with a forward pass
@@ -26,6 +31,4 @@ impl<const A: usize, const B: usize> Module<GraphTensor<R1<A>>> for Linear<A, B>
```
Here we see a single weight matrix as the internal state, of size AxB. We've written a single forward function for single input vectors of shape (A,) and matmul it by our weight matrix to get an output of shape (B,).
Now all of these ops are recorded on the graph, to be compiled and ran later on.
[So how does this compilation work? Let's find out!](https://github.com/jafioti/luminal/blob/main/docs/04%20Compilers.md)
Now all of these ops are recorded on the graph, to be compiled and ran later on.

39
docs/docs/quickstart.mdx Normal file
View File

@@ -0,0 +1,39 @@
---
title: 'Quickstart'
description: 'Start running ML models in minutes.'
icon: 'bolt'
---
## Clone the repo
Clone the codebase locally by running the following:
```bash
git clone https://github.com/jafioti/luminal
cd luminal
```
## Hello World
Simple examples demonstrate how a library works without diving in too deep. Run your first Luminal code like so:
```bash
cd ./examples
cargo run --release
```
Great! You've ran your first Luminal model!
## Run Llama 3
Run the following to start generating text with Llama 3 8B:
```bash
cd ./examples/llama
# Download the model
bash ./setup/setup.sh
# Run the model
cargo run --release --features metal # MacOS (Recommended)
cargo run --release --features cuda # Nvidia
cargo run --release # CPU
```
<Warning>
Luminal currently isn't well optimized for CPU usage, so running large models like Llama 3 on CPU isn't recommended.
</Warning>

66
docs/docs/why.mdx Normal file
View File

@@ -0,0 +1,66 @@
---
title: 'Why Luminal'
description: 'ML is a crowded landscape. What makes Luminal different?'
icon: 'lightbulb'
---
## The ML ecosystem is fragmented
In recent years, ML has seen a flourishing of interest, especially after apps like ChatGPT gained huge traction. With this interest has come many fantastic open source projects and libraries lowering the barrier to entry.
But despite all the effort, it still feels hard to take an existing model and deploy it to a new environment without jumping through hoops.
#### Deployment
ML deployments usually come in one of two flavors: extensions to training libraries, and specialized deployment libraries.
PyTorch and JAX exemplify the current mainstream of training libraries. While there exist great deployment systems for these, typically they involve either trying to ship a standalone Python interpreter, or exporting the model to another library.
ONNX-based runtimes represent the standard in dedicated deployment libraries. Once you get the model into a supported format, like ONNX, deployment to your chosen environment is fairly easy.
#### Devices x Datatypes x Operations
On top of this, frameworks are only usually able to support a handful of devices, since implementing a device involves implementing every operation the framework supports. Throw in datatypes and the amount of code needed grows exponentially.
When faced with all of this, it's no wonder ML developers usually just opt for the cloud, an environment they can have full control over.
## A better way
Luminal was borne out of this frustration, and a want to deploy to user devices with the same piece of mind Rust developers are used to. It turns out most of these problems were already solved in the early days of computing.
Why don't developers today hand-write assembly code? Why does code written on one machine work on all others? Do developers need to think about the differences between x86 and ARM ISA's? Of course not.
Let's learn the same lesson in ML. If you want to know how something is achieved in Luminal, there's a good chance the answer is the same: **compilers**.
## It's compilers all the way down
How simple *could* an ML library get? Surely after you made a linear algebra library you'd need to deal with datatypes, devices, backprop, and all the usual list of ML concerns, right? What if you could throw all those things away and just worry about doing the minimum to support arbitrary neural networks?
It turns out, it can get extremely simple. The core of Luminal is a few thousand lines of code and only 11 operations, which allows anyone to understand the whole thing in an afternoon.
But wouldn't that make your library so limited it's useless? **No!** Not if you can use compilers to add functionality back, in a composable, isolated way.
Let's see what we can do.
#### Devices
Since devices aren't handled by the core library, what if we had a compiler take each op present in the network and swap it out with equivalent operations on other devices, like CUDA GPUs? Or TPUs? Or quantum photonic retro-encabulators?
If you only have 11 ops, it's extremely straightforward. We can also have the compilers insert copy-to-device and copy-from-device ops so our data is moved correctly without us thinking about it.
So compilers get us support for other devices.
#### Datatypes
We want more than just fp32. If you tilt your head and squint, other datatypes are the same as other devices. It's just another seperate set of ops that processes your tensors slightly differently. So we can have a compiler insert the ops that support our desired datatype, and insert conversion to and from fp32 ops.
So we get datatypes back as well, through compilers.
#### Training
Whether or not a library will support training is one of the first decisions a developer makes when starting out. So surely, if the core of luminal doesn't support training, there's no way it'll be added in externally, right?
Nope! Compilers to the rescue again. With a limited op set, we can easily handle all possible cases of operations and derive the local gradients to get a full backward graph, and then connect it to the existing forward graph.
Boom! We now have access to gradients! With a few more convenience functions, we can use those gradients to update the model's weights. Training has arrived!
## In conclusion
By now you should be seeing a trend. Everything we've removed from the core library we can add back in with external compilers. But now all that functionality is external to the core, hackable, and isolated. You can use the Autograd compiler with the CudaFp16 compiler (or any other device / datatype compiler) and be confident it will Just Work™.
In the coming months you can expect to see advanced features like full 3D-parallel training, low-bit quantizations, and RL coming to Luminal, by way of external crates. Which means if you want to add something big, you probably can do it by writing your own compiler!

BIN
docs/favicon.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.0 KiB

BIN
docs/images/abstract.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 125 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

BIN
docs/logo/luminal_logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB