coreylowman / dfdx Goto Github PK

Deep learning in Rust, with shape checked tensors and neural networks

License: Other

Rust 92.15% Cuda 7.58% GLSL 0.26% WGSL 0.02%

rust autograd autodiff machine-learning neural-network autodifferentiation rust-lang backpropagation tensor deep-learning

dfdx's People

Contributors

Stargazers

Watchers

Forkers

strasdat yerke joseph-x-li isgasho bonedaddy icodein xbagon vikigenius elftausend hominee farm-ng weykon dimev caelunshun cbournhonesque gmorenz edhuang577 jianshu93 m1ngxu jyudelson1 kod-kristoff bbenshalom shouvikghosh2048 hbcbh1999 viliamvadocz nkoppel narsil leofidus quietlychris codingonion usbalbin kakoc timerertim amadeusine kstavro paholg 08r4y4n ai-learn-use legneato daughterofmars opfromthestart nabushika obiemunoz scriptis natyamatsya zojeda vasanthakumarv ccaven iacore lykhouzov memoryruins j4qfrost mauvray sirandreww yannickfricke myname1111 awpteamoose mattjurenka kurnevsky mayhemheroes sdake artificialwisdomai 9876691 ronofays iq-scm jafioti lebrancebw donisaac xubaiw pesky01 ue2020 d-berg jcrist1 emchristiansen oyelowo clstatham dogpawhat leodog896 swfsql tanmaysachan kimhappy timwedde gmh5225 melshakobyan96 shivampr21 optman hscspring favilo jrazek fujiehuang vinicius-ianni inflectrix rainiwu seddonm1 mrtnvgr schaudge gurudk

dfdx's Issues

Improve implementation of broadcast_outer_* methods

add/sub and mul/div are all slightly different, and they are kinda hard to read.

Cpu Device & change arrays_ops to Cpu device impls

A device should be in charge of allocating tensors/arrays, and doing operations on them. This should resolve stack overflow from #16, and make adding gpu device (#9) easier in the future.

Convolution mega issue

Test Linear & tuple modules

Clone for UniqueId should produce a different id

For safety & clarity reasons. If you clone a tensor for backprop, more often than not you want that to be a different tensor and for it to be treated separately during backprop.

For cases where you do want to keep the id the same, .duplicate() should be used.

The only place this really occurs is in kl_div_with_logits_loss where target_probs is cloned sicne it's used twice.

Add `Repeated<T, N>` for repeating a module N times.

Needed for transformers #34

Add example training on mnist

Add `nn::DropoutOneIn<N>`

Ideally we'd have p be a const parameter. unfortunately f32 cannot be const in stable.

Many uses cases make p 1 / N, where N is just an integer.

Dropout1In<N> would set p to be 1.0 / N as f32 for now.

add hard_cross_entropy

Current only works for actual probability distributions. hard cross entropy only has 1 non zero entry in inner dimension, so sum across that before taking mean

Add scaled softmax attention

This is for #34 , and is also actually related to #33 (since 33 requires transpose matrix multiplication, which this issue does as well)

Change Linear.weight to be transposed (for compatibility with torch)

Related to #6

Add Binary Cross Entropy loss

Randomize parameters based on parameter size

E.g. for xavier uniform initialization you need to know the in size & out size.

This will likely require a different trait than Randomize, and I'm still inclined to keep randomize. It'll also be slightly easier to use since the user won't have to pass in a distribution.

Options:

model.reset_params(&mut rng);
model.init_params(&mut rng);
model.randomize_params(&mut rng);

This should use Tensor::randomize() under the hood.

Add `max_last_dim()`

This would reduce last dim to the maximum value in that dimension. It can use T::Device::reduce_last_dim(..., &mut f32::max) (see logsumexp for example using that).

Example:

let t: Tensor2D<2, 3> = Tensor2D::new([[1.0, 2.0, 3.0], [-1.0, -2.0, -3.0]]);
let r: Tensor1D<2> = max_last_dim(t);
assert_eq!(r.data(), &[3.0, -1.0]);

broadcast_inner_* should take NoTape as lhs, similar to other arith functions

This will also slightly reduce the required movement of tape when using these functions.

Add dropout layers

Reformat nans_to/neg/value_mask as diff_fns

Add `nn::LayerNorm`

Needed for transformers #34

Add `concatenate` function

Needed for #34 . In multi head attention, you concatenate the output of all the single attention heads.

Note that this may require nightly access similar to #1 because we can't do expressions with const generics yet.

Add example ppo

Rename `tape.add_operation` to `tape.add_backward_op`

Add multiple dtype support

This will be another generic parameter of all tensors. Most existing operations will likely require float generic.

Related to #9 since it involves an additional generic parameter

Add Batch sampler utility class

Something that takes a usize length of the dataset, and you can:

Sample batches of a const known size
Iterate shuffled batches of const known size

Each of these would return a [usize; M] where M is a const M: usize

Save/load from numpy file

This will need:

Write single tensor to .npy file
Create a zip with multiple files from a struct
Ability to read a single np array from file into a tensor
Ability to read a collection of np arrays into a arbitrarily nested struct of tensors

Add versions of functions that take two references & tape as a third argument.

While this would force functions to allocate space for derivatives inside, it would be cleaner from an api perspective.

E.g.

fn matmul_ref(a: &..., b: &..., tape: H) {}

Add Adam optimizer

pytorch's adam page has psuedo code for this: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html?highlight=adam#torch.optim.Adam

Remove std::ops::Mul for matmul - matmul should always be explicit

Stack overflow with large layer sizes. Tensors should hold `Box<Array>`

broadcast_inner_sub should take ownership of both arguments

Since everywhere its used is with data the callee function owns for both. This will also allow it to reduce allocations for derivatives

Select subset operation

something like

fn select<const S: usize>(self, inds: &[usize, N]) -> Self<S, ...>;

I imagine the gradients for this would just be 1 if i is in inds, otherwise 0

Add clamp layer/function

Document everything

Plot benchmark speed against pytorch

Linear batched forward (matmul & broadcast add)
Backprop algorithm
Optimizer updates
forward with tape & without tape

Safe allocation of arrays on heap

This function would stop using alloc_zeroed() and Box::from_raw() https://github.com/coreylowman/dfdx/blob/main/src/devices/allocate.rs#L11:

        let layout = Layout::new::<T>();
        debug_assert_eq!(layout.size(), T::NUM_BYTES);
        unsafe {
            let ptr = alloc_zeroed(layout) as *mut T;
            Box::from_raw(ptr)
        }

Rename `tape_holder` to `tape`.

Document differences from pytorch & other rust DL crates

E.g.

const generic sizes
clean/safe/hackable implementation
Gradients not stored directly on parameters, makes updating & writing optimizers much cleaner

Is there a way to remove extra allocation for derivative storage in matmul/vecmul functions?

It currently:

Copies rhs data
Creates an empty version of rhs to store derivative

It'd be nice to reuse one, but because of the order of operations it may be impossible; to compute lhs derivative you need rhs & result, and to compute rhs derivative you need lhs & result.

Add something nn layer for multi head

This would be variable sized head where the input to the module is duplicated and the same input is passed to all sub modules.

Unclear how this would work since we are already using tuples. Perhaps something like:

impl Module<I> for MultiHead<(A, B)> {}
impl Module<I> for MultiHead<(A, B, C)> {}
impl Module<I> for MultiHead<(A, B, C, D)> {}
...

Add DQN Example code (on random data)

Add `gather_last_dim()`

This would accept an array of T::Reduced::ArrayType, where Dtype is usize, and select the items from last dimension that match up. It would return a Tensor::Reduced.

Example:

let t: Tensor2D<2, 3> = Tensor2D::new([[1.0, 2.0, 3.0], [-1.0, -2.0, -3.0]]);
let r: Tensor1D<2> = gather_last_dim(t, [0, 1]);
assert_eq!(r.data(), &[1.0, -2.0]);

Add versions of binary functions that take ownership of both arguments - for minimizing allocations

Examples where this would remove an allocation:

kl_divergence_with_logits_loss where target_probs is duplicated
binary_cross_entropy_with_logits_loss in b calculation where max_value is duplicated.

Preparation
- Move map functions to devices #199
- Move conv to devices #198
- Add where clauses for map functions to make partial progress on kernels possible (so we can start using cuda without all ops implemented)
Devices
- Add Cuda device that wraps cudarc::CudaDevice and an rng
- Add StdRng to Cpu
- Add rng seed to device construction
- ~~Add two GATs to device trait: DeviceArc and DeviceRng~~
  - ~~Add CpuRc which contains Arc<T> and Arc<Cpu>~~
Tensors
- Add Device to all tensor structs
- TensorCreator should accept &Device as parameter, and remove Rng since that will be accessed through device
- Move Device to generic argument of Tensors
- ~~Enable moving tensors between devices~~
nn
- Add trait ModuleCreator
  - Add ModuleCreator::zeros(Device)
  - Add ModuleCreator::default(Device) which calls zeros & reset params
- Remove implementations for Default
- Remove rng parameter from ResetParams, should use tensor's devices
Kernels
- Add trait LaunchKernel<K, Args>
- Move all Cpu traits to a combo of impl LaunchKernel<...> for Cpu and trait <Kernel>CpuImpl/impl <Kernel>CpuImpl for <Kernel>. See cudarc/examples/kernels.rs
- (In a separate crate) proc macro that wraps around kernels and maps them to something usable for ptx compiling (e.g. kernel!(|a, b, c| { *a = b + c }) (#185)
- Look into when/how to build the kernels (compile time hopefully??) (#184)
Testing
- Add feature based device construction in all tests (something like #[cfg(feature="test-cuda"]) that when specified uses cuda instead of cpu?
- Add macro build_test_device!() to use that uses testing features to create the device

Done:

Is it even possible to compile a rust closure to a cuda kernel? Assuming very small set of supported operations. Is this worth the maintainability?
- If we go the fixed set of functions route, how many different generic closures does dfdx use currently?
- ANSWER: Yes it is possible (the rust cuda project does it), but it will take some work. Automatic closure conversion to kernel is probably the direction i'll be trying to go since hand building all the cuda kernels next to the cpu closures seems too much work.
What functionality does nvidia provide for deep learning already? Assuming matmul & conv forward/backward. How to use these?
- ANSWER: cudnn, all tensors are 4d, supports base set of operations. probably not what we want to depend on tbh since it doesn't support everything we would need on GPU (e.g. optimizer kernels)

0.9.0 - nightly conv nets & transformers

Comparison against pytorch (patch version bump)

Misc other generic const exprs functions (patch version bump)

Released v0.5.1 - Mnist example with linear MLP

Released v0.5.2 - RL examples & save/load

Released v0.6.0 - transformers prep & other additions

#36
(minor breaking change) #41
#40
#38
(patch breaking change) #35
#33
#37
#51
(breaking) #44
(breaking) #46
(breaking) #47
(breaking) #41
#53

Transformers mega issue

Would like to add an small example of using a transformer architecture. This will likely involve new features such as batch mat mul and maybe some others.