coreylowman / dfdx Goto Github PK
View Code? Open in Web Editor NEWDeep learning in Rust, with shape checked tensors and neural networks
License: Other
Deep learning in Rust, with shape checked tensors and neural networks
License: Other
add/sub and mul/div are all slightly different, and they are kinda hard to read.
For safety & clarity reasons. If you clone a tensor for backprop, more often than not you want that to be a different tensor and for it to be treated separately during backprop.
For cases where you do want to keep the id the same, .duplicate()
should be used.
The only place this really occurs is in kl_div_with_logits_loss
where target_probs
is cloned sicne it's used twice.
Needed for transformers #34
Ideally we'd have p
be a const parameter. unfortunately f32 cannot be const in stable.
Many uses cases make p 1 / N
, where N is just an integer.
Dropout1In<N>
would set p to be 1.0 / N as f32
for now.
Current only works for actual probability distributions. hard cross entropy only has 1 non zero entry in inner dimension, so sum across that before taking mean
Related to #6
E.g. for xavier uniform initialization you need to know the in size & out size.
This will likely require a different trait than Randomize, and I'm still inclined to keep randomize. It'll also be slightly easier to use since the user won't have to pass in a distribution.
Options:
model.reset_params(&mut rng);
model.init_params(&mut rng);
model.randomize_params(&mut rng);
This should use Tensor::randomize()
under the hood.
This would reduce last dim to the maximum value in that dimension. It can use T::Device::reduce_last_dim(..., &mut f32::max)
(see logsumexp for example using that).
Example:
let t: Tensor2D<2, 3> = Tensor2D::new([[1.0, 2.0, 3.0], [-1.0, -2.0, -3.0]]);
let r: Tensor1D<2> = max_last_dim(t);
assert_eq!(r.data(), &[3.0, -1.0]);
This will also slightly reduce the required movement of tape when using these functions.
Needed for transformers #34
This will be another generic parameter of all tensors. Most existing operations will likely require float generic.
Related to #9 since it involves an additional generic parameter
Something that takes a usize
length of the dataset, and you can:
Each of these would return a [usize; M]
where M is a const M: usize
This will need:
While this would force functions to allocate space for derivatives inside, it would be cleaner from an api perspective.
E.g.
fn matmul_ref(a: &..., b: &..., tape: H) {}
pytorch's adam page has psuedo code for this: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html?highlight=adam#torch.optim.Adam
Since everywhere its used is with data the callee function owns for both. This will also allow it to reduce allocations for derivatives
something like
fn select<const S: usize>(self, inds: &[usize, N]) -> Self<S, ...>;
I imagine the gradients for this would just be 1 if i is in inds, otherwise 0
This function would stop using alloc_zeroed()
and Box::from_raw()
https://github.com/coreylowman/dfdx/blob/main/src/devices/allocate.rs#L11:
let layout = Layout::new::<T>();
debug_assert_eq!(layout.size(), T::NUM_BYTES);
unsafe {
let ptr = alloc_zeroed(layout) as *mut T;
Box::from_raw(ptr)
}
E.g.
It currently:
It'd be nice to reuse one, but because of the order of operations it may be impossible; to compute lhs derivative you need rhs & result, and to compute rhs derivative you need lhs & result.
This would be variable sized head where the input to the module is duplicated and the same input is passed to all sub modules.
Unclear how this would work since we are already using tuples. Perhaps something like:
impl Module<I> for MultiHead<(A, B)> {}
impl Module<I> for MultiHead<(A, B, C)> {}
impl Module<I> for MultiHead<(A, B, C, D)> {}
...
?
This would accept an array of T::Reduced::ArrayType
, where Dtype is usize, and select the items from last dimension that match up. It would return a Tensor::Reduced
.
Example:
let t: Tensor2D<2, 3> = Tensor2D::new([[1.0, 2.0, 3.0], [-1.0, -2.0, -3.0]]);
let r: Tensor1D<2> = gather_last_dim(t, [0, 1]);
assert_eq!(r.data(), &[1.0, -2.0]);
Examples where this would remove an allocation:
target_probs
is duplicatedb
calculation where max_value
is duplicated.One of the arguments can be reused as the storage for the gradient.
Currently using the matrixmultiply crate, but I think performance could be much improved with using the actual BLAS library. Unclear how compiling/including that works since it has to be compiled per machine.
There's a lot of work to be done here. Very rough list of todos:
Preparation
Devices
Cuda
device that wraps cudarc::CudaDevice and an rngCpu
DeviceArc
and DeviceRng
Arc<T>
and Arc<Cpu>
Tensors
Device
to all tensor structs&Device
as parameter, and remove Rng since that will be accessed through devicenn
trait ModuleCreator
ModuleCreator::zeros(Device)
ModuleCreator::default(Device)
which calls zeros & reset paramsKernels
trait LaunchKernel<K, Args>
impl LaunchKernel<...> for Cpu
and trait <Kernel>CpuImpl
/impl <Kernel>CpuImpl for <Kernel>
. See cudarc/examples/kernels.rskernel!(|a, b, c| { *a = b + c })
(#185)Testing
#[cfg(feature="test-cuda"]
) that when specified uses cuda instead of cpu?build_test_device!()
to use that uses testing features to create the deviceDone:
pytorch's sgd page has pseudocode for this: https://pytorch.org/docs/stable/generated/torch.optim.SGD.html
Would like to add an small example of using a transformer architecture. This will likely involve new features such as batch mat mul and maybe some others.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.