* [custom op] Generalize shape library logic to work with dtypes
This commit generalizes the shape library logic, so that dtype rules
for ops can also be expressed using the same mechanism. In other
words, each op can now have a shape function and a dtype function
specified in Python that is imported during lowering to calculate the
shapes and dtypes throught a program. For more information about how
to specify a dtype function, see the updated
`docs/adding_a_shape_and_dtype_function.md`.
For those not familiar with how the shape library works, the file
`docs/calculations_lib.md` provides an overview.
Currently `getTensorRank` returns -1 if it was unable to get the rank
of the tensor. However, not every use in the codebase was checking the
return value, and in some cases, the return value was casted to
unsigned leading to some infinte loops when an unranked tensor reached
a decomposition.
This commit changes the return of `getTensorRank` to
`Optional<unsigned>` to make it clear to the user that the function
can fail.
This commit also changes a couple of for loops that iterate a vector
in reverse order that can potentially become infinite loops into
range-based for loops.
This was an experimental attempt at rolling out own op-by-op executor
with `__torch_dispatch__`, but it proved difficult to make it robust.
Op-by-op execution is very easy to implement robustly now with the
PyTorch 2.0 stack, so we don't need eager_mode.
Downstream users were using eager_mode to implement lockstep numerical
accuracy debuggers. We implemented the same functionality with
TorchDynamo in https://github.com/llvm/torch-mlir/pull/1681 so now there
is not much reason to continue maintaining it.
A circular dependency was introduced in e7edcc62fd.
Specifically, the `makeShapeLLVMCompatible` and `makeShapeTorchCompatible` utilities were being called from `lib/Dialect/Torch/IR/TorchTypes.cpp` and `lib/Dialect/Torch/IR/TorchOps.cpp` defined under the `:TorchMLIRTorchDialect` bazel target, leading it to take a dependency on `:TorchMLIRConversionUtils` which already depends on `:TorchMLIRTorchDialect`, hence creating a circular dependency.
This commit resolves the same by moving said utilities from `lib/Conversion/Utils/Utils.cpp` to `lib/Dialect/Torch/Utils/Utils.cpp`. Please LMK if there's a better way to fix this and I will update the code.
This commit also adds the required targets to support building the new conversions from Torch to ML Program dialect that was introduced in f416953600.
Bazel build GHA triggered manually to verify: https://github.com/sjain-stanford/torch-mlir/actions/runs/3645944517
The current implementation of `DecomposeComplexOps` fails if an op
expected to be decomposed does not get decomposed in the first
iteration of the `createTorchSimplificationPipeline` in
`LowerToBackendContractPass`. However, some graphs require multiple
iterations of `createTorchSimplificationPipeline` to fully propagate
all statically knowable information, such as dtypes and shapes, to the
entire graph, sometimes resulting in the need to run
`DecomposeComplexOps` more than once.
This commit changes `DecomposeComplexOps` to use a greedy algorithm
for pattern application and moves the legalization check of ops to the
`LowerToBackendContractPass` to allow for the `DecomposeComplexOps` to
run more than once.
This gives some decent improvements to memory consumption and latency of
testing. I would have expected buffer-deallocation to actually make a
big difference to the final process RSS but it doesn't appear to. Also
running buffer-deallocation later in the pipeline results in
miscompiles. I didn't have the time or interest to dig in deeper, but
something is off.
(numbers below are taken from a single run, but I did do a few runs to make
sure that the variance wasn't that great)
- Linalg-on-Tensors shows memory consumption improvements and some slight speedups.
```
./tools/e2e_test.sh -s -v -c refbackend
fuse=0 dealloc=0
RSS: 3071.33 MB
real 3m58.204s
user 6m22.299s
sys 0m51.235s
fuse=1 dealloc=0
RSS: 2515.89 MB
real 3m34.797s
user 5m56.902s
sys 0m44.933s
fuse=1 dealloc=post-bufferize:
RSS: 2290.25 MB
real 3m42.242s
user 6m0.560s
sys 0m46.335s
```
- TOSA ResNet18 gets significantly faster and uses significantly less memory.
```
time ./tools/e2e_test.sh -s -v -c tosa -f ResNet18
fuse=0 dealloc=0
rss 1328.56 MB
real 0m50.303s
user 0m55.355s
sys 0m12.260s
fuse=1 dealloc=0
rss 859MB
real 0m30.454s
user 0m35.551s
sys 0m11.879s
fuse=1 dealloc=post-bufferize:
rss 851MB
real 0m30.313s
user 0m39.889s
sys 0m11.941s
```
Big thanks to Ramiro for the methodology here for measuring the RSS with
`psutil`:
https://gist.github.com/ramiro050/5b5c2501f7389c008d9029210772c3a8
This more accurately reflects what it is. The previous name was
conflating the use of RefBackend (which `linalg`, `tosa`, and `mhlo`
configs all use) with the use of the linalg backend (e.g. TorchToLinalg).
This conflation was artifically giving the linalg backend a "privileged"
position, which we want to avoid. We still keep it as the default
backend, and it remains the most complete, but at least there's not
artificial boosting.
- Support for non-prefixed accessors has been removed. See:
https://reviews.llvm.org/D136727
- Rename `operands` to `methodOperands` in `prim.CallMethod` since the
name `operands` overlaps with a builtin method name. See:
https://reviews.llvm.org/D136727
- Add passes in refbackend to lower memref.subview. See:
https://reviews.llvm.org/D136377
- Replace `CopyToValueTensorOps` first in `RewriteViewLikeSubgraph` in
maximize-value-semantics.
The current implementation of the `RewriteViewLikeSubgraph` pass in
maximize-value-semantics creates temporarily invalid IR. In
particular, given a forward slice starting from a
`CopyToNonValueTensorOp` and ending in `CopyToValueTensorOp`s, the
pass first replaces all uses of the `CopyToNonValueTensorOp` with
its operand, which results in all the `CopyToValueTensorOp` users
having their operand have type `!torch.vtensor`, which is invalid.
The correct way to do things is to first replace all the
`CopyToValueTensorOp`s with their operand, and then replace all uses
of the `CopyToNonValueTensorOp` with its operand.
This only started failing now because the generated accessor
`getOperand` for the `CopyToValueTensorOp` now returns a
`TypedValue<NonValueTensorType>`, which has an assert checking that
the value returned is of the expected type.
This is a minor variation on our other resnet18 examples swapping in
TorchDynamo.
We replicate the refbackend_torchdynamo_backend out of the e2e test
config to avoid making that appear like a public API.
Also, some minor cleanups to TorchDynamoTestConfig.
This test has been disabled a long time, and since RefBackend is so slow
we don't want to add this unnecessarily. I believe it is covered by
downstream testing such as the Shark Tank.
Thanks to TorchDynamo's great layering and design, this is only about
100 lines of code for a basic lockstep debugger.
This should allow us to deprecate eager_mode, since AFAIK the only
interesting use case that it was really supporting is for downstream users to
write lockstep debuggers.
NOTE: The exact reporting and interface here is subject to change. Please
try it out and provide feedback (or patches :) ).
- make_fx should not drop source locations: https://github.com/pytorch/pytorch/issues/90276
- Report tensors better (huge tensors should be summarized)
- Maybe don't abort, but just warn?
- Allow customizing atol/rtol.
- How best to print the failing node? And include surrounding graph
context?
This commit changes the `InsertRngGlobalsPass` to `TorchConversionToMLProgram`
pass. This commit also adds the `MLProgramBufferize` pass for the
bufferization of ml_program dialect ops to run on refbackend.
Signed-Off By: Vivek Khandelwal<vivek@nod-labs.com>
Summary of changes:
- Change ShapedType::kDynamicSize -> ShapedType::kDynamic
- llvm::NoneType has been deprecated, change convertScalarToDtype to use llvm::None
This commit replaces the LCG algorithm that was being used by the
`TorchToLinalg` lowering of `AtenUniformOp` to generate random numbers
with the `squares64` algorithm, for the LCG algorithm was producing
tensors that were highly correlated with one another.
Squares64 algorithm: https://arxiv.org/abs/2004.06278
Closes https://github.com/llvm/torch-mlir/issues/1608
Summary of changes:
- Replace call to `MemoryEffectOpInterface::hasNoEffect`
with `isMemoryEffectFree`.
- Make fix for the dynamic dims, since
`kDynamicSize` value changed to
`std::numeric_limits<int64_t>::min()` from `-1` in llvm
- `makeShapeLLVMCompatible` and `makeShapeTorchCompatible`
utilities convert shapes in order to remain consistent
with the Torch and MLIR semantics.
- Update tags
llvm: 147fe9de29dc13c14835127b35280c4d95c8e8ba
mhlo: 1944b5fa6062ec4c065d726c9c5d64f1487ee8c5
Signed-Off By: Vivek Khandelwal<vivek@nod-labs.com>
The current implementation sets the `nextSeed` value to `temp & 127`,
which is wrong. The last step of the LCG algorithm for the multiplier
and increment chosen should be `temp % 2^{64} = temp & (1 <<
63)`. However, because we are dealing with i64 values, the modulus
operation happens automatically, so it is not needed.
See Donald Knuth's values for LCG here:
https://en.wikipedia.org/wiki/Linear_congruential_generator
-- This commit fixes a bug in computeReductionType API.
-- The bug pertains to removal of `dim` from the `sizes` array.
Signed-off-by: Abhishek Varma <abhishek@nod-labs.com>
There are a few e2e tests that take several very large tensors as
input, which leads to the e2e test suite leaking too much
memory. Running things locally resulted in a total memory usage of
12.5 GB when running the suite sequentially on the refbackend.
Many of the tests that take large tensors don't actually need
such large tensors to pass, and some that take several large tensors
as input are just doing the same thing multiple times. This commit
reduces the size of some of the tensors and removes repetitive parts
of tests to reduce the memory usage to a total of 3 GB.
Set PyTorch and TorchVision version to nightly release 2022-11-22.
Add failing tests to the xfail set.
Signed-Off By: Vivek Khandelwal<vivek@nod-labs.com>
`np.bool is bool` and will never be returned as a dtype of an
`np.ndarray`, so we don't need to handle it here.
```
>>> a = np.ndarray([1], dtype=bool)
>>> a.dtype.type is np.bool_
True
```
More info here:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
-- This commit adds decompose logic for `aten._softmax` when
`half_to_float` is `True`.
-- An e2e test case will be added once support for half to float conversion for
`aten._softmax` is added upstream.
Signed-off-by: Abhishek Varma <abhishek@nod-labs.com>