Whether or not the PyTorch build is cached should not affect the success
of the Torch-MLIR build, but based on the existing code, a build may
fail if the `TM_PYTORCH_INSTALL_WITHOUT_REBUILD` variable was set but
the build cache doesn't exist.
Although that variable is set by CI upon a cache hit, nuances of
Github's caching behavior can create situations where the coupling
between `TM_PYTORCH_INSTALL_WITHOUT_REBUILD` and the cache lookup fails.
Specifically, a branch other than our default branch (`main`) may create
the cache entry, but because Github doesn't share this cache entry with
builds running on the `main` branch, the `main` branch build tries to
create it's own cache entry. However, since cache identifiers are
unique and because caches are immutable, the caching step running in the
`main` branch appears to create an invalid cache entry (of 233 bytes,
instead of the expected ~60 MB).
Consequently, subsequent builds observe a cache "hit", since caches
created by the `main` branch are shared with all other branches, but
because this cache entry is invalid (since it doesn't actually contain
the ~60 MB PyTorch WHL file), the builds fail.
One workaround would be to let only the `main` branch create caches, but
in doing so, we would also prevent other branches from _reading_ the
cache, making the builds in those branches terribly slow.
So this patch uses a different workaround, which is to check whether the
PyTorch WHL file exists, even if the build observed a cache hit. If the
file doesn't exist, even if it was a purported cache hit, the code
builds PyTorch from source, which is probably intuitive.
A longer term fix will follow, after a discussion with the wider team.
Without this patch, CI logs contained the line:
-- Linker detection: GNU ld
GNU ld is notoriously slow at linking large binaries, so this patch
swaps GNU ld with the LLVM linker.
Since the linker invocation is driven through the compiler, perhaps the
best way to use the LLVM linker is to tell the compiler which linker
binary to use. This patch adds the `-fuse-ld=lld` flag to all Linux
builds of Torch-MLIR in CI to make it use lld.
* ci: cache PyTorch source builds
This patch reduces the time spent in regular CI builds by caching
PyTorch source builds. Specifically, this patch:
1. Makes CI lookup the cache entry for the PyTorch commit hash in
pytorch-version.txt
2. If lookup was successful, CI fetches the previously-generated WHL
file into the build_tools/python/wheelhouse directory
3. CI sets the `TM_PYTORCH_INSTALL_WITHOUT_REBUILD` variable to `true`
4. The build_libtorch.sh script then uses the downloaded WHL file
instead of rebuilding PyTorch
* ci: warm up PyTorch source cache during daily RollPyTorch action
This patch makes the RollPyTorch action write the updated WHL file to
the cache, so that it can be later retrieved by CI that runs for each
PR. We deliberately add the caching step to the end of the action since
the RollPyTorch action never needs to read from the cache, although
executing this step earlier in the process should not cause problems
either.
We originally added these to help bring up more complex models with
heavier dependencies. However, over time it has become clear that these
models usually require more than just heavier dependencies -- they often
require a nontrivial amount of "one-off" code to extract the relevant
parts of the model and compile them. This is not a good fit for a
component in the core Torch-MLIR repo.
However, in the community, nod.ai has developed the ["Shark
Tank"](https://github.com/nod-ai/SHARK/tree/main/tank) which has all the
appropriate code to wrangle these models and organize them. We intend to
more heaviliy lean on that as a community and improve the symbiosis
there to serve the role that these heavydep tests were meant to play.
* build: disable LTC again so that we can bump PyTorch version
When built using PyTorch's master branch, the LTC code has been failing
to build for a few days. As a result, the PyTorch version referenced by
Torch-MLIR is stalled to the one from October 4th.
In an effort to advance to PyTorch version, this patch disables LTC, and
a subsequent patch will advance the PyTorch version.
* update PyTorch version to 1.14.0.dev20221010
Also disables the `UpSampleNearest2dDynamicFactor_basic` e2e test, since
the (PyTorch) oracle differs from the computed value for both the
refbackend and the eager_mode backends.
Instead of letting the auto-update script either fail because of script
errors or letting it commit bad versions, this patch makes the update
process manual, for now. Once the script stabilizes, I will its
re-enable periodic execution.
Updating the PyTorch version may break the Torch-MLIR build, as it did
recently, since the PyTorch update caused the shape library to change,
but the shape library was not updated in the commit for updating
PyTorch.
This patch introduces a new default-off environment variable to the
build_linux_packages.sh script called `TM_UPDATE_ODS_AND_SHAPE_LIB`
which instructs the script to run the update_torch_ods.sh and
update_shape_lib.sh scripts.
However, running these scripts requires an in-tree build and the tests
that run for an in-tree build of Torch-MLIR are more comprehensive than
those that run for an out-of-tree build, so this patch also swaps out
the out-of-tree build for an in-tree build.
Prior to this patch, the release process (`pip wheel`) retrieved
dependencies from the pyproject.toml file, which specified a version of
PyTorch that defaulted to the most recent nightly release. Instead, we
want the release process to use the same pinned PyTorch version as the
development build of PyTorch.
Since TOML files can't reference the pytorch-requirements.txt file, this
patch puts the dependencies from pyproject.toml into
whl-requirements.txt, which references pytorch-requirements.txt.
`git diff` does not work by default on untracked files. Since the
function `_check_file_not_changed_by` stores the new generated file in
an untracked file, `git diff` was not catching any modifications in
the new generated file. This commit adds the flag `--no-index` to make
`git diff` work with untracked files.
This patch fetches the most recent nightly (binary) build of PyTorch,
before pinning it in pytorch-requirements.txt, which is referenced in
the top-level requirements.txt file. This way, end users will continue
to be able to run `pip -r requirements.txt` without worrying whether
doing so will break their Torch-MLIR build.
This patch also fetches the git commit hash that corresponds to the
nightly release, and this hash is passed to the out-of-tree build so
that it can build PyTorch from source.
If we were to sort the torch versions as numbers (in the usual
descending order), then 1.9 appears before 1.13. To fix this problem,
we use the `--version-sort` flag (along with `--reverse` for specifying
a descending order). We also filter out lines that don't contain
version numbers by only considering lines that start with a digit.
As a matter of slight clarity, this patch renames the variable
`torch_from_src` to `torch_from_bin`, since that variable is initialized
to `TM_USE_PYTORCH_BINARY`.
Co-authored-by: powderluv <powderluv@users.noreply.github.com>
This adds a very long and obnoxious option to disable crashing tests.
The right fix here is to use the right multiprocessing techniques to
ensure that segfaulting tests can be XFAILed like normal tests, but we
currently don't know how to implement "catch a segfault" in Python
(patches or even just ideas welcome).
Motivated by #1361, where we ended up removing two tests from *all*
backends due to a failure in one backend, which is undesirable.
We added both ipc=host and explicit ulimits. This _may_ be causing slow downs on GHA. Remove the ulimit setting still passes all the CI tests locally. `--ipc=host` is still required.
The new logic has the following benefits:
1. It does not clobber the working tree state. We expect testing to not
change the work tree.
2. It correctly handles the case where a user has changes to the
generated files, but hasn't checked them in yet (this happens
frequently when adding new ops).
Gets both CI and Release builds integrated in one workflow.
Mount ccache and pip cache as required for fast iterative builds
Current Release docker builds still run with root perms, fix it
in the future to run as the same user.
There may be some corner cases left especially when switching
build types etc.
Docker build TEST plan:
tl;dr:
Build everythin: Releases (Python 3.8, 3.9, 3.10) and CIs.
TM_PACKAGES="torch-mlir out-of-tree in-tree"
2.57s user 2.49s system 0% cpu 30:33.11 total
Out of Tree + PyTorch binaries:
Fresh build (purged cache):
TM_PACKAGES="out-of-tree"
0.47s user 0.51s system 0% cpu 5:24.99 total
Incremental with ccache:
TM_PACKAGES="out-of-tree"
0.09s user 0.08s system 0% cpu 34.817 total
Out of Tree + PyTorch from source
Incremental
TM_PACKAGES="out-of-tree" TM_USE_PYTORCH_BINARY=OFF
1.58s user 1.81s system 2% cpu 1:59.61 total
In-Tree + PyTorch binaries:
Fresh build and tests: (purge ccache)
TM_PACKAGES="in-tree"
0.53s user 0.49s system 0% cpu 6:23.35 total
Fresh build/ but with prior ccache
TM_PACKAGES="in-tree"
0.45s user 0.66s system 0% cpu 3:57.47 total
Incremental in-tree with all tests and regression tests
TM_PACKAGES="in-tree"
0.16s user 0.09s system 0% cpu 2:18.52 total
In-Tree + PyTorch from source
Fresh build and tests: (purge ccache)
TM_PACKAGES="in-tree" TM_USE_PYTORCH_BINARY=OFF
2.03s user 2.28s system 0% cpu 11:11.86 total
Fresh build/ but with prior ccache
TM_PACKAGES="in-tree" TM_USE_PYTORCH_BINARY=OFF
1.58s user 1.88s system 1% cpu 4:53.15 total
Incremental in-tree with all tests and regression tests
TM_PACKAGES="in-tree" TM_USE_PYTORCH_BINARY=OFF
1.09s user 1.10s system 1% cpu 3:29.84 total
Incremental without tests
TM_PACKAGES="in-tree" TM_USE_PYTORCH_BINARY=OFF TM_SKIP_TESTS=ON
1.52s user 1.42s system 3% cpu 1:15.82 total
In-tree+out-of-tree + Pytorch Binaries
TM_PACKAGES="out-of-tree in-tree"
0.25s user 0.18s system 0% cpu 3:01.91 total
To clear all artifacts:
rm -rf build build_oot llvm-build libtorch docker_venv
externals/pytorch/build
We use it for more than TorchScript testing now. This is a purely
mechanical change to adjust some file paths to remove "torchscript".
The most perceptible change here is that now e2e tests are run with
```
./tools/e2e_test.sh
instead of:
./tools/torchscript_e2e_test.sh
```
Bumps the shape library:
- Updates the function signature for aten.arange.start_step
- upstream_shape_functions.mean_dim -> upstream_shape_functions.sum_mean_dim
* Replace CHECK_EQ with TORCH_CHECK_EQ
* Check value of TORCH_MLIR_USE_INSTALLED_PYTORCH during LTC build
* Update LTC XFAIL with NewZerosModule ops
* Explicitly blacklist _like ops
* Automatically blacklist new_/_like ops
* Prune away unused Python dependencies from LTC
* Add flag to disable LTC
* Autogen dummy _REFERENCE_LAZY_BACKEND library when LTC is disabled
* Implement compute_shape_var
* Removed Var tests from XFAIL Set
* XFAIL tests using _local_scalar_dense or index.Tensor
* Add StdDim tests to XFAIL set
* Autogen aten::cat
* Changed Example MLIR backend to Reference MLIR backend
* Moved reference_ltc_backend into csrc
* Merged sys_utils.h
* Renamed reference_ltc_backend to reference_lazy_backend
* Addressed review comments
* Update docs with new library name
* Removed _REFERENCE_LAZY_BACKEND from .gitignore
* Added reference_lazy_backend to the TorchMLIRPythonModules dependency list
Fixed typo in `ltc_examples.md`
Missed instance where `ltc_backend` was used instead of `lazy_backend`.
* Update native function definitions
* Add ops to support bert lowering
- Add empty_strided and as_strided
- Restore zeros_like to op blacklist (Without this, tensors will be unintentionally created with a CPU device rather than lazy)
- Check for composite implicit ops and add device data IR
- Also fix codegen for functionalization
* Add autogen to CMakeList
* Remove PyTorch submodule
* Reduced BERT model size
* Print Mark Step status in Torch MLIR LTC debug string
* Apply fixes to work with latest upstream/main
- Pass importOptions into getMlirTypeFromTorchType during NodeImporter::importNode
Without this, the tensor type created may have a mismatched type as ImportOptions may cause vtensor to be used instead of tensor
* Update shape inference functions
- Fixed compute_shape_native_batch_norm when mean and var are uninitialized
Previously, the number of shapes returned would be <3 if either mean or val was didn't exist. Instead, we now initialize them with a vector matching the number of channels.
- Implemented compute_shape_mul
- Fixed bug in reshape shape inference error message
* Get MLIR backend more consistent with TS backend
- Remove LazyNativeFunctions::_unsafe_view from autogen
- Blacklist ops to make JIT graph more like output of TS backend
- Print graph when SSA value has mismatch of types and results
- Remove normalize_index from LazyShapeInference
- Fix seeds for LTC example models
* Update and clean up shape inference functions
- Prune shape inference functions
- Add shape inference function for GenerateSlice
- Add shape inference function for GenerateCopy
Co-authored-by: Henry Tu <henry.tu@cerebras.net>
* Save InputOutputAliases to TorchMlirComputation
* Implement GetResultShape for TorchMlirLoweringContext
* Use optional return type for GetResultShape
* Remove support for aten::detach
With this op enabled, tensors were being copied, which resulted in incorrect aliasing.
* Add newline before printing I/O alias mapping
* Changed printout to use "Input param" as label instead of "Input"
* Remote shape inference function for aten::detach
* Moved implementation of SetUpAlias to MlirLoweringContext
As part of this change, TorchMlirComputation has been moved to the end of mlir_lowering_context.h so that it can access some new structs in TorchMlirLoweringContext
* Use updated PyTorch API
* Remove GetResultShape
Complements this upstream PyTorch PR: pytorch/pytorch#75828
This PR adds support for mapping input and output tensors which alias each other. (e.g. maps input weight tensor in parameter to the same tensor in output after a training iteration)
MLIR:
func @graph(%arg0: !torch.vtensor<[1,5],f32>, %arg1: !torch.vtensor<[1],si64>, ..., %arg6: !torch.vtensor<[10,5],f32>, %arg7: !torch.vtensor<[10],f32>, ...) {
...
return %arg0, %arg1, %17, %23, ... : !torch.vtensor<[1,5],f32>, !torch.vtensor<[1],si64>, !torch.vtensor<[10,5],f32>, !torch.vtensor<[10],f32>, ...
}
Input/Output Alias Mapping:
Output: 0 -> Input: 0
Output: 1 -> Input: 1
Output: 2 -> Input: 6
Output: 3 -> Input: 7
The aten::detach op has also been disabled in this PR to fix the issue of tensors not aliasing properly due to copying.