Until recently, we had to either risk feature branches creating PyTorch
build caches (which were unusable by the main branch or other parallel
feature branches because of GitHub's rules around sharing caches among
branches) or we had to limit the PyTorch build caches to only the main
branch, causing CI runs on feature branches to be terribly slow because
they had to rebuild PyTorch each time.
This patch enables the best of both worlds, by using a fork
(github.com/ashay/cache) of the GitHub's cache action, where the fork
adds an option (called `save`) which, when set, uploads a new cache
entry. We thus set this `save` flag only when we're building PyTorch
from source in Torch-MLIR's main branch, whereas all other builds set
this `save` flag to `false`.
The ability to conditionally update the cache has been an oft-requested
feature on the original (github.com/actions/cache) repository and
multiple unmerged PRs exist to allow conditional cache updates, so it is
likely that using the fork is only a temporary solution.
This patch is part of a larger set of improvements to the CI/build
system. In the code, we refer to the version as the string that
contains the release identifier such as 1.14.0.dev20221028, so calling
the file that contains the commit hash as pytorch-version.txt creates
confusion. For the sake of simplicity, this patch renames that file to
be pytorch-hash.txt.
If PyTorch build caches are created on a branch other than the main
branch, then GitHub does not share those caches with the main branch,
making every CI run that runs for each PR slow. This patch resolves the
problem by letting only the main branch create and use PyTorch build
caches.
* ci: cache PyTorch source builds
This patch reduces the time spent in regular CI builds by caching
PyTorch source builds. Specifically, this patch:
1. Makes CI lookup the cache entry for the PyTorch commit hash in
pytorch-version.txt
2. If lookup was successful, CI fetches the previously-generated WHL
file into the build_tools/python/wheelhouse directory
3. CI sets the `TM_PYTORCH_INSTALL_WITHOUT_REBUILD` variable to `true`
4. The build_libtorch.sh script then uses the downloaded WHL file
instead of rebuilding PyTorch
* ci: warm up PyTorch source cache during daily RollPyTorch action
This patch makes the RollPyTorch action write the updated WHL file to
the cache, so that it can be later retrieved by CI that runs for each
PR. We deliberately add the caching step to the end of the action since
the RollPyTorch action never needs to read from the cache, although
executing this step earlier in the process should not cause problems
either.
Instead of letting the auto-update script either fail because of script
errors or letting it commit bad versions, this patch makes the update
process manual, for now. Once the script stabilizes, I will its
re-enable periodic execution.
Updating the PyTorch version may break the Torch-MLIR build, as it did
recently, since the PyTorch update caused the shape library to change,
but the shape library was not updated in the commit for updating
PyTorch.
This patch introduces a new default-off environment variable to the
build_linux_packages.sh script called `TM_UPDATE_ODS_AND_SHAPE_LIB`
which instructs the script to run the update_torch_ods.sh and
update_shape_lib.sh scripts.
However, running these scripts requires an in-tree build and the tests
that run for an in-tree build of Torch-MLIR are more comprehensive than
those that run for an out-of-tree build, so this patch also swaps out
the out-of-tree build for an in-tree build.
A bug in the CI script caused the entire script to fail if the exit code
of the command for comparing with the existing hash returned a non-zero
exit status. The non-zero exit status for this comparison does not
imply failed execution, since it only indicates that the hash has
changed.
* build: push directly from CI to main branch
This avoids the need to create, approve, and merge a separate PR, in
addition to avoiding unnecessary CI runs for the PyTorch version update.
* build: schedule cronjob to run RollPyTorch action
This patch schedules the RollPyTorch action to be run at noon UTC, which
roughly corresponds to 4am Pacific Time. We pick this time since the
commit for PyTorch nightly releases are picked just after midnight
Pacific Time and the nightly release artifacts are produced in about 2
to 3 hours after the commit is picked.
* Update buildRelease.yml
Update Releases right after a Release build.
* Move gh-page update after release builds
This removes the periodic update and updates after a release build.
This patch fetches the most recent nightly (binary) build of PyTorch,
before pinning it in pytorch-requirements.txt, which is referenced in
the top-level requirements.txt file. This way, end users will continue
to be able to run `pip -r requirements.txt` without worrying whether
doing so will break their Torch-MLIR build.
This patch also fetches the git commit hash that corresponds to the
nightly release, and this hash is passed to the out-of-tree build so
that it can build PyTorch from source.
If we were to sort the torch versions as numbers (in the usual
descending order), then 1.9 appears before 1.13. To fix this problem,
we use the `--version-sort` flag (along with `--reverse` for specifying
a descending order). We also filter out lines that don't contain
version numbers by only considering lines that start with a digit.
As a matter of slight clarity, this patch renames the variable
`torch_from_src` to `torch_from_bin`, since that variable is initialized
to `TM_USE_PYTORCH_BINARY`.
Co-authored-by: powderluv <powderluv@users.noreply.github.com>
* Move CIs to use docker builds
Now that #1234 has landed and anyone can run CI / Release builds locally move GHA to use the same flow.
* update names
* Update comments
We use it for more than TorchScript testing now. This is a purely
mechanical change to adjust some file paths to remove "torchscript".
The most perceptible change here is that now e2e tests are run with
```
./tools/e2e_test.sh
instead of:
./tools/torchscript_e2e_test.sh
```
* Disable LTC by default until upstream revert relands
Tracked with the WIP https://github.com/llvm/torch-mlir/pull/1292
* Disable LTC e2e tests temporarily
* Update setup.py
Disable LTC in setup.py temporarily until upstream is fixed.
This fixes a seeding issue with the [previous PR](https://github.com/llvm/torch-mlir/pull/1240) where bazel build's GHA cache is not present to begin with and one of the commands (chown) fails on it. Should get the Bazel build back to green.
This PR adds:
- A minimal docker wrapper to the bazel GHA workflow to make it reproducible locally
- Bazel cache to speed up GHA workflows (down to ~5 minutes from ~40+minutes)
This is a no-op for non-bazel workflows and an incremental improvement.
When we renamed the directory containing submodules from `external` to
`externals`, we accidentally left the original name in the Github
workflow. This patch fixes the problem.
My earlier[ PR](https://github.com/llvm/torch-mlir/pull/1213) had (among other things) decoupled ubuntu and macos builds into separate matrix runs. This is not working well due to limited number of MacOS GHA VMs causing long queue times and backlog. There are two reasons causing this backlog:
1. macos arm64 builds with pytorch source are getting erratically cancelled due to resource / network constraints. This is addressed with this: https://github.com/llvm/torch-mlir/pull/1215
> "macos-arm64 (in-tree, OFF) The hosted runner: GitHub Actions 3 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error."
2. macos runs don't fail-fast when ubuntu runs fail due to being in separate matrix setups. This PR couples them again.
* mac m1 cross compile
Add support for M1 cross compile
* Remove redundant ExecutionEngine
It is registered as part of RegisterEverything
* nuke non-universal zstd
disable LTC
At the moment we don't gate torch-mlir PRs with bazel builds. This means bazel builds don't get run on open PRs, and so there's no good way to validate a fix PR which is meant to fix a broken bazel build. This option allows a bazel build to be manually triggered as needed on open PRs.
* Replace CHECK_EQ with TORCH_CHECK_EQ
* Check value of TORCH_MLIR_USE_INSTALLED_PYTORCH during LTC build
* Update LTC XFAIL with NewZerosModule ops
* Explicitly blacklist _like ops
* Automatically blacklist new_/_like ops
* Prune away unused Python dependencies from LTC
* Add flag to disable LTC
* Autogen dummy _REFERENCE_LAZY_BACKEND library when LTC is disabled
* Implement compute_shape_var
* Removed Var tests from XFAIL Set
* XFAIL tests using _local_scalar_dense or index.Tensor
* Add StdDim tests to XFAIL set
* Autogen aten::cat
* Added e2e LTC Torch MLIR tests
* Fix seed for reproducability
* Check if computation is None before getting debug string
* Updated unit tests, and added numeric tests
* Print name of the model layer that fails numeric validation
* Run LTC e2e test with CI/CD
* Set seed in main function, instead of beginning of execution
* Add comment to specify number of digits of precision
* Fixed typo
* Remove tests for LTC example models
* Added LTC option to torchscript e2e
* Implement compile and run for LTC e2e test
* xfail all tests that use ops that aren't currently supported
* Update buildAndTest.yml
test with fast-fail matrix builds
* Remove redundant and statement
* Downgrade to 20.04
Until upstream PyTorch FBGEMM is fixed to compile with clang+14+ https://github.com/pytorch/pytorch/pull/82396
* Update buildAndTest.yml
run tests on only the binary config.
This enables building Pytorch from source in the CI.
The build should mostly hit the ccache.
Release builds will follow once we have some runtime on the CI.