Since PRs created by the GitHub action bot cannot trigger workflows (and
thus build tests), this patch uses the token for a GitHub app that was
specifically created for the RollPyTorch action.
We previously used a fork of the action/cache repository for the PyTorch
cache since the actions/cache repo did not support read-only caches.
Now that actions/cache supports separate read and write steps, this
patch switches back to the actions/cache repo.
This patch, by itself, doesn't fix caching on Windows, but once a new
release of ccache is available, caching for Windows builds should start
working again (validated by building ccache from source and using it
with LLVM builds).
Ccache rejects caching when either the `/Zi` or `/ZI` flags are used
during compilation on Windows, since these flags tell the compiler to
embed debug information in a PDB file (separate from the object file
produced by the compiler). In particular, our CI builds add the `/Zi`
flag, making ccache mark these compiler invocations as uncacheable.
But what caused our CI to add debug flags, especially when we specified
`-DCMAKE_BUILD_TYPE=Release`? On Windows, unless we specify the
`--config Release` flag during the CMake build step, CMake assumes a
debug build. So all this while, we had been producing debug builds of
torch-mlir for every PR! No doubt it took so long to build the Windows
binaries.
The reason for having to specify the configuration during the _build_
step (as opposed to the _configure_ step) of CMake on Windows is that
CMake's Visual Studio generators will produce _both_ Release and Debug
profiles during the CMake configure step (thus requiring a build-time
value that tells CMake whether to build in Release or Debug mode).
Luckily, on Linux and macOS, the `--config` flag seems to be simply
ignored, instead of causing build errors.
Strangely, based on cursory tests, it seems like on Windows we need to
specify the Relase configuration as both `-DCMAKE_BUILD_TYPE=Release` as
well as `--config Release`. Dropping either made my build switch to a
Debug configuration.
Additionally, there is a bug in ccache v4.8 (although this is addressed
in trunk) that causes ccache to reject caching if the compiler
invocation includes any flag that starts with `/Z`, including /`Zc`,
which is added by LLVM's HandleLLVMOptions.cmake and which isn't related
to debug info or PDB files. The next release of ccache should include
the fix, which is to reject caching only for `/Zi` and `/ZI` flags and
not all flags that start with `/Z`.
As a side note, debugging this problem was possible because of ccache's
log file, which is enabled by: `ccache --set-config="log_file=log.txt"`.
The GitHub action for creating the PR expects that either the changes
are not committed (in which case it commits them with the specified
commit message) or that the commit exists but that it is also pushed to
remote.
Prior to this patch, we created the commit but did not push it to
remote, causing failures. This patch leaves the changes uncommitted so
that they're committed and pushed to remote as part of the PR creation.
Currently, we run just the Linux in-tree tests when the RollPyTorch
workflow runs, but this is insufficient since WHL files for macOS or
Windows are sometimes not uploaded by PyTorch, causing the RollPyTorch
action to pass but all subsequent torch-mlir CI tests to fail because of
the broken build.
The easiest way to validate the RollPyTorch action on all platforms is
to run the standard set of tests that we run for each submitted PR, so
this patch makes the RollPyTorch action submit a PR instead of
committing the changes to the main branch directly. The PR is assigned
to a handful of folks for review, although this can be changed in the
future.
Despite using sudo to delete the workspace directory, we still
occasionally run into checkout errors. This patch thus drops the
deletion of the workspace prior to checkout. It also restricts the
number of parallel jobs in the submodule fetch step to just one, to try
and resolve the checkout issue ("index.lock: File exists.").
We have recently started seeing errors like:
```
Synchronizing submodule url for 'externals/llvm-project'
Synchronizing submodule url for 'externals/mlir-hlo'
/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
Error: fatal: Unable to create '/home/anush/actions-runner/_work/torch-mlir/torch-mlir/.git/modules/externals/llvm-project/index.lock': File exists.
```
As a workaround, this patch removes the workspace directory before the
checkout step.
The RollPyTorch action needs the `unzip` command to peek into WHL files
for fetching metadata. This patch makes sure that the command is
installed before referencing it.
We want to ensure that pip packages required for building torch-mlir
should be included in the dependencies of torch-mlir, but we don't want
the pip packages required for _testing_ of torch-mlir to be included
among the dependencies. To be able to specify and install one set of
dependencies and not the other, this patch separates the pip packages
into two files: build-requirements.txt and test-requirements.txt.
This patch also updates references to the requirements.txt file so that
CI builds that run end-to-end tests install test-related pip
dependencies while everything else (including WHL builds) sticks to just
the build-related pip dependencies.
Despite this change, this patch should not affect a torch-mlir
developer's workflow. More precisely, since this patch makes the
top-level requirements.txt file refer to both build-requirements.txt and
test-requirements.txt files, a torch-mlir developer should be able to
continue referring to the requirements.txt file without any impact.
This patch replaces all MHLO operations with their StableHLO
counterparts and adds a validation pass to ensure that no MHLO operations
remain before translating all Stablehlo operations to the MHLO dialect
for further lowering to the Linalg dialect.
This patch also updates all lit tests so that they refer to the
`convert-torch-to-stablehlo` pass and so that they check for StableHLO
operations.
- Use v3 of actions/checkout, since the version we use (v2) uses
Node.js 12, which is deprecated by GitHub.
- Source the PowerShell venv sctipt (instead of the bash sript) since
the calling script is a PowerShell script. Without this, the build
doesn't use venv at all.
- Make the build dependencies in whl-requirements.txt (used by
setup.py) match those in requirements.txt. To that end, this patch
creates a build-requirements.txt that is referenced by
requirements.txt and whl-requirements.txt.
Now that the RollPyTorch tracker issue exists, we can automate the job
of notifying folks of failures instead of having to do it manually.
This patch adds a step to the workflow to post such a message.
There appear to be two problems with the caching layer in our CI runs:
(a) the sizes of some of the caches have grown to multiples of the
300 MB limit and (b) caching on Windows seems to be provide little to no
benefit.
To help understand the reasons for these problems, this patch adds a
line item to the list of steps run in CI to dump the ccache
configuration and statistics just prior to uploading the cache artifact.
The RollPyTorch action often takes more than 1.5 hours to finish.
During this time, if another PR is merged, then the RollPyTorch action
needs to first pull the merged changes before committing the updates to
the PyTorch commit hash and version files. This patch adds the required
`git pull` statement, without which, the subsequent `git push` statement
fails, causing the RollPyTorch action to fail as well.
* [custom op] Generalize shape library logic to work with dtypes
This commit generalizes the shape library logic, so that dtype rules
for ops can also be expressed using the same mechanism. In other
words, each op can now have a shape function and a dtype function
specified in Python that is imported during lowering to calculate the
shapes and dtypes throught a program. For more information about how
to specify a dtype function, see the updated
`docs/adding_a_shape_and_dtype_function.md`.
For those not familiar with how the shape library works, the file
`docs/calculations_lib.md` provides an overview.
Until recently, the metadata file in the torchvision package included
the nightly version of the torch package, but since that is no longer
the case, our RollPyTorch workflow is broken.
As a workaround, this patch uses the `pip download` command's ability to
fetch the dependent torch package for the specified version of
torchvision, before peeking into the WHL file for the torch package to
determine the release version and the commit hash.
The upload timestamp of the nightly torchvision package has drifted
beyond the scheduled time of the RollPyTorch action because of the time
change due to daylight saving. As a result, the RollPyTorch action now
picks the torchvision package from a day earlier instead of the most
recent package.
This patch schedules the RollPyTorch action to start one hour later than
before so that it continues to pick the most recent nightly package.
Bazel LIT test support was added in https://github.com/llvm/torch-mlir/pull/1585. This PR enables the tests in CI.
```
INFO: Build completed successfully, 254 total actions
@torch-mlir//test/Conversion:TorchToArith/basic.mlir.test PASSED in 0.3s
@torch-mlir//test/Conversion:TorchToLinalg/basic.mlir.test PASSED in 0.5s
@torch-mlir//test/Conversion:TorchToLinalg/elementwise.mlir.test PASSED in 0.3s
@torch-mlir//test/Conversion:TorchToLinalg/flatten.mlir.test PASSED in 0.3s
@torch-mlir//test/Conversion:TorchToLinalg/pooling.mlir.test PASSED in 0.3s
@torch-mlir//test/Conversion:TorchToLinalg/unsqueeze.mlir.test PASSED in 0.2s
@torch-mlir//test/Conversion:TorchToLinalg/view.mlir.test PASSED in 0.3s
@torch-mlir//test/Conversion:TorchToMhlo/basic.mlir.test PASSED in 0.5s
@torch-mlir//test/Conversion:TorchToMhlo/elementwise.mlir.test PASSED in 0.9s
@torch-mlir//test/Conversion:TorchToMhlo/gather.mlir.test PASSED in 0.3s
@torch-mlir//test/Conversion:TorchToMhlo/linear.mlir.test PASSED in 0.6s
@torch-mlir//test/Conversion:TorchToMhlo/pooling.mlir.test PASSED in 0.3s
@torch-mlir//test/Conversion:TorchToMhlo/reduction.mlir.test PASSED in 0.4s
@torch-mlir//test/Conversion:TorchToMhlo/view_like.mlir.test PASSED in 0.6s
@torch-mlir//test/Conversion:TorchToSCF/basic.mlir.test PASSED in 0.2s
@torch-mlir//test/Conversion:TorchToTosa/basic.mlir.test PASSED in 1.1s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/basic.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/error.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/free-functions.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/initializers.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/methods.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/module-uses-error.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/module-uses.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/multiple-instances-error.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/multiple-instances-multiple-module-args.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/multiple-instances.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/submodules.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:Torch/GlobalizeObjectGraph/visibility.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/adjust-calling-conventions.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/canonicalize.mlir.test PASSED in 0.4s
@torch-mlir//test/Dialect:Torch/decompose-complex-ops-legal.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/decompose-complex-ops.mlir.test PASSED in 0.9s
@torch-mlir//test/Dialect:Torch/drop-shape-calculations.mlir.test PASSED in 0.4s
@torch-mlir//test/Dialect:Torch/erase-module-initializer.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/inline-global-slots-analysis.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:Torch/inline-global-slots-transform.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/invalid.mlir.test PASSED in 0.4s
@torch-mlir//test/Dialect:Torch/lower-to-backend-contract-error.mlir.test PASSED in 17.3s
@torch-mlir//test/Dialect:Torch/maximize-value-semantics.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:Torch/ops.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:Torch/prepare-for-globalize-object-graph.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/promote-types.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:Torch/reduce-op-variants-error.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/reduce-op-variants.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/refine-public-return.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:Torch/refine-types-branch.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:Torch/refine-types-ops.mlir.test PASSED in 0.6s
@torch-mlir//test/Dialect:Torch/refine-types.mlir.test PASSED in 0.4s
@torch-mlir//test/Dialect:Torch/reify-shape-calculations.mlir.test PASSED in 2.9s
@torch-mlir//test/Dialect:Torch/simplify-shape-calculations.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:Torch/torch-function-to-torch-backend-pipeline.mlir.test PASSED in 0.6s
@torch-mlir//test/Dialect:TorchConversion/canonicalize.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:TorchConversion/finalizing-backend-type-conversion.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:TorchConversion/func-backend-type-conversion.mlir.test PASSED in 0.2s
@torch-mlir//test/Dialect:TorchConversion/ops.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:TorchConversion/verify-linalg-on-tensors-backend-contract.mlir.test PASSED in 0.3s
@torch-mlir//test/Dialect:TorchConversion/verify-tosa-backend-contract.mlir.test PASSED in 0.2s
@torch-mlir//test/RefBackend:insert-rng-globals.mlir.test PASSED in 0.2s
INFO: Build completed successfully, 2[54](https://github.com/sjain-stanford/torch-mlir/actions/runs/3476816449/jobs/5812368489#step:7:55) total actions
@torch-mlir//test/RefBackend:munge-calling-conventions.mlir.test PASSED in 0.2s
Executed [59](https://github.com/sjain-stanford/torch-mlir/actions/runs/3476816449/jobs/5812368489#step:7:60) out of 59 tests: 59 tests pass.
```
GHA workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/3476816449/jobs/5812368489
We currently pin the `torch` package to the latest nightly version, but
since `torchvision` depends on the `torch` package, the pip resolver
then has to run through an extensive list of `torchvision` packages that
can be installed with the pinned `torch` package. This search fails in
the RollPyTorch action, causing pip to settle on an old version of
`torchvision` that does not work with our tests. In reality, we are
only interested in a specific version of the `torchvision` package.
To make the dependency explicit and to prevent test failures because of
incorrect package installations, this patch makes two key changes:
1. `torchvision` is now pinned to the latest nightly release in
pytorch-requirements.txt along with the version of `torch` that is
necessary to install the requested `torchvision` package
2. The RollPyTorch action now looks for the latest `torchvision` package
instead of the latest `torch` package before writing the version
numbers for pinning in pytorch-requirements.txt
This patch makes a few small, but key, changes to enable ccache on
Windows. First, it replaces the hendrikmuhs/ccache-action action with
command line invocations to the ccache binary, since the action has two
bugs, one of which causes CI to refer to different ccache artifacts
before versus after the build on Windows whereas the other bug can
sometimes cause the action to incorrectly infer that the cache is empty.
Second, this patch slightly alters the cache key, so that our old cache
artifacts, which have grown too big, are eventually discarded in favor
of the new, smaller cache artifacts. Along the way, this patch also
keeps the RollPyTorch's cache artifact separate from the regular build's
cache artifact so as to keep these artifacts small, and also because the
RollPyTorch action is off the critical path for most contributors.
Finally, this patch makes small changes to the CMake file so that on
Windows, the ccache binary is added as a prefix, as recommended on the
[ccache Wiki](https://github.com/ccache/ccache/wiki/MS-Visual-Studio).
* ci: update versions of external actions
Node.js 12 actions are deprecated and will eventually go away, so this
patch bumps the old actions to their latest versions that use Node.js
16.
* ci: replace deprecated action with bash commands
The llvm/actions/install-ninja action uses Node.js 12, which is
deprecated. Since that action is not updated to work with Node.js 16,
this patch replaces that action with equivalent bash commands to install
Ninja.
* ci: use smaller ccache artifacts to reduce evictions
Over time, our ccache sizes have grown quite large (some as large as
1.3 GB), which results in us routinely exceeding GitHub's limits, thus
triggering frequent cache evictions. As a result, cache downloads and
uploads take unnecessary long, in addition to fewer cache entries being
available.
Based on experiments on a clean cache state, it appears that we need
less than 300 MB of (compressed) ccache artifacts for each build type.
Anything larger than that will accrue changes from the past that aren't
needed.
To alleviate the cache burden, this patch sets the maximum ccache size
to be 300 MB. This change should not affect the success or failure of
our builds. I will monitor the build times to check whether this change
causes any performance degradation.
* ci: use consistent platform identifiers
Prior to this patch, some of our builds ran on `ubuntu-latest`, while
some others ran on `ubuntu-20.04` and others ran on `ubuntu-22.04`, with
similar situations for macOS and windows. This patch instead sets all
Linux builds to run on `ubuntu-latest`, all macOS builds to run on
`macos-latest`, and all Windows builds to run on `windows-latest`, to
make debugging future CI failures a little easier.
Until recently, we had to either risk feature branches creating PyTorch
build caches (which were unusable by the main branch or other parallel
feature branches because of GitHub's rules around sharing caches among
branches) or we had to limit the PyTorch build caches to only the main
branch, causing CI runs on feature branches to be terribly slow because
they had to rebuild PyTorch each time.
This patch enables the best of both worlds, by using a fork
(github.com/ashay/cache) of the GitHub's cache action, where the fork
adds an option (called `save`) which, when set, uploads a new cache
entry. We thus set this `save` flag only when we're building PyTorch
from source in Torch-MLIR's main branch, whereas all other builds set
this `save` flag to `false`.
The ability to conditionally update the cache has been an oft-requested
feature on the original (github.com/actions/cache) repository and
multiple unmerged PRs exist to allow conditional cache updates, so it is
likely that using the fork is only a temporary solution.
This patch is part of a larger set of improvements to the CI/build
system. In the code, we refer to the version as the string that
contains the release identifier such as 1.14.0.dev20221028, so calling
the file that contains the commit hash as pytorch-version.txt creates
confusion. For the sake of simplicity, this patch renames that file to
be pytorch-hash.txt.
If PyTorch build caches are created on a branch other than the main
branch, then GitHub does not share those caches with the main branch,
making every CI run that runs for each PR slow. This patch resolves the
problem by letting only the main branch create and use PyTorch build
caches.
* ci: cache PyTorch source builds
This patch reduces the time spent in regular CI builds by caching
PyTorch source builds. Specifically, this patch:
1. Makes CI lookup the cache entry for the PyTorch commit hash in
pytorch-version.txt
2. If lookup was successful, CI fetches the previously-generated WHL
file into the build_tools/python/wheelhouse directory
3. CI sets the `TM_PYTORCH_INSTALL_WITHOUT_REBUILD` variable to `true`
4. The build_libtorch.sh script then uses the downloaded WHL file
instead of rebuilding PyTorch
* ci: warm up PyTorch source cache during daily RollPyTorch action
This patch makes the RollPyTorch action write the updated WHL file to
the cache, so that it can be later retrieved by CI that runs for each
PR. We deliberately add the caching step to the end of the action since
the RollPyTorch action never needs to read from the cache, although
executing this step earlier in the process should not cause problems
either.
Instead of letting the auto-update script either fail because of script
errors or letting it commit bad versions, this patch makes the update
process manual, for now. Once the script stabilizes, I will its
re-enable periodic execution.
Updating the PyTorch version may break the Torch-MLIR build, as it did
recently, since the PyTorch update caused the shape library to change,
but the shape library was not updated in the commit for updating
PyTorch.
This patch introduces a new default-off environment variable to the
build_linux_packages.sh script called `TM_UPDATE_ODS_AND_SHAPE_LIB`
which instructs the script to run the update_torch_ods.sh and
update_shape_lib.sh scripts.
However, running these scripts requires an in-tree build and the tests
that run for an in-tree build of Torch-MLIR are more comprehensive than
those that run for an out-of-tree build, so this patch also swaps out
the out-of-tree build for an in-tree build.