2024-01-28 07:48:06 +08:00
|
|
|
# yamllint disable rule:line-length
|
2020-08-05 09:57:59 +08:00
|
|
|
name: Build and Test
|
|
|
|
|
2021-10-22 12:23:21 +08:00
|
|
|
on:
|
[ci] Upgrade to new runners and disable unsupported jobs. (#2818)
Per the RFC and numerous conversations on Discord, this rebuilds the
torch-mlir CI and discontinues the infra and coupling to the binary
releases
(https://discourse.llvm.org/t/rfc-discontinuing-pytorch-1-binary-releases/76371).
I iterated on this to get latency back to about what it was with the old
(much larger and non-ephemeral) runners: About 4m - 4.5m for an
incremental change.
Behind the scenes changes:
* Uses a new runner pool operated by AMD. It is currently set to manual
scaling and has two runners (32-core, 64GiB RAM) while we get some
traction. We can either fiddle with some auto-scaling or use a schedule
to give it an increase during certain high traffic hours.
* Builds are now completely isolated and cannot have run-to-run
interference like we were getting before (i.e. lock file/permissions
stuff).
* The GHA runner is installed directly into a manylinux 2.28 container
with upgraded dev tools. This eliminates the need to do sub-invocations
of docker on Linux in order to run on the same OS that is used to build
wheels.
* While not using it now, this setup was cloned from another project
that posts the built artifacts to the job and fans out testing. Might be
useful here later.
* Uses a special git cache that lets us have ephemeral runners and still
check out the repo and deps (incl. llvm) in ~13s.
* Running in an Azure VM Scale Set.
In-repo changes:
* Disables (but does not yet delete):
* Old buildAndTest.yml jobs
* releaseSnapshotPackage.yml
* Adds a new `ci.yml` pipeline and scripts the steps in `build_tools/ci`
(by decomposing the existing `build_linux_packages.sh` for in-tree
builds and modularizing it a bit better).
* Test framework changes:
* Adds a `TORCH_MLIR_TEST_CONCURRENCY` env var that can be used to bound
the multiprocess concurrency. Ended up not using this in the final
version but is useful to have as a knob.
* Changes the default concurrency to `nproc * 0.8 + 1` vs `nproc * 1.1`.
We're running on systems with significantly less virtual memory and I
did a bit of fiddling to find a good tradeoff.
* Changed multiprocess mode to spawn instead of fork. Otherwise, I was
getting instability (as discussed on discord).
* Added MLIR configuration to disable multithreaded contexts globally
for the project. Constantly spawning `nproc * nproc` threads (more than
that actually) was OOM'ing.
* Added a test timeout of 5 minutes. If a multiprocess worker crashes,
the framework can get wedged indefinitely (and then will just be reaped
after multiple hours). We should fix this, but this at least keeps the
CI pool from wedging with stuck jobs.
Functional changes needing followup:
* No matter what I did, I couldn't get the LTC tests to work, and I'm
not 100% sure they were being run in the old setup as the scripts were a
bit twisty. I disabled them and left a comment.
* Dropped out-of-tree build variants. These were not providing much
signal and increase CI needs by 50%.
* Dropped MacOS and Windows builds. Now that we are "just a library" and
not building releases, there is less pressure to test these commit by
commit. Further, since we bump torch-mlir to known good commits on these
platforms, it has been a long time since either of these jobs have
provided much signal (and they take ~an hour+ to run). We can add them
back later post-submit if ever needed.
2024-01-28 10:35:45 +08:00
|
|
|
# pull_request:
|
|
|
|
# branches: [main]
|
|
|
|
# push:
|
|
|
|
# branches: [main]
|
|
|
|
# workflow_dispatch:
|
2020-08-05 09:57:59 +08:00
|
|
|
|
2022-08-13 08:38:48 +08:00
|
|
|
# Ensure that only a single job or workflow using the same
|
|
|
|
# concurrency group will run at a time. This would cancel
|
|
|
|
# any in-progress jobs in the same github workflow and github
|
|
|
|
# ref (e.g. refs/heads/main or refs/pull/<pr_number>/merge).
|
|
|
|
concurrency:
|
2023-11-23 10:04:09 +08:00
|
|
|
# A PR number if a pull request and otherwise the commit hash. This cancels
|
|
|
|
# queued and in-progress runs for the same PR (presubmit) or commit
|
|
|
|
# (postsubmit). The workflow name is prepended to avoid conflicts between
|
|
|
|
# different workflows.
|
|
|
|
group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
|
2022-08-13 08:38:48 +08:00
|
|
|
cancel-in-progress: true
|
|
|
|
|
|
|
|
|
2022-08-12 07:35:15 +08:00
|
|
|
# Provisioned Jobs:
|
2022-09-03 09:35:40 +08:00
|
|
|
# ubuntu/docker - x86_64 - llvm in-tree - pytorch binary - build+test # most used dev flow and fastest signal
|
|
|
|
# ubuntu/docker - x86_64 - llvm out-of-tree - pytorch source - build+test # most elaborate build
|
2022-08-12 21:33:57 +08:00
|
|
|
# macos - arm64 - llvm in-tree - pytorch binary - build only # cross compile, can't test arm64
|
2020-08-05 09:57:59 +08:00
|
|
|
jobs:
|
2022-08-13 02:30:09 +08:00
|
|
|
build-test:
|
2022-07-30 01:52:46 +08:00
|
|
|
strategy:
|
|
|
|
fail-fast: true
|
|
|
|
matrix:
|
[ci] Upgrade to new runners and disable unsupported jobs. (#2818)
Per the RFC and numerous conversations on Discord, this rebuilds the
torch-mlir CI and discontinues the infra and coupling to the binary
releases
(https://discourse.llvm.org/t/rfc-discontinuing-pytorch-1-binary-releases/76371).
I iterated on this to get latency back to about what it was with the old
(much larger and non-ephemeral) runners: About 4m - 4.5m for an
incremental change.
Behind the scenes changes:
* Uses a new runner pool operated by AMD. It is currently set to manual
scaling and has two runners (32-core, 64GiB RAM) while we get some
traction. We can either fiddle with some auto-scaling or use a schedule
to give it an increase during certain high traffic hours.
* Builds are now completely isolated and cannot have run-to-run
interference like we were getting before (i.e. lock file/permissions
stuff).
* The GHA runner is installed directly into a manylinux 2.28 container
with upgraded dev tools. This eliminates the need to do sub-invocations
of docker on Linux in order to run on the same OS that is used to build
wheels.
* While not using it now, this setup was cloned from another project
that posts the built artifacts to the job and fans out testing. Might be
useful here later.
* Uses a special git cache that lets us have ephemeral runners and still
check out the repo and deps (incl. llvm) in ~13s.
* Running in an Azure VM Scale Set.
In-repo changes:
* Disables (but does not yet delete):
* Old buildAndTest.yml jobs
* releaseSnapshotPackage.yml
* Adds a new `ci.yml` pipeline and scripts the steps in `build_tools/ci`
(by decomposing the existing `build_linux_packages.sh` for in-tree
builds and modularizing it a bit better).
* Test framework changes:
* Adds a `TORCH_MLIR_TEST_CONCURRENCY` env var that can be used to bound
the multiprocess concurrency. Ended up not using this in the final
version but is useful to have as a knob.
* Changes the default concurrency to `nproc * 0.8 + 1` vs `nproc * 1.1`.
We're running on systems with significantly less virtual memory and I
did a bit of fiddling to find a good tradeoff.
* Changed multiprocess mode to spawn instead of fork. Otherwise, I was
getting instability (as discussed on discord).
* Added MLIR configuration to disable multithreaded contexts globally
for the project. Constantly spawning `nproc * nproc` threads (more than
that actually) was OOM'ing.
* Added a test timeout of 5 minutes. If a multiprocess worker crashes,
the framework can get wedged indefinitely (and then will just be reaped
after multiple hours). We should fix this, but this at least keeps the
CI pool from wedging with stuck jobs.
Functional changes needing followup:
* No matter what I did, I couldn't get the LTC tests to work, and I'm
not 100% sure they were being run in the old setup as the scripts were a
bit twisty. I disabled them and left a comment.
* Dropped out-of-tree build variants. These were not providing much
signal and increase CI needs by 50%.
* Dropped MacOS and Windows builds. Now that we are "just a library" and
not building releases, there is less pressure to test these commit by
commit. Further, since we bump torch-mlir to known good commits on these
platforms, it has been a long time since either of these jobs have
provided much signal (and they take ~an hour+ to run). We can add them
back later post-submit if ever needed.
2024-01-28 10:35:45 +08:00
|
|
|
os-arch: [macos-arm64, windows-x86_64]
|
2022-08-12 07:35:15 +08:00
|
|
|
llvm-build: [in-tree, out-of-tree]
|
2023-11-20 04:10:19 +08:00
|
|
|
torch-binary: [ON]
|
2023-05-31 03:16:24 +08:00
|
|
|
torch-version: [nightly, stable]
|
2022-07-30 01:52:46 +08:00
|
|
|
exclude:
|
2023-11-20 04:10:19 +08:00
|
|
|
# Exclude llvm out-of-tree and pytorch stable (to save resources)
|
2022-08-12 07:35:15 +08:00
|
|
|
- llvm-build: out-of-tree
|
2023-11-20 04:10:19 +08:00
|
|
|
torch-version: stable
|
2022-08-13 02:30:09 +08:00
|
|
|
# Exclude macos-arm64 and llvm out-of-tree altogether
|
|
|
|
- os-arch: macos-arm64
|
|
|
|
llvm-build: out-of-tree
|
2023-05-31 03:16:24 +08:00
|
|
|
- os-arch: macos-arm64
|
|
|
|
torch-version: stable
|
2022-10-26 07:13:31 +08:00
|
|
|
- os-arch: windows-x86_64
|
|
|
|
llvm-build: out-of-tree
|
2023-05-31 03:16:24 +08:00
|
|
|
- os-arch: windows-x86_64
|
|
|
|
torch-version: stable
|
2022-08-13 02:30:09 +08:00
|
|
|
include:
|
|
|
|
# Specify OS versions
|
|
|
|
- os-arch: ubuntu-x86_64
|
2022-12-15 21:42:01 +08:00
|
|
|
os: a100
|
2022-08-13 02:30:09 +08:00
|
|
|
- os-arch: macos-arm64
|
CI script improvements (#1547)
* ci: update versions of external actions
Node.js 12 actions are deprecated and will eventually go away, so this
patch bumps the old actions to their latest versions that use Node.js
16.
* ci: replace deprecated action with bash commands
The llvm/actions/install-ninja action uses Node.js 12, which is
deprecated. Since that action is not updated to work with Node.js 16,
this patch replaces that action with equivalent bash commands to install
Ninja.
* ci: use smaller ccache artifacts to reduce evictions
Over time, our ccache sizes have grown quite large (some as large as
1.3 GB), which results in us routinely exceeding GitHub's limits, thus
triggering frequent cache evictions. As a result, cache downloads and
uploads take unnecessary long, in addition to fewer cache entries being
available.
Based on experiments on a clean cache state, it appears that we need
less than 300 MB of (compressed) ccache artifacts for each build type.
Anything larger than that will accrue changes from the past that aren't
needed.
To alleviate the cache burden, this patch sets the maximum ccache size
to be 300 MB. This change should not affect the success or failure of
our builds. I will monitor the build times to check whether this change
causes any performance degradation.
* ci: use consistent platform identifiers
Prior to this patch, some of our builds ran on `ubuntu-latest`, while
some others ran on `ubuntu-20.04` and others ran on `ubuntu-22.04`, with
similar situations for macOS and windows. This patch instead sets all
Linux builds to run on `ubuntu-latest`, all macOS builds to run on
`macos-latest`, and all Windows builds to run on `windows-latest`, to
make debugging future CI failures a little easier.
2022-11-03 10:37:01 +08:00
|
|
|
os: macos-latest
|
2022-10-26 07:13:31 +08:00
|
|
|
- os-arch: windows-x86_64
|
|
|
|
os: windows-latest
|
2022-08-13 02:30:09 +08:00
|
|
|
runs-on: ${{ matrix.os }}
|
2022-07-30 01:52:46 +08:00
|
|
|
|
2020-08-05 09:57:59 +08:00
|
|
|
steps:
|
2024-01-28 07:48:06 +08:00
|
|
|
- name: Prepare workspace
|
|
|
|
if: ${{ matrix.os-arch == 'ubuntu-x86_64' }}
|
|
|
|
run: |
|
|
|
|
# Clear the workspace directory so that we don't run into errors about
|
|
|
|
# existing lock files.
|
|
|
|
sudo rm -rf $GITHUB_WORKSPACE/*
|
|
|
|
|
|
|
|
- name: Checkout torch-mlir
|
|
|
|
uses: actions/checkout@v3
|
|
|
|
with:
|
|
|
|
submodules: 'true'
|
|
|
|
fetch-depth: 0
|
|
|
|
|
|
|
|
- name: Fetch PyTorch commit hash
|
|
|
|
if: ${{ matrix.os-arch != 'windows-x86_64' }}
|
|
|
|
run: |
|
|
|
|
PT_HASH="$(cat ${GITHUB_WORKSPACE}/pytorch-hash.txt)"
|
|
|
|
echo "PT_HASH=${PT_HASH}" >> ${GITHUB_ENV}
|
|
|
|
|
|
|
|
- name: Setup ccache
|
|
|
|
uses: ./.github/actions/setup-build
|
|
|
|
with:
|
|
|
|
cache-suffix: 'build-${{ matrix.llvm-build }}-${{ matrix.torch-version }}'
|
|
|
|
torch-version: ${{ matrix.torch-version }}
|
|
|
|
|
|
|
|
- name: Set up Visual Studio shell
|
|
|
|
if: ${{ matrix.os-arch == 'windows-x86_64' }}
|
|
|
|
uses: egor-tensin/vs-shell@v2
|
|
|
|
with:
|
|
|
|
arch: x64
|
|
|
|
|
|
|
|
- name: Try to Restore PyTorch Build Cache
|
|
|
|
if: ${{ matrix.torch-binary == 'OFF' }}
|
|
|
|
id: cache-pytorch
|
|
|
|
uses: actions/cache/restore@v3
|
|
|
|
with:
|
|
|
|
path: ${{ github.workspace }}/build_tools/python_deploy/wheelhouse
|
|
|
|
key: ${{ runner.os }}-pytorch-${{ env.PT_HASH }}
|
|
|
|
|
|
|
|
- name: Build and Test os-arch='ubuntu-x86_64' llvm-build='${{ matrix.llvm-build }}' torch-binary='${{ matrix.torch-binary }}'
|
|
|
|
if: ${{ matrix.os-arch == 'ubuntu-x86_64' }}
|
|
|
|
run: |
|
|
|
|
cd $GITHUB_WORKSPACE
|
|
|
|
TORCH_MLIR_SRC_PYTORCH_BRANCH="$(cat pytorch-hash.txt)" \
|
|
|
|
TM_PACKAGES="${{ matrix.llvm-build }}" \
|
|
|
|
TM_USE_PYTORCH_BINARY="${{ matrix.torch-binary }}" \
|
|
|
|
TM_PYTORCH_INSTALL_WITHOUT_REBUILD="${{ steps.cache-pytorch.outputs.cache-hit }}" \
|
|
|
|
TM_TORCH_VERSION="${{ matrix.torch-version }}" \
|
|
|
|
./build_tools/python_deploy/build_linux_packages.sh
|
|
|
|
|
|
|
|
- name: Configure os-arch='macos-arm64' llvm-build='in-tree' torch-binary='${{ matrix.torch-binary }}'
|
|
|
|
# cross compile, can't test arm64
|
|
|
|
if: ${{ matrix.os-arch == 'macos-arm64' && matrix.llvm-build == 'in-tree' }}
|
|
|
|
run: |
|
|
|
|
# TODO: Reenable LTC after build on macOS-arm64 is fixed (https://github.com/llvm/torch-mlir/issues/1253)
|
|
|
|
cmake -GNinja -Bbuild_arm64 \
|
|
|
|
-DCMAKE_BUILD_TYPE=Release \
|
|
|
|
-DCMAKE_C_COMPILER=clang \
|
|
|
|
-DCMAKE_CXX_COMPILER=clang++ \
|
|
|
|
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
|
|
|
|
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
|
|
|
|
-DCMAKE_LINKER=lld \
|
|
|
|
-DCMAKE_OSX_ARCHITECTURES=arm64 \
|
|
|
|
-DLLVM_ENABLE_ASSERTIONS=ON \
|
|
|
|
-DLLVM_ENABLE_PROJECTS=mlir \
|
|
|
|
-DLLVM_EXTERNAL_PROJECTS="torch-mlir" \
|
|
|
|
-DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$GITHUB_WORKSPACE" \
|
|
|
|
-DLLVM_TARGETS_TO_BUILD=AArch64 \
|
|
|
|
-DLLVM_USE_HOST_TOOLS=ON \
|
|
|
|
-DLLVM_ENABLE_ZSTD=OFF \
|
|
|
|
-DMLIR_ENABLE_BINDINGS_PYTHON=ON \
|
|
|
|
-DTORCH_MLIR_ENABLE_STABLEHLO=OFF \
|
|
|
|
-DTORCH_MLIR_ENABLE_LTC=OFF \
|
|
|
|
-DTORCH_MLIR_USE_INSTALLED_PYTORCH="${{ matrix.torch-binary }}" \
|
|
|
|
-DMACOSX_DEPLOYMENT_TARGET=12.0 \
|
|
|
|
-DPython3_EXECUTABLE="$(which python)" \
|
|
|
|
$GITHUB_WORKSPACE/externals/llvm-project/llvm
|
|
|
|
|
|
|
|
- name: Build torch-mlir (cross-compile)
|
|
|
|
if: ${{ matrix.os-arch == 'macos-arm64' }}
|
|
|
|
run: |
|
|
|
|
cmake --build build_arm64
|
|
|
|
|
|
|
|
- name: Build (Windows)
|
|
|
|
if: ${{ matrix.os-arch == 'windows-x86_64' }}
|
|
|
|
shell: bash
|
|
|
|
run: ./build_tools/python_deploy/build_windows_ci.sh
|
|
|
|
|
|
|
|
- name: Save PyTorch Build Cache
|
|
|
|
if: ${{ github.ref_name == 'main' && matrix.torch-binary == 'OFF' }}
|
|
|
|
uses: actions/cache/save@v3
|
|
|
|
with:
|
|
|
|
path: ${{ github.workspace }}/build_tools/python_deploy/wheelhouse
|
|
|
|
key: ${{ runner.os }}-pytorch-${{ env.PT_HASH }}
|
|
|
|
|
|
|
|
- name: Print ccache statistics
|
|
|
|
shell: bash
|
|
|
|
run: ccache --show-stats
|