torch-mlir/.github/workflows/buildAndTest.yml

159 lines
6.1 KiB
YAML
Raw Normal View History

# yamllint disable rule:line-length
name: Build and Test
on:
pull_request:
branches: [main]
Simplify matrix configuration for CI workflows (#1213) Addresses https://github.com/llvm/torch-mlir/issues/1207. #### Provisioned jobs: ``` # ubuntu - x86_64 - llvm in-tree - pytorch binary - build+test # most used dev flow and fastest signal # ubuntu - x86_64 - llvm out-of-tree - pytorch source - build+test # most elaborate build # macos - arm64 - llvm in-tree - pytorch source - build only # cross compile, can't test arm64 ``` #### Main changes - [x] Spawn macos builds from a separate matrix (in the same workflow). It made sense to do this as they are fairly different from ubuntu (cross compile, use a different cmake configuration). This simplifies the matrix configuration and exclusions quite a bit, and makes the workflow a bit more tractable and maintenance friendly. - [x] Remove the submodule md5sum step for ccache config. This was [broken](https://github.com/llvm/torch-mlir/runs/7779288734?check_suite_focus=true#step:3:145) for a while now. - [x] Removes unused matrix options - `os`, `targetarch`, `python-version`, `llvmtype`. - [x] Address ZSTD [comment](https://github.com/llvm/torch-mlir/pull/1204#discussion_r942349282) on @powderluv's cross compile [PR](https://github.com/llvm/torch-mlir/pull/1204). #### Further improvements (to be addressed in follow-on): * ubuntu-x86_64 out-of-tree integration tests fail ([error](https://github.com/sjain-stanford/torch-mlir/runs/7781264029?check_suite_focus=true)); only run unit tests for now (tests are excluded in current CI too) #### Passing workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/2840676309 ![image](https://user-images.githubusercontent.com/19234106/184194535-f3807991-401a-4cb9-b030-0ee8c334eba3.png)
2022-08-12 07:35:15 +08:00
push:
branches: [main]
workflow_dispatch:
[Build] Add concurrency groups to address long queue times (#1219) We're seeing large CI queue times ([example](https://discord.com/channels/636084430946959380/742573221882364009/1007631811184164944)) especially with MacOS VMs on GHA. Part of the problem is follow-on commits to the same branch which trigger new runs while the previous runs are still in-progress, hogging on the scarce VMs. This PR adds concurrency groups to the GHA workflow which ensures that only a single job or workflow using the same concurrency group will run at a time. This would cancel any in-progress jobs in the same github workflow and github ref (e.g. `refs/heads/main` or `refs/pull/<pr_number>/merge`). As discussed on discord [thread](https://discord.com/channels/636084430946959380/1007787336848912386/1007787338895740928), once this lands we may have to closely monitor the workflows to see this didn't introduce unintended consequences. If so, we could either revert, or decide to selectively cancel particular runs (e.g. macos only which is the main bottleneck right now) instead of entire workflow. This will also require some expectation management. As in, if you see an ❌ on the main branch, it may not necessarily mean things broke, it could mean the run was killed by a more recent run. Making it a bit harder to traceback a failure to a commit in a sequence of commits (requiring to run those builds again). Thanks @powderluv for the proposal and pointer to this! It should help with the scarce VMs on GHA and save on queue time. References: * https://docs.github.com/en/actions/using-jobs/using-concurrency#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow * https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow
2022-08-13 08:38:48 +08:00
# Ensure that only a single job or workflow using the same
# concurrency group will run at a time. This would cancel
# any in-progress jobs in the same github workflow and github
# ref (e.g. refs/heads/main or refs/pull/<pr_number>/merge).
concurrency:
# A PR number if a pull request and otherwise the commit hash. This cancels
# queued and in-progress runs for the same PR (presubmit) or commit
# (postsubmit). The workflow name is prepended to avoid conflicts between
# different workflows.
group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
[Build] Add concurrency groups to address long queue times (#1219) We're seeing large CI queue times ([example](https://discord.com/channels/636084430946959380/742573221882364009/1007631811184164944)) especially with MacOS VMs on GHA. Part of the problem is follow-on commits to the same branch which trigger new runs while the previous runs are still in-progress, hogging on the scarce VMs. This PR adds concurrency groups to the GHA workflow which ensures that only a single job or workflow using the same concurrency group will run at a time. This would cancel any in-progress jobs in the same github workflow and github ref (e.g. `refs/heads/main` or `refs/pull/<pr_number>/merge`). As discussed on discord [thread](https://discord.com/channels/636084430946959380/1007787336848912386/1007787338895740928), once this lands we may have to closely monitor the workflows to see this didn't introduce unintended consequences. If so, we could either revert, or decide to selectively cancel particular runs (e.g. macos only which is the main bottleneck right now) instead of entire workflow. This will also require some expectation management. As in, if you see an ❌ on the main branch, it may not necessarily mean things broke, it could mean the run was killed by a more recent run. Making it a bit harder to traceback a failure to a commit in a sequence of commits (requiring to run those builds again). Thanks @powderluv for the proposal and pointer to this! It should help with the scarce VMs on GHA and save on queue time. References: * https://docs.github.com/en/actions/using-jobs/using-concurrency#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow * https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow
2022-08-13 08:38:48 +08:00
cancel-in-progress: true
Simplify matrix configuration for CI workflows (#1213) Addresses https://github.com/llvm/torch-mlir/issues/1207. #### Provisioned jobs: ``` # ubuntu - x86_64 - llvm in-tree - pytorch binary - build+test # most used dev flow and fastest signal # ubuntu - x86_64 - llvm out-of-tree - pytorch source - build+test # most elaborate build # macos - arm64 - llvm in-tree - pytorch source - build only # cross compile, can't test arm64 ``` #### Main changes - [x] Spawn macos builds from a separate matrix (in the same workflow). It made sense to do this as they are fairly different from ubuntu (cross compile, use a different cmake configuration). This simplifies the matrix configuration and exclusions quite a bit, and makes the workflow a bit more tractable and maintenance friendly. - [x] Remove the submodule md5sum step for ccache config. This was [broken](https://github.com/llvm/torch-mlir/runs/7779288734?check_suite_focus=true#step:3:145) for a while now. - [x] Removes unused matrix options - `os`, `targetarch`, `python-version`, `llvmtype`. - [x] Address ZSTD [comment](https://github.com/llvm/torch-mlir/pull/1204#discussion_r942349282) on @powderluv's cross compile [PR](https://github.com/llvm/torch-mlir/pull/1204). #### Further improvements (to be addressed in follow-on): * ubuntu-x86_64 out-of-tree integration tests fail ([error](https://github.com/sjain-stanford/torch-mlir/runs/7781264029?check_suite_focus=true)); only run unit tests for now (tests are excluded in current CI too) #### Passing workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/2840676309 ![image](https://user-images.githubusercontent.com/19234106/184194535-f3807991-401a-4cb9-b030-0ee8c334eba3.png)
2022-08-12 07:35:15 +08:00
# Provisioned Jobs:
# ubuntu/docker - x86_64 - llvm in-tree - pytorch binary - build+test # most used dev flow and fastest signal
# ubuntu/docker - x86_64 - llvm out-of-tree - pytorch source - build+test # most elaborate build
# macos - arm64 - llvm in-tree - pytorch binary - build only # cross compile, can't test arm64
jobs:
build-test:
strategy:
fail-fast: true
matrix:
os-arch: [ubuntu-x86_64, macos-arm64, windows-x86_64]
Simplify matrix configuration for CI workflows (#1213) Addresses https://github.com/llvm/torch-mlir/issues/1207. #### Provisioned jobs: ``` # ubuntu - x86_64 - llvm in-tree - pytorch binary - build+test # most used dev flow and fastest signal # ubuntu - x86_64 - llvm out-of-tree - pytorch source - build+test # most elaborate build # macos - arm64 - llvm in-tree - pytorch source - build only # cross compile, can't test arm64 ``` #### Main changes - [x] Spawn macos builds from a separate matrix (in the same workflow). It made sense to do this as they are fairly different from ubuntu (cross compile, use a different cmake configuration). This simplifies the matrix configuration and exclusions quite a bit, and makes the workflow a bit more tractable and maintenance friendly. - [x] Remove the submodule md5sum step for ccache config. This was [broken](https://github.com/llvm/torch-mlir/runs/7779288734?check_suite_focus=true#step:3:145) for a while now. - [x] Removes unused matrix options - `os`, `targetarch`, `python-version`, `llvmtype`. - [x] Address ZSTD [comment](https://github.com/llvm/torch-mlir/pull/1204#discussion_r942349282) on @powderluv's cross compile [PR](https://github.com/llvm/torch-mlir/pull/1204). #### Further improvements (to be addressed in follow-on): * ubuntu-x86_64 out-of-tree integration tests fail ([error](https://github.com/sjain-stanford/torch-mlir/runs/7781264029?check_suite_focus=true)); only run unit tests for now (tests are excluded in current CI too) #### Passing workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/2840676309 ![image](https://user-images.githubusercontent.com/19234106/184194535-f3807991-401a-4cb9-b030-0ee8c334eba3.png)
2022-08-12 07:35:15 +08:00
llvm-build: [in-tree, out-of-tree]
torch-binary: [ON]
torch-version: [nightly, stable]
exclude:
# Exclude llvm out-of-tree and pytorch stable (to save resources)
Simplify matrix configuration for CI workflows (#1213) Addresses https://github.com/llvm/torch-mlir/issues/1207. #### Provisioned jobs: ``` # ubuntu - x86_64 - llvm in-tree - pytorch binary - build+test # most used dev flow and fastest signal # ubuntu - x86_64 - llvm out-of-tree - pytorch source - build+test # most elaborate build # macos - arm64 - llvm in-tree - pytorch source - build only # cross compile, can't test arm64 ``` #### Main changes - [x] Spawn macos builds from a separate matrix (in the same workflow). It made sense to do this as they are fairly different from ubuntu (cross compile, use a different cmake configuration). This simplifies the matrix configuration and exclusions quite a bit, and makes the workflow a bit more tractable and maintenance friendly. - [x] Remove the submodule md5sum step for ccache config. This was [broken](https://github.com/llvm/torch-mlir/runs/7779288734?check_suite_focus=true#step:3:145) for a while now. - [x] Removes unused matrix options - `os`, `targetarch`, `python-version`, `llvmtype`. - [x] Address ZSTD [comment](https://github.com/llvm/torch-mlir/pull/1204#discussion_r942349282) on @powderluv's cross compile [PR](https://github.com/llvm/torch-mlir/pull/1204). #### Further improvements (to be addressed in follow-on): * ubuntu-x86_64 out-of-tree integration tests fail ([error](https://github.com/sjain-stanford/torch-mlir/runs/7781264029?check_suite_focus=true)); only run unit tests for now (tests are excluded in current CI too) #### Passing workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/2840676309 ![image](https://user-images.githubusercontent.com/19234106/184194535-f3807991-401a-4cb9-b030-0ee8c334eba3.png)
2022-08-12 07:35:15 +08:00
- llvm-build: out-of-tree
torch-version: stable
# Exclude macos-arm64 and llvm out-of-tree altogether
- os-arch: macos-arm64
llvm-build: out-of-tree
- os-arch: macos-arm64
torch-version: stable
- os-arch: windows-x86_64
llvm-build: out-of-tree
- os-arch: windows-x86_64
torch-version: stable
include:
# Specify OS versions
- os-arch: ubuntu-x86_64
2022-12-15 21:42:01 +08:00
os: a100
- os-arch: macos-arm64
CI script improvements (#1547) * ci: update versions of external actions Node.js 12 actions are deprecated and will eventually go away, so this patch bumps the old actions to their latest versions that use Node.js 16. * ci: replace deprecated action with bash commands The llvm/actions/install-ninja action uses Node.js 12, which is deprecated. Since that action is not updated to work with Node.js 16, this patch replaces that action with equivalent bash commands to install Ninja. * ci: use smaller ccache artifacts to reduce evictions Over time, our ccache sizes have grown quite large (some as large as 1.3 GB), which results in us routinely exceeding GitHub's limits, thus triggering frequent cache evictions. As a result, cache downloads and uploads take unnecessary long, in addition to fewer cache entries being available. Based on experiments on a clean cache state, it appears that we need less than 300 MB of (compressed) ccache artifacts for each build type. Anything larger than that will accrue changes from the past that aren't needed. To alleviate the cache burden, this patch sets the maximum ccache size to be 300 MB. This change should not affect the success or failure of our builds. I will monitor the build times to check whether this change causes any performance degradation. * ci: use consistent platform identifiers Prior to this patch, some of our builds ran on `ubuntu-latest`, while some others ran on `ubuntu-20.04` and others ran on `ubuntu-22.04`, with similar situations for macOS and windows. This patch instead sets all Linux builds to run on `ubuntu-latest`, all macOS builds to run on `macos-latest`, and all Windows builds to run on `windows-latest`, to make debugging future CI failures a little easier.
2022-11-03 10:37:01 +08:00
os: macos-latest
- os-arch: windows-x86_64
os: windows-latest
runs-on: ${{ matrix.os }}
steps:
- name: Prepare workspace
if: ${{ matrix.os-arch == 'ubuntu-x86_64' }}
run: |
# Clear the workspace directory so that we don't run into errors about
# existing lock files.
sudo rm -rf $GITHUB_WORKSPACE/*
- name: Checkout torch-mlir
uses: actions/checkout@v3
with:
submodules: 'true'
fetch-depth: 0
- name: Fetch PyTorch commit hash
if: ${{ matrix.os-arch != 'windows-x86_64' }}
run: |
PT_HASH="$(cat ${GITHUB_WORKSPACE}/pytorch-hash.txt)"
echo "PT_HASH=${PT_HASH}" >> ${GITHUB_ENV}
- name: Setup ccache
uses: ./.github/actions/setup-build
with:
cache-suffix: 'build-${{ matrix.llvm-build }}-${{ matrix.torch-version }}'
torch-version: ${{ matrix.torch-version }}
- name: Set up Visual Studio shell
if: ${{ matrix.os-arch == 'windows-x86_64' }}
uses: egor-tensin/vs-shell@v2
with:
arch: x64
- name: Try to Restore PyTorch Build Cache
if: ${{ matrix.torch-binary == 'OFF' }}
id: cache-pytorch
uses: actions/cache/restore@v3
with:
path: ${{ github.workspace }}/build_tools/python_deploy/wheelhouse
key: ${{ runner.os }}-pytorch-${{ env.PT_HASH }}
- name: Build and Test os-arch='ubuntu-x86_64' llvm-build='${{ matrix.llvm-build }}' torch-binary='${{ matrix.torch-binary }}'
if: ${{ matrix.os-arch == 'ubuntu-x86_64' }}
run: |
cd $GITHUB_WORKSPACE
TORCH_MLIR_SRC_PYTORCH_BRANCH="$(cat pytorch-hash.txt)" \
TM_PACKAGES="${{ matrix.llvm-build }}" \
TM_USE_PYTORCH_BINARY="${{ matrix.torch-binary }}" \
TM_PYTORCH_INSTALL_WITHOUT_REBUILD="${{ steps.cache-pytorch.outputs.cache-hit }}" \
TM_TORCH_VERSION="${{ matrix.torch-version }}" \
./build_tools/python_deploy/build_linux_packages.sh
- name: Configure os-arch='macos-arm64' llvm-build='in-tree' torch-binary='${{ matrix.torch-binary }}'
# cross compile, can't test arm64
if: ${{ matrix.os-arch == 'macos-arm64' && matrix.llvm-build == 'in-tree' }}
run: |
# TODO: Reenable LTC after build on macOS-arm64 is fixed (https://github.com/llvm/torch-mlir/issues/1253)
cmake -GNinja -Bbuild_arm64 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DCMAKE_LINKER=lld \
-DCMAKE_OSX_ARCHITECTURES=arm64 \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_PROJECTS=mlir \
-DLLVM_EXTERNAL_PROJECTS="torch-mlir" \
-DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$GITHUB_WORKSPACE" \
-DLLVM_TARGETS_TO_BUILD=AArch64 \
-DLLVM_USE_HOST_TOOLS=ON \
-DLLVM_ENABLE_ZSTD=OFF \
-DMLIR_ENABLE_BINDINGS_PYTHON=ON \
-DTORCH_MLIR_ENABLE_STABLEHLO=OFF \
-DTORCH_MLIR_ENABLE_LTC=OFF \
-DTORCH_MLIR_USE_INSTALLED_PYTORCH="${{ matrix.torch-binary }}" \
-DMACOSX_DEPLOYMENT_TARGET=12.0 \
-DPython3_EXECUTABLE="$(which python)" \
$GITHUB_WORKSPACE/externals/llvm-project/llvm
- name: Build torch-mlir (cross-compile)
if: ${{ matrix.os-arch == 'macos-arm64' }}
run: |
cmake --build build_arm64
- name: Build (Windows)
if: ${{ matrix.os-arch == 'windows-x86_64' }}
shell: bash
run: ./build_tools/python_deploy/build_windows_ci.sh
- name: Save PyTorch Build Cache
if: ${{ github.ref_name == 'main' && matrix.torch-binary == 'OFF' }}
uses: actions/cache/save@v3
with:
path: ${{ github.workspace }}/build_tools/python_deploy/wheelhouse
key: ${{ runner.os }}-pytorch-${{ env.PT_HASH }}
- name: Print ccache statistics
shell: bash
run: ccache --show-stats