torch-mlir

Commit Graph

Author	SHA1	Message	Date
Ashay Rane	755d0c46da	CI: Spot fixes related to nightly and stable PyTorch builds (#2190 ) * CI: Skip (redundant) libtorch build when using stable PyTorch version When we use PyTorch stable builds, there is no need to build libtorch from source, making the stable-pytorch-with-torch-binary-OFF configuration redundant with stable-pytorch-with-torch-binary-ON. This patch drops the redundant configuration from CI. * CI: Simplify guard conditions for creating and using libtorch cache Whether libtorch is enabled or not is predicated on a host of conditions such as the platform, in-tree versus out-of-tree build, and stable versus nightly PyTorch builds. Instead of repeating these conditions to guard whether to create or use the libtorch cache artifacts (and getting them almost incorrect), this patch predicates the relevant pipeline steps to whether libtorch is enabled, thus making the conditions far simpler.	2023-06-01 22:58:25 -07:00
maxbartel	db3f2e3fde	Add Stable PyTorch CI Pipeline (#2038 ) * feat: split pytorch requirements into stable and nightly * fix: add true to tests to see full output * refactor: add comments to explain true statement * feat: move some tests to experimental mode * refactor: refactor pipeline into more fine grained difference * feat: add version differentiation for some tests * feat: activate more configs * refactor: change implementation to use less requirement files * refactor: remove contraints used for testing * fix: revert some requirement file names * refactor: remove unnecessary ninja install * fix: fix version parsing * refactor: remove dependency on torchvision in main requirements file * refactor: remove index url * style: remove unnecesary line switch * fix: readd index url	2023-05-30 12:16:24 -07:00
Ashay Rane	19a08d51f3	CI: [nfc] Use actions/cache instead of modified fork (#2124 ) We previously used a fork of the action/cache repository for the PyTorch cache since the actions/cache repo did not support read-only caches. Now that actions/cache supports separate read and write steps, this patch switches back to the actions/cache repo.	2023-05-12 23:25:17 -05:00
Ashay Rane	28bb866260	CI: prepare CI for ccache updates for MSVC/Windows (#2120 ) This patch, by itself, doesn't fix caching on Windows, but once a new release of ccache is available, caching for Windows builds should start working again (validated by building ccache from source and using it with LLVM builds). Ccache rejects caching when either the `/Zi` or `/ZI` flags are used during compilation on Windows, since these flags tell the compiler to embed debug information in a PDB file (separate from the object file produced by the compiler). In particular, our CI builds add the `/Zi` flag, making ccache mark these compiler invocations as uncacheable. But what caused our CI to add debug flags, especially when we specified `-DCMAKE_BUILD_TYPE=Release`? On Windows, unless we specify the `--config Release` flag during the CMake build step, CMake assumes a debug build. So all this while, we had been producing debug builds of torch-mlir for every PR! No doubt it took so long to build the Windows binaries. The reason for having to specify the configuration during the _build_ step (as opposed to the _configure_ step) of CMake on Windows is that CMake's Visual Studio generators will produce _both_ Release and Debug profiles during the CMake configure step (thus requiring a build-time value that tells CMake whether to build in Release or Debug mode). Luckily, on Linux and macOS, the `--config` flag seems to be simply ignored, instead of causing build errors. Strangely, based on cursory tests, it seems like on Windows we need to specify the Relase configuration as both `-DCMAKE_BUILD_TYPE=Release` as well as `--config Release`. Dropping either made my build switch to a Debug configuration. Additionally, there is a bug in ccache v4.8 (although this is addressed in trunk) that causes ccache to reject caching if the compiler invocation includes any flag that starts with `/Z`, including /`Zc`, which is added by LLVM's HandleLLVMOptions.cmake and which isn't related to debug info or PDB files. The next release of ccache should include the fix, which is to reject caching only for `/Zi` and `/ZI` flags and not all flags that start with `/Z`. As a side note, debugging this problem was possible because of ccache's log file, which is enabled by: `ccache --set-config="log_file=log.txt"`.	2023-05-12 12:45:01 -05:00
powderluv	0a3ab07c8f	Set fetch-depth 0 for CI builds too (#2034 )	2023-04-14 11:36:41 -07:00
powderluv	0497f0b08d	Revert "CI: drop deletion of workspace and limit submodule fetch concurrency (#1921 )" (#2007 ) This reverts commit `07f5f042c7`.	2023-04-06 10:36:30 -07:00
Ashay Rane	07f5f042c7	CI: drop deletion of workspace and limit submodule fetch concurrency (#1921 ) Despite using sudo to delete the workspace directory, we still occasionally run into checkout errors. This patch thus drops the deletion of the workspace prior to checkout. It also restricts the number of parallel jobs in the submodule fetch step to just one, to try and resolve the checkout issue ("index.lock: File exists.").	2023-04-04 12:58:52 -05:00
Ashay Rane	987d5ab335	CI: use `sudo` to remove Docker-created files (#1905 )	2023-02-27 17:44:50 -06:00
Ashay Rane	ea00371d85	CI: clear workspace directory before checkout (#1900 ) We have recently started seeing errors like: ``` Synchronizing submodule url for 'externals/llvm-project' Synchronizing submodule url for 'externals/mlir-hlo' /usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 Error: fatal: Unable to create '/home/anush/actions-runner/_work/torch-mlir/torch-mlir/.git/modules/externals/llvm-project/index.lock': File exists. ``` As a workaround, this patch removes the workspace directory before the checkout step.	2023-02-24 14:44:35 -06:00
powderluv	5710871f4f	Update buildAndTest.yml (#1881 ) * Update buildAndTest.yml * Update oneshotSnapshotPackage.yml * Update buildRelease.yml * Update RollPyTorch.yml * Update oneshotSnapshotPackage.yml * Update buildAndTest.yml	2023-02-15 09:17:12 -08:00
Ashay Rane	711646d095	mhlo: migrate conversion to stablehlo (#1840 ) This patch replaces all MHLO operations with their StableHLO counterparts and adds a validation pass to ensure that no MHLO operations remain before translating all Stablehlo operations to the MHLO dialect for further lowering to the Linalg dialect. This patch also updates all lit tests so that they refer to the `convert-torch-to-stablehlo` pass and so that they check for StableHLO operations.	2023-02-02 07:29:47 -06:00
powderluv	cd90c0aaf5	Update buildAndTest.yml (#1723 )	2022-12-15 05:42:01 -08:00
Ashay Rane	64f9a0e978	ci: print ccache statistics and configuration at end of CI run (#1719 ) There appear to be two problems with the caching layer in our CI runs: (a) the sizes of some of the caches have grown to multiples of the 300 MB limit and (b) caching on Windows seems to be provide little to no benefit. To help understand the reasons for these problems, this patch adds a line item to the list of steps run in CI to dump the ccache configuration and statistics just prior to uploading the cache artifact.	2022-12-14 09:50:43 -06:00
Ashay Rane	2846776897	ci: enable ccache on Windows (#1548 ) This patch makes a few small, but key, changes to enable ccache on Windows. First, it replaces the hendrikmuhs/ccache-action action with command line invocations to the ccache binary, since the action has two bugs, one of which causes CI to refer to different ccache artifacts before versus after the build on Windows whereas the other bug can sometimes cause the action to incorrectly infer that the cache is empty. Second, this patch slightly alters the cache key, so that our old cache artifacts, which have grown too big, are eventually discarded in favor of the new, smaller cache artifacts. Along the way, this patch also keeps the RollPyTorch's cache artifact separate from the regular build's cache artifact so as to keep these artifacts small, and also because the RollPyTorch action is off the critical path for most contributors. Finally, this patch makes small changes to the CMake file so that on Windows, the ccache binary is added as a prefix, as recommended on the [ccache Wiki](https://github.com/ccache/ccache/wiki/MS-Visual-Studio).	2022-11-03 12:17:22 -05:00
Ashay Rane	f847642495	CI script improvements (#1547 ) * ci: update versions of external actions Node.js 12 actions are deprecated and will eventually go away, so this patch bumps the old actions to their latest versions that use Node.js 16. * ci: replace deprecated action with bash commands The llvm/actions/install-ninja action uses Node.js 12, which is deprecated. Since that action is not updated to work with Node.js 16, this patch replaces that action with equivalent bash commands to install Ninja. * ci: use smaller ccache artifacts to reduce evictions Over time, our ccache sizes have grown quite large (some as large as 1.3 GB), which results in us routinely exceeding GitHub's limits, thus triggering frequent cache evictions. As a result, cache downloads and uploads take unnecessary long, in addition to fewer cache entries being available. Based on experiments on a clean cache state, it appears that we need less than 300 MB of (compressed) ccache artifacts for each build type. Anything larger than that will accrue changes from the past that aren't needed. To alleviate the cache burden, this patch sets the maximum ccache size to be 300 MB. This change should not affect the success or failure of our builds. I will monitor the build times to check whether this change causes any performance degradation. * ci: use consistent platform identifiers Prior to this patch, some of our builds ran on `ubuntu-latest`, while some others ran on `ubuntu-20.04` and others ran on `ubuntu-22.04`, with similar situations for macOS and windows. This patch instead sets all Linux builds to run on `ubuntu-latest`, all macOS builds to run on `macos-latest`, and all Windows builds to run on `windows-latest`, to make debugging future CI failures a little easier.	2022-11-02 21:37:01 -05:00
Ashay Rane	031d127940	ci: introduce read-only and read-write PyTorch build caches (#1546 ) Until recently, we had to either risk feature branches creating PyTorch build caches (which were unusable by the main branch or other parallel feature branches because of GitHub's rules around sharing caches among branches) or we had to limit the PyTorch build caches to only the main branch, causing CI runs on feature branches to be terribly slow because they had to rebuild PyTorch each time. This patch enables the best of both worlds, by using a fork (github.com/ashay/cache) of the GitHub's cache action, where the fork adds an option (called `save`) which, when set, uploads a new cache entry. We thus set this `save` flag only when we're building PyTorch from source in Torch-MLIR's main branch, whereas all other builds set this `save` flag to `false`. The ability to conditionally update the cache has been an oft-requested feature on the original (github.com/actions/cache) repository and multiple unmerged PRs exist to allow conditional cache updates, so it is likely that using the fork is only a temporary solution.	2022-11-01 23:26:17 -07:00
Ashay Rane	a8970101dc	pytorch: rename pytorch-version.txt to pytorch-hash.txt (#1541 ) This patch is part of a larger set of improvements to the CI/build system. In the code, we refer to the version as the string that contains the release identifier such as 1.14.0.dev20221028, so calling the file that contains the commit hash as pytorch-version.txt creates confusion. For the sake of simplicity, this patch renames that file to be pytorch-hash.txt.	2022-10-31 22:03:05 -05:00
Ashay Rane	2cf1092d4d	ci: restrict PyTorch cache to just the main branch (#1540 ) If PyTorch build caches are created on a branch other than the main branch, then GitHub does not share those caches with the main branch, making every CI run that runs for each PR slow. This patch resolves the problem by letting only the main branch create and use PyTorch build caches.	2022-10-31 15:14:53 -05:00
powderluv	bbde4e163f	Add Windows Builder (#1521 ) Add a powershell script to build windows .whl packages Disable LTC as it doesn't build on Windows. Add GHA hooks Use Python 3.10.8	2022-10-25 16:13:31 -07:00
Ashay Rane	a9942f343a	Cache PyTorch source builds to reduce CI time (#1500 ) * ci: cache PyTorch source builds This patch reduces the time spent in regular CI builds by caching PyTorch source builds. Specifically, this patch: 1. Makes CI lookup the cache entry for the PyTorch commit hash in pytorch-version.txt 2. If lookup was successful, CI fetches the previously-generated WHL file into the build_tools/python/wheelhouse directory 3. CI sets the `TM_PYTORCH_INSTALL_WITHOUT_REBUILD` variable to `true` 4. The build_libtorch.sh script then uses the downloaded WHL file instead of rebuilding PyTorch * ci: warm up PyTorch source cache during daily RollPyTorch action This patch makes the RollPyTorch action write the updated WHL file to the cache, so that it can be later retrieved by CI that runs for each PR. We deliberately add the caching step to the end of the action since the RollPyTorch action never needs to read from the cache, although executing this step earlier in the process should not cause problems either.	2022-10-18 00:42:42 -05:00
Ashay Rane	8a8e779529	Disable auto-update of PyTorch version until CI script stabilizes (#1456 ) Instead of letting the auto-update script either fail because of script errors or letting it commit bad versions, this patch makes the update process manual, for now. Once the script stabilizes, I will its re-enable periodic execution.	2022-10-04 03:02:44 -05:00
powderluv	e6528f701a	Move CIs to use docker builds (#1316 ) * Move CIs to use docker builds Now that #1234 has landed and anyone can run CI / Release builds locally move GHA to use the same flow. * update names * Update comments	2022-09-02 18:35:40 -07:00
Sean Silva	e16b43e20b	Remove "torchscript" association from the e2e framework. We use it for more than TorchScript testing now. This is a purely mechanical change to adjust some file paths to remove "torchscript". The most perceptible change here is that now e2e tests are run with ``` ./tools/e2e_test.sh instead of: ./tools/torchscript_e2e_test.sh ```	2022-08-29 14:10:03 -07:00
Sean Silva	bcccf41d96	Add CI for generated files. This ensures that they are always up to date. This also updates the shape lib to make the new CI actually pass :)	2022-08-29 12:07:16 -07:00
powderluv	c0630da678	Disable LTC by default until upstream revert relands (#1303 ) * Disable LTC by default until upstream revert relands Tracked with the WIP https://github.com/llvm/torch-mlir/pull/1292 * Disable LTC e2e tests temporarily * Update setup.py Disable LTC in setup.py temporarily until upstream is fixed.	2022-08-28 19:11:40 -07:00
Tanyo Kwok	2374098d71	[MHLO] Init end to end unit tests (#1223 )	2022-08-23 16:47:21 +08:00
Henry Tu	ba17a4d6c0	Reenable LTC in out-of-tree build (for real this time) (#1205 ) * Fix OOT LTC CI build failure * Disable LTC during macOS package gen * Add more details about static TorchMLIRJITIRImporter library	2022-08-19 15:25:00 -04:00
Ashay Rane	606f4d2c0e	build: streamline options for enabling LTC and MHLO (#1221 )	2022-08-12 23:49:28 -07:00
Sambhav Jain	34478ab1c7	[Build] Add concurrency groups to address long queue times (#1219 ) We're seeing large CI queue times ([example](https://discord.com/channels/636084430946959380/742573221882364009/1007631811184164944)) especially with MacOS VMs on GHA. Part of the problem is follow-on commits to the same branch which trigger new runs while the previous runs are still in-progress, hogging on the scarce VMs. This PR adds concurrency groups to the GHA workflow which ensures that only a single job or workflow using the same concurrency group will run at a time. This would cancel any in-progress jobs in the same github workflow and github ref (e.g. `refs/heads/main` or `refs/pull/<pr_number>/merge`). As discussed on discord [thread](https://discord.com/channels/636084430946959380/1007787336848912386/1007787338895740928), once this lands we may have to closely monitor the workflows to see this didn't introduce unintended consequences. If so, we could either revert, or decide to selectively cancel particular runs (e.g. macos only which is the main bottleneck right now) instead of entire workflow. This will also require some expectation management. As in, if you see an ❌ on the main branch, it may not necessarily mean things broke, it could mean the run was killed by a more recent run. Making it a bit harder to traceback a failure to a commit in a sequence of commits (requiring to run those builds again). Thanks @powderluv for the proposal and pointer to this! It should help with the scarce VMs on GHA and save on queue time. References: * https://docs.github.com/en/actions/using-jobs/using-concurrency#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow * https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow	2022-08-12 17:38:48 -07:00
Ashay Rane	1581d6a84c	build: fix typo in path (#1218 ) When we renamed the directory containing submodules from `external` to `externals`, we accidentally left the original name in the Github workflow. This patch fixes the problem.	2022-08-12 15:38:25 -07:00
Sambhav Jain	aed0ec3a2c	Merge matrix runs to fail fast globally (#1216 ) My earlier[ PR](https://github.com/llvm/torch-mlir/pull/1213) had (among other things) decoupled ubuntu and macos builds into separate matrix runs. This is not working well due to limited number of MacOS GHA VMs causing long queue times and backlog. There are two reasons causing this backlog: 1. macos arm64 builds with pytorch source are getting erratically cancelled due to resource / network constraints. This is addressed with this: https://github.com/llvm/torch-mlir/pull/1215 > "macos-arm64 (in-tree, OFF) The hosted runner: GitHub Actions 3 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error." 2. macos runs don't fail-fast when ubuntu runs fail due to being in separate matrix setups. This PR couples them again.	2022-08-12 11:30:09 -07:00
Sambhav Jain	b8bd0a46cc	use pytorch binary for macos-arm64 builds (#1215 )	2022-08-12 06:33:57 -07:00
Sambhav Jain	f00ca91db0	Simplify matrix configuration for CI workflows (#1213 ) Addresses https://github.com/llvm/torch-mlir/issues/1207. #### Provisioned jobs: ``` # ubuntu - x86_64 - llvm in-tree - pytorch binary - build+test # most used dev flow and fastest signal # ubuntu - x86_64 - llvm out-of-tree - pytorch source - build+test # most elaborate build # macos - arm64 - llvm in-tree - pytorch source - build only # cross compile, can't test arm64 ``` #### Main changes - [x] Spawn macos builds from a separate matrix (in the same workflow). It made sense to do this as they are fairly different from ubuntu (cross compile, use a different cmake configuration). This simplifies the matrix configuration and exclusions quite a bit, and makes the workflow a bit more tractable and maintenance friendly. - [x] Remove the submodule md5sum step for ccache config. This was [broken](https://github.com/llvm/torch-mlir/runs/7779288734?check_suite_focus=true#step:3:145) for a while now. - [x] Removes unused matrix options - `os`, `targetarch`, `python-version`, `llvmtype`. - [x] Address ZSTD [comment](https://github.com/llvm/torch-mlir/pull/1204#discussion_r942349282) on @powderluv's cross compile [PR](https://github.com/llvm/torch-mlir/pull/1204). #### Further improvements (to be addressed in follow-on): * ubuntu-x86_64 out-of-tree integration tests fail ([error](https://github.com/sjain-stanford/torch-mlir/runs/7781264029?check_suite_focus=true)); only run unit tests for now (tests are excluded in current CI too) #### Passing workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/2840676309 ![image](https://user-images.githubusercontent.com/19234106/184194535-f3807991-401a-4cb9-b030-0ee8c334eba3.png)	2022-08-11 16:35:15 -07:00
powderluv	2342456356	mac m1 cross compile (#1204 ) * mac m1 cross compile Add support for M1 cross compile * Remove redundant ExecutionEngine It is registered as part of RegisterEverything * nuke non-universal zstd disable LTC	2022-08-10 08:48:39 -07:00
powderluv	9cf0b6e8ff	Disable out-of-tree and PyTorch binary (#1206 )	2022-08-09 18:18:12 -07:00
Sambhav Jain	b696362b7d	Enable OOT builds in CI (#1188 )	2022-08-09 12:13:16 -07:00
Henry Tu	3e97a33c80	Revert "Reenable LTC in out-of-tree build (#1177 )" (#1183 ) This reverts commit `f85ae9c685`.	2022-08-08 18:58:35 -07:00
Henry Tu	f85ae9c685	Reenable LTC in out-of-tree build (#1177 )	2022-08-08 17:35:22 -04:00
Henry Tu	e322f6a878	Update LTC CMake hack documentation (#1155 ) * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt * Update buildAndTest.yml * Update setup.py * Address review comments	2022-08-05 14:12:20 -04:00
powderluv	37a229cffc	Update buildAndTest.yml (#1145 )	2022-08-03 12:50:54 -07:00
powderluv	0d25b6f10e	Fix cache-suffix name bug (#1138 ) This should enabling better caching of builds.	2022-08-03 07:53:01 -07:00
Henry Tu	2c3b3606d0	Resolve remaining LTC CI failures (#1110 ) * Replace CHECK_EQ with TORCH_CHECK_EQ * Check value of TORCH_MLIR_USE_INSTALLED_PYTORCH during LTC build * Update LTC XFAIL with NewZerosModule ops * Explicitly blacklist _like ops * Automatically blacklist new_/_like ops * Prune away unused Python dependencies from LTC * Add flag to disable LTC * Autogen dummy _REFERENCE_LAZY_BACKEND library when LTC is disabled * Implement compute_shape_var * Removed Var tests from XFAIL Set * XFAIL tests using _local_scalar_dense or index.Tensor * Add StdDim tests to XFAIL set * Autogen aten::cat	2022-07-30 09:40:02 -04:00
Antonio Kim	de6c135dc3	Fix LTC autogen for CI with nightly PyTorch - Update llvm-project pin to match main	2022-07-30 09:40:02 -04:00
Henry Tu	dfcc26556a	Added e2e LTC tests (#916 ) * Added e2e LTC Torch MLIR tests * Fix seed for reproducability * Check if computation is None before getting debug string * Updated unit tests, and added numeric tests * Print name of the model layer that fails numeric validation * Run LTC e2e test with CI/CD * Set seed in main function, instead of beginning of execution * Add comment to specify number of digits of precision * Fixed typo * Remove tests for LTC example models * Added LTC option to torchscript e2e * Implement compile and run for LTC e2e test * xfail all tests that use ops that aren't currently supported	2022-07-30 09:40:02 -04:00
Jae Hoon (Antonio) Kim	2f22e2ef40	Add initial LTC backend (#610 ) * Add initial LTC backend skeleton * Disable CI build and move TorchMLIRPyTorch.cmake	2022-07-30 09:40:02 -04:00
powderluv	db4a6991a0	buildAndTest.yml for matrix builds (#1098 ) * Update buildAndTest.yml test with fast-fail matrix builds * Remove redundant and statement * Downgrade to 20.04 Until upstream PyTorch FBGEMM is fixed to compile with clang+14+ https://github.com/pytorch/pytorch/pull/82396 * Update buildAndTest.yml run tests on only the binary config.	2022-07-29 10:52:46 -07:00
powderluv	31fd812acf	Add linux and macOS source builds in CI (#1070 ) This enables building Pytorch from source in the CI. The build should mostly hit the ccache. Release builds will follow once we have some runtime on the CI.	2022-07-21 14:16:03 -07:00
Ziheng Jiang	c61c99e887	[MHLO] Init MHLO integration. (#1083 ) Co-authored-by: Bairen Yi <yibairen.byron@bytedance.com> Co-authored-by: Jiawei Wu <xremold@gmail.com> Co-authored-by: Tianyou Guo <tianyou.gty@alibaba-inc.com> Co-authored-by: Xu Yan <yancey.yx@alibaba-inc.com> Co-authored-by: Ziheng Jiang <ziheng.jiang@bytedance.com>	2022-07-20 16:18:16 -07:00
Maksim Levental	cec5aeedb0	add ci tests (#754 )	2022-05-25 14:59:59 -05:00
powderluv	c1026fa95b	Switch to using the new Release builds (#780 )	2022-04-21 18:46:34 -07:00

1 2 3

101 Commits (45e2188615711a0db70cb7ad0ca92b95a46687e2)