torch-mlir

Commit Graph

Author	SHA1	Message	Date
Xida Ren (Cedar)	ed163f49e8	Delete .github/workflows/buildAndTest.yml to solve CI error messages … (#3155 ) …on every push fixes #3144	2024-04-12 12:11:48 -07:00
Ramiro Leal-Cavazos	1a7442e0aa	Add clang-format check to CI (#2816 ) This PR adds a check to the CI right after checking out the Torch-MLIR repository to make sure that the changes in the PR don't require any `git clang-format` modifications.	2024-01-30 19:59:46 -08:00
Vivek Khandelwal	e18fcebd3a	[CI] Change Roll PyTorch runner (#2828 ) Signed-Off By: Vivek Khandelwal <vivekkhandelwal1424@gmail.com>	2024-01-30 16:42:18 +05:30
Stella Laurenzo	032f225fa5	[ci] Allow long line in YAML	2024-01-27 19:43:41 -08:00
Stella Laurenzo	6b3ebb237f	[ci] Use a different cache key for torch nightly vs stable.	2024-01-27 19:42:29 -08:00
Stella Laurenzo	4513c3ca87	[ci] Add step to run unit tests. (#2820 )	2024-01-27 19:35:48 -08:00
Stella Laurenzo	77c14ab22b	[ci] Upgrade to new runners and disable unsupported jobs. (#2818 ) Per the RFC and numerous conversations on Discord, this rebuilds the torch-mlir CI and discontinues the infra and coupling to the binary releases (https://discourse.llvm.org/t/rfc-discontinuing-pytorch-1-binary-releases/76371). I iterated on this to get latency back to about what it was with the old (much larger and non-ephemeral) runners: About 4m - 4.5m for an incremental change. Behind the scenes changes: * Uses a new runner pool operated by AMD. It is currently set to manual scaling and has two runners (32-core, 64GiB RAM) while we get some traction. We can either fiddle with some auto-scaling or use a schedule to give it an increase during certain high traffic hours. * Builds are now completely isolated and cannot have run-to-run interference like we were getting before (i.e. lock file/permissions stuff). * The GHA runner is installed directly into a manylinux 2.28 container with upgraded dev tools. This eliminates the need to do sub-invocations of docker on Linux in order to run on the same OS that is used to build wheels. * While not using it now, this setup was cloned from another project that posts the built artifacts to the job and fans out testing. Might be useful here later. * Uses a special git cache that lets us have ephemeral runners and still check out the repo and deps (incl. llvm) in ~13s. * Running in an Azure VM Scale Set. In-repo changes: * Disables (but does not yet delete): * Old buildAndTest.yml jobs * releaseSnapshotPackage.yml * Adds a new `ci.yml` pipeline and scripts the steps in `build_tools/ci` (by decomposing the existing `build_linux_packages.sh` for in-tree builds and modularizing it a bit better). * Test framework changes: * Adds a `TORCH_MLIR_TEST_CONCURRENCY` env var that can be used to bound the multiprocess concurrency. Ended up not using this in the final version but is useful to have as a knob. * Changes the default concurrency to `nproc * 0.8 + 1` vs `nproc * 1.1`. We're running on systems with significantly less virtual memory and I did a bit of fiddling to find a good tradeoff. * Changed multiprocess mode to spawn instead of fork. Otherwise, I was getting instability (as discussed on discord). * Added MLIR configuration to disable multithreaded contexts globally for the project. Constantly spawning `nproc * nproc` threads (more than that actually) was OOM'ing. * Added a test timeout of 5 minutes. If a multiprocess worker crashes, the framework can get wedged indefinitely (and then will just be reaped after multiple hours). We should fix this, but this at least keeps the CI pool from wedging with stuck jobs. Functional changes needing followup: * No matter what I did, I couldn't get the LTC tests to work, and I'm not 100% sure they were being run in the old setup as the scripts were a bit twisty. I disabled them and left a comment. * Dropped out-of-tree build variants. These were not providing much signal and increase CI needs by 50%. * Dropped MacOS and Windows builds. Now that we are "just a library" and not building releases, there is less pressure to test these commit by commit. Further, since we bump torch-mlir to known good commits on these platforms, it has been a long time since either of these jobs have provided much signal (and they take ~an hour+ to run). We can add them back later post-submit if ever needed.	2024-01-27 18:35:45 -08:00
Stella Laurenzo	4a4d80a6ad	[ci] Add lint job and enable yaml linting of GH files. (#2819 )	2024-01-27 15:48:06 -08:00
Vivek Khandelwal	311b6b0286	CI: Fix Roll PyTorch CI failure at determining commit hash (#2796 ) Signed-Off By: Vivek Khandelwal <vivekkhandelwal1424@gmail.com>	2024-01-24 15:55:12 +05:30
Sambhav Jain	49fdc1a8a6	Add bazel targets for TorchOnnxToTorch conversion passes (#2596 ) Adapts to the TorchOnnxToTorch changes from https://github.com/llvm/torch-mlir/pull/2585. Also restores bazel builds in post-merge CI that was disabled in `2148c4cd0d`. Bazel workflow: https://github.com/sjain-stanford/torch-mlir/actions/runs/7023912962	2023-11-28 13:06:35 -08:00
Stella Laurenzo	53fc995639	Run CI on all main/postsubmit commits. Prior to this, the concurrency rules for presubmits (which cancel eagerly) were being applied to main. The result was that landing a second patch would cancel the CI on the one prior.	2023-11-22 18:05:18 -08:00
Stella Laurenzo	2148c4cd0d	Temporarily disable bazel build until fixed.	2023-11-22 18:00:39 -08:00
Vivek Khandelwal	b26797c20b	Disable torch-mlir-core for release build (#2586 )	2023-11-20 19:36:14 -08:00
Stella Laurenzo	5eae0adff1	Breakup python pytorch deps (#2582 ) This lifts the core of the jit_ir_importer and ltc out of the pt1 project, making them peers to it. As a side-effect of this layering, now the "MLIR bits" (dialects, etc) are not commingled with the various parts of the pt1 project, allowing pt1 and ltc to overlay cleanly onto a more fundamental "just MLIR" Python core. Prior to this, the Python namespace was polluted to the point that this could not happen. That "just MLIR" Python core will be introduced in a followup, which will create the space to upstream the FX and ONNX pure Python importers. This primary non-NFC change to the API is: * `torch_mlir.dialects.torch.importer.jit_ir` -> `torch_mlir.jit_ir_importer`. The rest is source code layering so that we can make the pt1 project optional without losing the other features. Progress on #2546.	2023-11-19 12:10:19 -08:00
Stella Laurenzo	6961f0a247	Re-organize project structure to separate PyTorch dependencies from core project. (#2542 ) This is a first step towards the structure we discussed here: https://gist.github.com/stellaraccident/931b068aaf7fa56f34069426740ebf20 There are two primary goals: 1. Separate the core project (C++ dialects and conversions) from the hard PyTorch dependencies. We move all such things into projects/pt1 as a starting point since they are presently entangled with PT1-era APIs. Additional work can be done to disentangle components from that (specifically LTC is identified as likely ultimately living in a `projects/ltc`). 2. Create space for native PyTorch2 Dynamo-based infra to be upstreamed without needing to co-exist with the original TorchScript path. Very little changes in this path with respect to build layering or options. These can be updated in a followup without commingling directory structure changes. This also takes steps toward a couple of other layering enhancements: * Removes the llvm-external-projects/torch-mlir-dialects sub-project, collapsing it into the main tree. * Audits and fixes up the core C++ build to account for issues found while moving things. This is just an opportunistic pass through but roughly ~halves the number of build actions for the project from the high 4000's to the low 2000's. It deviates from the discussed plan by having a `projects/` tree instead of `compat/`. As I was thinking about it, this will better accommodate the follow-on code movement. Once things are roughly in place and the CI passing, followups will focus on more in-situ fixes and cleanups.	2023-11-02 19:45:55 -07:00
Vivek Khandelwal	d10a86f51c	Disable LTC for arm release Also, revert https://github.com/llvm/torch-mlir/pull/2488. Disabling LTC based on the discussion here: https://discord.com/channels/636084430946959380/742573221882364009/1156272667813494824	2023-10-02 22:22:07 +05:30
Vivek Khandelwal	8abfa5b196	Use PyTorch nightly for Arm release build (#2488 ) The LTC backend has drifted from being able to pass tests on the stable PyTorch version, so pinning to nightly on ARM. Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>	2023-09-27 09:40:32 -07:00
Ashay Rane	5f772e8cb4	CI: reconcile differences between RollPyTorch and pre-merge checks (#2482 )	2023-09-23 07:00:16 -07:00
powderluv	cfd70dfb0d	Update merge-rollpytorch.yml with approved actors (#2446 )	2023-09-08 05:00:09 -07:00
Ashay Rane	8f28d933e1	CI: disable LTC e2e tests in stable PyTorch builds (#2414 ) This way, we can keep CI green without being forced to ignore _all_ errors that arise in stable PyTorch builds	2023-08-23 11:11:17 -05:00
Ramiro Leal-Cavazos	c0d8248ec7	Prevent failed stable CI job from cancelling nightly jobs (#2373 ) The CI jobs that use stable PyTorch are currently not required to pass in order for a patch to get merged in `main`. This commit makes sure that if a CI job for stable PyTorch fails, it does not cancel the other required jobs.	2023-08-03 14:58:57 -07:00
Sambhav Jain	facce24ae3	[Bazel] Fix broken Bazel build (#2252 ) Bazel GHA run: https://github.com/sjain-stanford/torch-mlir/actions/runs/5408580473	2023-06-29 08:45:35 -07:00
Ashay Rane	c202cb5263	CI: Checkout repo so that gh knows where to look for the PR (#2223 ) Without this patch, the gh command (for merging the PR) doesn't know which repo we're referring to.	2023-06-09 21:50:19 -05:00
Ashay Rane	33ac7c3ad1	CI: Use GitHub token when calling gh for merging RollPyTorch PR (#2220 )	2023-06-08 15:07:43 -05:00
Ashay Rane	3c1a796f7e	CI: Merge RollPyTorch PR upon successful completion (#2218 ) This patch removes the mock commands, so that once the Build And Test workflow has successfully completed on the RollPyTorch action, the PR is merged and the branch is deleted.	2023-06-07 14:06:50 -05:00
Ashay Rane	2480cb7a51	CI: Update script to (mock) merge of RollPyTorch PRs (#2213 ) Before enabling the actual merge, this patch dumps to the console the bash commands that it plans to execute.	2023-06-06 12:38:16 -05:00
Ashay Rane	173050ec8a	CI: Fix yaml syntax in merge-rollpytorch.yml (#2201 ) This patch fixes the indentation in the yaml file.	2023-06-05 09:43:00 -05:00
Ashay Rane	c804dac925	CI: Introduce workflow to auto-merge RollPyTorch updates (#2196 ) This patch adds a new workflow that runs when an update to the rollpytorch branch by silvasean (in whose name the RollPyTorch action runs) causes the regular CI build to complete without errors. Upon execution, this workflow currently just prints the PR number(s) of the PR created by the RollPyTorch action, but once this is working as expected, we will add the step to merge the PR changes.	2023-06-05 08:48:20 -05:00
Ashay Rane	755d0c46da	CI: Spot fixes related to nightly and stable PyTorch builds (#2190 ) * CI: Skip (redundant) libtorch build when using stable PyTorch version When we use PyTorch stable builds, there is no need to build libtorch from source, making the stable-pytorch-with-torch-binary-OFF configuration redundant with stable-pytorch-with-torch-binary-ON. This patch drops the redundant configuration from CI. * CI: Simplify guard conditions for creating and using libtorch cache Whether libtorch is enabled or not is predicated on a host of conditions such as the platform, in-tree versus out-of-tree build, and stable versus nightly PyTorch builds. Instead of repeating these conditions to guard whether to create or use the libtorch cache artifacts (and getting them almost incorrect), this patch predicates the relevant pipeline steps to whether libtorch is enabled, thus making the conditions far simpler.	2023-06-01 22:58:25 -07:00
maxbartel	db3f2e3fde	Add Stable PyTorch CI Pipeline (#2038 ) * feat: split pytorch requirements into stable and nightly * fix: add true to tests to see full output * refactor: add comments to explain true statement * feat: move some tests to experimental mode * refactor: refactor pipeline into more fine grained difference * feat: add version differentiation for some tests * feat: activate more configs * refactor: change implementation to use less requirement files * refactor: remove contraints used for testing * fix: revert some requirement file names * refactor: remove unnecessary ninja install * fix: fix version parsing * refactor: remove dependency on torchvision in main requirements file * refactor: remove index url * style: remove unnecesary line switch * fix: readd index url	2023-05-30 12:16:24 -07:00
powderluv	2f02ae1ebe	Delete another spurious pip (#2173 )	2023-05-26 00:02:21 -07:00
powderluv	9b7909b599	Add ARM64 release builds (#2159 ) Creates a build_linux_arm64 job that builds the release on an arm64 self-hosted runner. Drop Python 3.10 support Pass TM_TORCH_VERSION to choose the Stable PyTorch version (since arm64 doesn't have nightly builds) Borrows nightly / stable Pytorch switch from the WIP https://github.com/llvm/torch-mlir/pull/2038	2023-05-25 20:39:19 -07:00
powderluv	f5e0287aaa	Remove spurious pip in Release builds (#2172 ) (left over from a previous commit that was approved and landed in a branch on accident)	2023-05-25 18:59:21 -07:00
Ashay Rane	9f65a8a961	CI: disable caching for release builds (#2168 ) This patch adds a (default-true) input called `cache-enabled` to the setup-build action, so that when the input is false, ccache is not setup on the host machine. This patch also sets the input to be false for the release builds.	2023-05-25 11:01:46 -05:00
Ashay Rane	558f12f05c	CI: Use GitHub app token for creating PRs (#2137 ) Since PRs created by the GitHub action bot cannot trigger workflows (and thus build tests), this patch uses the token for a GitHub app that was specifically created for the RollPyTorch action.	2023-05-19 23:18:03 -05:00
Ashay Rane	19a08d51f3	CI: [nfc] Use actions/cache instead of modified fork (#2124 ) We previously used a fork of the action/cache repository for the PyTorch cache since the actions/cache repo did not support read-only caches. Now that actions/cache supports separate read and write steps, this patch switches back to the actions/cache repo.	2023-05-12 23:25:17 -05:00
Ashay Rane	28bb866260	CI: prepare CI for ccache updates for MSVC/Windows (#2120 ) This patch, by itself, doesn't fix caching on Windows, but once a new release of ccache is available, caching for Windows builds should start working again (validated by building ccache from source and using it with LLVM builds). Ccache rejects caching when either the `/Zi` or `/ZI` flags are used during compilation on Windows, since these flags tell the compiler to embed debug information in a PDB file (separate from the object file produced by the compiler). In particular, our CI builds add the `/Zi` flag, making ccache mark these compiler invocations as uncacheable. But what caused our CI to add debug flags, especially when we specified `-DCMAKE_BUILD_TYPE=Release`? On Windows, unless we specify the `--config Release` flag during the CMake build step, CMake assumes a debug build. So all this while, we had been producing debug builds of torch-mlir for every PR! No doubt it took so long to build the Windows binaries. The reason for having to specify the configuration during the _build_ step (as opposed to the _configure_ step) of CMake on Windows is that CMake's Visual Studio generators will produce _both_ Release and Debug profiles during the CMake configure step (thus requiring a build-time value that tells CMake whether to build in Release or Debug mode). Luckily, on Linux and macOS, the `--config` flag seems to be simply ignored, instead of causing build errors. Strangely, based on cursory tests, it seems like on Windows we need to specify the Relase configuration as both `-DCMAKE_BUILD_TYPE=Release` as well as `--config Release`. Dropping either made my build switch to a Debug configuration. Additionally, there is a bug in ccache v4.8 (although this is addressed in trunk) that causes ccache to reject caching if the compiler invocation includes any flag that starts with `/Z`, including /`Zc`, which is added by LLVM's HandleLLVMOptions.cmake and which isn't related to debug info or PDB files. The next release of ccache should include the fix, which is to reject caching only for `/Zi` and `/ZI` flags and not all flags that start with `/Z`. As a side note, debugging this problem was possible because of ccache's log file, which is enabled by: `ccache --set-config="log_file=log.txt"`.	2023-05-12 12:45:01 -05:00
Ashay Rane	e161f2511a	CI: let GitHub action create commit (#2114 ) The GitHub action for creating the PR expects that either the changes are not committed (in which case it commits them with the specified commit message) or that the commit exists but that it is also pushed to remote. Prior to this patch, we created the commit but did not push it to remote, causing failures. This patch leaves the changes uncommitted so that they're committed and pushed to remote as part of the PR creation.	2023-05-11 19:19:32 -05:00
Ashay Rane	377720af87	CI: create PR for RollPyTorch updates (#2106 ) Currently, we run just the Linux in-tree tests when the RollPyTorch workflow runs, but this is insufficient since WHL files for macOS or Windows are sometimes not uploaded by PyTorch, causing the RollPyTorch action to pass but all subsequent torch-mlir CI tests to fail because of the broken build. The easiest way to validate the RollPyTorch action on all platforms is to run the standard set of tests that we run for each submitted PR, so this patch makes the RollPyTorch action submit a PR instead of committing the changes to the main branch directly. The PR is assigned to a handful of folks for review, although this can be changed in the future.	2023-05-10 09:25:59 -05:00
powderluv	0a3ab07c8f	Set fetch-depth 0 for CI builds too (#2034 )	2023-04-14 11:36:41 -07:00
powderluv	6cab740603	Set fetch-depth 0 (#2009 ) This is to potentially workaround the index.lock issue in git when we checkout new depth 1 submodules of recently updated mhlo.	2023-04-06 14:29:59 -07:00
powderluv	0497f0b08d	Revert "CI: drop deletion of workspace and limit submodule fetch concurrency (#1921 )" (#2007 ) This reverts commit `07f5f042c7`.	2023-04-06 10:36:30 -07:00
Ashay Rane	07f5f042c7	CI: drop deletion of workspace and limit submodule fetch concurrency (#1921 ) Despite using sudo to delete the workspace directory, we still occasionally run into checkout errors. This patch thus drops the deletion of the workspace prior to checkout. It also restricts the number of parallel jobs in the submodule fetch step to just one, to try and resolve the checkout issue ("index.lock: File exists.").	2023-04-04 12:58:52 -05:00
powderluv	f83c516b15	Update RollPyTorch.yml to use Pytorch 3.11 (#1999 )	2023-04-03 23:53:26 -07:00
Maksim Levental	c718f87c5d	- rename no-jit -> core (#1920 ) - add windows release	2023-03-07 00:20:06 -06:00
Maksim Levental	ac1f03e6f7	add jit,no-jit release matrix (#1916 )	2023-03-05 22:13:33 -08:00
Maksim Levental	415265a64c	Add `torch-mlir-no-jit-importer` build case for mac os wheels (#1902 ) * add flags to setup.py for out-of-tree build * - fix build_ext bug - add wheels script cases for mac wheels	2023-03-05 12:23:43 -06:00
Ashay Rane	987d5ab335	CI: use `sudo` to remove Docker-created files (#1905 )	2023-02-27 17:44:50 -06:00
Ashay Rane	ea00371d85	CI: clear workspace directory before checkout (#1900 ) We have recently started seeing errors like: ``` Synchronizing submodule url for 'externals/llvm-project' Synchronizing submodule url for 'externals/mlir-hlo' /usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 Error: fatal: Unable to create '/home/anush/actions-runner/_work/torch-mlir/torch-mlir/.git/modules/externals/llvm-project/index.lock': File exists. ``` As a workaround, this patch removes the workspace directory before the checkout step.	2023-02-24 14:44:35 -06:00
Ashay Rane	268364e061	CI: install `unzip` before using it (#1893 ) The RollPyTorch action needs the `unzip` command to peek into WHL files for fetching metadata. This patch makes sure that the command is installed before referencing it.	2023-02-19 17:49:08 -06:00

1 2 3 4

192 Commits (859f5d280f68bf6b5e5c0ac61159630ec9824217)