* Need to have a dag of shared library deps in order to interop across python extensions (as presented in ODM).
* Introduced add_npcomp_library and friends to mirror the MLIR setup.
* Adds a libNPCOMP.so shared library.
* Redirects tools and extensions to link against libNPCOMP.so (instead of static libs).
* Moves all libraries to lib/, all binaries to bin/ and all python extensions to python/. The invariant is that the rpaths are setup to have a one level directory structure.
* Reworks the _torch_mlir extension to build like the others (still need to come up with a consolidated rule to do this instead of open coded).
* Includes an upstream version bump to pick up needed changes.
Sizes with dynamic linking (stripped, release, asserts enabled):
libNPCOMP.so: 43M (includes much of the underlying LLVM codegen deps)
libMLIR.so: 31M
_npcomp.so: 1.6M (python extension)
_torch_mlir.so: 670K (python extension)
npcomp-capi-ir-test: 6.3K
npcomp-opt: 351K
npcomp-run-mlir: 461K
mnist-playground: 530K
Still more can be done to normalize and optimize but this gets us structurally to the starting point.
I now realize that VerboseCamelCase is not the best choice for dialect
directory/file names and C++ identifiers (take e.g. "Linalg", "Basicpy",
etc. as prior art here; not LinearAlgebra or BasicPython). If I had to
name the convention it seems to be "Shortword" (or of course just
acronym dialects like LLVM, SCF, etc.).
This rename also has the side benefit of differentiating RefBackend
directories, which now refer to the actual backend itself, from
Refback/Refbackrt, which are the dialects which happen to be used by
that backend.
Other than the dialect definitions (which will live in standard Dialect/
subdirectory), the goal here is to keep RefBackend-related code nested
in {include/npcomp,lib,test}/RefBackend.
This is the first in a patch series that is refactoring the
constellation of things variously called or associated with "E2E",
"RefE2E", "npcomprt", and "TCP" into a more cleanly layered result.
Concretely, this first patch fixes the fact that TCP was basically
acting like a dumping ground needed by the reference backend. This
splits it out, which is fairly mechanical, but touches a lot of lines of
code (basically replacing `tcp` with `refback` and `TCP` with
`RefBackend).
Now, the RefBackend dialect is that dumping ground, which
is slighly better, as it starts allowing TCP to become a nice clean
middle layer that is not related per se to the reference backend.
The previous name RefE2E or "reference e2e flow" was super confusing.
Now that we are seeing more clearly where the "backend" distinction
lies, the [RefBackend] commit tag is born :)
I was seeing some miscompiles due to the uninitialized data read here
before. Interestingly, this was masked in some of our previous test
cases, since the uninitialized data "always" was so small that it would
present as a rounding error for the 1.0-10.0 sized values that the
matmul was computing on.
* Adds at::Tensor -> MlirValue tracking.
* Adds conversions for tensor and scalar types to MLIR types.
* Adds npcomp C APIs for constructing custom types.
* Reworks pybind include so as to get Torch pybind helpers (needed to pass at::Tensor type from Python->C++).
* Uses the MLIR-C API since that will save us a lot of grief down the road (i.e. will give PyTorch and libMLIR/libNPCOMP the ability to skew version-wise).
* Quite a few TODOs and not yet populating the function in any way.
* Uses the new dispatcher API.
* Just prints to the console for the moment when an op is captured.
* Executes the op through the existing implementation.
Now that we upstreamed our pass, we can remove it.
The final pass that landed upstream doesn't do the shape.assuming
canonicalization to legalize that op away, so added a
restricted-canonicalizer pass that allowed to run just shape dialect
canonicalizations, which deletes the shape.assuming.
The pass ended up kind of ugly. See the TODO's on it for some potential
cleaner directions.
Date: Fri Sep 18 13:55:52 2020 -0700
- Update to linalg syntax
- New generated builders are better. Custom builder for
tcp.shaped_results is now redundant.
This cleans up the lowering pipeline to easily allow extending to
multiple binary ops. It looks fairly repetitive at multiple levels, but
I don't want to prematurely generalize. I think that in principle we
could derive a large swatch of TCF + TCP from a single linalg-style
specification. Another direction is to use an OpInterface (something
like "buildLinalgGenericBody"). I'm keeping my eye on it.
In a subsequent commit, I'll mechanically add a set of binary ops
modeled off of the std arithmetic ops.
It was previously going through this awkward route that prematurely
created linalg.generic ops, which was an annoying layering problem since
we can't compute a shape transfer function for linalg.generic in the
general case. Now we pass it through the same path as tcp.matmul, with
the shape transfer function being defined for tcp.add.
This also removed the need for TCPToLinalg (now deleted). The equivalent
of that is happening in lower-shaped-results-to-memref. One interesting
outcome of this: we're basically using linalg as a "Buffer TCP". We
might want to look into using named structured ops for more of TCP, but
that would be a big velocity hit since then any change to the ODS /
verification for those ops would be a change to the upstream structured
op ODS generator. After we have more experience defining this manually,
we should re-evaluate rebasing TCP on generated named linalg ops.
I'm pretty happy with how this turned out. It looks pretty much like it
should -- one change at each layer. This particular op bottoms out on
linalg which takes care of the rest.
- Add tcf.matmul
- Add tcp.matmul
- Add TCF->TCP lowering
- Add tcp.matmul shape transfer function (BypassShapes.cpp)
- Add tcp.matmul -> linalg.matmul lowering (LowerShapedResultsToMemref.cpp)
- Add support to LowerShapeConstraints for lowering the new
shape.cstr_require
This matmul op is pretty limited in its capabilities. There is no
batching and no multidimensional contraction. Certainly more design work
will be needed to find the right abstractions that aren't too general
but also help to canonicalize many cases from frontends. This is mainly
to show that adding a new op needn't be very "scary" once we have the
e2e infra in place.
Also,
- this clears out some exploratory cruft from the TCF dialect now that
this is starting to become real.
* Includes pybind11 directly (for some reason using the pytorch helper header for this depends on a source file not in the image).
* Installs nnpack into the image.
* Installs new-clang and LLD and configures environment to use it (otherwise, link time is terrible).
* Fixes a gcc compile error (in the off chance you build with default gcc compiler).
* Tests are failing based on some dialect registration stuff that must not have been factored correctly. Will followup with a fix.
This now gets the overall "RefE2E" compilation stack to a point that I'm
fairly happy with. We simplify it by mostly embracing the "descriptor"
view of the world.
The overall flow is best understood by reading through the
createE2ELoweringPipeline function in lib/E2E/E2E.cpp
That function creates a pass pipeline that lowers from "TCF" (which is
~numpy level of abstraction) down to LLVM IR.
A brief high-level summary of what happens there:
1. TCF to TCP conversion. This involves reifying error handling in the
form of shape constraints. See test/Conversion/TCFToTCP/basic.mlir
2. Lowering shape constraints. This converts shape constraints into
eager error-handling code. See test/E2E/lower-shape-constraints.mlir
This pass will soon go upstream.
Because this lowers to std.assert, some later passes like
LowerToNpcomprtABI and LowerToLLVM are updated to properly plumb this
through e2e.
See test/npcomp-run-mlir/invalid-broadcast.mlir for an execution test
that properly aborts in case of an error.
3. Lowering tensors to memrefs. This is done via a series of passes
rather than an single mega conversion. Unlike the previous code that
mixed in the npcomprt ABI stuff here, it's now a very clean "pure
memref" conversion.
See test/E2E/lower-*-to-memref.mlir and
lib/E2E/TensorToMemref/
Most of the changes are concentrated here.
4. As part of the above, we use the upstream ConvertShapeToStandard for
lowering shapes.
5. We lower linalg to loops and lower loops to CFG using upstream
passes.
6. Rewrite the "ABI" boundaries of the program to npcomprt data
structures (LowerToNpcomprtABI). This mainly affects ABI boundaries and
how global tensor constants are represented. One of the major
improvements in this commit is that now it's a very clean rewrite that
just replaces memrefs on ABI boundaries with !npcomprt.tensor (before
there was a get_extent function that is not needed).
See test/E2E/lower-to-npcomprt-abi.mlir
7. Lower to LLVM with upstream mlir patterns + some patterns for the
npcomprt lowerings.
One aspect here that is still a remnant of a non-descriptor-based tensor
to memref flow is the BypassShapes + LowerShapedResultsToMemref.
BypassShapes wraps the "tensor compute" ops in a tcp.shaped_results
(basically a "tie_shape" kind of op), and then
LowerShapedResultsToMemref uses those annotations to allocate output
buffers while lowering the "tensor compute ops". Note that there are
very few "tensor compute" ops currently supported (tcp.add +
tcp.broadcast_to), so we just hardcode them in both passes.
Realistically, I expect this to go away as we fully embrace the
descriptor-based approach for simplicity, so don't look too deep into
it.
* llvm-project: b5924a8e27536d19dd5c4d302db29fb6163d5faa
* mhlo: 848ca244d20f045b7921da55a98a04d95ef94f0e
* Multiple breakages that need to be fixed.
Fixes:
* Refactor dialect registration
* Remove all kindof methods (Casting functionality has been added upstream and is implicitly
available, see https://llvm.discourse.group/t/removing-kinds-from-attributes-and-types/1547.)
* Update dialect registration to comply with https://reviews.llvm.org/D85495.
* Remove type kinds and update some changed dialect signatures.
* Upgrade ATen dialect to match upstream needs.
* Move dialect registration to tablegen.
* Register the ListType in tablegen.
* Change dialect initialization signature.
* Use TypeSwitch in MlirIr location printer.
* Remove global registry depends from npcomp-opt.
* Change LowerToLLVM to pass an MLIRContext vs an LLVMDialect for type creation.
* Remove dep on MLIREDSCInterface that is removed upstream.
* Thread through the DialectRegistry for opt and python-like tools.
* Modernize pass registration (This was forced because the GEN_PASS_REGISTRATION code now generates inline functions vs literal pass registration statements)
Co-authored-by: Marius Brehler <marius.brehler@iml.fraunhofer.de>
This patch adds a dialect intended to be used as a frontend dialect
to facilitate lowering from "A Tensor Library" in torch/pytorch.
This patch includes several passes that are useful in conjuction with the
dialect:
--aten-layer-name: Generates layer names for each operation, which are not
present in the original pytorch.
--aten-to-std: Lower the ATen dialect into standard dialect function calls.
--return-elimination-pass: convert functions (primarily the toplevel function)
to pass return values by reference. This simplifies pytorch integration.
--aten-op-report: generate a textual report about the model
--liveness-report
Future patches will implement actual integration with the pytorch jit to
intercept and generates MLIR in this dialect, then lower the resulting MLIR
into function calls through aten-layer-name -> aten-to-std ->
return-elimination -> std-to-llvm. The result would then jitted using the LLVM
jit, linked against a runtime library which makes calls back into pytorch to
implement all the layers.
Co-authored-by: Jeff Fifield <jeff.fifield@xilinx.com>
Co-authored-by: Jeff Fifield <jeff.fifield@xilinx.com>
Mostly this is CMake cleanup. Several library dependencies are missing, which
is often revealed with shared library builds. Also, it's generally bad to
link directly against LLVM libraries because it fails when using
LLVM_LINK_LLVM_DYLIB. MLIR will pull in libLLVM.so, and there will be
duplicate linkage with the the explicit libraries. There may need to be more
refactoring here.
* Since the manylinux images do not hard-link against python libs (resolving them at runtime), the module must be built without resolving undefined references.
* For some reason, builds on this platform are stricter, exposing dependency ordering issues.
* The CMake bits to build the extension are still somewhat of a mess. I have better versions both upstream and in IREE and will be reconciling. For now, don't look too closely.
* Primarily, the upstream shape dialect now uses tensor<?xindex> for non-erroring, immediate shape calculations (and will return this for shape_of of a tensor or memref).
* In addition, upstream passes do not yet exist for fully lowering to standard ops, so the passes here need to be extended to handle this new convention.
* This should be seen as an intermediate state, necessary to integrate a new LLVM version and needs more work and cleanup for generality.
* There is a good deal of awkwardness in these conversions. The hope is that additional upstream work will yield better defined conversion paths once out of this intermediate state.
My main interest in this is that tweaking the default of this flag is a
quick way to check for miscompiling canonicalizations / op definitions
not annotated properly (e.g. marked NoSideEffect when in fact it is not
safe to do so).
* Enables e2e test.
* With what I've learned in upstream about test directory layout, I can consolidate most of the separate directories we have for these things. Will do that in a followup.
* Not pleased with the LLVM global initialization depends but serviceable for now.
This required making module descriptors hold a FuncDescriptor* instead
of a pointer to array of FuncDescriptors as it previously did, which is
innocuous (just requires an llvm.bitcast after the llvm.mlir.addressof).
* Rewrites public function signatures to operate on tensors (vs ndarray).
* Most of our backends presume immutable tensors at public function boundaries.
Also, the previous code had a special case for deleting this op when it
had no uses. This is subsumed by the change in this commit since now
shape.const_shape is properly lowered.
With this change, the included test case with multiple serially
dependent ops works!
This specific issue was related to the scalar argument to that
function. We needed to compute a broadcast of a scalar shape (which is a
shape.const_shape) with another shape.
This ~totally reworks the existing "runtime" stuff to be more
principled and usable, such as from Python. It's still not fully
production-quality, mainly in the department of memory management (e.g.
it currently leaks memory; we need to figure out "who frees memrefs" +
the analysis and transformation needed to do that (maybe use upstream
buffer allocation pass?)).
The user API is in include/npcomp/runtime/UserAPI.h, though
include/npcomp/JITRuntime/JITModule.h is a friendlier wrapper.
The stuff under {include,lib}/runtime is totally firewalled from the
compiler and tiny (<6kB, though no attention has gone into optimizing
that size). For example, we don't link in libSupport into the runtime,
instead having our own bare bones replacements for basics like ArrayRef
(the JITRuntime helps with bridging that gap, since it *can* depend on
all common LLVM utilities).
The overall features of npcomprt is that it exposes a module that
with multiple function entry points. Each function has arguments and
results that are tensor-valued, and npcomprt::Tensor is the runtime type
that is used to interact with that (and a npcomprt::Ref<T>
reference-counting wrapper is provided to wrap npcomprt::Tensor in the
common case).
From an implementation perspective, an npcomprt module at the
LLVM/object/binary level exposes a single module descriptor struct that
has pointers to other metadata (currently just a list of function
metadata descriptors). All interactions with the npcomp runtime are
keyed off of that module descriptor, including function lookups and
dispatching. This is done to dodge platform ABI issues and also allow
enough reflection to e.g. verify provided arguments.
Most of the compiler-side work here was in LowerToNpcomprtABI and
LowerToLLVM.
Also,
- Rename npcomp_rt/NpcompRt to npcomprt/Npcomprt; it was getting
annoying to type the underscores/caps.
- misc improvements to bash_helpers.sh
* This elides the very common code the compiler adds for chaining otherwise tensor-related numpy ops together.
* More aggressive canonicalizations would require more advanced analysis.
* Preserving shape across the copy ops makes more thing shaped by default.
* Inference of ndarray types will now preserve the shape when specializing the dtype.
* Correctly infers the unknown dtypes that emit as part of compilation for the simple ufunc case.
* Significant more testing needs to be done on the details now that the pass is minimally functional.
* The actual pass itself is still too hacky/not general, but the underlying analysis is further along.
* Adds an op interface for adding CPA constraints.
* Adds a type conversion hook for handling built-in types (that we can't have adopt our interface).
* Converts tensor<> to object(!Tensor, [e:<type>]) just like NdArray.
* Implement a few numpy ops far enough to do dtype inference for simple sequences.
* This starts to lay down the infra for reasoning about calls
* Adds the importer code to generate IR for function calls of compiler recognized static functions.
* Adds python bindings for invoking flow, HAL, and VM lowering pipelines.
* Adds pythong bindings for translating to VM module flatbuffer.
* Adds a new backend_test/iree directory and configure lit to find the IREE python rt bindings.
* Open code a simple_invoke.py that exercises the whole pipeline (need real APIs for a lot of this).
* Fails when invoking the function because I never implemented argument marshaling for scalars :(
* Plenty of stuff to do tomorrow.
* Conversions to std for numeric binary expressions, numeric to_boolean, and numeric comparisons.
* Added folders to constant ops to comply with requirements of the pass system.
* Extended the frontend with parameter/result annotation processing for primitives (can specify types for function arguments).
* Added (empty) directory/sources for IREEVM conversions. These are only enabled if IREE is enabled.
The secret here is LLVM_ENABLE_WARNINGS=ON.
I also fixed a couple warnings, which gets us to be warning-clean.
I noticed also that npcomp-run-mlir/basic.mlir seems to be failing.
Maybe something since the latest integrate. My next commit (introduce
npcomp mini runtime) will largely rewrite it though, so it'll get fixed
then.
With this commit, we finish conversion to LLVM dialect, and should be
ready for subsequent commits to convert to an LLVM module and let LLVM
codegen to native machine code.
This required a custom "lower to LLVM" pass to support lowering
tcp.abort_if to a runtime call. In the future, this pass will grow to do
type conversions for our own runtime types as we add those.
This more clearly captures its semantics as a structural "observer" of
code that we currently mark as NoSideEffect but eventually lowers to
eager error handling code.
Also, update LowerRankedShapes to erase it, now that the layering here
is clear. That pass reifies the eager error handling code, so the need
for the dummy op to keep things alive isn't needed.
With this change, we are now ready to start lowering to LLVM!
This is the current print-ir-after-all from e2e-lowering-pipeline:
https://reviews.llvm.org/P8221
Specifically, we use unranked memrefs which get passed as a fixed-size
set of arguments/returns. One big caveat about this is that returning
results isn't going to work. See TODO in LowerTensorLoadOp.
This is far from enough runtime-wise, but it starts to demarcate a
plausible layering. Notice for example how this removes the
runtime-dependence from LowerRankedShapes.
Eventually, we want to have an `npcomp_rt` or `npcomp_hal` dialect with
its own set of runtime types that will supercede this.
See comments in LowerTensorLoadOp for more direction about where this is
going to evolve.
The idea was half-baked and after some deep thought felt like a solution
looking for a problem. What we had here (and is removed in this patch)
just wasn't pulling its weight.
I cannot think of anything we would want to do with tcp.island as it is
removed here beyond just sinking and merging them within a basic block,
such that the witness argument is kind of pointless (only matters for
hoisting).
TCP compute ops like tcp.add and tcp.broadcast_to have the strong
invariant of "pure or undefined behavior", which means they are always
safe to sink. The island concept as removed here conferred no benefit.
Also, I'll note that "islands" are a trick you can only play once in a
system (unless they strictly nest). I have some early-stage thoughs on
having an island concept that helps with modeling tensor shapes
robustly which seems promising (the island would serve a similar role as
tie_shape).
This uses an approach inspired by what is done in IREE. See comments on
LowerRankedShapes.cpp for how it works.
The basic gist is that we have an op that creates a !shape.shape from a
set of SSA values representing the extents, and then iteratively replace
any op producing a !shape.shape with instances of that op.
This also adds a small pass to clean up the `dim` ops that linalg
introduces. For now, it only has a trivial pattern that looks for a
`tcp.alloc_memref(%shape)` op to get the shape as we currently have an
invariant that all memrefs are the result of such ops.
But eventually this will need to look through view ops and any other
shape-ish stuff that linalg introduces as it lowers to loops, along with
any slicing ops introduced by buffer allocation.
There's a lot of details to flesh out here, but the basic approach seems
promising (see comments in createE2ELoweringPipeline).
This approach will be put to the test when we try to do our first
fusions since that tickles some of the nasty phase ordering issues
involved here.
But we're not there yet.
This makes sure we stay resonably canonically using the shape machinery.
(In fact, DimOp should probably be in the shape dialect since it hides a
`shape.shape_of` call)
* This is intended to provide low-level modeling for built-in objects.
* It is now possible to trace slice tuples (which are tuples of NoneType|EllipsisType|SlotObjectType<slice, ...>).