torch-mlir/tools/mnist-playground
Stella Laurenzo af4edb63ae Start reworking towards a shared library build.
* Need to have a dag of shared library deps in order to interop across python extensions (as presented in ODM).
* Introduced add_npcomp_library and friends to mirror the MLIR setup.
* Adds a libNPCOMP.so shared library.
* Redirects tools and extensions to link against libNPCOMP.so (instead of static libs).
* Moves all libraries to lib/, all binaries to bin/ and all python extensions to python/. The invariant is that the rpaths are setup to have a one level directory structure.
* Reworks the _torch_mlir extension to build like the others (still need to come up with a consolidated rule to do this instead of open coded).
* Includes an upstream version bump to pick up needed changes.

Sizes with dynamic linking (stripped, release, asserts enabled):
  libNPCOMP.so: 43M (includes much of the underlying LLVM codegen deps)
  libMLIR.so: 31M
  _npcomp.so: 1.6M (python extension)
  _torch_mlir.so: 670K (python extension)
  npcomp-capi-ir-test: 6.3K
  npcomp-opt: 351K
  npcomp-run-mlir: 461K
  mnist-playground: 530K

Still more can be done to normalize and optimize but this gets us structurally to the starting point.
2020-10-09 16:02:58 -07:00
..
CMakeLists.txt Start reworking towards a shared library build. 2020-10-09 16:02:58 -07:00
README.md [RefBackend] Rename "E2E" to RefBackend. 2020-10-07 10:29:48 -07:00
fc.mlir Add hopefully short-lived mnist-playground utility. 2020-10-05 13:59:06 -07:00
mnist-playground.cpp [RefBackend] Put JITModule in refback namsepace. 2020-10-08 09:07:00 -07:00

README.md

mnist-playground

This is intended to be a short-lived "playground" for doing various experiments, guided by a real model use case, for improving the npcomp reference backend.

It's expected that utilities developed here will graduate to a more general utility or that this utility will be obsoleted by Python-driven flows once those come online.

Goals:

  • Obtain a performance-grounded analysis of the TCF/TCP design + reference backend design, and improve the designs.

  • Make forward progress on TCF/TCP + reference backend while the PyTorch frontend is being brought up.

Rough sketch of how we intend to get there:

  1. Link against PyTorch, and write a simple routine to do inference on a simple FC MNIST.

  2. Write a similar routine in TCF, extending TCF and the reference backend as needed for functional completeness. The PyTorch code serves as a numerical correctness reference.

  3. Run and profile the reference backend and obtain a set of action items for design improvements, both to performance and stability. The PyTorch code serves as a performance baseline.

  4. Implement important action items on a priority basis, and document remaining major design issues that don't make sense to address at this time, along with a justification for why the current design doesn't prevent us from eventually solving them. Iterate the previous step and this one as makes sense.

  5. (Stretch) Add support for convolutional MNIST and/or training.

Current Status

Step 1. DONE

Step 2. MOSTLY DONE. Still need to improve the op set to make the FC MNIST more complete. In particular, implementing functionality for reshape and softmax.

Step 3. STARTING. Initial performance on 10x784x100 (10 FC feature, batch 100) is 66x off from PyTorch. No profiling done yet.

Example command line (the .mlir file and -invoke are similar to npcomp-run-mlir):

$ mnist-playground tools/mnist-playground/fc.mlir -invoke fc
PyTorch: numRuns: 16384 nsPerRun: 3.947563e+05
RefBackend: numRuns: 256 nsPerRun: 2.471073e+07
Ratio (RefBackend / PyTorch): 62.5974

There is currently a fragile dependency between hardcoded at:: function calls in the .cpp file and the TCF code in the .mlir file. A correctness check is done to make sure they agree. Once we have a PyTorch frontend and/or ATen roundrip ATen backend oneline, we can avoid this fragility.