[RefBackend] Add elementwise fusion and buffer deallocation

This gives some decent improvements to memory consumption and latency of
testing. I would have expected buffer-deallocation to actually make a
big difference to the final process RSS but it doesn't appear to. Also
running buffer-deallocation later in the pipeline results in
miscompiles. I didn't have the time or interest to dig in deeper, but
something is off.

(numbers below are taken from a single run, but I did do a few runs to make
sure that the variance wasn't that great)

- Linalg-on-Tensors shows memory consumption improvements and some slight speedups.
```
./tools/e2e_test.sh -s -v -c refbackend
fuse=0 dealloc=0
RSS: 3071.33 MB
real    3m58.204s
user    6m22.299s
sys     0m51.235s
fuse=1 dealloc=0
RSS: 2515.89 MB
real    3m34.797s
user    5m56.902s
sys     0m44.933s
fuse=1 dealloc=post-bufferize:
RSS: 2290.25 MB
real    3m42.242s
user    6m0.560s
sys     0m46.335s
```

- TOSA ResNet18 gets significantly faster and uses significantly less memory.
```
time ./tools/e2e_test.sh -s -v -c tosa -f ResNet18
fuse=0 dealloc=0
rss 1328.56 MB
real    0m50.303s
user    0m55.355s
sys     0m12.260s
fuse=1 dealloc=0
rss 859MB
real    0m30.454s
user    0m35.551s
sys     0m11.879s
fuse=1 dealloc=post-bufferize:
rss 851MB
real    0m30.313s
user    0m39.889s
sys     0m11.941s
```

Big thanks to Ramiro for the methodology here for measuring the RSS with
`psutil`:
https://gist.github.com/ramiro050/5b5c2501f7389c008d9029210772c3a8
pull/1697/head
Sean Silva 2022-12-07 13:35:10 +00:00
parent 29c8823464
commit 69171c246a
1 changed files with 9 additions and 0 deletions

View File

@ -116,6 +116,14 @@ class RefBackendInvoker:
LOWERING_PIPELINE = "builtin.module(" + ",".join([
"func.func(refback-generalize-tensor-pad)",
# Apply some optimizations. It would be great if MLIR had more useful
# optimizations that worked out of the box here.
# Note: When measured, this doesn't seem to actually help that much
# for the linalg-on-tensors backend.
# This is likely because if things are naturally fusable we usually already
# emit things in that form from the high level (e.g. single linalg-generic).
# Other backends are likely to benefit more.
"func.func(linalg-fuse-elementwise-ops)",
# Bufferize.
"func.func(scf-bufferize)",
"func.func(tm-tensor-bufferize)",
@ -126,6 +134,7 @@ LOWERING_PIPELINE = "builtin.module(" + ",".join([
"refback-mlprogram-bufferize",
"func.func(tensor-bufferize)",
"func.func(finalizing-bufferize)",
"func.func(buffer-deallocation)",
# Munge to make it ExecutionEngine compatible.
# Specifically, we rewrite calling convention boundaries to be in terms
# of unranked memref, and we rewrite the return to actually be a