Building Tor binary with optimizations for my CPU

Since recently my relay started to use more CPU resources than before, I decided to build binary, which is allowed to use all available instructions for my CPU. Especially, I was interested in enabling AVX2.
First of all, I tried to force chutney to run on my Windows machine. After several hacks it started, but results were strange - ~2% of CPU use for single Tor process and 2.81 MBytes/s Single Stream Bandwidth - it is wrong for sure - speed was limited by something. I decided to not dig into it further and proceed without benchmarks.
I launched ./configure --help and searched for useful options, but only thing I found was “influential environment variable” - CFLAGS. Ok.
Usually I build tor with ./ && ./configure --disable-unittests --disable-module-dirauth --disable-manpage --disable-html-manual --disable-asciidoc && make, but this time I tried ./ && CFLAGS="-march=haswell" ./configure --disable-unittests --disable-module-dirauth --disable-manpage --disable-html-manual --disable-asciidoc && make.
Results were suspicious: binary shrinked from 14 MB to 5 MB. It is not what I was expecting when changed instruction sets. Next surprise was increased CPU load. After looking at disassembly, I found that AVX2 was actually activated, but also binary was built with wrong curve25519_donna implementation.
By comparing logs from different builds I found that -march=haswell for some reason deactivated -g and -O2 flags. Ok. Next attempt used this line: ./ && CFLAGS="-march=haswell -g -O2" ./configure --disable-unittests --disable-module-dirauth --disable-manpage --disable-html-manual --disable-asciidoc && make. This time binary came back to its 14MB size, curve25519_donna was built correctly and AVX2 was enabled. Looks good. CPU usage for Tor process look normal.
But I’m not sure if I should be satisfied with result. 1. Did it become actually faster? 2. Did I switched AVX2 optimizations on correctly or I broke something (like happened for curve25519_donna for the first try).
Can anyone say what is the correct method of enabling AVX2 for Tor? And do anyone know if it makes Tor faster or slower?

upd. I made mistake in first version of this topic: binary produced with -g -O2 -march=haswell did not used AVX2. I rebuilt it with -march=haswell -g -O2 and now it have AVX2. GCC is very strange software.