We have two servers: acceptance an production.
One day I noticed that an average page loading is 2 seconds on production while it is 0.5 seconds on acceptance.
Hm, fucking DB guys are fucking around with fucking DB. I thought.
I enabled log4jdbc. Checked DB query timings. And they were the same as on acceptance.
Then I put more logging to the controller. I understood that the controller works fast. Page rendering is slow.
I added more logging to the web layer. There is no decent bottleneck in the application. It is damn slow all around. I checked garbage collector – it was fine.
Then checked all the logic spread around the application: permissions, logging, i18n. I found and fixed numerous bugs. Did a pile of performance optimizations. But in the end prod was 2 times slower than acceptance still.
Then I though this is the hardware. I checked memory, CPU, HDD. The servers are the same.
I implemented a small arithmetic progression summing algorithm without any third party dependencies and ran them using java outside of tomcat. The are equally fast.
Running the same algorithm inside tomcat leads to 4 times difference on prod.
Then I decided – this is JVM.
I checked JVM parameters and found that it has the following on acceptance but not on production:
-XX:ReservedCodeCacheSize=256m -XX:+UseCodeCacheFlushing -XX:CodeCacheFlushingMinimumFreeSpace=20m
Fixing that fixed the half year problem.
Trying to do a retrospective I understand that there is no were no way to figure out that code cache region was exhausing.
Java specs promise to file a warning in this case
Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
But I have not see anything like that in logs.
UseCodeCacheFlushing. Remember. Always. Especially when you use Groovy in your application.