An external monitoring tool will report to you that the average response time of the 5 monitored URLs has doubled in the last 30 minutes. The project is running on a single physical server that is not under your management and is running somewhere in a datacenter. You connect via SSH, start htop, and see that the CPU load is 95% and the memory is long overflowing.
According to git, you know that about a week ago they did a database migration to a new table structure, and a colleague writes in chat that he had to run the migration overnight, because the recalculation of columns and indexes took about 5 hours, during which almost the entire database was locked, and neither INSERT nor SELECT worked.
So the performance problems are probably due to improperly designed indexes, poorly redesigned SQL queries, or large connection pooling. There is no time for a revert, there are 7 thousand users on the site according to Google Analytics, and an outage for 5 hours would mean a reputational risk for the client, and a loss of tens to hundreds of thousands of crowns during that time (it's hard to estimate, the projectionists make up enough). You realize that testing only functionality on a test environment is not enough, and you need to implement a load test as well.
Since this is an important e-commerce store of your biggest client, and you expect that the situation may get worse, you have 30 seconds to make a decision.
How do you proceed?
Jan Barášek Více o autorovi
Autor článku pracuje jako seniorní vývojář a software architekt v Praze. Navrhuje a spravuje velké webové aplikace, které znáte a používáte. Od roku 2009 nabral bohaté zkušenosti, které tímto webem předává dál.
Rád vám pomůžu: