I quickly got the feeling that with the right approach, I was still engaged in the same scientific activity, but with different tasks. The key approach in science sounds something like this:
- Develop a hypothesis
- Honestly devise a method that disproves it
- Conduct experiments
- Present your findings with a poster at a conference
As it turns out, these same approaches fit perfectly into the typical IT workflow. Even presenting with a poster is possible. For instance, a task arrives - we need to increase the resilience of a regional cluster. The cluster consists of load balancers, application nodes, Sphinx nodes (SQL Phrase Index), and caches. We need to increase its power, but it’s unclear how to do it right. So, you sit down and start drawing up a list of hypotheses and gathering data:
- How to decompose the load into individual bottlenecks?
- Can a unified model describe the ideal ratio of the viper to the hedgehog cores in vCPU of Sphinx nodes and the RAM volume on caching servers?
- Does it make sense to add cache beyond a certain threshold, or does it become saturated?
In short, my favorite job involves complete uncertainty and developing optimizations for things that empirically evolved but lack a precise recipe. There might not be a standard approach, as each company reinvent it's unique bicycle. Instead of a set of test tubes with varying concentrations of reagent and stem cells, I now have clusters around the world, and instead of a standard experiment, there are load tests involving short-term shutdowns of entire continents. We eventually found the dependency, and I managed to build a model, festooning it with correction coefficients for the power of vCPU on servers of different generations.
Another aspect that turned out to be extremely relevant, a standard in scientific research but not always used in IT, is that experiments should be both positive and negative.
For example, the task is to test the impact of a new API version on one of the subsystems to ensure its functionality.
The incorrect approach goes like this:
- Update the API code in the tested service's environment.
- Make sure everything works.
- Celebrate a successful deployment.
The correct approach:
- Perform all the steps from the incorrect one.
- Disable/break the API to ensure that the system stops functioning.
In one task, I suddenly discovered that the client's subsystem continued to function even with a broken API. It turned out that it simply wasn't using it, contrary to what the documentation suggested.