Before we started using Telepresence and Forge, we were running local instances of the services directly on the engineer’s workstations, and in production automatically provisioned and configured EC2 instances and other AWS cloud resources using Ansible. Creating and maintaining the development environments is a significant amount of overhead, which increases with the number of components in the system, and is (more or less) repeated by each engineer. Also the various data stores need to be created and populated with test data before any work can be done. Once a development environment is created, it inevitably diverges from production and from everyone else’s local data, creating more work and many opportunities for error. The amount of work involved scales with the number of engineers and the average number of services that an engineer touches. As we were simultaneously growing the team and splitting components apart to reduce coupling, we found that our engineers were spending an impractical amount of time maintaining their local environments and test data.
We started using Telepresence and Forge because we wanted to improve productivity, accuracy, and autonomy. This is especially important as we begin adding remote engineers. We wanted to reduce the cognitive burden of each engineer maintaining a development environment, so our engineers can focus as much as possible on the specific challenges and opportunities of our business. We are also evolving our production infrastructure towards microservices in Kubernetes and need an efficient and complete development setup in that context.
Now, every service can run in Docker, and we have most of our services working with Forge and Telepresence. I personally work on many different services, so I see a huge difference in productivity from Telepresence and Forge. For development, we use Telepresence in container mode, and integrate our dev tools with the containerized process. With Clojure services we connect locally with an nREPL client to a Clojure process in a container. For Go, we have a dev container with watch/compile/reload facilities, and similar setups for other languages. The result is that the number of tools that are strictly required in order to work on the entire system has dropped from dozens to just a handful: Docker, kubectl, AWS CLI, git, Telepresence, and Forge. With those six tools, you can almost immediately run any service locally with Telepresence, with test data and interactions with other services. The utility of that configuration is shared by the entire team, and the work involved scales with the number of services, not with (services * engineers).
We have a single shared development namespace in Kubernetes, where we maintain all the stateful services. This way engineers are not forced to provide a local Kafka, MongoDB, Cassandra, Redis, or whatever. They still are able to do so if they need to. Sharing e.g. a dev Kafka cluster across the team empowers us to ensure that the protocols between different components are correct throughout development. It reduces tasks being blocked waiting for another person’s implementation, and reduces implementation errors in the communication between services. Each engineer has his or her personal preferences of development tools, but we want to empower our engineers to begin to be productive on as many services as possible, with as few required tools as possible.
From this experience, I have found best practices around development workflow definitely take time to get right, they need to fit your organization, and they warrant a lot of consideration since they impact everything you do. There is no “one size fits all” approach. This is an ongoing process, it’s important to continue examining (ideally, measuring) the way your team works, reflecting on it, and refining it. Testing your workflow with a small set of users, and incrementally rolling it out is a good strategy.