This might handle scripts as you described, but just wait until it needs the context of random bits of a tens-of-millions-of-lines monorepo plus knowledge of custom infrastructure that isn’t documented anywhere and–oh wait, we can’t actually let this LLM-as-a-service read our code because X and Y compliance/security/legal/etc, even if we ran it on-prem.
The robots aren’t coming for you so soon, don’t worry.
Would be interested to hear more about this game. How long does this take during work time, the evaluation criteria you hit on (is this testing your IDE setup, your knowledge of build tools, features of the tech stack like threading, etc etc)