This isn’t a limitation of neural nets, as early as 2018 we’ve had stable style transfer for videos, see https://medium.com/element-ai-research-lab/stabilizing-neura...
Praying stuff like this feels... Not crass, exactly, but like... Unrefined and ersatz in a way. It's like buying a mall katana and displaying it like it's something of cultural value.
This was honestly a great waste of compute power.
To the downvoters, do what you want, but it doesn't change the apparent reality of what we're seeing. AI image generation is an exciting and growing technology but this isn't a remarkable or successful application of it in any measurable dimension. The system lacks temporal and spatial context and that is going to be a hard blocker for this application.