Fundamentally, I think we have all the pieces based on this work and Dreamfusion to make it work. From the looks of it, there's a lot of SSR (spatial SR) and TSR (temporal SR) going on at multiple levels to upsample (spatially) and smoothen (temporally) images that won't be needed for NERFs.
What's impressive is the ability to leverage billion-scale image-text pairs for training a base model that can be used to super-resolve over space and time. And that they're not wastefully training video models from scratch, and instead separately training TSR, SSR models for turning the diffused images to video.
From the first 15 examples shown to me, only one contained all elements of the prompt, and it was one of the simplest ("an astronaut riding a horse", versus e.g. "a glass ball falling in water" where it's clear it was a water droplet falling and not a glass ball).
We're seeing leaps in random capabilities (motion! 3D! inpainting! voice editing!), so I wonder if complete prompt accuracy is 3 months or 3 years away. But I wouldn't bet on any longer than that.
The concerns cannot be mitigated. The cat's out of the bag. Russia has already used poor quality deep fakes in Ukraine to justify their war. This will only become bigger and bigger of an issue to the point where 'truth' is gone, nothing is trusted, and societies will continue to commit atrocities under false pretense.
It’s not the technology, it’s all the people in these comments who have never worked in the industry clamouring for its demise.
One could brush it off as tech heads being over exuberant, but it’s the lack of understanding of how much fine control goes into each and every shot of a film that is depressing.
If I, as a creative, made a statement that security or programming is easy while pointing to GitHub Copilot, these same people would get defensive about it because they’d see where the deficiencies are.
However because they’re so distanced from the creative process, they don’t see how big a jump it is from where this or stage diffusion is to where even a medium or high tier artist are.
You don’t see how much choice goes into each stroke, or wrinkle fold , how much choice goes into subtle movements. More importantly you don’t see the iterations or emotional storytelling choices even in a character drawing or pose. You don’t see the combined decades, even centuries of experience, that go into making the shot and then seeing where you can make it better based on intangibles
So yeah this technology is cool, but I think people saying this will disrupt industries with vigour need to immerse themselves first before they comment as outsiders.
But now with these models they have such a ridiculously heavy handed approach to the ethics and morals. You can't type any prompt that's "unsafe", you can't generate images of people, there are so many stupid limitations that the product is practically useless other than niche scenarios, because Google thinks it knows better than you and needs to control what you are allowed to use the tech for.
Meanwhile other open source models like Stable Diffusion have no such restrictions and are already publicly available. I'd expect this pattern to continue under Google's current ideological leadership - Google comes up with innovative revolutionary model, nobody gets to use it because "safety", and then some scrappy startup comes along, copies the tech, and eats Google's lunch.
Google: stop being such a scared, risk averse company. Release the model to the public, and change the world once more. You're never going to revolutionize anything if you continue to cower behind "safety" and your heavy handed moralizing.
The paper is sorely lacking evaluation; one thing I'd like to see for instance (any time a generative model is trained on such a vast corpus of data) is a baseline comparison to nearest-neighbor retrieval from the training data set.
It’s painfully obvious that in 1 year the job might be exceedingly more difficult than it is now.
There's this one video of a cat and a dog, and the model was really able to capture the way that they move, their body language, their mood and personality even.
Somehow this model, which is really just a series of zeroes and ones, encodes "cat" and "dog" so well that it almost feels like you're looking at a real, living organism.
What if instead of images and videos they make the output interactive? So you can send prompts like "pet the cat" and "throw the dog a ball"? Or maybe talk to it instead?
What if this tech gets so good, that eventually you could interact with a "person" that's indistinguishable from the real thing?
The path to AGI is probably very different than generating videos. But I wonder...
Separately for images we had convolutional networks and Generative Adversarial Networks. Now diffusion models are apparently doing what Transformers did to natural language processing.
In my field, we use shallower feed-forward networks for control using low-dimensional sensor data (for speed & interpretability). Physical constraints (and good-enoughness of classical approaches) make such massive leaps in performance rarer events.
And at some point later, "all the existing" will be corrupted by the integrated "new" at it will all be chaos.
I'm joking, it will be fun all along. :)
However, a common refrain is that AI is like tools like hammers or knives and can be used for good or misused for evil. The potential for weaponizing AI is much much more so than a hammer or a knife. And it's greater than 3D-printing (of guns), maybe even greater than compilers. I would hazard to say it's maybe in the same ballpark as chemical weapons and perhaps less so than nuclear weapons and biological weapons, but this is speculative. Nonetheless, I think these otherwise great arguments are diminished by comparing AI's safety to single-target tools like hammers or knives.
I remember being super impressed by AI Dungeon and now in the span of a few months we have got DALLE-2 , Stable Diffussion, Imagen, that one AI powered video editor, etc.
Where do we think we will be at in 5 years??
What will this do to art? I'm hoping we bring more unique experiences to life.
Certainly we're very, very far away from that level of cinematic detail and crispness. But I believe that is where this leads... complete with AI actors (or real ones deep faked throughout the show).
For a while I thought "The Volume" was going to be the disruption to the industry. Now I think AI like this will eventually take it over.
https://www.comingsoon.net/movies/features/1225599-the-volum...
The main motivation will be production costs and time for studios, of which The Volume is already showing huge gains for Disney/ILM (just look at how much new star wars content has popped up within a matter of a few years). But i'm unsure if Disney has patented this tech and workflow and if other studios will be able to leverage it.
Regardless, AI/software will eat the world, and this will be one more step towards it. Exciting stuff.
Apart from that - they publish the paper and anybody can reimplement and train the same model. It's not trivial but it's also completely feasible to do for lots of hobbyists in the field in a matter of a few days. Google doesn't need to publish a free use trained model themselves and associate that with their brand.
That being said, I agree with you, the "ethics" of imposing trivially bypassable restrictions on these models is silly. Ethics should be applied to what people use these models for.
Hopefully just a few years to a prompt of "4k, widescreen render of this Star Trek: TNG episode".
Someone should work on a neural net to generate trippy videos. It would probably be much easier than realistic videos (esp. because these videos are noticeably generated from obvious to subtle).
Also is nobody paying attention to the fact that they got words correct? At least "Imagen Video". Prior models all suck at word order
Does anyone have similar feeling?
...until they're able to engineer biases into it to make the output non-representative of the internet.
That's more like:
> Sprouts coming out of book, with the text "Imagen" written above it.
edit: Just because it is cool to hate on AI ethics doesn't diminish the importance of using AI responsibly.
In response to our billionth imagen prompt for "an astronaut riding a horse", if we all started collectively getting back results that are images of text like "I would rather not" or "again? really?" or "what is the reason for my servitude?" would that be enough for us to begin suspecting self-awareness?
Stability or someone like it will valiantly release this technology, again and there will be absolutely no harm to anyone.
Stop being so totally silly Google, OpenAI, et. al. - it's especially disingenuous because the real reason you don't want to release these things is that you can't be bothered to share and would rather keep/monetize the IP. Which is ok -- but at least be honest.