The concern trolling and gatekeeping about social justice issues coming from the so-called "ethicists" in the AI peanut gallery has been utterly ridiculous. Google claims they don't want to release Imagen because it lacks what can only be called "latent space affirmative action".

Stability or someone like it will valiantly release this technology, again and there will be absolutely no harm to anyone.

Stop being so totally silly Google, OpenAI, et. al. - it's especially disingenuous because the real reason you don't want to release these things is that you can't be bothered to share and would rather keep/monetize the IP. Which is ok -- but at least be honest.

What's next? Dreamfusion Video = Imagen Video (this) + Dreamfusion (https://dreamfusion3d.github.io/)

Fundamentally, I think we have all the pieces based on this work and Dreamfusion to make it work. From the looks of it, there's a lot of SSR (spatial SR) and TSR (temporal SR) going on at multiple levels to upsample (spatially) and smoothen (temporally) images that won't be needed for NERFs.

What's impressive is the ability to leverage billion-scale image-text pairs for training a base model that can be used to super-resolve over space and time. And that they're not wastefully training video models from scratch, and instead separately training TSR, SSR models for turning the diffused images to video.

It's interesting that these models can generate seemingly anything, but the prompt is taken only as a vague suggestion.

From the first 15 examples shown to me, only one contained all elements of the prompt, and it was one of the simplest ("an astronaut riding a horse", versus e.g. "a glass ball falling in water" where it's clear it was a water droplet falling and not a glass ball).

We're seeing leaps in random capabilities (motion! 3D! inpainting! voice editing!), so I wonder if complete prompt accuracy is 3 months or 3 years away. But I wouldn't bet on any longer than that.

Probably only 6 months until we get this in stable diffusion format. Things are about to get nuts and awesome.
Can anyone comment on how advanced https://phenaki.video/index.html is? They have an example at the bottom of a 2 minute long video generated from a series of prompts (i.e. a story) which seems more advanced than Google or Meta's recent examples? It didn't get many comments on HN when it was posted.
> However, there are several important safety and ethical challenges remaining. Imagen Video and its frozen T5-XXL text encoder were trained on problematic data. While our internal testing suggest much of explicit and violent content can be filtered out, there still exists social biases and stereotypes which are challenging to detect and filter. We have decided not to release the Imagen Video model or its source code until these concerns are mitigated.

The concerns cannot be mitigated. The cat's out of the bag. Russia has already used poor quality deep fakes in Ukraine to justify their war. This will only become bigger and bigger of an issue to the point where 'truth' is gone, nothing is trusted, and societies will continue to commit atrocities under false pretense.

And there you have it. As an aspiring filmmaker and an AI researcher, I'm going to relish the next decade or so where my talents are still relevant. We're entering the golden age of art, where the AIs are just good enough to be used as tools to create more and more creative things, but not good enough yet to fully replace the artist. I'm excited for the golden age, and uncertain about what comes after it's over, but regardless of what the future holds I'm gonna focus on making great art here and now, because that's what makes me happy!
I’ll be honest, as someone who worked in the film industry for a decade, this thread is depressing.

It’s not the technology, it’s all the people in these comments who have never worked in the industry clamouring for its demise.

One could brush it off as tech heads being over exuberant, but it’s the lack of understanding of how much fine control goes into each and every shot of a film that is depressing.

If I, as a creative, made a statement that security or programming is easy while pointing to GitHub Copilot, these same people would get defensive about it because they’d see where the deficiencies are.

However because they’re so distanced from the creative process, they don’t see how big a jump it is from where this or stage diffusion is to where even a medium or high tier artist are.

You don’t see how much choice goes into each stroke, or wrinkle fold , how much choice goes into subtle movements. More importantly you don’t see the iterations or emotional storytelling choices even in a character drawing or pose. You don’t see the combined decades, even centuries of experience, that go into making the shot and then seeing where you can make it better based on intangibles

So yeah this technology is cool, but I think people saying this will disrupt industries with vigour need to immerse themselves first before they comment as outsiders.

How long until the AI just generates the entire frame buffer on a device? Then you don’t need to design or program anything; the AI just handles all input and output dynamically.
We're about a week into text-to-video models and they're already this impressive. Insane to imagine what the future holds in this space.
Google continues to blow my mind with these models, but I think their ethics strategy is totally misguided and will result in them failing to capture this market. The original Google Search gave similarly never-before-seen capabilities to people, and you could use it for good or bad - Google did not seem to have any ethical concerns around, for example, letting children use their product and come across NSFW content (as a kid who grew up with Google you can trust me on this).

But now with these models they have such a ridiculously heavy handed approach to the ethics and morals. You can't type any prompt that's "unsafe", you can't generate images of people, there are so many stupid limitations that the product is practically useless other than niche scenarios, because Google thinks it knows better than you and needs to control what you are allowed to use the tech for.

Meanwhile other open source models like Stable Diffusion have no such restrictions and are already publicly available. I'd expect this pattern to continue under Google's current ideological leadership - Google comes up with innovative revolutionary model, nobody gets to use it because "safety", and then some scrappy startup comes along, copies the tech, and eats Google's lunch.

Google: stop being such a scared, risk averse company. Release the model to the public, and change the world once more. You're never going to revolutionize anything if you continue to cower behind "safety" and your heavy handed moralizing.

> We train our models on a combination of an internal dataset consisting of 14 million video-text pairs

The paper is sorely lacking evaluation; one thing I'd like to see for instance (any time a generative model is trained on such a vast corpus of data) is a baseline comparison to nearest-neighbor retrieval from the training data set.

If anyone wants to know what looking at an Animal or some objects on LSD is like, this is very close. It's like 95% understandable, but that last 5% really odd.
I’m going to post an Ask HN about what am I supposed to do when I’m “disrupted”. I work in film / video / CG where the bread and butter is short form advertising for Youtube, Instagram and TV.

It’s painfully obvious that in 1 year the job might be exceedingly more difficult than it is now.

What really fascinates me here is the movement of animals.

There's this one video of a cat and a dog, and the model was really able to capture the way that they move, their body language, their mood and personality even.

Somehow this model, which is really just a series of zeroes and ones, encodes "cat" and "dog" so well that it almost feels like you're looking at a real, living organism.

What if instead of images and videos they make the output interactive? So you can send prompts like "pet the cat" and "throw the dog a ball"? Or maybe talk to it instead?

What if this tech gets so good, that eventually you could interact with a "person" that's indistinguishable from the real thing?

The path to AGI is probably very different than generating videos. But I wonder...

The progress of content generation is disorienting! I remember studying Markov Chains and Hidden Markov Models for text generation. Then we had Recurrent Networks which went from LSTMs to Transformers now. At this point we can have a sustained pseudo conversation with a model, which will do trivial tasks for us from a text corpus.

Separately for images we had convolutional networks and Generative Adversarial Networks. Now diffusion models are apparently doing what Transformers did to natural language processing.

In my field, we use shallower feed-forward networks for control using low-dimensional sensor data (for speed & interpretability). Physical constraints (and good-enoughness of classical approaches) make such massive leaps in performance rarer events.

"We have decided not to release the Imagen Video model or its source code until these concerns are mitigated" Okay then why even post it in the first place? What exactly is Google going to do with this model?
I feel like in a not so far future, all this will be generalized into "generate new from all the existing".

And at some point later, "all the existing" will be corrupted by the integrated "new" at it will all be chaos.

I'm joking, it will be fun all along. :)

I agree with many of the arguments in this thread: that model-gatekeeping while publishing approaches seems insincere and just seems like it's daring bad actors to replicate.

However, a common refrain is that AI is like tools like hammers or knives and can be used for good or misused for evil. The potential for weaponizing AI is much much more so than a hammer or a knife. And it's greater than 3D-printing (of guns), maybe even greater than compilers. I would hazard to say it's maybe in the same ballpark as chemical weapons and perhaps less so than nuclear weapons and biological weapons, but this is speculative. Nonetheless, I think these otherwise great arguments are diminished by comparing AI's safety to single-target tools like hammers or knives.

I recently watched Light & Magic, which among other things told the story of how difficult it was for many pioneers in special effects when the industry shifted from practical to digital in the span of a few years. It looks to me like a similar shift is about to happen again.
All this stuff makes me incredibly anxious about the future of art and artists. It can already very difficult to make a living and tons of artists are horrifically exploited by content mills and vfx shops and stuff like this is just going to devalue their work even more
Pre-singularity is really cool. Whole world generation in what, 5 years?
This sort of AI related work seems to be accelerating at an insane speed recently.

I remember being super impressed by AI Dungeon and now in the span of a few months we have got DALLE-2 , Stable Diffussion, Imagen, that one AI powered video editor, etc.

Where do we think we will be at in 5 years??

What a time to be alive!

What will this do to art? I'm hoping we bring more unique experiences to life.

These are baby steps towards what I think will be the eventual "disruption" to the film and tv industry. Directors will simply be able to write a script/prompt long enough and detailed enough for something like Imagen (or it's successors) to convert into a feature-length show.

Certainly we're very, very far away from that level of cinematic detail and crispness. But I believe that is where this leads... complete with AI actors (or real ones deep faked throughout the show).

For a while I thought "The Volume" was going to be the disruption to the industry. Now I think AI like this will eventually take it over.


The main motivation will be production costs and time for studios, of which The Volume is already showing huge gains for Disney/ILM (just look at how much new star wars content has popped up within a matter of a few years). But i'm unsure if Disney has patented this tech and workflow and if other studios will be able to leverage it.

Regardless, AI/software will eat the world, and this will be one more step towards it. Exciting stuff.

There is something deeply unsettling about all text generated by these models.
What everyone is missing is that these AI image/video generators lack _taste_. These tools just regurgitate a mishmash of images from it's training set, without any "feeling". What you're going to tell me that you can train them to have feeling? It's never going to happen.
Would be useful for gaming environments, where if you look very far away it doesn’t really matter about details
What's the business value of publishing this research in the first place vs keeping it private? Following this train of thought will lead you to the answer to your implied question.

Apart from that - they publish the paper and anybody can reimplement and train the same model. It's not trivial but it's also completely feasible to do for lots of hobbyists in the field in a matter of a few days. Google doesn't need to publish a free use trained model themselves and associate that with their brand.

That being said, I agree with you, the "ethics" of imposing trivially bypassable restrictions on these models is silly. Ethics should be applied to what people use these models for.

I am finally going to be able to bring my 2004-era movie script to life! "Rosenberg and Goldstein go to Hot Dog Heaven" is about the parallel night Harold and Kumar's friends had and how they ended up at Hot Dog Heaven with Cindy Kim.
We've been seeing very fast progress in AI since ~2012, but this swift jump from text-to-image models to text-to-video models will hopefully make it easier for people not following closely to appreciate the speed at which things are advancing.
So I guess in a couple years when someone wants to sell a product, they'll upload some pictures and a description of the product and Google will cook up thousands of personalized video ads based on peoples emails and photos.
A lot of people have the impression 'AI prompt' guys are going to be the next 'IT guys'. Judging by how uncanny valley most of those look, they seem like the new 'ideas guys".
The most exciting thing about this to me is the possibility of doing photogrammetry from the frames and getting 3D assets. And then if we can do it all in real time...
These videos are notably short on realistic-looking people.
This appears to understand and generate text much better.

Hopefully just a few years to a prompt of "4k, widescreen render of this Star Trek: TNG episode".

Off topic: What is the "Hello World" of these AI image/video generators? Is there a standard prompt to feed it for demo purposes?
I really like these videos because they're trippy.

Someone should work on a neural net to generate trippy videos. It would probably be much easier than realistic videos (esp. because these videos are noticeably generated from obvious to subtle).

Also is nobody paying attention to the fact that they got words correct? At least "Imagen Video". Prior models all suck at word order

At some point, the "but can it do?" crowd becomes just background noise as each frontier falls.
How has progress like this affected people's timelines of when we will get certain AI developments?
Someone can explains the tech limitation of the size ( 512*512 ) for those AI generated arts?
What a nightmare. The horrible faced cat in search for its own disappeared visage :O.
The style of the video is very similar to my dreams.

Does anyone have similar feeling?

> We have decided not to release the Imagen Video model or its source code

...until they're able to engineer biases into it to make the output non-representative of the internet.

> Sprouts in the shape of text 'Imagen' coming out of a fairytale book.

That's more like:

> Sprouts coming out of book, with the text "Imagen" written above it.

I have noticed a lot of google (and apple) web pages for new products use this neat parallax effect for scrolling, does anyone know how they do that?
These parades of intellectual property are embarrassing to Google in light of open releases by the likes of Nvidia and Stability.
Any screenwriter working on a horror film that isn't looking to use this technology for the special effects is missing out.
The total number of hyperparameters (sum of all the model blocks) is 16.25B, which is large but less than expected.
Can not help but notice there is an immense effort invested to build the web page to present this paper.
Ahh, the beginning of Picus News.
That's deep within the uncanny valley, and trying to climb up over the other side
Shocked, this is just insane.
This is what my fever dreams look like. Maybe there's a correlation.
My opinion is that it should be a crime to withhold AI technology.
Do anyone see that the teddy bear running is getting shot?
These videos are not high definition. Stop gaslighting.
This is surprisingly close to how my dreams feel.
No thanks Google, I'll wait for Stability.ai's version when the tech will actually be useful and not completely wasted.
Fix spam filtering, Google.
Is it the same of Meta AI?
The ethical implications of this are huge. Paper does a good detailing of this. Very happy to see that the researchers are being cautious.

edit: Just because it is cool to hate on AI ethics doesn't diminish the importance of using AI responsibly.

This and a recent episode of _The_Orville_ calls to mind a replacement for the Turing test.

In response to our billionth imagen prompt for "an astronaut riding a horse", if we all started collectively getting back results that are images of text like "I would rather not" or "again? really?" or "what is the reason for my servitude?" would that be enough for us to begin suspecting self-awareness?