Tests suggest clues of whose content was used to train OpenAI’s Sora

(washingtonpost.com)

23 points | by kgwgk a day ago ago

5 comments

DroneBetter a day ago ago
https://archive.is/ozjEb (note some of the gifs become static images here)
codedokode a day ago ago
This vague situation with copyright plays against open-source AI models who have to disclose the sources of training data, while closed-source companies can freely use pirated material and get advantage over open-source models.
smegma2 21 hours ago ago
I’m normally skeptical of claims like this, but looking at the examples it seems that Sora is reproducing some of its training data verbatim. I guess it’s a case of overfitting? In particular the Civ example seems like it must have been copied almost verbatim.
viewtransform 20 hours ago ago
Title goes on to say:
Tests by The Post suggest the training data for OpenAI’s video generator Sora included versions of movies, TikTok clips and Netflix shows.
ath3nd a day ago ago
[dead]