I feel like we need more awareness on what is open-source and how does it work. This is NOT open source. This is, at best, source available but as there is no way to confirm that this code even runs anywhere ever it's entirely a bad faith performance to trick people, deceive regulators and stain the entire open source movement.
I sincerely hope that the main stream media does not fall for this and calls it out. It's not rocket science. It's really really simple - this is not good for anyone.
This is open source. You're thinking of trusted execution, audits, licenses with disclosure requirements, or signed affidavits which is a totally different thing than open source. Otherwise you could claim that just about anything isn't open source just because you're not sure what is happening on someone else's computer.
ok. This is open source of _what_? Without tying the code to a real life object the intent is absolutely meaningless. Here's the open source code for hackernews:
What does that give us? We can't run this to host our own hackernews as it's clearly not runnable. We can't really learn anything from this as it doesn't not represent any real reality. Maybe it's a fun reading exercise but that's about it.
Open source means that I can take source and run it to ensure it's trusted. Ascii characters being visible on my screen is just a nice byproduct of this goal.
So in the end are we going by the OSI's definition of Open Source, or not? Can we make up our mind please?
Every time anyone posts here even a slightly modified Open Source license (e.g. a MIT license with an extra restriction that prevents megacorporations from using it but doesn't affect anyone else) people come out of the woodwork with their pitchforks screaming "this is not Open Source!", and insist that the Open Source Definition decides what is Open Source or not, and not to call anything which doesn't meet that definition "Open Source".
And yet here we are with a repository licensed under an actually Open Source license, and suddenly this is the most upvoted comment, and now people don't actually care about the Open Source Definition after all?
Either we go by the OSI's definition, in which case this is open source, regardless of what you think the motivations are for opening up this code, or we go by the "vibes" of whether it feels open source, in which case a modified MIT license which prohibits companies with a trillion+ market cap from using it is also open source.
You’re discussing licenses, their concern is about calling a thing that cannot function without the associated proprietary back-end “open source” for marketing.
If you want to make the argument only about the license, then you should make sure you are consistent by referencing “open source license” every single time instead. Their point is that companies use releases like this to claim they “open source” simply by releasing some useless code under an open source license.
I think if you simply replace “license” with the word “software” in those same OSI tenants, you’ll suddenly find that this “open source” project doesn’t come close to being the “open source” most people believe in. They don’t just expect the definition to stop with the license if you’re going to call something “open source” instead of “has an open source license”. OSI only provides a definition of “Open Source” with respect to licenses.
So while you may consider only a singular definition by an American organization, founded by corporations, designed to focus on clarifying and promoting the licensing aspect of open source, as the end-all be-all all-encompassing definition for the words “open source”, others argue that there are more things in software than just a license and they hope the media won’t be fooled into reporting about X offering “open source” access.
> there is no way to confirm that this code even runs anywhere ever
I'm confused what this has to do with "open source" or how it affects public perception.
I agree with you that it's totally possible to lie about what is actually running in production and that sharing some code doesn't mean it's that code, but how is this a new problem?
Freedom 1 is dubiously fulfilled. I can modify it, sure, but I can't modify it when the program runs on my data for me.
Freedom 0 isn't fulfilled. I don't have the necessary input data to run the program myself.
(Of course the free software definition wasn't written for today's world, and the clarification below goes somewhat against my argument for Freedom 0. Feel free to pick this apart.)
(unless, of course, the code isn't licensed under an OSI-approved license. Parent didn't actually specify which license the hypothetical not-windows-11 was being "open sourced" under, so we can't actually say for sure whether this hypothetical release is open source or not)
This clearly has the goal of muddying the water of the DSA transparency requirements. It's an opaque way of trying to mislead users into believing that X is being transparent while not being so at all.
They pretend to be transparent about their algorithms while denying researchers access to their API through exorbitant pricing and severely limited quotas.
You might want to ask a deep research LLM to collect evidence for and against you claim and read through it. I just did with Gemini and it convinced me (with evidence) that your claim is not consistent with the facts.
I am sceptical of Musk, but this seems to be a legitimate transparency move.
Access to read 1 million posts through the X API costs $5000/month. Enterprise access to their API costs $42 000 per month.
Multiple researchers are being told by X that they must pay this fee to get access[1][2][3].
X has recently been fined for not providing this access to researchers. Both for the organic engagement, and for paid advertising. [4]
The pricing of X's API is exorbitant and orders of magnitude higher than arguably higher quality datasets like Reddit. One million posts through the Reddit API costs $2.40.
The pricing scheme is obviously not value based and is clearly designed to limit data access to researchers. As users here note, studying recommender systems requires studying the inputs and outputs of the system. Platforms are rightly not mandated to present the inputs due to privacy concerns. But they are mandated to make the outputs available. And they aren't. "Open sourcing" their algorithm is not a replacement for this, it's an obvious a ploy to present themselves as transparent.
That is one of the worst clanker-brained replies I have ever read on this platform.
The OP you're replying to made a concrete point (X claim to be transparent but block researcher access through unreasonably price gating), and you didn't even attempt to refute it nor engage with the substance of the post at all.
If you think that this evidence is so compelling why don't you link to some sources and summarize it in your own words? If you cannot be bothered to do that much, why are you replying in the first place?
Telling someone that they're wrong and they should just chat with an LLM to educate themselves removes any room for discussion, leaving this platform and other comment readers worse off.
Nope. I didn't feel like spending the time to read the code, but I did want an LLM to pull out specific pieces for me and compare them to other published info. This is a good way to use LLMs: ask them to organize data for you to consider yourself and come to your own conclusion.
In this case, the info I looked at changed my opinion (downwards) on how cynical this release really was.
It seems like what they've released is entirely useless. Just done for the headlines I guess. All the real information is the components not provided. They may as well have uploaded the CPython source and told us that was the algorithm, which executes a hand-engineered model of heuristics stored in a closed-source .py file.
Not really that surprising: all logic that used to be in the code is now in the model; the only code that is left is some glue to connect the outside world to the number crunching, just like Llama2 runs your LLMs with only 700 lines of C.
They're eating the code. They're eating the algorithms.
I wonder if this'll turn out like the last time they published their algorithm to great fanfare, and then didn't bother to ever update it: https://github.com/twitter/the-algorithm
Someone will, and whoever does it will probably use an Agent CLI: Claude Code with Opus 4.5, Codex CLI with GPT‑5.2‑Codex, Gemini CLI with 3-Pro, GitHub Copilot CLI, etc. I’m 100% sure of it, I’d bet everything I have. Heck, even the code change was made by an AI Agent called “CI Agent” <support@x.ai> as seen here: https://github.com/xai-org/x-algorithm/commit/aaa167b3de8a67...
Social media apps do not compete in terms of code quality, but user-capture. People go to X because their friends are on X or there is someone on X they want to follow. The sole valuable aspect of any social media company is how many people use it. That's why, when Musk bought Twitter, he discarded the branding, the software engineers, rewrote the backend, and ditched the moderation. The only valuable thing that he was interested in buying was the captive users of Twitter and the embedded value in their social relations and generated content.
X has content moderation that relies on a mix of AI and human review, focusing on automated systems and user reports. There’s less emphasis on account suspensions and more on reach restriction, alongside community-led moderation like "Community Notes"
looks like this is the "for you" feed, once again shared without weights so we only have so much visibility into the actual influence of each trait.
"We have eliminated every single hand-engineered feature and most heuristics from the system. The Grok-based transformer does all the heavy lifting by understanding your engagement history (what you liked, replied to, shared, etc.) and using that to determine what content is relevant to you." aka it's a black box now.
- Phoenix (out of network) ranker seems to have all the interesting predictive ML work. it estimates P(favorite), P(reply), P(repost), P(quote), P(click), P(video_view), P(share), P(follow_author), P(not_interested), P(block_author), P(mute_author), P(report) independently and then the `WeightedScorer` combines them using configurable weights. there's an extra DiversityScore and OONScore to add some adjustments but again dont know the weights https://deepwiki.com/xai-org/x-algorithm/4.1-phoenix-candida...
- other scores of interest: photo_expand_score, and dwell_score and dwell_time. share via copy, share, and share via dm are all obviously "super like" buttons.
- Two-Tower retrieval uses dot product similarity between user features/engagement (User Tower) and normalized embeddings for all items (Candidate Tower). but when you look into the code and considering that this is probably the most important model for recommendations quality.... it's maybe a little disappointing that its a 2 layer MLP? https://deepwiki.com/search/what-models-are-used-for-user_98...
- the ten pre-scoring filters are minorly interesting, nothing super surprising here apart from AgeFilter (https://deepwiki.com/xai-org/x-algorithm/4.6.1-agefilter) which I guess means beyond a certain max_age (1 day?) nothing ever shows up on For You. surprising to have a simple flat cutoff vs i guess the alternative of an exponential aging algorithm.
- videoduration hydrator explicitly prioritizes video duration (https://deepwiki.com/xai-org/x-algorithm/4.5.6-videoduration...) but we dont know in what direction... do you recommend shorter or longer videos? and why a hydrator for what is presumably a pretty static property?
open questions from me
1. how large is the production reranker? default param count is here https://deepwiki.com/search/how-many-params-is-the-transfo_c... but that gives no indication. the latency felt ultra high initially last year and seems to have come down some, what budget are we working with?
2. can we make the retrieval better? i dont have a tooon of confidence in the User Tower / Candidate Tower system - is this SOTA (it's probably not - see how youtube does codebook semantic id's https://www.youtube.com/watch?v=LxQsQ3vZDqo&list=PLcfpQ4tk2k... )
3. no a/b testing / rollout infrastructure?
4. so many hydration subsystems - is this brittle?
Sad that this is the only relevant comment in this thread, thanks for the insights. DeepWiki is very nice for this. Didn't know that e.g. copying the post link via the share button influences the algorithm!
The only relevant technical discussion in the whole thread got downvoted to the bottom, and the top comment is on the "correct" political side but factually incorrect. This is typical on politically charged topics. Is there anything HN can do to reduce the impact of politically motivated voting? The discouragement of posting politics in the guidelines doesn't seem to be enough anymore.
I feel like we need more awareness on what is open-source and how does it work. This is NOT open source. This is, at best, source available but as there is no way to confirm that this code even runs anywhere ever it's entirely a bad faith performance to trick people, deceive regulators and stain the entire open source movement.
I sincerely hope that the main stream media does not fall for this and calls it out. It's not rocket science. It's really really simple - this is not good for anyone.
This is open source. You're thinking of trusted execution, audits, licenses with disclosure requirements, or signed affidavits which is a totally different thing than open source. Otherwise you could claim that just about anything isn't open source just because you're not sure what is happening on someone else's computer.
ok. This is open source of _what_? Without tying the code to a real life object the intent is absolutely meaningless. Here's the open source code for hackernews:
``` @route("/"): def main(): return "hello world" ```
What does that give us? We can't run this to host our own hackernews as it's clearly not runnable. We can't really learn anything from this as it doesn't not represent any real reality. Maybe it's a fun reading exercise but that's about it.
Open source means that I can take source and run it to ensure it's trusted. Ascii characters being visible on my screen is just a nice byproduct of this goal.
> This is NOT open source.
So in the end are we going by the OSI's definition of Open Source, or not? Can we make up our mind please?
Every time anyone posts here even a slightly modified Open Source license (e.g. a MIT license with an extra restriction that prevents megacorporations from using it but doesn't affect anyone else) people come out of the woodwork with their pitchforks screaming "this is not Open Source!", and insist that the Open Source Definition decides what is Open Source or not, and not to call anything which doesn't meet that definition "Open Source".
And yet here we are with a repository licensed under an actually Open Source license, and suddenly this is the most upvoted comment, and now people don't actually care about the Open Source Definition after all?
Either we go by the OSI's definition, in which case this is open source, regardless of what you think the motivations are for opening up this code, or we go by the "vibes" of whether it feels open source, in which case a modified MIT license which prohibits companies with a trillion+ market cap from using it is also open source.
You’re discussing licenses, their concern is about calling a thing that cannot function without the associated proprietary back-end “open source” for marketing.
If you want to make the argument only about the license, then you should make sure you are consistent by referencing “open source license” every single time instead. Their point is that companies use releases like this to claim they “open source” simply by releasing some useless code under an open source license.
I think if you simply replace “license” with the word “software” in those same OSI tenants, you’ll suddenly find that this “open source” project doesn’t come close to being the “open source” most people believe in. They don’t just expect the definition to stop with the license if you’re going to call something “open source” instead of “has an open source license”. OSI only provides a definition of “Open Source” with respect to licenses.
So while you may consider only a singular definition by an American organization, founded by corporations, designed to focus on clarifying and promoting the licensing aspect of open source, as the end-all be-all all-encompassing definition for the words “open source”, others argue that there are more things in software than just a license and they hope the media won’t be fooled into reporting about X offering “open source” access.
> there is no way to confirm that this code even runs anywhere ever
I'm confused what this has to do with "open source" or how it affects public perception.
I agree with you that it's totally possible to lie about what is actually running in production and that sharing some code doesn't mean it's that code, but how is this a new problem?
This is open source. The license is the Apache license that meets the open source definition:
https://github.com/xai-org/x-algorithm/blob/main/LICENSE
By license sure, it is. But having a look at https://www.gnu.org/philosophy/free-sw.html.en#four-freedoms I kind of doubt it really is.
Freedom 1 is dubiously fulfilled. I can modify it, sure, but I can't modify it when the program runs on my data for me. Freedom 0 isn't fulfilled. I don't have the necessary input data to run the program myself.
(Of course the free software definition wasn't written for today's world, and the clarification below goes somewhat against my argument for Freedom 0. Feel free to pick this apart.)
Which part of open source mentions that it is NOT open source if the code is not run.
The claim is THIS is the SOURCE that is being opened. The claim can not be verified. If it's not running then this isn't the SOURCE.
If I "Open Source" windows 11 but lie and put some other junk there then I can't CLAIM to have open sourced windows 11 now can I?
That’s not part of the open source definition.
You can claim the open source code isn’t Windows 11, but you can’t complain the code isn’t open source.
</pedantry>
Yes that’s correct. I’m imagining it’s the Apache license like the X code, which is indeed an open source license.
> I feel like we need more awareness on what is open-source and how does it work. This is NOT open source.
This clearly has the goal of muddying the water of the DSA transparency requirements. It's an opaque way of trying to mislead users into believing that X is being transparent while not being so at all.
They pretend to be transparent about their algorithms while denying researchers access to their API through exorbitant pricing and severely limited quotas.
You might want to ask a deep research LLM to collect evidence for and against you claim and read through it. I just did with Gemini and it convinced me (with evidence) that your claim is not consistent with the facts.
I am sceptical of Musk, but this seems to be a legitimate transparency move.
Access to read 1 million posts through the X API costs $5000/month. Enterprise access to their API costs $42 000 per month.
Multiple researchers are being told by X that they must pay this fee to get access[1][2][3].
X has recently been fined for not providing this access to researchers. Both for the organic engagement, and for paid advertising. [4]
The pricing of X's API is exorbitant and orders of magnitude higher than arguably higher quality datasets like Reddit. One million posts through the Reddit API costs $2.40.
The pricing scheme is obviously not value based and is clearly designed to limit data access to researchers. As users here note, studying recommender systems requires studying the inputs and outputs of the system. Platforms are rightly not mandated to present the inputs due to privacy concerns. But they are mandated to make the outputs available. And they aren't. "Open sourcing" their algorithm is not a replacement for this, it's an obvious a ploy to present themselves as transparent.
[1] https://arxiv.org/abs/2404.07340
[2] https://devcommunity.x.com/t/academic-twitter-access-is-dead...
[3] https://devcommunity.x.com/t/apply-academic-research-access/...
[4] https://ec.europa.eu/commission/presscorner/detail/en/ip_25_...
That is one of the worst clanker-brained replies I have ever read on this platform.
The OP you're replying to made a concrete point (X claim to be transparent but block researcher access through unreasonably price gating), and you didn't even attempt to refute it nor engage with the substance of the post at all.
If you think that this evidence is so compelling why don't you link to some sources and summarize it in your own words? If you cannot be bothered to do that much, why are you replying in the first place?
Telling someone that they're wrong and they should just chat with an LLM to educate themselves removes any room for discussion, leaving this platform and other comment readers worse off.
Are you being sarcastic?
Nope. I didn't feel like spending the time to read the code, but I did want an LLM to pull out specific pieces for me and compare them to other published info. This is a good way to use LLMs: ask them to organize data for you to consider yourself and come to your own conclusion.
In this case, the info I looked at changed my opinion (downwards) on how cynical this release really was.
Err... for me: that's shockingly small amount of code. I don't think there's over 5k of LOC there.
Another one: there doesn't seem to be a single test file.
Honestly, this looks like a PoC - Proof of Concept. They've open sourced what used to be a PoC at one point.
Seems like that was the intent: "We have eliminated every single hand-engineered feature and most heuristics from the system"
It seems like what they've released is entirely useless. Just done for the headlines I guess. All the real information is the components not provided. They may as well have uploaded the CPython source and told us that was the algorithm, which executes a hand-engineered model of heuristics stored in a closed-source .py file.
I think the goal is for AI to take over the heuristics. This is basically code for the AI model
Not really that surprising: all logic that used to be in the code is now in the model; the only code that is left is some glue to connect the outside world to the number crunching, just like Llama2 runs your LLMs with only 700 lines of C.
They're eating the code. They're eating the algorithms.
I wonder if this'll turn out like the last time they published their algorithm to great fanfare, and then didn't bother to ever update it: https://github.com/twitter/the-algorithm
Though, to be fair, there were hundreds of "rewrite it in Rust" issues opened against that old one - it looks like they listened!
Hasn't this become more of a blackbox now that it's grok-based? And we've seen grok responses can be actively tweaked whenever Elon doesn't like it?
I'm sure there's many examples but here's the first Google search result: https://www.theguardian.com/us-news/2025/nov/12/elon-musk-gr...
That’s not an example of what you’re claiming.
what is the difference between this and https://github.com/twitter/the-algorithm
Seems like that is the old one, and the one they just released is a new one.
"We have open-sourced our new algorithm, powered by the same transformer architecture as xAI's Grok model."
Old algo. They replaced X algo a while ago, it uses Grok...
'it uses grok' means what?
> Grok based transformer
Is Grok not an LLM? Or do they have other models under that brand?
> > Grok based transformer
> Is Grok not an LLM?
Transformer is the underlying technology for (most) LLMs (GPT stands for “Generative Pre-Trained Transformer”)
I don't know the answer to your second question, but what about "transformer" makes you think "not an LLM"?
I did not expect to see Rust. They seem to have forgotten to commit Cargo.toml though.
Oh I see it is not meant to be built really. Some code is omitted.
Surprising no one.
Can someone port this to a bluesky custom feed?
Someone will, and whoever does it will probably use an Agent CLI: Claude Code with Opus 4.5, Codex CLI with GPT‑5.2‑Codex, Gemini CLI with 3-Pro, GitHub Copilot CLI, etc. I’m 100% sure of it, I’d bet everything I have. Heck, even the code change was made by an AI Agent called “CI Agent” <support@x.ai> as seen here: https://github.com/xai-org/x-algorithm/commit/aaa167b3de8a67...
"CI Agent" has nothing to do with AI lol, it just stands for Continous Integration. The word "agent" predates AI.
anything interesting? anything that is a surprise?
By releasing these things are they giving their competitors an advantage??
Someone explain.
They probably open sourced all the "safe" components everyone in the social media industry knows.
They most likely have some secret sauce that they don't release to public.
Who? BlueSky...?
Plus they had done this before and no real competitor raised since last time they did it. So why not do it again.
The same reason many big corps open source their tech: goodwill/recruiting.
xAI likely needs both more than usual nowadays.
X algo is not that amazing for that to happen. We are not talking about Tiktok.
What competitors? Their moat is not tech based. A competitor can't outbuild them to compete.
Nobody is competing in this loss making buisiness model.
Social media apps do not compete in terms of code quality, but user-capture. People go to X because their friends are on X or there is someone on X they want to follow. The sole valuable aspect of any social media company is how many people use it. That's why, when Musk bought Twitter, he discarded the branding, the software engineers, rewrote the backend, and ditched the moderation. The only valuable thing that he was interested in buying was the captive users of Twitter and the embedded value in their social relations and generated content.
> ditched the moderation
X has content moderation that relies on a mix of AI and human review, focusing on automated systems and user reports. There’s less emphasis on account suspensions and more on reach restriction, alongside community-led moderation like "Community Notes"
You couldn't pay me to use grok
I bet I could.
We don't want to get arrested for child pornography.
I am on X professionally as a developer relations engineer and I haven’t seen a single instance of this on X.
Meanwhile the people making a fuss about it are the same people that voted against investigating the recent child abuse scandal in the UK.
Nope.
I have character and I'm German who knows his history.
ooh, LLM Recsys alert! (we had an LLM Recsys track at ai.engineer last year). official announcement here: https://x.com/XEng/status/2013471689087086804
looks like this is the "for you" feed, once again shared without weights so we only have so much visibility into the actual influence of each trait.
"We have eliminated every single hand-engineered feature and most heuristics from the system. The Grok-based transformer does all the heavy lifting by understanding your engagement history (what you liked, replied to, shared, etc.) and using that to determine what content is relevant to you." aka it's a black box now.
the README is actually pretty nice, would recommend reading this. it doesnt look too different form Elon's original code review tweet/picture https://x.com/elonmusk/status/1593899029531803649?lang=en
sharing additonal notes while diving through the source: https://deepwiki.com/xai-org/x-algorithm
and a codemap of the signal generation pipeline: https://deepwiki.com/search/make-a-map-of-all-the-signals_3d...
- Phoenix (out of network) ranker seems to have all the interesting predictive ML work. it estimates P(favorite), P(reply), P(repost), P(quote), P(click), P(video_view), P(share), P(follow_author), P(not_interested), P(block_author), P(mute_author), P(report) independently and then the `WeightedScorer` combines them using configurable weights. there's an extra DiversityScore and OONScore to add some adjustments but again dont know the weights https://deepwiki.com/xai-org/x-algorithm/4.1-phoenix-candida... - other scores of interest: photo_expand_score, and dwell_score and dwell_time. share via copy, share, and share via dm are all obviously "super like" buttons.
- Two-Tower retrieval uses dot product similarity between user features/engagement (User Tower) and normalized embeddings for all items (Candidate Tower). but when you look into the code and considering that this is probably the most important model for recommendations quality.... it's maybe a little disappointing that its a 2 layer MLP? https://deepwiki.com/search/what-models-are-used-for-user_98...
- Grok-1 JAX transformer (https://github.com/xai-org/x-algorithm/blob/main/phoenix/REA...) uses special attention masking that prevents candidates from attending to each other during inference. Each candidate only attends to the user context (engagement history). This ensures a candidate's score is independent of which other candidates are in the batch, enabling score consistency and caching. nice image here https://github.com/xai-org/x-algorithm/blob/main/phoenix/REA...
- kind of nice usage of Rust traits to create a type safe data pipeline. look at this beautiful flow chart https://deepwiki.com/xai-org/x-algorithm/3-candidate-pipelin... and the "Field Ownership pattern" https://deepwiki.com/xai-org/x-algorithm/3.6-scorer-trait#fi...
- the ten pre-scoring filters are minorly interesting, nothing super surprising here apart from AgeFilter (https://deepwiki.com/xai-org/x-algorithm/4.6.1-agefilter) which I guess means beyond a certain max_age (1 day?) nothing ever shows up on For You. surprising to have a simple flat cutoff vs i guess the alternative of an exponential aging algorithm.
- videoduration hydrator explicitly prioritizes video duration (https://deepwiki.com/xai-org/x-algorithm/4.5.6-videoduration...) but we dont know in what direction... do you recommend shorter or longer videos? and why a hydrator for what is presumably a pretty static property?
open questions from me
1. how large is the production reranker? default param count is here https://deepwiki.com/search/how-many-params-is-the-transfo_c... but that gives no indication. the latency felt ultra high initially last year and seems to have come down some, what budget are we working with?
2. can we make the retrieval better? i dont have a tooon of confidence in the User Tower / Candidate Tower system - is this SOTA (it's probably not - see how youtube does codebook semantic id's https://www.youtube.com/watch?v=LxQsQ3vZDqo&list=PLcfpQ4tk2k... )
3. no a/b testing / rollout infrastructure?
4. so many hydration subsystems - is this brittle?
Sad that this is the only relevant comment in this thread, thanks for the insights. DeepWiki is very nice for this. Didn't know that e.g. copying the post link via the share button influences the algorithm!
Agreed.
Thanks, we'll put that link in the toptext too.
The only relevant technical discussion in the whole thread got downvoted to the bottom, and the top comment is on the "correct" political side but factually incorrect. This is typical on politically charged topics. Is there anything HN can do to reduce the impact of politically motivated voting? The discouragement of posting politics in the guidelines doesn't seem to be enough anymore.
[dead]