Manual string replacement with a hardcoded list of cases for escaping as suggested by the article isn't good advice for the use case of 'support inserting arbitrary text'.
Do use CDATA nodes, but only work on XML with an actual XML DOM library instead of string manipulation. Browsers have these built-in (DOMParser).
I totally understand the general advice of using actual XML DOM library for making DOM. But for my own understanding, I want to ask why the 5 escapes the OP suggests (&, <, >, " and ') aren't good enough? Do you see anyway to exploit it if these 5 are escaped? Someone kind enough to enlighten me?
> The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
> In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".
> To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " ' ", and the double-quote character (") as " " ".
i never really liked CDATA but i'm not buying the argument here since you can do the escaping with replaceAll("]]>", "]]]]><![CDATA[>") instead of four replaceAlls. (assuming you are writing your own xml serializer in javascript in 2026 for some reason)
Just out of curiosity, I looked at the HN RSS feed and they still use regular escape for titles (and some other things, except description). It means they use 2 versions of escape instead of 1. So why not just use 1?
What requirments are you talking about? Human readability? IMHO, RSS is for feed readers, not humans. When looking at https://news.ycombinator.com/rss , the RSS isn't that human friendly at all, all line breaks are removed. The point is the simplicity and uniformity, regular escape works well for many cases, not just description.
That assumes that you don't have anything else to escape or sanitize.
I see people stuffing all sorts of HTML tags and nonstandard attributes in an RSS <description>, just because CDATA allows them to do so without breaking the parser. Images, videos, inline SVGs with maybe some scripts inside...
The RSS spec should never have allowed this. Reading a feed would have been much more pleasant (not to mention safer for everyone!) if the contents were required to be in plain text.
I’m not sure I understand why this is a problem. RSS is a spec for publishing a list of available content, or publishing the content directly. Formatting that content was always going to be something people wanted to do, so whether it was rich text, html or what became markdown, it was inevitable that aggregators were always going to have to deal with both publishes wanting their publication to have styles and users wanting their aggregator software to either handle that style or hide it.
At least with a cdata tag your being explicitly told “here be dragons”
CData's "ad-hoc escaping" by closing and reopening a CData section always felt to me as if it could be a compatibility hazard - I think most examples of CData sections have a single section spanning the whole text node - so I wouldn't be surprised if some homegrown parsers didn't handle "edge cases" like multiple sections inside a text node correctly.
But I'd want to see evidence that this is actually the case. The OP seems to argue "don't use CData, because the escape sequence for ]]> looks confusing" - and that's just vibes, not a proper argument.
If it's for "looks" I think CData would actually be the much better choice. ]]> appears extremely rarely in RSS content while <>& are guaranteed to appear if your content is HTML. So in 99.99%, you won't need any escaping at all for CData and can just insert the HTML verbatim, while "regular" escaping will change every single angle bracket of your HTML.
But are most people going to read raw RSS? Just out of curiosity, I checked HN RSS, it still escapes character in a regular way (without CDATA) for titles. So, just keep 2 versions of XML escape instead of one?
I recently became aware of RSS stylesheets. Apparently there is a specification for that called XSLT which is distinct from CSS in both form and function. However, there are plans by Google/Mozilla to remove XSLT from their browser engines for security/maintainability reasons. Apparently RSS supports javascript though, so it's possible to manipulate the RSS DOM that way. One could imagine a javascript polyfill that interprets XSLT, although I'm not sure if there's some cross-site security issues that would make that impractical.
Cross-site is fine by default, though the script is small enough to easily self-host. If you have a content-security-policy, you'll need to allow the host in script-src.
I would not follow this advice. The most trouble I’ve had with RSS was usually from not having it. I also have never used CDATA at a word level - just wrap the full text block in it.
I come from a very different, old-school perspective, because I hand-write my blog posts in HTML and also hand-write my RSS feed in XML.
I've found CDATA invaluable, because I can just copy and paste the content from the HTML file to the XML file. I've never used the CDATA terminator characters in a blog post, so that's a non-problem.
> This is mostly about when you write your automated feed generator.
Yes, that's why I said, "I come from a very different, old-school perspective."
However, I don't find the points persuasive:
1. A special case for the CDATA terminator doesn't seem any worse than special cases for every HTML character that needs to be escaped in XML.
2. I'm not sure who exactly the hypothetical misled people are (straw men?) who would think "the content is raw HTML or somehow safer."
3. I'm not sure how split CDATA blocks is "less uniform" than escaped characters or why less uniform output is a downside, especially as you state in another comment, "IMHO, RSS is for feed readers, not humans."
4. I'm not sure how CDATA makes "debugging confusing," and in any case using CDATA blocks inside an article seems like a pretty rare case; like I said, I haven't done that myself.
The worst use of the <BLINK> tag ever was the discussion held in the early days of RSS about escaping HTML in titles, whose attention-grabbing title went something like this: "Hey, what happens when you put a <BLINK> tag in the title???!!!"
The content of that notorious discussion went on and off and on and off for weeks, giving all the netizens of the RSS community blogosphere terrible headaches, with people's entire blogs disappearing and reappearing every second, until it finally reached a flashing point, when Dave Winer humbly conceded that it wasn't the user's fault for being an idiot, and maybe just maybe there was tiny teeny little design flaw in RSS, and it wasn't actually such a great idea to allow HTML tags in RSS titles.
I Wanna Be <![CDATA[
Sung to the tune of “I Wanna Be Sedated”, with apologies to The Ramones.
Twenty-twenty-twenty four escapes to go, I wanna be <![CDATA[
Nothin’ to markup and no where to quo-o-ote, I wanna be <![CDATA[
Just get me through the parser, put me in a node
Hurry hurry hurry before I go inline
I can’t control my syntax, I can’t control my name
Oh no no no no no
Twenty-twenty-twenty four escapes to go….
Just put me in a stylesheet, get me in a namespace
Hurry hurry hurry before I go inline
I can’t control my syntax, I can’t control my name
Oh no no no no no
Twenty-twenty-twenty four escapes to go, I wanna be <![CDATA[
Nothin’ to markup and no where to quo-o-ote, I wanna be <![CDATA[
Just get me through the parser, put me in a node
Hurry hurry hurry before I go loco
I can’t control my syntax I can’t control my name
Oh no no no no no
Twenty-twenty-twenty escapaes to go…
Just get me through the parser…
Ba-ba-bamp-ba ba-ba-ba-bamp-ba I wanna be <![CDATA[
Ba-ba-bamp-ba ba-ba-ba-bamp-ba I wanna be <![CDATA[
Ba-ba-bamp-ba ba-ba-ba-bamp-ba I wanna be <![CDATA[
Ba-ba-bamp-ba ba-ba-ba-bamp-ba I wanna be <![CDATA[
Manual string replacement with a hardcoded list of cases for escaping as suggested by the article isn't good advice for the use case of 'support inserting arbitrary text'.
Do use CDATA nodes, but only work on XML with an actual XML DOM library instead of string manipulation. Browsers have these built-in (DOMParser).
I totally understand the general advice of using actual XML DOM library for making DOM. But for my own understanding, I want to ask why the 5 escapes the OP suggests (&, <, >, " and ') aren't good enough? Do you see anyway to exploit it if these 5 are escaped? Someone kind enough to enlighten me?
They are:
> The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
> In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".
> To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " ' ", and the double-quote character (") as " " ".
https://www.w3.org/TR/xml/#syntax
i never really liked CDATA but i'm not buying the argument here since you can do the escaping with replaceAll("]]>", "]]]]><![CDATA[>") instead of four replaceAlls. (assuming you are writing your own xml serializer in javascript in 2026 for some reason)
Just out of curiosity, I looked at the HN RSS feed and they still use regular escape for titles (and some other things, except description). It means they use 2 versions of escape instead of 1. So why not just use 1?
Different requirements.
The description contains HTML markup, such as <p></p> for paragraph breaks. CDATA is a nice and clean way to encode them without breaking anything.
The title doesn't contain any markup, and shouldn't. A good old escape function covers both the "doesn't" part and the "shouldn't" part.
What requirments are you talking about? Human readability? IMHO, RSS is for feed readers, not humans. When looking at https://news.ycombinator.com/rss , the RSS isn't that human friendly at all, all line breaks are removed. The point is the simplicity and uniformity, regular escape works well for many cases, not just description.
That assumes that you don't have anything else to escape or sanitize.
I see people stuffing all sorts of HTML tags and nonstandard attributes in an RSS <description>, just because CDATA allows them to do so without breaking the parser. Images, videos, inline SVGs with maybe some scripts inside...
The RSS spec should never have allowed this. Reading a feed would have been much more pleasant (not to mention safer for everyone!) if the contents were required to be in plain text.
I’m not sure I understand why this is a problem. RSS is a spec for publishing a list of available content, or publishing the content directly. Formatting that content was always going to be something people wanted to do, so whether it was rich text, html or what became markdown, it was inevitable that aggregators were always going to have to deal with both publishes wanting their publication to have styles and users wanting their aggregator software to either handle that style or hide it.
At least with a cdata tag your being explicitly told “here be dragons”
CData's "ad-hoc escaping" by closing and reopening a CData section always felt to me as if it could be a compatibility hazard - I think most examples of CData sections have a single section spanning the whole text node - so I wouldn't be surprised if some homegrown parsers didn't handle "edge cases" like multiple sections inside a text node correctly.
But I'd want to see evidence that this is actually the case. The OP seems to argue "don't use CData, because the escape sequence for ]]> looks confusing" - and that's just vibes, not a proper argument.
If it's for "looks" I think CData would actually be the much better choice. ]]> appears extremely rarely in RSS content while <>& are guaranteed to appear if your content is HTML. So in 99.99%, you won't need any escaping at all for CData and can just insert the HTML verbatim, while "regular" escaping will change every single angle bracket of your HTML.
In my experience, liberal use of CDATA is often the only way to get third-party data-importing software to work correctly.
Whether it's efficient is a far second to whether it successfully imports the data.
Looking at you, WP All Import...
Imo, ease and multi-line HTML readability of CDATA outweighs that one edge case (it cannot directly contain its own terminator).
But are most people going to read raw RSS? Just out of curiosity, I checked HN RSS, it still escapes character in a regular way (without CDATA) for titles. So, just keep 2 versions of XML escape instead of one?
RSS is truly in its own little universe.
I recently became aware of RSS stylesheets. Apparently there is a specification for that called XSLT which is distinct from CSS in both form and function. However, there are plans by Google/Mozilla to remove XSLT from their browser engines for security/maintainability reasons. Apparently RSS supports javascript though, so it's possible to manipulate the RSS DOM that way. One could imagine a javascript polyfill that interprets XSLT, although I'm not sure if there's some cross-site security issues that would make that impractical.
> RSS is truly in its own little universe.
More like a little island in the XML archipelago.
> RSS stylesheets. Apparently there is a specification for that called XSLT
XSLT is a bit more than just “RSS stylesheets”.
The removal was discussed here: https://news.ycombinator.com/item?id=45873434
No need to imagine a polyfil, they already exist: https://github.com/mfreed7/xslt_polyfill
If you want to style RSS, you can just go straight to JavaScript and avoid all the XSLT mishegas.
I made a site with to get people started: https://www.rss.style/
Example RSS feed: https://www.rss.style/changelog.xml
Cross-site is fine by default, though the script is small enough to easily self-host. If you have a content-security-policy, you'll need to allow the host in script-src.
I would not follow this advice. The most trouble I’ve had with RSS was usually from not having it. I also have never used CDATA at a word level - just wrap the full text block in it.
I come from a very different, old-school perspective, because I hand-write my blog posts in HTML and also hand-write my RSS feed in XML.
I've found CDATA invaluable, because I can just copy and paste the content from the HTML file to the XML file. I've never used the CDATA terminator characters in a blog post, so that's a non-problem.
This is mostly about when you write your automated feed generator.
> This is mostly about when you write your automated feed generator.
Yes, that's why I said, "I come from a very different, old-school perspective."
However, I don't find the points persuasive:
1. A special case for the CDATA terminator doesn't seem any worse than special cases for every HTML character that needs to be escaped in XML.
2. I'm not sure who exactly the hypothetical misled people are (straw men?) who would think "the content is raw HTML or somehow safer."
3. I'm not sure how split CDATA blocks is "less uniform" than escaped characters or why less uniform output is a downside, especially as you state in another comment, "IMHO, RSS is for feed readers, not humans."
4. I'm not sure how CDATA makes "debugging confusing," and in any case using CDATA blocks inside an article seems like a pretty rare case; like I said, I haven't done that myself.
The worst use of the <BLINK> tag ever was the discussion held in the early days of RSS about escaping HTML in titles, whose attention-grabbing title went something like this: "Hey, what happens when you put a <BLINK> tag in the title???!!!"
The content of that notorious discussion went on and off and on and off for weeks, giving all the netizens of the RSS community blogosphere terrible headaches, with people's entire blogs disappearing and reappearing every second, until it finally reached a flashing point, when Dave Winer humbly conceded that it wasn't the user's fault for being an idiot, and maybe just maybe there was tiny teeny little design flaw in RSS, and it wasn't actually such a great idea to allow HTML tags in RSS titles.
I Wanna Be <![CDATA[
Sung to the tune of “I Wanna Be Sedated”, with apologies to The Ramones.
Is it just me or is the back button broken on this website?
For me it seems fine. Sometimes when I click on the headings I have to press back several times, but this is because they have anchored links.
Hm, on my desktop it's fine too, just on the phone it plain doesn't want to go back…
[flagged]