making bbPress (and WordPress) work better!

more feed fixing including text super-sanitizer

If you like the idea of serving up all your content in a way that makes it super easy for search engine spammers and other nasties to grab it all, and/or allow a human reader to never actually visit the pretty website you worked so hard on – well then the default WP feeds are for you.

Unfortunately I’m not in the above group. The default WP feeds for posts, comments, rss and atom formats, all give up EVERYTHING in ugly, CDATA wrappers that aren’t correct technique. They also ignore any plugins you have active to manipulate your text before it’s seen (ie. [poll=2] )

It kills me how people spend hours and days making things validate on their website which does nothing for human visitors but don’t think twice about how they appear via their feeds.

So I wanted the feeds to
1. actually process any plugins used for the posts (ie. [poll=2] )
2. only give teaser excerpts (50 words or less)
3. not use ugly CDATA wrappers
4. not include the images
5. not include javascript, embeds or active-x

WordPress doesn’t do ANY of the above by default!

First I had to create a way to completely process the posts (#1) but scrub them clean of #2-#5.
Below the break you’ll find the function to do that “function scrub_text()”

Then you have to go into wp-rss2.php and replae where it dumps the description/content with:
<description><?php echo scrub_text(apply_filters(‘the_content’, $post->post_content) ,50); ?></description>

Repeat with similar technique for wp-commentsrss2.php wp-atom.php and wp-commentsatom.php
ie. for wp-commentsrss2.php
<description><? echo scrub_text(get_comment_text(),50); ?></description>

The nice thing about the scrub_text is it will convert foreign entities to unicode which will pass virtually all xml parsers including IE6’s ancient code.

(complex filters borrowed from – thanks!)

function scrub_text($text,$limit=0){
$search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript
'@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags
'@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
'@<![\s\S]*?--[ \t\n\r]*>@' // Strip multi-line comments including CDATA
$text = preg_replace($search, '', $text);
if ($limit) {
$blah = explode(' ', $text);
if (count($blah) > $limit) { $k = $limit; $use_dotdotdot = 1; }
else { $k = count($blah); $use_dotdotdot = 0; }
for ($i=0; $i<$k; $i++) {$excerpt .= $blah[$i].' ';}
$excerpt .= ($use_dotdotdot) ? '...' : '';
$text = $excerpt;
// convert foreign entities to unicode
$htmlEntities = array_values (get_html_translation_table (HTML_ENTITIES, ENT_QUOTES));
$entitiesDecoded = array_keys (get_html_translation_table (HTML_ENTITIES, ENT_QUOTES));
$num = count ($entitiesDecoded);
for ($u = 0; $u < $num; $u++) { $utf8Entities[$u] = '&#'.ord($entitiesDecoded[$u]).';'; }

return str_replace ($htmlEntities, $utf8Entities, $text);

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s