mirror of
https://github.com/samuelclay/NewsBlur.git
synced 2025-08-20 05:14:44 +00:00
Comment explaining need for feed preprocessing encoding step
This commit is contained in:
parent
d6f9ef24f8
commit
a91fd46abc
1 changed files with 10 additions and 3 deletions
|
@ -71,9 +71,16 @@ from utils.youtube_fetcher import YoutubeFetcher
|
||||||
|
|
||||||
def preprocess_feed_encoding(raw_xml):
|
def preprocess_feed_encoding(raw_xml):
|
||||||
"""
|
"""
|
||||||
Check if the raw XML content contains any misencoded HTML entities that indicate
|
Fix for The Verge RSS feed encoding issues (and other feeds with similar problems).
|
||||||
UTF-8 bytes were misinterpreted (e.g., sequences like ’
|
|
||||||
which represent a smart apostrophe).
|
The Verge and other Vox Media sites often serve RSS feeds with special characters
|
||||||
|
that were incorrectly encoded. This happens when UTF-8 bytes are misinterpreted
|
||||||
|
as Latin-1/Windows-1252 characters and then HTML-encoded, resulting in garbled text
|
||||||
|
like "Apple’s" instead of "Apple's" with a smart apostrophe.
|
||||||
|
|
||||||
|
This function detects these patterns and reverses the process by:
|
||||||
|
1. Unescaping the HTML entities (producing characters like ’)
|
||||||
|
2. Re-encoding as Latin-1 and decoding as UTF-8 to recover the original characters
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
raw_xml (str): The raw XML content fetched from the feed
|
raw_xml (str): The raw XML content fetched from the feed
|
||||||
|
|
Loading…
Add table
Reference in a new issue