Comment explaining need for feed preprocessing encoding step

This commit is contained in:
Samuel Clay 2025-04-05 14:12:01 -07:00
parent d6f9ef24f8
commit a91fd46abc

View file

@ -71,9 +71,16 @@ from utils.youtube_fetcher import YoutubeFetcher
def preprocess_feed_encoding(raw_xml): def preprocess_feed_encoding(raw_xml):
""" """
Check if the raw XML content contains any misencoded HTML entities that indicate Fix for The Verge RSS feed encoding issues (and other feeds with similar problems).
UTF-8 bytes were misinterpreted (e.g., sequences like ’
which represent a smart apostrophe). The Verge and other Vox Media sites often serve RSS feeds with special characters
that were incorrectly encoded. This happens when UTF-8 bytes are misinterpreted
as Latin-1/Windows-1252 characters and then HTML-encoded, resulting in garbled text
like "Apple’s" instead of "Apple's" with a smart apostrophe.
This function detects these patterns and reverses the process by:
1. Unescaping the HTML entities (producing characters like â)
2. Re-encoding as Latin-1 and decoding as UTF-8 to recover the original characters
Args: Args:
raw_xml (str): The raw XML content fetched from the feed raw_xml (str): The raw XML content fetched from the feed