Comment explaining need for feed preprocessing encoding step

2025-08-20 05:14:44 +00:00 · 2025-04-05 14:12:01 -07:00 · 2025-04-05 14:12:01 -07:00 · a91fd46abc
commit a91fd46abc
parent d6f9ef24f8
1 changed files with 10 additions and 3 deletions
--- a/utils/feed_fetcher.py
+++ b/utils/feed_fetcher.py
@ -71,9 +71,16 @@ from utils.youtube_fetcher import YoutubeFetcher
 def preprocess_feed_encoding(raw_xml):
    """
-    Check if the raw XML content contains any misencoded HTML entities that indicate
+    Fix for The Verge RSS feed encoding issues (and other feeds with similar problems).
-    UTF-8 bytes were misinterpreted (e.g., sequences like &acirc;&#128;&#153; 
+    
-    which represent a smart apostrophe).
+    The Verge and other Vox Media sites often serve RSS feeds with special characters
    that were incorrectly encoded. This happens when UTF-8 bytes are misinterpreted
    as Latin-1/Windows-1252 characters and then HTML-encoded, resulting in garbled text
    like "Apple&acirc;&#128;&#153;s" instead of "Apple's" with a smart apostrophe.
    This function detects these patterns and reverses the process by:
    1. Unescaping the HTML entities (producing characters like â€™)
    2. Re-encoding as Latin-1 and decoding as UTF-8 to recover the original characters
    Args:
        raw_xml (str): The raw XML content fetched from the feed