NewsBlur/apps/analyzer/tokenizer.py

import re


class Tokenizer:
    """A simple regex-based whitespace tokenizer.
    It expects a string and can return all tokens lower-cased
    or in their existing case.
    """

    WORD_RE = re.compile("[^a-zA-Z-]+")

    def __init__(self, phrases, lower=False):
        self.phrases = phrases
        self.lower = lower

    def tokenize(self, doc):
        print(doc)
        formatted_doc = " ".join(self.WORD_RE.split(doc))
        print(formatted_doc)
        for phrase in self.phrases:
            if phrase in formatted_doc:
                yield phrase


if __name__ == "__main__":
    phrases = ["Extra Extra", "Streetlevel", "House of the Day"]
    tokenizer = Tokenizer(phrases)

    doc = "Extra, Extra"
    tokenizer.tokenize(doc)
Adding a tokenizer to the classifier, as well as using Divmod's Reverend Bayesian classifier. 2009-12-18 18:29:34 +00:00			`import re`

Black formatting. 2024-04-24 09:43:56 -04:00
Adding a tokenizer to the classifier, as well as using Divmod's Reverend Bayesian classifier. 2009-12-18 18:29:34 +00:00			`class Tokenizer:`
			`"""A simple regex-based whitespace tokenizer.`
			`It expects a string and can return all tokens lower-cased`
			`or in their existing case.`
			`"""`
Black formatting. 2024-04-24 09:43:56 -04:00
			`WORD_RE = re.compile("[^a-zA-Z-]+")`
Adding a tokenizer to the classifier, as well as using Divmod's Reverend Bayesian classifier. 2009-12-18 18:29:34 +00:00
			`def __init__(self, phrases, lower=False):`
			`self.phrases = phrases`
			`self.lower = lower`
Black formatting. 2024-04-24 09:43:56 -04:00
Adding a tokenizer to the classifier, as well as using Divmod's Reverend Bayesian classifier. 2009-12-18 18:29:34 +00:00			`def tokenize(self, doc):`
python2 to python3 conversion for analyzer and push 2020-10-13 22:01:32 +07:00			`print(doc)`
Black formatting. 2024-04-24 09:43:56 -04:00			`formatted_doc = " ".join(self.WORD_RE.split(doc))`
python2 to python3 conversion for analyzer and push 2020-10-13 22:01:32 +07:00			`print(formatted_doc)`
Adding a tokenizer to the classifier, as well as using Divmod's Reverend Bayesian classifier. 2009-12-18 18:29:34 +00:00			`for phrase in self.phrases:`
			`if phrase in formatted_doc:`
			`yield phrase`
Black formatting. 2024-04-24 09:43:56 -04:00

			`if __name__ == "__main__":`
			`phrases = ["Extra Extra", "Streetlevel", "House of the Day"]`
Adding a tokenizer to the classifier, as well as using Divmod's Reverend Bayesian classifier. 2009-12-18 18:29:34 +00:00			`tokenizer = Tokenizer(phrases)`

Black formatting. 2024-04-24 09:43:56 -04:00			`doc = "Extra, Extra"`
			`tokenizer.tokenize(doc)`