contextpro.normalization module¶
This module contains functions for text data normalization.
-
contextpro.normalization.batch_convert_numerals_to_numbers(documents: List[str], num_workers: Optional[int] = None) → List[str]¶ Replaces numerals with numbers in all sentences in a concurrent manner.
- Parameters
documents (List[str]) – list of sentences which contain numerals
num_workers (Optional[int]) – number of logical processors to use, by default None (all)
- Returns
list of sentences with numerals converted to numbers
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_convert_numerals_to_numbers >>> corpus = [ ... "A bunch of five", ... "A picture is worth a thousand words", ... "A stitch in time saves nine", ... "Back to square one", ... "Behind the eight ball", ... "Between two stools", ... ] >>> batch_convert_numerals_to_numbers(corpus, num_workers=2) [ 'A bunch of 5', 'A picture is worth a 1000 words', 'A stitch in time saves 9', 'Back to square 1', 'Behind the 8 ball', 'Between 2 stools', ]
-
contextpro.normalization.batch_lemmatize(tokens: List[List[str]], num_workers: Optional[int] = None, **kwargs: Any) → List[List[str]]¶ Lemmatizes tokens in lists of tokens in a concurrent manner using NLTK WordNetLemmatizer.
- Parameters
tokens (List[List[str]]) – nested token lists containing tokens with various inflectional forms
num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)
- Other Parameters
**kwargs (Any) –
- additional properties of the below lemmatizer:
nltk.WordNetLemmatizer
- Returns
nested token lists with lemmatized tokens
- Return type
List[List[str]]
- Raises
ValueError – if ‘tokens’ provided are not a list of nested string lists
Examples
>>> from contextpro.normalization import batch_lemmatize >>> corpus = [ ... ["I", "like", "driving", "a", "car"], ... ["I", "am", "going", "for", "a", "walk"], ... ["What", "are", "you", "doing"], ... ["Where", "are", "you", "coming", "from"] ... ] >>> batch_lemmatize(corpus, num_workers=2, pos="v") [ ['I', 'like', 'drive', 'a', 'car'], ['I', 'be', 'go', 'for', 'a', 'walk'], ['What', 'be', 'you', 'do'], ['Where', 'be', 'you', 'come', 'from'] ]
-
contextpro.normalization.batch_remove_non_ascii_characters(documents: List[str]) → List[str]¶ Removes non-ascii characters from sentences.
- Parameters
documents (List[str]) – list of sentences with non-ascii characters
- Returns
list of sentences with removed non-ascii characters
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_remove_non_ascii_characters >>> corpus = [ ... "https://sitebulb.com/Folder/øê.html?大学", ... "Jöreskog bißchen Zürcher" ... "This is a © but not a ®" ... "fractions ¼, ½, ¾" ... ] >>> batch_remove_non_ascii_characters(corpus) [ 'https://sitebulb.com/Folder/.html?', 'Jreskog bichen Zrcher', 'This is a but not a ', 'fractions , , ' ]
-
contextpro.normalization.batch_remove_numbers(documents: List[str]) → List[str]¶ Removes numbers from all sentences.
- Parameters
documents (List[str]) – list of sentences which contain numbers
- Returns
list of sentences without numbers
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_remove_numbers >>> corpus = [ ... "He is 12 years old.", ... "His father has 3 cars", ... "I have 3 computers", ... "He earns 1000$ daily" ... ] >>> batch_remove_numbers(corpus) [ 'He is years old.', 'His father has cars', 'I have computers', 'He earns $ daily' ]
-
contextpro.normalization.batch_remove_punctuation(documents: List[str]) → List[str]¶ Removes punctuation characters from all sentences.
- Parameters
documents (List[str]) – list of sentences which contain punctuation characters
- Returns
list of sentences without punctuation characters
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_remove_punctuation >>> corpus = [ ... "My name is Dr. Jekyll.", ... "His name is Mr. Hyde!", ... "Is his name Edward Scissorhands?", ... "This is Tom-Parker!" ... ] >>> batch_remove_punctuation(corpus) [ 'My name is Dr Jekyll', 'His name is Mr Hyde', 'Is his name Edward Scissorhands', 'This is TomParker' ]
-
contextpro.normalization.batch_remove_stopwords(tokens: List[List[str]], custom_stopwords: List[str] = []) → List[List[str]]¶ Removes stopwords from nested token lists.
- Parameters
tokens (List[List[str]]) – nested token lists with stopwords included
custom_stopwords (List[str], optional) – list of stopwords to remove from sentences, by default []
- Returns
nested token lists without stopwords
- Return type
List[List[str]]
- Raises
ValueError – if ‘tokens’ provided are not a list of nested string lists
Examples
>>> from contextpro.normalization import batch_remove_stopwords >>> corpus = [ ... ['My', 'name', 'is', 'Dr', 'Jekyll'], ... ['His', 'name', 'is', 'Mr', 'Hyde'], ... ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'], ... ['And', 'this', 'is', 'Tom', 'Parker'] ... ] >>> batch_remove_stopwords(corpus) [ ['My', 'name', 'Dr', 'Jekyll'], ['His', 'name', 'Mr', 'Hyde'], ['This', 'guy', 'name', 'Edward', 'Scissorhands'], ['And', 'Tom', 'Parker'] ]
-
contextpro.normalization.batch_remove_whitespace(documents: List[str]) → List[str]¶ Removes whitespace characters from all sentences.
- Parameters
documents (List[str]) – list of sentences which contain whitespace characters
- Returns
list of sentences without whitespace characters
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_remove_whitespace >>> corpus = [ ... "He Has Not Been in Touch for over a Month.", ... "I Will See \r\nYou next Week",, ... "I Am Hungry - Can We\t Eat Now, Please?",, ... "It Is \r\nFreezing Outside!" ... ] >>> batch_remove_whitespace(corpus) [ 'He Has Not Been in Touch for over a Month.', 'I Will See You next Week', 'I Am Hungry - Can We Eat Now, Please?', 'It Is Freezing Outside!', ]
-
contextpro.normalization.batch_replace_contractions(documents: List[str], **kwargs: bool) → List[str]¶ Expands contractions in sentences.
- Parameters
documents (List[str]) – list of sentences with contracted words
- Other Parameters
**kwargs (bool) –
- additional properties of the below method:
contractions.fix()
- Returns
list of sentences without contractions
- Return type
List[str]
- Raises
ValueError – if ‘documents’ provided are not a list of strings
Examples
>>> from contextpro.normalization import batch_replace_contractions >>> corpus = [ ... "I don't want to be rude, but you shouldn't do this", ... "Do you think he'll pass his driving test?", ... "I'll see you next week", ... "I'm going for a walk" ... ] >>> batch_replace_contractions(corpus) [ 'I do not want to be rude, but you should not do this', 'Do you think he will pass his driving test?', 'I will see you next week', 'I am going for a walk', ]
-
contextpro.normalization.batch_stem(tokens: List[List[str]], stemmer_type: Optional[str] = 'nltk_porter_stemmer', num_workers: Optional[int] = None, **kwargs: Any) → List[List[str]]¶ Stems tokens in lists of tokens to their root (base) form in a concurrent manner.
- Parameters
tokens (List[List[str]]) – list of lists of tokens containing words with various inflectional forms
stemmer_type (Optional[str], optional) –
stemmer type which will be used to stem the tokens, by default “nltk_porter_stemmer”
- Allowed values:
nltk_porter_stemmer
nltk_lancaster_stemmer
nltk_regexp_stemmer
nltk_snowball_stemmer
num_workers (Optional[int], optional) – number of processors to use, by default None (all processors)
- Other Parameters
**kwargs (Any) –
- additional properties of the below stemmers:
nltk.PorterStemmer
nltk.LancasterStemmer
nltk.RegexpStemmer
nltk.SnowballStemmer
- Returns
list of lists of stemmed tokens
- Return type
List[List[str]]
- Raises
ValueError – if ‘tokens’ provided are not a list of nested string lists
Examples
>>> from contextpro.normalization import batch_stem >>> corpus = [ ... ["I", "like", "driving", "a", "car"], ... ["I", "am", "going", "for", "a", "walk"], ... ["Do", "you", "think", "this", "is", "doable"], ... ["I", "have", "three", "bikes", "in", "two", "garages"] ... ] >>> batch_stem( ... corpus, ... stemmer_type="nltk_porter_stemmer", ... num_workers=2 ... ) [ ['I', 'like', 'drive', 'a', 'car'], ['I', 'am', 'go', 'for', 'a', 'walk'], ['Do', 'you', 'think', 'thi', 'is', 'doabl'], ['I', 'have', 'three', 'bike', 'in', 'two', 'garag'] ]
-
contextpro.normalization.convert_numerals_to_numbers(sentence: str) → str¶ Replaces numerals with numbers in a sentence.
- Parameters
sentence (str) – with numerals
- Returns
with numerals replaced with numbers
- Return type
str
Examples
>>> from contextpro.normalization import convert_numerals_to_numbers >>> sentence = "A bunch of five" >>> convert_numerals_to_numbers(sentence) 'A bunch of 5'
-
contextpro.normalization.lemmatize(tokens: List[str], **kwargs: Any) → List[str]¶ Lemmatizes tokens using NLTK’s WordNetLemmatizer
- Parameters
tokens (List[str]) – list of tokens containing words with various inflectional forms
- Other Parameters
**kwargs (Any) –
- additional properties of the below lemmatizer:
nltk.WordNetLemmatizer
- Returns
list of lemmatized tokens
- Return type
List[str]
Examples
>>> from contextpro.normalization import lemmatize >>> tokens = ["I", "like", "driving", "a", "car"] >>> lemmatize(tokens, pos="v") ['I', 'like', 'drive', 'a', 'car']
-
contextpro.normalization.remove_non_ascii_characters(sentence: str) → str¶ Removes non-ascii characters from the provided sentence.
- Parameters
sentence (str) – with non-ascii characters
- Returns
without non-ascii characters
- Return type
str
Examples
>>> from contextpro.normalization import remove_non_ascii_characters >>> sentence = "https://sitebulb.com/Folder/øê.html?大学" >>> remove_non_ascii_characters(sentence) 'https://sitebulb.com/Folder/.html?'
-
contextpro.normalization.remove_numbers(sentence: str) → str¶ Removes numbers from the sentence.
- Parameters
sentence (str) – with number characters
- Returns
without number characters
- Return type
str
Examples
>>> from contextpro.normalization import remove_numbers >>> sentence = "He is 12 years old." >>> remove_numbers(sentence) 'He is years old.'
-
contextpro.normalization.remove_punctuation(sentence: str) → str¶ Removes punctuation characters from the sentence.
- Parameters
sentence (str) – with punctuation characters
- Returns
without punctuation characters
- Return type
str
Examples
>>> from contextpro.normalization import remove_punctuation >>> corpus = "My name is Dr. Jekyll." >>> remove_punctuation(sentence) 'My name is Dr Jekyll'
-
contextpro.normalization.remove_stopwords(tokens: List[str], custom_stopwords: List[str] = []) → List[str]¶ Remove stopwords from the provided list of tokens.
- Parameters
tokens (List[str]) – list of tokens, including stopwords
custom_stopwords (List[str], optional) – which should be removed from the list of tokens, by default []
- Returns
list of tokens without stopwords
- Return type
List[str]
Examples
>>> from contextpro.normalization import remove_stopwords >>> tokens = ['My', 'name', 'is', 'Dr', 'Jekyll'] >>> remove_stopwords(tokens) ['My', 'name', 'Dr', 'Jekyll']
-
contextpro.normalization.remove_whitespace(sentence: str) → str¶ Removes whitespace characters from the sentence.
- Parameters
sentence (str) – with whitespace characters
- Returns
without whitespace characters
- Return type
str
Examples
>>> from contextpro.normalization import remove_whitespace >>> sentence = "He Has Not Been in Touch for over a Month." >>> remove_whitespace(sentence) 'He Has Not Been in Touch for over a Month.'
-
contextpro.normalization.replace_contractions(sentence: str, **kwargs: bool) → str¶ Expands contractions in a sentence.
- Parameters
sentence (str) – with contracted words
- Other Parameters
**kwargs (bool) –
- additional properties of the below method:
contractions.fix()
- Returns
without contracted words
- Return type
str
Examples
>>> from contextpro.normalization import replace_contractions >>> sentence = "I don't want to be rude, but you shouldn't do this" >>> replace_contractions(sentence) 'I do not want to be rude, but you should not do this'
-
contextpro.normalization.stem(tokens: List[str], stemmer_type: Optional[str] = 'nltk_porter_stemmer', **kwargs: Any) → List[str]¶ Reduces tokens to their root (base) form.
- Parameters
tokens (List[str]) – list of tokens containing words with various inflectional forms
stemmer_type (Optional[str], optional) –
stemmer type which will be used to stem the tokens, by default “nltk_porter_stemmer”
- Allowed values:
nltk_porter_stemmer
nltk_lancaster_stemmer
nltk_regexp_stemmer
nltk_snowball_stemmer
- Other Parameters
**kwargs (Any) –
- additional properties of the below stemmers:
nltk.PorterStemmer
nltk.LancasterStemmer
nltk.RegexpStemmer
nltk.SnowballStemmer
- Returns
list of stemmed tokens
- Return type
List[str]
Examples
>>> from contextpro.normalization import stem >>> tokens = ["I", "like", "driving", "a", "car"] >>> stem(tokens, stemmer_type="nltk_porter_stemmer") ['I', 'like', 'drive', 'a', 'car']