contextpro.normalization module

This module contains functions for text data normalization.

contextpro.normalization.batch_convert_numerals_to_numbers(documents: List[str], num_workers: Optional[int] = None)List[str]

Replaces numerals with numbers in all sentences in a concurrent manner.

Parameters
  • documents (List[str]) – list of sentences which contain numerals

  • num_workers (Optional[int]) – number of logical processors to use, by default None (all)

Returns

list of sentences with numerals converted to numbers

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_convert_numerals_to_numbers
>>> corpus = [
...     "A bunch of five",
...     "A picture is worth a thousand words",
...     "A stitch in time saves nine",
...     "Back to square one",
...     "Behind the eight ball",
...     "Between two stools",
... ]
>>> batch_convert_numerals_to_numbers(corpus, num_workers=2)
[
    'A bunch of 5',
    'A picture is worth a 1000 words',
    'A stitch in time saves 9',
    'Back to square 1',
    'Behind the 8 ball',
    'Between 2 stools',
]
contextpro.normalization.batch_lemmatize(tokens: List[List[str]], num_workers: Optional[int] = None, **kwargs: Any)List[List[str]]

Lemmatizes tokens in lists of tokens in a concurrent manner using NLTK WordNetLemmatizer.

Parameters
  • tokens (List[List[str]]) – nested token lists containing tokens with various inflectional forms

  • num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)

Other Parameters

**kwargs (Any) –

additional properties of the below lemmatizer:
  • nltk.WordNetLemmatizer

Returns

nested token lists with lemmatized tokens

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.normalization import batch_lemmatize
>>> corpus =  [
...     ["I", "like", "driving", "a", "car"],
...     ["I", "am", "going", "for", "a", "walk"],
...     ["What", "are", "you", "doing"],
...     ["Where", "are", "you", "coming", "from"]
... ]
>>> batch_lemmatize(corpus, num_workers=2, pos="v")
[
    ['I', 'like', 'drive', 'a', 'car'],
    ['I', 'be', 'go', 'for', 'a', 'walk'],
    ['What', 'be', 'you', 'do'],
    ['Where', 'be', 'you', 'come', 'from']
]
contextpro.normalization.batch_remove_non_ascii_characters(documents: List[str])List[str]

Removes non-ascii characters from sentences.

Parameters

documents (List[str]) – list of sentences with non-ascii characters

Returns

list of sentences with removed non-ascii characters

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_remove_non_ascii_characters
>>> corpus = [
...     "https://sitebulb.com/Folder/øê.html?大学",
...     "Jöreskog bißchen Zürcher"
...     "This is a © but not a ®"
...     "fractions ¼, ½, ¾"
... ]
>>> batch_remove_non_ascii_characters(corpus)
[
    'https://sitebulb.com/Folder/.html?',
    'Jreskog bichen Zrcher',
    'This is a  but not a ',
    'fractions , , '
]
contextpro.normalization.batch_remove_numbers(documents: List[str])List[str]

Removes numbers from all sentences.

Parameters

documents (List[str]) – list of sentences which contain numbers

Returns

list of sentences without numbers

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_remove_numbers
>>> corpus = [
...     "He is 12 years old.",
...     "His father has 3 cars",
...     "I have 3 computers",
...     "He earns 1000$ daily"
... ]
>>> batch_remove_numbers(corpus)
[
    'He is  years old.',
    'His father has  cars',
    'I have  computers',
    'He earns $ daily'
]
contextpro.normalization.batch_remove_punctuation(documents: List[str])List[str]

Removes punctuation characters from all sentences.

Parameters

documents (List[str]) – list of sentences which contain punctuation characters

Returns

list of sentences without punctuation characters

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_remove_punctuation
>>> corpus = [
...     "My name is Dr. Jekyll.",
...     "His name is Mr. Hyde!",
...     "Is his name Edward Scissorhands?",
...     "This is Tom-Parker!"
... ]
>>> batch_remove_punctuation(corpus)
[
    'My name is Dr Jekyll',
    'His name is Mr Hyde',
    'Is his name Edward Scissorhands',
    'This is TomParker'
]
contextpro.normalization.batch_remove_stopwords(tokens: List[List[str]], custom_stopwords: List[str] = [])List[List[str]]

Removes stopwords from nested token lists.

Parameters
  • tokens (List[List[str]]) – nested token lists with stopwords included

  • custom_stopwords (List[str], optional) – list of stopwords to remove from sentences, by default []

Returns

nested token lists without stopwords

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.normalization import batch_remove_stopwords
>>> corpus = [
...     ['My', 'name', 'is', 'Dr', 'Jekyll'],
...     ['His', 'name', 'is', 'Mr', 'Hyde'],
...     ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
...     ['And', 'this', 'is', 'Tom', 'Parker']
... ]
>>> batch_remove_stopwords(corpus)
[
    ['My', 'name', 'Dr', 'Jekyll'],
    ['His', 'name', 'Mr', 'Hyde'],
    ['This', 'guy', 'name', 'Edward', 'Scissorhands'],
    ['And', 'Tom', 'Parker']
]
contextpro.normalization.batch_remove_whitespace(documents: List[str])List[str]

Removes whitespace characters from all sentences.

Parameters

documents (List[str]) – list of sentences which contain whitespace characters

Returns

list of sentences without whitespace characters

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_remove_whitespace
>>> corpus = [
...     "He Has Not Been    in Touch for over a Month.",
...     "I Will See \r\nYou next Week",,
...     "I Am Hungry - Can We\t Eat Now, Please?",,
...     "It Is \r\nFreezing Outside!"
... ]
>>> batch_remove_whitespace(corpus)
[
    'He Has Not Been in Touch for over a Month.',
    'I Will See You next Week',
    'I Am Hungry - Can We Eat Now, Please?',
    'It Is Freezing Outside!',
]
contextpro.normalization.batch_replace_contractions(documents: List[str], **kwargs: bool)List[str]

Expands contractions in sentences.

Parameters

documents (List[str]) – list of sentences with contracted words

Other Parameters

**kwargs (bool) –

additional properties of the below method:
  • contractions.fix()

Returns

list of sentences without contractions

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_replace_contractions
>>> corpus = [
...     "I don't want to be rude, but you shouldn't do this",
...     "Do you think he'll pass his driving test?",
...     "I'll see you next week",
...     "I'm going for a walk"
... ]
>>> batch_replace_contractions(corpus)
[
    'I do not want to be rude, but you should not do this',
    'Do you think he will pass his driving test?',
    'I will see you next week',
    'I am going for a walk',
]
contextpro.normalization.batch_stem(tokens: List[List[str]], stemmer_type: Optional[str] = 'nltk_porter_stemmer', num_workers: Optional[int] = None, **kwargs: Any)List[List[str]]

Stems tokens in lists of tokens to their root (base) form in a concurrent manner.

Parameters
  • tokens (List[List[str]]) – list of lists of tokens containing words with various inflectional forms

  • stemmer_type (Optional[str], optional) –

    stemmer type which will be used to stem the tokens, by default “nltk_porter_stemmer”

    Allowed values:
    • nltk_porter_stemmer

    • nltk_lancaster_stemmer

    • nltk_regexp_stemmer

    • nltk_snowball_stemmer

  • num_workers (Optional[int], optional) – number of processors to use, by default None (all processors)

Other Parameters

**kwargs (Any) –

additional properties of the below stemmers:
  • nltk.PorterStemmer

  • nltk.LancasterStemmer

  • nltk.RegexpStemmer

  • nltk.SnowballStemmer

Returns

list of lists of stemmed tokens

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.normalization import batch_stem
>>> corpus =  [
...     ["I", "like", "driving", "a", "car"],
...     ["I", "am", "going", "for", "a", "walk"],
...     ["Do", "you", "think", "this", "is", "doable"],
...     ["I", "have", "three", "bikes", "in", "two", "garages"]
... ]
>>> batch_stem(
...    corpus,
...    stemmer_type="nltk_porter_stemmer",
...    num_workers=2
... )
[
    ['I', 'like', 'drive', 'a', 'car'],
    ['I', 'am', 'go', 'for', 'a', 'walk'],
    ['Do', 'you', 'think', 'thi', 'is', 'doabl'],
    ['I', 'have', 'three', 'bike', 'in', 'two', 'garag']
]
contextpro.normalization.convert_numerals_to_numbers(sentence: str)str

Replaces numerals with numbers in a sentence.

Parameters

sentence (str) – with numerals

Returns

with numerals replaced with numbers

Return type

str

Examples

>>> from contextpro.normalization import convert_numerals_to_numbers
>>> sentence = "A bunch of five"
>>> convert_numerals_to_numbers(sentence)
'A bunch of 5'
contextpro.normalization.lemmatize(tokens: List[str], **kwargs: Any)List[str]

Lemmatizes tokens using NLTK’s WordNetLemmatizer

Parameters

tokens (List[str]) – list of tokens containing words with various inflectional forms

Other Parameters

**kwargs (Any) –

additional properties of the below lemmatizer:
  • nltk.WordNetLemmatizer

Returns

list of lemmatized tokens

Return type

List[str]

Examples

>>> from contextpro.normalization import lemmatize
>>> tokens =  ["I", "like", "driving", "a", "car"]
>>> lemmatize(tokens, pos="v")
['I', 'like', 'drive', 'a', 'car']
contextpro.normalization.remove_non_ascii_characters(sentence: str)str

Removes non-ascii characters from the provided sentence.

Parameters

sentence (str) – with non-ascii characters

Returns

without non-ascii characters

Return type

str

Examples

>>> from contextpro.normalization import remove_non_ascii_characters
>>> sentence = "https://sitebulb.com/Folder/øê.html?大学"
>>> remove_non_ascii_characters(sentence)
'https://sitebulb.com/Folder/.html?'
contextpro.normalization.remove_numbers(sentence: str)str

Removes numbers from the sentence.

Parameters

sentence (str) – with number characters

Returns

without number characters

Return type

str

Examples

>>> from contextpro.normalization import remove_numbers
>>> sentence = "He is 12 years old."
>>> remove_numbers(sentence)
'He is  years old.'
contextpro.normalization.remove_punctuation(sentence: str)str

Removes punctuation characters from the sentence.

Parameters

sentence (str) – with punctuation characters

Returns

without punctuation characters

Return type

str

Examples

>>> from contextpro.normalization import remove_punctuation
>>> corpus = "My name is Dr. Jekyll."
>>> remove_punctuation(sentence)
'My name is Dr Jekyll'
contextpro.normalization.remove_stopwords(tokens: List[str], custom_stopwords: List[str] = [])List[str]

Remove stopwords from the provided list of tokens.

Parameters
  • tokens (List[str]) – list of tokens, including stopwords

  • custom_stopwords (List[str], optional) – which should be removed from the list of tokens, by default []

Returns

list of tokens without stopwords

Return type

List[str]

Examples

>>> from contextpro.normalization import remove_stopwords
>>> tokens = ['My', 'name', 'is', 'Dr', 'Jekyll']
>>> remove_stopwords(tokens)
['My', 'name', 'Dr', 'Jekyll']
contextpro.normalization.remove_whitespace(sentence: str)str

Removes whitespace characters from the sentence.

Parameters

sentence (str) – with whitespace characters

Returns

without whitespace characters

Return type

str

Examples

>>> from contextpro.normalization import remove_whitespace
>>> sentence = "He Has Not Been    in Touch for over a Month."
>>> remove_whitespace(sentence)
'He Has Not Been in Touch for over a Month.'
contextpro.normalization.replace_contractions(sentence: str, **kwargs: bool)str

Expands contractions in a sentence.

Parameters

sentence (str) – with contracted words

Other Parameters

**kwargs (bool) –

additional properties of the below method:
  • contractions.fix()

Returns

without contracted words

Return type

str

Examples

>>> from contextpro.normalization import replace_contractions
>>> sentence = "I don't want to be rude, but you shouldn't do this"
>>> replace_contractions(sentence)
'I do not want to be rude, but you should not do this'
contextpro.normalization.stem(tokens: List[str], stemmer_type: Optional[str] = 'nltk_porter_stemmer', **kwargs: Any)List[str]

Reduces tokens to their root (base) form.

Parameters
  • tokens (List[str]) – list of tokens containing words with various inflectional forms

  • stemmer_type (Optional[str], optional) –

    stemmer type which will be used to stem the tokens, by default “nltk_porter_stemmer”

    Allowed values:
    • nltk_porter_stemmer

    • nltk_lancaster_stemmer

    • nltk_regexp_stemmer

    • nltk_snowball_stemmer

Other Parameters

**kwargs (Any) –

additional properties of the below stemmers:
  • nltk.PorterStemmer

  • nltk.LancasterStemmer

  • nltk.RegexpStemmer

  • nltk.SnowballStemmer

Returns

list of stemmed tokens

Return type

List[str]

Examples

>>> from contextpro.normalization import stem
>>> tokens =  ["I", "like", "driving", "a", "car"]
>>> stem(tokens, stemmer_type="nltk_porter_stemmer")
['I', 'like', 'drive', 'a', 'car']