contextpro package

contextpro.batch_calculate_corpus_statistics(documents: List[str], lowercase: bool = False, remove_stopwords: bool = False, tokenizer_pattern: str = '\\b[^\\d\\W]+\\b', custom_stopwords: List[str] = [], num_workers: Optional[int] = None)pandas.core.frame.DataFrame

Calculates the below statistics for each document in the corpus in a concurrent manner:

  • Number of characters

  • Number of tokens

  • Number of punctuation characters

  • Number of digits

  • Number of whitespace characters

  • Number of non-ascii characters

  • Sentiment score

  • Subjectivity score

Parameters
  • documents (List[str]) – list of strings

  • lowercase (bool, optional) – convert all characters to lowercase before calculating statistics, by default False

  • remove_stopwords (bool, optional) – remove stopwords before calculating statistics. Uses english stopwords from the NLTK library if ‘custom_stopwords’ list is not provided, by default False

  • tokenizer_pattern (str, optional) – regex pattern used by the underlying NLTK Regexp Tokenizer to tokenize the documents, by default r”b[^dW]+b”

  • custom_stopwords (List[str], optional) – custom stopwords to use for token filtering, by default []

  • num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)

Returns

with statistics for each document in the provided corpus

Return type

pd.DataFrame

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_corpus_statistics
>>> corpus = [
...     "My name is Dr. Jekyll.",
...     "His name is Mr. Hyde",
...     "This guy's name is Edward Scissorhands",
...     "And this is Tom Parker"
... ]
>>> batch_calculate_corpus_statistics(
...     corpus,
...     lowercase=False,
...     remove_stopwords=False,
...     num_workers=2,
... )
    characters  tokens  punctuation_characters  digits  whitespace_characters  \
0          22       5                       2       0                      4
1          20       5                       1       0                      4
2          38       7                       1       0                      5
3          22       5                       0       0                      4

ascii_characters sentiment_score subjectivity_score 0 22 0.0 0.0 1 20 0.0 0.0 2 38 0.0 0.0 3 22 0.0 0.0

contextpro.batch_calculate_sentiment_scores(documents: List[str], num_workers: Optional[int] = None)List[float]

Calculate sentiment scores for sentences in a concurrent manner.

Parameters
  • documents (List[str]) – list of sentences which sentiment scores have to be calculated

  • num_workers (Optional[int]) – number of logical processors to use, by default None (all)

Returns

list of floats within [-1.0, 1.0] range representing sentiment scores for the sentences where -1.0 means negative and 1.0 positive

Return type

List[float]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_sentiment_scores
>>> corpus = [
...     "I don't like you.",
...     "I love the Spiderman movie",
...     "In my opinion this movie was rather boring than exciting",
...     "This is the worst movie I've ever seen"
... ]
>>> batch_calculate_sentiment_scores(
...     corpus,
...     num_workers=2
... )
[0.0, 0.5, -0.35, -1.0]
contextpro.batch_calculate_subjectivity_scores(documents: List[str], num_workers: Optional[int] = None)List[float]

Calculate subjectivity scores for sentences in a concurrent manner.

Parameters
  • documents (List[str]) – list of sentences which subjectivity scores have to be calculated

  • num_workers (Optional[int]) – number of logical processors to use, by default None (all)

Returns

list of floats within [0.0, 1.0] range representing subjectivity scores for the sentences where 0.0 means very objective and 1.0 very subjective

Return type

List[float]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.statistics import batch_calculate_subjectivity_scores
>>> corpus = [
...     "I don't like you.",
...     "I love the Spiderman movie",
...     "In my opinion this movie was rather boring than exciting",
...     "This is the worst movie I've ever seen"
... ]
>>> batch_calculate_subjectivity_scores(
...     corpus,
...     num_workers=2
... )
[0.0, 0.6, 0.9, 1.0]
contextpro.batch_convert_numerals_to_numbers(documents: List[str], num_workers: Optional[int] = None)List[str]

Replaces numerals with numbers in all sentences in a concurrent manner.

Parameters
  • documents (List[str]) – list of sentences which contain numerals

  • num_workers (Optional[int]) – number of logical processors to use, by default None (all)

Returns

list of sentences with numerals converted to numbers

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_convert_numerals_to_numbers
>>> corpus = [
...     "A bunch of five",
...     "A picture is worth a thousand words",
...     "A stitch in time saves nine",
...     "Back to square one",
...     "Behind the eight ball",
...     "Between two stools",
... ]
>>> batch_convert_numerals_to_numbers(corpus, num_workers=2)
[
    'A bunch of 5',
    'A picture is worth a 1000 words',
    'A stitch in time saves 9',
    'Back to square 1',
    'Behind the 8 ball',
    'Between 2 stools',
]
contextpro.batch_get_ngram_counts(tokens: List[List[str]], ngram_size: int = 1)Dict[str, int]

Calculate ngram counts across the corpus of tokenized documents.

Parameters
  • tokens (List[List[str]]) – list of nested token lists

  • ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams

Returns

mapping from ngram to the number of occurrences in a corpus of tokenized documents

Return type

Dict[str, int]

Raises

ValueError – if ‘tokens’ provided is not a list of nested token lists

Examples

>>> from contextpro.statistics import get_ngram_counts
>>> corpus = [
    ["my", "name", "is", "dr", "jekyll"],
    ["his", "name", "is", "mr", "hyde"],
    ["this", "guy", "name", "is", "edward", "scissorhands"],
    ["and", "this", "is", "tom", "parker"],
]
>>> batch_get_ngram_counts(corpus, ngram_size=2)
{
    "my name": 1, "name is": 3, "is dr": 1, "dr jekyll": 1,
    "his name": 1, "is mr": 1, "mr hyde": 1, "this guy": 1,
    "guy name": 1, "is edward": 1, "edward scissorhands": 1,
    "and this": 1, "this is": 1, "is tom": 1, "tom parker": 1
}
contextpro.batch_get_ngrams(tokens: List[List[str]], ngram_size: int = 1)List[List[str]]

Prepare n-grams from the provided list of token lists.

Parameters
  • tokens (List[List[str]]) – list of token lists, each representing single document

  • ngram_size (int) – size of ngrams to return, by default 1 (unigrams)

Returns

list of nested ngram lists

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.feature_extraction import batch_get_ngrams
>>> tokens = [
...     ["my", "name", "is", "spiderman"],
...     ["she", "lives", "in", "australia"],
... ]
>>> batch_get_ngrams(tokens, ngram_size=2)
[
    ["my name", "name is", "is spiderman"],
    ["she lives", "lives in", "in australia"],
]
contextpro.batch_lemmatize(tokens: List[List[str]], num_workers: Optional[int] = None, **kwargs: Any)List[List[str]]

Lemmatizes tokens in lists of tokens in a concurrent manner using NLTK WordNetLemmatizer.

Parameters
  • tokens (List[List[str]]) – nested token lists containing tokens with various inflectional forms

  • num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)

Other Parameters

**kwargs (Any) –

additional properties of the below lemmatizer:
  • nltk.WordNetLemmatizer

Returns

nested token lists with lemmatized tokens

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.normalization import batch_lemmatize
>>> corpus =  [
...     ["I", "like", "driving", "a", "car"],
...     ["I", "am", "going", "for", "a", "walk"],
...     ["What", "are", "you", "doing"],
...     ["Where", "are", "you", "coming", "from"]
... ]
>>> batch_lemmatize(corpus, num_workers=2, pos="v")
[
    ['I', 'like', 'drive', 'a', 'car'],
    ['I', 'be', 'go', 'for', 'a', 'walk'],
    ['What', 'be', 'you', 'do'],
    ['Where', 'be', 'you', 'come', 'from']
]
contextpro.batch_remove_non_ascii_characters(documents: List[str])List[str]

Removes non-ascii characters from sentences.

Parameters

documents (List[str]) – list of sentences with non-ascii characters

Returns

list of sentences with removed non-ascii characters

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_remove_non_ascii_characters
>>> corpus = [
...     "https://sitebulb.com/Folder/øê.html?大学",
...     "Jöreskog bißchen Zürcher"
...     "This is a © but not a ®"
...     "fractions ¼, ½, ¾"
... ]
>>> batch_remove_non_ascii_characters(corpus)
[
    'https://sitebulb.com/Folder/.html?',
    'Jreskog bichen Zrcher',
    'This is a  but not a ',
    'fractions , , '
]
contextpro.batch_remove_numbers(documents: List[str])List[str]

Removes numbers from all sentences.

Parameters

documents (List[str]) – list of sentences which contain numbers

Returns

list of sentences without numbers

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_remove_numbers
>>> corpus = [
...     "He is 12 years old.",
...     "His father has 3 cars",
...     "I have 3 computers",
...     "He earns 1000$ daily"
... ]
>>> batch_remove_numbers(corpus)
[
    'He is  years old.',
    'His father has  cars',
    'I have  computers',
    'He earns $ daily'
]
contextpro.batch_remove_punctuation(documents: List[str])List[str]

Removes punctuation characters from all sentences.

Parameters

documents (List[str]) – list of sentences which contain punctuation characters

Returns

list of sentences without punctuation characters

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_remove_punctuation
>>> corpus = [
...     "My name is Dr. Jekyll.",
...     "His name is Mr. Hyde!",
...     "Is his name Edward Scissorhands?",
...     "This is Tom-Parker!"
... ]
>>> batch_remove_punctuation(corpus)
[
    'My name is Dr Jekyll',
    'His name is Mr Hyde',
    'Is his name Edward Scissorhands',
    'This is TomParker'
]
contextpro.batch_remove_stopwords(tokens: List[List[str]], custom_stopwords: List[str] = [])List[List[str]]

Removes stopwords from nested token lists.

Parameters
  • tokens (List[List[str]]) – nested token lists with stopwords included

  • custom_stopwords (List[str], optional) – list of stopwords to remove from sentences, by default []

Returns

nested token lists without stopwords

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.normalization import batch_remove_stopwords
>>> corpus = [
...     ['My', 'name', 'is', 'Dr', 'Jekyll'],
...     ['His', 'name', 'is', 'Mr', 'Hyde'],
...     ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
...     ['And', 'this', 'is', 'Tom', 'Parker']
... ]
>>> batch_remove_stopwords(corpus)
[
    ['My', 'name', 'Dr', 'Jekyll'],
    ['His', 'name', 'Mr', 'Hyde'],
    ['This', 'guy', 'name', 'Edward', 'Scissorhands'],
    ['And', 'Tom', 'Parker']
]
contextpro.batch_remove_whitespace(documents: List[str])List[str]

Removes whitespace characters from all sentences.

Parameters

documents (List[str]) – list of sentences which contain whitespace characters

Returns

list of sentences without whitespace characters

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_remove_whitespace
>>> corpus = [
...     "He Has Not Been    in Touch for over a Month.",
...     "I Will See \r\nYou next Week",,
...     "I Am Hungry - Can We\t Eat Now, Please?",,
...     "It Is \r\nFreezing Outside!"
... ]
>>> batch_remove_whitespace(corpus)
[
    'He Has Not Been in Touch for over a Month.',
    'I Will See You next Week',
    'I Am Hungry - Can We Eat Now, Please?',
    'It Is Freezing Outside!',
]
contextpro.batch_replace_contractions(documents: List[str], **kwargs: bool)List[str]

Expands contractions in sentences.

Parameters

documents (List[str]) – list of sentences with contracted words

Other Parameters

**kwargs (bool) –

additional properties of the below method:
  • contractions.fix()

Returns

list of sentences without contractions

Return type

List[str]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.normalization import batch_replace_contractions
>>> corpus = [
...     "I don't want to be rude, but you shouldn't do this",
...     "Do you think he'll pass his driving test?",
...     "I'll see you next week",
...     "I'm going for a walk"
... ]
>>> batch_replace_contractions(corpus)
[
    'I do not want to be rude, but you should not do this',
    'Do you think he will pass his driving test?',
    'I will see you next week',
    'I am going for a walk',
]
contextpro.batch_stem(tokens: List[List[str]], stemmer_type: Optional[str] = 'nltk_porter_stemmer', num_workers: Optional[int] = None, **kwargs: Any)List[List[str]]

Stems tokens in lists of tokens to their root (base) form in a concurrent manner.

Parameters
  • tokens (List[List[str]]) – list of lists of tokens containing words with various inflectional forms

  • stemmer_type (Optional[str], optional) –

    stemmer type which will be used to stem the tokens, by default “nltk_porter_stemmer”

    Allowed values:
    • nltk_porter_stemmer

    • nltk_lancaster_stemmer

    • nltk_regexp_stemmer

    • nltk_snowball_stemmer

  • num_workers (Optional[int], optional) – number of processors to use, by default None (all processors)

Other Parameters

**kwargs (Any) –

additional properties of the below stemmers:
  • nltk.PorterStemmer

  • nltk.LancasterStemmer

  • nltk.RegexpStemmer

  • nltk.SnowballStemmer

Returns

list of lists of stemmed tokens

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided are not a list of nested string lists

Examples

>>> from contextpro.normalization import batch_stem
>>> corpus =  [
...     ["I", "like", "driving", "a", "car"],
...     ["I", "am", "going", "for", "a", "walk"],
...     ["Do", "you", "think", "this", "is", "doable"],
...     ["I", "have", "three", "bikes", "in", "two", "garages"]
... ]
>>> batch_stem(
...    corpus,
...    stemmer_type="nltk_porter_stemmer",
...    num_workers=2
... )
[
    ['I', 'like', 'drive', 'a', 'car'],
    ['I', 'am', 'go', 'for', 'a', 'walk'],
    ['Do', 'you', 'think', 'thi', 'is', 'doabl'],
    ['I', 'have', 'three', 'bike', 'in', 'two', 'garag']
]
contextpro.batch_tokenize_text(documents: List[str], tokenizer_method: Optional[str] = 'nltk_word_tokenizer', num_workers: Optional[int] = None, **kwargs)List[List[str]]

Tokenizes sentences in a concurrent manner.

Parameters
  • documents (List[str]) – list of sentences to tokenize

  • tokenizer_method (Optional[str]) –

    tokenization method which will be used to tokenize the sentences by default “nltk_word_tokenizer”.

    Allowed values:
    • nltk_word_tokenizer

    • nltk_regexp_tokenizer

  • num_workers (Optional[int], optional) – number of logical processors to use, by default None (all)

Other Parameters

**kwargs (additional properties of the below methods:) –

  • nltk.word_tokenize()

  • nltk.regexp_tokenize()

Returns

nested lists containing tokens

Return type

List[List[str]]

Raises

ValueError – if ‘documents’ provided are not a list of strings

Examples

>>> from contextpro.tokenization import batch_tokenize_text
>>> corpus = [
...     "My name is Dr. Jekyll.",
...     "His name is Mr. Hyde",
...     "This guy's name is Edward Scissorhands",
...     "And this is Tom Parker"
... ]
>>> batch_tokenize_text(
...     corpus,
...     tokenizer_method="nltk_regexp_tokenizer",
...     pattern=r"\b[^\d\W]+\b",
...     gaps=False,
...     num_workers=2
... )
[['My', 'name', 'is', 'Dr', 'Jekyll'],
 ['His', 'name', 'is', 'Mr', 'Hyde'],
 ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
 ['And', 'this', 'is', 'Tom', 'Parker']]
contextpro.calculate_sentiment_score(document: str)float

Calculate sentiment score for the sentence using TextBlob object.

Parameters

document (str) – sentence which sentiment score has to be calculated

Returns

float within [-1.0, 1.0] range representing sentiment score for the sentence, where -1.0 means negative and 1.0 positive

Return type

float

Examples

>>> from contextpro.statistics import calculate_sentiment_score
>>> corpus = "I love the Spiderman movie"
>>> calculate_sentiment_score(sentence)
0.5
contextpro.calculate_subjectivity_score(document: str)float

Calculate subjectivity score for the sentence using TextBlob object.

Parameters

document (str) – sentence which subjectivity score has to be calculated

Returns

float within [0.0, 1.0] range representing subjectivity score for the sentence, where 0.0 means very objective and 1.0 very subjective

Return type

float

Examples

>>> from contextpro.statistics import calculate_subjectivity_score
>>> corpus = "I love the Spiderman movie"
>>> calculate_subjectivity_score(sentence)
0.6
contextpro.convert_numerals_to_numbers(sentence: str)str

Replaces numerals with numbers in a sentence.

Parameters

sentence (str) – with numerals

Returns

with numerals replaced with numbers

Return type

str

Examples

>>> from contextpro.normalization import convert_numerals_to_numbers
>>> sentence = "A bunch of five"
>>> convert_numerals_to_numbers(sentence)
'A bunch of 5'
contextpro.get_ngram_counts(tokens: List[str], ngram_size: int = 1)Dict[str, int]

Calculate ngram counts in a tokenized document.

Parameters
  • tokens (List[str]) – list of tokens

  • ngram_size (str, optional) – size of ngrams to calculate, by default 1 - unigrams

Returns

mapping from ngram to the number of occurrences in a document

Return type

Dict[str, int]

Raises

ValueError – if ‘tokens’ provided is not a list of strings

Examples

>>> from contextpro.statistics import get_ngram_counts
>>> tokens = ["my", "name", "is", "dr", "jekyll"]
>>> get_ngram_counts(tokens, ngram_size=2)
{'my name': 1, 'name is': 1, 'is dr': 1, 'dr jekyll': 1}
contextpro.get_ngrams(tokens: List[str], ngram_size: int = 1)List[str]

Prepare n-grams from the provided list of tokens.

Parameters
  • tokens (List[str]) – list of tokens

  • ngram_size (int) – size of ngrams to return, by default 1 (unigrams)

Returns

list of ngrams

Return type

List[List[str]]

Raises

ValueError – if ‘tokens’ provided is not a list of strings

Examples

>>> from contextpro.feature_extraction import get_ngrams
>>> tokens = ["my", "name", "is", "dr", "jekyll"]
>>> get_ngrams(tokens, ngram_size=2)
["my name", "name is", "is spiderman"]
contextpro.lemmatize(tokens: List[str], **kwargs: Any)List[str]

Lemmatizes tokens using NLTK’s WordNetLemmatizer

Parameters

tokens (List[str]) – list of tokens containing words with various inflectional forms

Other Parameters

**kwargs (Any) –

additional properties of the below lemmatizer:
  • nltk.WordNetLemmatizer

Returns

list of lemmatized tokens

Return type

List[str]

Examples

>>> from contextpro.normalization import lemmatize
>>> tokens =  ["I", "like", "driving", "a", "car"]
>>> lemmatize(tokens, pos="v")
['I', 'like', 'drive', 'a', 'car']
contextpro.remove_non_ascii_characters(sentence: str)str

Removes non-ascii characters from the provided sentence.

Parameters

sentence (str) – with non-ascii characters

Returns

without non-ascii characters

Return type

str

Examples

>>> from contextpro.normalization import remove_non_ascii_characters
>>> sentence = "https://sitebulb.com/Folder/øê.html?大学"
>>> remove_non_ascii_characters(sentence)
'https://sitebulb.com/Folder/.html?'
contextpro.remove_numbers(sentence: str)str

Removes numbers from the sentence.

Parameters

sentence (str) – with number characters

Returns

without number characters

Return type

str

Examples

>>> from contextpro.normalization import remove_numbers
>>> sentence = "He is 12 years old."
>>> remove_numbers(sentence)
'He is  years old.'
contextpro.remove_punctuation(sentence: str)str

Removes punctuation characters from the sentence.

Parameters

sentence (str) – with punctuation characters

Returns

without punctuation characters

Return type

str

Examples

>>> from contextpro.normalization import remove_punctuation
>>> corpus = "My name is Dr. Jekyll."
>>> remove_punctuation(sentence)
'My name is Dr Jekyll'
contextpro.remove_stopwords(tokens: List[str], custom_stopwords: List[str] = [])List[str]

Remove stopwords from the provided list of tokens.

Parameters
  • tokens (List[str]) – list of tokens, including stopwords

  • custom_stopwords (List[str], optional) – which should be removed from the list of tokens, by default []

Returns

list of tokens without stopwords

Return type

List[str]

Examples

>>> from contextpro.normalization import remove_stopwords
>>> tokens = ['My', 'name', 'is', 'Dr', 'Jekyll']
>>> remove_stopwords(tokens)
['My', 'name', 'Dr', 'Jekyll']
contextpro.remove_whitespace(sentence: str)str

Removes whitespace characters from the sentence.

Parameters

sentence (str) – with whitespace characters

Returns

without whitespace characters

Return type

str

Examples

>>> from contextpro.normalization import remove_whitespace
>>> sentence = "He Has Not Been    in Touch for over a Month."
>>> remove_whitespace(sentence)
'He Has Not Been in Touch for over a Month.'
contextpro.replace_contractions(sentence: str, **kwargs: bool)str

Expands contractions in a sentence.

Parameters

sentence (str) – with contracted words

Other Parameters

**kwargs (bool) –

additional properties of the below method:
  • contractions.fix()

Returns

without contracted words

Return type

str

Examples

>>> from contextpro.normalization import replace_contractions
>>> sentence = "I don't want to be rude, but you shouldn't do this"
>>> replace_contractions(sentence)
'I do not want to be rude, but you should not do this'
contextpro.stem(tokens: List[str], stemmer_type: Optional[str] = 'nltk_porter_stemmer', **kwargs: Any)List[str]

Reduces tokens to their root (base) form.

Parameters
  • tokens (List[str]) – list of tokens containing words with various inflectional forms

  • stemmer_type (Optional[str], optional) –

    stemmer type which will be used to stem the tokens, by default “nltk_porter_stemmer”

    Allowed values:
    • nltk_porter_stemmer

    • nltk_lancaster_stemmer

    • nltk_regexp_stemmer

    • nltk_snowball_stemmer

Other Parameters

**kwargs (Any) –

additional properties of the below stemmers:
  • nltk.PorterStemmer

  • nltk.LancasterStemmer

  • nltk.RegexpStemmer

  • nltk.SnowballStemmer

Returns

list of stemmed tokens

Return type

List[str]

Examples

>>> from contextpro.normalization import stem
>>> tokens =  ["I", "like", "driving", "a", "car"]
>>> stem(tokens, stemmer_type="nltk_porter_stemmer")
['I', 'like', 'drive', 'a', 'car']
contextpro.tokenize_text(document: str, tokenizer_method: Optional[str] = 'nltk_word_tokenizer', **kwargs)List[str]

Convert sentence into a list of tokens.

Parameters
  • documents (str) – sentence to tokenize

  • tokenizer_method (Optional[str]) –

    tokenization method which will be used to tokenize the sentence by default “nltk_word_tokenizer”.

    Allowed values:
    • nltk_word_tokenizer

    • nltk_regexp_tokenizer

Other Parameters

**kwargs (additional properties of the below methods:) –

  • nltk.word_tokenize()

  • nltk.regexp_tokenize()

Returns

list of tokens

Return type

List[str]

Examples

>>> from contextpro.tokenization import tokenize_text
>>> sentence = "My name is Dr. Jekyll."
>>> tokenize_text(
...     corpus,
...     tokenizer_method="nltk_regexp_tokenizer",
...     pattern=r"\b[^\d\W]+\b",
...     gaps=False,
... )
['My', 'name', 'is', 'Dr', 'Jekyll']