Python Text Processing Useful Resources

Python Text Processing - Removing Stopwords



Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. Such words are already captured this in corpus named corpus. We first download it to our python environment.

import nltk
nltk.download('stopwords')

It will download a file with English stopwords.

Verifying the Stopwords

main.py

from nltk.corpus import stopwords
stopwords.words('english')
print (stopwords.words() [0:20])

Output

When we run the above program we get the following output −

['tyre', 'rreth', 'le', 'atyre', 'këta', 'megjithëse', 'kemi', 'per', 
'ndonëse', 'dytë', 'pse', 'tha', 'aty', 'ndaj', 'ke', 'këtë', 'duhet', 
'pa', 'perket', 'veç']

The various language other than English which has these stopwords are as below.

main.py

from nltk.corpus import stopwords
print(stopwords.fileids())

Output

When we run the above program we get the following output −

['albanian', 'arabic', 'azerbaijani', 'basque', 'belarusian', 'bengali', 'catalan', 
'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek',
 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 
 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish',
 'tajik', 'tamil', 'turkish', 'uzbek']

Example - Removing stopwords

We use the below example to show how the stopwords are removed from the list of words.

main.py

from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree','near','the','river']
for word in all_words: 
    if word not in en_stops:
        print(word)

Output

When we run the above program we get the following output −

There
tree
near
river
Advertisements