Implementing Descriptive Text Analytics using Python¶

In [6]:
# 1. Selection

#i select libraries
import pandas as pd
import string
import collections 
from wordcloud import wordcloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm 
import matplotlib.pyplot as pt
In [7]:
#ii select data source
populationdata = pd.read_csv(r'C:\Users\PC\Desktop\KIbe\sem 2\Unstructured data analytics & apps\jupkibe\data\emailsdata.csv', encoding = 'latin1')
print('The image of the dataframe is :\n',populationdata.shape)
The image of the dataframe is :
 (42, 2)
In [30]:
# iii select data sample
# with a certain number of records
sampledata = populationdata.sample(40)
print("Data Sample\n", sampledata)

# iii b data with certain cases
# eg spam emails
# spam_emails = sampledata(sampledata.email_type == [spam]) use in  data cleaning next although not callable if we use it like this 

# output the shape
#-print("Data Shape\n", spam_emails.shape)

# sample data with spam emails
#- print("Data Shape\n",spam_emails)
Data Sample
    email_type                                        description
20        ham                 I dont knw pa, i just drink milk..
5        spam  FreeMsg Hey there darling it's been 3 week's n...
27        ham  Same here, but I consider walls and bunkers an...
13       spam  URGENT! You have won a 1 week FREE membership ...
16       spam  XXXMobileMovieClub: To use your credit, click ...
4         ham  Nah I don't think he goes to usf, he lives aro...
37        NaN                                                NaN
17        ham                  Gudnite....tc...practice going on
8        spam  WINNER!! As a valued network customer you have...
31        ham  Its going good...no problem..but still need li...
15       spam  U have a secret admirer who is looking 2 make ...
7         ham  As per your request 'Melle Melle (Oru Minnamin...
30        ham  Well am officially in a philosophical hole, so...
3        spam  U have a Secret Admirer who is looking 2 make ...
26       spam  Congratulations ur awarded either a yrs supply...
41        ham                        Dude we should go sup again
10        ham  I'm gonna be home soon and i don't want to tal...
18        ham                                    I'll be late...
33        ham  Ugh its been a long day. I'm exhausted. Just w...
24       spam  Reminder: You have not downloaded the content ...
39        ham  see, i knew giving you a break a few times wou...
14        ham  I've been searching for the right words to tha...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
29       spam  Hello. We need some posh birds and chaps to us...
21        ham  Maybe?! Say hi to  and find out if  got his ca...
36       spam  Sunshine Quiz Wkly Q! Win a top Sony DVD playe...
34        ham  Talk With Yourself Atleast Once In A Day...!!!...
12       spam  SIX chances to win CASH! From 100 to 20,000 po...
28       spam  PRIVATE! Your 2003 Account Statement for 07808...
9        spam  Had your mobile 11 months or more? U R entitle...
38        NaN                                                NaN
35        ham       Are you in castor? You need to see something
40        ham  I love to give massages. I use lots of baby oi...
1         ham                      Ok lar... Joking wif u oni...
19       spam                                                NaN
25        ham      Dude ive been seeing a lotta corvettes lately
6         ham  Even my brother is not like to speak with me. ...
23       spam  Shop till u Drop, IS IT YOU, either 10K, 5K, †...
22        ham  Omg I want to scream. I weighed myself and I l...
32        ham                    I'll text you when I drop x off
In [35]:
# 2. Data cleaning

# defining spam_emails 1st, and make it callable use [] instead of ()
spam_emails = sampledata[sampledata.email_type == 'spam'] 

#- check sum of missing value
# ici! we check for records with missing(null) values in description
spam_emails['description'].isnull().sum()

# ici! drop all records with 'null values
cleaned_spam_emails = spam_emails.dropna()
print(cleaned_spam_emails)
   email_type                                        description
5        spam  FreeMsg Hey there darling it's been 3 week's n...
13       spam  URGENT! You have won a 1 week FREE membership ...
16       spam  XXXMobileMovieClub: To use your credit, click ...
8        spam  WINNER!! As a valued network customer you have...
15       spam  U have a secret admirer who is looking 2 make ...
3        spam  U have a Secret Admirer who is looking 2 make ...
26       spam  Congratulations ur awarded either a yrs supply...
24       spam  Reminder: You have not downloaded the content ...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
29       spam  Hello. We need some posh birds and chaps to us...
36       spam  Sunshine Quiz Wkly Q! Win a top Sony DVD playe...
12       spam  SIX chances to win CASH! From 100 to 20,000 po...
28       spam  PRIVATE! Your 2003 Account Statement for 07808...
9        spam  Had your mobile 11 months or more? U R entitle...
23       spam  Shop till u Drop, IS IT YOU, either 10K, 5K, †...
In [42]:
# 3. Data transformation

# - data array to store transformed data
transformed_spam = []

# - convert cleaned sample data to lower case
cleaned_spam_inlowercase = cleaned_spam_emails['description'].str.lower()
print("Spam Description in lowercase\n\n", cleaned_spam_inlowercase)

# - text split to separate words
split_spam = cleaned_spam_inlowercase.str.split(' ')
print("Separate words of Spam Description\n",split_spam) # on the doc it calls it all-spam an error ios through that all_spam is not defined

# - ici! remove punctuations
for text in split_spam:
    text = [x.strip(string.punctuation) for x in text]
    transformed_spam.append(text)
    
# - output transformed data
print(transformed_spam)
    
Spam Description in lowercase

 5     freemsg hey there darling it's been 3 week's n...
13    urgent! you have won a 1 week free membership ...
16    xxxmobilemovieclub: to use your credit, click ...
8     winner!! as a valued network customer you have...
15    u have a secret admirer who is looking 2 make ...
3     u have a secret admirer who is looking 2 make ...
26    congratulations ur awarded either a yrs supply...
24    reminder: you have not downloaded the content ...
2     free entry in 2 a wkly comp to win fa cup fina...
29    hello. we need some posh birds and chaps to us...
36    sunshine quiz wkly q! win a top sony dvd playe...
12    six chances to win cash! from 100 to 20,000 po...
28    private! your 2003 account statement for 07808...
9     had your mobile 11 months or more? u r entitle...
23    shop till u drop, is it you, either 10k, 5k, †...
Name: description, dtype: object
Separate words of Spam Description
 5     [freemsg, hey, there, darling, it's, been, 3, ...
13    [urgent!, you, have, won, a, 1, week, free, me...
16    [xxxmobilemovieclub:, to, use, your, credit,, ...
8     [winner!!, as, a, valued, network, customer, y...
15    [u, have, a, secret, admirer, who, is, looking...
3     [u, have, a, secret, admirer, who, is, looking...
26    [congratulations, ur, awarded, either, a, yrs,...
24    [reminder:, you, have, not, downloaded, the, c...
2     [free, entry, in, 2, a, wkly, comp, to, win, f...
29    [hello., we, need, some, posh, birds, and, cha...
36    [sunshine, quiz, wkly, q!, win, a, top, sony, ...
12    [six, chances, to, win, cash!, from, 100, to, ...
28    [private!, your, 2003, account, statement, for...
9     [had, your, mobile, 11, months, or, more?, u, ...
23    [shop, till, u, drop,, is, it, you,, either, 1...
Name: description, dtype: object
[['freemsg', 'hey', 'there', 'darling', "it's", 'been', '3', "week's", 'now', 'and', 'no', 'word', 'back', "i'd", 'like', 'some', 'fun', 'you', 'up', 'for', 'it', 'still', 'tb', 'ok', 'xxx', 'std', 'chgs', 'to', 'send', '\x86\x9c1.50', 'to', 'rcv'], ['urgent', 'you', 'have', 'won', 'a', '1', 'week', 'free', 'membership', 'in', 'our', '\x86\x9c100,000', 'prize', 'jackpot', 'txt', 'the', 'word', 'claim', 'to', 'no', '81010', 't&c', 'www.dbuk.net', 'lccltd', 'pobox', '4403ldnw1a7rw18'], ['xxxmobilemovieclub', 'to', 'use', 'your', 'credit', 'click', 'the', 'wap', 'link', 'in', 'the', 'next', 'txt', 'message', 'or', 'click', 'here', 'http://wap', 'xxxmobilemovieclub.com?n=qjkgighjjgcbl'], ['winner', 'as', 'a', 'valued', 'network', 'customer', 'you', 'have', 'been', 'selected', 'to', 'receivea', '\x86\x9c900', 'prize', 'reward', 'to', 'claim', 'call', '09061701461', 'claim', 'code', 'kl341', 'valid', '12', 'hours', 'only'], ['u', 'have', 'a', 'secret', 'admirer', 'who', 'is', 'looking', '2', 'make', 'contact', 'with', 'u-find', 'out', 'who', 'they', 'r*reveal', 'who', 'thinks', 'ur', 'so', 'special-call', 'on', '09058094565'], ['u', 'have', 'a', 'secret', 'admirer', 'who', 'is', 'looking', '2', 'make', 'contact', 'with', 'u-find', 'out', 'who', 'they', 'r*reveal', 'who', 'thinks', 'ur', 'so', 'special-call', 'on', '09065171142-stopsms-08'], ['congratulations', 'ur', 'awarded', 'either', 'a', 'yrs', 'supply', 'of', 'cds', 'from', 'virgin', 'records', 'or', 'a', 'mystery', 'gift', 'guaranteed', 'call', '09061104283', 'ts&cs', 'www.smsco.net', '\x86\x9c1.50pm', 'approx', '3mins'], ['reminder', 'you', 'have', 'not', 'downloaded', 'the', 'content', 'you', 'have', 'already', 'paid', 'for', 'goto', 'http://doit', 'mymoby', 'tv', 'to', 'collect', 'your', 'content'], ['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005', 'text', 'fa', 'to', '87121', 'to', 'receive', 'entry', 'question(std', 'txt', "rate)t&c's", 'apply', "08452810075over18's"], ['hello', 'we', 'need', 'some', 'posh', 'birds', 'and', 'chaps', 'to', 'user', 'trial', 'prods', 'for', 'champneys', 'can', 'i', 'put', 'you', 'down', 'i', 'need', 'your', 'address', 'and', 'dob', 'asap', 'ta', 'r'], ['sunshine', 'quiz', 'wkly', 'q', 'win', 'a', 'top', 'sony', 'dvd', 'player', 'if', 'u', 'know', 'which', 'country', 'liverpool', 'played', 'in', 'mid', 'week', 'txt', 'ansr', 'to', '82277', '\x86\x9c1.50', 'sp:tyrone'], ['six', 'chances', 'to', 'win', 'cash', 'from', '100', 'to', '20,000', 'pounds', 'txt', 'csh11', 'and', 'send', 'to', '87575', 'cost', '150p/day', '6days', '16', 'tsandcs', 'apply', 'reply', 'hl', '4', 'info'], ['private', 'your', '2003', 'account', 'statement', 'for', '07808', 'xxxxxx', 'shows', '800', 'un-redeemed', 's', 'i', 'm', 'points', 'call', '08719899217', 'identifier', 'code', '41685', 'expires', '07/11/04'], ['had', 'your', 'mobile', '11', 'months', 'or', 'more', 'u', 'r', 'entitled', 'to', 'update', 'to', 'the', 'latest', 'colour', 'mobiles', 'with', 'camera', 'for', 'free', 'call', 'the', 'mobile', 'update', 'co', 'free', 'on', '08002986030'], ['shop', 'till', 'u', 'drop', 'is', 'it', 'you', 'either', '10k', '5k', '\x86\x9c500', 'cash', 'or', '\x86\x9c100', 'travel', 'voucher', 'call', 'now', '09064011000', 'ntt', 'po', 'box', 'cr01327bt', 'fixedline', 'cost', '150ppm', 'mobile', 'vary']]
In [58]:
# 4 Data integration

# - text corpus
text_corpus = [" ".join(text) for text in transformed_spam]
final_text_corpus = " ".join(text_corpus)
print(final_text_corpus)
freemsg hey there darling it's been 3 week's now and no word back i'd like some fun you up for it still tb ok xxx std chgs to send †œ1.50 to rcv urgent you have won a 1 week free membership in our †œ100,000 prize jackpot txt the word claim to no 81010 t&c www.dbuk.net lccltd pobox 4403ldnw1a7rw18 xxxmobilemovieclub to use your credit click the wap link in the next txt message or click here http://wap xxxmobilemovieclub.com?n=qjkgighjjgcbl winner as a valued network customer you have been selected to receivea †œ900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only u have a secret admirer who is looking 2 make contact with u-find out who they r*reveal who thinks ur so special-call on 09058094565 u have a secret admirer who is looking 2 make contact with u-find out who they r*reveal who thinks ur so special-call on 09065171142-stopsms-08 congratulations ur awarded either a yrs supply of cds from virgin records or a mystery gift guaranteed call 09061104283 ts&cs www.smsco.net †œ1.50pm approx 3mins reminder you have not downloaded the content you have already paid for goto http://doit mymoby tv to collect your content free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's hello we need some posh birds and chaps to user trial prods for champneys can i put you down i need your address and dob asap ta r sunshine quiz wkly q win a top sony dvd player if u know which country liverpool played in mid week txt ansr to 82277 †œ1.50 sp:tyrone six chances to win cash from 100 to 20,000 pounds txt csh11 and send to 87575 cost 150p/day 6days 16 tsandcs apply reply hl 4 info private your 2003 account statement for 07808 xxxxxx shows 800 un-redeemed s i m points call 08719899217 identifier code 41685 expires 07/11/04 had your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for free call the mobile update co free on 08002986030 shop till u drop is it you either 10k 5k †œ500 cash or †œ100 travel voucher call now 09064011000 ntt po box cr01327bt fixedline cost 150ppm mobile vary
In [65]:
# 5. wordcloud from corpus
# import libraries 
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# ici! Create a  object 
wordcloud_spam = WordCloud( background_color="white").generate(final_text_corpus)
# i experienced some problems here but when i inculded libraries it worked fine

# - plotting wordcloud model
plt.figure(figsize = (20,20))
plt.imshow(wordcloud_spam, interpolation = 'bilinear')
plt.axis("off")
plt.show()
In [ ]: