# 1. Selection
#i select libraries
import pandas as pd
import string
import collections
from wordcloud import wordcloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as pt
#ii select data source
populationdata = pd.read_csv(r'C:\Users\PC\Desktop\KIbe\sem 2\Unstructured data analytics & apps\jupkibe\data\emailsdata.csv', encoding = 'latin1')
print('The image of the dataframe is :\n',populationdata.shape)
The image of the dataframe is : (42, 2)
# iii select data sample
# with a certain number of records
sampledata = populationdata.sample(40)
print("Data Sample\n", sampledata)
# iii b data with certain cases
# eg spam emails
# spam_emails = sampledata(sampledata.email_type == [spam]) use in data cleaning next although not callable if we use it like this
# output the shape
#-print("Data Shape\n", spam_emails.shape)
# sample data with spam emails
#- print("Data Shape\n",spam_emails)
Data Sample email_type description 20 ham I dont knw pa, i just drink milk.. 5 spam FreeMsg Hey there darling it's been 3 week's n... 27 ham Same here, but I consider walls and bunkers an... 13 spam URGENT! You have won a 1 week FREE membership ... 16 spam XXXMobileMovieClub: To use your credit, click ... 4 ham Nah I don't think he goes to usf, he lives aro... 37 NaN NaN 17 ham Gudnite....tc...practice going on 8 spam WINNER!! As a valued network customer you have... 31 ham Its going good...no problem..but still need li... 15 spam U have a secret admirer who is looking 2 make ... 7 ham As per your request 'Melle Melle (Oru Minnamin... 30 ham Well am officially in a philosophical hole, so... 3 spam U have a Secret Admirer who is looking 2 make ... 26 spam Congratulations ur awarded either a yrs supply... 41 ham Dude we should go sup again 10 ham I'm gonna be home soon and i don't want to tal... 18 ham I'll be late... 33 ham Ugh its been a long day. I'm exhausted. Just w... 24 spam Reminder: You have not downloaded the content ... 39 ham see, i knew giving you a break a few times wou... 14 ham I've been searching for the right words to tha... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 29 spam Hello. We need some posh birds and chaps to us... 21 ham Maybe?! Say hi to and find out if got his ca... 36 spam Sunshine Quiz Wkly Q! Win a top Sony DVD playe... 34 ham Talk With Yourself Atleast Once In A Day...!!!... 12 spam SIX chances to win CASH! From 100 to 20,000 po... 28 spam PRIVATE! Your 2003 Account Statement for 07808... 9 spam Had your mobile 11 months or more? U R entitle... 38 NaN NaN 35 ham Are you in castor? You need to see something 40 ham I love to give massages. I use lots of baby oi... 1 ham Ok lar... Joking wif u oni... 19 spam NaN 25 ham Dude ive been seeing a lotta corvettes lately 6 ham Even my brother is not like to speak with me. ... 23 spam Shop till u Drop, IS IT YOU, either 10K, 5K, ... 22 ham Omg I want to scream. I weighed myself and I l... 32 ham I'll text you when I drop x off
# 2. Data cleaning
# defining spam_emails 1st, and make it callable use [] instead of ()
spam_emails = sampledata[sampledata.email_type == 'spam']
#- check sum of missing value
# ici! we check for records with missing(null) values in description
spam_emails['description'].isnull().sum()
# ici! drop all records with 'null values
cleaned_spam_emails = spam_emails.dropna()
print(cleaned_spam_emails)
email_type description 5 spam FreeMsg Hey there darling it's been 3 week's n... 13 spam URGENT! You have won a 1 week FREE membership ... 16 spam XXXMobileMovieClub: To use your credit, click ... 8 spam WINNER!! As a valued network customer you have... 15 spam U have a secret admirer who is looking 2 make ... 3 spam U have a Secret Admirer who is looking 2 make ... 26 spam Congratulations ur awarded either a yrs supply... 24 spam Reminder: You have not downloaded the content ... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 29 spam Hello. We need some posh birds and chaps to us... 36 spam Sunshine Quiz Wkly Q! Win a top Sony DVD playe... 12 spam SIX chances to win CASH! From 100 to 20,000 po... 28 spam PRIVATE! Your 2003 Account Statement for 07808... 9 spam Had your mobile 11 months or more? U R entitle... 23 spam Shop till u Drop, IS IT YOU, either 10K, 5K, ...
# 3. Data transformation
# - data array to store transformed data
transformed_spam = []
# - convert cleaned sample data to lower case
cleaned_spam_inlowercase = cleaned_spam_emails['description'].str.lower()
print("Spam Description in lowercase\n\n", cleaned_spam_inlowercase)
# - text split to separate words
split_spam = cleaned_spam_inlowercase.str.split(' ')
print("Separate words of Spam Description\n",split_spam) # on the doc it calls it all-spam an error ios through that all_spam is not defined
# - ici! remove punctuations
for text in split_spam:
text = [x.strip(string.punctuation) for x in text]
transformed_spam.append(text)
# - output transformed data
print(transformed_spam)
Spam Description in lowercase 5 freemsg hey there darling it's been 3 week's n... 13 urgent! you have won a 1 week free membership ... 16 xxxmobilemovieclub: to use your credit, click ... 8 winner!! as a valued network customer you have... 15 u have a secret admirer who is looking 2 make ... 3 u have a secret admirer who is looking 2 make ... 26 congratulations ur awarded either a yrs supply... 24 reminder: you have not downloaded the content ... 2 free entry in 2 a wkly comp to win fa cup fina... 29 hello. we need some posh birds and chaps to us... 36 sunshine quiz wkly q! win a top sony dvd playe... 12 six chances to win cash! from 100 to 20,000 po... 28 private! your 2003 account statement for 07808... 9 had your mobile 11 months or more? u r entitle... 23 shop till u drop, is it you, either 10k, 5k, ... Name: description, dtype: object Separate words of Spam Description 5 [freemsg, hey, there, darling, it's, been, 3, ... 13 [urgent!, you, have, won, a, 1, week, free, me... 16 [xxxmobilemovieclub:, to, use, your, credit,, ... 8 [winner!!, as, a, valued, network, customer, y... 15 [u, have, a, secret, admirer, who, is, looking... 3 [u, have, a, secret, admirer, who, is, looking... 26 [congratulations, ur, awarded, either, a, yrs,... 24 [reminder:, you, have, not, downloaded, the, c... 2 [free, entry, in, 2, a, wkly, comp, to, win, f... 29 [hello., we, need, some, posh, birds, and, cha... 36 [sunshine, quiz, wkly, q!, win, a, top, sony, ... 12 [six, chances, to, win, cash!, from, 100, to, ... 28 [private!, your, 2003, account, statement, for... 9 [had, your, mobile, 11, months, or, more?, u, ... 23 [shop, till, u, drop,, is, it, you,, either, 1... Name: description, dtype: object [['freemsg', 'hey', 'there', 'darling', "it's", 'been', '3', "week's", 'now', 'and', 'no', 'word', 'back', "i'd", 'like', 'some', 'fun', 'you', 'up', 'for', 'it', 'still', 'tb', 'ok', 'xxx', 'std', 'chgs', 'to', 'send', '\x86\x9c1.50', 'to', 'rcv'], ['urgent', 'you', 'have', 'won', 'a', '1', 'week', 'free', 'membership', 'in', 'our', '\x86\x9c100,000', 'prize', 'jackpot', 'txt', 'the', 'word', 'claim', 'to', 'no', '81010', 't&c', 'www.dbuk.net', 'lccltd', 'pobox', '4403ldnw1a7rw18'], ['xxxmobilemovieclub', 'to', 'use', 'your', 'credit', 'click', 'the', 'wap', 'link', 'in', 'the', 'next', 'txt', 'message', 'or', 'click', 'here', 'http://wap', 'xxxmobilemovieclub.com?n=qjkgighjjgcbl'], ['winner', 'as', 'a', 'valued', 'network', 'customer', 'you', 'have', 'been', 'selected', 'to', 'receivea', '\x86\x9c900', 'prize', 'reward', 'to', 'claim', 'call', '09061701461', 'claim', 'code', 'kl341', 'valid', '12', 'hours', 'only'], ['u', 'have', 'a', 'secret', 'admirer', 'who', 'is', 'looking', '2', 'make', 'contact', 'with', 'u-find', 'out', 'who', 'they', 'r*reveal', 'who', 'thinks', 'ur', 'so', 'special-call', 'on', '09058094565'], ['u', 'have', 'a', 'secret', 'admirer', 'who', 'is', 'looking', '2', 'make', 'contact', 'with', 'u-find', 'out', 'who', 'they', 'r*reveal', 'who', 'thinks', 'ur', 'so', 'special-call', 'on', '09065171142-stopsms-08'], ['congratulations', 'ur', 'awarded', 'either', 'a', 'yrs', 'supply', 'of', 'cds', 'from', 'virgin', 'records', 'or', 'a', 'mystery', 'gift', 'guaranteed', 'call', '09061104283', 'ts&cs', 'www.smsco.net', '\x86\x9c1.50pm', 'approx', '3mins'], ['reminder', 'you', 'have', 'not', 'downloaded', 'the', 'content', 'you', 'have', 'already', 'paid', 'for', 'goto', 'http://doit', 'mymoby', 'tv', 'to', 'collect', 'your', 'content'], ['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005', 'text', 'fa', 'to', '87121', 'to', 'receive', 'entry', 'question(std', 'txt', "rate)t&c's", 'apply', "08452810075over18's"], ['hello', 'we', 'need', 'some', 'posh', 'birds', 'and', 'chaps', 'to', 'user', 'trial', 'prods', 'for', 'champneys', 'can', 'i', 'put', 'you', 'down', 'i', 'need', 'your', 'address', 'and', 'dob', 'asap', 'ta', 'r'], ['sunshine', 'quiz', 'wkly', 'q', 'win', 'a', 'top', 'sony', 'dvd', 'player', 'if', 'u', 'know', 'which', 'country', 'liverpool', 'played', 'in', 'mid', 'week', 'txt', 'ansr', 'to', '82277', '\x86\x9c1.50', 'sp:tyrone'], ['six', 'chances', 'to', 'win', 'cash', 'from', '100', 'to', '20,000', 'pounds', 'txt', 'csh11', 'and', 'send', 'to', '87575', 'cost', '150p/day', '6days', '16', 'tsandcs', 'apply', 'reply', 'hl', '4', 'info'], ['private', 'your', '2003', 'account', 'statement', 'for', '07808', 'xxxxxx', 'shows', '800', 'un-redeemed', 's', 'i', 'm', 'points', 'call', '08719899217', 'identifier', 'code', '41685', 'expires', '07/11/04'], ['had', 'your', 'mobile', '11', 'months', 'or', 'more', 'u', 'r', 'entitled', 'to', 'update', 'to', 'the', 'latest', 'colour', 'mobiles', 'with', 'camera', 'for', 'free', 'call', 'the', 'mobile', 'update', 'co', 'free', 'on', '08002986030'], ['shop', 'till', 'u', 'drop', 'is', 'it', 'you', 'either', '10k', '5k', '\x86\x9c500', 'cash', 'or', '\x86\x9c100', 'travel', 'voucher', 'call', 'now', '09064011000', 'ntt', 'po', 'box', 'cr01327bt', 'fixedline', 'cost', '150ppm', 'mobile', 'vary']]
# 4 Data integration
# - text corpus
text_corpus = [" ".join(text) for text in transformed_spam]
final_text_corpus = " ".join(text_corpus)
print(final_text_corpus)
freemsg hey there darling it's been 3 week's now and no word back i'd like some fun you up for it still tb ok xxx std chgs to send 1.50 to rcv urgent you have won a 1 week free membership in our 100,000 prize jackpot txt the word claim to no 81010 t&c www.dbuk.net lccltd pobox 4403ldnw1a7rw18 xxxmobilemovieclub to use your credit click the wap link in the next txt message or click here http://wap xxxmobilemovieclub.com?n=qjkgighjjgcbl winner as a valued network customer you have been selected to receivea 900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only u have a secret admirer who is looking 2 make contact with u-find out who they r*reveal who thinks ur so special-call on 09058094565 u have a secret admirer who is looking 2 make contact with u-find out who they r*reveal who thinks ur so special-call on 09065171142-stopsms-08 congratulations ur awarded either a yrs supply of cds from virgin records or a mystery gift guaranteed call 09061104283 ts&cs www.smsco.net 1.50pm approx 3mins reminder you have not downloaded the content you have already paid for goto http://doit mymoby tv to collect your content free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's hello we need some posh birds and chaps to user trial prods for champneys can i put you down i need your address and dob asap ta r sunshine quiz wkly q win a top sony dvd player if u know which country liverpool played in mid week txt ansr to 82277 1.50 sp:tyrone six chances to win cash from 100 to 20,000 pounds txt csh11 and send to 87575 cost 150p/day 6days 16 tsandcs apply reply hl 4 info private your 2003 account statement for 07808 xxxxxx shows 800 un-redeemed s i m points call 08719899217 identifier code 41685 expires 07/11/04 had your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for free call the mobile update co free on 08002986030 shop till u drop is it you either 10k 5k 500 cash or 100 travel voucher call now 09064011000 ntt po box cr01327bt fixedline cost 150ppm mobile vary
# 5. wordcloud from corpus
# import libraries
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# ici! Create a object
wordcloud_spam = WordCloud( background_color="white").generate(final_text_corpus)
# i experienced some problems here but when i inculded libraries it worked fine
# - plotting wordcloud model
plt.figure(figsize = (20,20))
plt.imshow(wordcloud_spam, interpolation = 'bilinear')
plt.axis("off")
plt.show()