Matplotlib Exploration – Bar Charts with The PirateBay

Background

In an effort to learn a bit more about data science, I have been working through the book Data Science from Scratch by Joel Grus. The book is interesting in that it attempts to teach the concepts of data science by exposing the reader to the nuts and bolts of techniques, rather than using high level tools like scikit-learn. I think this is great for helping me avoid “black-boxing” data science. For the most part, each chapter of the book covers a different technique in data science. I decided a good approach to getting the most out of this book would be to read the book chapter by chapter and after reading each chapter try to come up with my own little project to apply each technique. The first non-introductory chapter in the book was Chapter 3: Visualizing Data. In this chapter the author introduces the plotting package Matplotlib and gives a few examples of it’s use.

The Project

I have always thought that The Piratebay was an interesting phenomenon and since it has been around for a long time I thought it was likely a good opportunity for analysis. So with that thought in mind I set out to see if I could gather data on The Piratebay for a simple plotting exercise with Matplotlib.

The code for this mini project can be found on my Github at:

https://github.com/brettvanderwerff/Matplotlib-Exploration-Bar-Charts-with-The-PirateBay

The Approach

Getting the Data

Originally I thought I would be scraping thepiratebay.org in order to get something to plot, but before I invested time in that, a quick google search led me to realize that multiple groups already host organized data sets obtained from The Piratebay. I think the lesson here is just to do a quick search to see if someone else has collected the data you want before you set out to collect it yourself. The data set I settled on was gotten from

https://archive.org/details/pirate-bay-torrent-dumps-2004-2016

I liked this data set because it covered a lot of years (2004-2016) and was easily downloadable through a torrent . I have no idea who the author is, but it was uploaded to archive.org by a user named chirsbarry. I downloaded the data and unzipped it. The data came in 13 separate csv files, one csv file for each year. I opened up the csv representing data from the year 2016 (torrent_dump_2016.csv) just to get an idea of the structure of the csv.

WARNING: The contents of this data set are going to be offensive to some people.

The first 52 lines of the csv file were metadata that described the layout of each torrent entry and gave key, value pairs for torrent category codes.

The next 35,000 or so lines each represented a different torrent uploaded to The Piratebay in 2016. The formatting of each of these lines is such:

DATE_ADDED;HASH(B64);NAME;SIZE(BYTES);CATEGORY

Even though the file is a csv (comma-separated values), the values in this csv are separated by semi-colons.

torrent_list — Peek at torrent_dump_2016.csv data

Reading the data into Python

So now I had csv files containing a bit of data on every PirateBay torrent upload for each year between 2004 and 2016. I only used the torrent_dump_2016.csv file as an example here. Before working with this csv file in Python, I moved the metadata that I needed (just the category keys) to a Python file named category_keys.py. This was to avoid issues reading the csv file into Python. With that out of the way I started reading the data into Python with the get_torrent_list function of app.py.

FffNCfB - Imgur — get_torrent_list function

Although I had read and wrote text files before this, I had never worked with csv files in Python. Sentdex’s Youtube video got me up to speed. The main thing I had to make a bit unique for my situation was to set the delimiter in the reader function of the csv module to ‘;’ instead of the default ‘,’. This function converted my csv file data into a python list (torrent_list) where each list element represented a different line in the csv. Each element of the torrent_list was further divided into a nested list with elements representing values divided by the semi-colon in the original csv.

Processing the Data for Plotting

Now that I had the csv file represented by a list of nested lists so that each value in the csv would be easily indexed, I could begin processing the data for plotting.

Just to mess around with Matplotlib for the first time, I decided to simply plot how may torrents in my 2016 data set belonged each category as described in the category_keys dictionary of the category_keys.py file. I did this without too much hassle by using two functions in the app.py file: one to generate a list containing only the categories from the torrent_list (the get_categories function) and another to count how many instances of each category occur in the list returned by the get_categories function (the count_categories function).

LsICVp6 - Imgur — The functions get_categories and count_categories

Probably the most noteworthy thing between these two functions is the use of the Counter subclass of the collections module. This subclass allowed me to take my categories_list, which was just a list of all the categories found in my torrent_list, and count the number of occurrences of each category. This category count was then returned as a dictionary (categories_count) where the keys represented the categories and the values represented how many times each category was observed in the entire data set.

Now that I had my categories_count dictionary, I wanted to split the dictionary into two separate lists, one for the categories (x) and one for the category counts (y), these two lists would eventually represent the x and y entries for my bar chart with Matplotlib. This splitting was achieved with the get_x_y function of app.py.

Originally I tried plotting the category counts for every category in my file in a bar chart and it was far too crowded, so I needed a mechanism to allow the user to select only a few of the most abundant categories that they were interested in. I addressed this by creating the top_n argument for the get_x_y function, which allows the user to extract the top N most abundant torrent categories in the categories_count dictionary via the following:

sorted(dictionary, key=lambda x: dictionary.get(x), reverse=True)[:top_n]

It is important to note that the sorted function allowed me to get a representation of what my dictionary would look like sorted, but did not actually sort the dictionary. I only needed to do this to identify what keys, and by extension what values, I wanted to pack into my x and y lists. This was also my first time using lambda. Basically this whole sorted expression says to sort the categories_count dictionary according to the values of the keys (gotten from the lambda expression), in order from greatest to smallest. The sorted function returns an ordered list of the dictionary keys that is then indexed according the the value of top_n. The sliced list will be my x list for plotting and the y list is gotten from the for loop that comes right after that. Both these lists are returned by the get_x_y function.

Plotting the Data

Now that I had an x list that represented the torrent categories and a y list with corresponding counts of those categories, it was finally time to start plotting.

yj2VFEH - Imgur — The bar_chart function

To do this I made the bar_chart function in app.py. This function takes the x and y lists returned by the get_x_y function, title, xlabel, and ylabel as arguments.

plt.title(title) plt.xlabel(xlabel) plt.ylabel(ylabel)

The title, xlabel , and ylabel functions of pyplot are simply for labeling the chart with a title and labels for the x and y axes.

plt.xticks(rotation=45)

The xticks function sets the rotation of category labels at the x ticks (the hash mark below each category on the x axis). I set this to 45 degrees because some of the category titles were long and would overlap without slight rotation.

plt.bar(x=x, height=y)

The bar function makes the bar chart. I passed my x list of categories here as an argument for x coordinates of the bars and my y list of category counts as an argument for the height of the bars.

plt.tight_layout()

The tight layout_layout function was something I needed to call in order to squeeze the bar chart into the figure that is rendered to the user. Without calling this function many of my labels were cut off by the figure edges.

plt.show()

The show function needed to be called in order to display the bar chart to the user.

All I needed now was a if __name__ == ‘__main__’ statement to generate a plot for the top 10 torrented categories of 2016:

EBHpYwF - Imgur — if __name__ == ‘__main__’ statement

Which gave me the following figure:

Final Thoughts and Future Directions

Overall this was a neat little mini-project that I learned a few things from. There were a few tricks like using the Counter dictionary subclass and the sorted function that were new to me, but I think the most valuable things I learned here were how to read csv files with the csv module and of course how to make a basic bar chart using Matplotlib. Here I only visualized one small aspect of a subset of my entire PirateBay data set, so there is a lot of potential for further work. For example, I could look at visualizing changes in the # of torrents for a specific category over several years or I could start looking at the time data and plot the average number of torrent uploads that occur per bin of time in a day. It would also be great to try out the other plotting mechanisms in Matplotlib like line charts, scatter plots, and histograms. Also I mostly dealt with list and dictionary data structures here, but in the future I could explore working with my data in other ways such as by using Pandas dataframes.

Thanks for reading I hope you enjoyed 🙂

First Adventures in Machine Learning with Naive Bayes

YvxTvEr - Imgur

Disclaimer: I am new to programming and this is my first pass at machine learning. This is just a documentation of my learning process and is not really intended for use as a guide. Feel free to contact me with any issues or errors you notice.

Background

I always had machine learning in mind thoughout my first few months of learning programming in python. I decided a few weeks ago that I would start seriously looking at the best way to approach the topic as a learner. My favored approach to python was to do guided learning for a minimal amount of time to get a foundation and then jump to a project as soon as possible, so I took the same approach to machine learning.

Based on the awesome blog post Machine learning in a week , I decided to dip my toes in by looking at the Udacity Intro to machine learning course. I did not spend a ton of time doing this course, but it did help me understand some of the basic terminology used in machine learning like features and labels. Once they introduced naive Bayes, I logged out and looked for a project that I could use to apply naive Bayes on my own.

227e52ea4622a62344f1bc0a31111f43-bayes-theorem-game-theory — The reverend Thomas Bayes

The Project

So I wanted to use naive Bayes in a (somewhat) self guided project. From googling around, I quickly learned that naive Bayes is used heavily in text classification. I poked around on google scholar for applications of text classification that looked interesting and found a paper titled Social Media Writing Style Fingerprint out of Texas A & M by Yadav et al. from late 2017. The basic idea of the paper was that you can discriminate between a comment made by a redditor of interest and a random redditor by using machine learning. I did not look at their methods, as I wanted to try the problem on my own using a naive Bayes classifier as a learning experience.

The overall problem: Use naive Bayes to classify Reddit comments as belonging to a redditor of interest or some random redditor.

The Approach

Getting 1000 comments from the redditor of interest

So I wanted to use naive Bayes to determine whether a Reddit comment came from a specific redditor or not. The first thing that I needed was text data in the form of a “corpus” or in my case a collection of Reddit comments from a redditor of interest. I picked u/FriesWithThat as my redditor of interest for no specific reason other than he/she had a long history as a redditor, giving me plenty of comments to work with. To get comment data for FriesWithThat, I first looked at the Reddit API wrapper PRAW, but PRAW seemed to limit the amount of comments you could retrieve for a redditor to 1000 comments. I wanted the option to get more than 1000 comments if need be, so I looked for other options. It turns out that there is a data scientist on Reddit by the name of u/Stuck_In_the_Matrix that has been saving all the comments ever made on Reddit to a database. He makes these comments available via an API that return a JSON response. I wrote a script that calls his API and gets 1000 comments from FriesWithThat and writes those comments to the disk as a txt file titled ‘FriesWithThat_comments.txt’. Although here I only obtained 1000 comments, the script allows me to gather the entire comment history of FriesWithThat or any other redditor if desired. The code for this can be found on my github repository for this project at:

https://github.com/brettvanderwerff/Naive-Bayes-Reddit-Comment-Analyzer

under the script get_redditor_of_interest_comments.py.

Getting 1000 random redditor comments for comparison

Knowing that I needed a corpus of random redditor comments to compare to FriesWithThat’s comments, I set out to collect a list of at least 1000 random redditors to draw random comments from. The strategy was to obtain the single newest comment from each redditor in my list of at least 1000 random redditors until I had 1000 random comments to match the 1000 comments I already had for FriesWithThat. I wanted to collect these random redditors by recording the authors of the newest posts made in the subreddit r/all. I thought by using r/all as a source for my random redditor list that I would avoid bias in the types of comments I would have gotten by focusing on any one individual subreddit as a source. PRAW actually worked very well here. I had no problem getting my list of random redditors, as it turn out new posts by new authors were made to r/all almost as fast as I could grab them with my get_random_redditor_list.py script which writes a list of random redditors to the disk at endpoint as ‘random_redditor_list.txt’.

Now that I had a list of over 1000 random redditors, I wanted to retrieve the single newest comment each of those redditors made. I couldn’t find a good way to do this in PRAW, so I went back to calling StuckInTheMatrix’s Pushift Reddit API as shown in my get_random_redditor_comments.py script. This script opens the ‘random_redditor_list.txt, reads a redditor name off the first line, and gets the most recent comment made by that redditor. This cycle continues by reading a new line with each cycle until 1000 random comments are obtained. The 1000 random comments are then saved to the disk as ‘random_redditor_comments.txt’.

Bringing the data together

At this point, I was mostly finished with building the devices I needed to get data from FriesWithThat and random redditors. Also, by now it was obvious that getting data to analyze was no small part of this endeavor and likely was taking more time than the analysis itself would. The last thing I needed to do was put it all together in a script that runs all the data getting scripts I built in sequence. I attempted to do this in a tidy way with the collect_data.py script that invokes: get_redditor_of_interest_comments.py, get_random_redditor_list.py, and get_random_redditor_comments.py .

HhNW8uT - Imgur — Snippet from get_redditor_of_interest_comments.py

The analysis

Most of the approach before this point was using techniques I was pretty familiar with, most of the remaining techniques are specific to scikit-learn, a machine learning package for python, and were new territory for me, so I am adding more detail in this section of the blog for that reason.

Now that I had a corpus of comments saved to the disk for FriesWithThat (FriesWithThat_comments.txt) and a corpus saved for random redditor comments (random_redditor_comments.txt) by running collect_data.py, I wanted to read them back into python in an organized way and curate them for machine learning with a naive Bayes classifier. For this I used organize_corpus.py. This script does 4 main things:

open_corpus function reads the corpus txt file into python as one large string.
tokenize_corpus function takes the string as an argument and processes it through the nltk function sent_tokenize, which splits up the large string by punctuation points and returns a list of sentences. The tokenize_corpus function also randomizes the order of sentences in the returned list to avoid bias when the list is later divided into training and testing lists.
split_test_train is a function that I made to split up each list returned from tokenize_corpus into a training list and a testing list, these lists are then returned to the user. From looking around it seemed common for people to set aside 2/3 of their text data for training leaving 1/3 for use in testing, so that is what I did here. This function also generates and returns a list of corresponding labels. I chose to adopt the convention of labeling FriesWithThat’s data as 0 and the random redditors data as 1. I know there are built in functions to do this entire process in scikit-learn, but as a novice I could not get them to work with my data.
The combine_data function takes the training lists outputted by the split_test_train function and returns them as a combined list. It also returns a combined list of their corresponding labels. It processes the testing lists and corresponding labels in the same way and returns those as well.

GZMIoGt - Imgur — The complete naive_bayes.py program

With all the data organized it was time for the meat of the analysis, which takes place in the naive_bayes.py script. The very first thing this script does is invoke all the functions in the organize_corpus.py module to process the previously saved ‘FriesWithThat_comments.txt’ and ‘random_redditor_comments.txt’ files to return sentence divided training and testing lists of each corpus with their associated label lists.

TfidfVectorizer is a class from scikit-learn that I used to convert my training and testing lists of sentences to number representations. The technical term for this process is called vectorization and it is a required step as the naive Bayes algorithm cannot handle text without this numeric conversion. TfidfVectorizer does this by using the tf-idf or ‘term-frequency-inverse document frequency’ method. This method first counts the number of times a word appears in one of my sentences and then divides it by the total number of words in the sentence. For example, if the word ‘jog’ appears twice in a sentence that is ten words long, then the ‘term frequency’ for ‘jog’ in that sentence would be .2. This .2 would then be multiplied by the ‘inverse document frequency’ of ‘jog’, which is the log of (total word count in the corpus/number of times ‘jog’ appears in the corpus). To continue our example, if the word ‘jog’ appears 100 times in a 10,000 word corpus, then the tf-idf calculation would look like this:

tf-idf = .2 * log (10,000/100)

tf-idf = .4

I liked this approach because it normalizes the vectorization to take into account differences in total word count between corpus. So I created the object vector like so:
vector = TfidfVectorizer()

The next step was to fit my training data, so vector will “learn” the vocabulary in my corpus and return a matrix of the vocabulary to me. This can be done conveniently in one step by using the following fit_transform method:

document_term_matrix_train = vector.fit_transform(combine_train)

fit_transform takes the combined training list as an argument and returns a document term matrix, which is a matrix of n_samples and n_features. In my case n_samples refers to the number of sentences in my corpus and n_features refers to the number of unique words in the corpus. Andrew Hintermeier has a great blog post that explains this data structure. Next I similarly transformed my testing data to a document term matrix. For this step I used the method transform instead of fit_transform.

document_term_matrix_test = vector.transform(combine_test)

I used the transform method on the testing data instead of fit_transform as I did with the training data, because I didn’t want to override the fit performed on my training data with my testing data.

Now that I had fitted vector to the training data and transformed both the training data and the testing data to document term matrices, it was time to set up the classifier. I decided to use a multinomial naive Bayes classifier. Just from reading around, it appeared to be a gold standard algorithm in text classification problems like mine. This blog post by ‘AI Technology & Industry Review’ helped me develop some understanding of how multinomial naive Bayes works by hand. Next I created the classifier object with the multinomial naive Bayes class from scikit-learn.

classifier = MultinomialNB()

I also needed to fit my training data and the corresponding labels to the classifier by using the fit method of the multinomial naive Bayes class, this prepared the classifier for predicting whether a comment came from FriesWithThat or a random redditor:

classifier.fit(document_term_matrix_train, combine_train_label)

I was on the home stretch now and the next step is really where the magic happened, by calling the predict method on the classifier and passing the testing data as an argument, the predictor returns a list of what it thinks the labels for the testing data should be based upon the classifier fitting I did in the last step with the training data and labels.

predictor = classifier.predict(document_term_matrix_test)

The final step was to call the accuracy_score function, which determines how often the predicted labels returned by the predictor agree with the test data labels I generated earlier, which are the correct labels to match the test data. The program printed the accuracy as an output upon successful execution.

uhcrBY2 - Imgur — The output of naive_bayes.py

I ran this with several different collections of comments and typically got % accuracy in the high 60s to mid 70s using 1000 comments. The accuracy is not great, but as a novice using little to no optimization I was happy that I got anything above 50%. Overall it was a fun project and I learned that text classification with naive Bayes can be fun and approachable to a machine learning beginner.

Future Directions

This project is far from done. I would really like to optimize it and get a higher accuracy. I did not really filter or clean the comment data in any way. Filtering out different words or different types of word could improve the outcome. Cleaning spelling and punctuation along with removing strange things like links could change the results too ( Reddit comments are messy). In my example, I treated each sentence as a sample, but I could explore dividing the corpus in other ways, such as treating each comment as a sample to see if that makes a difference. There are also other text classification algorithms such as support vector machines which may be better suited to my problem. I also limited my project to taking 1000 comments from FriesWithThat and random redditors, I could try taking many more comments and seeing if that improves my accuracy score.

In my opinion, all these are good things to follow up on, but I think ultimately the best next step would be to write a program to do the naive Bayes calculations “by hand” rather than using scikit-learn’s classes. I think this would better solidify my understanding of what is happening under the hood during classification.

Thanks for reading I hope you enjoyed!