Background
In an effort to learn a bit more about data science, I have been working through the book Data Science from Scratch by Joel Grus. The book is interesting in that it attempts to teach the concepts of data science by exposing the reader to the nuts and bolts of techniques, rather than using high level tools like scikit-learn. I think this is great for helping me avoid “black-boxing” data science. For the most part, each chapter of the book covers a different technique in data science. I decided a good approach to getting the most out of this book would be to read the book chapter by chapter and after reading each chapter try to come up with my own little project to apply each technique. The first non-introductory chapter in the book was Chapter 3: Visualizing Data. In this chapter the author introduces the plotting package Matplotlib and gives a few examples of it’s use.
The Project
I have always thought that The Piratebay was an interesting phenomenon and since it has been around for a long time I thought it was likely a good opportunity for analysis. So with that thought in mind I set out to see if I could gather data on The Piratebay for a simple plotting exercise with Matplotlib.
The code for this mini project can be found on my Github at:
https://github.com/brettvanderwerff/Matplotlib-Exploration-Bar-Charts-with-The-PirateBay
![the-pirate-bay](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/02/the-pirate-bay.png?w=264&h=300)
The Approach
Getting the Data
Originally I thought I would be scraping thepiratebay.org in order to get something to plot, but before I invested time in that, a quick google search led me to realize that multiple groups already host organized data sets obtained from The Piratebay. I think the lesson here is just to do a quick search to see if someone else has collected the data you want before you set out to collect it yourself. The data set I settled on was gotten from
https://archive.org/details/pirate-bay-torrent-dumps-2004-2016
I liked this data set because it covered a lot of years (2004-2016) and was easily downloadable through a torrent . I have no idea who the author is, but it was uploaded to archive.org by a user named chirsbarry. I downloaded the data and unzipped it. The data came in 13 separate csv files, one csv file for each year. I opened up the csv representing data from the year 2016 (torrent_dump_2016.csv) just to get an idea of the structure of the csv.
WARNING: The contents of this data set are going to be offensive to some people.
The first 52 lines of the csv file were metadata that described the layout of each torrent entry and gave key, value pairs for torrent category codes.
![meta](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/02/meta.png?w=1100)
The next 35,000 or so lines each represented a different torrent uploaded to The Piratebay in 2016. The formatting of each of these lines is such:
DATE_ADDED;HASH(B64);NAME;SIZE(BYTES);CATEGORY
Even though the file is a csv (comma-separated values), the values in this csv are separated by semi-colons.
![torrent_list](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/02/torrent_list.png?w=1100)
Reading the data into Python
So now I had csv files containing a bit of data on every PirateBay torrent upload for each year between 2004 and 2016. I only used the torrent_dump_2016.csv file as an example here. Before working with this csv file in Python, I moved the metadata that I needed (just the category keys) to a Python file named category_keys.py. This was to avoid issues reading the csv file into Python. With that out of the way I started reading the data into Python with the get_torrent_list function of app.py.
![FffNCfB - Imgur](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/02/fffncfb-imgur.png?w=1100)
Although I had read and wrote text files before this, I had never worked with csv files in Python. Sentdex’s Youtube video got me up to speed. The main thing I had to make a bit unique for my situation was to set the delimiter in the reader function of the csv module to ‘;’ instead of the default ‘,’. This function converted my csv file data into a python list (torrent_list) where each list element represented a different line in the csv. Each element of the torrent_list was further divided into a nested list with elements representing values divided by the semi-colon in the original csv.
Processing the Data for Plotting
Now that I had the csv file represented by a list of nested lists so that each value in the csv would be easily indexed, I could begin processing the data for plotting.
Just to mess around with Matplotlib for the first time, I decided to simply plot how may torrents in my 2016 data set belonged each category as described in the category_keys dictionary of the category_keys.py file. I did this without too much hassle by using two functions in the app.py file: one to generate a list containing only the categories from the torrent_list (the get_categories function) and another to count how many instances of each category occur in the list returned by the get_categories function (the count_categories function).
![LsICVp6 - Imgur](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/02/lsicvp6-imgur.png?w=1100)
Probably the most noteworthy thing between these two functions is the use of the Counter subclass of the collections module. This subclass allowed me to take my categories_list, which was just a list of all the categories found in my torrent_list, and count the number of occurrences of each category. This category count was then returned as a dictionary (categories_count) where the keys represented the categories and the values represented how many times each category was observed in the entire data set.
Now that I had my categories_count dictionary, I wanted to split the dictionary into two separate lists, one for the categories (x) and one for the category counts (y), these two lists would eventually represent the x and y entries for my bar chart with Matplotlib. This splitting was achieved with the get_x_y function of app.py.
![66txJCh - Imgur](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/03/66txjch-imgur.png?w=1100)
Originally I tried plotting the category counts for every category in my file in a bar chart and it was far too crowded, so I needed a mechanism to allow the user to select only a few of the most abundant categories that they were interested in. I addressed this by creating the top_n argument for the get_x_y function, which allows the user to extract the top N most abundant torrent categories in the categories_count dictionary via the following:
sorted(dictionary, key=lambda x: dictionary.get(x), reverse=True)[:top_n]
It is important to note that the sorted function allowed me to get a representation of what my dictionary would look like sorted, but did not actually sort the dictionary. I only needed to do this to identify what keys, and by extension what values, I wanted to pack into my x and y lists. This was also my first time using lambda. Basically this whole sorted expression says to sort the categories_count dictionary according to the values of the keys (gotten from the lambda expression), in order from greatest to smallest. The sorted function returns an ordered list of the dictionary keys that is then indexed according the the value of top_n. The sliced list will be my x list for plotting and the y list is gotten from the for loop that comes right after that. Both these lists are returned by the get_x_y function.
Plotting the Data
Now that I had an x list that represented the torrent categories and a y list with corresponding counts of those categories, it was finally time to start plotting.
![yj2VFEH - Imgur](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/03/yj2vfeh-imgur.png?w=1100)
To do this I made the bar_chart function in app.py. This function takes the x and y lists returned by the get_x_y function, title, xlabel, and ylabel as arguments.
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
The title, xlabel , and ylabel functions of pyplot are simply for labeling the chart with a title and labels for the x and y axes.
plt.xticks(rotation=45)
The xticks function sets the rotation of category labels at the x ticks (the hash mark below each category on the x axis). I set this to 45 degrees because some of the category titles were long and would overlap without slight rotation.
plt.bar(x=x, height=y)
The bar function makes the bar chart. I passed my x list of categories here as an argument for x coordinates of the bars and my y list of category counts as an argument for the height of the bars.
plt.tight_layout()
The tight layout_layout function was something I needed to call in order to squeeze the bar chart into the figure that is rendered to the user. Without calling this function many of my labels were cut off by the figure edges.
plt.show()
The show function needed to be called in order to display the bar chart to the user.
All I needed now was a if __name__ == ‘__main__’ statement to generate a plot for the top 10 torrented categories of 2016:
![EBHpYwF - Imgur](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/03/ebhpywf-imgur.png?w=1100)
Which gave me the following figure:
![Figure_1-1](https://adventureswithpie.wordpress.com/wp-content/uploads/2018/03/figure_1-11.png?w=1100)
Final Thoughts and Future Directions
Overall this was a neat little mini-project that I learned a few things from. There were a few tricks like using the Counter dictionary subclass and the sorted function that were new to me, but I think the most valuable things I learned here were how to read csv files with the csv module and of course how to make a basic bar chart using Matplotlib. Here I only visualized one small aspect of a subset of my entire PirateBay data set, so there is a lot of potential for further work. For example, I could look at visualizing changes in the # of torrents for a specific category over several years or I could start looking at the time data and plot the average number of torrent uploads that occur per bin of time in a day. It would also be great to try out the other plotting mechanisms in Matplotlib like line charts, scatter plots, and histograms. Also I mostly dealt with list and dictionary data structures here, but in the future I could explore working with my data in other ways such as by using Pandas dataframes.
Thanks for reading I hope you enjoyed 🙂