This is the third part of the articles I am writing about my little project I am working on. In part1, I created a web scraper to get the data I needed. In part 2, I added support to save the collected data to a MongoDB database. Now in this part, I will look into how to clean up and add new features (columns) to the collected data to make it more suitable for analysis.
My primary motivation here is to learn new technologies as I progress, so my baby steps may not be the state of art in this particular area and all tips and tricks or corrections are welcome.
For this project, I am using python and each day I love it more and more. There are some cool libraries for python such as pandas that will be used. There are some cool tools such as python notebooks that will be also used.
For starts, make sure that you have jupiter notebook installed on your machine and then start Jupyter Notebooks from the git repo folder.
pip install jupiter
cd /path/to/stackjobs
jupiter notebooks
With this command, we started the iPython (Jupyter) notebooks and a new browser will be opened. Click on the Enhancing and Extending data with Pandas notebook to see and run the code that this article with describe. Also, the enhance_data_with_pandas.py file contains the same code, so it can be run without iptyhon notebook.
Read More »Playing With Pandas