This is the third part of the articles I am writing about my little project I am working on. In part1, I created a web scraper to get the data I needed. In part 2, I added support to save the collected data to a mongodb database. Now in this part, I will look into how to clean up and add new features (columns) to the collected data to make it more suitable for analysis.
My primary motivation here is to learn new technologies as I progress, so my baby steps may not be the state of art in this particular area and all tips and tricks or corrections are welcome.
For this project I am using python and each day I love it more and more. There are some cool libraries for python such as pandas that will be used. There are some col tools such as python notebooks that will be also used.
For starts, make sure that you have jupiter notebook installed on your machine and then start Jupyter Notebooks from the git repo folder.
1 2 3
With this command, we started the iPython (Jupyter) notebooks and a new browser will be opened. Click on the Enhancing and Extending data with Pandas notebook to see and run the code that this article with describe. Also the enhance_data_with_pandas.py file contains the same code, so it can be run without iptyhon notebook.
Exploring data with pandas
First of all we need to import some libraries and make some changes to how the data is printed out.
The changes in the display options e.g display.max_columns and display.max_rows are need in order for me to see the whole data when printed in the notebook and not just a few lines. Feel free to comment it out…
1 2 3 4 5 6 7 8 9
Next, there are three calls to see what kind of data we loaded into our jobs data frame. We print out the columns, the types of the columns and some generic numerical data describing the dataset loaded. In the last one, the most important is the count that is telling us how many rows of data we loaded.
1 2 3
We can also print out the first and last few lines of the data, to have a look what we have using the tail() and head() method calls.
Working with Salary
In our jobs data, we have the salary column that may or may not contains the salary is offered by the company for a particular position. When we have missing values in that column that can cause issues for further processing, so let’s replace the NA values with empty strings.
Next, we will extract the information, if the given position offers equity or not and store it in a new column called equity.
Next, we will move the salary data to a new series object and remove the Provides Equity in order to make it easier to extract the other values. The reason for this is that I would like to leave the other columns as they were originally and do the needed modifications in the new columns or in some temp variables.
When we removed the Provides Equity from the salary information our data is either an empty string or contains the currency information and the low high figure of the salary information. For example:
1 2 3
This information will be extracted using regular expression and captured in different columns. For the rows with empty values, it is recommended to provide a default value. Empty string for the currency and zero for the low-high figure.
Then we map back the new information to the jobs data frame.
1 2 3 4 5 6 7 8 9 10 11 12 13
Fixing up Location
We will also need better location information, so we can do analysis by countries and cities. For this we need to extract country, state and city out of location column. But first let’s remove the na values from location column. Then use a lambda to split the comma separated location value into individual fields. Finally, rename the columns to city, location_1 and location_2. The reason for this is that the content of location can be different for different countries.
In the first expression the second member is a country and in the second expression it is a state, so we need to handle these differently.
1 2 3 4 5 6
Fixing US locations
So it seems that US locations are special. They are in the form of city, state, we need this to be in form of city, state, country, so let’s fix this first. If we have a US state in location1 column then put US in location2.
1 2 3 4 5 6 7 8
Filling the state and country columns
We are now ready to put back the city, state and country information in our original data frame. Here is the logic, If in a row location_2 is null then location_1 contains the country of that location, if location_2 is not empty then location_2 is going to be the country and location_1 will contain the state.
1 2 3 4 5 6 7 8 9 10 11 12 13
Now we can see what countries are posting the most jobs. It seems that the US, Deutschland, Germany and the UK are the top countries. But wait. Aren’t Germany and Deutschland are the same country? Let’s fix this and some other countries with native names.
1 2 3 4 5 6 7 8 9 10
Now, we have a pretty neat table with data that we can do analytics on. In the next part, I will try to ask questions from data and create some graphs to find out interesting stuff like:
- What countries has the most jobs?
- What cities has the most jobs?
- What technologies are the most popular?
- What technologies are paying the most?