Let’s continue with our project. To summarise what we did in the first part, we wrote a scraper in python using the scrapy framework that was capable of getting data from the stackoverflow job pages, but nothing else than that.
In this part, we will save the results to mongodb, make sure that we do not have duplicates and learn how to export the data we gathered…
The source code for this project can be found in stackjobs repo on github.
I need to store the data somewhere and I choose to use mongodb, because of no particular reason :). Let’s install it via the homebrew package manager by using the commands:
If you want to download the install package or use some other methods to install, please visit the page Install MongoDB Community Edition on OS X and choose the method that suits you.
Then create the directory structure for mongodb in a suitable location and start the server
You can confirm that the mongodb service is up and running by visiting http://localhost:28017.
How to configure mongo to run as a service?
Because, I installed mongodb using brew I can also start, stop and list mongodb as a service on macOS with the following command:
1 2 3
My only problem is with this approach is that I do not know how can I enable the –rest interface in this case, but navigating to http://localhost:27017 I get the standard mongodb warning, meaning that my database engine is running and that is good enough for now.
In order, to communicate with mongodb in python, you will need to install pymongo module using pip:
Saving the items in db
To be able to save the items gathered you will need to add some configuration to your project and also add some code to your pipelines.py.
Setting up a Pipeline
First, let’s see how the pipeline works. We will define a new class called MongoDBPipeline. This class has a constructor and a process_item method that will be called whenever a new item is created.
We will be using the process_item to check the validity of the item first, then it will be checked against a list called ids_seen. This list contains all the items that we already processed. We will ensure this way that there will be no duplicated items in our database.
If the item is valid and it is not in our list of previous items then we ca add it to our mongo database.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now, we need to check out the constructor for this class. Here we initialise the connection and fill the ids_seen list with the identifiers that we already have in the database.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Configuring the mongodb connection.
We need to do a little more work in order to be able to connect to the database and to register the newly added MongoDBPipeline, so it is actually called on the new items. In the settings.py, we need to added either add or modify the following line, so it looks like this:
1 2 3 4 5 6
With the first line, we register our pipeline class on the system and assign a priority. With this mechanism, we can specify in what we order we want the pipelines to be executed, if we would have more than one pipelines.
The four lines after that are specifying how to connect to the database, such as the name of the server and server port, database name and collection name (table name).
With these changes we can now run our parser and all items scraped will be stored in our mongo database.
Exporting data from mongo
What to do next? Maybe, we want to export the data to a file, so we can do further analysis and cleaning up on it. For this, I used the mongoexport command with the following parameters to create a csv file
It is also possible to export the data in son format, if needed. For that, you can use the following command:
Scheduling our job
The last thing we need to do is to schedule our scraper to run on my mac. This can be achieved using crontab. I will schedule my scraper to run in every 6 hours as I do not think that running it more often would make sense.
First, you will need to start the nano editor to edit your crontab file:
Now enter the following to run our scraper every 6 hours. Use Ctrl-0 and Ctrl-X to exit the editor and save.
To list the existing crontab jobs, we can use the following command:
In this part, we have installed mongodb on our machine and added logic and configuration to the scrapy project, so we can save all items that has been found to the database. Also, We have added a way to export the data into CSV or JSON file format from the mongo database. Finally, added this script to the crontab job and scheduled to run it every 2 hours.
In the next part, I will spend a little time on the exported data and use pandas to transform the exported data into a a better consumable format.