Setting up Scrapy
- Python: 3.6
- Create a virtual environment
- Activate the virtual environment
- install Scrapy using command
pip install scrapy
This is all we need to up and running with scrapy.
Exploring Scrapy
In the terminal type scrapy
in order to get the available commands associated with scrapy
Here is the list of available commands
- bench - Run quick benchmark test
- fetch - Fetch a URL using the Scrapy downloader
- genspider - Generate new spider using pre-defined templates
- runspider - Run a self-contained spider (without creating a project)
- settings - Get settings values
- shell - Interactive scraping console
- startproject - Create new project
- version - Print Scrapy version
- view - Open URL in browser, as seen by Scrapy
Creating a Project
As per the above section, we will use startproject
command to create a new project.
scrapy startproject worldometer
startproject
- commandworldometer
- Name of the project
Creating a Spider
Naviagate to the newly create project directory. Use the following command to create a spider:
scrapy genspider countries www.worldometers.info/world-population/population-by-country
genspider
- command to generate spidercountries
- Name of the spider. Must be uniquewww.worldometers.info/world-population/population-by-country
- URL for the spider
This command will generate a folder named spider
Introduction to scrapy shell
- Install ipython using command
pip install ipython
- After installation use command
scrapy shell
to start scrapy shell in the terminal/command prompt.
Exploring various shell Commands
shelp()
- It gives available scrapy objects which can be used in a shellfetch("https://www.worldometers.info/world-population/population-by-country/")
- Crawl on the given url. Other option is to use request object.
req = scrapy.Request(
url="https://www.worldometers.info/world-population/population-by-country/"
)
fetch(r)
-
response.body
- This command will display the response from the website -
view(response)
- This will open view in a web browser.
Important thing to note here is that the spider see web content without javascript
Introduction to Object Identification and Data Capturing
As mentioned earlier, spider does not see javascript on the page. So in that case, we can disable the javascript while inspecting the web page. Steps to disable javascript are provided on this post.
- Xpath expressions for the elements can be built using this cheat sheet.
Following are the steps to obtain the content of an element.
title = response.xpath("//h2/text()")
title.get()
Above 2 commands will give the content of the xpath //h2/text()
title = response.css("h2::text")
title.get()
Above 2 commands will give the content of the css selector h2::text
response.xpath("//td/a/text()").getall()
Above command will give the content of all the elements as a list
Changing the script and running the spider
In the spider created earlier using scrapy command, we wiil modify and run that file.
Make changes in the countries.py file.
class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info/world-population/population-by-country']
start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
def parse(self, response):
countries = response.xpath("//td/a/text()").getall()
yield {
'countries': countries
}
The methods that we tried in the shell are used in the parse method and the result is returned using yield having key as ‘countries.
Now from the shell, run command scrapy crawl countries
. Make sure you are in the same location as scrapy.cfg
file.
The script will run and provide the output in the shell.
Share this post
Twitter
Reddit
LinkedIn
Pinterest
Email