Scrapy
29 Sep 2017Scrapy Learned
Architecture
- engine
- spider
- scheduler
- downloader
- item pipelines
Terminology
- crawl: walk the links
- parser: parse the webpage html into structured data
- spider: define how crawler and parser work
- scraper: extract data from within a webpage.
A crawler
gets web pages – i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).
` A scraper ` takes pages that have been downloaded [Edit: or, in a more general sense, data that’s formatted for display], and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.
parse(self, response)
return
- extract dict {}
- Item
- Request with parse() defined.
Selectors
grammar: response.css(selector)
or response.xpath(selector)
- property:
a ::text
- child level:
div.pre_post > a
- class:
div.pre_post
- id:
div#id
functions:
extract_first()
css selection results are array, return the first element of the results, even though there are only one element in the results.
Scrapy Commands
- scrapyd-deploy
- scrapy.cfg
Global commands
- startproject
- genspider
- settings
- runspider
- shell
- fetch
- view
- version
Project commands
- crawl
- check
- list
- edit
- parse
- bench
project structure
scrapy.cfg
myproject/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
Scrapy Shell
TODOS:
response.follow
- yield result or page
Spider
Selector
The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too.
.xpath('//title/text()')
.css('title::text')
.re('a*')
In [20]: type(response.selector)
Out[20]: scrapy.selector.unified.Selector
In [21]: type(response.selector.xpath('//title/text()'))
Out[21]: scrapy.selector.unified.SelectorList
In [22]: type(response.selector.xpath('//title/text()')[0])
Out[22]: scrapy.selector.unified.Selector
In [23]:
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
response.css('img').xpath('@src').extract()
response.css('img::attr(src)').extract()
extract()
extract_first()
Relative Xpath
.//
relative//
absolute
Item
ItemLoader
Scrapy Shell
Item pipeline
Feed Exports
Serialization formats Item exporters
- JSON
- JSON lines
- CSV
- XML
Requests and Responses
Link Extractors
Downloader Middleware
IN settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
hook
Engine -> mw1 -> mw2 -> mw3 -> -> -> downloader
engine
↓
mw1.process_request()
↓
mw2.process_request()
↓
mw3.process_request()
↓
↓
downloader
↓
↓
mw3.process_response()
↓
mw2.process_response()
↓
mw1.process_response()
↓
engine
process_request(request, spider)
process_response(request, response, spider)
## Sitemap
```sh
➜ python curl http://www.jollychic.com/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="gss.xsl"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_images_1.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_images_9.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_product_1.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_product_2.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
</sitemapindex>
curl http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz |gzip -d |less
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0">
<url>
<priority>0.5</priority>
</url>
<url>
<loc>http://www.jollychic.com/t/long-white-dresses-t5.html</loc>
<priority>0.5</priority>
</url>
</urlset>
for index, link in enumerate(links):