Scrapy Learned

Architecture

engine
spider
scheduler
downloader
item pipelines

Terminology

crawl: walk the links
parser: parse the webpage html into structured data
spider: define how crawler and parser work
scraper: extract data from within a webpage.

A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).

A scraper takes pages that have been downloaded [Edit: or, in a more general sense, data that's formatted for display], and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.

`parse(self, response)`

return

extract dict
Item
Request with parse() defined.

Selectors

grammar: response.css(selector) or response.xpath(selector)

property: a ::text
child level: div.pre_post > a
class: div.pre_post
id: div#id

functions:

extract_first() css selection results are array, return the first element of the results, even though there are only one element in the results.

Scrapy Commands

scrapyd-deploy
scrapy.cfg

Global commands

startproject
genspider
settings
runspider
shell
fetch
view
version

Project commands

crawl
check
list
edit
parse
bench

project structure

scrapy.cfg
myproject/
    __init__.py
    items.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

Scrapy Shell

TODOS:

response.follow
yield result or page

Spider

Selector

The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too.

.xpath('//title/text()')
.css('title::text')
.re('a*')

In [20]: type(response.selector)
Out[20]: scrapy.selector.unified.Selector

In [21]: type(response.selector.xpath('//title/text()'))
Out[21]: scrapy.selector.unified.SelectorList

In [22]: type(response.selector.xpath('//title/text()')[0])
Out[22]: scrapy.selector.unified.Selector

In [23]:

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

response.css('img').xpath('@src').extract()
response.css('img::attr(src)').extract()

extract()
extract_first()

Relative Xpath

.// relative
// absolute

Item

ItemLoader

Scrapy Shell

Item pipeline

Feed Exports

Serialization formats Item exporters

JSON
JSON lines
CSV
XML

Requests and Responses

Link Extractors

Downloader Middleware

IN settings.py

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
}

hook

Engine -> mw1 -> mw2 -> mw3 -> ->   -> downloader

       engine
          ↓
mw1.process_request()
          ↓
mw2.process_request()
          ↓
mw3.process_request()
          ↓
          ↓
      downloader
          ↓
          ↓
mw3.process_response()
          ↓
mw2.process_response()
          ↓
mw1.process_response()
          ↓
       engine

process_request(request, spider)

process_response(request, response, spider)

Sitemap

➜  python curl  http://www.jollychic.com/sitemap.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="gss.xsl"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_images_1.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_images_9.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_product_1.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_product_2.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
</sitemapindex>

curl http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz | gzip -d | less

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0">
    <url>
        <priority>0.5</priority>
    </url>
    <url>
        <loc>http://www.jollychic.com/t/long-white-dresses-t5.html</loc>
        <priority>0.5</priority>
    </url>
</urlset>

for index, link in enumerate(links):