Scrapy

Scrapy Learned

Architecture

Architecture

Terminology

A crawler gets web pages – i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).

A scraper takes pages that have been downloaded [Edit: or, in a more general sense, data that’s formatted for display], and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.

parse(self, response)

return

Selectors

grammar: response.css(selector) or response.xpath(selector)

functions:

Scrapy Commands

Global commands

Project commands

project structure

scrapy.cfg
myproject/
    __init__.py
    items.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

Scrapy Shell

TODOS:

Spider

Selector

The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too.


In [20]: type(response.selector)
Out[20]: scrapy.selector.unified.Selector

In [21]: type(response.selector.xpath('//title/text()'))
Out[21]: scrapy.selector.unified.SelectorList

In [22]: type(response.selector.xpath('//title/text()')[0])
Out[22]: scrapy.selector.unified.Selector

In [23]: 
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
response.css('img').xpath('@src').extract()
response.css('img::attr(src)').extract()

Relative Xpath

Item

ItemLoader

Scrapy Shell

Item pipeline

Feed Exports

Serialization formats Item exporters

Requests and Responses

Downloader Middleware

IN settings.py

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
}

hook

Engine -> mw1 -> mw2 -> mw3 -> ->   -> downloader

       engine
          ↓
mw1.process_request()
          ↓
mw2.process_request()
          ↓
mw3.process_request()
          ↓
          ↓
      downloader
          ↓
          ↓
mw3.process_response()
          ↓
mw2.process_response()
          ↓
mw1.process_response()
          ↓
       engine
process_request(request, spider)

process_response(request, response, spider)

## Sitemap

```sh
  python curl  http://www.jollychic.com/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
	<?xml-stylesheet type="text/xsl" href="gss.xsl"?>
	<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<sitemap>
		<loc>http://www.jollychic.com/sitemap/sitemap_images_1.xml.gz</loc>
		<lastmod>2017-05-08</lastmod>
	</sitemap>
	<sitemap>
		<loc>http://www.jollychic.com/sitemap/sitemap_images_9.xml.gz</loc>
		<lastmod>2017-05-08</lastmod>
	</sitemap>
	<sitemap>
		<loc>http://www.jollychic.com/sitemap/sitemap_product_1.xml.gz</loc>
		<lastmod>2017-05-08</lastmod>
	</sitemap>
	<sitemap>
		<loc>http://www.jollychic.com/sitemap/sitemap_product_2.xml.gz</loc>
		<lastmod>2017-05-08</lastmod>
	</sitemap>
	<sitemap>
		<loc>http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz</loc>
		<lastmod>2017-05-08</lastmod>
	</sitemap>
</sitemapindex>                  
curl http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz |gzip -d  |less
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0">
                <url>
                        <priority>0.5</priority>
                </url>
                <url>
                        <loc>http://www.jollychic.com/t/long-white-dresses-t5.html</loc>
                        <priority>0.5</priority>
                </url>
</urlset>
 for index, link in enumerate(links):