Scrapy_cffi: Async-first, modular web scraping utilities

funnyStrange 6 hours ago

scrapy_cffi keeps a Scrapy-style crawler architecture, supporting async execution and both HTTP & WebSocket requests. CLI support is minimal—Python API is recommended for running spiders.

The real highlights are its utility extensions:

• JSON Extractor – handles standard, embedded, and malformed JSON

• Media Downloader – segmented downloads for videos and large files

• Async Database Managers – Redis, MySQL, MongoDB with automatic retry and reconnection

• Multi-process RPC – quickly register functions, classes, and objects for rapid prototyping without MQ/Redis

These utilities can be used independently or combined into full async crawlers, offering flexibility, rapid prototyping, and easy extensibility.