a simple & tiny scrapy clustering solution, considered a drop-in replacement for scrapyd

Overview

scrapyr

a very simple scrapy orchestrator engine that could be distributed among multiple machines to build a scrapy cluster, under-the-hood it uses redis as a task broker, it may be changed in the future to support pluggable brokers, but for now it does the job.

Features

  • uses simple configuration language for humans called hcl.
  • multiple types of queues/workers (lifo, fifo, weight).
  • you can define multiple workers with different type of queues.
  • abbility to override the content of the settings.py of the scrapy project from the same configuration file.
  • a status endpoint helps you to understand what is going on.
  • a enqueue endpoint lets you push a job into the specified queue, as well the abbility to execute the job instantly and returns the extracted items.

API Examples

  • Getting the status of the cluster
curl --request GET \
  --url http://localhost:1993/status \
  --header 'content-type: application/json'
  • Push a task into the queue utilizing the worker worker1 which is pre-defined in the scrapyr.hcl
# worker -> the worker name (predefined in scrapyr.hcl)
# spider -> the scrapy spider to be executed
# max_execution_time -> the max duration the scrapy process should take
# args -> a key value strings will be translated to `-a key=value ...` for each key-value pair.
# weight -> the weight of the task itself (in case of weight based workers defined in the scrapyr.hcl)
curl --request POST \
  --url http://localhost:1993/enqueue \
  --header 'content-type: application/json' \
  --data '{
	"worker": "worker1",
	"spider": "spider_name",
	"max_execution_time": "20s",
	"args": {
            "scrapy_arg_name": "scrapy_arg_value"
      },
	"weight": 10
}'

Configurations

here is an example of the scraply.hcl

# the webserver listening address
listen_addr = ":1993"

# redis connection string
# it uses url-style connection string
# example: redis://username:[email protected]:port/database_number
redis_dsn = "redis://127.0.0.1:6378/1"

scrapy {
    project_dir = "${HOME}/playground/tstscrapy"

    python_bin = "/usr/bin/python3"

    items_dir = "${PWD}/data"
}

worker worker1 {
    // which method you want the worker to use
    // lifo: last in, first out
    // fifo: first in, first out
    // weight: max weight, first out
    use = "weight"

    // max processes to be executed in the same time for this workers
    max_procs = 5
}


# sometimes you may need to control the `ProjectNAme/ProjectName/settings.py` file from here
# so we did this special key which mounts the contents of it into `settings.py` file.
settings_py = <<PYTHON
# Scrapy settings for tstscrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tstscrapy'

SPIDER_MODULES = ['tstscrapy.spiders']
NEWSPIDER_MODULE = 'tstscrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tstscrapy (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'tstscrapy.middlewares.TstscrapySpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'tstscrapy.middlewares.TstscrapyDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'tstscrapy.pipelines.TstscrapyPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

DOWNLOAD_TIMEOUT = 10
PYTHON

Install

you can download the latest binary build from the releases page or by using docker directly.

Contributing

  • Fork the repo
  • Create a feature branch
  • Push your changes
  • Create a pull request

License

Apache License v2.0

Author

You might also like...
Drop-in replacement for Go's flag package, implementing POSIX/GNU-style --flags.

Description pflag is a drop-in replacement for Go's flag package, implementing POSIX/GNU-style --flags. pflag is compatible with the GNU extensions to

Highly concurrent drop-in replacement for bufio.Writer

concurrent-writer Highly concurrent drop-in replacement for bufio.Writer. concurrent.Writer implements highly concurrent buffering for an io.Writer ob

Drop-in replacement for the standard library errors package and github.com/pkg/errors

Emperror: Errors Drop-in replacement for the standard library errors package and github.com/pkg/errors. This is a single, lightweight library merging

A drop-in replacement for Go errors, with some added sugar! Unwrap user-friendly messages, HTTP status code, easy wrapping with multiple error types.
A drop-in replacement for Go errors, with some added sugar! Unwrap user-friendly messages, HTTP status code, easy wrapping with multiple error types.

Errors Errors package is a drop-in replacement of the built-in Go errors package with no external dependencies. It lets you create errors of 11 differ

Zero downtime restarts for go servers (Drop in replacement for http.ListenAndServe)

endless Zero downtime restarts for golang HTTP and HTTPS servers. (for golang 1.3+) Inspiration & Credits Well... it's what you want right - no need t

A high-performance 100% compatible drop-in replacement of
A high-performance 100% compatible drop-in replacement of "encoding/json"

A high-performance 100% compatible drop-in replacement of "encoding/json" You can also use thrift like JSON using thrift-iterator Benchmark Source cod

🔥 ~6x faster, stricter, configurable, extensible, and beautiful drop-in replacement for golint
🔥 ~6x faster, stricter, configurable, extensible, and beautiful drop-in replacement for golint

revive Fast, configurable, extensible, flexible, and beautiful linter for Go. Drop-in replacement of golint. Revive provides a framework for developme

A high-performance 100% compatible drop-in replacement of
A high-performance 100% compatible drop-in replacement of "encoding/json"

A high-performance 100% compatible drop-in replacement of "encoding/json" You can also use thrift like JSON using thrift-iterator Benchmark Source cod

drpc is a lightweight, drop-in replacement for gRPC
drpc is a lightweight, drop-in replacement for gRPC

drpc is a lightweight, drop-in replacement for gRPC

Drop-in replacement for Go net/http when running in AWS Lambda & API Gateway
Drop-in replacement for Go net/http when running in AWS Lambda & API Gateway

Package gateway provides a drop-in replacement for net/http's ListenAndServe for use in AWS Lambda & API Gateway, simply swap it out for gateway.Liste

Drop-in replacement for Go's stringer tool with support for bitflag sets.

stringer This program is a drop-in replacement for Go's commonly used stringer tool. In addition to generating String() string implementations for ind

twtr (pronounced "tweeter") is a drop-in replacement of the original twtxt client

twtr twtr (pronounced "tweeter") is a drop-in replacement of the original twtxt client, with some extensions and bonus features, so you can make the s

A drop-in replacement to any Writer type, which also calculates a hash using the provided hash type.

writehasher A drop-in replacement to any Writer type, which also calculates a hash using the provided hash type. Example package main import ( "fmt"

Goslmailer (GoSlurmMailer) is a drop-in replacement MailProg for slurm

GoSlurmMailer - drop in replacement for default slurm MailProg. Delivers slurm job messages to various destinations.

golang bigcache with clustering as a library.

clusteredBigCache This is a library based on bigcache with some modifications to support clustering and individual item expiration Bigcache is an exce

Vitess is a database clustering system for horizontal scaling of MySQL.

Vitess Vitess is a database clustering system for horizontal scaling of MySQL through generalized sharding. By encapsulating shard-routing logic, Vite

Scalable game server framework with clustering support and client libraries for iOS, Android, Unity and others through the C SDK.

pitaya Pitaya is an simple, fast and lightweight game server framework with clustering support and client libraries for iOS, Android, Unity and others

k-modes and k-prototypes clustering algorithms implementation in Go

go-cluster GO implementation of clustering algorithms: k-modes and k-prototypes. K-modes algorithm is very similar to well-known clustering algorithm

Vitess is a database clustering system for horizontal scaling of MySQL.

Vitess Vitess is a database clustering system for horizontal scaling of MySQL through generalized sharding. By encapsulating shard-routing logic, Vite

Comments
  • Bump github.com/labstack/echo/v4 from 4.1.17 to 4.9.0

    Bump github.com/labstack/echo/v4 from 4.1.17 to 4.9.0

    Bumps github.com/labstack/echo/v4 from 4.1.17 to 4.9.0.

    Release notes

    Sourced from github.com/labstack/echo/v4's releases.

    v4.9.0

    Security

    • Fix open redirect vulnerability in handlers serving static directories (e.Static, e.StaticFs, echo.StaticDirectoryHandler) #2260

    Enhancements

    • Allow configuring ErrorHandler in CSRF middleware #2257
    • Replace HTTP method constants in tests with stdlib constants #2247

    v4.8.0

    Most notable things

    You can now add any arbitrary HTTP method type as a route #2237

    e.Add("COPY", "/*", func(c echo.Context) error 
      return c.String(http.StatusOK, "OK COPY")
    })
    

    You can add custom 404 handler for specific paths #2217

    e.RouteNotFound("/*", func(c echo.Context) error { return c.NoContent(http.StatusNotFound) })
    

    g := e.Group("/images") g.RouteNotFound("/*", func(c echo.Context) error { return c.NoContent(http.StatusNotFound) })

    Enhancements

    • Add new value binding methods (UnixTimeMilli,TextUnmarshaler,JSONUnmarshaler) to Valuebinder #2127
    • Refactor: body_limit middleware unit test #2145
    • Refactor: Timeout mw: rework how test waits for timeout. #2187
    • BasicAuth middleware returns 500 InternalServerError on invalid base64 strings but should return 400 #2191
    • Refactor: duplicated findStaticChild process at findChildWithLabel #2176
    • Allow different param names in different methods with same path scheme #2209
    • Add support for registering handlers for different 404 routes #2217
    • Middlewares should use errors.As() instead of type assertion on HTTPError #2227
    • Allow arbitrary HTTP method types to be added as routes #2237

    v4.7.2

    Fixes

    • Fix nil pointer exception when calling Start again after address binding error #2131
    • Fix CSRF middleware not being able to extract token from multipart/form-data form #2136
    • Fix Timeout middleware write race #2126

    Enhancements

    ... (truncated)

    Changelog

    Sourced from github.com/labstack/echo/v4's changelog.

    v4.9.0 - 2022-09-04

    Security

    • Fix open redirect vulnerability in handlers serving static directories (e.Static, e.StaticFs, echo.StaticDirectoryHandler) #2260

    Enhancements

    • Allow configuring ErrorHandler in CSRF middleware #2257
    • Replace HTTP method constants in tests with stdlib constants #2247

    v4.8.0 - 2022-08-10

    Most notable things

    You can now add any arbitrary HTTP method type as a route #2237

    e.Add("COPY", "/*", func(c echo.Context) error 
      return c.String(http.StatusOK, "OK COPY")
    })
    

    You can add custom 404 handler for specific paths #2217

    e.RouteNotFound("/*", func(c echo.Context) error { return c.NoContent(http.StatusNotFound) })
    

    g := e.Group("/images") g.RouteNotFound("/*", func(c echo.Context) error { return c.NoContent(http.StatusNotFound) })

    Enhancements

    • Add new value binding methods (UnixTimeMilli,TextUnmarshaler,JSONUnmarshaler) to Valuebinder #2127
    • Refactor: body_limit middleware unit test #2145
    • Refactor: Timeout mw: rework how test waits for timeout. #2187
    • BasicAuth middleware returns 500 InternalServerError on invalid base64 strings but should return 400 #2191
    • Refactor: duplicated findStaticChild process at findChildWithLabel #2176
    • Allow different param names in different methods with same path scheme #2209
    • Add support for registering handlers for different 404 routes #2217
    • Middlewares should use errors.As() instead of type assertion on HTTPError #2227
    • Allow arbitrary HTTP method types to be added as routes #2237

    v4.7.2 - 2022-03-16

    Fixes

    • Fix nil pointer exception when calling Start again after address binding error #2131
    • Fix CSRF middleware not being able to extract token from multipart/form-data form #2136
    • Fix Timeout middleware write race #2126

    ... (truncated)

    Commits
    • 16d3b65 Changelog for 4.9.0
    • 0ac4d74 Fix #2259 open redirect vulnerability in echo.StaticDirectoryHandler (used by...
    • d77e8c0 Added ErrorHandler and ErrorHandlerWithContext in CSRF middleware (#2257)
    • 534bbb8 replace POST constance with stdlib constance
    • fb57d96 replace GET constance with stdlib constance
    • d48197d Changelog for 4.8.0
    • cba12a5 Allow arbitrary HTTP method types to be added as routes
    • a327884 add:README.md-Third-party middlewares-github.com/go-woo/protoc-gen-echo
    • 61422dd Update CI-flow (Go 1.19 +deps)
    • a9879ff Middlewares should use errors.As() instead of type assertion on HTTPError
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Redis global connection variable

    Redis global connection variable

    In my point of view, this app shouldn't rely on a global Redis connection instance instead of it should inject this connection into each component to have a clear dependence bounding. This simplifies the reasoning of types and you could replace his type by an interface.

    Example:

    newJobByID(id, redisConn)
    

    Bad Example:

    This dep of redisConn is hidden

    newJobByID(id)
    

    If this proposal is good and you want you accepted it then I could find some spare time to refactor it. How do you test your changes by a test project?

    opened by donutloop 3
Releases(v2.0.0)
Owner
Mohammed Al Ashaal
a software engineer 🤓
Mohammed Al Ashaal
k-means clustering algorithm implementation written in Go

kmeans k-means clustering algorithm implementation written in Go What It Does k-means clustering partitions a multi-dimensional data set into k cluste

Christian Muehlhaeuser 414 Dec 6, 2022
PaddleDTX is a solution that focused on distributed machine learning technology based on decentralized storage.

中文 | English PaddleDTX PaddleDTX is a solution that focused on distributed machine learning technology based on decentralized storage. It solves the d

null 82 Dec 14, 2022
Fast, simple sklearn-like feature processing for Go

go-featureprocessing Fast, simple sklearn-like feature processing for Go Does not cross cgo boundary No memory allocation No reflection Convenient ser

Nikolay Dubina 89 Dec 2, 2022
A simple OCR API server, seriously easy to be deployed by Docker, on Heroku as well

ocrserver Simple OCR server, as a small working sample for gosseract. Try now here https://ocr-example.herokuapp.com/, and deploy your own now. Deploy

Hiromu OCHIAI 541 Dec 28, 2022
Simple gc using integer vectors to simulate

gcint Simple gc using integer vectors to simulate Iterate primarily over what should be the shorter vector (readers) removing unused references in fro

Matt Rutkowski 0 Nov 24, 2021
A simple utility, written in Go, for interacting with Salesforce.

Salesforce CLI A simple utility, written in Go, for interacting with Salesforce. Currently only specific functionality is implemented, and the output

Darren Parkinson 0 Dec 14, 2021
A simple yet customisable program written in go to make hackerman-like terminal effects.

stuntman a simple program written in go to make you look like a hackerman Demo stuntman -binar -width 90 -color cyan stuntman -text -width 90 -vertgap

Solaris 10 Aug 4, 2022
Bcfm-study-case - A simple http server using the Echo library in Go language

Task 1 Hakkında Burada Go dilinde Echo kütüphanesini kullanarak basit bir http s

Caner Gülay 0 Feb 2, 2022
A scrapy implementation in Go

go-scrapy A scrapy implementation in Go. (Work in progres) Overview go-scrapy is a very useful and productive web crawlign framework, used to crawl we

soft-pro-dev 2 Nov 17, 2021
Drop-in replacement for Go's flag package, implementing POSIX/GNU-style --flags.

Description pflag is a drop-in replacement for Go's flag package, implementing POSIX/GNU-style --flags. pflag is compatible with the GNU extensions to

Steve Francia 2k Dec 30, 2022