Collyzar - A distributed redis-based framework for colly.

Overview

Collyzar

A distributed redis-based framework for colly.

Collyzar provides a very simple configuration and tools to implement distributed crawling/scraping.

Features

  • Simple configuration and clean API
  • Distributed crawling/scraping
  • Built-in global bloom filter
  • Built-in spider cache
  • Support redis command
  • Multi-machine load balancing
  • Support to pause or stop all crawling machines
  • Pass additional information to the crawler and get it inside the crawler and store it in the database

Installation

Add collyzar to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/Zartenc/collyzar/v2 latest
)

Example Usage

See examples folder for more detailed examples.

Crawler cluster machine

SpiderName must be unique.

After running, it will always monitor the redis crawler queue for crawling until it receives a pause or stop signal.

func main(){
    cs := &collyzar.CollyzarSettings{
    		SpiderName: "zarten",
    		Domain:     "www.amazon.com",
    		RedisIp:    "127.0.0.1",
    	}
	collyzar.Run(myResponse, cs, nil)
}

func myResponse(response *collyzar.ZarResponse){
	fmt.Println(response.StatusCode)
}

Control machine

Push url to redis queue

func main(){
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	url := "https://www.amazon.com"
	pushInfo := collyzar.PushInfo{Url:url}

	err := ts.PushToQueue(pushInfo)
	if err != nil{
		fmt.Println(err)
	}
}

Tools

Provide tools including stop crawlers and pause crawlers.

Stop all crawlers
func main() {
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	err := ts.StopSpiders()
	if err != nil{
		fmt.Println(err)
	}
}

Pause all crawlers

For all crawlers, the crawler process is idle after pausing the crawler.
Then you can use the WakeupSpiders method to wake up the crawlers.

func main() {
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	err := ts.PauseSpiders()
	if err != nil{
		fmt.Println(err)
	}
}

Bugs

Bugs or suggestions? Visit the issue tracker

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow").

You might also like...
Interact with Chromium-based browsers' debug port to view open tabs, installed extensions, and cookies
Interact with Chromium-based browsers' debug port to view open tabs, installed extensions, and cookies

WhiteChocolateMacademiaNut Description Interacts with Chromium-based browsers' debug port to view open tabs, installed extensions, and cookies. Tested

Golang based web site opengraph data scraper with caching
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Warhammer40K faction scraper written in Golang, powered by colly.

Wascra Description Wascra is a tool written in Golang, which lets you extract all relevant Datasheet info from a Warhammer40K (9th edition) faction fr

Redis-shake is a tool for synchronizing data between two redis databases. Redis-shake是一个用于在两个redis之间同步数据的工具,满足用户非常灵活的同步、迁移需求。
Redis-shake is a tool for synchronizing data between two redis databases. Redis-shake是一个用于在两个redis之间同步数据的工具,满足用户非常灵活的同步、迁移需求。

RedisShake is mainly used to synchronize data from one redis to another. Thanks to the Douyu's WSD team for the support. 中文文档 English tutorial 中文使用文档

Go-distributed-websocket - Distributed Web Socket with Golang and Redis
Go-distributed-websocket - Distributed Web Socket with Golang and Redis

go-distributed-websocket Distributed Web Socket with Golang and Redis Dependenci

7 days golang programs from scratch (web framework Gee, distributed cache GeeCache, object relational mapping ORM framework GeeORM, rpc framework GeeRPC etc) 7天用Go动手写/从零实现系列

7 days golang programs from scratch README 中文版本 7天用Go从零实现系列 7天能写什么呢?类似 gin 的 web 框架?类似 groupcache 的分布式缓存?或者一个简单的 Python 解释器?希望这个仓库能给你答案

Distributed-Services - Distributed Systems with Golang to consequently build a fully-fletched distributed service

Distributed-Services This project is essentially a result of my attempt to under

A lightweight, distributed and reliable message queue based on Redis

nmq A lightweight, distributed and reliable message queue based on Redis Get Started Download go get github.com/inuggets/nmq Usage import "github.com

Distributed disk storage database based on Raft and Redis protocol.
Distributed disk storage database based on Raft and Redis protocol.

IceFireDB Distributed disk storage system based on Raft and RESP protocol. High performance Distributed consistency Reliable LSM disk storage Cold and

Distributed disk storage database based on Raft and Redis protocol.
Distributed disk storage database based on Raft and Redis protocol.

IceFireDB Distributed disk storage system based on Raft and RESP protocol. High performance Distributed consistency Reliable LSM disk storage Cold and

Disgo - A distributed lock based on redis developed with golang
Disgo - A distributed lock based on redis developed with golang

English | 中文 DisGo Introduce DisGo is a distributed lock based on Redis, develop

Golang client for redislabs' ReJSON module with support for multilple redis clients (redigo, go-redis)

Go-ReJSON - a golang client for ReJSON (a JSON data type for Redis) Go-ReJSON is a Go client for ReJSON Redis Module. ReJSON is a Redis module that im

Redis client Mock Provide mock test for redis query

Redis client Mock Provide mock test for redis query, Compatible with github.com/go-redis/redis/v8 Install Confirm that you are using redis.Client the

LFU Redis implements LFU Cache algorithm using Redis as data storage

LFU Redis cache library for Golang LFU Redis implements LFU Cache algorithm using Redis as data storage LFU Redis Package gives you control over Cache

GoBigdis is a persistent database that implements the Redis server protocol. Any Redis client can interface with it and start to use it right away.

GoBigdis GoBigdis is a persistent database that implements the Redis server protocol. Any Redis client can interface with it and start to use it right

Redis inventory is a tool to analyse Redis memory usage by key patterns and displaying it hierarchically
Redis inventory is a tool to analyse Redis memory usage by key patterns and displaying it hierarchically

Redis inventory is a tool to analyse Redis memory usage by key patterns and displaying it hierarchically. The name is inspired by "Disk Inventory X" tool doing similar analysis for disk usage.

The source code for workshop Scalable architecture using Redis as backend database using Golang + Redis

The source code for workshop Scalable architecture using Redis as backend database using Golang + Redis

Releases(v2.1.5)
Owner
Zarten
Zarten
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Go Tripod 15 Aug 21, 2022
Warhammer40K faction scraper written in Golang, powered by colly.

Wascra Description Wascra is a tool written in Golang, which lets you extract all relevant Datasheet info from a Warhammer40K (9th edition) faction fr

null 0 Feb 8, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

null 15 May 2, 2022
High-performance crawler framework based on fasthttp.

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。 1 创建一个 Crawler import "github.com/go-predator/predator" func main() {

null 14 Dec 14, 2022
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Re 62 Nov 9, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 9.5k Jan 7, 2023
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 7.2k Dec 30, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 18.6k Jan 9, 2023
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

胡聪 1.8k Jan 6, 2023
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 769 Dec 4, 2022