[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

Overview

go_spider

Build Status

A crawler of vertical communities achieved by GOLANG.

image

Latest stable Release: Version 1.2 (Sep 23, 2014).

  • go_spider讨论群 QQ群号:337344607

Features

  • Concurrent
  • Fit for vertical communities
  • Flexible, Modular
  • Native Go implementation
  • Can be expanded to an individualized crawler easily

Requirements

  • Go 1.2 or higher

Documentation

中文文档 && 常见问题.

Installation

go get github.com/hu17889/go_spider
go get github.com/PuerkitoBio/goquery
go get github.com/bitly/go-simplejson
go get golang.org/x/net/html/charset

This project is based on simplejson, goquery.

You can download packages from http://gopm.io/ in China.

Use example

Here is an example for crawling github content. You can have a try of the crawl process.

  • go install github.com/hu17889/go_spider/example/github_repo_page_processor
  • ./bin/github_repo_page_processor

More examples here: examples.

Make your spider

    // Spider input:
    //  PageProcesser ;
    //  Task name used in Pipeline for record;
    spider.NewSpider(NewMyPageProcesser(), "TaskName").
        AddUrl("https://github.com/hu17889?tab=repositories", "html"). // Start url, html is the responce type ("html" or "json")
        AddPipeline(pipeline.NewPipelineConsole()).                    // Print result on screen
        SetThreadnum(3).                                               // Crawl request by three Coroutines
        Run()
  • Use default modules

  • Downloader:HttpDownloader

  • Scheduler:QueueScheduler

  • Pipeline:PipelineConsole,PipelineFile

  • Use your modules

Just copy the default modules and modify it!

If you make a Downloader module, you can use it by Spider.SetDownloader(your_downloader).

If you make a Pipeline module, you can use it by Spider.AddPipeline(your_pipeline).

If you make a Scheduler module, you can use it by Spider.SetScheduler(your_scheduler).

Extensions

Extensions folder include modulers or other tools someone sharing. You can push your code without bugs.

Modulers

Spider

Summary: Crawler initialization, concurrent management, default moduler, moduler management, config setting.

Functions:

  • Clawler startup functions: Get, GetAll, Run
  • Add request: AddUrl, AddUrls, AddRequest, AddRequests
  • Set main moduler: AddPipeline(could have several pipeline modulers), SetScheduler, SetDownloader
  • Set config: SetExitWhenComplete, SetThreadnum(concurrent number), SetSleepTime(sleep time after one crawl)
  • Monitor: OpenFileLog, OpenFileLogDefault(open file log function, logged by mlog package), CloseFileLog, OpenStrace(open tracing info printed on screen by stderr), CloseStrace

Downloader

Summary: Spider gets a Request in Scheduler that has url to be crawled. Then Downloader downloads the result(html, json, jsonp, text) of the Request. The result is saved in Page for parsing in PageProcesser. Html parsing is based on goquery package. Json parsing is based on simplejson package. Jsonp will be conversed to json. Text form represents plain text content without parser.

Functions:

  • Download: download content of the crawl objective. Result contains data body, header, cookies and request info.

PageProcesser

Summary: The PageProcesser moduler only parse results. The moduler gets results(key-value pairs) and urls to be crawled next step. These key-value pairs will be saved in PageItems and urls will be pushed in Scheduler.

Functions:

  • Process: parse the objective crawled.

Page

Summary: save information of request.

Functions:

  • Get result: GetJson, GetHtmlParser, GetBodyStr(plain text)
  • Get information of objective: GetRequest, GetCookies, GetHeader
  • Get Status of crawl process: IsSucc(Download success or not), Errormsg(Get error info in Downloader)
  • Set config:SetSkip, GetSkip(if skip is true, do not output result in Pipeline), AddTargetRequest, AddTargetRequests(Save urls to be crawled next stage), AddTargetRequestWithParams, AddTargetRequestsWithParams, AddField(Save key-value pairs after parsing)

Scheduler

Summary: The Scheduler moduler is a Request queue. Urls parsed in PageProcesser will be pushed in the queue.

Functions:

  • Push
  • Poll
  • Count

Pipeline

Summary: The Pipeline moduler will output the result and save wherever you want. Default moduler is PipelineConsole(Output to stdout) and PipelineFile(Output to file)

Functions:

  • Process

Request

Summary: The Request moduler has config for http request like url, header and cookies.

Functions:

  • Process

License

go_spider is licensed under the Mozilla Public License Version 2.0

Mozilla summarizes the license scope as follows:

MPL: The copyleft applies to any files containing MPLed code.

That means:

  • You can use the unchanged source code both in private as also commercial
  • You needn't publish the source code of your library as long the files licensed under the MPL 2.0 are unchanged
  • You must publish the source code of any changed files licensed under the MPL 2.0 under a) the MPL 2.0 itself or b) a compatible license (e.g. GPL 3.0 or Apache License 2.0)

Please read the MPL 2.0 FAQ if you have further questions regarding the license.

You can read the full terms here: LICENSE.

You might also like...
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

 Go IMDb Crawler
Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ 😃 INSPIRATION 💪 Want to know which celebrities have a common birthday with yours? 👀 Want to get th

Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Releases(V1.2)
  • V1.2(Jan 13, 2015)

    • Add header and cookies
    • Add url tag name
    • Add example "github_login_profile_page_processor" for using header and cookies
    • Add function AddRequest, AddRequests, GetByRequest, GetAllByRequest in spider and AddTargetRequestWithParams, AddTargetRequestsWithParams in Page
    • Expose moduler Request
    Source code(tar.gz)
    Source code(zip)
  • V1.1.1(Dec 6, 2014)

  • V1.1(Nov 30, 2014)

    • Add new responce type "jsonp", "text";
    • Change the charset to utf-8 when the charset of responce data is not utf-8;
    • Add Header and Cookies to Page
    • Fix some implicit bugs
    Source code(tar.gz)
    Source code(zip)
  • v1.0(Sep 23, 2014)

Owner
胡聪
胡聪
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 1.8k Dec 31, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 18.6k Jan 9, 2023
Go-site-crawler - a simple application written in go that can fetch contentfrom a url endpoint

Go Site Crawler Go Site Crawler is a simple application written in go that can f

Shane Grech 1 Feb 5, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 769 Dec 4, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

null 15 May 2, 2022
High-performance crawler framework based on fasthttp.

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。 1 创建一个 Crawler import "github.com/go-predator/predator" func main() {

null 14 Dec 14, 2022
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Mansz 1 Sep 24, 2022
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 7.2k Dec 30, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 15 Aug 21, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Go Tripod 15 Aug 21, 2022