Elegant Scraper and Crawler Framework for Golang

Overview

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

GoDoc Backers on Open Collective Sponsors on Open Collective build status report card view examples Code Coverage FOSSA Status Twitter URL

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

Add colly to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/gocolly/colly/v2 latest
)

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

License

FOSSA Status

You might also like...
Example Book Report API written in Golang with Fiber and GORM

book-report Example Book Report API written in Golang with Fiber and GORM API func setupRoutes(app *fiber.App) { app.Get("/api/v1/book", book.GetBook

A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

csvparser This package provides a fast and easy-of-use custom mapping from .csv data into Golang structs. Index Pre-requisites Installation Examples C

Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

[Go] Package of validators and sanitizers for strings, numerics, slices and structs

govalidator A package of validators and sanitizers for strings, structs and collections. Based on validator.js. Installation Make sure that Go is inst

Take screenshots of websites and create PDF from HTML pages using chromium and docker

gochro is a small docker image with chromium installed and a golang based webserver to interact wit it. It can be used to take screenshots of w

Parse data and test fixtures from markdown files, and patch them programmatically, too.

go-testmark Do you need test fixtures and example data for your project, in a language agnostic way? Do you want it to be easy to combine with documen

Watches container registries for new and changed tags and creates an RSS feed for detected changes.

Tagwatch Watches container registries for new and changed tags and creates an RSS feed for detected changes. Configuration Tagwatch is configured thro

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

A golang package to work with Decentralized Identifiers (DIDs)

did did is a Go package that provides tools to work with Decentralized Identifiers (DIDs). Install go get github.com/ockam-network/did Example packag

Comments
  • Using cookies

    Using cookies

    
    import (
    	"fmt"
    	"github.com/gocolly/colly"
    )
    
    func main() {
    
    	c := colly.NewCollector()
    
    	c.OnRequest(func(r *colly.Request) {
    		r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 OPR/96.0.4640.0")
    		r.Headers.Set("cookie", "enter_cookies")
    	})
    
    	c.OnHTML(".sr-only", func(element *colly.HTMLElement) {
    		fmt.Println(element.Text)
    	})
    
    	c.Visit("https://github.com/settings/profile")
    
    }
    

    What I am trying to do: Use my cookies and go to the github settings/profile page and extract the ".sr-only" text and print it. For some reason, nothing prints.

    https://github.com/gocolly/colly/issues/599 I used the sample from a similar issue right here

    opened by redoceD 0
  • Option not to pass Request Context to the Next Request

    Option not to pass Request Context to the Next Request

    I'm using Request Context to store information about the parsed body on various c.OnHTML callbacks..

    So what happens is, if I use the e.Request.Visit() for following up on hrefs, then the request context is also being passed. I wanted to avoid this. So instead of using e.Request.Visit() I used c.Visit() directly. This made sure that I got new context for each request.

    However, I would like to use the MaxDepth option as well. But that only works if I use the e.Request.Visit().

    It would work for me to use the e.Request.Visit() but give new context for each request. This is currently not possible. Is that correct?

    If yes, this feature request would be great to have as a configuration option - to determine if the request context has to be passed along or not..

    For now, I have manually made the change for local purposes..

    index 6beef834..524bb77c 100644
    --- a/vendor/github.com/gocolly/colly/v2/request.go
    +++ b/vendor/github.com/gocolly/colly/v2/request.go
    @@ -117,7 +117,7 @@ func (r *Request) AbsoluteURL(u string) string {
     // request and preserves the Context of the previous request.
     // Visit also calls the previously provided callbacks
     func (r *Request) Visit(URL string) error {
    -	return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)
    +	return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, nil, nil, true)
     }
     
     // HasVisited checks if the provided URL has been visited```
    
    opened by sundarv85 0
  • TLS Error on Robots.txt is not handled in OnError

    TLS Error on Robots.txt is not handled in OnError

    I'm running a test project on localhost:8000 and when I access it over https, it fails (which is expected)

    Get "https://localhost:8000/": tls: first record does not look like a TLS handshake

    The above is correctly caught in OnError. However, when I set ignoreRobots to false, then it tries to fetch the robots.txt and the below failure

    Get "https://localhost:8000/robots.txt": tls: first record does not look like a TLS handshake

    Is not propogated to OnError - as it is really not originating from the request that I had started, but colly tries to first fetch the robots which fails.. Could this also be propogated either to OnError or can be caught with a known Error Code from Colly such as

    ErrRobotsTxtBlocked = errors.New("URL blocked by robots.txt")
    ErrRobotsTxtFetchFailed = errors.New("Unable to fetch robots.txt") // New Error Code
    
    opened by sundarv85 0
  • queue AddRequest will stuck if queue.Run() loop end

    queue AddRequest will stuck if queue.Run() loop end

    https://github.com/gocolly/colly/blob/947eeead97b39d46ce2c89b06164c78b39d25759/queue/queue.go#L113

    stuck on q.wake <- struct{}{}

    because q.wake already not used by queue.Run().

    opened by lrobot 0
Releases(v2.1.0)
  • v2.1.0(Jun 8, 2020)

    • HTTP tracing support
    • New callback: OnResponseHeader
    • Queue fixes
    • New collector option: Collector.CheckHead
    • Proxy fixes
    • Fixed POST revisit checking
    • Updated dependencies
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Nov 28, 2019)

    • Breaking change: Change Collector.RedirectHandler member to Collector.SetRedirectHandler function
    • Go module support
    • Collector.HasVisited method added to be able to check if an url has been visited
    • Collector.SetClient method introduced
    • HTMLElement.ChildTexts method added
    • New user agents
    • Multiple bugfixes
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Feb 13, 2019)

    • Compatibility with the latest htmlquery package
    • New request shortcut for HEAD requests
    • Check URL availibility before visiting
    • Fix proxy URL value
    • Request counter fix
    • Minor fixes in examples
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Aug 28, 2018)

    • Appengine integration takes context.Context instead of http.Request (API change)
    • Added "Accept" http header by default to every request
    • Support slices of pointers and structs in unmarshal
    • Fixed a race condition in queues
    • ForEachWithBreak method added to HTMLElement
    • Added a local file example
    • Support gzip decompression of response bodies
    • Don't share waitgroup when cloning a collector
    • Fixed instagram example
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(May 14, 2018)

    We are happy to announce that the first major release of Colly is here. Our goal was to create a scraping framework to speed up development and let its users concentrate on collecting relevant data. There is no need to reinvent the wheel when writing a new collector. Scrapers built on top of Colly support different storage backends, dynamic configuration and running requests in parallel out of the box. It is also possible to run your scrapers in a distributed manner.

    Facts about the development

    It started in September 2017 and has not stopped since. Colly has attracted numerous developers who helped by providing valuable feedback and contributing new features. Let's see the numbers. In the last seven months 30 contributors have created 338 commits. Users have opened 78 issues. 74 of the those were resolved in a few days. Contributors have opened 59 pull requests and all of them except for one are either got merged or closed. We would like to thank all of our supporters who either contributed code or wrote blog posts about Colly or helped development somehow. We would not be here without you.

    You might ask why it is released now. Our experience in various deployments in production shows Colly provides a stable and robust platform for developing and running scrapers both locally and in multi server configuration. The feature set is complete and ready to support even complex use cases. What are those features?

    • Rate limiting During scraping controlling the number of request sent to the scraped site might be crucial. We would not want to disrupt the service by overloading with too many requests. It is bad for the operators of the site and also for us, because the data we would like to collect becomes inaccessible. Thus, request number must be limited. The collector provided by Colly can be configured to send only a limited number of requests in parallel.

    • Request caching To relieve the load from external services and decrease the number of outgoing requests response caching is supported.

    • Configurable via environment variables To eliminate rebuilding of your scraper during fine-tuning, Colly can read configuration options from environment variables. So you can modify its settings without a Golang development environment.

    • Proxies/proxy switchers If the address of scrapers has to be hidden proxies can be added to make requests instead of the machine running the scraping job. Furthermore, to scale Colly without running multiple scraper instances, proxy switchers can be used. Collectors support proxy switchers which can distribute requests among multiple servers. Scraping collected sites is still done on the machine running the scrapers. But the network traffic is moved to different hosts.

    • Storage backend and storage interface During scraping a various data needs to be stored and sometimes shared. To access these objects Colly provides a storage interface. You can create your own storages and use it in your scraper by implementing the interface required. By default Colly saves everything into memory. Additional Colly backend implementations are available for Redis and SQLite3.

    • Request queue Scraping pages in parallel asynchronously is a must have feature when scraping. Colly maintains a request queue where URLs found during scraping are collected. Worker threads of your collector are taking these URLs and creating requests.

    • Goodies The package named extensions provides multiple helpers for collectors. These are common functions implemented in advance, so you don't have to bloat your scraper code with general implementations. An example extension is RandomUserAgent which generates a random User Agent for every request. You can find the full list of Goodies: https://godoc.org/github.com/gocolly/colly/extensions

    • Debuggers Debugging can be painful. Colly tries to ease the pain by providing Debuggers to inspect your scraper. You can simply write debug messages to the console by using LogDebugger. If you prefer web interfaces, we've got you covered. Colly comes with a web debugger. You can use it by initializing a WebDebugger. See here how debuggers can be used: https://godoc.org/github.com/gocolly/colly/debug

    We the team behind Colly believe that it has become a stable and mature scraping framework capable of supporting complex use cases. We are hoping for an even more productive future. Last but not least thank you for your support and contributions.

    Source code(tar.gz)
    Source code(zip)
Owner
Colly
Elegant scraper and crawler framework for Golang
Colly
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

foolin 78 Dec 13, 2022
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Gez

null 1.8k Dec 29, 2022
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)

Go View File 在线体验地址 http://39.97.98.75:8082/view/upload (不会经常更新,保留最基本的预览功能。服务器配置较低,如果出现链接超时请等待几秒刷新重试,或者换Chrome) 目前已经完成 docker部署 (不用为运行环境烦恼) Wor

CZC 78 Dec 26, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

JF Technology 532 Jan 4, 2023
iTunes and RSS 2.0 Podcast Generator in Golang

podcast Package podcast generates a fully compliant iTunes and RSS 2.0 podcast feed for GoLang using a simple API. Full documentation with detailed ex

Eric Duncan 115 Dec 23, 2022
agrep-like fuzzy matching, but made faster using Golang and precomputation.

goagrep There are situations where you want to take the user's input and match a primary key in a database. But, immediately a problem is introduced:

Zack 42 Oct 8, 2022
Golang metrics for calculating string similarity and other string utility functions

strutil strutil provides string metrics for calculating string similarity as well as other string utility functions. Full documentation can be found a

Adrian-George Bostan 156 Jan 3, 2023
:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech

Joseph Kato 3k Jan 4, 2023
This package provides Go (golang) types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most famously used by app.diagrams.net, the new name of draw.io.

Go Draw - Golang MX This package provides types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most fa

null 2 Aug 30, 2022
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

null 0 Dec 5, 2021