Elegant Scraper and Crawler Framework for Golang

Overview

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

GoDoc Backers on Open Collective Sponsors on Open Collective build status report card view examples Code Coverage FOSSA Status Twitter URL

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

Add colly to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/gocolly/colly/v2 latest
)

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

License

FOSSA Status

Issues
  • OnHTML can't find the element

    OnHTML can't find the element "div#gs_res_ccl_mid" from a google scholar page, but yesterday it had found the element

    scraper-gocolly1.txt

    A program of mine, which worked for over a week now, just out of the nowhere stopped working in the sense that one html element as specified div#gs_res_ccl_mid cant be found. I have tried the "div#gs_bdy_ccl" element which can be found. I added an println statement to check if an element is found. There is no pop-up or anyother problem causing this, since the other tag worked. The element on the page "https://scholar.google.de/scholar?hl=de&as_sdt=0%2C5&q=%22organizational+routines%22+wurm&btnG=" also did not change when i checked. Is there a change of the package which could cause this? Could you help me please?

    opened by INikson 0
  • Add fuzzer

    Add fuzzer

    This PR adds ClusterfuzzLite and a fuzzer for the HTML unmarshalling.

    ClusterfuzzLite will run fuzzers in the CI when PRs are made. It can be extended beyond this PR with code coverage and batch fuzzing: https://google.github.io/clusterfuzzlite/running-clusterfuzzlite/github-actions/

    Signed-off-by: AdamKorcz [email protected]

    opened by AdamKorcz 0
  • How to scrap web-page data from predefined HTML source inside the main program source?

    How to scrap web-page data from predefined HTML source inside the main program source?

    I'm trying to scrap https://dnsdumpster.com, and to scrap exact values I need to pass few headers when requesting, but colly doesn't seems like having support for custom headers yet, only way I can see is Visit() function where I need to pass the URL. Is there any way to scrap a page from the main source code file?

    For instance

    package main
    
    import "github.com/gocolly/colly"
    
    func main() {
    	page := "<title>hello</title>"
    	c := colly.NewCollector()
    }
    

    Here instead of using c.Visit() function (after downloading the HTML file locally), how I can able to get title text from page variable?

    The only way I saw that you need to download the whole page locally as I mentioned above, and using file://[path] the way to scrap from there, but in my opinion it looks a bad idea, as the program may not have the write permission which is required to write the HTML file on disk, another way, I can upload the downloaded HTML file on somewhere else, and then requesting to that URL, but it will slow down everything, downloading and uploading again and again. Is there a way to resolve it?

    opened by kiwimoe 2
  • Colly instagram crawler sent panic

    Colly instagram crawler sent panic

    My first three or four tests were ok and there were no problems until the next test had problems

    the entry_data field which is the most important field in Instagram response, has this in it: "entry_data": { "LoginAndSignupPage": [ { "captcha": { "enabled": false, "key": "" }, "gdpr_required": false, "tos_version": "row", "username_hint": "" } ] },

    But the previous requests were like this

    "entry_data": { "ProfilePage": [ { ------- ------- ------- ------- ------- } }

    panic: runtime error: index out of range [0] with length 0

    goroutine 1 [running]: main.main.func2(0xc0001d8d20) /home/khoujani/Desktop/eSport98/backend/eSport98-back/app/main2.go:129 +0x5e5 github.com/gocolly/colly.(*Collector).handleOnHTML.func1(0xc0000ae020?, 0xc0005519b0) /home/khoujani/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:963 +0x78 github.com/PuerkitoBio/goquery.(*Selection).Each(0xc000551980, 0xc000143c48) /home/khoujani/go/pkg/mod/github.com/!puerkito!bio/[email protected]/iteration.go:10 +0x46 github.com/gocolly/colly.(*Collector).handleOnHTML(0xc0002729c0, 0xc0002840c0) /home/khoujani/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:953 +0x24f github.com/gocolly/colly.(*Collector).fetch(0xc0002729c0, {0xa?, 0x3?}, {0x92eaac, 0x3}, 0x1, {0x0?, 0x0}, 0xc0000cfe20?, 0xc0002514d0, ...) /home/khoujani/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:623 +0x68d github.com/gocolly/colly.(*Collector).scrape(0xc0002729c0, {0xc0000c4b00, 0x20}, {0x92eaac, 0x3}, 0x1, {0x0, 0x0}, 0x0, 0x0, ...) /home/khoujani/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:535 +0x645 github.com/gocolly/colly.(*Collector).Visit(0xc0002729c0?, {0xc0000c4b00?, 0x4?}) /home/khoujani/go/pkg/mod/github.com/gocolly/[email protected]/colly.go:412 +0xa6 main.main() /home/khoujani/Desktop/eSport98/backend/eSport98-back/app/main2.go:192 +0x285

    Process finished with the exit code 2

    opened by hosseinkhojany 0
  • How do I get a site's TLS certificate?

    How do I get a site's TLS certificate?

    I am trying to get the TLS certificate that a site is presenting during the TLS handshake. I looked through the documentation and the response object but did not find what I was looking for.

    According to the docs, I can customize some http options by changing the default HTTP roundtripper. I tried setting custom GetCertificate and GetClientCertificate functions, assuming that these functions would be used during the TLS handshake, but the print statements are never called.

        // Instantiate default collector
        c := colly.NewCollector(
            // Visit only domains: hackerspaces.org, wiki.hackerspaces.org
            colly.AllowedDomains("pkg.go.dev"),
        )
    
        c.WithTransport(&http.Transport{
            TLSClientConfig: &tls.Config{
                GetCertificate: func(ch *tls.ClientHelloInfo) (*tls.Certificate, error) {
                    fmt.Println("~~~GETCERT CALLED~~")
                    return nil, nil
                },
                GetClientCertificate: func(cri *tls.CertificateRequestInfo) (*tls.Certificate, error) {
                    fmt.Println("~~~GETCLIENTCERT CALLED~~")
                    return nil, nil
                },
            },
        })
    

    How would I get the TLS certificate using Colly?

    versions: $ go list -m github.com/gocolly/colly/v2 github.com/gocolly/colly/v2 v2.1.0

    $ go version go version go1.18.3 darwin/arm64

    opened by joshuaherrera 0
Releases(v2.1.0)
  • v2.1.0(Jun 8, 2020)

    • HTTP tracing support
    • New callback: OnResponseHeader
    • Queue fixes
    • New collector option: Collector.CheckHead
    • Proxy fixes
    • Fixed POST revisit checking
    • Updated dependencies
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Nov 28, 2019)

    • Breaking change: Change Collector.RedirectHandler member to Collector.SetRedirectHandler function
    • Go module support
    • Collector.HasVisited method added to be able to check if an url has been visited
    • Collector.SetClient method introduced
    • HTMLElement.ChildTexts method added
    • New user agents
    • Multiple bugfixes
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Feb 13, 2019)

    • Compatibility with the latest htmlquery package
    • New request shortcut for HEAD requests
    • Check URL availibility before visiting
    • Fix proxy URL value
    • Request counter fix
    • Minor fixes in examples
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Aug 28, 2018)

    • Appengine integration takes context.Context instead of http.Request (API change)
    • Added "Accept" http header by default to every request
    • Support slices of pointers and structs in unmarshal
    • Fixed a race condition in queues
    • ForEachWithBreak method added to HTMLElement
    • Added a local file example
    • Support gzip decompression of response bodies
    • Don't share waitgroup when cloning a collector
    • Fixed instagram example
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(May 14, 2018)

    We are happy to announce that the first major release of Colly is here. Our goal was to create a scraping framework to speed up development and let its users concentrate on collecting relevant data. There is no need to reinvent the wheel when writing a new collector. Scrapers built on top of Colly support different storage backends, dynamic configuration and running requests in parallel out of the box. It is also possible to run your scrapers in a distributed manner.

    Facts about the development

    It started in September 2017 and has not stopped since. Colly has attracted numerous developers who helped by providing valuable feedback and contributing new features. Let's see the numbers. In the last seven months 30 contributors have created 338 commits. Users have opened 78 issues. 74 of the those were resolved in a few days. Contributors have opened 59 pull requests and all of them except for one are either got merged or closed. We would like to thank all of our supporters who either contributed code or wrote blog posts about Colly or helped development somehow. We would not be here without you.

    You might ask why it is released now. Our experience in various deployments in production shows Colly provides a stable and robust platform for developing and running scrapers both locally and in multi server configuration. The feature set is complete and ready to support even complex use cases. What are those features?

    • Rate limiting During scraping controlling the number of request sent to the scraped site might be crucial. We would not want to disrupt the service by overloading with too many requests. It is bad for the operators of the site and also for us, because the data we would like to collect becomes inaccessible. Thus, request number must be limited. The collector provided by Colly can be configured to send only a limited number of requests in parallel.

    • Request caching To relieve the load from external services and decrease the number of outgoing requests response caching is supported.

    • Configurable via environment variables To eliminate rebuilding of your scraper during fine-tuning, Colly can read configuration options from environment variables. So you can modify its settings without a Golang development environment.

    • Proxies/proxy switchers If the address of scrapers has to be hidden proxies can be added to make requests instead of the machine running the scraping job. Furthermore, to scale Colly without running multiple scraper instances, proxy switchers can be used. Collectors support proxy switchers which can distribute requests among multiple servers. Scraping collected sites is still done on the machine running the scrapers. But the network traffic is moved to different hosts.

    • Storage backend and storage interface During scraping a various data needs to be stored and sometimes shared. To access these objects Colly provides a storage interface. You can create your own storages and use it in your scraper by implementing the interface required. By default Colly saves everything into memory. Additional Colly backend implementations are available for Redis and SQLite3.

    • Request queue Scraping pages in parallel asynchronously is a must have feature when scraping. Colly maintains a request queue where URLs found during scraping are collected. Worker threads of your collector are taking these URLs and creating requests.

    • Goodies The package named extensions provides multiple helpers for collectors. These are common functions implemented in advance, so you don't have to bloat your scraper code with general implementations. An example extension is RandomUserAgent which generates a random User Agent for every request. You can find the full list of Goodies: https://godoc.org/github.com/gocolly/colly/extensions

    • Debuggers Debugging can be painful. Colly tries to ease the pain by providing Debuggers to inspect your scraper. You can simply write debug messages to the console by using LogDebugger. If you prefer web interfaces, we've got you covered. Colly comes with a web debugger. You can use it by initializing a WebDebugger. See here how debuggers can be used: https://godoc.org/github.com/gocolly/colly/debug

    We the team behind Colly believe that it has become a stable and mature scraping framework capable of supporting complex use cases. We are hoping for an even more productive future. Last but not least thank you for your support and contributions.

    Source code(tar.gz)
    Source code(zip)
Owner
Colly
Elegant scraper and crawler framework for Golang
Colly
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

foolin 68 Aug 10, 2022
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Gez

null 1.8k Aug 10, 2022
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)

Go View File 在线体验地址 http://39.97.98.75:8082/view/upload (不会经常更新,保留最基本的预览功能。服务器配置较低,如果出现链接超时请等待几秒刷新重试,或者换Chrome) 目前已经完成 docker部署 (不用为运行环境烦恼) Wor

CZC 69 Aug 18, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

JF Technology 484 Aug 12, 2022
iTunes and RSS 2.0 Podcast Generator in Golang

podcast Package podcast generates a fully compliant iTunes and RSS 2.0 podcast feed for GoLang using a simple API. Full documentation with detailed ex

Eric Duncan 110 Aug 1, 2022
agrep-like fuzzy matching, but made faster using Golang and precomputation.

goagrep There are situations where you want to take the user's input and match a primary key in a database. But, immediately a problem is introduced:

Zack 41 Nov 30, 2021
Golang metrics for calculating string similarity and other string utility functions

strutil strutil provides string metrics for calculating string similarity as well as other string utility functions. Full documentation can be found a

Adrian-George Bostan 112 Aug 1, 2022
:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech

Joseph Kato 2.9k Aug 7, 2022
This package provides Go (golang) types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most famously used by app.diagrams.net, the new name of draw.io.

Go Draw - Golang MX This package provides types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most fa

null 1 Nov 30, 2021
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

null 0 Dec 5, 2021
Example Book Report API written in Golang with Fiber and GORM

book-report Example Book Report API written in Golang with Fiber and GORM API func setupRoutes(app *fiber.App) { app.Get("/api/v1/book", book.GetBook

Trung Nguyen 1 Nov 5, 2021
A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

csvparser This package provides a fast and easy-of-use custom mapping from .csv data into Golang structs. Index Pre-requisites Installation Examples C

João Duarte 20 May 10, 2022
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Charles Banning 525 Aug 9, 2022
[Go] Package of validators and sanitizers for strings, numerics, slices and structs

govalidator A package of validators and sanitizers for strings, structs and collections. Based on validator.js. Installation Make sure that Go is inst

Alex Saskevich 5.4k Aug 17, 2022
Take screenshots of websites and create PDF from HTML pages using chromium and docker

gochro is a small docker image with chromium installed and a golang based webserver to interact wit it. It can be used to take screenshots of w

Christian Mehlmauer 49 Jul 13, 2022
Parse data and test fixtures from markdown files, and patch them programmatically, too.

go-testmark Do you need test fixtures and example data for your project, in a language agnostic way? Do you want it to be easy to combine with documen

Eric Myhre 19 Aug 8, 2022
Watches container registries for new and changed tags and creates an RSS feed for detected changes.

Tagwatch Watches container registries for new and changed tags and creates an RSS feed for detected changes. Configuration Tagwatch is configured thro

Wolfgang Popp 1 Jan 7, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Microcosm 2.4k Aug 13, 2022
A golang package to work with Decentralized Identifiers (DIDs)

did did is a Go package that provides tools to work with Decentralized Identifiers (DIDs). Install go get github.com/ockam-network/did Example packag

Ockam 64 Aug 3, 2022