Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Overview

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

GoDoc report card Code Coverage

Features

  • JS Rendering
  • 5.000+ Requests/Sec
  • Caching (Memory/Disk/LevelDB)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Metrics (Prometheus, Expvar, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Cookies, Middlewares, robots.txt
  • Automatic response decoding to UTF-8

See scraper Options for all custom settings.

Status

We highly recommend you to use Geziyor with go modules.

Usage

This example extracts all quotes from quotes.toscrape.com and exports to JSON file.

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Documentation

Installation

go get -u github.com/geziyor/geziyor

If you want to make JS rendered requests, make sure you have Chrome installed.

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Normal Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually.
You can make requests using Geziyor methods:

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
    	g.Get("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Making JS Rendered Requests

JS Rendered requests can be made using GetRendered method. By default, geziyor uses local Chrome application CLI to start Chrome browser. Set BrowserEndpoint option to use different chrome instance. Such as, "ws://localhost:3000"

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
    //BrowserEndpoint: "ws://localhost:3000",
}).Start()

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start()

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start()

Benchmark

8748 request per seconds on Macbook Pro 15" 2016

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: darwin
goarch: amd64
pkg: github.com/geziyor/geziyor
BenchmarkRequests-8   	  200000	    108710 ns/op
PASS
ok  	github.com/geziyor/geziyor	22.861s
Comments
  • google-chrome: executable file not found in $PATH

    google-chrome: executable file not found in $PATH

    Issue:

    I get an error when I start my service on the server. Local on my machine everything works so far.

    request getting rendered: exec: "google-chrome": executable file not found in $PATH

    Code

    main.go

    // ...
    	crawler := geziyor.NewGeziyor(&geziyor.Options{
    		StartRequestsFunc: func(g *geziyor.Geziyor) {
    			g.GetRendered("https://www.google.com/", g.Opt.ParseFunc)
    		},
    		Exporters: []export.Exporter{&export.JSON{}},
    	})
    	
    	crawler.Start()
    // ...
    

    Dockerfile

    # -- Stage 1 -- #
    FROM golang:1.16-alpine as builder
    WORKDIR /app
    
    COPY . .
    RUN go build -mod=readonly -o bin/service
    
    # -- Stage 2 -- #
    FROM alpine
    
    # Install any required dependencies.
    RUN apk --no-cache add ca-certificates
    
    WORKDIR /root/
    
    COPY --from=builder /app/bin/service /usr/local/bin/
    
    CMD ["service"]
    

    Question

    I assume I need additional dependencies on my server for geziyor to run smoothly? For example Headless Chrome?

    opened by dock-mausbach 12
  • Cookie cutters and Declarative scrapping

    Cookie cutters and Declarative scrapping

    Many web sites can be scrapped using standard CSS selection without defining fancy Go code to do that. For this, I still like goscrape's "structured scraper" approach. Ref:

    https://github.com/andrew-d/goscrape#goscrape

    And here is how its scrapping is defined declaratively:

    https://github.com/andrew-d/goscrape/blob/d89ba4ccc7f78429613f2a71bc7703c8faf9e8c9/_examples/scrape_hn.go#L15-L26

    	config := &scrape.ScrapeConfig{
    		DividePage: scrape.DividePageBySelector("tr:nth-child(3) tr:nth-child(3n-2):not([style='height:10px'])"),
    
    		Pieces: []scrape.Piece{
    			{Name: "title", Selector: "td.title > a", Extractor: extract.Text{}},
    			{Name: "link", Selector: "td.title > a", Extractor: extract.Attr{Attr: "href"}},
    			{Name: "rank", Selector: "td.title[align='right']",
    				Extractor: extract.Regex{Regex: regexp.MustCompile(`(\d+)`)}},
    		},
    
    		Paginator: paginate.BySelector("a[rel='nofollow']:last-child", "href"),
    	}
    

    Hope geziyor can do declarative scrapping using predefined cookie cutters like above as well.

    opened by suntong 8
  • context deadline exceeded

    context deadline exceeded

    I'm trying to scrape 3242 webpages but I'm getting response: Get "https://www.typeform.com/templates/t/course-evaluation-form-template/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) for a lot of urls

    Any advice?

    opened by TheUltimateCookie 7
  • Queue performance enhancements + delay middleware fix

    Queue performance enhancements + delay middleware fix

    Enhances the queue logic to improve memory management and handle deadlock situations

    • Fixes delay middleware to always factor in delay if randomised delay is added (combined, not instead of)
    • Moves request middleware to prior to the core g.do func - this allows middleware to cancel requests within triggering the semaphore locks (and corresponding rate limits!), also avoids queuing items that will only be cancelled, saving memory.
    • Avoids deadlocks when the queue exceeds the max queue size, discards any new records and prints a message to the log
    opened by williamjulianvicary 5
  • Are there any plan to add supports for a POST request?

    Are there any plan to add supports for a POST request?

    Hi there, I was using the project for a personal crawler, after navigating the source code I've realized that the only way to send a POST request might be implementing a StartRequestsFunc (let me know if I'm wrong lol) which manipulates the http client directly, e.g.,

    func postToUrl(url string, body io.Reader) {
    	geziyor.NewGeziyor(&geziyor.Options{
    		StartRequestsFunc: func(g *geziyor.Geziyor) {
    			req, _ := client.NewRequest("POST", url, body)
    			g.Do(req, nil)
    		}
    	}).Start()
    }
    

    I haven't tried this approach yet but I'd like to know if that's the proper way to send requests other than a GET? Or is there any plan to add other implementations or an official example about a POST request?

    opened by Walker088 5
  • out of control RAM usage

    out of control RAM usage

    I've got a script that clicks every link and then clicks every link, and it quickly gets out of hand in terms of memory usage (40+GB) before crashing. Any suggestions as to where it's getting out of control? Storing millions of requests shouldn't take that much RAM in my mind.

    opened by ThomasMeetKai 4
  • Proxy Management Not supported

    Proxy Management Not supported

    In order to integrate proxies, geziyor does not provide any interface. It does provide request middlewares but the object that can be manipulated in the middleware does not have proxy related configuration. Would be great if that can be supported as well.

    opened by prashant-ee 4
  • How to get response error other than HTTP errors

    How to get response error other than HTTP errors

    Hi,

    How can I get response error other than HTTP errors (StatusCode), like time out, address not found, Website isn't reachable.... ? For example

    	geziyor.NewGeziyor(&geziyor.Options{
    		StartURLs: []string{"http://www.1b4f.com/"},
    		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
    			fmt.Println(string(r.Body))
    		},
    	}).Start()
    

    Log output :

    2019/12/10 15:00:21 Scraping Started 2019/12/10 15:00:21 Retrying: http://www.1b4f.com/ 2019/12/10 15:00:21 Retrying: http://www.1b4f.com/ 2019/12/10 15:00:21 Response error: Get http://www.1b4f.com/: dial tcp: lookup www.1b4f.com: no such host 2019/12/10 15:00:21 Scraping Finished

    I want to strore in DataBase site Url & Error ("http://www.1b4f.com/", "dial tcp: lookup www.1b4f.com: no such host")

    opened by LeMoussel 4
  • Recursive Exports / Native return channels

    Recursive Exports / Native return channels

    I found it quite common to have recursive / nested scarping.

    ├── a
    │   ├── itemA
    │   └── foldA
    │       └── itemB
    └── b
        ├── itemC
        └── foldA
            └── itemA
    

    Total result being something like:

    {
      "a": [
        {
          "title": "itemA",
          "author": "Foo Bar",
          "contents": "asdjnasknd"
        },
        {
          "title": "foldA",
          "children": [
            {
              "title": "itemB",
              "author": "Foo Baz",
              "contents": "afgdgasknd"
            }
          ]
        }
      ],
      "b": [
        {
          "title": "itemC",
          "author": "Foo Bar",
          "contents": "odjfoij"
        },
        {
          "title": "foldA",
          "children": [
            {
              "title": "itemA",
              "author": "Foo Baz",
              "contents": "alsd"
            }
          ]
        }
      ]
    }
    

    Problem is, as soon as you pass something to g.Do(), you have no way of hearing back from the function.

    opened by jtagcat 3
  • runtime error: invalid memory address or nil pointer dereference

    runtime error: invalid memory address or nil pointer dereference

    I just ran the basic example and got this error

    code:

    package main
    
    import (
    	"fmt"
    
    	"github.com/geziyor/geziyor"
    	"github.com/geziyor/geziyor/client"
    )
    
    func main() {
    	geziyor.NewGeziyor(&geziyor.Options{
    		StartRequestsFunc: func(g *geziyor.Geziyor) {
    			g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    		},
    		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
    			fmt.Println(r.HTMLDoc.Find("title").Text())
    		},
    		//BrowserEndpoint: "ws://localhost:3000",
    	}).Start()
    }
    

    error:

    Scraping Started
    Crawled: (200) <GET https://httpbin.org/anything>
    runtime error: invalid memory address or nil pointer dereference goroutine 40 [running]:
    runtime/debug.Stack()
            C:/Program Files/Go/src/runtime/debug/stack.go:24 +0x65
    github.com/geziyor/geziyor.(*Geziyor).recoverMe(0xc00016cdc0)
            C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:307 +0x45
    panic({0x111dc60, 0x17f7d60})
            C:/Program Files/Go/src/runtime/panic.go:838 +0x207
    main.main.func2(0xc00014a1c8?, 0xc000409d10?)
            C:/Users/Marshall/Desktop/gezi/main.go:16 +0x18
    github.com/geziyor/geziyor.(*Geziyor).do(0xc00016cdc0, 0xc0001524b0, 0x12350c8)
            C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:262 +0x235
    created by github.com/geziyor/geziyor.(*Geziyor).Do
            C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:228 +0xd2
    
    Scraping Finished
    

    Any advice?

    opened by TheUltimateCookie 3
  • Add a generic in-memory counter and expose Metrics

    Add a generic in-memory counter and expose Metrics

    Added a Generic metrics counter (in-memory) and also exposed the Metrics variable so that it can be used outside of external metrics counters.

    (This is more of a suggestion, I'm otherwise counting manually but it doesn't make sense when it's built in!)

    opened by williamjulianvicary 3
  • Scrape URLs then get to there.

    Scrape URLs then get to there.

    I'm looking for,

    • Parses URLs
    • Visit to each parsed URLs
    • Parse data from visited page.

    For example,

    • Get books URLs from Goodreads
    • Visit those places
    • Get books' data from the visited pages.

    This is possible with colly, I wonder if it's possible with geziyor.

    opened by mizzunet 0
  • Is scraping shadow DOM an option?

    Is scraping shadow DOM an option?

    Hi, I'm trying to web scrapping YouTube charts, unsuccessfully because they use polymer / shadow DOM. With Geziyor, could I do that? I'm using colly, and they don't have support for that.

    opened by jtlimo 6
  • Memory leak (gocoroutines)

    Memory leak (gocoroutines)

    At the moment, each request that is queued up is first pushed into a gocoroutine, which is then blocked until space appears in the queue - however, if you're crawling very large websites with lots of links this causes exponential gocoroutines to build-up, at ~8kb a pop (each link in the "queue")

    I attempted to fix this on my side with a semaphore which blocks requests getting added to the queue, however as the middleware's can cancel the request I'm not getting exposed any kind of response to handle that semaphore and I run into deadlocks.

    In short, I feel this needs a queue in the core vs. spinning up tens of thousands of go coroutines, if you try crawling, say, Wikipedia with 100 connections, you'll see memory usage accelerate incredibly quickly (into gb's in seconds).

    What I was essentially doing on my side was adding a semaphore (very similar to your existing semaphore) but having this block at the point of URL addition/response consumption before the go coroutine is created - which would solve the issue, if it wasn't for my deadlocks(!)

    Let me know if I can add any more context! :-)

    opened by williamjulianvicary 10
  • Closing square bracket in JSON export

    Closing square bracket in JSON export

    Hello, In the file 'json.go' responsible for json export, you open the file with the 'O_APPEND ' flag, so the closing square bracket instead of the last comma is not written to the file and this error is not processed.

    opened by Mramorov 1
  • Unable to stop geziyor and close browser after scraping finishes

    Unable to stop geziyor and close browser after scraping finishes

    I'm currently using geziyor with chromedp/headless-shell. I'm connecting via a remote URL.

    limit := 1000
    urls := []string{....}
    geziyor.NewGeziyor(&geziyor.Options{
    			StartRequestsFunc: func(g *geziyor.Geziyor) {
    				for i := 0; i < limit; i++ {
    					g.GetRendered(url[i], g.Opt.ParseFunc)
    				}
    			},
    			ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
    				if r.StatusCode == 404 {
    					fmt.Println("Page not found")
    					return
    				}
    				// Scraping some stuff...
    			}, RobotsTxtDisabled: true,
    			BrowserEndpoint: result["webSocketDebuggerUrl"].(string),
    }).Start()
    

    Issue I'm unable to stop geziyor program after scraping finishes. Also not able to close the browser (don't know if you guys handle it in the background) once scraping finishes, so I get these errors in chromedp/headless-shell [0813/133906.020892:WARNING:resource_bundle.cc(1048)] locale resources are not loaded [0813/135451.951241:ERROR:broker_posix.cc(46)] Received unexpected number of handles headless-shell error screenshot

    While in geziyor, I get these error messages request getting rendered: could not dial "ws://127.0.0.1:9222/devtools/browser/d224d555-fbd7-491a-86c0-332edb8f2975": context deadline exceeded geziyor error screenshot

    Any advice on the issue?

    opened by TheUltimateCookie 13
Owner
null
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Andrew Stuart 218 Sep 27, 2022
Stylesheet-based markdown rendering for your CLI apps 💇🏻‍♀️

Glamour Write handsome command-line tools with Glamour. glamour lets you render markdown documents & templates on ANSI compatible terminals. You can c

Charm 1.3k Sep 26, 2022
A UTF-8 and internationalisation testing utility for text rendering.

ɱéťàł "English, but metal" Metal is a tool that converts English text into a legible, Zalgo-like character swap for the purposes of testing localisati

Harley 0 Jan 14, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 23 Aug 22, 2022
Auto-gen fuzzing wrappers from normal code. Automatically find buggy call sequences, including data races & deadlocks. Supports rich signature types.

fzgen fzgen auto-generates fuzzing wrappers for Go 1.18, optionally finds problematic API call sequences, can automatically wire outputs to inputs acr

thepudds 74 Sep 22, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Microcosm 2.4k Sep 27, 2022
A fast string sorting algorithm (MSD radix sort)

Your basic radix sort A fast string sorting algorithm This is an optimized sorting algorithm equivalent to sort.Strings in the Go standard library. Fo

Algorithms to Go 178 Aug 17, 2022
Super Fast Regex in Go

Rubex : Super Fast Regexp for Go by Zhigang Chen ([email protected] or [email protected]) ONLY USE go1 BRANCH A simple regular expression libr

Moovweb 218 Sep 9, 2022
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Bill Burdick 27 Jul 30, 2022
Fast and secure steganography CLI for hiding text/files in images.

indie CLI This complete README is hidden in the target.png file below without the original readme.png this could have also been a lie as none could ev

BoB 4 Mar 20, 2022
A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

csvparser This package provides a fast and easy-of-use custom mapping from .csv data into Golang structs. Index Pre-requisites Installation Examples C

João Duarte 20 May 10, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 17.8k Oct 1, 2022
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Goribot 一个分布式友好的轻量的 Golang 爬虫框架。 完整文档 | Document !! Warning !! Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续

null 208 Aug 26, 2022
Go minifiers for web formats

Minify Online demo if you need to minify files now. Command line tool that minifies concurrently and watches file changes. Releases of CLI for various

Taco de Wolff 3.1k Sep 20, 2022
🌭 The hotdog web browser and browser engine 🌭

This is the hotdog web browser project. It's a web browser with its own layout and rendering engine, parsers, and UI toolkit! It's made from scratch e

Danilo Fragoso 1k Sep 27, 2022
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

null 0 Dec 5, 2021
Build and deploy resilient web applications.

Archived Due to the security concerns surrounding XML, this package is now archived. Go server overview : Template engine. Built in request tracer. we

Cheikh Seck 13 Dec 15, 2020
Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Antch Antch, inspired by Scrapy. If you're familiar with scrapy, you can quickly get started. Antch is a fast, powerful and extensible web crawling &

null 238 Sep 27, 2022
🎨 Terminal color rendering library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows.

?? Terminal color rendering library, support 8/16 colors, 256 colors, RGB color rendering output, support Print/Sprintf methods, compatible with Windows. GO CLI 控制台颜色渲染工具库,支持16色,256色,RGB色彩渲染输出,使用类似于 Print/Sprintf,兼容并支持 Windows 环境的色彩渲染

Gookit 1.2k Sep 26, 2022