:paw_prints: Creeper - The Next Generation Crawler Framework (Go)


Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started


$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
	c := creeper.Open("./hacker_news.crs")
	c.Array("news").Each(func(c *creeper.Creeper) {
		println("title: ", c.String("title"))
		println("site: ", c.String("site"))
		println("link: ", c.String("link"))

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec


Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.


Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
	title: $(".title a.storylink").text
	site: $(".title span.sitestr").text
	link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.


Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun


Fun represents the data processing process.

There are all supported funs:

Name Parameters Description
$ (selector: string) Relative CSS selector (select from parent node)
$root (selector: string) Absolute CSS selector (select from body)
html inner HTML
text inner text
outerHTML outer HTML
attr (attr: string) attribute value
style style attribute value
href href attribute value
src src attribute value
class class attribute value
id id attribute value
calc (prec: int) calculate arithmetic expression
match (regexp: string) match first sub-string via regular expression
expand (regexp: string, target: string) expand matched strings to target string



impl.moe · Github @wspl

    note without css selector

    gallery(@page=0) = "https://e-hentai.org/g/1034547/27cc8cb432/?p={@page}" tags[]: gallery -> $("div#taglist table tr td:eq(1) div a") name: .html

    The name will always return the first html selected by css("div#taglist table tr td:eq(1) div a").

    opened by LegendKan 3
    Problem since new commits ?


    I just copy paste your example code (hacker_news). Yesterday, it worked. Today with the new sources, it doesn't work anymore :(

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x45cdb6]
    goroutine 1 [running]:
    panic(0x65db60, 0xc42000c190)
            /opt/go/src/runtime/panic.go:500 +0x1a1
    github.com/wspl/creeper.(*Node).Value(0x0, 0x0, 0xc42001acd0, 0x0, 0xc4200340b8)
            /home/ubuntu/workspace/src/github.com/wspl/creeper/node.go:118 +0x26
    github.com/wspl/creeper.(*Creeper).Each(0xc42001acd0, 0x6cc428)
            /home/ubuntu/workspace/src/github.com/wspl/creeper/creeper.go:74 +0x8b
            /home/ubuntu/workspace/main.go:12 +0x73
    exit status 2

    (A simple println alone works :) )

    opened by Dolu89 1
    Feature: Next Page Node - functional node for directing next page

    page = "http://example.com/info?page=1"
    demo[]: page -> $(".example")
        text: $(".title").html
        @next: $("a.next").href

    I am thinking about another method for page number directing, that is simulating the operation of the user click on the next page. We can add a @next node to indicates the next page link. Page director would switch to next page automatically when current page has no more content.

    New grammar features - Functional node: For assisting crawling. Node name start with @. It is less readable than private nodes.

    opened by wspl 0
    go get

    go get github.com/wspl/creeper


    fatal error: unexpected signal during runtime execution [signal 0xb code=0x1 addr=0x1880e6d3a41e pc=0xf0eb]

    runtime stack: runtime.throw(0x4971c0, 0x2a) /usr/local/go/src/runtime/panic.go:547 +0x90 runtime.sigpanic() /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a runtime.unlock(0x982540) /usr/local/go/src/runtime/lock_sema.go:107 +0x14b runtime.(*mheap).alloc_m(0x982540, 0x1, 0x10000000010, 0xeed928) /usr/local/go/src/runtime/mheap.go:492 +0x314 runtime.(*mheap).alloc.func1() /usr/local/go/src/runtime/mheap.go:502 +0x41 runtime.systemstack(0xc82047fe58) /usr/local/go/src/runtime/asm_amd64.s:307 +0xab runtime.(*mheap).alloc(0x982540, 0x1, 0x10000000010, 0xed8f) /usr/local/go/src/runtime/mheap.go:503 +0x63 runtime.(*mcentral).grow(0x983f10, 0x0) /usr/local/go/src/runtime/mcentral.go:209 +0x93 runtime.(*mcentral).cacheSpan(0x983f10, 0xeed928) /usr/local/go/src/runtime/mcentral.go:89 +0x47d runtime.(*mcache).refill(0xaf4000, 0x10, 0xeed928) /usr/local/go/src/runtime/mcache.go:119 +0xcc runtime.mallocgc.func2() /usr/local/go/src/runtime/malloc.go:642 +0x2b runtime.systemstack(0xc820025500) /usr/local/go/src/runtime/asm_amd64.s:291 +0x79 runtime.mstart() /usr/local/go/src/runtime/proc.go:1051

    goroutine 1 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc821c79140 sp=0xc821c79138 runtime.mallocgc(0xf0, 0x438dc0, 0x0, 0x438dc0) /usr/local/go/src/runtime/malloc.go:643 +0x869 fp=0xc821c79218 sp=0xc821c79140

    opened by xwinie 0
    how to parse Json structures

    We probably use both HTML parser and JSON parser for crawling complex pages, I found that pattern files support HTML parser only, how could I use this framework to parse JSON structures or extend functionalities by myself? Thanks.

    opened by top 0
  • Concurrency?


    Is it possible to scrape pages concurrently with creeper?

    opened by aidiss 0
    support Automatic cookie and session handling

    some site need auth login, can creeper support it ? 😈

    opened by cute-angelia 0
    @next node function implementation

    There are some hindrance in implementing the functional part of @next. They have stumped me for a long time:

    • InitSelector's loop call
    • Wait until the page cycle ends and blocks the total cycle when there is no next page
    help wanted 
    opened by wspl 0
