:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

Overview

License Go Report Card Gitter Creeper

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started

Installation

$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
	c := creeper.Open("./hacker_news.crs")
	c.Array("news").Each(func(c *creeper.Creeper) {
		println("title: ", c.String("title"))
		println("site: ", c.String("site"))
		println("link: ", c.String("link"))
		println("===")
	})
}

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
===
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
	title: $(".title a.storylink").text
	site: $(".title span.sitestr").text
	link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:

Name Parameters Description
$ (selector: string) Relative CSS selector (select from parent node)
$root (selector: string) Absolute CSS selector (select from body)
html inner HTML
text inner text
outerHTML outer HTML
attr (attr: string) attribute value
style style attribute value
href href attribute value
src src attribute value
class class attribute value
id id attribute value
calc (prec: int) calculate arithmetic expression
match (regexp: string) match first sub-string via regular expression
expand (regexp: string, target: string) expand matched strings to target string

Author

Plutonist

impl.moe · Github @wspl

Issues
  • note without css selector

    note without css selector

    gallery(@page=0) = "https://e-hentai.org/g/1034547/27cc8cb432/?p={@page}" tags[]: gallery -> $("div#taglist table tr td:eq(1) div a") name: .html

    The name will always return the first html selected by css("div#taglist table tr td:eq(1) div a").

    bug 
    opened by LegendKan 3
  • Problem since new commits ?

    Problem since new commits ?

    Hi,

    I just copy paste your example code (hacker_news). Yesterday, it worked. Today with the new sources, it doesn't work anymore :(

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x45cdb6]
    
    goroutine 1 [running]:
    panic(0x65db60, 0xc42000c190)
            /opt/go/src/runtime/panic.go:500 +0x1a1
    github.com/wspl/creeper.(*Node).Value(0x0, 0x0, 0xc42001acd0, 0x0, 0xc4200340b8)
            /home/ubuntu/workspace/src/github.com/wspl/creeper/node.go:118 +0x26
    github.com/wspl/creeper.(*Creeper).Each(0xc42001acd0, 0x6cc428)
            /home/ubuntu/workspace/src/github.com/wspl/creeper/creeper.go:74 +0x8b
    main.main()
            /home/ubuntu/workspace/main.go:12 +0x73
    exit status 2
    

    (A simple println alone works :) )

    bug 
    opened by Dolu89 1
  • Feature: Next Page Node - functional node for directing next page

    Feature: Next Page Node - functional node for directing next page

    page = "http://example.com/info?page=1"
    demo[]: page -> $(".example")
        text: $(".title").html
        @next: $("a.next").href
    

    I am thinking about another method for page number directing, that is simulating the operation of the user click on the next page. We can add a @next node to indicates the next page link. Page director would switch to next page automatically when current page has no more content.

    New grammar features - Functional node: For assisting crawling. Node name start with @. It is less readable than private nodes.

    feature 
    opened by wspl 0
  • go get

    go get

    go get github.com/wspl/creeper

    github.com/PuerkitoBio/goquery

    fatal error: unexpected signal during runtime execution [signal 0xb code=0x1 addr=0x1880e6d3a41e pc=0xf0eb]

    runtime stack: runtime.throw(0x4971c0, 0x2a) /usr/local/go/src/runtime/panic.go:547 +0x90 runtime.sigpanic() /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a runtime.unlock(0x982540) /usr/local/go/src/runtime/lock_sema.go:107 +0x14b runtime.(*mheap).alloc_m(0x982540, 0x1, 0x10000000010, 0xeed928) /usr/local/go/src/runtime/mheap.go:492 +0x314 runtime.(*mheap).alloc.func1() /usr/local/go/src/runtime/mheap.go:502 +0x41 runtime.systemstack(0xc82047fe58) /usr/local/go/src/runtime/asm_amd64.s:307 +0xab runtime.(*mheap).alloc(0x982540, 0x1, 0x10000000010, 0xed8f) /usr/local/go/src/runtime/mheap.go:503 +0x63 runtime.(*mcentral).grow(0x983f10, 0x0) /usr/local/go/src/runtime/mcentral.go:209 +0x93 runtime.(*mcentral).cacheSpan(0x983f10, 0xeed928) /usr/local/go/src/runtime/mcentral.go:89 +0x47d runtime.(*mcache).refill(0xaf4000, 0x10, 0xeed928) /usr/local/go/src/runtime/mcache.go:119 +0xcc runtime.mallocgc.func2() /usr/local/go/src/runtime/malloc.go:642 +0x2b runtime.systemstack(0xc820025500) /usr/local/go/src/runtime/asm_amd64.s:291 +0x79 runtime.mstart() /usr/local/go/src/runtime/proc.go:1051

    goroutine 1 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc821c79140 sp=0xc821c79138 runtime.mallocgc(0xf0, 0x438dc0, 0x0, 0x438dc0) /usr/local/go/src/runtime/malloc.go:643 +0x869 fp=0xc821c79218 sp=0xc821c79140

    opened by xwinie 0
  • how to parse Json structures

    how to parse Json structures

    We probably use both HTML parser and JSON parser for crawling complex pages, I found that pattern files support HTML parser only, how could I use this framework to parse JSON structures or extend functionalities by myself? Thanks.

    opened by top 0
  • Concurrency?

    Concurrency?

    Is it possible to scrape pages concurrently with creeper?

    opened by aidiss 0
  • support Automatic cookie and session handling

    support Automatic cookie and session handling

    some site need auth login, can creeper support it ? 😈

    opened by cute-angelia 0
  • @next node function implementation

    @next node function implementation

    There are some hindrance in implementing the functional part of @next. They have stumped me for a long time:

    • InitSelector's loop call
    • Wait until the page cycle ends and blocks the total cycle when there is no next page
    help wanted 
    opened by wspl 0
Owner
Plutonist
Plutonist
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 64 Sep 2, 2021
Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Niloy Sikdar 9 Jul 31, 2021
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.2k Sep 8, 2021
Collyzar - A distributed redis-based framework for colly.

Collyzar A distributed redis-based framework for colly. Collyzar provides a very simple configuration and tools to implement distributed crawling/scra

Zarten 209 Aug 22, 2021
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

胡聪 1.7k Sep 10, 2021
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 8.1k Sep 6, 2021
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 246 Aug 27, 2021
用Go实现抓取Boss直聘职位数据。IP代理,模拟浏览器,高效快速。

crawler-boss 用Go实现抓取Boss直聘职位数据。有几个特点 1.代理防IP被封 2.模拟浏览器,反识别爬虫。 3.控制爬取频率。 4.多协程爬取。 不足之处 1.爬取失败,没有进行重试以及更换IP处理。 2.错误处理 3.代码结构方面进行优化。 交流 && 疑问 如果有任何错误或不懂的

Ray 19 Sep 9, 2021
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 14.8k Sep 14, 2021
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 986 Sep 12, 2021
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 6.9k Sep 3, 2021
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

MontFerret 4.6k Sep 4, 2021
DorkScout - Golang tool to automate google dork scan against the entiere internet or specific targets

dorkscout dokrscout is a tool to automate the finding of vulnerable applications or secret files around the internet throught google searches, dorksco

R4yan 93 Sep 5, 2021
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 1 Sep 3, 2021
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 7 Sep 12, 2021
A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

null 10.6k Sep 12, 2021
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Anas Khan 1.7k Sep 9, 2021
轻量级爬虫,接口测试,压力测试框架, 提高开发对应场景的golang程序的效率。

gathertool 轻量级爬虫,接口测试,压力测试框架, 提高开发对应场景的golang程序的效率。

ManGe(漫鸽) 12 Aug 26, 2021
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHenHQ 749 Sep 7, 2021