Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Overview

Crawlab

中文 | English

Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

Demo | Documentation

Installation

Three methods:

  1. Docker (Recommended)
  2. Direct Deploy (Check Internal Kernel)
  3. Kubernetes (Multi-Node Deployment)

Pre-requisite (Docker)

  • Docker 18.03+
  • Redis 5.x+
  • MongoDB 3.6+
  • Docker Compose 1.24+ (optional but recommended)

Pre-requisite (Direct Deploy)

  • Go 1.12+
  • Node 8.12+
  • Redis 5.x+
  • MongoDB 3.6+

Quick Start

Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.

git clone https://github.com/crawlab-team/crawlab
cd crawlab
docker-compose up -d

Next, you can look into the docker-compose.yml (with detailed config params) and the Documentation (Chinese) for further information.

Run

Docker

Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB and Redis databases. Create a file named docker-compose.yml and input the code below.

version: '3.3'
services:
  master: 
    image: tikazyq/crawlab:latest
    container_name: master
    environment:
      CRAWLAB_SERVER_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_REDIS_ADDRESS: "redis"
    ports:    
      - "8080:8080"
    depends_on:
      - mongo
      - redis
  mongo:
    image: mongo:latest
    restart: always
    ports:
      - "27017:27017"
  redis:
    image: redis:latest
    restart: always
    ports:
      - "6379:6379"

Then execute the command below, and Crawlab Master Node + MongoDB + Redis will start up. Open the browser and enter http://localhost:8080 to see the UI interface.

docker-compose up

For Docker Deployment details, please refer to relevant documentation.

Screenshot

Login

Home Page

Node List

Node Network

Spider List

Spider Overview

Spider Analytics

Spider File Edit

Task Log

Task Results

Cron Job

Language Installation

Dependency Installation

Notifications

Architecture

The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.

The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before v0.3.0. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.

Master Node

The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The Master Node offers below services:

  1. Crawling Task Coordination;
  2. Worker Node Management and Communication;
  3. Spider Deployment;
  4. Frontend and API Services;
  5. Task Execution (one can regard the Master Node as a Worker Node)

The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.

Worker Node

The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis PubSub. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

MongoDB

MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.

Redis

Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute HSET to set their info into a hash list named nodes in Redis, and the Master Node will identify online nodes according to the hash list.

Frontend

Frontend is a SPA based on Vue-Element-Admin. It has re-used many Element-UI components to support corresponding display.

Integration with Other Frameworks

Crawlab SDK provides some helper methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

⚠️ Note: make sure you have already installed crawlab-sdk using pip.

Scrapy

In settings.py in your Scrapy project, find the variable named ITEM_PIPELINES (a dict variable). Add content below.

ITEM_PIPELINES = {
    'crawlab.pipelines.CrawlabMongoPipeline': 888,
}

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Result

General Python Spider

Please add below content to your spider files to save results.

# import result saving method
from crawlab import save_item

# this is a result record, must be dict type
result = {'name': 'crawlab'}

# call result saving method
save_item(result)

Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Result

Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID. By doing so, the data can be related to a task. Also, another environment variable CRAWLAB_COLLECTION is passed by Crawlab as the name of the collection to store results data.

Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?

The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.

Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.

Framework Technology Pros Cons Github Stats
Crawlab Golang + Vue Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc. Not yet support spider versioning
ScrapydWeb Python Flask + Vue Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform. Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
Gerapy Python Django + Vue Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc. Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeper Python Flask Open-source Scrapyhub. Concise and simple UI interface. Support cron job. Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.

Contributors

Community & Sponsorship

If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group. Or you scan the Alipay QR code below to give us a reward to upgrade our teamwork software or buy a coffee.

Issues
  • Runing spiders ?!

    Runing spiders ?!

    I can't run spider neither outside docker master container nor inside it Inside i get this error after typing: crawlab upload spider Not authorized Error loging in

    Out side i only can upload zip file and it's not connected to my script and doesn't returen any data

    Any help?!

    question 
    opened by Nesma-m7md 14
  • admin 通过vscode port登陆失败

    admin 通过vscode port登陆失败

    Bug 描述 我将机器部署在内网服务器,通过vscodeport功能映射到本机打开,例如,当输入用户名和密码均为admin 时,登陆功能不工作。 vscode port配置 image

    登陆错误 image

    排除其他因素 远程主机端口6800存在scrapyd的管理界面,vscode能将远程主机的6800映射到本机端口上。 image

    期望结果 登陆admin 能工作。

    bug 
    opened by kevinzhangcode 11
  • gRPC Client Cannot Connect to the master node!

    gRPC Client Cannot Connect to the master node!

    Bug

    When I tried building CRAWLAB by docker, worker node cannot connect the master node.

    YML File

    version: '3.3'
    services:
      master: 
        image: crawlabteam/crawlab:latest
        container_name: crawlab_example_master
        environment:
          CRAWLAB_NODE_MASTER: "Y"
          CRAWLAB_MONGO_HOST: "mongo"
          CRAWLAB_GRPC_SERVER_ADDRESS: "0.0.0.0:9666"
          CRAWLAB_SERVER_HOST: "0.0.0.0"
          CRAWLAB_GRPC_AUTHKEY: "youcanneverguess"
        volumes:
          - "./.crawlab/master:/root/.crawlab"
        ports:    
          - "8080:8080"
          - "9666:9666"
          - "8000:8000"
        depends_on:
          - mongo
    
      worker01: 
        image: crawlabteam/crawlab:latest
        container_name: crawlab_example_worker01
        environment:
          CRAWLAB_NODE_MASTER: "N"
          CRAWLAB_GRPC_ADDRESS: "MY_Public_IP_Address:9666"
          CRAWLAB_GRPC_AUTHKEY: "youcanneverguess"
          CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
        volumes:
          - "./.crawlab/worker01:/root/.crawlab"
        depends_on:
          - master
    
      mongo:
        image: mongo:latest
        container_name: crawlab_example_mongo
        restart: always
    

    image

    bug question 
    opened by edmund-zhao 11
  • Seaweedfs / Gocolly integration

    Seaweedfs / Gocolly integration

    Hi all,

    Hope you are all well !

    Just was wondering if it is possible to integrate these two awesome tools to crawlab, it would be awesome for storing millions of static objects and to scrape with golang. We already did that with a friend https://github.com/lucmichalski/peaks-tires but we lack of the horizontal scaling and a crawling management interface. That s why, and how, we found crawlab.

    • https://github.com/gocolly/colly Elegant Scraper and Crawler Framework for Golang

    • https://github.com/chrislusf/seaweedfs SeaweedFS is a simple and highly scalable distributed file system, to store and serve billions of files fast! SeaweedFS implements an object store with O(1) disk seek, transparent cloud integration, and an optional Filer supporting POSIX, S3 API, AES256 encryption, Rack-Aware Erasure Coding for warm storage, FUSE mount, Hadoop compatible, WebDAV.

    Thanks for your insights and feedbacks on the topic.

    Cheers, X

    enhancement 
    opened by x0rzkov 10
  • 最新master的代码,不能正常通过docker启动

    最新master的代码,不能正常通过docker启动

    Bug 描述 最新master的代码,不能正常通过docker启动

    复现步骤

    1. 拉取最新的代码
    2. docker-compose up
    3. 报错:2021/08/10 17:44:51 error grpc client connect error: grpc error: client failed to start. reattempt in 51.7 seconds
    4. 拉取517ae21e13a57e0d9c074b162793aee689f99c0d
    5. docker-compose up
    6. 正常运行

    期望结果 docker能正常工作。

    截屏 image

    bug v0.6 
    opened by zires 9
  • toscrapy_books 爬虫选择所有节点运行,任务运行了两遍

    toscrapy_books 爬虫选择所有节点运行,任务运行了两遍

    Describe the bug A clear and concise description of what the bug is.

    To Reproduce Steps to reproduce the behavior:

    toscrapy_books 爬虫选择所有节点运行,任务运行了两遍 image

    Expected behavior 只爬一遍

    good first issue 
    opened by EkkoG 9
  • 使用docker-compose安装的,每过一个星期master就会挂掉?需要手动重启

    使用docker-compose安装的,每过一个星期master就会挂掉?需要手动重启

    这一个是日志文件输出: crawlab-master | 2019/11/10 06:00:00 error handle task error:open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | 2019/11/10 06:00:00 error [Worker 3] open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | fatal error: concurrent map writes crawlab-master | fatal error: concurrent map writes crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory

    bug 
    opened by yjiu1990 9
  • scrapy目录结构问题

    scrapy目录结构问题

    上传爬虫之后,如果我的爬虫没有严格按照scrapy项目结构设置的话,在爬虫 - 爬虫详情 - scrapy设置里识别不到对应的文件,控制台也会提示错误

    比如:scrapy的settings和pipeline、middleware在同一文件夹下,而我的爬虫把pipeline和middleware独立成了文件夹

    bug wontfix v0.6 
    opened by TalentedBastard 7
  • "TypeError: res is undefined" when i tried to sign in

    I did everything according to the instructions from GitHub. When I try to login, I get an error "TypeError: res is undefined". Please, help me to resolve that problem Screenshot from 2021-12-10 22-41-58 Screenshot from 2021-12-10 22-41-41 Screenshot from 2021-12-10 22-44-24

    bug v0.6 
    opened by Kohtie 7
  • redis和mongodb启动正常,crawlab无法启动

    redis和mongodb启动正常,crawlab无法启动

    首先这是我的启动命令,10.25.3.15是我的宿主机ip不是容器ip。

    docker run -d --name crawlab \
            -e CRAWLAB_REDIS_ADDRESS=10.25.3.15:6379 \
            -e CRAWLAB_MONGO_HOST=10.25.3.15:27017 \
            -e CRAWLAB_SERVER_MASTER=Y \
            -e CRAWLAB_API_ADDRESS=http://10.25.3.15:8000 \
            -p 8080:8080 \
            -p 8000:8000 \
            -v /var/logs/crawlab:/var/logs/crawlab \
            tikazyq/crawlab:0.3.0
    

    其次,redis和mongodb运行正常,都可以用客户端连接到。

    # docker ps -a
    CONTAINER ID        IMAGE                              COMMAND                  CREATED              STATUS                          PORTS                      NAMES
    cbc9afa4dd23        tikazyq/crawlab:0.3.0              "/bin/sh /app/docker…"   About a minute ago   Exited (2) About a minute ago                              crawlab
    3e699d7894f4        redis:latest                       "docker-entrypoint.s…"   7 hours ago          Up 7 hours                      0.0.0.0:6379->6379/tcp     musing_euler
    1f13f0deb7b9        mongo                              "docker-entrypoint.s…"   26 hours ago         Up 26 hours                     0.0.0.0:27017->27017/tcp   mongo
    

    这个是crawlab的报错

    2019-12-20T09:16:04.328836179Z sed: -e expression #1, char 24: unknown option to `s'
    2019-12-20T09:16:04.342651601Z  * Starting nginx nginx
    2019-12-20T09:16:04.381728628Z    ...done.
    2019-12-20T09:16:04.400495525Z [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
    2019-12-20T09:16:04.400536039Z
    2019-12-20T09:16:04.400540159Z [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
    2019-12-20T09:16:04.400544463Z  - using env:	export GIN_MODE=release
    2019-12-20T09:16:04.400548025Z  - using code:	gin.SetMode(gin.ReleaseMode)
    2019-12-20T09:16:04.400551579Z
    2019-12-20T09:16:04.401112841Z 2019/12/20 09:16:04  info 初始化配置成功
    2019-12-20T09:16:04.401136592Z 2019/12/20 09:16:04  info 初始化日志设置成功
    2019-12-20T09:16:14.906247470Z goroutine 1 [running]:
    2019-12-20T09:16:14.906288803Z runtime/debug.Stack(0x0, 0x0, 0x0)
    2019-12-20T09:16:14.906297173Z 	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
    2019-12-20T09:16:14.906304320Z runtime/debug.PrintStack()
    2019-12-20T09:16:14.906310283Z 	/usr/local/go/src/runtime/debug/stack.go:16 +0x22
    2019-12-20T09:16:14.906315800Z main.main()
    2019-12-20T09:16:14.906321270Z 	/go/src/app/main.go:33 +0x142
    2019-12-20T09:16:14.908342672Z panic: no reachable servers
    2019-12-20T09:16:14.908374372Z
    2019-12-20T09:16:14.908385165Z goroutine 1 [running]:
    2019-12-20T09:16:14.908393479Z main.main()
    2019-12-20T09:16:14.908401642Z 	/go/src/app/main.go:34 +0x1cde
    

    crawlab无法启动起来,加了--restart always的话,前端可以显示出来,但是点击登录就一直转圈圈。

    good first issue 
    opened by simon824 7
  • docker-compose  里面的html到底有没有测试

    docker-compose 里面的html到底有没有测试

    chunk-vendors.740bfa25.js:sourcemap:54 OPTIONS http://localhost:8000/nodes net::ERR_CONNECTION_REFUSED
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:54
    e.exports @ chunk-vendors.740bfa25.js:sourcemap:54
    e.exports @ chunk-vendors.740bfa25.js:sourcemap:22
    Promise.then (async)
    o.request @ chunk-vendors.740bfa25.js:sourcemap:8
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:8
    (anonymous) @ request.js:12
    x @ chunk-vendors.740bfa25.js:sourcemap:42
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    e.<computed> @ chunk-vendors.740bfa25.js:sourcemap:42
    c @ chunk-vendors.740bfa25.js:sourcemap:22
    o @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    t @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ request.js:6
    f @ request.js:54
    getNodeList @ node.js:37
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:20
    p.dispatch @ chunk-vendors.740bfa25.js:sourcemap:20
    dispatch @ chunk-vendors.740bfa25.js:sourcemap:20
    mounted @ DialogView.vue:155
    nt @ chunk-vendors.740bfa25.js:sourcemap:14
    Fn @ chunk-vendors.740bfa25.js:sourcemap:14
    insert @ chunk-vendors.740bfa25.js:sourcemap:14
    V @ chunk-vendors.740bfa25.js:sourcemap:14
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:14
    Tn.e._update @ chunk-vendors.740bfa25.js:sourcemap:14
    i @ chunk-vendors.740bfa25.js:sourcemap:14
    ti.get @ chunk-vendors.740bfa25.js:sourcemap:14
    ti @ chunk-vendors.740bfa25.js:sourcemap:14
    On @ chunk-vendors.740bfa25.js:sourcemap:14
    Ci.$mount @ chunk-vendors.740bfa25.js:sourcemap:14
    e._init @ chunk-vendors.740bfa25.js:sourcemap:14
    Ci @ chunk-vendors.740bfa25.js:sourcemap:14
    56d7 @ main.js:70
    c @ bootstrap:88
    0 @ bootstrap:262
    c @ bootstrap:88
    n @ bootstrap:45
    (anonymous) @ bootstrap:262
    (anonymous) @ bootstrap:262
    Show 11 more frames
    request.js:23 Uncaught (in promise) TypeError: Cannot read property 'status' of undefined
        at request.js:23
        at x (chunk-vendors.740bfa25.js:sourcemap:42)
        at Generator._invoke (chunk-vendors.740bfa25.js:sourcemap:42)
        at Generator.e.<computed> [as throw] (chunk-vendors.740bfa25.js:sourcemap:42)
        at c (chunk-vendors.740bfa25.js:sourcemap:22)
        at s (chunk-vendors.740bfa25.js:sourcemap:22)
    (anonymous) @ request.js:23
    x @ chunk-vendors.740bfa25.js:sourcemap:42
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    e.<computed> @ chunk-vendors.740bfa25.js:sourcemap:42
    c @ chunk-vendors.740bfa25.js:sourcemap:22
    s @ chunk-vendors.740bfa25.js:sourcemap:22
    Promise.then (async)
    getNodeList @ node.js:38
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:20
    p.dispatch @ chunk-vendors.740bfa25.js:sourcemap:20
    dispatch @ chunk-vendors.740bfa25.js:sourcemap:20
    mounted @ DialogView.vue:155
    nt @ chunk-vendors.740bfa25.js:sourcemap:14
    Fn @ chunk-vendors.740bfa25.js:sourcemap:14
    insert @ chunk-vendors.740bfa25.js:sourcemap:14
    V @ chunk-vendors.740bfa25.js:sourcemap:14
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:14
    Tn.e._update @ chunk-vendors.740bfa25.js:sourcemap:14
    i @ chunk-vendors.740bfa25.js:sourcemap:14
    ti.get @ chunk-vendors.740bfa25.js:sourcemap:14
    ti @ chunk-vendors.740bfa25.js:sourcemap:14
    On @ chunk-vendors.740bfa25.js:sourcemap:14
    Ci.$mount @ chunk-vendors.740bfa25.js:sourcemap:14
    e._init @ chunk-vendors.740bfa25.js:sourcemap:14
    Ci @ chunk-vendors.740bfa25.js:sourcemap:14
    56d7 @ main.js:70
    c @ bootstrap:88
    0 @ bootstrap:262
    c @ bootstrap:88
    n @ bootstrap:45
    (anonymous) @ bootstrap:262
    (anonymous) @ bootstrap:262
    2chunk-vendors.740bfa25.js:sourcemap:54 OPTIONS http://localhost:8000/login net::ERR_CONNECTION_REFUSED
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:54
    e.exports @ chunk-vendors.740bfa25.js:sourcemap:54
    e.exports @ chunk-vendors.740bfa25.js:sourcemap:22
    Promise.then (async)
    o.request @ chunk-vendors.740bfa25.js:sourcemap:8
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:8
    (anonymous) @ request.js:12
    x @ chunk-vendors.740bfa25.js:sourcemap:42
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    e.<computed> @ chunk-vendors.740bfa25.js:sourcemap:42
    c @ chunk-vendors.740bfa25.js:sourcemap:22
    o @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    t @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ request.js:6
    p @ request.js:58
    (anonymous) @ user.js:72
    t @ chunk-vendors.740bfa25.js:sourcemap:22
    login @ user.js:71
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:20
    p.dispatch @ chunk-vendors.740bfa25.js:sourcemap:20
    dispatch @ chunk-vendors.740bfa25.js:sourcemap:20
    (anonymous) @ index.vue:134
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    z @ chunk-vendors.740bfa25.js:sourcemap:42
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    u @ chunk-vendors.740bfa25.js:sourcemap:42
    c @ chunk-vendors.740bfa25.js:sourcemap:42
    u @ chunk-vendors.740bfa25.js:sourcemap:42
    e @ index.vue:92
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    c @ chunk-vendors.740bfa25.js:sourcemap:42
    d @ chunk-vendors.740bfa25.js:sourcemap:42
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    m @ chunk-vendors.740bfa25.js:sourcemap:42
    validate @ chunk-vendors.740bfa25.js:sourcemap:42
    validate @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    validate @ chunk-vendors.740bfa25.js:sourcemap:22
    handleLogin @ index.vue:131
    click @ index.vue?6294:1
    nt @ chunk-vendors.740bfa25.js:sourcemap:14
    n @ chunk-vendors.740bfa25.js:sourcemap:14
    Jr.c._wrapper @ chunk-vendors.740bfa25.js:sourcemap:14
    Show 15 more frames
    chunk-vendors.740bfa25.js:sourcemap:54 OPTIONS http://localhost:8000/users net::ERR_CONNECTION_REFUSED
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:54
    e.exports @ chunk-vendors.740bfa25.js:sourcemap:54
    e.exports @ chunk-vendors.740bfa25.js:sourcemap:22
    Promise.then (async)
    o.request @ chunk-vendors.740bfa25.js:sourcemap:8
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:8
    (anonymous) @ request.js:12
    x @ chunk-vendors.740bfa25.js:sourcemap:42
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    e.<computed> @ chunk-vendors.740bfa25.js:sourcemap:42
    c @ chunk-vendors.740bfa25.js:sourcemap:22
    o @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    t @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ request.js:6
    h @ request.js:62
    (anonymous) @ user.js:97
    t @ chunk-vendors.740bfa25.js:sourcemap:22
    register @ user.js:96
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:20
    p.dispatch @ chunk-vendors.740bfa25.js:sourcemap:20
    dispatch @ chunk-vendors.740bfa25.js:sourcemap:20
    (anonymous) @ index.vue:149
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    z @ chunk-vendors.740bfa25.js:sourcemap:42
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    u @ chunk-vendors.740bfa25.js:sourcemap:42
    c @ chunk-vendors.740bfa25.js:sourcemap:42
    u @ chunk-vendors.740bfa25.js:sourcemap:42
    i @ index.vue:100
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    c @ chunk-vendors.740bfa25.js:sourcemap:42
    d @ chunk-vendors.740bfa25.js:sourcemap:42
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:42
    m @ chunk-vendors.740bfa25.js:sourcemap:42
    validate @ chunk-vendors.740bfa25.js:sourcemap:42
    validate @ chunk-vendors.740bfa25.js:sourcemap:22
    (anonymous) @ chunk-vendors.740bfa25.js:sourcemap:22
    validate @ chunk-vendors.740bfa25.js:sourcemap:22
    handleSignup @ index.vue:146
    click @ index.vue?6294:1
    nt @ chunk-vendors.740bfa25.js:sourcemap:14
    n @ chunk-vendors.740bfa25.js:sourcemap:14
    Jr.c._wrapper @ chunk-vendors.740bfa25.js:sourcemap:14
    Show 15 more frames
    user.js:102 Uncaught (in promise) TypeError: Cannot read property 'data' of undefined
        at user.js:102
    

    http://localhost:8000/nodes

    good first issue 
    opened by 36k-wild-monkey 7
  • 请问大佬每次新建一个爬虫,都需要配置一次git吗?

    请问大佬每次新建一个爬虫,都需要配置一次git吗?

    当很多爬虫脚本 python3 test/xxx.py python3 test/xxx1.py python3 test/xxx1.py 要新建三次爬虫,只是目录不同,但是在一个项目中,是不是要配置三次git提交,拉下来是三套代码,是否可优化选择已经配置过的git文件,选择项目后,就可以在同一个目录下的文件进行运行, 谢谢大佬作答

    enhancement question 
    opened by Lu-dashuai 3
  • Python dependency 安裝

    Python dependency 安裝 "pyodbc" 套件一直無法成功

    【問題描述】

    您好 我透過 Web UI 安裝 Python 的 pyodbc 套件時,都沒辦法成功,有些節點會顯示錯誤,有些則是一直跑圈圈,沒顯示成功與否 不知道這是因為什麼問題?

    【Error Log】

    ` running build running build_ext building 'pyodbc' extension creating build creating build/temp.linux-x86_64-3.8 creating build/temp.linux-x86_64-3.8/src x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.32 -I/usr/include/python3.8 -c src/buffer.cpp -o build/temp.linux-x86_64-3.8/src/buffer.o -Wno-write-strings In file included from src/buffer.cpp:12: src/pyodbc.h:56:10: fatal error: sql.h: No such file or directory 56 | #include Collecting pyodbc Using cached pyodbc-4.0.32.tar.gz (280 kB) Building wheels for collected packages: pyodbc Building wheel for pyodbc (setup.py): started Building wheel for pyodbc (setup.py): finished with status 'error' ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-6g3x7c1w/pyodbc/setup.py'"'"'; file='"'"'/tmp/pip-install-6g3x7c1w/pyodbc/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-6fw9k9ou cwd: /tmp/pip-install-6g3x7c1w/pyodbc/ Complete output (14 lines): running bdist_wheel | ^~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    ERROR: Failed building wheel for pyodbc Running setup.py clean for pyodbc Failed to build pyodbc Installing collected packages: pyodbc Running setup.py install for pyodbc: started Running setup.py install for pyodbc: finished with status 'error' ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-6g3x7c1w/pyodbc/setup.py'"'"'; file='"'"'/tmp/pip-install-6g3x7c1w/pyodbc/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-75gk25qf/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/pyodbc cwd: /tmp/pip-install-6g3x7c1w/pyodbc/ Complete output (14 lines): running install running build running build_ext building 'pyodbc' extension creating build creating build/temp.linux-x86_64-3.8 creating build/temp.linux-x86_64-3.8/src x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.32 -I/usr/include/python3.8 -c src/buffer.cpp -o build/temp.linux-x86_64-3.8/src/buffer.o -Wno-write-strings In file included from src/buffer.cpp:12: src/pyodbc.h:56:10: fatal error: sql.h: No such file or directory 56 | #include | ^~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-6g3x7c1w/pyodbc/setup.py'"'"'; file='"'"'/tmp/pip-install-6g3x7c1w/pyodbc/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-75gk25qf/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/pyodbc Check the logs for full command output. `

    bug v0.6 plugin 
    opened by Yuehua-Liu 8
  • 集成scrapy的pipeline时,运行save_item时候报错

    集成scrapy的pipeline时,运行save_item时候报错

    Bug 描述 例如,当用v0.6版本集成scrapy的save_item到pipeline时候,运行报错

    复现步骤 该 Bug 复现步骤如下

    1. scrapy新项目,settings的ITEM_PIPELINES里加入 'crawlab.scrapy.pipelines.CrawlabPipeline': 888
    2. 运行项目,之前OK的项目,报如下错误: 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-06-19 19:29:29 [twisted] CRITICAL: Unhandled error in Deferred: 2022-06-19 19:29:29 [twisted] CRITICAL: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks result = current_context.run(gen.send, result) File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 87, in crawl self.engine = self._create_engine() File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 101, in _create_engine return ExecutionEngine(self, lambda _: self.stop()) File "/usr/local/lib/python3.7/site-packages/scrapy/core/engine.py", line 70, in init self.scraper = Scraper(crawler) File "/usr/local/lib/python3.7/site-packages/scrapy/core/scraper.py", line 71, in init self.itemproc = itemproc_cls.from_crawler(crawler) File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 53, in from_crawler return cls.from_settings(crawler.settings, crawler) File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings mwcls = load_object(clspath) File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 61, in load_object mod = import_module(module) File "/usr/lib64/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 953, in _find_and_load_unlocked File "", line 219, in _call_with_frames_removed File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 965, in _find_and_load_unlocked ModuleNotFoundError: No module named 'crawlab.scrapy'

    之前是报 crawlab 找不到,单独用 pip install crawlab 后,就报上述错误了,想用pip安装不了。 [[email protected] 62adaf0b2fc909326db774e7]# pip install crawlab.scrapy ERROR: Could not find a version that satisfies the requirement crawlab.scrapy (from versions: none) ERROR: No matching distribution found for crawlab.scrapy

    期望结果 能正常存入mongodb,不报错能工作。

    截屏 CPU是ARM64的,对应的软件环境,Scrapy的版本不是默认v0.6的版本,因为高版本之前运行会报一个Reactor的错误,就降低到2.5.1了。 image

    bug scrapy v0.6 sdk 
    opened by jjzhoujun 1
  • cannot download scrape results

    cannot download scrape results

    Describe the bug After scrapy task finished, I cannot download results. Button is disabled, tooltip says Export (currently unavailable)

    To Reproduce just run crawlab from basic docker-compose.yml enter spiders, choose scrapy quotes, data tab at the bottom of the page button is disabled.

    Expected behavior I should be able to download fetched results in - at least- csv format.

    Screenshot 2022-06-17 at 10 21 02 enhancement v0.6 
    opened by gregorth 2
  • MongoDB network issue results in Crawlab master node shutdown

    MongoDB network issue results in Crawlab master node shutdown

    Describe the bug MongoDB network issue results in Crawlab master node shutdown.

    Master container logs are as below.

    crawlab_example_master | server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo:27017, Type: Unknown, Last error: connection(mongo:27017[-7]) unable to write wire message to network: write tcp 192.168.80.3:47772->192.168.80.2:27017: i/o timeout }, ] }
    crawlab_example_master | /go/pkg/mod/github.com/crawlab-team/[email protected]/trace.go:13 github.com/crawlab-team/go-trace.TraceError()
    crawlab_example_master | /go/pkg/mod/github.com/crawlab-team/[email protected]/models/service/binder_list.go:82 github.com/crawlab-team/crawlab-core/models/service.(*ListBinder).Process()
    crawlab_example_master | /go/pkg/mod/github.com/crawlab-team/[email protected]/models/service/binder_list.go:44 github.com/crawlab-team/crawlab-core/models/service.(*ListBinder).Bind()
    crawlab_example_master | /go/pkg/mod/github.com/crawlab-team/[email protected]/models/service/base_service.go:67 github.com/crawlab-team/crawlab-core/models/service.(*BaseService).GetList()
    crawlab_example_master | /go/pkg/mod/github.com/crawlab-team/[email protected]/models/service/schedule_service.go:34 github.com/crawlab-team/crawlab-core/models/service.(*Service).GetScheduleList()
    crawlab_example_master | /go/pkg/mod/github.com/crawlab-team/[email protected]/schedule/service.go:175 github.com/crawlab-team/crawlab-core/schedule.(*Service).fetch()
    crawlab_example_master | /go/pkg/mod/github.com/crawlab-team/[email protected]/schedule/service.go:134 github.com/crawlab-team/crawlab-core/schedule.(*Service).update()
    crawlab_example_master | /go/pkg/mod/github.com/crawlab-team/[email protected]/schedule/service.go:122 github.com/crawlab-team/crawlab-core/schedule.(*Service).Update()
    crawlab_example_master | /usr/local/go/src/runtime/asm_amd64.s:1371 runtime.goexit()
    crawlab_example_master | panic: runtime error: invalid memory address or nil pointer dereference
    crawlab_example_master | [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xf1bf72]
    crawlab_example_master |
    crawlab_example_master | goroutine 140 [running]:
    crawlab_example_master | github.com/crawlab-team/crawlab-core/models/service.(*Service).GetScheduleList(0xc00084cbe0, 0xc0747bf620, 0x0, 0x7, 0xc074840f28, 0x3fd0000000000000, 0x0, 0x2d353862342d3131)
    crawlab_example_master | 	/go/pkg/mod/github.com/crawlab-team/[email protected]/models/service/schedule_service.go:35 +0x72
    crawlab_example_master | github.com/crawlab-team/crawlab-core/schedule.(*Service).fetch(0xc00010e000, 0x49794d6a5a444d34, 0x49695a69356d4973)
    crawlab_example_master | 	/go/pkg/mod/github.com/crawlab-team/[email protected]/schedule/service.go:175 +0xad
    crawlab_example_master | github.com/crawlab-team/crawlab-core/schedule.(*Service).update(0xc00010e000)
    crawlab_example_master | 	/go/pkg/mod/github.com/crawlab-team/[email protected]/schedule/service.go:134 +0x45
    crawlab_example_master | github.com/crawlab-team/crawlab-core/schedule.(*Service).Update(0xc00010e000)
    crawlab_example_master | 	/go/pkg/mod/github.com/crawlab-team/[email protected]/schedule/service.go:122 +0x28
    crawlab_example_master | created by github.com/crawlab-team/crawlab-core/schedule.(*Service).Start
    crawlab_example_master | 	/go/pkg/mod/github.com/crawlab-team/[email protected]/schedule/service.go:78 +0x51
    

    To Reproduce Steps to reproduce the behavior:

    1. cut MongoDB network to Crawlab master node

    Expected behavior Master should not shutdown even if MongoDB is not accessible.

    Screenshots image

    image bug v0.6 
    opened by tikazyq 0
Releases(v0.6.0)
  • v0.6.0(May 23, 2022)

    Change Log (v0.6.0)

    Overview

    As a major release, v0.6.0 is consisted of a number of large changes to enhance the performance, scalability, robustness and usability of Crawlab. This beta version is theoretically more robust than older versions mainly in task execution, files synchronization and node management, yet we still recommend users to thoroughly run tests with various samples.

    Enhancements

    Backend

    • File Synchronization. Migrated file sync from MongoDB GridFS to SeaweedFS for better stability and robustness.
    • Node Communication. Migrated node communication from Redis-based RPC to gRPC. Worker nodes indirectly interact with MongoDB by making gRPC calls to the master node.
    • Task Queue. Migrated task queue from Redis list to MongoDB collection to allow more flexibility (e.g. priority queue).
    • Logging. Migrated logging storage system to SeaweedFS to resolve performance issue in MongoDB.
    • SDK Integration. Migrated results data ingestion from native SDK to task handler side.
    • Task Related. Abstracted task related logics into Task Scheduler, Task Handler and Task Runners to increase decoupling and improve scalability and maintainability.
    • Compotenization. Introduced DI (dependency injection) framework and componentized modules, services and sub-systems.
    • Plugin Framework. Crawlab Plugin Framework (CPF) has been released. See more info [here](https://docs.crawlab.cn/en/guide/plugin/).
    • Git Integration. Git integration is implemented as a built-in feature.
    • Scrapy Integration. Scrapy integration is implemented as a plugin [spider-assistant](https://docs.crawlab.cn/en/guide/plugin/plugin-spider-assistant).
    • Dependency Integration. Dependency integration is implemented as a plugin [dependency](https://docs.crawlab.cn/en/guide/plugin/plugin-dependency).
    • Notifications. Notifications feature is implemented as a plugin [notification](https://docs.crawlab.cn/en/guide/plugin/plugin-notification).

    Frontend

    • Vue 3. Migrated to latest version of frontend framework Vue 3 to support more advanced features such as composition API and TypeScript.
    • UI Framework. Built with Vue 3-based UI framework Element-Plus from Vue-Element-Admin, more flexibility and functionality.
    • Advanced File Editor. Support more advanced file editor features including drag-and-drop copying/moving files, renaming, deleting, file editing, code highlight, nav tabs, etc.
    • Customizable Table. Support more advanced built-in operations such as columns adjustment, batch operation, searching, filtering, sorting, etc.
    • Nav Tabs. Support multiple nav tabs for viewing different pages.
    • Batch Creation. Support batch creating objects including spiders, projects, schedules, etc.
    • Detail Navigation. Sidebar navigation in detail pages.
    • Enhanced Dashboard. More stats charts in home page dashboard.

    Miscellaneous

    • Documentation Site. Upgraded [documentation site](https://docs.crawlab.cn/en).
    • Official Plugins. Allow users to install [official plugins](https://docs.crawlab.cn/en/guide/plugin/) on Crawlab web UI.
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-beta.20211224(Dec 24, 2021)

    Change Log (v0.6.0-beta.20211224)

    Overview

    This is the third beta release for the next major version v0.6.0. With more features and optimization coming in, the release of official version v0.6.0 is approaching soon.

    Enhancement

    • [x] Internationalization. Support Chinese.
    • [x] CLI Upload Spider. #1020
    • [x] Official Plugins. Allow users to install official plugins on Crawlab web UI.
    • [x] More Documentation. Added documentation for plugins and CLI.

    Bug Fixes

    TODOs

    • [ ] Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
    • [ ] Crontab Editor. Frontend component that visualize the crontab editing.
    • [ ] Results Deduplication.
    • [ ] Environment Variables.
    • [ ] Frontend Utility Enhancement. Advanced features such as saved table customization.
    • [ ] Log Auto Cleanup.
    • [ ] More Documentation.
    • [ ] E2E Tests.
    • [ ] Frontend Output File Size Optimization.

    What Next

    The next version could the official release of v0.6.0, but not determined yet. There will be more tests running against the current beta version to ensure robustness and production-ready deployment.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-beta.20211120(Nov 20, 2021)

    Change Log (v0.6.0-beta.20211120)

    Overview

    This is the second beta release for the next major version v0.6.0 after the first beta release. With more features and optimization coming in, the release of official version v0.6.0 is approaching soon.

    Enhancement

    Backend

    • [x] Plugin Framework. Crawlab Plugin Framework (CPF) has been released. See more info here.
    • [x] Git Integration. Git integration is implemented as a built-in feature.
    • [x] Scrapy Integration. Scrapy integration is implemented as a plugin spider-assistant.
    • [x] Dependency Integration. Dependency integration is implemented as a plugin dependency.
    • [x] Notifications. Notifications feature is implemented as a plugin notification.
    • [x] Documentation Site. Set up documentation site.

    Frontend

    • Bug Fixing.

    TODOs

    • [ ] Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
    • [ ] Crontab Editor. Frontend component that visualize the crontab editing.
    • [ ] Results Deduplication.
    • [ ] Environment Variables.
    • [ ] Internationalization. Support Chinese.
    • [ ] Frontend Utility Enhancement. Advanced features such as saved table customization.
    • [ ] Log Auto Cleanup.
    • [ ] More Documentation.

    What Next

    The next version could the official release of v0.6.0, but not determined yet. There will be more tests running against the current beta version to ensure robustness and production-ready deployment.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-beta.20210803(Aug 3, 2021)

    Change Log (v0.6.0-beta.20210803)

    Overview

    This is the beta release for the next major version v0.6.0. It recommended NOT to use it in production as it is not fully tested and thus not stable enough. Futhermore, more features including those not ready in the beta release (e.g. Git, Scrapy, Notification) are planned to be integrated into the live version, in the form of plugins.

    Enhancement

    As a major release, v0.6 (including beta versions) is consisted of a number of large changes to enhance the performance, scalability, robustness and usability of Crawlab. This beta version is theoretically more robust than older versions mainly in task execution, files synchronization and node management, yet we still recommend users to thoroughly run tests with various samples.

    Backend

    • File Synchronization. Migrated file sync from MongoDB GridFS to SeaweedFS for better stability and robustness.
    • Node Communication. Migrated node communication from Redis-based RPC to gRPC. Worker nodes indirectly interact with MongoDB by making gRPC calls to the master node.
    • Task Queue. Migrated task queue from Redis list to MongoDB collection to allow more flexibility (e.g. priority queue).
    • Logging. Migrated logging storage system to SeaweedFS to resolve performance issue in MongoDB.
    • SDK Integration. Migrated results data ingestion from native SDK to task handler side.
    • Task Related. Abstracted task related logics into Task Scheduler, Task Handler and Task Runners to increase decoupling and improve scalability and maintainability.
    • Compotenization. Introduced DI (dependency injection) framework and componentized modules, services and sub-systems.

    Frontend

    • Vue 3. Migrated to latest version of frontend framework Vue 3 to support more advanced features such as composition API and TypeScript.
    • UI Framework. Built with Vue 3-based UI framework Element-Plus from Vue-Element-Admin, more flexibility and functionality.
    • Advanced File Editor. Support more advanced file editor features including drag-and-drop copying/moving files, renaming, deleting, file editing, code highlight, nav tabs, etc.
    • Customizable Table. Support more advanced built-in operations such as columns adjustment, batch operation, searching, filtering, sorting, etc.
    • Nav Tabs. Support multiple nav tabs for viewing different pages.
    • Batch Creation. Support batch creating objects including spiders, projects, schedules, etc.
    • Detail Navigation. Sidebar navigation in detail pages.
    • Enhanced Dashboard. More stats charts in home page dashboard.

    TODOs

    As you may be aware that this is a beta release, some of the existing useful features such as Git and Scrapy integration may not be available. However, we are trying to include them in the official v0.6.0 release, as some of their core functionalities are already ready in the code base, and we will add to the stable version only if they are fully tested.

    • [ ] Plugin Framework. Advanced features will exist in the form of plugins, or pluggable modules.
    • [ ] Git Integration. To be included as a plugin.
    • [ ] Scrapy Integration. To be included as a plugin.
    • [ ] Notifications. To be included as a plugin.
    • [ ] Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
    • [ ] Crontab Editor. Frontend component that visualize the crontab editing.
    • [ ] Results Deduplication.
    • [ ] Environment Variables.
    • [ ] Internationalization. Support Chinese.
    • [ ] Frontend Utility Enhancement. Advanced features such as saved table customization.
    • [ ] Log Auto Cleanup.
    • [ ] Documentation.

    What Next

    This beta release is only a preview and a test ground for the core functionalies in Crawlab v0.6. Therefore, we will invite you guys to download and run more tests. The official release is expected to be ready after major issues from the beta version are sorted and Plugin Framework and other key features are developed and fully tested. With that beared in mind, a second beta version before the main release will also be possible.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Jul 31, 2020)

    Features / Enhancement

    • Added error message details.
    • Added Golang programming language support.
    • Added web driver installation scripts for Chrome Driver and Firefox.
    • Support system tasks. A "system task" is similar to normal spider task, it allows users to view logs of general tasks such as installing languages.
    • Changed methods of installing languages from RPC to system tasks.

    Bug Fixes

    • Fixed first download repo 500 error in Spider Market page. #808
    • Fixed some translation issues.
    • Fixed 500 error in task detail page. #810
    • Fixed password reset issue. #811
    • Fixed unable to download CSV issue. #812
    • Fixed unable to install node.js issue. #813
    • Fixed disabled status for batch adding schedules. #814
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Jul 19, 2020)

    Features / Enhancement

    • Spider Market. Allow users to download open-source spiders into Crawlab.
    • Batch actions. Allow users to interact with Crawlab in batch fashions, e.g. batch run tasks, batch delete spiders, ect.
    • Migrate MongoDB driver to MongoDriver.
    • Refactor and optmize node-related logics.
    • Change default task.workers to 16.
    • Change default nginx client_max_body_size to 200m.
    • Support writing logs to ElasticSearch.
    • Display error details in Scrapy page.
    • Removed Challenge page.
    • Moved Feedback and Dislaimer pages to navbar.

    Bug Fixes

    • Fixed log not expiring issue because of failure to create TTL index.
    • Set default log expire duration to 1 day.
    • task_id index not created.
    • docker-compose.yml fix.
    • Fixed 404 page.
    • Fixed unable to create worker node before master node issue.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.10(Apr 23, 2020)

    Features / Enhancement

    • Enhanced Log Management. Centralizing log storage in MongoDB, reduced the dependency of PubSub, allowing log error detection.
    • API Token. Allow users to generate API tokens and use them to integrate into their own systems.
    • Web Hook. Trigger a Web Hook http request to pre-defined URL when a task starts or finishes.
    • Auto Install Dependencies. Allow installing dependencies automatically from requirements.txt or package.json.
    • Auto Results Collection. Set results collection to results_<spider_name> if it is not set.
    • Optimized Project List. Not display "No Project" item in the project list.
    • Upgrade Node.js. Upgrade Node.js version from v8.12 to v10.19.
    • Add Run Button in Schedule Page. Allow users to manually run task in Schedule Page.

    Bug Fixes

    • Cannot register. #670
    • Spider schedule tab cron expression shows second. #678
    • Missing daily stats in spider. #684
    • Results count not update in time. #689
    Source code(tar.gz)
    Source code(zip)
  • v0.4.9(Mar 31, 2020)

    Features / Enhancement

    • Challenges. Users can achieve different challenges based on their actions.
    • More Advanced Access Control. More granular access control, e.g. normal users can only view/manage their own spiders/projects and admin users can view/manage all spiders/projects.
    • Feedback. Allow users to send feedbacks and ratings to Crawlab team.
    • Better Home Page Metrics. Optimized metrics display on home page.
    • Configurable Spiders Converted to Customized Spiders. Allow users to convert their configurable spiders into customized spiders which are also Scrapy spiders.
    • View Tasks Triggered by Schedule. Allow users to view tasks triggered by a schedule. #648
    • Support Results De-Duplication. Allow users to configure de-duplication of results. #579
    • Support Task Restart. Allow users to re-run historical tasks.

    Bug Fixes

    • CLI unable to use on Windows. #580
    • Re-upload error. #643 #640
    • Upload missing folders. #646
    • Unable to add schedules in Spider Page.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.8(Mar 11, 2020)

    Features / Enhancement

    • Support Installations of More Programming Languages. Now users can install or pre-install more programming languages including Java, .Net Core and PHP.
    • Installation UI Optimization. Users can better view and manage installations on Node List page.
    • More Git Support. Allow users to view Git Commits record, and allow checkout to corresponding commit.
    • Support Hostname Node Registration Type. Users can set hostname as the node key as the unique identifier.
    • RPC Support. Added RPC support to better manage node communication.
    • Run On Master Switch. Users can determine whether to run tasks on master. If not, all tasks will be run only on worker nodes.
    • Disabled Tutorial by Default.
    • Added Related Documentation Sidebar.
    • Loading Page Optimization.

    Bug Fixes

    • Duplicated Nodes. #391
    • Duplicated Spider Upload. #603
    • Failure in dependencies installation results in unusable dependency installation functionalities.. #609
    • Create Tasks for Offline Nodes. #622
    Source code(tar.gz)
    Source code(zip)
  • v0.4.7(Feb 24, 2020)

    Features / Enhancement

    • Better Support for Scrapy. Spiders identification, settings.py configuration, log level selection, spider selection. #435
    • Git Sync. Allow users to sync git projects to Crawlab.
    • Long Task Support. Users can add long-task spiders which is supposed to run without finishing. #425
    • Spider List Optimization. Tasks count by status, tasks detail popup, legend. #425
    • Upgrade Check. Check latest version and notifiy users to upgrade.
    • Spiders Batch Operation. Allow users to run/stop spider tasks and delete spiders in batches.
    • Copy Spiders. Allow users to copy an existing spider to create a new one.
    • Wechat Group QR Code.

    Bug Fixes

    • Schedule Spider Selection Issue. Fields not responding to spider change.
    • Cron Jobs Conflict. Possible bug when two spiders set to the same time of their cron jobs. #515 #565
    • Task Log Issue. Different tasks write to the same log file if triggered at the same time. #577
    • Task List Filter Options Incomplete.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.6(Feb 13, 2020)

    Features / Enhancement

    • SDK for Node.js. Users can apply SDK in their Node.js spiders.
    • Log Management Optimization. Log search, error highlight, auto-scrolling.
    • Task Execution Process Optimization. Allow users to be redirected to task detail page after triggering a task.
    • Task Display Optimization. Added "Param" in the Latest Tasks table in the spider detail page. #295
    • Spider List Optimization. Added "Update Time" and "Create Time" in spider list page.
    • Page Loading Placeholder.

    Bug Fixes

    • Lost Focus in Schedule Configuration. #519
    • Unable to Upload Spider using CLI. #524
    Source code(tar.gz)
    Source code(zip)
  • v0.4.5(Feb 3, 2020)

    Features / Enhancement

    • Interactive Tutorial. Guide users through the main functionalities of Crawlab.
    • Global Environment Variables. Allow users to set global environment variables, which will be passed into all spider programs. #177
    • Project. Allow users to link spiders to projects. #316
    • Demo Spiders. Added demo spiders when Crawlab is initialized. #379
    • User Admin Optimization. Restrict privilleges of admin users. #456
    • Setting Page Optimization.
    • Task Results Optimization.

    Bug Fixes

    • Unable to find spider file error. #485
    • Click delete button results in redirect. #480
    • Unable to create files in an empty spider. #479
    • Download results error. #465
    • crawlab-sdk CLI error. #458
    • Page refresh issue. #441
    • Results not support JSON. #202
    • Getting all spider after deleting a spider.
    • i18n warning.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.4(Jan 17, 2020)

    Features / Enhancement

    • Email Notification. Allow users to send email notifications.
    • DingTalk Robot Notification. Allow users to send DingTalk Robot notifications.
    • Wechat Robot Notification. Allow users to send Wechat Robot notifications.
    • API Address Optimization. Added relative URL path in frontend so that users don't have to specify CRAWLAB_API_ADDRESS explicitly.
    • SDK Compatiblity. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
    • Enhanced File Management. Added tree-like file sidebar to allow users to edit files much more easier.
    • Advanced Schedule Cron. Allow users to edit schedule cron with visualized cron editor.

    Bug Fixes

    • nil retuened error.
    • Error when using HTTPS.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(Jan 7, 2020)

    Features / Enhancement

    • Dependency Installation. Allow users to install/uninstall dependencies and add programming languages (Node.js only for now) on the platform web interface.
    • Pre-install Programming Languages in Docker. Allow Docker users to set CRAWLAB_SERVER_LANG_NODE as Y to pre-install Node.js environments.
    • Add Schedule List in Spider Detail Page. Allow users to view / add / edit schedule cron jobs in the spider detail page. #360
    • Align Cron Expression with Linux. Change the expression of 6 elements to 5 elements as aligned in Linux.
    • Enable/Disable Schedule Cron. Allow users to enable/disable the schedule jobs. #297
    • Better Task Management. Allow users to batch delete tasks. #341
    • Better Spider Management. Allow users to sort and filter spiders in the spider list page.
    • Added Chinese CHANGELOG.
    • Added Github Star Button at Nav Bar.

    Bug Fixes

    • Schedule Cron Task Issue. #423
    • Upload Spider Zip File Issue. #403 #407
    • Exit due to Network Failure. #340
    • Cron Jobs not Running Correctly
    • Schedule List Columns Mis-positioned
    • Clicking Refresh Button Redirected to 404 Page
    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Dec 28, 2019)

    Features / Enhancement

    • Disclaimer. Added page for Disclaimer.
    • Call API to fetch version. #371
    • Configure to allow user registration. #346
    • Allow adding new users.
    • More Advanced File Management. Allow users to add / edit / rename / delete files. #286
    • Optimized Spider Creation Process. Allow users to create an empty customized spider before uploading the zip file.
    • Better Task Management. Allow users to filter tasks by selecting through certian criterions. #341

    Bug Fixes

    • Duplicated nodes. #391
    • "mongodb no reachable" error. #373
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Dec 15, 2019)

    Features / Enhancement

    • Spiderfile Optimization. Stages changed from dictionary to array. #358
    • Baidu Tongji Update.

    Bug Fixes

    • Unable to display schedule tasks. #353
    • Duplicate node registration. #334
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 6, 2019)

  • v0.3.5(Oct 28, 2019)

    Features / Enhancement

    • Graceful Showdown. detail
    • Node Info Optimization. detail
    • Append System Environment Variables to Tasks. detail
    • Auto Refresh Task Log. detail
    • Enable HTTPS Deployment. detail

    Bug Fixes

    • Unable to fetch spider list info in schedule jobs. detail
    • Unable to fetch node info from worker nodes. detail
    • Unable to select node when trying to run spider tasks. detail
    • Unable to fetch result count when result volume is large. #260
    • Node issue in schedule tasks. #244
    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Oct 8, 2019)

    1、fix 非自定义爬虫前端无法查看爬虫的问题 2、fix kill主进程未kill子进程的问题 3、fix 爬虫异常退出状态错误的问题 4、fix kill进程后状态错误的问题

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Oct 7, 2019)

  • v0.3.2(Sep 30, 2019)

    1、重构爬虫同步流程,修改为直接从GridFs上同步爬虫 2、fix 爬虫日志无法正常获取的问题 3、fix 爬虫无法正常同步的问题 4、fix 爬虫无法正常删除的问题 5、fix 任务状态无法正常停止的问题 6、优化爬虫列表的搜索

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Aug 25, 2019)

    Features / Enhancement

    • Docker Image Optimization. Split docker further into master, worker, frontend with alpine image.
    • Unit Tests. Covered part of the backend code with unit tests.
    • Frontend Optimization. Login page, button size, hints of upload UI optimization.
    • More Flexible Node Registration. Allow users to pass a variable as key for node registration instead of MAC by default.

    Bug Fixes

    • Uploading Large Spider Files Error. Memory crash issue when uploading large spider files. #150
    • Unable to Sync Spiders. Fixes through increasing level of write permission when synchronizing spider files. #114
    • Spider Page Issue. Fixes through removing the field "Site". #112
    • Node Display Issue. Nodes do not display correctly when running docker containers on multiple machines. #99
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jul 31, 2019)

    Features / Enhancement

    • Golang Backend: Refactored code from Python backend to Golang, much more stability and performance.
    • Node Network Graph: Visualization of node typology.
    • Node System Info: Available to see system info including OS, CPUs and executables.
    • Node Monitoring Enhancement: Nodes are monitored and registered through Redis.
    • File Management: Available to edit spider files online, including code highlight.
    • Login/Regiser/User Management: Require users to login to use Crawlab, allow user registration and user management, some role-based authorization.
    • Automatic Spider Deployment: Spiders are deployed/synchronized to all online nodes automatically.
    • Smaller Docker Image: Slimmed Docker image and reduced Docker image size from 1.3G to ~700M by applying Multi-Stage Build.

    Bug Fixes

    • Node Status. Node status does not change even though it goes offline actually. #87
    • Spider Deployment Error. Fixed through Automatic Spider Deployment #83
    • Node not showing. Node not able to show online #81
    • Cron Job not working. Fixed through new Golang backend #64
    • Flower Error. Fixed through new Golang backend #57
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Jul 22, 2019)

    Features / Enhancement

    • Documentation: Better and much more detailed documentation.
    • Better Crontab: Make crontab expression through crontab UI.
    • Better Performance: Switched from native flask engine to gunicorn. #78

    Bugs Fixes

    • Deleting Spider. Deleting a spider does not only remove record in db but also removing related folder, tasks and schedules. #69
    • MongoDB Auth. Allow user to specify authenticationDatabase to connect to mongodb. #68
    • Windows Compatibility. Added eventlet to requirements.txt. #59
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Jun 12, 2019)

  • v0.2.2(May 30, 2019)

  • v0.2.1(May 27, 2019)

  • v0.2(May 10, 2019)

  • v0.1.1(Apr 23, 2019)

skweez spiders web pages and extracts words for wordlist generation.

skweez skweez (pronounced like "squeeze") spiders web pages and extracts words for wordlist generation. It is basically an attempt to make a more oper

Michael Eder 46 Mar 10, 2022
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 7.1k Jun 24, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.3k Jun 17, 2022
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Mansz 0 Feb 4, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 263 Jun 12, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 77 Jun 21, 2022
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 10 Jun 28, 2022
A recursive, mirroring web crawler that retrieves child links.

A recursive, mirroring web crawler that retrieves child links.

Tony Afula 0 Jan 29, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 16.8k Jun 26, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 769 Jun 24, 2022
Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Niloy Sikdar 9 Feb 13, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

null 15 May 2, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 11 May 9, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

9ian1i 2k Jun 27, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Go Tripod 11 May 9, 2022
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

null 4 May 23, 2022
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

null 29 May 5, 2022
A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

QuaKe 1 Nov 9, 2021
Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Alexey Khan 1 Nov 3, 2021