Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Overview

Crawlab

中文 | English

Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

Demo | Documentation

Installation

Three methods:

  1. Docker (Recommended)
  2. Direct Deploy (Check Internal Kernel)
  3. Kubernetes (Multi-Node Deployment)

Pre-requisite (Docker)

  • Docker 18.03+
  • Redis 5.x+
  • MongoDB 3.6+
  • Docker Compose 1.24+ (optional but recommended)

Pre-requisite (Direct Deploy)

  • Go 1.12+
  • Node 8.12+
  • Redis 5.x+
  • MongoDB 3.6+

Quick Start

Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.

git clone https://github.com/crawlab-team/crawlab
cd crawlab
docker-compose up -d

Next, you can look into the docker-compose.yml (with detailed config params) and the Documentation (Chinese) for further information.

Run

Docker

Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB and Redis databases. Create a file named docker-compose.yml and input the code below.

version: '3.3'
services:
  master: 
    image: tikazyq/crawlab:latest
    container_name: master
    environment:
      CRAWLAB_SERVER_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_REDIS_ADDRESS: "redis"
    ports:    
      - "8080:8080"
    depends_on:
      - mongo
      - redis
  mongo:
    image: mongo:latest
    restart: always
    ports:
      - "27017:27017"
  redis:
    image: redis:latest
    restart: always
    ports:
      - "6379:6379"

Then execute the command below, and Crawlab Master Node + MongoDB + Redis will start up. Open the browser and enter http://localhost:8080 to see the UI interface.

docker-compose up

For Docker Deployment details, please refer to relevant documentation.

Screenshot

Login

Home Page

Node List

Node Network

Spider List

Spider Overview

Spider Analytics

Spider File Edit

Task Log

Task Results

Cron Job

Language Installation

Dependency Installation

Notifications

Architecture

The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.

The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before v0.3.0. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.

Master Node

The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The Master Node offers below services:

  1. Crawling Task Coordination;
  2. Worker Node Management and Communication;
  3. Spider Deployment;
  4. Frontend and API Services;
  5. Task Execution (one can regard the Master Node as a Worker Node)

The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.

Worker Node

The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis PubSub. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

MongoDB

MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.

Redis

Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute HSET to set their info into a hash list named nodes in Redis, and the Master Node will identify online nodes according to the hash list.

Frontend

Frontend is a SPA based on Vue-Element-Admin. It has re-used many Element-UI components to support corresponding display.

Integration with Other Frameworks

Crawlab SDK provides some helper methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

⚠️ Note: make sure you have already installed crawlab-sdk using pip.

Scrapy

In settings.py in your Scrapy project, find the variable named ITEM_PIPELINES (a dict variable). Add content below.

ITEM_PIPELINES = {
    'crawlab.pipelines.CrawlabMongoPipeline': 888,
}

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Result

General Python Spider

Please add below content to your spider files to save results.

# import result saving method
from crawlab import save_item

# this is a result record, must be dict type
result = {'name': 'crawlab'}

# call result saving method
save_item(result)

Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Result

Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID. By doing so, the data can be related to a task. Also, another environment variable CRAWLAB_COLLECTION is passed by Crawlab as the name of the collection to store results data.

Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?

The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.

Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.

Framework Technology Pros Cons Github Stats
Crawlab Golang + Vue Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc. Not yet support spider versioning
ScrapydWeb Python Flask + Vue Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform. Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
Gerapy Python Django + Vue Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc. Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeper Python Flask Open-source Scrapyhub. Concise and simple UI interface. Support cron job. Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.

Contributors

Community & Sponsorship

If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group. Or you scan the Alipay QR code below to give us a reward to upgrade our teamwork software or buy a coffee.

Comments
  • Runing spiders ?!

    Runing spiders ?!

    I can't run spider neither outside docker master container nor inside it Inside i get this error after typing: crawlab upload spider Not authorized Error loging in

    Out side i only can upload zip file and it's not connected to my script and doesn't returen any data

    Any help?!

    question 
    opened by Nesma-m7md 14
  • admin 通过vscode port登陆失败

    admin 通过vscode port登陆失败

    Bug 描述 我将机器部署在内网服务器,通过vscodeport功能映射到本机打开,例如,当输入用户名和密码均为admin 时,登陆功能不工作。 vscode port配置 image

    登陆错误 image

    排除其他因素 远程主机端口6800存在scrapyd的管理界面,vscode能将远程主机的6800映射到本机端口上。 image

    期望结果 登陆admin 能工作。

    bug 
    opened by kevinzhangcode 11
  • gRPC Client Cannot Connect to the master node!

    gRPC Client Cannot Connect to the master node!

    Bug

    When I tried building CRAWLAB by docker, worker node cannot connect the master node.

    YML File

    version: '3.3'
    services:
      master: 
        image: crawlabteam/crawlab:latest
        container_name: crawlab_example_master
        environment:
          CRAWLAB_NODE_MASTER: "Y"
          CRAWLAB_MONGO_HOST: "mongo"
          CRAWLAB_GRPC_SERVER_ADDRESS: "0.0.0.0:9666"
          CRAWLAB_SERVER_HOST: "0.0.0.0"
          CRAWLAB_GRPC_AUTHKEY: "youcanneverguess"
        volumes:
          - "./.crawlab/master:/root/.crawlab"
        ports:    
          - "8080:8080"
          - "9666:9666"
          - "8000:8000"
        depends_on:
          - mongo
    
      worker01: 
        image: crawlabteam/crawlab:latest
        container_name: crawlab_example_worker01
        environment:
          CRAWLAB_NODE_MASTER: "N"
          CRAWLAB_GRPC_ADDRESS: "MY_Public_IP_Address:9666"
          CRAWLAB_GRPC_AUTHKEY: "youcanneverguess"
          CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
        volumes:
          - "./.crawlab/worker01:/root/.crawlab"
        depends_on:
          - master
    
      mongo:
        image: mongo:latest
        container_name: crawlab_example_mongo
        restart: always
    

    image

    bug question 
    opened by edmund-zhao 11
  • 求助 | 各种请求新建操作均无响应,不清楚什么地方配置出错了

    求助 | 各种请求新建操作均无响应,不清楚什么地方配置出错了

    进入了管理平台但是各种新建操作都无法完成,比如新建项目会返回下面之类的操作信息:face_with_spiral_eyes::face_with_spiral_eyes::face_with_spiral_eyes:

    crawlab_master   | [GIN] 2022/07/24 - 22:47:03 | 400 |    2.186658ms |       127.0.0.1 | PUT      "/projects"
    ......
    crawlab_master   | node error: not exists
    ......
    crawlab_master   | mongo: no documents in result
    ......
    

    docker-compose.yml配置如下:

    # 主节点
    version: '3.3'
    services:
      mongo:
        image: mongo
        container_name: mongo
        restart: always
        environment:
          MONGO_INITDB_ROOT_USERNAME: root  # mongo username
          MONGO_INITDB_ROOT_PASSWORD: 123456  # mongo password
        volumes:
          - "/opt/crawlab/mongo/data/db:/data/db"  # 持久化 mongo 数据
        ports:
          - "27017:27017"  # 开放 mongo 端口到宿主机
    
    
      mongo-express:
        image: mongo-express
        container_name: mongo-express
        restart: always
        depends_on: #设置依赖,这里用来代替links字段
          - mongo
        ports: #对外的映射端口,这里使用了27016,容器宿主机本机访问的地址:http://localhost:27016,外网的话改为ip:port.
          # - "27016:8081"
          - "8081:8081"
        environment:
          ME_CONFIG_MONGODB_SERVER: mongo #服务名是mongo容器的名字
          ME_CONFIG_MONGODB_PORT: 27017
          ME_CONFIG_BASICAUTH_USERNAME: admin #登陆页面时候的用户名
          ME_CONFIG_BASICAUTH_PASSWORD: 123456 #登陆页面的用户密码
          ME_CONFIG_MONGODB_ADMINUSERNAME: root #mongo验证的用户名
          ME_CONFIG_MONGODB_ADMINPASSWORD: 123456 #mongo验证的用户密码
    
    
      master:
        image: crawlabteam/crawlab
        container_name: crawlab_master
        restart: always
        environment:
          CRAWLAB_NODE_MASTER: Y  # Y: 主节点
          CRAWLAB_MONGO_HOST: mongo  # mongo host address. 在 Docker-Compose 网>络中,直接引用 service 名称
          CRAWLAB_MONGO_PORT: 27017  # mongo port
          CRAWLAB_MONGO_DB: crawlab  # mongo database
          CRAWLAB_MONGO_USERNAME: root  # mongo username
          CRAWLAB_MONGO_PASSWORD: '123456'  # mongo password
          CRAWLAB_MONGO_AUTHSOURCE: admin  # mongo auth source
        volumes:
          - "/opt/crawlab/master:/data"  # 持久化 crawlab 数据
        ports:
          - "8080:8080"  # 开放 api 端口
          - "9666:9666"  # 开放 grpc 端口
        depends_on:
          - mongo
         ~~~
    
    bug v0.6 
    opened by Indulgeinu 10
  • 当存在错误日志的情况下,/tasks/{id}/error-log 仍返回为空

    当存在错误日志的情况下,/tasks/{id}/error-log 仍返回为空

    Bug 描述 没有错误信息提示,只有“空结果”错误提示

    部分日志信息

    2021-01-30 23:25:03 [scrapy.core.engine] INFO: Spider opened
    2021-01-30 23:25:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2021-01-30 23:25:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2021-01-30 23:25:05 [user_basic] ERROR [githubspider.user_basic] : Could not resolve to a node with the global id of ''
    2021-01-30 23:25:05 [scrapy.core.engine] INFO: Closing spider (finished)
    2021-01-30 23:25:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1004,
    'downloader/request_count': 2,
    'downloader/request_method_count/GET': 1,
    'downloader/request_method_count/POST': 1,
    'downloader/response_bytes': 1878,
    'downloader/response_count': 2,
    'downloader/response_status_count/200': 2,
    'elapsed_time_seconds': 1.450998,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2021, 1, 30, 15, 25, 5, 286471),
    'log_count/ERROR': 1,
    
    GET http://localhost:8080/api/tasks/7f081549-28c4-4a7e-80e6-cdb7cf399534/error-log
    {"status":"ok","message":"success","data":null,"error":""}
    

    截屏

    yAuuwt.jpg

    bug 
    opened by YanhuiJessica 10
  • Seaweedfs / Gocolly integration

    Seaweedfs / Gocolly integration

    Hi all,

    Hope you are all well !

    Just was wondering if it is possible to integrate these two awesome tools to crawlab, it would be awesome for storing millions of static objects and to scrape with golang. We already did that with a friend https://github.com/lucmichalski/peaks-tires but we lack of the horizontal scaling and a crawling management interface. That s why, and how, we found crawlab.

    • https://github.com/gocolly/colly Elegant Scraper and Crawler Framework for Golang

    • https://github.com/chrislusf/seaweedfs SeaweedFS is a simple and highly scalable distributed file system, to store and serve billions of files fast! SeaweedFS implements an object store with O(1) disk seek, transparent cloud integration, and an optional Filer supporting POSIX, S3 API, AES256 encryption, Rack-Aware Erasure Coding for warm storage, FUSE mount, Hadoop compatible, WebDAV.

    Thanks for your insights and feedbacks on the topic.

    Cheers, X

    enhancement 
    opened by ghost 10
  • 最新master的代码,不能正常通过docker启动

    最新master的代码,不能正常通过docker启动

    Bug 描述 最新master的代码,不能正常通过docker启动

    复现步骤

    1. 拉取最新的代码
    2. docker-compose up
    3. 报错:2021/08/10 17:44:51 error grpc client connect error: grpc error: client failed to start. reattempt in 51.7 seconds
    4. 拉取517ae21e13a57e0d9c074b162793aee689f99c0d
    5. docker-compose up
    6. 正常运行

    期望结果 docker能正常工作。

    截屏 image

    bug v0.6 
    opened by zires 9
  • toscrapy_books 爬虫选择所有节点运行,任务运行了两遍

    toscrapy_books 爬虫选择所有节点运行,任务运行了两遍

    Describe the bug A clear and concise description of what the bug is.

    To Reproduce Steps to reproduce the behavior:

    toscrapy_books 爬虫选择所有节点运行,任务运行了两遍 image

    Expected behavior 只爬一遍

    good first issue 
    opened by EkkoG 9
  • 使用docker-compose安装的,每过一个星期master就会挂掉?需要手动重启

    使用docker-compose安装的,每过一个星期master就会挂掉?需要手动重启

    这一个是日志文件输出: crawlab-master | 2019/11/10 06:00:00 error handle task error:open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | 2019/11/10 06:00:00 error [Worker 3] open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | fatal error: concurrent map writes crawlab-master | fatal error: concurrent map writes crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory

    bug 
    opened by yjiu1990 9
  • scrapy目录结构问题

    scrapy目录结构问题

    上传爬虫之后,如果我的爬虫没有严格按照scrapy项目结构设置的话,在爬虫 - 爬虫详情 - scrapy设置里识别不到对应的文件,控制台也会提示错误

    比如:scrapy的settings和pipeline、middleware在同一文件夹下,而我的爬虫把pipeline和middleware独立成了文件夹

    bug wontfix v0.6 
    opened by TalentedBastard 7
  • "TypeError: res is undefined" when i tried to sign in

    I did everything according to the instructions from GitHub. When I try to login, I get an error "TypeError: res is undefined". Please, help me to resolve that problem Screenshot from 2021-12-10 22-41-58 Screenshot from 2021-12-10 22-41-41 Screenshot from 2021-12-10 22-44-24

    bug v0.6 
    opened by Kohtie 7
  • 0.6.0-1版本 任务执行错误,重启后恢复正常,原因未知

    0.6.0-1版本 任务执行错误,重启后恢复正常,原因未知

    Describe the bug crawlab0.6 运行两天后(此时内存占用在95%左右) 突然出现所有任务执行错误:exit status 2,任务log无任何记录

    pod log中显示:dial tcp 127.0.0.1:8888: connect: connection refused 不确定是不是这个原因导致的

    To Reproduce Steps to reproduce the behavior:

    1. 设置好100+定时任务
    2. 运行N天之后,
    3. 所有任务执行错误:exit status 2

    Expected behavior 任务正常执行

    Screenshots

    master log文件

    crawlab.log

    bug v0.6 performance 
    opened by ma-pony 0
  • 0.6版本 重新运行任务时,任务的执行命令是爬虫定义的执行命令,而不是上一次任务执行的命令

    0.6版本 重新运行任务时,任务的执行命令是爬虫定义的执行命令,而不是上一次任务执行的命令

    Describe the bug 社区版v0.6.0-1 不确定是bug,还是未实现的功能

    重新运行任务时,任务的执行命令是爬虫定义的执行命令,而不是上一次任务执行的命令

    To Reproduce Steps to reproduce the behavior:

    1. 新建任务1,自定义任务的执行命令,执行任务1
    2. 点击重新运行,执行任务2
    3. 新任务2的执行命令为爬虫定义的执行命令,而不是任务1 的执行命令

    Expected behavior 重新运行的新任务命令取自上一个任务

    Screenshots

    bug v0.6 task 
    opened by ma-pony 0
Releases(v0.6.0-1)
  • v0.6.0-1(Oct 27, 2022)

    What's Changed

    • Bump eventsource from 1.1.0 to 1.1.1 in /frontend by @dependabot in https://github.com/crawlab-team/crawlab/pull/1115
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1117
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1119
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1122
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1124
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1127
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1145
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1148
    • fix(mem): fixed memory leak issue resulted by log collector by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1157
    • build(golang/dockerfile): rm unused tar by @ma-pony in https://github.com/crawlab-team/crawlab/pull/1165
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1169
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1170
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1185
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1191
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1205
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1208
    • fix(git): unable to pull code from remote by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1211
    • Develop by @tikazyq in https://github.com/crawlab-team/crawlab/pull/1216

    New Contributors

    • @ma-pony made their first contribution in https://github.com/crawlab-team/crawlab/pull/1165

    Full Changelog: https://github.com/crawlab-team/crawlab/compare/v0.6.0...v0.6.0-1

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(May 23, 2022)

    Change Log (v0.6.0)

    Overview

    As a major release, v0.6.0 is consisted of a number of large changes to enhance the performance, scalability, robustness and usability of Crawlab. This beta version is theoretically more robust than older versions mainly in task execution, files synchronization and node management, yet we still recommend users to thoroughly run tests with various samples.

    Enhancements

    Backend

    • File Synchronization. Migrated file sync from MongoDB GridFS to SeaweedFS for better stability and robustness.
    • Node Communication. Migrated node communication from Redis-based RPC to gRPC. Worker nodes indirectly interact with MongoDB by making gRPC calls to the master node.
    • Task Queue. Migrated task queue from Redis list to MongoDB collection to allow more flexibility (e.g. priority queue).
    • Logging. Migrated logging storage system to SeaweedFS to resolve performance issue in MongoDB.
    • SDK Integration. Migrated results data ingestion from native SDK to task handler side.
    • Task Related. Abstracted task related logics into Task Scheduler, Task Handler and Task Runners to increase decoupling and improve scalability and maintainability.
    • Compotenization. Introduced DI (dependency injection) framework and componentized modules, services and sub-systems.
    • Plugin Framework. Crawlab Plugin Framework (CPF) has been released. See more info [here](https://docs.crawlab.cn/en/guide/plugin/).
    • Git Integration. Git integration is implemented as a built-in feature.
    • Scrapy Integration. Scrapy integration is implemented as a plugin [spider-assistant](https://docs.crawlab.cn/en/guide/plugin/plugin-spider-assistant).
    • Dependency Integration. Dependency integration is implemented as a plugin [dependency](https://docs.crawlab.cn/en/guide/plugin/plugin-dependency).
    • Notifications. Notifications feature is implemented as a plugin [notification](https://docs.crawlab.cn/en/guide/plugin/plugin-notification).

    Frontend

    • Vue 3. Migrated to latest version of frontend framework Vue 3 to support more advanced features such as composition API and TypeScript.
    • UI Framework. Built with Vue 3-based UI framework Element-Plus from Vue-Element-Admin, more flexibility and functionality.
    • Advanced File Editor. Support more advanced file editor features including drag-and-drop copying/moving files, renaming, deleting, file editing, code highlight, nav tabs, etc.
    • Customizable Table. Support more advanced built-in operations such as columns adjustment, batch operation, searching, filtering, sorting, etc.
    • Nav Tabs. Support multiple nav tabs for viewing different pages.
    • Batch Creation. Support batch creating objects including spiders, projects, schedules, etc.
    • Detail Navigation. Sidebar navigation in detail pages.
    • Enhanced Dashboard. More stats charts in home page dashboard.

    Miscellaneous

    • Documentation Site. Upgraded [documentation site](https://docs.crawlab.cn/en).
    • Official Plugins. Allow users to install [official plugins](https://docs.crawlab.cn/en/guide/plugin/) on Crawlab web UI.
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-beta.20211224(Dec 24, 2021)

    Change Log (v0.6.0-beta.20211224)

    Overview

    This is the third beta release for the next major version v0.6.0. With more features and optimization coming in, the release of official version v0.6.0 is approaching soon.

    Enhancement

    • [x] Internationalization. Support Chinese.
    • [x] CLI Upload Spider. #1020
    • [x] Official Plugins. Allow users to install official plugins on Crawlab web UI.
    • [x] More Documentation. Added documentation for plugins and CLI.

    Bug Fixes

    TODOs

    • [ ] Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
    • [ ] Crontab Editor. Frontend component that visualize the crontab editing.
    • [ ] Results Deduplication.
    • [ ] Environment Variables.
    • [ ] Frontend Utility Enhancement. Advanced features such as saved table customization.
    • [ ] Log Auto Cleanup.
    • [ ] More Documentation.
    • [ ] E2E Tests.
    • [ ] Frontend Output File Size Optimization.

    What Next

    The next version could the official release of v0.6.0, but not determined yet. There will be more tests running against the current beta version to ensure robustness and production-ready deployment.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-beta.20211120(Nov 20, 2021)

    Change Log (v0.6.0-beta.20211120)

    Overview

    This is the second beta release for the next major version v0.6.0 after the first beta release. With more features and optimization coming in, the release of official version v0.6.0 is approaching soon.

    Enhancement

    Backend

    • [x] Plugin Framework. Crawlab Plugin Framework (CPF) has been released. See more info here.
    • [x] Git Integration. Git integration is implemented as a built-in feature.
    • [x] Scrapy Integration. Scrapy integration is implemented as a plugin spider-assistant.
    • [x] Dependency Integration. Dependency integration is implemented as a plugin dependency.
    • [x] Notifications. Notifications feature is implemented as a plugin notification.
    • [x] Documentation Site. Set up documentation site.

    Frontend

    • Bug Fixing.

    TODOs

    • [ ] Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
    • [ ] Crontab Editor. Frontend component that visualize the crontab editing.
    • [ ] Results Deduplication.
    • [ ] Environment Variables.
    • [ ] Internationalization. Support Chinese.
    • [ ] Frontend Utility Enhancement. Advanced features such as saved table customization.
    • [ ] Log Auto Cleanup.
    • [ ] More Documentation.

    What Next

    The next version could the official release of v0.6.0, but not determined yet. There will be more tests running against the current beta version to ensure robustness and production-ready deployment.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-beta.20210803(Aug 3, 2021)

    Change Log (v0.6.0-beta.20210803)

    Overview

    This is the beta release for the next major version v0.6.0. It recommended NOT to use it in production as it is not fully tested and thus not stable enough. Futhermore, more features including those not ready in the beta release (e.g. Git, Scrapy, Notification) are planned to be integrated into the live version, in the form of plugins.

    Enhancement

    As a major release, v0.6 (including beta versions) is consisted of a number of large changes to enhance the performance, scalability, robustness and usability of Crawlab. This beta version is theoretically more robust than older versions mainly in task execution, files synchronization and node management, yet we still recommend users to thoroughly run tests with various samples.

    Backend

    • File Synchronization. Migrated file sync from MongoDB GridFS to SeaweedFS for better stability and robustness.
    • Node Communication. Migrated node communication from Redis-based RPC to gRPC. Worker nodes indirectly interact with MongoDB by making gRPC calls to the master node.
    • Task Queue. Migrated task queue from Redis list to MongoDB collection to allow more flexibility (e.g. priority queue).
    • Logging. Migrated logging storage system to SeaweedFS to resolve performance issue in MongoDB.
    • SDK Integration. Migrated results data ingestion from native SDK to task handler side.
    • Task Related. Abstracted task related logics into Task Scheduler, Task Handler and Task Runners to increase decoupling and improve scalability and maintainability.
    • Compotenization. Introduced DI (dependency injection) framework and componentized modules, services and sub-systems.

    Frontend

    • Vue 3. Migrated to latest version of frontend framework Vue 3 to support more advanced features such as composition API and TypeScript.
    • UI Framework. Built with Vue 3-based UI framework Element-Plus from Vue-Element-Admin, more flexibility and functionality.
    • Advanced File Editor. Support more advanced file editor features including drag-and-drop copying/moving files, renaming, deleting, file editing, code highlight, nav tabs, etc.
    • Customizable Table. Support more advanced built-in operations such as columns adjustment, batch operation, searching, filtering, sorting, etc.
    • Nav Tabs. Support multiple nav tabs for viewing different pages.
    • Batch Creation. Support batch creating objects including spiders, projects, schedules, etc.
    • Detail Navigation. Sidebar navigation in detail pages.
    • Enhanced Dashboard. More stats charts in home page dashboard.

    TODOs

    As you may be aware that this is a beta release, some of the existing useful features such as Git and Scrapy integration may not be available. However, we are trying to include them in the official v0.6.0 release, as some of their core functionalities are already ready in the code base, and we will add to the stable version only if they are fully tested.

    • [ ] Plugin Framework. Advanced features will exist in the form of plugins, or pluggable modules.
    • [ ] Git Integration. To be included as a plugin.
    • [ ] Scrapy Integration. To be included as a plugin.
    • [ ] Notifications. To be included as a plugin.
    • [ ] Associated Tasks. There will be main tasks and their sub-tasks if task mode is "all nodes" or "selected nodes".
    • [ ] Crontab Editor. Frontend component that visualize the crontab editing.
    • [ ] Results Deduplication.
    • [ ] Environment Variables.
    • [ ] Internationalization. Support Chinese.
    • [ ] Frontend Utility Enhancement. Advanced features such as saved table customization.
    • [ ] Log Auto Cleanup.
    • [ ] Documentation.

    What Next

    This beta release is only a preview and a test ground for the core functionalies in Crawlab v0.6. Therefore, we will invite you guys to download and run more tests. The official release is expected to be ready after major issues from the beta version are sorted and Plugin Framework and other key features are developed and fully tested. With that beared in mind, a second beta version before the main release will also be possible.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Jul 31, 2020)

    Features / Enhancement

    • Added error message details.
    • Added Golang programming language support.
    • Added web driver installation scripts for Chrome Driver and Firefox.
    • Support system tasks. A "system task" is similar to normal spider task, it allows users to view logs of general tasks such as installing languages.
    • Changed methods of installing languages from RPC to system tasks.

    Bug Fixes

    • Fixed first download repo 500 error in Spider Market page. #808
    • Fixed some translation issues.
    • Fixed 500 error in task detail page. #810
    • Fixed password reset issue. #811
    • Fixed unable to download CSV issue. #812
    • Fixed unable to install node.js issue. #813
    • Fixed disabled status for batch adding schedules. #814
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Jul 19, 2020)

    Features / Enhancement

    • Spider Market. Allow users to download open-source spiders into Crawlab.
    • Batch actions. Allow users to interact with Crawlab in batch fashions, e.g. batch run tasks, batch delete spiders, ect.
    • Migrate MongoDB driver to MongoDriver.
    • Refactor and optmize node-related logics.
    • Change default task.workers to 16.
    • Change default nginx client_max_body_size to 200m.
    • Support writing logs to ElasticSearch.
    • Display error details in Scrapy page.
    • Removed Challenge page.
    • Moved Feedback and Dislaimer pages to navbar.

    Bug Fixes

    • Fixed log not expiring issue because of failure to create TTL index.
    • Set default log expire duration to 1 day.
    • task_id index not created.
    • docker-compose.yml fix.
    • Fixed 404 page.
    • Fixed unable to create worker node before master node issue.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.10(Apr 23, 2020)

    Features / Enhancement

    • Enhanced Log Management. Centralizing log storage in MongoDB, reduced the dependency of PubSub, allowing log error detection.
    • API Token. Allow users to generate API tokens and use them to integrate into their own systems.
    • Web Hook. Trigger a Web Hook http request to pre-defined URL when a task starts or finishes.
    • Auto Install Dependencies. Allow installing dependencies automatically from requirements.txt or package.json.
    • Auto Results Collection. Set results collection to results_<spider_name> if it is not set.
    • Optimized Project List. Not display "No Project" item in the project list.
    • Upgrade Node.js. Upgrade Node.js version from v8.12 to v10.19.
    • Add Run Button in Schedule Page. Allow users to manually run task in Schedule Page.

    Bug Fixes

    • Cannot register. #670
    • Spider schedule tab cron expression shows second. #678
    • Missing daily stats in spider. #684
    • Results count not update in time. #689
    Source code(tar.gz)
    Source code(zip)
  • v0.4.9(Mar 31, 2020)

    Features / Enhancement

    • Challenges. Users can achieve different challenges based on their actions.
    • More Advanced Access Control. More granular access control, e.g. normal users can only view/manage their own spiders/projects and admin users can view/manage all spiders/projects.
    • Feedback. Allow users to send feedbacks and ratings to Crawlab team.
    • Better Home Page Metrics. Optimized metrics display on home page.
    • Configurable Spiders Converted to Customized Spiders. Allow users to convert their configurable spiders into customized spiders which are also Scrapy spiders.
    • View Tasks Triggered by Schedule. Allow users to view tasks triggered by a schedule. #648
    • Support Results De-Duplication. Allow users to configure de-duplication of results. #579
    • Support Task Restart. Allow users to re-run historical tasks.

    Bug Fixes

    • CLI unable to use on Windows. #580
    • Re-upload error. #643 #640
    • Upload missing folders. #646
    • Unable to add schedules in Spider Page.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.8(Mar 11, 2020)

    Features / Enhancement

    • Support Installations of More Programming Languages. Now users can install or pre-install more programming languages including Java, .Net Core and PHP.
    • Installation UI Optimization. Users can better view and manage installations on Node List page.
    • More Git Support. Allow users to view Git Commits record, and allow checkout to corresponding commit.
    • Support Hostname Node Registration Type. Users can set hostname as the node key as the unique identifier.
    • RPC Support. Added RPC support to better manage node communication.
    • Run On Master Switch. Users can determine whether to run tasks on master. If not, all tasks will be run only on worker nodes.
    • Disabled Tutorial by Default.
    • Added Related Documentation Sidebar.
    • Loading Page Optimization.

    Bug Fixes

    • Duplicated Nodes. #391
    • Duplicated Spider Upload. #603
    • Failure in dependencies installation results in unusable dependency installation functionalities.. #609
    • Create Tasks for Offline Nodes. #622
    Source code(tar.gz)
    Source code(zip)
  • v0.4.7(Feb 24, 2020)

    Features / Enhancement

    • Better Support for Scrapy. Spiders identification, settings.py configuration, log level selection, spider selection. #435
    • Git Sync. Allow users to sync git projects to Crawlab.
    • Long Task Support. Users can add long-task spiders which is supposed to run without finishing. #425
    • Spider List Optimization. Tasks count by status, tasks detail popup, legend. #425
    • Upgrade Check. Check latest version and notifiy users to upgrade.
    • Spiders Batch Operation. Allow users to run/stop spider tasks and delete spiders in batches.
    • Copy Spiders. Allow users to copy an existing spider to create a new one.
    • Wechat Group QR Code.

    Bug Fixes

    • Schedule Spider Selection Issue. Fields not responding to spider change.
    • Cron Jobs Conflict. Possible bug when two spiders set to the same time of their cron jobs. #515 #565
    • Task Log Issue. Different tasks write to the same log file if triggered at the same time. #577
    • Task List Filter Options Incomplete.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.6(Feb 13, 2020)

    Features / Enhancement

    • SDK for Node.js. Users can apply SDK in their Node.js spiders.
    • Log Management Optimization. Log search, error highlight, auto-scrolling.
    • Task Execution Process Optimization. Allow users to be redirected to task detail page after triggering a task.
    • Task Display Optimization. Added "Param" in the Latest Tasks table in the spider detail page. #295
    • Spider List Optimization. Added "Update Time" and "Create Time" in spider list page.
    • Page Loading Placeholder.

    Bug Fixes

    • Lost Focus in Schedule Configuration. #519
    • Unable to Upload Spider using CLI. #524
    Source code(tar.gz)
    Source code(zip)
  • v0.4.5(Feb 3, 2020)

    Features / Enhancement

    • Interactive Tutorial. Guide users through the main functionalities of Crawlab.
    • Global Environment Variables. Allow users to set global environment variables, which will be passed into all spider programs. #177
    • Project. Allow users to link spiders to projects. #316
    • Demo Spiders. Added demo spiders when Crawlab is initialized. #379
    • User Admin Optimization. Restrict privilleges of admin users. #456
    • Setting Page Optimization.
    • Task Results Optimization.

    Bug Fixes

    • Unable to find spider file error. #485
    • Click delete button results in redirect. #480
    • Unable to create files in an empty spider. #479
    • Download results error. #465
    • crawlab-sdk CLI error. #458
    • Page refresh issue. #441
    • Results not support JSON. #202
    • Getting all spider after deleting a spider.
    • i18n warning.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.4(Jan 17, 2020)

    Features / Enhancement

    • Email Notification. Allow users to send email notifications.
    • DingTalk Robot Notification. Allow users to send DingTalk Robot notifications.
    • Wechat Robot Notification. Allow users to send Wechat Robot notifications.
    • API Address Optimization. Added relative URL path in frontend so that users don't have to specify CRAWLAB_API_ADDRESS explicitly.
    • SDK Compatiblity. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
    • Enhanced File Management. Added tree-like file sidebar to allow users to edit files much more easier.
    • Advanced Schedule Cron. Allow users to edit schedule cron with visualized cron editor.

    Bug Fixes

    • nil retuened error.
    • Error when using HTTPS.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(Jan 7, 2020)

    Features / Enhancement

    • Dependency Installation. Allow users to install/uninstall dependencies and add programming languages (Node.js only for now) on the platform web interface.
    • Pre-install Programming Languages in Docker. Allow Docker users to set CRAWLAB_SERVER_LANG_NODE as Y to pre-install Node.js environments.
    • Add Schedule List in Spider Detail Page. Allow users to view / add / edit schedule cron jobs in the spider detail page. #360
    • Align Cron Expression with Linux. Change the expression of 6 elements to 5 elements as aligned in Linux.
    • Enable/Disable Schedule Cron. Allow users to enable/disable the schedule jobs. #297
    • Better Task Management. Allow users to batch delete tasks. #341
    • Better Spider Management. Allow users to sort and filter spiders in the spider list page.
    • Added Chinese CHANGELOG.
    • Added Github Star Button at Nav Bar.

    Bug Fixes

    • Schedule Cron Task Issue. #423
    • Upload Spider Zip File Issue. #403 #407
    • Exit due to Network Failure. #340
    • Cron Jobs not Running Correctly
    • Schedule List Columns Mis-positioned
    • Clicking Refresh Button Redirected to 404 Page
    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Dec 28, 2019)

    Features / Enhancement

    • Disclaimer. Added page for Disclaimer.
    • Call API to fetch version. #371
    • Configure to allow user registration. #346
    • Allow adding new users.
    • More Advanced File Management. Allow users to add / edit / rename / delete files. #286
    • Optimized Spider Creation Process. Allow users to create an empty customized spider before uploading the zip file.
    • Better Task Management. Allow users to filter tasks by selecting through certian criterions. #341

    Bug Fixes

    • Duplicated nodes. #391
    • "mongodb no reachable" error. #373
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Dec 15, 2019)

    Features / Enhancement

    • Spiderfile Optimization. Stages changed from dictionary to array. #358
    • Baidu Tongji Update.

    Bug Fixes

    • Unable to display schedule tasks. #353
    • Duplicate node registration. #334
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 6, 2019)

  • v0.3.5(Oct 28, 2019)

    Features / Enhancement

    • Graceful Showdown. detail
    • Node Info Optimization. detail
    • Append System Environment Variables to Tasks. detail
    • Auto Refresh Task Log. detail
    • Enable HTTPS Deployment. detail

    Bug Fixes

    • Unable to fetch spider list info in schedule jobs. detail
    • Unable to fetch node info from worker nodes. detail
    • Unable to select node when trying to run spider tasks. detail
    • Unable to fetch result count when result volume is large. #260
    • Node issue in schedule tasks. #244
    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Oct 8, 2019)

    1、fix 非自定义爬虫前端无法查看爬虫的问题 2、fix kill主进程未kill子进程的问题 3、fix 爬虫异常退出状态错误的问题 4、fix kill进程后状态错误的问题

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Oct 7, 2019)

  • v0.3.2(Sep 30, 2019)

    1、重构爬虫同步流程,修改为直接从GridFs上同步爬虫 2、fix 爬虫日志无法正常获取的问题 3、fix 爬虫无法正常同步的问题 4、fix 爬虫无法正常删除的问题 5、fix 任务状态无法正常停止的问题 6、优化爬虫列表的搜索

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Aug 25, 2019)

    Features / Enhancement

    • Docker Image Optimization. Split docker further into master, worker, frontend with alpine image.
    • Unit Tests. Covered part of the backend code with unit tests.
    • Frontend Optimization. Login page, button size, hints of upload UI optimization.
    • More Flexible Node Registration. Allow users to pass a variable as key for node registration instead of MAC by default.

    Bug Fixes

    • Uploading Large Spider Files Error. Memory crash issue when uploading large spider files. #150
    • Unable to Sync Spiders. Fixes through increasing level of write permission when synchronizing spider files. #114
    • Spider Page Issue. Fixes through removing the field "Site". #112
    • Node Display Issue. Nodes do not display correctly when running docker containers on multiple machines. #99
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jul 31, 2019)

    Features / Enhancement

    • Golang Backend: Refactored code from Python backend to Golang, much more stability and performance.
    • Node Network Graph: Visualization of node typology.
    • Node System Info: Available to see system info including OS, CPUs and executables.
    • Node Monitoring Enhancement: Nodes are monitored and registered through Redis.
    • File Management: Available to edit spider files online, including code highlight.
    • Login/Regiser/User Management: Require users to login to use Crawlab, allow user registration and user management, some role-based authorization.
    • Automatic Spider Deployment: Spiders are deployed/synchronized to all online nodes automatically.
    • Smaller Docker Image: Slimmed Docker image and reduced Docker image size from 1.3G to ~700M by applying Multi-Stage Build.

    Bug Fixes

    • Node Status. Node status does not change even though it goes offline actually. #87
    • Spider Deployment Error. Fixed through Automatic Spider Deployment #83
    • Node not showing. Node not able to show online #81
    • Cron Job not working. Fixed through new Golang backend #64
    • Flower Error. Fixed through new Golang backend #57
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Jul 22, 2019)

    Features / Enhancement

    • Documentation: Better and much more detailed documentation.
    • Better Crontab: Make crontab expression through crontab UI.
    • Better Performance: Switched from native flask engine to gunicorn. #78

    Bugs Fixes

    • Deleting Spider. Deleting a spider does not only remove record in db but also removing related folder, tasks and schedules. #69
    • MongoDB Auth. Allow user to specify authenticationDatabase to connect to mongodb. #68
    • Windows Compatibility. Added eventlet to requirements.txt. #59
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Jun 12, 2019)

  • v0.2.2(May 30, 2019)

  • v0.2.1(May 27, 2019)

  • v0.2(May 10, 2019)

  • v0.1.1(Apr 23, 2019)

skweez spiders web pages and extracts words for wordlist generation.

skweez skweez (pronounced like "squeeze") spiders web pages and extracts words for wordlist generation. It is basically an attempt to make a more oper

Michael Eder 46 Nov 23, 2022
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 7.2k Nov 28, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.3k Nov 23, 2022
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Mansz 1 Sep 24, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 264 Nov 9, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 80 Nov 16, 2022
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 14 Sep 27, 2022
A recursive, mirroring web crawler that retrieves child links.

A recursive, mirroring web crawler that retrieves child links.

Tony Afula 0 Jan 29, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 18.3k Nov 28, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 770 Nov 9, 2022
Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Niloy Sikdar 11 Aug 1, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

null 15 May 2, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 15 Aug 21, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

9ian1i 2.2k Nov 28, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Go Tripod 15 Aug 21, 2022
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

null 3 Sep 7, 2022
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

null 29 May 5, 2022
A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

QuaKe 1 Nov 9, 2021
Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Alexey Khan 1 Nov 3, 2021