Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Overview

Crawlab

中文 | English

Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer

Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.

Demo | Documentation

Installation

Three methods:

  1. Docker (Recommended)
  2. Direct Deploy (Check Internal Kernel)
  3. Kubernetes (Multi-Node Deployment)

Pre-requisite (Docker)

  • Docker 18.03+
  • Redis 5.x+
  • MongoDB 3.6+
  • Docker Compose 1.24+ (optional but recommended)

Pre-requisite (Direct Deploy)

  • Go 1.12+
  • Node 8.12+
  • Redis 5.x+
  • MongoDB 3.6+

Quick Start

Please open the command line prompt and execute the command below. Make sure you have installed docker-compose in advance.

git clone https://github.com/crawlab-team/crawlab
cd crawlab
docker-compose up -d

Next, you can look into the docker-compose.yml (with detailed config params) and the Documentation (Chinese) for further information.

Run

Docker

Please use docker-compose to one-click to start up. By doing so, you don't even have to configure MongoDB and Redis databases. Create a file named docker-compose.yml and input the code below.

version: '3.3'
services:
  master: 
    image: tikazyq/crawlab:latest
    container_name: master
    environment:
      CRAWLAB_SERVER_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_REDIS_ADDRESS: "redis"
    ports:    
      - "8080:8080"
    depends_on:
      - mongo
      - redis
  mongo:
    image: mongo:latest
    restart: always
    ports:
      - "27017:27017"
  redis:
    image: redis:latest
    restart: always
    ports:
      - "6379:6379"

Then execute the command below, and Crawlab Master Node + MongoDB + Redis will start up. Open the browser and enter http://localhost:8080 to see the UI interface.

docker-compose up

For Docker Deployment details, please refer to relevant documentation.

Screenshot

Login

Home Page

Node List

Node Network

Spider List

Spider Overview

Spider Analytics

Spider File Edit

Task Log

Task Results

Cron Job

Language Installation

Dependency Installation

Notifications

Architecture

The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.

The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before v0.3.0. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.

Master Node

The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.

The Master Node offers below services:

  1. Crawling Task Coordination;
  2. Worker Node Management and Communication;
  3. Spider Deployment;
  4. Frontend and API Services;
  5. Task Execution (one can regard the Master Node as a Worker Node)

The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node synchronizes (deploys) spiders to Worker Nodes, via Redis and MongoDB GridFS.

Worker Node

The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through Redis PubSub. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.

MongoDB

MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. The MongoDB GridFS file system is the medium for the Master Node to store spider files and synchronize to the Worker Nodes.

Redis

Redis is a very popular Key-Value database. It offers node communication services in Crawlab. For example, nodes will execute HSET to set their info into a hash list named nodes in Redis, and the Master Node will identify online nodes according to the hash list.

Frontend

Frontend is a SPA based on Vue-Element-Admin. It has re-used many Element-UI components to support corresponding display.

Integration with Other Frameworks

Crawlab SDK provides some helper methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.

⚠️ Note: make sure you have already installed crawlab-sdk using pip.

Scrapy

In settings.py in your Scrapy project, find the variable named ITEM_PIPELINES (a dict variable). Add content below.

ITEM_PIPELINES = {
    'crawlab.pipelines.CrawlabMongoPipeline': 888,
}

Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Result

General Python Spider

Please add below content to your spider files to save results.

# import result saving method
from crawlab import save_item

# this is a result record, must be dict type
result = {'name': 'crawlab'}

# call result saving method
save_item(result)

Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Result

Other Frameworks / Languages

A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID. By doing so, the data can be related to a task. Also, another environment variable CRAWLAB_COLLECTION is passed by Crawlab as the name of the collection to store results data.

Comparison with Other Frameworks

There are existing spider management frameworks. So why use Crawlab?

The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.

Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.

Framework Technology Pros Cons Github Stats
Crawlab Golang + Vue Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc. Not yet support spider versioning
ScrapydWeb Python Flask + Vue Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform. Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
Gerapy Python Django + Vue Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc. Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeper Python Flask Open-source Scrapyhub. Concise and simple UI interface. Support cron job. Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.

Contributors

Community & Sponsorship

If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group. Or you scan the Alipay QR code below to give us a reward to upgrade our teamwork software or buy a coffee.

Issues
  • 上传了个node的爬虫 ,然后node  no find

    上传了个node的爬虫 ,然后node no find

    上传了个node的爬虫 ,然后node no find

    good first issue 
    opened by 1014470807 17
  • 如图,启动失败

    如图,启动失败

    F91CF36C-5475-4C00-A790-934E6AE7C827

    good first issue question 
    opened by fafafafso 12
  • nodejs爬虫无法运行

    nodejs爬虫无法运行

    主要是两个问题,一个是worker不自带node,二个是如果打包node_modules压缩包就太大了无法上传,不打包就没法运行。实测19mb的zip就没法上传。

    bug 
    opened by FrontMage 11
  • 日志的留存

    日志的留存

    能不能设置不清除日志

    bug good first issue 
    opened by fafafafso 10
  • 添加scrapy爬虫 点击运行 无法选择爬虫显示No data

    添加scrapy爬虫 点击运行 无法选择爬虫显示No data

    我用cli将包含scrapy.cfg的根目录crawlab upload 但到web界面尝试运行 无法选择爬虫 显示No data,怎么处理,求助!

    good first issue 
    opened by MambaInVeins 10
  • 节点_id随机时间会自动变更

    节点_id随机时间会自动变更

    节点的_id随机时间会自动变更,导致历史任务节点全部消失,定时执行的任务显示待定不能执行任务,目前解决方案是写一个脚本先查询当前可用节点,再依次替换定时任务和任务的节点id信息

    bug question 
    opened by 126217832 9
  • 优化建议  文件权限

    优化建议 文件权限

    在scrapy settings中LOG_ENABLED = True并设置LOG_FILE路径 打印日志运行会出错 几次尝试后应该是打印log时的文件权限问题! 建议上传时设置下默认权限

    bug 
    opened by Tang-1 9
  • 使用docker-compose安装的,每过一个星期master就会挂掉?需要手动重启

    使用docker-compose安装的,每过一个星期master就会挂掉?需要手动重启

    这一个是日志文件输出: crawlab-master | 2019/11/10 06:00:00 error handle task error:open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | 2019/11/10 06:00:00 error [Worker 3] open /var/logs/crawlab/5daef3fd05363c0015606068/20191110060000.log: no such file or directory crawlab-master | fatal error: concurrent map writes crawlab-master | fatal error: concurrent map writes crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory crawlab-master | 2019/11/11 12:03:39 error open /var/logs/crawlab/5daef3fd05363c0015606068/20191111021501.log: no such file or directory

    bug 
    opened by yjiu1990 9
  • toscrapy_books 爬虫选择所有节点运行,任务运行了两遍

    toscrapy_books 爬虫选择所有节点运行,任务运行了两遍

    Describe the bug A clear and concise description of what the bug is.

    To Reproduce Steps to reproduce the behavior:

    toscrapy_books 爬虫选择所有节点运行,任务运行了两遍 image

    Expected behavior 只爬一遍

    good first issue 
    opened by cielpy 9
  • 无法登陆

    无法登陆

    request.js:22 Uncaught (in promise) TypeError: Cannot read property 'status' of undefined at request.js:22

    good first issue 
    opened by anye88 8
  • 修复导航连接中的文档链接

    修复导航连接中的文档链接

    上一个PR没有做中英文区分,补上

    opened by maybgit 0
  • [BUG]添加一个定时任务自动刷新后会把其它爬虫的定时任务也显示出来

    [BUG]添加一个定时任务自动刷新后会把其它爬虫的定时任务也显示出来

    Bug 描述 添加一个定时任务自动刷新后会把其它爬虫的定时任务也显示出来. 刷新整个页面后重新进入到该爬出详情页面查看定时任务则正常

    复现步骤 该 Bug 复现步骤如下

    1. 点击左侧 爬虫 菜单 -> 点击一条爬虫查看详情 -> 切换到 定时任务 tab 添加定时任务
    2. 点击添加, 填好必要数据后保存
    3. 保存后会把其它爬虫的定时任务也显示出来

    期望结果 应该是保存后请求的地址写错了。

    截屏 image image image

    bug 
    opened by NewGreenHand 0
  • git log里点击checkout 不能回到那个版本

    git log里点击checkout 不能回到那个版本

    Bug 描述 git部署了爬虫,在log里选择checkout到某个版本,但是没有效果

    复现步骤 该 Bug 复现步骤如下

    1. 部署爬虫
    2. 该爬虫有多个版本
    3. checkout任意一个版本,但是代码没有发生变化

    期望结果 代码能回到那个版本

    备注 不知道是不是我操作得不对

    bug 
    opened by shilin414 0
  • Github synchronising with multiple branches not working

    Github synchronising with multiple branches not working

    1 - I'm running two spiders each one with a different branch, actually crawlab is fetching only master branch and if trying to synchronize an other branch the Web UI shows that the branch is second one but when checking the code i recognise that is still using master in the background .

    2 - when trying to use SSH key and Git SSH URL i'm getting this error Screen Shot 2021-06-08 at 12 41 36

    enhancement 
    opened by MlataIbrahim 1
  • 演示页面进不去了

    演示页面进不去了

    不知道可配置爬虫怎么分页,

    bug 
    opened by carl-don-it 0
  • 同一个爬虫,爬取多次后只能看到最新的爬取日志,看不到历史日志

    同一个爬虫,爬取多次后只能看到最新的爬取日志,看不到历史日志

    Bug 描述 同一个爬虫,爬取多次后只能看到最新的爬取日志,看不到历史日志 爬虫为 scrapy 爬虫,将其类型定义为【自定义爬虫】,通过入口文件启动 如:python run_spider.py spider_name PS:1. 已尝试将其定义为【scrapy爬虫】仍然看不到历史日志 2. 使用入口文件运行爬虫时,入口文件定义使用了日志,也无法在 docker 中看到日志文件

    复现步骤 该 Bug 复现步骤如下

    1. 【爬虫】界面运行爬虫

    期望结果 可以查看每次运行的日志

    bug 
    opened by LZC6244 0
  • crawlab/core/client.py获取settings可优化判断

    crawlab/core/client.py获取settings可优化判断

    /usr/local/lib/python3.8/dist-packages/crawlab/core/client.py文件190行

    @staticmethod
    def settings(directory=None):
        if directory is None:
            directory = os.path.abspath(os.curdir)
        os.chdir(directory)
    
        cp = get_scrapy_cfg()
    
        settings_mod_name = cp.get('settings', 'default')
    
        sys.path.insert(0, directory)
        settings = importlib.import_module(settings_mod_name)
    
        data = []
        for key in [key for key in dir(settings) if not key.startswith('__')]:
    
            value = getattr(settings, key)
            # `建议修改这里的取值,否则后面json.dumps(data)会报错,比如value是一个模块变量的话`
            value = value if isinstance(value, (str, int, bool, dict, tuple)) else str(value)
            data.append({
                'key': key,
                'value': value,
            })
    
        print(json.dumps(data))
    
    enhancement 
    opened by codewc 1
  • 日志问题

    日志问题

    1.日志经常报错,而且返回的速度特别慢,这个非常重要

    2.建议添加scrapy爬虫的请求次数统计

    enhancement 
    opened by ggandrr 0
  • 希望能在运行爬虫时添加爬虫检索功能

    希望能在运行爬虫时添加爬虫检索功能

    希望能在运行爬虫时添加爬虫检索功能 当我想运行指定爬虫时,只能通过弹出下拉框里下拉勾选才行,当爬虫数量庞大的时候,下拉选择变得很不方便,是否能在运行爬虫时添加爬虫检索功能

    image

    enhancement 
    opened by wmy1334 0
Releases(v0.5.1)
  • v0.5.1(Jul 31, 2020)

    Features / Enhancement

    • Added error message details.
    • Added Golang programming language support.
    • Added web driver installation scripts for Chrome Driver and Firefox.
    • Support system tasks. A "system task" is similar to normal spider task, it allows users to view logs of general tasks such as installing languages.
    • Changed methods of installing languages from RPC to system tasks.

    Bug Fixes

    • Fixed first download repo 500 error in Spider Market page. #808
    • Fixed some translation issues.
    • Fixed 500 error in task detail page. #810
    • Fixed password reset issue. #811
    • Fixed unable to download CSV issue. #812
    • Fixed unable to install node.js issue. #813
    • Fixed disabled status for batch adding schedules. #814
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Jul 19, 2020)

    Features / Enhancement

    • Spider Market. Allow users to download open-source spiders into Crawlab.
    • Batch actions. Allow users to interact with Crawlab in batch fashions, e.g. batch run tasks, batch delete spiders, ect.
    • Migrate MongoDB driver to MongoDriver.
    • Refactor and optmize node-related logics.
    • Change default task.workers to 16.
    • Change default nginx client_max_body_size to 200m.
    • Support writing logs to ElasticSearch.
    • Display error details in Scrapy page.
    • Removed Challenge page.
    • Moved Feedback and Dislaimer pages to navbar.

    Bug Fixes

    • Fixed log not expiring issue because of failure to create TTL index.
    • Set default log expire duration to 1 day.
    • task_id index not created.
    • docker-compose.yml fix.
    • Fixed 404 page.
    • Fixed unable to create worker node before master node issue.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.10(Apr 23, 2020)

    Features / Enhancement

    • Enhanced Log Management. Centralizing log storage in MongoDB, reduced the dependency of PubSub, allowing log error detection.
    • API Token. Allow users to generate API tokens and use them to integrate into their own systems.
    • Web Hook. Trigger a Web Hook http request to pre-defined URL when a task starts or finishes.
    • Auto Install Dependencies. Allow installing dependencies automatically from requirements.txt or package.json.
    • Auto Results Collection. Set results collection to results_<spider_name> if it is not set.
    • Optimized Project List. Not display "No Project" item in the project list.
    • Upgrade Node.js. Upgrade Node.js version from v8.12 to v10.19.
    • Add Run Button in Schedule Page. Allow users to manually run task in Schedule Page.

    Bug Fixes

    • Cannot register. #670
    • Spider schedule tab cron expression shows second. #678
    • Missing daily stats in spider. #684
    • Results count not update in time. #689
    Source code(tar.gz)
    Source code(zip)
  • v0.4.9(Mar 30, 2020)

    Features / Enhancement

    • Challenges. Users can achieve different challenges based on their actions.
    • More Advanced Access Control. More granular access control, e.g. normal users can only view/manage their own spiders/projects and admin users can view/manage all spiders/projects.
    • Feedback. Allow users to send feedbacks and ratings to Crawlab team.
    • Better Home Page Metrics. Optimized metrics display on home page.
    • Configurable Spiders Converted to Customized Spiders. Allow users to convert their configurable spiders into customized spiders which are also Scrapy spiders.
    • View Tasks Triggered by Schedule. Allow users to view tasks triggered by a schedule. #648
    • Support Results De-Duplication. Allow users to configure de-duplication of results. #579
    • Support Task Restart. Allow users to re-run historical tasks.

    Bug Fixes

    • CLI unable to use on Windows. #580
    • Re-upload error. #643 #640
    • Upload missing folders. #646
    • Unable to add schedules in Spider Page.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.8(Mar 11, 2020)

    Features / Enhancement

    • Support Installations of More Programming Languages. Now users can install or pre-install more programming languages including Java, .Net Core and PHP.
    • Installation UI Optimization. Users can better view and manage installations on Node List page.
    • More Git Support. Allow users to view Git Commits record, and allow checkout to corresponding commit.
    • Support Hostname Node Registration Type. Users can set hostname as the node key as the unique identifier.
    • RPC Support. Added RPC support to better manage node communication.
    • Run On Master Switch. Users can determine whether to run tasks on master. If not, all tasks will be run only on worker nodes.
    • Disabled Tutorial by Default.
    • Added Related Documentation Sidebar.
    • Loading Page Optimization.

    Bug Fixes

    • Duplicated Nodes. #391
    • Duplicated Spider Upload. #603
    • Failure in dependencies installation results in unusable dependency installation functionalities.. #609
    • Create Tasks for Offline Nodes. #622
    Source code(tar.gz)
    Source code(zip)
  • v0.4.7(Feb 24, 2020)

    Features / Enhancement

    • Better Support for Scrapy. Spiders identification, settings.py configuration, log level selection, spider selection. #435
    • Git Sync. Allow users to sync git projects to Crawlab.
    • Long Task Support. Users can add long-task spiders which is supposed to run without finishing. #425
    • Spider List Optimization. Tasks count by status, tasks detail popup, legend. #425
    • Upgrade Check. Check latest version and notifiy users to upgrade.
    • Spiders Batch Operation. Allow users to run/stop spider tasks and delete spiders in batches.
    • Copy Spiders. Allow users to copy an existing spider to create a new one.
    • Wechat Group QR Code.

    Bug Fixes

    • Schedule Spider Selection Issue. Fields not responding to spider change.
    • Cron Jobs Conflict. Possible bug when two spiders set to the same time of their cron jobs. #515 #565
    • Task Log Issue. Different tasks write to the same log file if triggered at the same time. #577
    • Task List Filter Options Incomplete.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.6(Feb 13, 2020)

    Features / Enhancement

    • SDK for Node.js. Users can apply SDK in their Node.js spiders.
    • Log Management Optimization. Log search, error highlight, auto-scrolling.
    • Task Execution Process Optimization. Allow users to be redirected to task detail page after triggering a task.
    • Task Display Optimization. Added "Param" in the Latest Tasks table in the spider detail page. #295
    • Spider List Optimization. Added "Update Time" and "Create Time" in spider list page.
    • Page Loading Placeholder.

    Bug Fixes

    • Lost Focus in Schedule Configuration. #519
    • Unable to Upload Spider using CLI. #524
    Source code(tar.gz)
    Source code(zip)
  • v0.4.5(Feb 3, 2020)

    Features / Enhancement

    • Interactive Tutorial. Guide users through the main functionalities of Crawlab.
    • Global Environment Variables. Allow users to set global environment variables, which will be passed into all spider programs. #177
    • Project. Allow users to link spiders to projects. #316
    • Demo Spiders. Added demo spiders when Crawlab is initialized. #379
    • User Admin Optimization. Restrict privilleges of admin users. #456
    • Setting Page Optimization.
    • Task Results Optimization.

    Bug Fixes

    • Unable to find spider file error. #485
    • Click delete button results in redirect. #480
    • Unable to create files in an empty spider. #479
    • Download results error. #465
    • crawlab-sdk CLI error. #458
    • Page refresh issue. #441
    • Results not support JSON. #202
    • Getting all spider after deleting a spider.
    • i18n warning.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.4(Jan 17, 2020)

    Features / Enhancement

    • Email Notification. Allow users to send email notifications.
    • DingTalk Robot Notification. Allow users to send DingTalk Robot notifications.
    • Wechat Robot Notification. Allow users to send Wechat Robot notifications.
    • API Address Optimization. Added relative URL path in frontend so that users don't have to specify CRAWLAB_API_ADDRESS explicitly.
    • SDK Compatiblity. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
    • Enhanced File Management. Added tree-like file sidebar to allow users to edit files much more easier.
    • Advanced Schedule Cron. Allow users to edit schedule cron with visualized cron editor.

    Bug Fixes

    • nil retuened error.
    • Error when using HTTPS.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(Jan 7, 2020)

    Features / Enhancement

    • Dependency Installation. Allow users to install/uninstall dependencies and add programming languages (Node.js only for now) on the platform web interface.
    • Pre-install Programming Languages in Docker. Allow Docker users to set CRAWLAB_SERVER_LANG_NODE as Y to pre-install Node.js environments.
    • Add Schedule List in Spider Detail Page. Allow users to view / add / edit schedule cron jobs in the spider detail page. #360
    • Align Cron Expression with Linux. Change the expression of 6 elements to 5 elements as aligned in Linux.
    • Enable/Disable Schedule Cron. Allow users to enable/disable the schedule jobs. #297
    • Better Task Management. Allow users to batch delete tasks. #341
    • Better Spider Management. Allow users to sort and filter spiders in the spider list page.
    • Added Chinese CHANGELOG.
    • Added Github Star Button at Nav Bar.

    Bug Fixes

    • Schedule Cron Task Issue. #423
    • Upload Spider Zip File Issue. #403 #407
    • Exit due to Network Failure. #340
    • Cron Jobs not Running Correctly
    • Schedule List Columns Mis-positioned
    • Clicking Refresh Button Redirected to 404 Page
    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Dec 27, 2019)

    Features / Enhancement

    • Disclaimer. Added page for Disclaimer.
    • Call API to fetch version. #371
    • Configure to allow user registration. #346
    • Allow adding new users.
    • More Advanced File Management. Allow users to add / edit / rename / delete files. #286
    • Optimized Spider Creation Process. Allow users to create an empty customized spider before uploading the zip file.
    • Better Task Management. Allow users to filter tasks by selecting through certian criterions. #341

    Bug Fixes

    • Duplicated nodes. #391
    • "mongodb no reachable" error. #373
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Dec 14, 2019)

    Features / Enhancement

    • Spiderfile Optimization. Stages changed from dictionary to array. #358
    • Baidu Tongji Update.

    Bug Fixes

    • Unable to display schedule tasks. #353
    • Duplicate node registration. #334
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 6, 2019)

  • v0.3.5(Oct 28, 2019)

    Features / Enhancement

    • Graceful Showdown. detail
    • Node Info Optimization. detail
    • Append System Environment Variables to Tasks. detail
    • Auto Refresh Task Log. detail
    • Enable HTTPS Deployment. detail

    Bug Fixes

    • Unable to fetch spider list info in schedule jobs. detail
    • Unable to fetch node info from worker nodes. detail
    • Unable to select node when trying to run spider tasks. detail
    • Unable to fetch result count when result volume is large. #260
    • Node issue in schedule tasks. #244
    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Oct 8, 2019)

    1、fix 非自定义爬虫前端无法查看爬虫的问题 2、fix kill主进程未kill子进程的问题 3、fix 爬虫异常退出状态错误的问题 4、fix kill进程后状态错误的问题

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Oct 7, 2019)

  • v0.3.2(Sep 30, 2019)

    1、重构爬虫同步流程,修改为直接从GridFs上同步爬虫 2、fix 爬虫日志无法正常获取的问题 3、fix 爬虫无法正常同步的问题 4、fix 爬虫无法正常删除的问题 5、fix 任务状态无法正常停止的问题 6、优化爬虫列表的搜索

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Aug 24, 2019)

    Features / Enhancement

    • Docker Image Optimization. Split docker further into master, worker, frontend with alpine image.
    • Unit Tests. Covered part of the backend code with unit tests.
    • Frontend Optimization. Login page, button size, hints of upload UI optimization.
    • More Flexible Node Registration. Allow users to pass a variable as key for node registration instead of MAC by default.

    Bug Fixes

    • Uploading Large Spider Files Error. Memory crash issue when uploading large spider files. #150
    • Unable to Sync Spiders. Fixes through increasing level of write permission when synchronizing spider files. #114
    • Spider Page Issue. Fixes through removing the field "Site". #112
    • Node Display Issue. Nodes do not display correctly when running docker containers on multiple machines. #99
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jul 31, 2019)

    Features / Enhancement

    • Golang Backend: Refactored code from Python backend to Golang, much more stability and performance.
    • Node Network Graph: Visualization of node typology.
    • Node System Info: Available to see system info including OS, CPUs and executables.
    • Node Monitoring Enhancement: Nodes are monitored and registered through Redis.
    • File Management: Available to edit spider files online, including code highlight.
    • Login/Regiser/User Management: Require users to login to use Crawlab, allow user registration and user management, some role-based authorization.
    • Automatic Spider Deployment: Spiders are deployed/synchronized to all online nodes automatically.
    • Smaller Docker Image: Slimmed Docker image and reduced Docker image size from 1.3G to ~700M by applying Multi-Stage Build.

    Bug Fixes

    • Node Status. Node status does not change even though it goes offline actually. #87
    • Spider Deployment Error. Fixed through Automatic Spider Deployment #83
    • Node not showing. Node not able to show online #81
    • Cron Job not working. Fixed through new Golang backend #64
    • Flower Error. Fixed through new Golang backend #57
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Jul 22, 2019)

    Features / Enhancement

    • Documentation: Better and much more detailed documentation.
    • Better Crontab: Make crontab expression through crontab UI.
    • Better Performance: Switched from native flask engine to gunicorn. #78

    Bugs Fixes

    • Deleting Spider. Deleting a spider does not only remove record in db but also removing related folder, tasks and schedules. #69
    • MongoDB Auth. Allow user to specify authenticationDatabase to connect to mongodb. #68
    • Windows Compatibility. Added eventlet to requirements.txt. #59
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Jun 12, 2019)

  • v0.2.2(May 30, 2019)

  • v0.2.1(May 27, 2019)

  • v0.2(May 10, 2019)

  • v0.1.1(Apr 23, 2019)

Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Niloy Sikdar 11 Jul 27, 2021
Collyzar - A distributed redis-based framework for colly.

Collyzar A distributed redis-based framework for colly. Collyzar provides a very simple configuration and tools to implement distributed crawling/scra

Zarten 209 Jul 16, 2021
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

胡聪 1.7k Jul 15, 2021
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 246 Jul 20, 2021
用Go实现抓取Boss直聘职位数据。IP代理,模拟浏览器,高效快速。

crawler-boss 用Go实现抓取Boss直聘职位数据。有几个特点 1.代理防IP被封 2.模拟浏览器,反识别爬虫。 3.控制爬取频率。 4.多协程爬取。 不足之处 1.爬取失败,没有进行重试以及更换IP处理。 2.错误处理 3.代码结构方面进行优化。 交流 && 疑问 如果有任何错误或不懂的

Ray 11 Jul 26, 2021
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 763 May 7, 2021
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 6.9k Jul 24, 2021
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 14.4k Jul 24, 2021
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 925 Jul 22, 2021
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

MontFerret 4.6k Jul 18, 2021
利用天眼查查询企业子公司

cSubsidiary 利用天眼查查询企业子公司 下载地址 介绍 可以通过两种方式查询自己想要的企业子公司 -n 参数:利用给出的关键字先进行模糊查询,然后选出第一个匹配的结果,对该公司进行查询。(方便但是不准确,所以不推荐) -i 参数:利用给出的公司id对该公司进行查询。

cances 155 Jul 27, 2021
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Detectify 358 Jul 23, 2021
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Anas Khan 1.6k Jul 20, 2021
A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

null 10.4k Jul 19, 2021