Fast website link checker in Go

Overview

Muffet

GitHub Action Codecov Go Report Card Docker License

demo

Muffet is a website link checker which scrapes and inspects all pages in a website recursively.

Features

  • Massive speed
  • Colored outputs
  • Different tags support (a, img, link, script, etc)

Installation

GO111MODULE=on go get -u github.com/raviqqe/muffet/v2

Homebrew

brew install muffet

Usage

muffet https://shady.bakery.hotland

For more information, see muffet --help.

Docker

docker run raviqqe/muffet https://shady.bakery.hotland

GitHub Action

Currently, we do not provide any official one. Feel free to create an issue if you want!

License

MIT

Issues
  • Support URL's with containing whitespace

    Support URL's with containing whitespace

    Improve support for checking URL's containing whitepace by unescaping the URL and then removing tabs and CR/LF characters. This allows URL's such as the one below to be checked correctly:

    Really Long Title

    Browsers such as Chrome and Firefox remove the embedded whitespace resulting in the URL '/path/to/page/i-am-a-really-long-title-that-got-wrapped'. This commit tries to duplicate this behavior.

    opened by nwidger 15
  • 429 errors when using `--max-connections`

    429 errors when using `--max-connections`

    We have a bunch of GitHub issue links on a site, and even with --max-connections=10 --buffer-size=8192 --color=always --rate-limit=2 we're running in a lot of 429 errors. Any suggestion on how to avoid this?

    bug 
    opened by PatrickHeneise 14
  • invalid status code 999

    invalid status code 999

    When I use Muffet on my LinkedIn profile, which certainly works, I get "invalid status code 999".

    muffet https://www.linkedin.com/in/chrisbenson

    Thanks, Chris

    opened by chrisbenson 11
  • Doesn't support links with hashes

    Doesn't support links with hashes

    Testing this tool with my website gives a lot of great information (thanks!) but runs into a few issues, of them being a link like:

    https://en.wikipedia.org/wiki/Bell_Labs#1970s
    

    It reports this link to be a 400, I would guess because it's sending the full URL to the server instead of requesting the portion

    https://en.wikipedia.org/wiki/Bell_Labs
    

    That a web browser or other client would request, because the #1970s portion is not part of the server request path.

    opened by tmcw 10
  • Identical URLs requested multiple times

    Identical URLs requested multiple times

    When I run muffet against a local site I see in the logs that some pages are being requested many times in a single run. This seem unnecessary and puts extra load on the server being tested.

    Here is a simple example. Create "test.html" with this content:

    <html><body>
    <a href="/foo.html">foo</a>
    <a href="/test2.html">test2</a>
    </body></html>
    

    and "test2.html" with this:

    <html><body>
    <a href="/foo.html">foo</a>
    </body></html>
    

    Then serve this content with python3 -m http.server.

    And run muffet http://localhost:8000/test.html.

    The python http.server output I get is this:

    Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
    127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test.html HTTP/1.1" 200 -
    127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test2.html HTTP/1.1" 200 -
    127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
    127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -
    127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
    127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -
    

    This shows that "/foo.html" was requested multiple times.

    Strangely, small changes to those html files cause different results. If I add a link to test.html, muffet requests foo.html only once in the run.

    duplicate 
    opened by fredcy 8
  • Spaces removed from links (all URLs)

    Spaces removed from links (all URLs)

    Related to #44, however, now I've come across a site (being published on Read the Docs) which has many spaces in the filenames. Though the IDs are fine, the spaces get removed from the path, which breaks the links and results in many false 404's.

    Perhaps instead of removing spaces, using the net/url Parse function only? Or removing spaces only from the URL.Fragment? https://github.com/raviqqe/muffet/blob/4998c9b377f664bcad596990afeac7679cb306c8/scraper.go#L48-L54 https://github.com/raviqqe/muffet/blob/4998c9b377f664bcad596990afeac7679cb306c8/scraper.go#L82-L90 https://play.golang.org/p/42kUw1Rg23m

    opened by StephenBrown2 6
  • Add an -i flag for ignoring the fragment part of URLs

    Add an -i flag for ignoring the fragment part of URLs

    Also a few other very minor changes.

    All tests pass.

    go fmt, go vet, go lint and megacheck has no complaints.

    Example usage:

    % ./muffet https://arodseth.com
    https://arodseth.com
            ERROR   https://stackoverflow.com/questions/166506/finding-local-ip-addresses-using-pythons-stdlib/1267524#1267524 (ID #1267524 not found)
            ERROR   https://stackoverflow.com/questions/2951028/is-it-possible-to-include-inline-assembly-in-google-go-code/6535590#6535590 (ID #6535590 not found)
            ERROR   https://unix.stackexchange.com/questions/82598/how-do-i-write-a-retry-logic-in-script-to-keep-retrying-to-run-it-upto-5-times/82610#82610 (ID #82610 not found)
    

    (returns error code 1)

    % ./muffet -i https://arodseth.com
    

    (returns error code 0)

    opened by xyproto 6
  • Memory leak on Windows 10

    Memory leak on Windows 10

    Not sure if it's platform specific or not, but when were running a scan for https://msdn.microsoft.com for fun as mentioned in #27 just noticed muffet.exe using ~7Gb of memory and kept going...

    enhancement good first issue 
    opened by CJHarmath 6
  • Links are scanned multiple times

    Links are scanned multiple times

    Pretty cool project, saw it on hackers news!

    Just for fun try to run it against https://msdn.microsoft.com ( they are notorious to have broken links all over the place ).

    example from the result showing duplicates ( there are many) OK https://www.visualstudio.com OK https://www.visualstudio.com/ OK https://www.visualstudio.com/ OK https://www.visualstudio.com/

    opened by CJHarmath 6
  • x509: certificate signed by unknown authority error

    x509: certificate signed by unknown authority error

    I wanted check my site (https://lazyd.org) with muffet. The site has "COMODO CA Limited" certificate.

    But it exits with x509: certificate signed by unknown authority error.

    If you know what causes it, please let me know.

    Also maybe add -insecure flag would great. Like git -c http.sslVerify=false

    enhancement good first issue question 
    opened by kybin 6
  • Error:

    Error: "no free connections available to host"

    Many requests fail with "no free connections available to host", even if I specify -c1 on the command line. The error seems to come directly from fasthttp, but it looks like muffet is trying to create new connections before the previous ones have been closed completely.

    bug 
    opened by planbnet 5
  • Add support for outputing to JSON all crawled URLs (including 200 ones)

    Add support for outputing to JSON all crawled URLs (including 200 ones)

    In addition to #38, where one wants to save the failed URLs as JSON, listing also all resources (i.e. those that return 30x or 200) could be useful.

    For example one could use muffet to crawl a site in order to extract a list of all dependent resources (CSS, JS, images, etc.) and other linked-to pages.

    Then one could use these URLs for other analytical purposes, or even to warmup a cache after a redeploy.

    With the current format, the links JSON list could be expanded with all encountered URLs and replacing error with status to easily differentiate what was an error and what was a successful crawl.

    enhancement 
    opened by cipriancraciun 0
  • Add support for blacklisting `http://` (or other schemes)

    Add support for blacklisting `http://` (or other schemes)

    Given that today HTTPS is almost mandatory, it would be useful to have an option to report the presence of http:// URLs.

    As an extension to this, perhaps add a way to warn the user if other schemes are used, like for example ftp://, slack://, etc.

    Perhaps the simplest way to achieve this is to have either:

    • --allow-scheme http --allow-scheme https --alow-scheme ftp --allow-scheme mailto where any scheme not listed is warned;
    • --deny-scheme gopher --deny-scheme gemini where any scheme listed is warned;
    • (obviously combining both --allow-scheme and --deny-scheme makes no sense;)
    enhancement 
    opened by cipriancraciun 2
  • Add support for multiple URL's

    Add support for multiple URL's

    Assuming #38 is solved (i.e. muffet doesn't fetch the same URL twice), it would be useful to allow muffet to take multiple URLs.

    For example, say one has both a www and a blog site that happen to share some resources. If one would be able to list both sites in the same muffet invocation, the shared URLs would be checked only once.

    A different use-case would be in conjunction with --one-page-only (i.e. turning recursion off) and listing all known URLs on the command line.


    Complementary with the multiple URLs, a separate option to read these URLs from a file would allow even more flexibility.

    For example, one could take the sitemap.xml, process that to extract the URLs that search engines would actually crawl, put these URLs in a file, one per line, and instruct muffet to execute only on those URLs.

    For example muffet --one-page-only --urls ./sitemap.txt would try all links listed in sitemap.txt without recursing.

    Meanwhile muffet --urls ./sitemap.txt would try all links listed, but recurse for each link bun not crossing the domain listed in that URLs domain.

    enhancement 
    opened by cipriancraciun 2
  • muffet generates 403 on pixabay.com

    muffet generates 403 on pixabay.com

    my site links to https://pixabay.com/ and if i check it with muffet it leads to a 403. tried to set a custom header but i still get a 403

    --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"
    

    i don't want to scrape pixabay but i would like to check all external links on my site. currently my workaround is to exclude the site.

    same problem for pexels.com.

    looks like it is not really a problem of muffet as wget also produces a 403 but maybe muffet can do something about it

    question 
    opened by c33s 2
  • Support for proxy environment variables would be good

    Support for proxy environment variables would be good

    like wget or curl the support of proxy environment variables would be a good thing

    • HTTP_PROXY
    • HTTPS_PROXY
    • NO_PROXY

    good article about proxy env variables: https://about.gitlab.com/blog/2021/01/27/we-need-to-talk-no-proxy/

    enhancement 
    opened by c33s 0
Releases(v2.6.0)
Owner
Yota Toyama
Yota Toyama
Scraper to download school attendance data from the DfE's statistics website

?? Simple to use. Scrape attendance data with a single command! ?? Super fast. A

Luke Carr 0 Mar 31, 2022
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 1.5k Aug 4, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 77 Jul 20, 2022
WebWalker - Fast Script To Walk Web for find urls...

WebWalker send http request to url to get all urls in url and send http request to urls and again .... WebWalker can find 10,000 urls in 10 seconds.

WolvesLeader 1 Nov 28, 2021
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Mansz 0 Feb 4, 2022
🧀 Formaggo is a simple model checker inspired by TLA+, The checker and the models are written in Go

?? Formaggo. A cheesy exhaustive state checker in Go. Formaggo is a simple model checker inspired by TLA+. The checker and the models are written in G

Jakub Mikians 0 Jan 23, 2022
This product about make link to be short link with golang rest api

This project using golang with go fiber, firebase, and dependency injection

Muhammad Zahidin 1 Oct 13, 2021
LINE account link: Sample code for LINE account link

LINE account link: Sample code for LINE account link This is sample code to demostration LINE chatbot account link, refer to document https://develope

null 0 Dec 11, 2021
A fast and easy to use URL health checker ⛑️ Keep your links healthy during tough times

AreYouOK? A minimal, fast & easy to use URL health checker Who is AreYouOk made for ? OSS Package Maintainers ??️

Bhupesh Varshney 28 Jul 22, 2022
An incredibly fast proxy checker & IP rotator with ease.

An incredibly fast proxy checker & IP rotator with ease.

Kitabisa 753 Aug 7, 2022
Lightweight, fast and dependency-free Cron expression parser (due checker) for Golang (tested on v1.13 and above)

adhocore/gronx gronx is Golang cron expression parser ported from adhocore/cron-expr. Zero dependency. Very fast because it bails early in case a segm

Jitendra Adhikari 208 Aug 9, 2022
A fast unused and duplicate dependency checker

Depp - A fast unused and duplicate package checker Installation npm install -g depp-installer (will try to get npm install -g depp later) Usage Just

Rahul Tarak 261 Aug 5, 2022
Multi-threaded socks proxy checker written in Go!

Soxy - a very fast tool for checking open SOCKS proxies in Golang I was looking for some open socks proxies, and so I needed to test them - but really

pry0cc 45 Jul 5, 2022
PHP security vulnerabilities checker

Local PHP Security Checker The Local PHP Security Checker is a command line tool that checks if your PHP application depends on PHP packages with know

Fabien Potencier 931 Aug 12, 2022
Google Maps API checker

GAP Google API checker. Based on the study Unauthorized Google Maps API Key Usage Cases, and Why You Need to Care and Google Maps API (Not the Key) Bu

Joan Bono 41 Jun 20, 2022
Golang security checker

gosec - Golang Security Checker Inspects source code for security problems by scanning the Go AST. License Licensed under the Apache License, Version

Secure Go 6.2k Aug 9, 2022
A best practices checker for Kubernetes clusters. 🤠

Clusterlint As clusters scale and become increasingly difficult to maintain, clusterlint helps operators conform to Kubernetes best practices around r

DigitalOcean 480 Aug 5, 2022
actionlint is a static checker for GitHub Actions workflow files.

actionlint actionlint is a static checker for GitHub Actions workflow files. Features: Syntax check for workflow files to check unexpected or missing

Linda_pp 921 Aug 12, 2022
Key-Checker - Go scripts for checking API key / access token validity

Key-Checker Go scripts for checking API key / access token validity Update V1.0.0 ?? Added 37 checkers! Screenshoot ?? How to Install go get github.co

Muhammad Daffa 190 Jul 20, 2022
checkspaces is a checker for spaces between // and directives.

checkspaces checks if there is a space between // and directives.

masibw 2 Dec 10, 2021
gosec - Golang Security Checker

Inspects source code for security problems by scanning the Go AST.

Secure Go 6.2k Aug 6, 2022
Checker/validator for Hong Kong IDs

hkidchecker Checker/validator for Hong Kong IDs Description This Go package validates Hong Kong ID card IDs. Useful for example for validating form in

Patrik Lindahl 1 Oct 13, 2021
✨ Fastest Feature-packed Discord Token Checker written in GO ✨

FAST-discord-token-checker ✨ Fastest Feature-packed Discord Token Checker written in GO ✨ Overview ?? This program is the fastest ever written Discord

Vanshaj 47 Aug 9, 2022
A Micro-UTP, plug-able sanity checker for any on-prem JFrog platform instance

hello-frog About this plugin This plugin is a template and a functioning example for a basic JFrog CLI plugin. This README shows the expected structur

rdar 0 Dec 7, 2021
App with CRUD for user, with palindrome checker for user's first and last name

Run db container first, so that app does not connect to db while db has not started yet docker-compose up -d db docker-compose up -d app CRUD endpoint

Vinh Tran 0 Dec 9, 2021
A Simple HTTP health checker for golang

patsch Permanently Assert Target Succeeds Check Health use cases used by kubernetes cluster admins to quickly identify faulty ingresses used by kubern

DB Schenker 4 Feb 22, 2022
log4jshell vulnerability checker tool

Description log4j-checker tool helps identify whether a certain system is running a vulnerable version of the log4j library. Download and run the tool

null 1 Dec 20, 2021
Mdlinks - Markdown cross-document links checker

mdlinks This repository provides Go package, command-line tool, and a GitHub Action that can verify cross-document links in a collection of markdown f

Artyom Pervukhin 1 Mar 19, 2022
Bitcoin address balance checker on steroids.

BTCSteroids Bitcoin address balance checker on steroids. Table of contents Quick start What's included Use Cases Thanks Copyright and license Quick st

null 7 Jun 2, 2022