Fast website link checker in Go

Overview

Muffet

GitHub Action Codecov Go Report Card Docker License

demo

Muffet is a website link checker which scrapes and inspects all pages in a website recursively.

Features

  • Massive speed
  • Colored outputs
  • Different tags support (a, img, link, script, etc)

Installation

GO111MODULE=on go get -u github.com/raviqqe/muffet/v2

Homebrew

brew install muffet

Usage

muffet https://shady.bakery.hotland

For more information, see muffet --help.

Docker

docker run raviqqe/muffet https://shady.bakery.hotland

GitHub Action

Currently, we do not provide any official one. Feel free to create an issue if you want!

License

MIT

Comments
  • Support URL's with containing whitespace

    Support URL's with containing whitespace

    Improve support for checking URL's containing whitepace by unescaping the URL and then removing tabs and CR/LF characters. This allows URL's such as the one below to be checked correctly:

    Really Long Title

    Browsers such as Chrome and Firefox remove the embedded whitespace resulting in the URL '/path/to/page/i-am-a-really-long-title-that-got-wrapped'. This commit tries to duplicate this behavior.

    opened by nwidger 15
  • 429 errors when using `--max-connections`

    429 errors when using `--max-connections`

    We have a bunch of GitHub issue links on a site, and even with --max-connections=10 --buffer-size=8192 --color=always --rate-limit=2 we're running in a lot of 429 errors. Any suggestion on how to avoid this?

    bug 
    opened by PatrickHeneise 14
  • invalid status code 999

    invalid status code 999

    When I use Muffet on my LinkedIn profile, which certainly works, I get "invalid status code 999".

    muffet https://www.linkedin.com/in/chrisbenson

    Thanks, Chris

    opened by chrisbenson 11
  • Doesn't support links with hashes

    Doesn't support links with hashes

    Testing this tool with my website gives a lot of great information (thanks!) but runs into a few issues, of them being a link like:

    https://en.wikipedia.org/wiki/Bell_Labs#1970s
    

    It reports this link to be a 400, I would guess because it's sending the full URL to the server instead of requesting the portion

    https://en.wikipedia.org/wiki/Bell_Labs
    

    That a web browser or other client would request, because the #1970s portion is not part of the server request path.

    opened by tmcw 10
  • Identical URLs requested multiple times

    Identical URLs requested multiple times

    When I run muffet against a local site I see in the logs that some pages are being requested many times in a single run. This seem unnecessary and puts extra load on the server being tested.

    Here is a simple example. Create "test.html" with this content:

    <html><body>
    <a href="/foo.html">foo</a>
    <a href="/test2.html">test2</a>
    </body></html>
    

    and "test2.html" with this:

    <html><body>
    <a href="/foo.html">foo</a>
    </body></html>
    

    Then serve this content with python3 -m http.server.

    And run muffet http://localhost:8000/test.html.

    The python http.server output I get is this:

    Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
    127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test.html HTTP/1.1" 200 -
    127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test2.html HTTP/1.1" 200 -
    127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
    127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -
    127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
    127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -
    

    This shows that "/foo.html" was requested multiple times.

    Strangely, small changes to those html files cause different results. If I add a link to test.html, muffet requests foo.html only once in the run.

    duplicate 
    opened by fredcy 8
  • Spaces removed from links (all URLs)

    Spaces removed from links (all URLs)

    Related to #44, however, now I've come across a site (being published on Read the Docs) which has many spaces in the filenames. Though the IDs are fine, the spaces get removed from the path, which breaks the links and results in many false 404's.

    Perhaps instead of removing spaces, using the net/url Parse function only? Or removing spaces only from the URL.Fragment? https://github.com/raviqqe/muffet/blob/4998c9b377f664bcad596990afeac7679cb306c8/scraper.go#L48-L54 https://github.com/raviqqe/muffet/blob/4998c9b377f664bcad596990afeac7679cb306c8/scraper.go#L82-L90 https://play.golang.org/p/42kUw1Rg23m

    opened by StephenBrown2 6
  • Add an -i flag for ignoring the fragment part of URLs

    Add an -i flag for ignoring the fragment part of URLs

    Also a few other very minor changes.

    All tests pass.

    go fmt, go vet, go lint and megacheck has no complaints.

    Example usage:

    % ./muffet https://arodseth.com
    https://arodseth.com
            ERROR   https://stackoverflow.com/questions/166506/finding-local-ip-addresses-using-pythons-stdlib/1267524#1267524 (ID #1267524 not found)
            ERROR   https://stackoverflow.com/questions/2951028/is-it-possible-to-include-inline-assembly-in-google-go-code/6535590#6535590 (ID #6535590 not found)
            ERROR   https://unix.stackexchange.com/questions/82598/how-do-i-write-a-retry-logic-in-script-to-keep-retrying-to-run-it-upto-5-times/82610#82610 (ID #82610 not found)
    

    (returns error code 1)

    % ./muffet -i https://arodseth.com
    

    (returns error code 0)

    opened by xyproto 6
  • Memory leak on Windows 10

    Memory leak on Windows 10

    Not sure if it's platform specific or not, but when were running a scan for https://msdn.microsoft.com for fun as mentioned in #27 just noticed muffet.exe using ~7Gb of memory and kept going...

    enhancement good first issue 
    opened by CJHarmath 6
  • Links are scanned multiple times

    Links are scanned multiple times

    Pretty cool project, saw it on hackers news!

    Just for fun try to run it against https://msdn.microsoft.com ( they are notorious to have broken links all over the place ).

    example from the result showing duplicates ( there are many) OK https://www.visualstudio.com OK https://www.visualstudio.com/ OK https://www.visualstudio.com/ OK https://www.visualstudio.com/

    opened by CJHarmath 6
  • x509: certificate signed by unknown authority error

    x509: certificate signed by unknown authority error

    I wanted check my site (https://lazyd.org) with muffet. The site has "COMODO CA Limited" certificate.

    But it exits with x509: certificate signed by unknown authority error.

    If you know what causes it, please let me know.

    Also maybe add -insecure flag would great. Like git -c http.sslVerify=false

    enhancement good first issue question 
    opened by kybin 6
  • Error:

    Error: "no free connections available to host"

    Many requests fail with "no free connections available to host", even if I specify -c1 on the command line. The error seems to come directly from fasthttp, but it looks like muffet is trying to create new connections before the previous ones have been closed completely.

    bug 
    opened by planbnet 5
  • failed to fetch root page: id #/ not found

    failed to fetch root page: id #/ not found

    I try to run Muffet over a URL that contains a #.

    muffet "https://example.com/LcPI-kL7gFOvTj1-kjogmGRTxvKxFW3H#/"

    the error message I get is failed to fetch root page: id #/ not found

    opened by Vad1mo 0
  • Anchors / fragments not rendered by JavaScript not found

    Anchors / fragments not rendered by JavaScript not found

    Hi :)

    I know there has been a couple of issues where anchors that are manipulated by JavaScript are not found (and I know why that is and that there is nothing that could be done) but I've recently stumbled upon a website which has an anchor that is not manipulated by javascript and it's still reported as "not found".

    Steps to reproduce:

    $ ~/bin/muffet --version
    2.6.1
    
    $ ~/bin/muffet --one-page-only https://deploy-preview-1034--kuma.netlify.app/docs/1.8.x/explore/gateway-api/
    https://deploy-preview-1034--kuma.netlify.app/docs/1.8.x/explore/gateway-api/
            error when reading response headers: small read buffer. Increase ReadBufferSize. Buffer size=4096, contents: "HTTP/1.1 200 OK\r\nDate: Mon, 17 Oct 2022 12:37:02 GMT\r\nPerf: 7626143928\r\nExpiry: Tue, 31 Mar 1981 05:00:00 GMT\r\nPragma: no-cache\r\nServer: tsa_o\r\nSet-Cookie: guest_id=v1%3A166601022209834356; Max-Age=34"..."rt?a=O5RXE%3D%3D%3D&ro=false\r\nStrict-Transport-Security: max-age=631138519\r\nCross-Origin-Opener-Policy: same-origin-allow-popups\r\nCross-Origin-Embedder-Policy: unsafe-none\r\nX-Response-Time: 135\r\nx-con"        https://twitter.com/KumaMesh
            id #install-experimental-channel not found      https://gateway-api.sigs.k8s.io/guides/getting-started/#install-experimental-channel
    

    Not sure if the twitter one is relevant, it does not show up on our CI.

    Link to CI logs.

    And proof that in chrome with disabled javascript the anchor is there.

    image

    Please let me know if this is something that could be fixed.

    opened by slonka 0
  • Add support for outputing to JSON all crawled URLs (including 200 ones)

    Add support for outputing to JSON all crawled URLs (including 200 ones)

    In addition to #38, where one wants to save the failed URLs as JSON, listing also all resources (i.e. those that return 30x or 200) could be useful.

    For example one could use muffet to crawl a site in order to extract a list of all dependent resources (CSS, JS, images, etc.) and other linked-to pages.

    Then one could use these URLs for other analytical purposes, or even to warmup a cache after a redeploy.

    With the current format, the links JSON list could be expanded with all encountered URLs and replacing error with status to easily differentiate what was an error and what was a successful crawl.

    enhancement 
    opened by cipriancraciun 0
  • Add support for blacklisting `http://` (or other schemes)

    Add support for blacklisting `http://` (or other schemes)

    Given that today HTTPS is almost mandatory, it would be useful to have an option to report the presence of http:// URLs.

    As an extension to this, perhaps add a way to warn the user if other schemes are used, like for example ftp://, slack://, etc.

    Perhaps the simplest way to achieve this is to have either:

    • --allow-scheme http --allow-scheme https --alow-scheme ftp --allow-scheme mailto where any scheme not listed is warned;
    • --deny-scheme gopher --deny-scheme gemini where any scheme listed is warned;
    • (obviously combining both --allow-scheme and --deny-scheme makes no sense;)
    enhancement 
    opened by cipriancraciun 2
  • Add support for multiple URL's

    Add support for multiple URL's

    Assuming #38 is solved (i.e. muffet doesn't fetch the same URL twice), it would be useful to allow muffet to take multiple URLs.

    For example, say one has both a www and a blog site that happen to share some resources. If one would be able to list both sites in the same muffet invocation, the shared URLs would be checked only once.

    A different use-case would be in conjunction with --one-page-only (i.e. turning recursion off) and listing all known URLs on the command line.


    Complementary with the multiple URLs, a separate option to read these URLs from a file would allow even more flexibility.

    For example, one could take the sitemap.xml, process that to extract the URLs that search engines would actually crawl, put these URLs in a file, one per line, and instruct muffet to execute only on those URLs.

    For example muffet --one-page-only --urls ./sitemap.txt would try all links listed in sitemap.txt without recursing.

    Meanwhile muffet --urls ./sitemap.txt would try all links listed, but recurse for each link bun not crossing the domain listed in that URLs domain.

    enhancement 
    opened by cipriancraciun 2
  • muffet generates 403 on pixabay.com

    muffet generates 403 on pixabay.com

    my site links to https://pixabay.com/ and if i check it with muffet it leads to a 403. tried to set a custom header but i still get a 403

    --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"
    

    i don't want to scrape pixabay but i would like to check all external links on my site. currently my workaround is to exclude the site.

    same problem for pexels.com.

    looks like it is not really a problem of muffet as wget also produces a 403 but maybe muffet can do something about it

    question 
    opened by c33s 2
Releases(v2.6.2)
Owner
Yota Toyama
Yota Toyama
Scraper to download school attendance data from the DfE's statistics website

?? Simple to use. Scrape attendance data with a single command! ?? Super fast. A

Luke Carr 0 Mar 31, 2022
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 1.8k Dec 31, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 81 Nov 22, 2022
WebWalker - Fast Script To Walk Web for find urls...

WebWalker send http request to url to get all urls in url and send http request to urls and again .... WebWalker can find 10,000 urls in 10 seconds.

WolvesLeader 1 Nov 28, 2021
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Mansz 1 Sep 24, 2022
This product about make link to be short link with golang rest api

This project using golang with go fiber, firebase, and dependency injection

Muhammad Zahidin 1 Oct 13, 2021
LINE account link: Sample code for LINE account link

LINE account link: Sample code for LINE account link This is sample code to demostration LINE chatbot account link, refer to document https://develope

null 0 Dec 11, 2021
🧀 Formaggo is a simple model checker inspired by TLA+, The checker and the models are written in Go

?? Formaggo. A cheesy exhaustive state checker in Go. Formaggo is a simple model checker inspired by TLA+. The checker and the models are written in G

Jakub Mikians 0 Jan 23, 2022
A fast and easy to use URL health checker ⛑️ Keep your links healthy during tough times

AreYouOK? A minimal, fast & easy to use URL health checker Who is AreYouOk made for ? OSS Package Maintainers ??️

Bhupesh Varshney 30 Oct 7, 2022
An incredibly fast proxy checker & IP rotator with ease.

An incredibly fast proxy checker & IP rotator with ease.

Kitabisa 913 Jan 7, 2023
Lightweight, fast and dependency-free Cron expression parser (due checker) for Golang (tested on v1.13 and above)

adhocore/gronx gronx is Golang cron expression parser ported from adhocore/cron-expr. Zero dependency. Very fast because it bails early in case a segm

Jitendra Adhikari 239 Dec 30, 2022
A fast unused and duplicate dependency checker

Depp - A fast unused and duplicate package checker Installation npm install -g depp-installer (will try to get npm install -g depp later) Usage Just

Rahul Tarak 262 Oct 31, 2022
Simulate network link speed

linkio linkio provides an io.Reader and io.Writer that simulate a network connection of a certain speed, e.g. to simulate a mobile connection. Quick s

Ian Kent 53 Sep 27, 2022
a ed2k link crawler written in Go / golang

Install/Run Example $ git clone git://github.com/kevinwatt/ed2kcrawler.git Build ed2kcrawler $ make Create an config.cfg file [GenSetting] OThread =

Kevin Watt 31 Aug 24, 2022
:link: Generate HTML and CSS together, on the fly

On The Fly Package for generating HTML and CSS together, on the fly. Can also be used for generating HTML, XML or CSS (or templates). HTML and CSS can

Alexander F. Rødseth 43 Oct 12, 2022
Multi-threaded socks proxy checker written in Go!

Soxy - a very fast tool for checking open SOCKS proxies in Golang I was looking for some open socks proxies, and so I needed to test them - but really

pry0cc 45 Sep 6, 2022
PHP security vulnerabilities checker

Local PHP Security Checker The Local PHP Security Checker is a command line tool that checks if your PHP application depends on PHP packages with know

Fabien Potencier 989 Jan 3, 2023
Google Maps API checker

GAP Google API checker. Based on the study Unauthorized Google Maps API Key Usage Cases, and Why You Need to Care and Google Maps API (Not the Key) Bu

Joan Bono 45 Nov 17, 2022
Golang security checker

gosec - Golang Security Checker Inspects source code for security problems by scanning the Go AST. License Licensed under the Apache License, Version

Secure Go 6.5k Jan 4, 2023
A best practices checker for Kubernetes clusters. 🤠

Clusterlint As clusters scale and become increasingly difficult to maintain, clusterlint helps operators conform to Kubernetes best practices around r

DigitalOcean 500 Dec 29, 2022