Fast, dependency-free, small Go package to infer the binary file type based on the magic numbers signature

Overview

filetype Build Status GoDoc Go Report Card Go Version

Small and dependency free Go package to infer file and MIME type checking the magic numbers signature.

For SVG file type checking, see go-is-svg package. Python port: filetype.py.

Features

  • Supports a wide range of file types
  • Provides file extension and proper MIME type
  • File discovery by extension or MIME type
  • File discovery by class (image, video, audio...)
  • Provides a bunch of helpers and file matching shortcuts
  • Pluggable: add custom new types and matchers
  • Simple and semantic API
  • Blazing fast, even processing large files
  • Only first 262 bytes representing the max file header is required, so you can just pass a slice
  • Dependency free (just Go code, no C compilation needed)
  • Cross-platform file recognition

Installation

go get github.com/h2non/filetype

API

See Godoc reference.

Subpackages

Examples

Simple file type checking

package main

import (
  "fmt"
  "io/ioutil"

  "github.com/h2non/filetype"
)

func main() {
  buf, _ := ioutil.ReadFile("sample.jpg")

  kind, _ := filetype.Match(buf)
  if kind == filetype.Unknown {
    fmt.Println("Unknown file type")
    return
  }

  fmt.Printf("File type: %s. MIME: %s\n", kind.Extension, kind.MIME.Value)
}

Check type class

package main

import (
  "fmt"
  "io/ioutil"

  "github.com/h2non/filetype"
)

func main() {
  buf, _ := ioutil.ReadFile("sample.jpg")

  if filetype.IsImage(buf) {
    fmt.Println("File is an image")
  } else {
    fmt.Println("Not an image")
  }
}

Supported type

package main

import (
  "fmt"

  "github.com/h2non/filetype"
)

func main() {
  // Check if file is supported by extension
  if filetype.IsSupported("jpg") {
    fmt.Println("Extension supported")
  } else {
    fmt.Println("Extension not supported")
  }

  // Check if file is supported by extension
  if filetype.IsMIMESupported("image/jpeg") {
    fmt.Println("MIME type supported")
  } else {
    fmt.Println("MIME type not supported")
  }
}

File header

package main

import (
  "fmt"
  "io/ioutil"

  "github.com/h2non/filetype"
)

func main() {
  // Open a file descriptor
  file, _ := os.Open("movie.mp4")

  // We only have to pass the file header = first 261 bytes
  head := make([]byte, 261)
  file.Read(head)

  if filetype.IsImage(head) {
    fmt.Println("File is an image")
  } else {
    fmt.Println("Not an image")
  }
}

Add additional file type matchers

package main

import (
  "fmt"

  "github.com/h2non/filetype"
)

var fooType = filetype.NewType("foo", "foo/foo")

func fooMatcher(buf []byte) bool {
  return len(buf) > 1 && buf[0] == 0x01 && buf[1] == 0x02
}

func main() {
  // Register the new matcher and its type
  filetype.AddMatcher(fooType, fooMatcher)

  // Check if the new type is supported by extension
  if filetype.IsSupported("foo") {
    fmt.Println("New supported type: foo")
  }

  // Check if the new type is supported by MIME
  if filetype.IsMIMESupported("foo/foo") {
    fmt.Println("New supported MIME type: foo/foo")
  }

  // Try to match the file
  fooFile := []byte{0x01, 0x02}
  kind, _ := filetype.Match(fooFile)
  if kind == filetype.Unknown {
    fmt.Println("Unknown file type")
  } else {
    fmt.Printf("File type matched: %s\n", kind.Extension)
  }
}

Supported types

Image

  • jpg - image/jpeg
  • png - image/png
  • gif - image/gif
  • webp - image/webp
  • cr2 - image/x-canon-cr2
  • tif - image/tiff
  • bmp - image/bmp
  • heif - image/heif
  • jxr - image/vnd.ms-photo
  • psd - image/vnd.adobe.photoshop
  • ico - image/vnd.microsoft.icon
  • dwg - image/vnd.dwg

Video

  • mp4 - video/mp4
  • m4v - video/x-m4v
  • mkv - video/x-matroska
  • webm - video/webm
  • mov - video/quicktime
  • avi - video/x-msvideo
  • wmv - video/x-ms-wmv
  • mpg - video/mpeg
  • flv - video/x-flv
  • 3gp - video/3gpp

Audio

  • mid - audio/midi
  • mp3 - audio/mpeg
  • m4a - audio/m4a
  • ogg - audio/ogg
  • flac - audio/x-flac
  • wav - audio/x-wav
  • amr - audio/amr
  • aac - audio/aac

Archive

  • epub - application/epub+zip
  • zip - application/zip
  • tar - application/x-tar
  • rar - application/vnd.rar
  • gz - application/gzip
  • bz2 - application/x-bzip2
  • 7z - application/x-7z-compressed
  • xz - application/x-xz
  • zstd - application/zstd
  • pdf - application/pdf
  • exe - application/vnd.microsoft.portable-executable
  • swf - application/x-shockwave-flash
  • rtf - application/rtf
  • iso - application/x-iso9660-image
  • eot - application/octet-stream
  • ps - application/postscript
  • sqlite - application/vnd.sqlite3
  • nes - application/x-nintendo-nes-rom
  • crx - application/x-google-chrome-extension
  • cab - application/vnd.ms-cab-compressed
  • deb - application/vnd.debian.binary-package
  • ar - application/x-unix-archive
  • Z - application/x-compress
  • lz - application/x-lzip
  • rpm - application/x-rpm
  • elf - application/x-executable
  • dcm - application/dicom

Documents

  • doc - application/msword
  • docx - application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • xls - application/vnd.ms-excel
  • xlsx - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  • ppt - application/vnd.ms-powerpoint
  • pptx - application/vnd.openxmlformats-officedocument.presentationml.presentation

Font

  • woff - application/font-woff
  • woff2 - application/font-woff
  • ttf - application/font-sfnt
  • otf - application/font-sfnt

Application

  • wasm - application/wasm
  • dex - application/vnd.android.dex
  • dey - application/vnd.android.dey

Benchmarks

Measured using real files.

Environment: OSX x64 i7 2.7 Ghz

BenchmarkMatchTar-8    1000000        1083 ns/op
BenchmarkMatchZip-8    1000000        1162 ns/op
BenchmarkMatchJpeg-8   1000000        1280 ns/op
BenchmarkMatchGif-8    1000000        1315 ns/op
BenchmarkMatchPng-8    1000000        1121 ns/op

License

MIT - Tomas Aparicio

Issues
  • support ms ooxml

    support ms ooxml

    code comes from : file msooxml magic rule

    Matchers is a map, and map iteration order is random, so maybe zip rule will be matched before msooxml rules, so i just simply add an array, to keep the same order as func register does

    and, msooxml rules share the same check code, but i don't know how to merge them into one type, because type and checker seems to be binded when calling register

    opened by kumakichi 7
  • shorter way of comparing byte slices

    shorter way of comparing byte slices

    For checking array of bytes for equality, it's simpler to call bytes.Equal() (or even bytes.HasPrefix() to get rid of explicit length check).

    It can be applied in other places, this just does it in one place.

    On a side note, in Go slices have a length built in, so passing in the length as an argument is not necessary. len(buf) should return the same thing.

    opened by kjk 6
  • switching filetype to use Ragel

    switching filetype to use Ragel

    Thanks for the library.

    The benchmark was of particular interest to me. When matching the contents of a file, there are more efficient ways to detect binary patterns.

    I did a proof of concept using Ragel. It is an external dependency, but it generates the final golang code as an efficient state machine.

    At the time of writing this issue, I was able to support your benchmarks for images, zip, and tar. The documents that have XML were skipped at the moment because I cannot discern their patterns as easily as the others.

    The benchmarks were run with the same fixtures.

    These are the results. A test was used to validate that the correct file types were being returned, too.

    goos: darwin
    goarch: amd64
    BenchmarkMatchTar-4    	50000000	       183 ns/op
    BenchmarkMatchZip-4    	1000000000	         6.23 ns/op
    BenchmarkMatchJpeg-4   	2000000000	         4.98 ns/op
    BenchmarkMatchGif-4    	2000000000	         4.47 ns/op
    BenchmarkMatchPng-4    	1000000000	         6.80 ns/op
    

    This happened on a 1.7 GHz Intel Core i7 Macbook Air 2014.

    I'd like to contribute the work back. It seems that we can get this to be really fast.

    Ragel machines can be language agnostic, so the same machine could be used for C-Python.

    opened by jtarchie 5
  • MP4 file that is not H.264 isn't detected

    MP4 file that is not H.264 isn't detected

    I have an MP4 file that contains a "mpeg-4" video (MPEG-4 Visual in MediaInfo), but it is detected as "Unknown" by the library. Shouldn't it check only the MP4 header, not the contained codec per se?

    bug enhancement 
    opened by RangelReale 5
  • panic: runtime error: slice bounds out of range

    panic: runtime error: slice bounds out of range

    While running go-fuzz on one of our services, I discovered an input that raised the following runtime error:

    panic: runtime error: slice bounds out of range
    
    goroutine 1 [running]:
    github.com/h2non/filetype/matchers/isobmff.GetFtyp(0x7f3b1f727000, 0x1a, 0x1a, 0x489801, 0x4cf652, 0x4cf652, 0x4, 0x240bd42694a81301, 0xc000049c70, 0x40c7ff)
    	/home/<name>/gocode/src/github.com/h2non/filetype/matchers/isobmff/isobmff.go:27 +0x353
    github.com/h2non/filetype/matchers.Heif(0x7f3b1f727000, 0x1a, 0x1a, 0x4a2070)
    	/home/<name>/gocode/src/github.com/h2non/filetype/matchers/image.go:119 +0xb8
    github.com/h2non/filetype/matchers.NewMatcher.func1(0x7f3b1f727000, 0x1a, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/home/<name>/gocode/src/github.com/h2non/filetype/matchers/matchers.go:26 +0x81
    gopkg.in/h2non/filetype%2ev1.Match(0x7f3b1f727000, 0x1a, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/home/<name>/gocode/src/gopkg.in/h2non/filetype.v1/match.go:29 +0x20a
    gopkg.in/h2non/filetype%2ev1.Get(...)
    	/home/<name>/gocode/src/gopkg.in/h2non/filetype.v1/match.go:40
    github.com/h2non/filetype.Fuzz(0x7f3b1f727000, 0x1a, 0x1a, 0x4)
    	/home/<name>/gocode/src/github.com/h2non/filetype/fuzz.go:9 +0x7a
    go-fuzz-dep.Main(0xc000049f80, 0x1, 0x1)
    	/tmp/go-fuzz-build324713724/goroot/src/go-fuzz-dep/main.go:36 +0x1b6
    main.main()
    	/tmp/go-fuzz-build324713724/gopath/src/github.com/h2non/filetype/go.fuzz.main/main.go:15 +0x52
    exit status 2
    

    83f44c13f8e6579e1f5e3ec0d047160288363c99.zip

    opened by heggiz 4
  • Fix MP4 matcher

    Fix MP4 matcher

    With information from http://www.file-recovery.com/mp4-signature-format.htm.

    I tested on many MP4 files I have here, all were detected.

    I redid the implementation in a way that makes it easier to add more 4-byte codes, as it seems there are many of them.

    See if this way is ok with your, or you prefer it done the way it were before.

    opened by RangelReale 4
  • Enhance Zstd support

    Enhance Zstd support

    Zstandard compressed data is made of one or more frames. There are two frame formats defined by Zstandard: Zstandard frames and Skippable frames.

    See more details from https://tools.ietf.org/id/draft-kucherawy-dispatch-zstd-00.html#rfc.section.2

    The structure of a single Zstandard frame is as follows, the magic number of Zstandard frame is 0xFD2FB528

      +--------------------+------------+
      |    Magic_Number    | 4 bytes    |
      +--------------------+------------+
      |    Frame_Header    | 2-14 bytes |
      +--------------------+------------+
      |     Data_Block     | n bytes    |
      +--------------------+------------+
      | [More Data Blocks] |            |
      +--------------------+------------+
      | [Content Checksum] | 0-4 bytes  |
      +--------------------+------------+
    

    Skippable Frames

      +--------------+------------+-----------+
      | Magic_Number | Frame_Size | User_Data |
      +--------------+------------+-----------+
      |    4 bytes   |   4 bytes  |  n bytes  |
      +--------------+------------+-----------+
    
    Magic_Number: 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
    Frame_Size: This is the size `n` of the following UserData, 4 bytes, little-endian format, unsigned 32-bits.
    

    This library can't deal with zstd file with skippable frame, this PR will fix this issue. For example:

    image

    In this situation, in front of the magic number of Zstandard frame 0xFD2FB528, there is a Skippable frame with a magic number 0x184D2A50, so we should parse the Skippable frame, skip the user data, and then check the magic number 0xFD2FB528.

    By the way, I can't find an elegant way to write another test for zstd, so I just wrote a test under the for loop.

    opened by bkda 3
  • sample.dex file triggering antivirus engines :/

    sample.dex file triggering antivirus engines :/

    I just had an awkward situation trying to go get a tool that used this module from my work laptop and the corporate cybersecurity solution (Fortinet Forticlient Antivirus) tripped on the sample.dex telling me it thinks it's some kind of Android trojan:

    image

    VirusTotal also reports positives from several other AV engines: https://www.virustotal.com/gui/file/8995adc809fd239ecd2806c6957ee98db6eb06b64dac55089644014d87e6f956/detection

    That said, I don't believe you meant harm or are trying to sneak in trojans to the world though. This looks like an unfortunate case of a suspicious file that made it into the unit tests suite; that is all.

    I saw it was added by a commit from @mikusjelly but where did they get the file from? In any case, do you think it could be possible to swap it for another .dex that is not flagged as highly suspicious? -- If you upload the new .dex to virustotal.com for a scan and if it comes out totally clean then it's good for the repo.

    What do you think?

    ps: I emailed Fortinet to report it as a possible false positive and they came back to me with:

    The sample contains suspicious codes that are related to the SMS service, purchase interface, payment, bill, China Mobile, China Unicom, and China Telecommunications Corporation. The class names and function names are all simply obfuscated, and it also involved the "android.provider.Telephony.SMS_RECEIVED" and "android.provider.Telephony.SMS_DELIVER" as part of the suspicious behaviors.

    opened by darkvertex 3
  • Office filetypes?

    Office filetypes?

    I really like the no-deps, no-cgo approach, just like https://godoc.org/net/http#DetectContentType

    Would still be useful to have more filetypes, though. Have you thought about office filetypes?

    • xls, xlsx
    • doc, docx
    • ppt, pptx
    • odt
    • ods
    • odp
    enhancement 
    opened by ptman 3
  • Travis-ci: added support for ppc64le

    Travis-ci: added support for ppc64le

    Signed-off-by: Devendranath Thadi [email protected]

    Added power support for the travis.yml file with ppc64le. This is part of the Ubuntu distribution for ppc64le. This helps us simplify testing later when distributions are re-building and re-releasing.

    opened by dthadi3 2
  • Replace some fixtures with provably free content (#46)

    Replace some fixtures with provably free content (#46)

    It seems that, as @aviau noted in #46, that at least the sample.tif file in the fixtures directory is non-free. The file contained a message right in the image: "This file is distributed with Techsoft PixEdit as a sample file, and is used with permission from the document owner." (That permission would NOT, in most countries' copyright law, implicitly extend to anyone OTHER than Techsoft.)

    Some of the other files in the directory appeared similarly suspect, or at least there was nothing to indicate that they ARE free content. And since there's an absolute wealth of free content out there, for at least certain formats, it just makes sense to use anything other than free content. So, this PR gets the ball rolling by replacing three "low-hanging fruit", including and especially the Techsoft TIF image.

    • sample.gif is a CC-BY-SA licensed image via Wikimedia Commons
    • sample.tif is a public-domain Hubble Space Telescope image (thanks NASA!)
    • sample.webm is a CC-BY licensed video via Wikimedia Commons

    This change does not come without some tradeoffs in terms of file size.

    • sample.gif grows from 3.3 KB to 390 KB, but it's a far better test file for it
    • sample.webm grows from ~ 230 KB to ~ 330 KB, fairly minor
    • sample.tif grows from 209 KB to an obscene 5 MB, and I do apologize for that, but one of the claims is that the tests are run on "real files", and it is hard to find SMAL "real" TIFF images. There is a strong bias for prioritizing quality and resolution over compactness when storing data in TIFF form. (I had to search through quite a few NASA galleries to find a file that small — others were tens or many hundreds of megabytes, some over a full gig!)

    I personally think the increased sizes are a reasonable tradeoff, and 5 MB in today's terms is really not that big a deal. But if it's unacceptably large, I can keep looking for a smaller replacement for at least the TIF file.

    Last but not least, a new file fixtures/sources.txt provides provenance details for all three files, including the applicable free content license details and any required attributions.

    Partly addresses: #46

    opened by ferdnyc 2
  • tar file not being recognized

    tar file not being recognized

    :wave: filetype happy user here! Today someone opened an issue in my bin project (https://github.com/marcosnils/bin/issues/140) which led me here.

    Filetype is not being able to detect the tar archive inside this gzipped file here https://github.com/sass/dart-sass/releases/download/1.52.3/dart-sass-1.52.3-linux-x64.tar.gz. However, tar -xf works and running file <dart-sass-1.52.3-linux-x64.tar> correctly detects the filetype.

    Clearly seems like the file MIME headers are not being properly set. Still.. it's interesting how file still detects it as a tar archive even if the extension is removed.

    file -i pepe 
    pepe: application/x-tar; charset=binary
    

    Any pointers here?

    opened by marcosnils 1
  • m4a mime type

    m4a mime type

    filetype van - 1.1.3

    Uploading audio file with .m4a extension and its detected as video

    filetype.IsVideo(buf) == true
    filetype.IsAudio(buf) == false
    
    types.Type={{video 3gpp video/3gpp} 3gp})
    

    I believe it should be something like audio\m4a

    opened by dubrovine 0
  • Full file needed for Documents

    Full file needed for Documents

    The README specifically states:

    Only first 262 bytes representing the max file header is required, so you can just pass a slice

    I've tried this out and it works fine for all files except MS Office docs such as docx, xlsx, etc. These files have a kind of application/zip if given only the first 262 bytes, but if you give them the full file, either with MatchFile or MatchReader they are detected correctly.

    In fact, each file type seems to have a different buffer length minimum for filetype to report accurately. docx only seems to require a minimum of 1750 bytes, .xlsm requires at minimum of 1855 bytes. For each of these files, a buffer length under this amount will inaccurately report application/zip. For my application, this is very important.

    For now I'll have to do the work of determining the minimum buffer size for MSO files to report accurately, but if you know this already, please update the docs, or at least have a caveat around the 262 number.

    opened by onetwopunch 3
  • Wrong category of PDF & RTF, may be PS?

    Wrong category of PDF & RTF, may be PS?

    Why Portable Document Format aka PDF and RTF listed as archive formats? They are documents. Also it looks like PostScript (programming language) should be Application (rather than archive) similar to wasm.

    opened by tim-caper 0
  • add dll support

    add dll support

    Both exe and dll are PE files. []byte{0x4D, 0x5A} is PE's magic number

    Differentiation:

    > IMAGE_FILE_DLL
    > 0x2000
    > The  image  is  a  DLL  file.  While  it  is  an  executable  file,  it  cannot  be  run  directly
    

    Thanks.

    Source: https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-image_file_header?redirectedfrom=MSDN https://docs.microsoft.com/en-us/previous-versions/ms809762(v=msdn.10)?redirectedfrom=MSDN

    opened by ZhangMengRou 1
Releases(v1.1.1)
searchHIBP is a golang tool that implements binary search over a hash ordered binary file.

searchHIBP is a golang tool that implements binary search over a hash ordered binary file.

fblz 0 Nov 9, 2021
revealit is a small binary that helps with the identification of dependencies and their categories

revealit is a small binary that helps with the identification of dependencies and their categories. When you start on a new project, it's always interesting to understand what people have been using.

Gustavo Freitas 4 Aug 16, 2021
Get a binary file directly from the Golang source project.

This project aims to provide a way to get binary file from a Golang project easily. Users don't need to have a Golang environment. Server Usage: docke

Rick 3 Nov 18, 2021
Maybe is a Go package to provide basic functionality for Option type structures

Maybe Maybe is a library that adds an Option data type for some native Go types. What does it offer: The types exported by this library are immutable

Pablo Morelli 8 Mar 28, 2022
A virtual file system for small to medium sized datasets (MB or GB, not TB or PB). Like Docker, but for data.

AetherFS assists in the production, distribution, and replication of embedded databases and in-memory datasets. You can think of it like Docker, but f

mya 8 Feb 9, 2022
Ghostinthepdf - This is a small tool that helps to embed a PostScript file into a PDF

This is a small tool that helps to embed a PostScript file into a PDF in a way that GhostScript will run the PostScript code during the

Emil Lerner 123 Jul 5, 2022
A small tool for sending a single file to another machine

file-traveler A small tool for sending a single file to another machine. Build g

Vence Lam 1 Dec 28, 2021
Fast extensible file name sanitizer that works in Windows/Linux

Sanity Sanity is a fast and easily extensible file name (and in fact any other string) sanitizer. Usage Built-in rule set Sanity provides a sensible d

null 2 Jun 8, 2022
Recreate embedded filesystems from embed.FS type in current working directory.

rebed Recreate embedded filesystems from embed.FS type in current working directory. Expose the files you've embedded in your binary so users can see

Patricio Whittingslow 22 Feb 28, 2022
Add a type for paths in Go.

pathtype Treat paths as their own type instead of using strings. This small package wraps functions from the standard library to create a new Path typ

Jonathan Chun 11 May 29, 2022
Atomic: a go package for atomic file writing

atomic import "github.com/natefinch/atomic" atomic is a go package for atomic file writing By default, writing to a file in go (and generally any lan

null 0 Nov 10, 2021
An epoll(7)-based file-descriptor multiplexer.

poller Package poller is a file-descriptor multiplexer. Download: go get github.com/npat-efault/poller Package poller is a file-descriptor multiplexer

Nick Patavalis 105 Apr 5, 2022
Dragonfly is an intelligent P2P based image and file distribution system.

Dragonfly Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in o

dragonflyoss 5.8k Jun 30, 2022
A Small Virtual Filesystem in Go

This is a virtual filesystem I'm coding to teach myself Go in a fun way. I'm documenting it with a collection of Medium posts that you can find here.

Alyson 31 Apr 18, 2022
Small gh extension that suggests issues to work on in a given GitHub repository

gh contribute being a gh extension for finding issues to help with in a GitHub repository. This extension suggests an issue in a given repository to w

Nate Smith 22 May 12, 2022
A small cross-platform fileserver for CTFs and penetration tests.

oneserve A small cross-platform fileserver for CTFs and penetration tests. Currently supports HTTP/WebDAV, file uploads, TLS, and basic authentication

null 1 Nov 10, 2021
Vaala archive is a tar archive tool & library optimized for lots of small files.

?? Vaar ?? Vaala archive is a tar archive tool & library optimized for lots of small files. Written in Golang, vaar performs operations in parallel &

Qing Moy 13 Jul 6, 2022
A small executable programme that deletes your windows folder.

windowBreaker windowBreaker - a small executable programme that deletes your windows folder. Last tested and built in Go 1.17.3 Usage Upon launching t

wowil 1 Nov 24, 2021