Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

Overview

docconv

Go reference Build status Report card Sourcegraph

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Note for returning users: the Go import path for this package changed to code.sajari.com/docconv.

Installation

If you haven't setup Go before, you first need to install Go.

To fetch and build the code:

$ go get code.sajari.com/docconv/...

This will also build the command line tool docd into $GOPATH/bin. Make sure that $GOPATH/bin is in your PATH environment variable.

Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext

Example install of dependencies (not all systems):

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext

Optional dependencies

To add image support to the docconv library you first need to install and build gosseract.

Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images:

$ go get -tags ocr code.sajari.com/docconv/...

This may complain on macOS, which you can fix by installing tesseract via brew:

$ brew install tesseract

docd tool

The docd tool runs as either:

  1. a service on port 8888 (by default)

    Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.

  2. a service exposed from within a Docker container

    This also runs as a service, but from within a Docker container. Official images are published at https://hub.docker.com/r/sajari/docd.

    Optionally you can build it yourself:

    cd docd
    docker build -t docd .
    
  3. via the command line.

    Documents can be sent as an argument, e.g.

    $ docd -input document.pdf
    

Optional flags

  • addr - the bind address for the HTTP server, default is ":8888"
  • log-level
    • 0: errors & critical info
    • 1: inclues 0 and logs each request as well
    • 2: include 1 and logs the response payloads
  • readability-length-low - sets the readability length low if the ?readability=1 parameter is set
  • readability-length-high - sets the readability length high if the ?readability=1 parameter is set
  • readability-stopwords-low - sets the readability stopwords low if the ?readability=1 parameter is set
  • readability-stopwords-high - sets the readability stopwords high if the ?readability=1 parameter is set
  • readability-max-link-density - sets the readability max link density if the ?readability=1 parameter is set
  • readability-max-heading-distance - sets the readability max heading distance if the ?readability=1 parameter is set
  • readability-use-classes - comma separated list of readability classes to use if the ?readability=1 parameter is set

How to start the service

$ # This will only log errors and critical info
$ docd -log-level 0

$ # This will run on port 8000 and log each request
$ docd -addr :8000 -log-level 1

Example usage (code)

Some basic code is shown below, but normally you would accept the file by HTTP or open it from the file system.

This should be enough to get you started though.

Use case 1: run locally

Note: this assumes you have the dependencies installed.

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	res, err := docconv.ConvertPath("your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Use case 2: request over the network

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv/client"
)

func main() {
	// Create a new client, using the default endpoint (localhost:8888)
	c := client.New()

	res, err := client.ConvertPath(c, "your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Alternatively, via a curl:

curl -s -F input=your-file.pdf http://localhost:8888/convert
Issues
  • Compatibility with Windows

    Compatibility with Windows

    Docconv seems to give trouble when running on Windows computer. In doc.go for instance there is a hardcoded path to a tempdir which includes a forward slash (line 17 and 60 e.g.). This is definitely a problem for a Windows OS...

    opened by TomDeneire 16
  • issues with deploying to gcloud

    issues with deploying to gcloud

    I tried to deploy it to appengine, and i'm failing.

    
    ---------------------------------------------------------------------------------------------------------------- REMOTE BUILD OUTPUT -----------------------------------------------------------------------------------------------------------------
    starting build "b30eafdb-e2e2-4a5d-b334-6be6535a8773"
    
    FETCHSOURCE
    Fetching storage object: gs://staging.xxx.appspot.com/us.gcr.io/xxx/appengine/docd.1:latest#1571314487109426
    Copying gs://staging.xxx.appspot.com/us.gcr.io/xxx/appengine/docd.1:latest#1571314487109426...
    / [1 files][  6.4 MiB/  6.4 MiB]
    Operation completed over 1 objects/6.4 MiB.
    BUILD
    Already have image (with digest): gcr.io/cloud-builders/docker
    Sending build context to Docker daemon  13.49MB
    Step 1/9 : FROM alpine
    latest: Pulling from library/alpine
    Digest: sha256:acd3ca9941a85e8ed16515bfc5328e4e2f8c128caa72959a58a127b7801ee01f
    Status: Downloaded newer image for alpine:latest
     ---> 961769676411
    Step 2/9 : MAINTAINER Hamish Ogilvy
     ---> Running in 1bf83502be90
    Removing intermediate container 1bf83502be90
     ---> 1d609d316173
    Step 3/9 : ENV CC=/usr/bin/gcc
     ---> Running in ebc84ab28e30
    Removing intermediate container ebc84ab28e30
     ---> 857461fa7c94
    Step 4/9 : ENV CXX=/usr/bin/g++
     ---> Running in 7022ffdf3aa6
    Removing intermediate container 7022ffdf3aa6
     ---> e6a37cfd4e07
    Step 5/9 : COPY dependencies/* /
    COPY failed: no source files were specified
    ERROR
    ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: exit status 1
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    ERROR: (gcloud.app.deploy) Cloud build failed. Check logs at https://console.cloud.google.com/gcr/builds/xxx?project=xxx Failure status: UNKNOWN: Error Response: [2] Build failed; check build logs for details
    rm: appengine/dependencies: No such file or directory
    
    opened by slav123 9
  • Fix temporary file creation

    Fix temporary file creation

    The following line throws an error message for me: f, err := ioutil.TempFile(os.TempDir(), "/docconv")

    error message: pattern contains path separator

    full error message thrown by docconv.Convert(): error converting data: error creating local file: error creating temporary file: pattern contains path separator

    This can be easily reproduced with go playground, just add and remove the slash to see the difference: https://play.golang.org/p/VaCO9evqKzM

    I suppose it could be fixed by just removing the slash, unless I'm missing the reason it needed to be there in the first place.

    opened by remlse 5
  • docconv appears to be out of sync with latest github.com/otiai10/gosseract

    docconv appears to be out of sync with latest github.com/otiai10/gosseract

    With current docconv at github/sajari/docconv that imports github.com/otiai10/gosseract/v1/gosseract the build on Mac OS terminal results in:

    go get -tags ocr code.sajari.com/docconv/... package github.com/otiai10/gosseract/v1/gosseract: cannot find package "github.com/otiai10/gosseract/v1/gosseract" in any of: /usr/local/go/src/github.com/otiai10/gosseract/v1/gosseract (from $GOROOT) /Users//go/src/github.com/otiai10/gosseract/v1/gosseract (from $GOPATH)

    Changing import to reference otiai10's current release (i.e. "github.com/otiai10/gosseract/v1/gosseract") in image_orc.go results in undefined references as follows:

    go get -tags ocr code.sajari.com/docconv/...

    code.sajari.com/docconv

    /Users//go/src/code.sajari.com/docconv/image_ocr.go:35:11: undefined: gosseract.Must /Users//go/src/code.sajari.com/docconv/image_ocr.go:35:26: undefined: gosseract.Params

    opened by PaulDuncanson 5
  • Use as internal go library

    Use as internal go library

    Is there a way to use this as an interal/embedded library in a golang program? I see that I can most likely achieve this with ODT and several other types since it returns a string.

    However, with PDF it returns a BodyResult that is not exported and looks like it is interpreted directly by the command line? Am I missing something?

    opened by deranjer 4
  • error converting data: exec:

    error converting data: exec: "pdftotext": executable file not found in $PATH

    I'm trying to launch simple code from tutorial in this repo, only with my own PDF file and has this error error converting data: exec: "pdftotext": executable file not found in $PATH.

    Platform MacOS. My PDF file is in go/src/project and in go/bin

    My Go project file path: /User/admin/go/src/project

    .bash_profile:

    export GOPATH=$HOME/go
    export GOBIN=$GOPATH/bin
    export PATH=$PATH:/usr/local/go/bin
    

    Code:

    package main
    
    import (
    	"fmt"
    	"log"
    
    	"code.sajari.com/docconv"
    )
    
    func main() {
    	res, err := docconv.ConvertPath("gsl-mit-edu-0to1.pdf")
    	if err != nil {
    		log.Fatal(err)
    	}
    	fmt.Println(res)
    }
    

    What could be the problem ?

    opened by nickbullll 4
  • Convert Images in PDFs

    Convert Images in PDFs

    This is based on the work in https://github.com/sajari/docconv/pull/19. Many thanks to @marioidival for getting that started.

    The objective is to enable this tool to perform character recognition on images within PDFs in addition to its current pdftotext capabilities.

    When the project is built with the ocr tag, ConvertPDF will detect images within the document and invoke ConvertImage on each of them.

    Note that our gosseract dependency just released a v2 with a breaking change. In order to preserve the current integration, I've updated the import statement to use gosseract/v1/gosseract as recommended in their current README.

    opened by onemartini 4
  • go-charset/charset not more hosting in google code

    go-charset/charset not more hosting in google code

    new hosting: https://github.com/rogpeppe/go-charset

    $ go get github.com/sajari/docconv
    warning: code.google.com is shutting down; import path code.google.com/p/go-charset/charset will stop working
    
    opened by avelino 4
  • Possible license inconsistencies

    Possible license inconsistencies

    Hello,

    We were considering using your library as part of our application and discovered one potential license inconsistency:

    Your library is licensed under MIT and has poppler-utils in the dependencies. However, poppler is licensed under GPL 2.

    License information: https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/README.md https://pkgs.alpinelinux.org/package/edge/main/x86/poppler-utils

    In our understanding, it could make your library obligatory to be licensed under GPL 2. I'm not the license expert and I might be mistaken here. But I hope you find this observation helpful, and you might have already considered it, and there are reasons why it's still fine to use MIT. It'd be great if you can clarify it, and explain us the legal way to use your library and poppler in our app under MIT.

    Thank you in advance!

    opened by hwo411 3
  • docd: refactor Dockerfile and publish to DockerHub

    docd: refactor Dockerfile and publish to DockerHub

    • Use multi-stage build to compile docd
    • Bring debian variant up to date
    • Deprecate the alpine variant
    • Add GitHub action to publish an official image to DockerHub
    • Use published image as the AppEngine custom runtime

    TODO:

    • [x] add DockerHub credentials for publishing
    • [ ] release v1.1.1
    opened by jsok 3
  • enable html output when readability is set to true

    enable html output when readability is set to true

    If html.go > HTMLReadabilityOptionsValues.ReadabilityUseClasses is left as an empty string as initialized, nothing will be included in the output. "good" is probably the very minimum that should be included in the output.

    opened by remlse 3
  • use as the default

    use as the default "good" and "neargood" for html when ReadabilityUseClasses is empty

    That are the defaults for docd, but that doesn't apply to library usage generating the problem seen in the issue #78.

    This PR can have a drawback if you intentionally pass an empty list for readabilityUseClasses, but that makes no sense, because the resulting extraction would be an empty body.

    Fixes #78

    opened by jespino 0
  • Fix mime type of .tif files

    Fix mime type of .tif files

    .tif files should map to the image/tiff mime type.

    List of official mime types: https://www.iana.org/assignments/media-types/media-types.xhtml

    From MDN web docs: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types

    opened by remlse 0
  • system deadlock

    system deadlock

    Line 102 103 in this doc. go causes a system deadlock, mainly because the coroutine implemented above failed to add valid data to the channel

    	body := <-bc
    	meta := <-mc
    

    err:

    ConvertDoc: could not read doc: mscfb: bad signature; 43016997712
    wvText: exit status 255
    
    opened by cnHuaShao 0
  • fix: case insensitive processing of docx and pptx internals

    fix: case insensitive processing of docx and pptx internals

    Hi,

    Sometimes, the case in Content-Types.xml and zipped file names do not match. Maybe not an issue on Windows, but is is an issue, when such file is searched in map with file names. (In particular, there was a directory dosProps, but in xml Overrides was docprops).

    I've changed the processing to be case insensitive.

    Regards

    opened by pavelbazika 0
  • xml: support also files not encoded in utf-8

    xml: support also files not encoded in utf-8

    Hi,

    when xml is not encoded in utf-8, decoder requires charset reader.

    Credits: https://stackoverflow.com/questions/6002619/unmarshal-an-iso-8859-1-xml-input-in-go/32224438#32224438

    opened by pavelbazika 0
Releases(v1.2.1)
  • v1.2.1(Jul 18, 2022)

    Vulnerabilities fixed:

    • deprecated protobuf dependency (#107)
    • msoleps stack overflow bug (dependency updated) (#108)
    • remote code execution vulnerability in the PDF OCR converter (#110)
    • unbounded memory consumption when reading files from ZIP archive (#111)

    Other:

    • improvements to client error messages (#113)
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Aug 20, 2021)

    Bug fixes and improvements:

    • docd: refactor Dockerfile and publish to DockerHub (#101)
    • Updated dependency for poppler and removed bash arg check (#100)
    • doc: improve metadata parsing so that titles can be reliably extracted (#99)
    • add test for TestConvertHTML (#93)
    • actions: stop building for Go 1.13
    • rtf: don't ignore lines less than 5 characters long (#91)
    • pptx_test: check returned error before deferring f.Close()
    • add note about ignored error check
    • docd: remove unused convertPath function
    • remove path separator from ioutil.TempFile prefix
    • add Sourcegraph badge to README
    • support PowerPoint files in Convert functions
    • ocr: update gosseract to v2
    • pptx: add support for MS PowerPoint files (#71)
    • avoid double copy on tidy
    • Update tidy pkg to use NewLocalFile func
    • support windows temp directories
    • add go mod support
    • pdf: add extra time layout for pdfs
    • try to avoid loading file data into slices
    • Fixes
    • Get docx contents reading [Content_Types].xml to get correct file names
    Source code(tar.gz)
    Source code(zip)
Owner
Search.io
Enabling every organization to create smart search experiences
Search.io
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Charles Banning 525 Aug 9, 2022
Dasel - Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool.

Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

Tom Wright 3.6k Aug 15, 2022
An encoder for Go structs to HTML

GOHTML An encoder for a Go struct to HTML Using the "reflect" package and recursion this package is able to convert a complex go struct into HTML Feat

Dominic Cooper-Wootton 25 Jul 19, 2022
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Hugo Bollon 336 Aug 9, 2022
Convert arbitrary formats to Go Struct (including json, toml, yaml, etc.)

go2struct Convert arbitrary formats to Go Struct (including json, toml, yaml, etc.) Installation Run the following command under your project: go get

Afeyer 34 Jul 6, 2022
An interesting go struct tag expression syntax for field validation, etc.

An interesting go struct tag expression syntax for field validation, etc.

Bytedance Inc. 1.2k Aug 8, 2022
Generic types that are missing from Go, including sets, trees, sorted lists, etc.

go-typ Generic types that are missing from Go, including sets, trees, sorted lists, etc. All code is implemented with 0 dependencies and in pure Go co

null 14 Jul 26, 2022
Simple .docx converter implemented by Go. Convert .docx to plain text.

docc Simple ".docx" converter implemented by Go. Convert ".docx" to plain text. License MIT Features Less dependency. No need for Microsoft Office. On

tenkoh 2 Mar 30, 2022
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

cat This is a simple libary to extract text from plaintext, .docx, .odt, .pdf and .rtf files. Install go get -u github.com/lu4p/cat Basic Usage packag

null 67 Jul 11, 2022
Simple system for writing HTML/XML as Go code. Better-performing replacement for html/template and text/template

Simple system for writing HTML as Go code. Use normal Go conditionals, loops and functions. Benefit from typing and code analysis. Better performance than templating. Tiny and dependency-free.

Nelo Mitranim 4 Apr 13, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 23 Jul 17, 2022
A minimalist Go PDF writer in 1982 lines. Draws text, images and shapes. Helps understand the PDF format. Used in production for reports.

one-file-pdf - A minimalist PDF generator in <2K lines and 1 file The main idea behind this project was: "How small can I make a PDF generator for it

Ali Bala 444 Jul 15, 2022
Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Takumasa Sakao 27 Apr 7, 2022
This command line converts .html file into .html with images embed.

embed-html This command line converts .html file into .html with images embed. Install > go get github.com/gonejack/embed-html Usage > embed-html *.ht

会有猫的 0 Jan 13, 2022
Cairo in Go: vector to SVG, PDF, EPS, raster, HTML Canvas, etc.

Canvas is a common vector drawing target that can output SVG, PDF, EPS, raster images (PNG, JPG, GIF, ...), HTML Canvas through WASM, and OpenGL. It h

Taco de Wolff 1k Aug 16, 2022
Go package that handles HTML, JSON, XML and etc. responses

gores http response utility library for Go this package is very small and lightweight, useful for RESTful APIs. installation go get github.com/alioygu

Ali OYGUR 99 May 21, 2022
converts text-formats from one to another, it is very useful if you want to re-format a json file to yaml, toml to yaml, csv to yaml, ... etc

re-txt reformates a text file from a structure to another, i.e: convert from json to yaml, toml to json, ... etc Supported Source Formats json yaml hc

Mohammed Al Ashaal 67 Jun 8, 2022
word2text - a tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free

This tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free. This tool has been tested on: - Linux 32bit and 64 bit - Windows 32 bit and 64 bit - OpenBSD 64 bit

Ryan Thomas 5 Apr 19, 2021
mold your templated to HTML/ TEXT/ PDF easily.

mold mold your templated to HTML/ TEXT/ PDF easily. install go get github.com/mayur-tolexo/mold Example 1 //Todo model type Todo struct { Title stri

Mayur Das 0 Jun 7, 2019
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

JF Technology 484 Aug 12, 2022
go HTTP client that makes it plain simple to configure TLS, basic auth, retries on specific errors, keep-alive connections, logging, timeouts etc.

goat Goat, is an HTTP client built on top of a standard Go http package, that is extremely easy to configure; no googling required. The idea is simila

VSPAZ 1 Jun 25, 2022
wikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format.

wikipedia-jsonl wikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format. How to use At first, download the XML dump from Wikime

Minoru Osuka 2 Feb 13, 2022
Golang PDF library for creating and processing PDF files (pure go)

UniPDF - PDF for Go UniDoc UniPDF is a PDF library for Go (golang) with capabilities for creating and reading, processing PDF files. The library is wr

UniDoc 1.6k Aug 8, 2022
goldmark-pdf is a renderer for goldmark that allows rendering to PDF.

A PDF renderer for the goldmark markdown parser.

Stephen Afam-Osemene 86 Jul 6, 2022
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)

Go View File 在线体验地址 http://39.97.98.75:8082/view/upload (不会经常更新,保留最基本的预览功能。服务器配置较低,如果出现链接超时请等待几秒刷新重试,或者换Chrome) 目前已经完成 docker部署 (不用为运行环境烦恼) Wor

CZC 69 Aug 18, 2022
textnote is a command line tool for quickly creating and managing daily plain text notes.

textnote is a command line tool for quickly creating and managing daily plain text notes. It is designed for ease of use to encourage the practice of daily, organized note taking. textnote intentionally facilitates only the management (creation, opening, organizing, and consolidated archiving) of notes, following the philosophy that notes are best written in a text editor and not via a CLI.

Daniel Kaslovsky 157 Jul 12, 2022
AppGo is an application that is intended to read a plain text log file and deliver an encoded polyline

AppGo AppGo is an application that is intended to read a plain text log file and deliver an encoded polyline. Installation To run AppGo it is necessar

Wendy Conde 0 Oct 23, 2021
Share plain text, images and files in Local area network.

LAN-Share Share plain text, images and files in Local area network. Usage $ lan-share -h Usage of lan-share: -addr string Listen on address

Liming Jin 3 Jan 25, 2022
Program to convert plain text to CSV file which can imported into Anki.

Program to convert plain text to CSV file which can imported into Anki. The motivation of this program is to save time by automatically coverting Question and Answer into CSV file which can be imported directly into Anki.

Anuroop Sirothia 0 May 22, 2022