Toggle navigation

Account Invoice Import Simple PDF

Beta License: AGPL-3 OCA/edi Translate me on Weblate Try me on Runboat

This module is an extension of the module account_invoice_import: it adds support for simple PDF invoices i.e. PDF invoice that don’t have an embedded XML file. This module has been developped to solve the drawbacks of the OCA module account_invoice_import_invoice2data ; its advantages are the following:

  • Possibility to add support for a new vendor without developper skills: the accountant can do it!
  • Adding support for a new vendor is faster.
  • More tolerance on vendor invoice layout changes.
  • Easier to install.

With this module, you can import all the invoices that you were able to import with the module account_invoice_import_invoice2data. In fact, this module uses the same design when importing a PDF vendor bill:

  1. raw text extraction of the PDF file,
  2. identify the partner using the VAT number (if the VAT number is present in the raw text extraction) or some keywords,
  3. use regular expressions (regex) to extract the data needed to create the vendor bill in Odoo (single line configuration).

The main difference with the OCA module account_invoice_import_invoice2data is that the regular expressions are auto-generated from the configuration made by the user in Odoo. No need to be a regex expert! But you can still write regex to extract some fields for some very specific needs.

The module can extract the following fields:

  • Total Amount with taxes
  • Total Untaxed Amount
  • Total Tax Amount
  • Invoice Date
  • Due Date
  • Start Date
  • End Date
  • Invoice Number
  • Description (for that field, you have to write a regex)

In this list, only 3 fields are required:

  • Invoice Date
  • 2 out of the 3 Amount fields (the 3rd can be deducted from the 2 others: Total Amount = Total Untaxed + Total Tax)

To take advantage of the fields Start Date and End Date, you need the OCA module account_invoice_start_end_dates from the account-closing project.

To know the full story behind the development of this module, read Akretion’s blog post.

Table of contents

Installation

The most important technical component of this module is the tool that converts the PDF to text. Converting PDF to text is not an easy job. As outlined in this blog post, different tools can give quite different results. The best results are usually achieved with tools based on a PDF viewer, which exclude pure-python tools. But pure-python tools are easier to install than tools based on a PDF viewer. It is important to understand that, if you change the PDF to text tool, you will certainly have a slightly different text output, which may oblige you to update the field extraction rule, which can be time-consuming if you have already configured many vendors.

The module supports 5 different extraction methods:

  1. PyMuPDF which is a Python binding for MuPDF, a lightweight PDF toolkit/viewer/renderer published under the AGPL licence by the company Artifex Software.
  2. pdftotext python library, which is a python binding for the pdftotext tool.
  3. pdftotext command line tool, which is based on poppler, a PDF rendering library used by xpdf and Evince (the PDF reader of Gnome).
  4. pypdf, which is one of the most common PDF lib for Python. pypdf is a pure-python solution, so it’s very easy to install on all OSes.

PyMuPDF and pdftotext both give a very good text output. So far, I can’t say which one is best. pypdf often gives lower-quality text output, but its advantage is that it is a pure-Python librairy, so you will always be able to install it whatever your technical environnement is.

You can choose one extraction method and only install the tools/libs for that method.

Install PyMuPDF

Install it via pip:

pip3 install --upgrade pymupdf

Beware that PyMuPDF is not a pure-python library: it uses MuPDF, which is written in C language. If a python wheel for your OS, CPU architecture and Python version is available on pypi (check the list of PyMuPDF wheels on pypi), it will install smoothly. Otherwize, the installation via pip will require MuPDF and all its development libs to compile the binding.

Install pdftotext python lib

To install pdftotext python lib, run:

sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev

and then install the lib via pip:

pip3 install --upgrade pdftotext

On OSes other than Debian/Ubuntu, follow the instructions on the project page.

Install pdftotext command line

To install pdftotext command line, run:

sudo apt install poppler-utils

Install pypdf

To install the pypdf python lib, run:

pip3 install --upgrade pypdf

Other requirements

This module also requires the following Python libraries:

  • regex which is backward-compatible with the re module of the Python standard library, but has additional functionalities.
  • dateparser which is a powerful date parsing library.

The dateparser lib depends itself on regex. So you can install these Python libraries via pip with the following command:

pip3 install --upgrade dateparser

The dateparser lib is not compatible with all regex lib versions. As of February 2024, the version requirement declared by dateparser for regex is !=2019.02.19, !=2021.8.27. So the latest version of dateparser is currenly compatible with the latest version of regex. To know the version of regex installed in your environment, run:

pip3 show regex

Configuration

By default, for the PDF to text conversion, the module tries the different methods in the order mentionned in the INSTALL section: it will first try to use PyMuPDF; if it fails (for example because the lib is not properly installed), then it will try to use the pdftotext python lib, if that one also fails, it will try to use pdftotext command line and, if it also fails, it will eventually try pypdf. If none of the 4 methods work, Odoo will display an error message.

If you want to force Odoo to use a specific text extraction method, go to the menu Configuration > Technical > Parameters > System Parameters and create a new System Parameter:

  • Key: invoice_import_simple_pdf.pdf2txt
  • Value: select the proper value for the method you want to use:

    1. pymupdf
    2. pdftotext.lib
    3. pdftotext.cmd
    4. pypdf

In this configuration, Odoo will only use the selected text extraction method and, if it fails, it will display an error message.

You will find a full demonstration about how to configure each Vendor and import the PDF invoices in this screencast.

Bug Tracker

Bugs are tracked on GitHub Issues. In case of trouble, please check there if your issue has already been reported. If you spotted it first, help us to smash it by providing a detailed and welcomed feedback.

Do not contact contributors directly about support or help with technical issues.

Credits

Authors

  • Akretion

Contributors

Maintainers

This module is maintained by the OCA.

Odoo Community Association

OCA, or the Odoo Community Association, is a nonprofit organization whose mission is to support the collaborative development of Odoo features and promote its widespread use.

Current maintainer:

alexis-via

This module is part of the OCA/edi project on GitHub.

You are welcome to contribute. To learn how please visit https://odoo-community.org/page/Contribute.