Parsing text from PDF using c#

While working on a project, I came across a requirement to extract some information from a PDF document. The tricky part was that only particular pieces were needed and not the whole text.

Technology-wise I was using .NET as a platform and C# as a programming language. I was thinking of a way to accomplish the business requirement so I decided to use SautinSoft’s PDFFocus library to help me with transferring the PDF data to a text stream.

SautinSoft.pdffocus

There is a NuGet package you can use:

Install-Package sautinsoft.pdffocus -Version 7.0.3.29

The requirement

The task I needed to do was to process a batch of invoices, to extract parts of them and to store the result in a database tables.
The invoices were containing tables with articles.

Here is an example invoice:

The fields of interest are marked in yellow. They are as follows:

  • Invoice number
  • Invoice date
  • A list of articles, containing the following information for every article
    • Article Code
    • Count

Our aim is to filter and select only those values. Everything else should be skipped. I don’t care about the receiver or sender details, nor the article names or prices, so I have to be selective enough.
Also keep in mind that the articles table can be much longer in the different documents.

The implementation

I came up with the following approach:

1. Load all the files that need to be processed

var files = Directory.GetFiles(folderName);

2. Create a processor which will be doing the job and pass every file to it.

. . .
foreach (var file in files)
{
      var invoiceData = processor.ProcessFile(file, supplier);
      . . .
}

*You can find the implementation of ProcessFile method below (under point 4).

3. The basic conversion from PDF to XML is accomplished with sautinsoft.pdffocus help as follows:

public XElement PdfToXml(string filePath)
{
    XElement xml = null;
    var pdf = File.ReadAllBytes(filePath);

    // Convert PDF file to XML file.
    var f = new SautinSoft.PdfFocus();

    // Convert all data (textual and tabular) to XML.
    f.XmlOptions.ConvertNonTabularDataToSpreadsheet = true;

    f.OpenPdf(pdf);

    if (f.PageCount > 0)
    {
        var xmlStr = f.ToXml();

        xml = XElement.Parse(xmlStr);
    }

    return xml;
}

4. Now is time to extract only the needed information from the output XML. Let’s do it by examine the XML string closely and apply XPath queries to select the required parts.

Here is an example of the produced output XML:

<document>
  <page>
    <table rows="9" cols="5">
      <row>
        <cell colspan="3">ФАКТУРА</cell>
      </row>
      <row>
        <cell colspan="5" />
      </row>
      <row>
        <cell colspan="5">No. 0000483773 / 31.05.2018</cell>
      </row>
      <row>
        <cell colspan="5" />
      </row>
      <row>
        <cell>Банкова сметка: 12345676</cell>
        <cell />
        <cell>IBAN: XX12ASDF1234567AG23456</cell>
        <cell />
        <cell>BIC: ASDFGH</cell>
      </row>
      <row>
        <cell colspan="5" />
      </row>
    </table>
    <table rows="1" cols="2">
      <row>
        <cell>Получател (From) Име: (Name) Receiver Name Адрес: (Address) Address of the receiver here Ид.No: (Identificator) 123456 ДДС No: (VAT Num) 929292 МОЛ: (Mol) BG828K92</cell>
        <cell>Доставчик (Supplier) Име: (Name) Issuer Name Адрес:(Address) Address of the issuer here Ид.No:(Identificator) 9876565 ДДС No:(VAT Num) 782747 МОЛ:(Mol) BG77779</cell>
      </row>
    </table>
    <table rows="6" cols="11">
      <row>
        <cell>No</cell>
        <cell>Код (Code)</cell>
        <cell>Наименование (Name)</cell>
        <cell>Мяр.</cell>
        <cell>Кол.(Q)</cell>
        <cell colspan="2">Ед. цена</cell>
        <cell>Отст.</cell>
        <cell colspan="2">Ед.цена</cell>
        <cell>Стойност</cell>
      </row>
      <row>
        <cell>1</cell>
        <cell>662 101 493</cell>
        <cell>Article name 662</cell>
        <cell>Бр</cell>
        <cell>2</cell>
        <cell colspan="2">380.0000</cell>
        <cell>28.00%</cell>
        <cell colspan="2">273.60</cell>
        <cell>547.20</cell>
      </row>
      <row>
        <cell>2</cell>
        <cell>K12345678</cell>
        <cell>New Article K12345678 name</cell>
        <cell>Бр</cell>
        <cell>1</cell>
        <cell colspan="2">1150.0000</cell>
        <cell>28.00%</cell>
        <cell colspan="2">828.00</cell>
        <cell>828.00</cell>
      </row>
      <row>
        <cell rowspan="3" colspan="6">Валута: BGN Основание: Словом: Хиляда шестстотин и петдесет .24</cell>
        <cell colspan="3">Данъчна основа</cell>
        <cell colspan="2">1357.20</cell>
      </row>
      <row>
        <cell colspan="6" />
        <cell colspan="3">20.00 % ДДС</cell>
        <cell colspan="2">275.04</cell>
      </row>
      <row>
        <cell colspan="6" />
        <cell colspan="3">Сума за плащане</cell>
        <cell colspan="2">1650.24</cell>
      </row>
    </table>
    <table rows="3" cols="2">
      <row>
        <cell>Забележка: ВЗЕТО</cell>
        <cell>Възоснова на:</cell>
      </row>
      <row>
        <cell>Място на издаване: София</cell>
        <cell>Дата на дан. събитие: 31.05.2018</cell>
      </row>
      <row>
        <cell>Плащане: Банков превод</cell>
        <cell>Срок на плащане: 30.06.2018</cell>
      </row>
    </table>    
</document>

Now we can get the needed parts by XPath, similar to the following:

public ExtractedInvoiceData ProcessFile(string filePath, string supplierName)
{
    var xml = baseDataProcessor.PdfToXml(filePath);

    if (xml != null)
    {
        var invouceNumberAndDate =
            xml.XPathSelectElements("/page/table[1]/row[starts-with(cell, 'No.')]").ToList()[0].Value.Split(' ');
        var partsList = xml.XPathSelectElements("/page/table[3]/row[count(cell) = 9]/cell[2] | /page/table[3]/row[count(cell) = 9]/cell[5]").ToList();

        var parts = baseDataProcessor.ConstructParts(partsList.Skip(2).ToList());
        var invNumber = invouceNumberAndDate[1];
        var invDate = invouceNumberAndDate[3];
        
        return new ExtractedInvoiceData
        {
            InvoiceDate = DateTime.Parse(invDate),
            InvoiceNumber = invNumber,
            InvoiceParts = parts,
            SupplierName = supplierName
        };
    }

    return null;
}

The invoice number and date are taken from the values from the first table and a row which is with tag cell and starts with the string ‘No.’. This will return a collection and we only need the first element.
As for the partsList – get it from the third table attribute where there is a row tag with nine cell children, then for this collection of nine get the second and the fifth one. This gets all the parts in the table.
After that there is a simple logic to split this list of parts:

public List<ExtractedPart> ConstructParts(List<XElement> parts)
{
    var extractedPartsList = new List<ExtractedPart>();

    for (var i = 0; i < parts.Count; i++)
    {
        int count;
        int.TryParse(Regex.Match(parts[i + 1].Value, @"\d+").Value, out count);

        extractedPartsList.Add(
            new ExtractedPart
            {
                Code = parts[i].Value,
                Count = count
            });

        i++;
    }

    return extractedPartsList;
}

And that’s it. You now have the needed data and can pass it to the repository which will store it in the database.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s