0

I want to extract text fields content from pdf files which have text fields that I need to bring into my Winforms project. Searching I found reference to iTextSharp but then saw that it is replaced with iText7 but everything I read refers only to it being used in C#. My winforms project is vb. Any pointers as to what would be my best option to achieve getting that data into my project is much appreciated

ADyson
  • 57,178
  • 14
  • 51
  • 63
  • 4
    Have you tried installing the respective nuget package in your VB project? – GSerg Aug 26 '21 at 15:47
  • 5
    A .net library is not language dependent. If it is a .net library you can use either c# or vb.net. Perhaps you only found sample code in c# because it is more popular and the author did not want to go through the effort on providing sample code in every supported language. – Igor Aug 26 '21 at 15:48
  • 3
    There are no "C# libraries", there are ".NET libraries", which work on any language that compiles to the .NET Framework (or .NET Core). Even though samples and documentation may be on C#, it'll work the very same way in VB.NET (or on F#, or IronPython, or any other .NET language). – Alejandro Aug 26 '21 at 15:51

2 Answers2

3

If it's available for .NET then it's available for all .NET languages. That's one of the useful things about the way the .NET Framework works - it doesn't matter which programming language a particular library or project was written in. Once it's compiled to a .NET assembly DLL it can be used from any other .NET language regardless.

If, for example, iTextSharp or iText7 is available as Nuget package (which I believe they are) then it's especially simple - you can just install the package into your VB.NET project and use it.

You may often find that usage examples are written in one specific language, but that doesn't mean you can't make exactly the same class instantions, method calls etc using a different language. If you struggle with translating code samples from one language to another there are free automatic code converters available online (especially between C# and VB.NET and vice-versa) which will usually do 90-100% of the conversion work for you.

ADyson
  • 57,178
  • 14
  • 51
  • 63
1

To extract text from a PDF file using itext7, try the following:

Pre-requisite: Download/install NuGet package itext7

Add the following Imports statements:

Imports iText.Kernel.Pdf
Imports iText.Kernel.Pdf.Canvas.Parser.Listener
Imports iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor

GetTextFromPdf:

Public Function GetTextFromPdf(filename As String) As String
    Dim sb As System.Text.StringBuilder = New System.Text.StringBuilder()

    Using doc As PdfDocument = New PdfDocument(New PdfReader(filename))
        'Dim strategy As LocationTextExtractionStrategy = New LocationTextExtractionStrategy()

        For i As Integer = 1 To doc.GetNumberOfPages() Step 1
            Dim page = doc.GetPage(i)
            'Dim text = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy)
            Dim text = GetTextFromPage(page)
            sb.AppendLine(text)
        Next
    End Using

    Return sb.ToString()
End Function

The code for GetTextFromPdf is adapted from here.

Update:

The code below shows how to read the field names and field values from an AcroForm in a Pdf document:

Add the following Imports statements:

Imports iText.Forms
Imports iText.Kernel.Pdf

GetTextFromPdfFields

Public Function GetTextFromPdfFields(filename As String) As String
    Dim sb As System.Text.StringBuilder = New System.Text.StringBuilder()

    'create new instance
    Using doc As PdfDocument = New PdfDocument(New PdfReader(filename))

        'get AcroForm from document
        Dim form As PdfAcroForm = PdfAcroForm.GetAcroForm(doc, True)

        'get form fields
        Dim fieldDict As IDictionary(Of String, Fields.PdfFormField) = form.GetFormFields()

        'loop through form fields
        For Each kvp As KeyValuePair(Of String, Fields.PdfFormField) In fieldDict
            Dim type As PdfName = form.GetField(kvp.Key).GetFormType()
            Dim fieldName As PdfString = form.GetField(kvp.Key).GetFieldName()
            Dim fieldValue As String = form.GetField(kvp.Key).GetValueAsString()

            If fieldName IsNot Nothing Then
                'append data to instance of StringBuilder
                sb.AppendLine("Type: " & type.ToString() & " FieldName: " & fieldName.ToString() & " Value: " & fieldValue)
            End If
        Next
    End Using

    Return sb.ToString()
End Function

**Note: The code for GetTextFromPdfFields is adapted from here.

Tu deschizi eu inchid
  • 4,117
  • 3
  • 13
  • 24
  • Hi @user9938, I had gone through the process of getting iText7 loaded through Nuget but hadn't found any code that worked. Your code is a great start for me thanks but I now get the text off the pdf when what I need is the text in the fields in the pdf. Are you able to help with that please? – Hugh Self Taught Aug 27 '21 at 14:30
  • Thanks a stack my new hero. Last question on the fields. Is there an easy way to determine the line breaks when the field is multiline? – Hugh Self Taught Aug 27 '21 at 17:04
  • I found the following which seems to work okay. Haven't tested extensively with multiple scenarios yet but for now it works txtAbouMe.Text = Replace(AboutMe, Chr(13), vbCrLf) – Hugh Self Taught Aug 28 '21 at 08:48