I want to extract text fields content from pdf files which have text fields that I need to bring into my Winforms project. Searching I found reference to iTextSharp but then saw that it is replaced with iText7 but everything I read refers only to it being used in C#. My winforms project is vb. Any pointers as to what would be my best option to achieve getting that data into my project is much appreciated
-
4Have you tried installing the respective nuget package in your VB project? – GSerg Aug 26 '21 at 15:47
-
5A .net library is not language dependent. If it is a .net library you can use either c# or vb.net. Perhaps you only found sample code in c# because it is more popular and the author did not want to go through the effort on providing sample code in every supported language. – Igor Aug 26 '21 at 15:48
-
3There are no "C# libraries", there are ".NET libraries", which work on any language that compiles to the .NET Framework (or .NET Core). Even though samples and documentation may be on C#, it'll work the very same way in VB.NET (or on F#, or IronPython, or any other .NET language). – Alejandro Aug 26 '21 at 15:51
2 Answers
If it's available for .NET then it's available for all .NET languages. That's one of the useful things about the way the .NET Framework works - it doesn't matter which programming language a particular library or project was written in. Once it's compiled to a .NET assembly DLL it can be used from any other .NET language regardless.
If, for example, iTextSharp or iText7 is available as Nuget package (which I believe they are) then it's especially simple - you can just install the package into your VB.NET project and use it.
You may often find that usage examples are written in one specific language, but that doesn't mean you can't make exactly the same class instantions, method calls etc using a different language. If you struggle with translating code samples from one language to another there are free automatic code converters available online (especially between C# and VB.NET and vice-versa) which will usually do 90-100% of the conversion work for you.
- 57,178
- 14
- 51
- 63
To extract text from a PDF file using itext7, try the following:
Pre-requisite: Download/install NuGet package itext7
Add the following Imports statements:
Imports iText.Kernel.Pdf
Imports iText.Kernel.Pdf.Canvas.Parser.Listener
Imports iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor
GetTextFromPdf:
Public Function GetTextFromPdf(filename As String) As String
Dim sb As System.Text.StringBuilder = New System.Text.StringBuilder()
Using doc As PdfDocument = New PdfDocument(New PdfReader(filename))
'Dim strategy As LocationTextExtractionStrategy = New LocationTextExtractionStrategy()
For i As Integer = 1 To doc.GetNumberOfPages() Step 1
Dim page = doc.GetPage(i)
'Dim text = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy)
Dim text = GetTextFromPage(page)
sb.AppendLine(text)
Next
End Using
Return sb.ToString()
End Function
The code for GetTextFromPdf is adapted from here.
Update:
The code below shows how to read the field names and field values from an AcroForm in a Pdf document:
Add the following Imports statements:
Imports iText.Forms
Imports iText.Kernel.Pdf
GetTextFromPdfFields
Public Function GetTextFromPdfFields(filename As String) As String
Dim sb As System.Text.StringBuilder = New System.Text.StringBuilder()
'create new instance
Using doc As PdfDocument = New PdfDocument(New PdfReader(filename))
'get AcroForm from document
Dim form As PdfAcroForm = PdfAcroForm.GetAcroForm(doc, True)
'get form fields
Dim fieldDict As IDictionary(Of String, Fields.PdfFormField) = form.GetFormFields()
'loop through form fields
For Each kvp As KeyValuePair(Of String, Fields.PdfFormField) In fieldDict
Dim type As PdfName = form.GetField(kvp.Key).GetFormType()
Dim fieldName As PdfString = form.GetField(kvp.Key).GetFieldName()
Dim fieldValue As String = form.GetField(kvp.Key).GetValueAsString()
If fieldName IsNot Nothing Then
'append data to instance of StringBuilder
sb.AppendLine("Type: " & type.ToString() & " FieldName: " & fieldName.ToString() & " Value: " & fieldValue)
End If
Next
End Using
Return sb.ToString()
End Function
**Note: The code for GetTextFromPdfFields is adapted from here.
- 4,117
- 3
- 13
- 24
-
Hi @user9938, I had gone through the process of getting iText7 loaded through Nuget but hadn't found any code that worked. Your code is a great start for me thanks but I now get the text off the pdf when what I need is the text in the fields in the pdf. Are you able to help with that please? – Hugh Self Taught Aug 27 '21 at 14:30
-
Thanks a stack my new hero. Last question on the fields. Is there an easy way to determine the line breaks when the field is multiline? – Hugh Self Taught Aug 27 '21 at 17:04
-
I found the following which seems to work okay. Haven't tested extensively with multiple scenarios yet but for now it works txtAbouMe.Text = Replace(AboutMe, Chr(13), vbCrLf) – Hugh Self Taught Aug 28 '21 at 08:48