Data Science Asked by Mr. Scientist on September 5, 2021
I have a pdf file(link below).
I have to extract keywords from it and also need have there frequency in pdf file. E.g. 'Java':42, 'inheritance':3
.
I have to store the keyword with their weights in an excel sheet.
Can you please tell me some code in python to do it.
https://drive.google.com/file/d/1gZCnlhwVMBIE0SugUUxDIgQrfVz-cDQR/view
This is an interesting question! Let's convert all your PDF files to Word Documents, so you can loop through each one, and read all the contents of each, and then report everything in MS Excel.
Sub OpenAndReadWordDoc()
Rows("2:1000000").Select
Range(Selection, Selection.End(xlDown)).Select
Selection.ClearContents
Range("A1").Select
' assumes that the previous procedure has been executed
Dim oWordApp As Word.Application
Dim oWordDoc As Word.Document
Dim blnStart As Boolean
Dim r As Long
Dim sFolder As String
Dim strFilePattern As String
Dim strFileName As String
Dim sFileName As String
Dim ws As Worksheet
Dim c As Long
Dim n As Long
Dim iCount As Long
Dim strSearch As String
'~~> Establish an Word application object
On Error Resume Next
Set oWordApp = GetObject(, "Word.Application")
If Err Then
Set oWordApp = CreateObject("Word.Application")
' We started Word for this macro
blnStart = True
End If
On Error GoTo ErrHandler
Set ws = ActiveSheet
r = 1 ' startrow for the copied text from the Word document
' Last column
n = ws.Range("A1").End(xlToRight).Column
sFolder = "C:UsersExcelDesktoptest"
'~~> This is the extension you want to go in for
strFilePattern = "*.doc*"
'~~> Loop through the folder to get the word files
strFileName = Dir(sFolder & strFilePattern)
Do Until strFileName = ""
sFileName = sFolder & strFileName
'~~> Open the word doc
Set oWordDoc = oWordApp.Documents.Open(sFileName)
' Increase row number
r = r + 1
' Enter file name in column A
ws.Cells(r, 1).Value = sFileName
ActiveCell.Offset(1, 0).Select
ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName, _
SubAddress:="A" & r, TextToDisplay:=sFileName
' Loop through the columns
For c = 2 To n
If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value), _
MatchWholeWord:=True, MatchCase:=False) Then
strSearch = ws.Cells(1, c).Value
iCount = 0
With ActiveDocument.Content.Find
.Text = strSearch
.Format = False
.Wrap = wdFindStop
Do While .Execute
iCount = iCount + 1
Loop
End With
ws.Cells(r, c).Value = iCount
End If
Next c
oWordDoc.Close SaveChanges:=False
'~~> Find next file
strFileName = Dir
Loop
ExitHandler:
On Error Resume Next
' close the Word application
Set oWordDoc = Nothing
If blnStart Then
' We started Word, so we close it
oWordApp.Quit
End If
Set oWordApp = Nothing
Exit Sub
ErrHandler:
MsgBox Err.Description, vbExclamation
Resume ExitHandler
End Sub
Function GetDirectory(path)
GetDirectory = Left(path, InStrRev(path, ""))
End Function
Test the code above and feed back. My setup looks like the image below.
Answered by ASH on September 5, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP