Extract data from non-HTML documents
You can use the Crawler to extract data from documents such as .pdf
’s and .doc
’s.
To do tyhis, the Algolia Crawler uses Tika.
Tika extracts a document’s content and transforms it into a basic HTML file.
Limitations
Because it’s difficult to translate non-HTML documents into HTML, there are limitations to what can be done:
- A PDF can break if it’s exported with an unknown font.
- The produced HTML has little semantic value, which will make good relevancy hard to achieve.
- Document indexing is slower than classic HTML indexing.
- Language detection isn’t available.
Enable document extraction
To enable document extraction, add the fileTypesToMatch
setting to at least one of your crawler’s actions.
The available fileTypesToMatch
are:
html
for web pages. This is the default when nofileTypesToMatch
parameter is presentpdf
for PDF documentsdoc
,xls
andppt
for Microsoft Office documentsodt
,ods
andodp
for Open documentsemail
for electronic mail documents
When this setting is used and a document is encountered, the parameter $
is assigned the transformed HTML of document. The file’s type is stored in the fileType
parameter of your recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
({
[...]
actions: [
{
indexName: 'crawler-example',
pathsToMatch: ['https://www.example.com/**'],
fileTypesToMatch: ['pdf', 'doc'],
recordExtractor: ({ url, $, fileType }) => {
console.log($.html(), fileType);
}
},
]
});
Sample crawler configuration
This is an example of a configuration file implements document extraction.
Supported file types
- Associated extension:
.pdf
fileTypesToMatch
:pdf
For example, in this .pdf
file, Tika exposes the following HTML which your crawler then passes to your recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:PDFVersion" content="1.3"/>
<meta name="pdf:docinfo:title" content="test-docx-file.pages"/>
<meta name="xmp:CreatorTool" content="Pages"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2018-07-17T13:35:40Z"/>
<meta name="Last-Modified" content="2018-07-17T13:35:40Z"/>
<meta name="dcterms:modified" content="2018-07-17T13:35:40Z"/>
<meta name="dc:format" content="application/pdf; version=1.3"/>
<meta name="Last-Save-Date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:docinfo:creator_tool" content="Pages"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:docinfo:modified" content="2018-07-17T13:35:40Z"/>
<meta name="meta:save-date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="test-docx-file.pages"/>
<meta name="modified" content="2018-07-17T13:35:40Z"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="meta:creation-date" content="2018-07-17T13:35:40Z"/>
<meta name="created" content="Tue Jul 17 13:35:40 UTC 2018"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date" content="2018-07-17T13:35:40Z"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="producer" content="Mac OS X 10.13.5 Quartz PDFContext"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="Mac OS X 10.13.5 Quartz PDFContext"/>
<meta name="pdf:docinfo:created" content="2018-07-17T13:35:40Z"/>
<title>test-docx-file.pages</title>
</head>
<body>
<div class="page">
<p/>
<p>Test PDF file content</p>
<p/>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
Word document
- Associated extensions:
.doc
,.docx
fileTypesToMatch
:doc
For example, in this .doc
file, Tika exposes the following HTML, which your crawler then passes to its recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/msword"/>
<title>
</title>
</head>
<body>
<div class="header"/>
<p class="body">Test DOC file content</p>
<div class="footer"/>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument text
- Associated extension:
.odt
fileTypesToMatch
:odt
Excel spreadsheet
- Associated extensions:
.xls
,.xlsx
fileTypesToMatch
:xls
For example, in this .xls
file, Tika exposes the following HTML, which your crawler then passes to its recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/vnd.ms-excel"/>
<title>
</title>
</head>
<body>
<div class="page">
<h1>Feuille 1</h1>
<table>
<tbody>
<tr>
<td>Test XLS file content</td>
</tr>
</tbody>
</table>
<div class="outside">&C&"Helvetica,Regular"&12&K000000&P</div>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument spreadsheet
- Associated extension:
.ods
fileTypesToMatch
:ods
PowerPoint document
- Associated extensions:
.ppt
,.pptx
fileTypesToMatch
:ppt
For example, in this .ppt
file, Tika exposes the following HTML, which your crawler then passes to its recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/vnd.ms-powerpoint"/>
<title>
</title>
</head>
<body>
<div class="slideShow">
<div class="slide">
<div class="slide-master-content"/>
<div class="slide-content">
<p>Test PPT file content</p>
<p/>
</div>
</div>
<div class="ocr"/>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument presentation
- Associated extension:
.odp
fileTypesToMatch
:odp
Email documents
- Associated extension:
.msg
fileTypesToMatch
:email
The file type email
includes all documents related to email.
The Crawler supports the Outlook Mail Message (.msg
) format.
For example, Tika will convert this email into the following HTML:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2017-06-01T15:24:31Z" />
<meta name="Message:To-Email" content="to@domain.com" />
<meta name="dc:description" content="this is a mail to test msg file" />
<meta name="subject" content="this is a mail to test msg file" />
<meta name="dc:creator" content="from@domain.com" />
<meta name="Message:From-Email" content="from@domain.com" />
<meta name="dcterms:created" content="2017-06-01T15:24:31Z" />
<meta name="Message-To" content="to@domain.com" />
<meta name="dcterms:modified" content="2017-06-01T15:24:31Z" />
<meta name="Last-Modified" content="2017-06-01T15:24:31Z" />
<meta name="Message-Recipient-Address" content="to@domain.com" />
<meta name="Message:Raw-Header:X-Unsent" content="1" />
<meta name="Message:Raw-Header:Subject" content="this is a mail to test msg file" />
<meta name="meta:mapi-message-class" content="NOTE" />
<meta name="Message:To-Display-Name" content="to@domain.com" />
<meta name="Last-Save-Date" content="2017-06-01T15:24:31Z" />
<meta name="Message:Raw-Header:MIME-Version" content="1.0" />
<meta name="meta:save-date" content="2017-06-01T15:24:31Z" />
<meta name="dc:title" content="this is a mail to test msg file" />
<meta name="Message:Raw-Header:Message-ID" content="<c58b1b52f61f4789ba40339c6e993440>" />
<meta name="modified" content="2017-06-01T15:24:31Z" />
<meta name="Content-Type" content="application/vnd.ms-outlook" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser" />
<meta name="creator" content="from@domain.com" />
<meta name="Message:Raw-Header:From" content="from@domain.com" />
<meta name="meta:author" content="from@domain.com" />
<meta name="meta:creation-date" content="2017-06-01T15:24:31Z" />
<meta name="meta:mapi-from-representing-email" content="from@domain.com" />
<meta name="Creation-Date" content="2017-06-01T15:24:31Z" />
<meta name="Message-Cc" content="" />
<meta name="Message-Bcc" content="" />
<meta name="meta:mapi-from-representing-name" content="from@domain.com" />
<meta name="Message:Raw-Header:To" content="to@domain.com" />
<meta name="Message:From-Name" content="from@domain.com" />
<meta name="Author" content="from@domain.com" />
<meta name="Message-From" content="from@domain.com" />
<meta name="Message:To-Name" content="" />
<title>this is a mail to test msg file</title>
</head>
<body>
<h1>this is a mail to test msg file</h1>
<dl>
<dt>From</dt>
<dd>from@domain.com</dd>
<dt>To</dt>
<dd>to@domain.com</dd>
<dt>Recipients</dt>
<dd>to@domain.com</dd>
</dl>
<div class="message-body">
<p>This message was sent using a msg file </p>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.