Crawler data extraction overview
The Crawler processes pages as follows:
- Page is fetched.
- Links and records are extracted from the page.
- Extracted records are sent to Algolia.
- Extracted links are added to your crawler’s URL database.
The process repeats until all the required pages have been extracted.
The crawler URL database
When a crawl starts, your crawler adds all the URLs in the following parameters to its URL database:
For each of these pages, your crawler fetches linked pages. It looks for links in any of the following formats:
head > link[rel=alternate]
a[href]
iframe[src]
area[href]
head > link[rel=canonical]
- Redirect target (when HTTP code is
301
or302
)
You can specify that some links should be ignored.
The record extractor
The recordExtractor
parameter takes a site’s metadata and HTML and returns an array of JSON objects.
For example:
1
2
3
4
5
6
7
8
9
10
recordExtractor: ({ url, $, contentLength, fileType }) => {
return [
{
url: url.href,
title: $("head > title").text(),
description: $("meta[name=description]").attr("content"),
type: $('meta[property="og:type"]').attr("content"),
}
];
}
recordExtractor
properties
This function receives an object with several properties:
$
: site content is accessed by a Cheerio instance ($
) parameter.url
: a Location object that contains the URL of the page being crawledfiletype
: the file type of the webpage (such ashtml
orpdf
)contentlength
: the length of the page’s contentdatasources
: any external data you want to combine with your extracted datahelpers
: a collection of functions to help you extract content and generate records.
url
, fileType
, and contentLength
provide useful metadata on the page you are crawling. However, to extract content from your pages, you must use the Cheerio instance ($
).
recordExtractor
return structure
The JSON objects returned by your recordExtractor
are directly converted into a record in your Algolia index.
They can contain any type as long as they’re compatible with an Algolia record.
However:
- Each record must be less than 500 KB
- You can return a maximum of 200 records per crawled URL.
Extract from JavaScript-based sites
You can use your crawler on JavaScript-based sites.
To do this, set renderJavaScript
to true
in your crawler’s configuration.
Since setting renderJavaScript
to true
slows the crawling process, you can to use it for only a subset of your site.