Crawler data extraction overview

The Crawler processes pages as follows:

Page is fetched.
Links and records are extracted from the page.
Extracted records are sent to Algolia.
Extracted links are added to your crawler’s URL database.

The process repeats until all the required pages have been extracted.

The crawler URL database

When a crawl starts, your crawler adds all the URLs in the following parameters to its URL database:

For each of these pages, your crawler fetches linked pages. It looks for links in any of the following formats:

head > link[rel=alternate]
a[href]
iframe[src]
area[href]
head > link[rel=canonical]
Redirect target (when HTTP code is 301 or 302)

You can specify that some links should be ignored.

The record extractor

The recordExtractor parameter takes a site’s metadata and HTML and returns an array of JSON objects. For example:

Copy
recordExtractor: ({ url, $, contentLength, fileType }) => {
     return [
          {
              url: url.href,
              title: $("head > title").text(),
              description: $("meta[name=description]").attr("content"),
              type: $('meta[property="og:type"]').attr("content"),
          }
     ];
}

`recordExtractor` properties

This function receives an object with several properties:

$: site content is accessed by a Cheerio instance ($) parameter.
url: a Location object that contains the URL of the page being crawled
filetype: the file type of the webpage (such as html or pdf)
contentlength: the length of the page’s content
datasources: any external data you want to combine with your extracted data
helpers: a collection of functions to help you extract content and generate records.

url, fileType, and contentLength provide useful metadata on the page you are crawling. However, to extract content from your pages, you must use the Cheerio instance ($).

`recordExtractor` return structure

The JSON objects returned by your recordExtractor are directly converted into a record in your Algolia index.

They can contain any type as long as they’re compatible with an Algolia record.

However:

Each record must be less than 500 KB
You can return a maximum of 200 records per crawled URL.

Extract from JavaScript-based sites

You can use your crawler on JavaScript-based sites. To do this, set renderJavaScript to true in your crawler’s configuration.

Since setting renderJavaScript to true slows the crawling process, you can to use it for only a subset of your site.

Crawler data extraction overview

The crawler URL database

The record extractor

`recordExtractor` properties

`recordExtractor` return structure

Extract from JavaScript-based sites

Further reading

On this page

Crawler data extraction overview

The crawler URL database

The record extractor

recordExtractor properties

recordExtractor return structure

Extract from JavaScript-based sites

Further reading

On this page

`recordExtractor` properties

`recordExtractor` return structure