Introduction

PureHTML is a HTML parsing specification for extracting useful data from HTMLs in the JSON form.
You may use Yaml to configure PureHTML parser, JSON support is also on the way.

Getting Started

Install the @purescraps/purehtml with the package manager of your choice:

npm i -S @purescraps/purehtml

Currently, the parsing backend is only implemented in TypeScript. Implementations in different languages is in our roadmap.

Usage:

import { ConfigFactory, extract } from '@purescraps/purehtml';

const input = `<div>
  <p>foo</p>
  <p>bar</p>
  <p>baz</p>
</div>`;
const config = ConfigFactory.fromYAML('selector: p');
const result = extract(inputHtml, config, 'https://example.com');

Basics

Let's consider the this HTML:

<div>
  <span class="foo">Hello, PureHTML!</span>
</div>

Let's select the contents of .foo selector. Here is our configuration:

selector: .foo

The parser output will be exactly:

Extracting numbers

That was easy! What if we want to extract a number?

<div>
  <p id="price">12.99</p>
</div>
Type Casts (number transformer)

As you might have noticed, our output is a string. We would want a number instead. We will adjust our configuration to type-cast our output to a number. We will just add transform: number line to our configuration.

<div>
  <p id="price">12.99</p>
</div>
Trimming Text (trim transformer)

Usually, the HTML is not formatted in the way we expect. The selectors would usually give us values with lots of spaces.

<div>
  <p id="product-title">
  
    Awesome Product
  
  </p>
</div>

The spaces & new lines around the product title are not useful for us. So we trim them:

<div>
  <p id="product-title">
  
    Awesome Product
  
  </p>
</div>
Combining Transformers

What if we want to apply several transformers to our output? If we want to trim and cast to number, we just add transform: [trim, number] to our configuration. Then the selector's output will be first trimmed, then casted to number.

<div>
  <span id="product-price">
  
    12.99
  
  </span>
</div>

We will explore the other transformers in Transormers section.

Constant config

This configuration accepts only constant property and returns its value. This transformer is really useful as the in the union config as default case, similar to switch/case statement JavaScript and other languages.

Arrays

Extracting an array can be done by setting { type: array, items: <configuration> } in our config.

<div>
  <span>a</span>
  <span>b</span>
  <span>c</span>
</div>
Appying transformers to the items:

Let's say, we want to extract attributes of the matched items. We can use attr(...name: string[]) transformer.

<div>
  <a href="https://example.com/foo">a</a>
  <a href="https://example.com/bar">b</a>
  <a href="https://example.com/baz">c</a>
</div>

Objects

Extracting an array can be done by setting
{ type: object, properties: { property: <configuration> } }
in our config. Example:

<div>
  <span class="firstname">John</span>
  <span class="lastname">Doe</span>
  <span class="age">42</span>
</div>
Appying transformers to the items:

Let's say, we want to extract attributes of the matched items. We can use attr(...name: string[]) transformer.

<div id="course-details" data-course-id="9999">
  introduction and table-of-contents of the course...
  <h1>
    Web Scraping Fundamentals
  </h1>
</div>

Transformers

Selectors give us string values of the innerText of the given HTMLElement. You may want to trim the result, cast output to number. Or you may also need to extract an attribute of the matched element. These cases and more can be handled by the use of transformers.
attr transformer

It can extract an individual attribute's value. It can also extract all or subset of the element's attributes with their values.

Examples:

exists transformer

Returns true if the given selector returned any elements.

html transformer

Returns innerHTML of the matched element.

length transformer

Returns the length of the string and array values.

number transformer

Casts the output to a number.

resolve transformer

Resolves URLs by the given values with the URL given toextract() call.

trim transformer

Trims all the subsequent whitespaces/newlines from start and end of the string.

urlQueryParam transformer

Similar to the attr transformer, but this transformer works on the URL query parameters.

Examples:

Union config

This configuration accepts multiple configurations and returns the first matched config's result. You may pass a constant config as the last element so it will be used as default value.