Introduction
PureHTML is a HTML parsing specification for extracting useful data from HTMLs in the JSON form.
You may use Yaml to configure PureHTML parser, JSON support is also on the way.
Getting Started
Install the @purescraps/purehtml with the package manager of your choice:
Currently, the parsing backend is only implemented in TypeScript. Implementations in different languages is in our roadmap.
Usage:
Basics
Let's consider the this HTML:
Let's select the contents of .foo
selector. Here is our configuration:
The parser output will be exactly:
Extracting numbers
That was easy! What if we want to extract a number?
Type Casts (number
transformer)
As you might have noticed, our output is a string
. We would want a number
instead. We will adjust our configuration to type-cast our output to a number
. We will just add transform: number
line to our configuration.
Trimming Text (trim
transformer)
Usually, the HTML is not formatted in the way we expect. The selectors would usually give us values with lots of spaces.
The spaces & new lines around the product title are not useful for us. So we trim
them:
Combining Transformers
What if we want to apply several transformers to our output? If we want to trim
and cast to number
, we just add transform: [trim, number]
to our configuration. Then the selector's output will be first trimmed, then casted to number.
We will explore the other transformers in Transormers section.
Constant config
This configuration accepts only constant
property and returns its value. This transformer is really useful as the in the union config as default
case, similar to switch/case statement JavaScript and other languages.
Arrays
Extracting an array can be done by setting { type: array, items: <configuration> }
in our config.
Appying transformers to the items:
Let's say, we want to extract attributes of the matched items. We can use attr(...name: string[])
transformer.
Objects
Extracting an array can be done by setting { type: object, properties: { property: <configuration> } }
in our config. Example:
Appying transformers to the items:
Let's say, we want to extract attributes of the matched items. We can use attr(...name: string[])
transformer.
Transformers
Selectors give usstring
values of the innerText
of the given HTMLElement. You may want to trim
the result, cast output to number
. Or you may also need to extract an attribute
of the matched element. These cases and more can be handled by the use of transformers.attr
transformer
It can extract an individual attribute's value. It can also extract all or subset of the element's attributes with their values.
Examples:
exists
transformer
Returns true if the given selector returned any elements.
html
transformer
Returns innerHTML
of the matched element.
length
transformer
Returns the length of the string
and array
values.
number
transformer
Casts the output to a number.
resolve
transformer
Resolves URLs by the given values with the URL given toextract()
call.
trim
transformer
Trims all the subsequent whitespaces/newlines from start and end of the string.
urlQueryParam
transformer
Similar to the attr
transformer, but this transformer works on the URL query parameters.
Examples:
Union config
This configuration accepts multiple configurations and returns the first matched config's result. You may pass a constant config
as the last element so it will be used as default
value.