Thinking in Dimensions: A Unified Approach to Filter Grammars

written

Learning filter grammars (such as CSS selectors, regular expressions or SQL) can be slow and error prone. Cheatsheets help with picking up a new grammar, or revising an old one you may be returning to after a period of absence, but because each has different concepts, syntax and operators, it’s often difficult to see the similarities. They are so different that it’s easy to lose track of the fact that they are all different tools to perform the same job.

This article explains a unified approach to filter grammars that provides a framework to classify, select and apply them appropriately; it also aids to reduce the barrier to identifying and learning new ones when your application calls for it.

General terminology: Dimensions and Nodes

To understand a filter grammar, you must first understand the dimensions or properties of the data structure it is designed for. The term dimension has its limitations (I use it because I find it helpful to think of these properties visually in dimensional space, but you may prefer instead think of them as simply attributes, or members of a tuple).

Filter grammars tend to operate on and return results of the same type, which can be thought of as the nodes of the data structure. It doesn’t matter what logical structure the data takes (whether that’s a HTML document tree, or a flat list of file lines), it’s useful to think of the data being filtering as a series of independent nodes, or points in space you arrive at by matching on each dimension’s value.

Semantic dimensions

Identifying what semantic dimensions a data structure has is a process of considering its logic structure and meaning, and examining what specialised operators the filter grammar provides to reflect that meaning.

We’ll take HTML documents as an convenient example as it is used with common filter grammars like CSS and XPath.

HTML documents have a number of semantic dimensions:

  • An element’s hierarchy (it’s relationship to its ancestors and descants)
  • An element’s siblings (its grouping with other elements that share a parent element) and often more importantly, it’s position within those siblings.
  • An element’s type (the tag type it’s given in the document)
  • An element’s attributes
  • An element’s text content
  • An element’s states (focused, hovered, disabled, etc)

A filter grammar won’t always provide the ability to filter on all of these dimensions. For example, in CSS you can apply filters on an elements' hierarchy, position, type and attribute, but not really its text content (for that you need XPath or possible jQuery). And XPath does not provide the ability to filter on an element’s state, but CSS does.

Filter grammars are comprised of smaller ones

Semantic dimensions are typically independent and so filter grammars can be thought of as a collection of smaller, simpler grammars, applied to specific dimensions. Although there are cases in some filter grammars where two or more properties are combined when matched, the dimensions tend to be kept separate. Which of dimension to filter, is usually indicated with some sort of switch. For example, in CSS, the period indicator (.) proceeds matches that should be applied to an element’s class attribute, and a hash (#) is used to match on an element’s id.

Data types

Once you have identified the semantic dimensions of a data structure, you need to determine the data type the filter grammar treats it as:

  • Nominal: Named categories (e.g. enums, booleans). These values have no magnitude (no one value is greater than any other), no intervals (they’re discrete categories with no values in between) and no absolute values (each value has meaning only in it being distinct from the other possible values and not some global measure).
  • Ordinal: Discrete values (e.g. integers, hierarchy, position) that have an explicit ordering amongst values. They have magnitude (values can be ordered from smallest to largest), but no “in between” values.
  • Scalar: Continuous values (e.g. floats). They are like ordinals, but have an infinite precision (up to the limitations of the underlying datatype).

From these, there are some special cases:

  • Sets and fields: Sets or collections where the order of elements does not matter. This is usually a way of encoding nominal data. (e.g. comma or space separated lists of words)
  • Permutations: Sets or collections, where the elements are in a specific order (e.g. arrays, characters in a string, hierarchies, file paths and list items). The order itself is ordinal data (i.e. the first position - index 0 - is always lower or before the second position - index 1 - and so on).

A dimension’s data type determines which operators are possible (and their semantics).

Nominal data

The simplest dimensional representation to visualise is the number line. Similarly, the simplest operator is the presence one - which determines whether there is a value at all. After that, we consider the equality operator by placing values on the number line, and evaluating if they occupy the same position. We can then consider their complements - the not present and not equal operators. It is at this point we have effectively explored the full set of basic operators available for nominal data.

Ordinal data

Ordinal data (which adds magnitude) allows us to consider values relative to a reference points, often termed the anchor. We can compare two values relative to one another by deeming one the anchor and evaluating if a value is closer or further away to one end of the number line. This is how we get the less than and greater than operators (which we can overlap with the equality operators to get less than or equal to and greater than or equal to). These operators are complements of one another (paying special attention to where you want the equality to fall: e.g. the complement of less than is not greater than but greater than or equal to.)

If you place the anchor at the first or last possible value of a dimension (often termed a global anchor) in order to find its closest neighbor, you achieve the minimum and maximum operators.

If you introduce an additional anchor (which is the same as applying the less than and greater than operators one after another), it’s possible to identify if a value is between two others. You can also place limits on the less than and greater than operators, to get the within operator (requiring a value to be within a certain number of places of another).

Scalar data

Scalar values - which not only have magnitude, but are continuous and have equal intervals as well - allow aggregate functions such as averages and percentiles which may yield results between the original values (more on this in a minute).

Permutation data

Stepping away from the number line, and instead considering permutations, we get a different set of semantics for the same operators: Anchors are not based on values but position (e.g. at the start of the permutation, or at the end).

With the number line, the position of the values were determined by those values themselves. However, permutations have both the position of the value, and the value itself. For example, in the string “Monday”, ‘n’ has position 2 (0 indexed) and value ‘n’. Some operators set or anchor the position and attempt to match the value, like begins with or ends with. These are the global anchors. However, some operators require relative offsets or ranges for the position value, like the proceeded by operator; these are the relative anchors. Global anchors match on the position and then value and relative anchors match on value and then position, so neither one dimension is inherently more important than the other.

Summary of semantic dimension operators


Nominal
Sets and fields
Ordinal
Scalar
Permutations
Magnitude
None
Yes
Yes
Yes. (Ordinal value of positions)
Discrete
Yes
Yes
No. (Continuous.)
Yes (Each position has discrete index)
Absolute values
No
Yes.
Yes.
Yes. (First, last)
Examples
Boolean, Categories
Comma or space separated list of words
Integers
Floats
Strings, Hierarchy, Lists
Non anchored match
Any/presence
Presence (not undefined)
Exact match
Equals
Field is equal
Equals
Partial match
Not applicable
Set contains element
Not applicable
Contains (smaller permutation)
Top/Left anchored
Absolute match
Not applicable - no position or magnitude, so anchors have no meaning

Lowest (equal to lowest value in the set)
First, Begins with (Matches elements in first positions)
Immediate relative match
Immediately after (value in set immediately after anchor)
Immediately followed by (matches element immediately before anchor)
Any relative match
More than (one of the values after the anchor)
Immediately followed by (matches any element before the anchor)
Scoped relative match
Within (one of the values less than the anchor, within a certain number of places)
Separated by (set or number of possible values)
Bottom/Right anchored
Absolute match
Not applicable - no position or magnitude, so anchors have no meaning

Highest
Last, Ends with
Immediate relative match
Immediately before (value in set immediately before anchor)
Immediately proceeded by
Any relative match
Less than (one of the values before the anchor)
Proceeded by
Scoped relative match
Within (one of the values more than the anchor, within a certain number of places)
Separated by (set or number of possible values)

Other dimension types

We’ve discussed semantic dimensions thus far, but there are at least two other types.

The first is revealed by the use of projections. This can also be thought of as projecting values into another dimension (using some function or transform), which has its own applicable operators. Aggregate functions are an example of a projection; they provide operators that do not depend on the values themselves, but consider the value in context with the rest of those in the set. For example, the most common operator, creates count data for each value (grouping those that are the same) and then applies a maximum operator. Projections often production dimensions that are a different data type than their input values.

Another dimension type is revealed by examining the underlying data type being used to represent a semantic dimension. For example, nominal data or ordinal data values may be represented as strings, which are themselves permutations of characters and the filter grammar may provide access to these operators.

For example, consider the set of the days of the week: ‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’, ‘Saturday’, ‘Sunday’. To match on these as ordinal values, we could query the “Days before Wednesday”, but we could also query them on the positions of the characters in each string by asking for “The values starting with ’T'”.

Semantic dimension
Inherit in the semantics of the data structure and filter grammar
Data type dimensions
Inherit in the type of data used to represent the values of the semantic dimensions
Aggregate functions and transforms
Achieved by applying functions to values and projecting them into a different dimension, which has it own operators

The relationship between dimensions and operator complexity

There appears to be roughly a continuum of filter grammars and data structures. At one end are the data structures with many semantic dimensions, with filter grammars that have relatively few projection or data dimension operators. One such example is CSS - the data structure (HTML documents) have many semantic dimensions and CSS selectors are largely defined in terms of these semantic dimensions, with restricted data and projection dimension operators. You cannot for example, find all elements with tags that start with a sequence of characters, or find the most common element.

In contrast, XPath includes a lot less of the semantic dimensions CSS is designed around, and instead treats a HTML document as a generic tree structure. Consequently, it has richer set of data type dimension operators, providing a lot more power (at the cost of increased expression complexity and reduced performance).

At the other end of the continuum are regular expressions, which operate on flat strings (usually single lines in a file) which have few semantic dimensions. Instead there is only the data type’s dimensions, which the filter grammar provides very complex operators for (well beyond those we’ve discussed so far). If applied to a HTML document, it interprets it only as a line semantic dimension (a permutation of characters which have the two data dimensions of position and character value), requiring a complex sequence of expressions in order to match the same nodes that would be trivial in CSS. However, it’s possible to match character sequences that are outside of valid HTML, or which are normalized away such as whitespace.

Utilities like sed and awk (which use regular expressions to match lines) add a few more semantic dimensions by providing the ability to match on line numbers and ranges. awk also separates lines into fields and provides projection dimensions by offering the ability to generate mathematical aggregate calculations like count, average etc.

Combining expressions

Putting dimensions aside now, the other component to filter grammars are how expressions are combined.

There are 4 main logic operators:

  • AND: Both expressions must be true
  • OR: One or both of the expressions must be true
  • NOT: The expression must be false, for its complement to return true
  • Exclusive OR (XOR): One or the other of the expressions must be true, but not both

Depending on the grammar, these operators may be available outside of expressions, to operate on them as a whole expression, or within expressions to operate on specific dimensions, or both.

For example, in CSS to find divs with the class foo and bar, you apply the operator within the expression:

1
div.foo.bar

But to find divs with a class foo or bar, you must combine separate expressions:

1
div.foo, div.bar

XPath on the other hand, provides the AND and OR operators both inside and outside an expression:

The following are equivalent AND expressions:

1
2
//div[@class="foo"][@class="bar"]
//div[@class="foo" and @class="bar"]

And the following are equivalent OR expressions:

1
2
//div[@class="foo"] | //div[@class="bar"]
//div[@class="foo" or @class="bar"]

Filtering vs Selection

For some filter grammars, filtering and selection are usually the same operation (the nodes that match the filter are the ones that are selected), with a few operators that are the exception. This is true of CSS: the elements returned are those matched by the filtering operators, with the exception of the pseudo-selectors (those starting with a colon) - these continue to filter without changing the selection target to a child or sibling.

1
ul > li:first-child

There is, however, no way to select an element based on its descendants without changing the selection focus to those descendants. The same is not true of XPath expressions, which always selecting the nodes that match the filter, but also allow filtering nodes based on their ancestors or children without changing the selection target.

1
2
3
4
5
# Selects descendant
//ul/li[1]

# Select parent based on descendant
//ul[li[position()=1]]

Some grammars require explicitly specifying what you want to select from the nodes that match the filter, like SQL’s SELECT operator.

1
2
-- Select username from users table
SELECT username FROM  users;

Regular expressions are perhaps the most complicated, with their capturing groups and non-capturing groups like the look ahead and look behind operators.

1
2
# Select "foo" followed by "bar" (but don't include "bar" in the output)
/(foo)(?=bar)*/

awk also provides a hold space that allows for having the target of the selection be different from the target for filtering.

A in-depth discussion of each is outside the scope of this article, but it is worth noting many grammars provide a slightly different solution to the same problem.

Overview of common filtering grammars


Operate on
Semantic Dimensions
Type
Selection
CSS Selectors
HTML or XML documents


Element hierarchy
Permutation
Pseudo-selectors further filter nodes without changing selection target
Element position
Permutation
Element type
Nominal
Attribute values
id and class: Space-separated Fields
Permutation (String)
Element’s state (hovered, focused, etc)
Nominal
XPath
Same as CSS selectors
- Element’s state
Use filtering on descendants or ancestors to filter without changing selection target
+ Element’s text content
Permutation (String)
jQuery Selectors
Same as CSS selector
Can use DOM traversal (outside of selection expression)
+ Some internal state (animation)
Nominal
+ Some extra operators for those dimensions

Globbing
File paths

Directory hierarchy
Permutation
Can use parent operator to select siblings or ancestors of match
Directory and file names
Permutation (String)
Regular Expressions
Strings
String content
Permutation (String)
Use capturing groups, look aheads and look behinds
grep
File contents

File lines
Permutation (String)
Use arguments to return files or lines
sed
Same as grep 
Use arguments to decide what’s printed
+ Line position
Permutation
awk
Same as sed
Use print to decide output
+ Fields
Fields
SQL Filtering & Selection
Relational databases
Tables

Nominal
SELECT operator
Fields
Depends on data type

Examples


Comments