DataModel Operators

Operators are pure functions which are used to transform data in DataModel. There are some operators which are relational algebra compliant and rest of the operators provides ease of use and completeness of workflow. Operators help you achieve the following:

  • Selection: Filtering rows/tuples of a DataModel based on a predicate
  • Projection: Selecting only specific fields (columns) from a DataModel
  • Grouping: Grouping rows/tuples based on an aggregation function on specific fields (columns)
  • Sorting: Arranging one or more fields (columns) in a DataModel in ascending on descending order based on the values
  • Binning: Transforms a measure into a set of discrete values, each value being an interval
  • Calculated Variables: Create a new field (column) from existing fields (columns)
  • Composing operations: Chain all previous operations or compose them to get a single DataModel

How to use Operators?

An operator is a function which operates on data and returns a new instance of DataModel with transformed data. This immutable behaviour helps you develop a large complex system of visualizations and interactivity.

You can either chain a set of operators to be applied on a source DataModel or compose several operator to create a new operator and pass an instance of Datamodel to transform it.

Chaining of operators in general looks like

resultantDef = obj
    .op1((...params) => /* predicate definition here */)
    .op3((...params) => /* predicate definition here */)
    .op3((...params) => /* predicate definition here */)

Composing of operator in general looks like

composedOp = compose(
    op1((...params) => /* predicate definition here */),
    op2((...params) => /* predicate definition here */),
    op3((...params) => /* predicate definition here */)
); /* creates a new operator which is a combination of all other  operators */

resultantDef = composedOp(obj);

Head here to checkout all the operators.

As you might have guesses, chaining creates intermediate DataModel instances and hold those in memory. compose internally performs chaining operations on the DataModel and returns a single DataModel instance, thereby eliminating the need to create multiple intermediate DataModels.

Data

We will be using cars.csv data for illustration purpose.

Terminology

Through out the document we talk about variables, fields, columns interchangeably.

Variables are from raw data which gets translated to fields in DataModel. But all of these are just columns of a tabular data format.

Commonly used relational algebra operators

Lets explore DataModel's operators. We will be describing an use case before using an operator. This way you will get to know how powerful operators are and also you will develop an intuitive sense where to use these operators.

select

What we want to do: We want to see only the cars made in USA.

To achieve the above result, clearly we need to filter all the rows for which the data value in Origin field is USA.

We use select operator to filter the rows. Selection is a row filtering operation based on a predicate. A predicate is a simple function which returns true or false. The predicate is called for every tuple (row) with value of current tuple passed as arguments. By default, if the predicate returns true then the tuple is included in resultant datamodel.

Head here to know more about this operator.

main
run-button
run-button
reset-button
 // Data and schema is retrieved from https://www.charts.com/static/cars.json,
 // https://www.charts.com/static/cars-schema.json. DataModel is extracted from
 // muze namespace and assigned to DataModel variable.
  
const select = DataModel.Operators.select; /* Getting the select operator */

const dm = new DataModel(data, schema);
/* Selecting rows that only contain 'USA' */
const selectFn = select(fields => fields.Origin.value === 'USA');
const outputDM = selectFn(dm); /* Applying the selection opearator */
 

project

What we want to do: We want to know which car is from which country? .

To achieve the above outcome, we need to only include Name and Origin (country) fields in our derived DataModel. We use project to include specific fields, based on a given list of fields.

Head here to know more about this operator.

main
run-button
run-button
reset-button
// Data and schema is retrieved from https://www.charts.com/static/cars.json,
// https://www.charts.com/static/cars-schema.json. DataModel is extracted from
// muze namespace and assigned to DataModel variable.

const project = DataModel.Operators.project; /* Getting the project operator */
const dm = new DataModel(data, schema); /* Creating new DataModel */

/* Keeping only the name and origin fields in the DataModel */
const projectFn = project(['Name', 'Origin']);

const outputDM = projectFn(dm) /* Getting the new projected DataModel */

By default all the filtering operators operates on Inclusion mode, which means if the predicate returns true then the row is included. Exclusion mode is the vice versa operation (e.g., exclusion in this case would mean select all columns apart from Name & Origin). Learn more the modes here.

groupBy

What we want to do: We want to see max horsepower of for each car maker.

To achieve the above result, we have to group the cars by maker. Essentially we are getting rid of all the dimensions but Maker and aggregating the values to get one value of Horsepower per Maker.

groupBy helps you achieve that, as shown below:

main
run-button
run-button
reset-button
// Data and schema is retrieved from https://www.charts.com/static/cars.json,
// https://www.charts.com/static/cars-schema.json. DataModel is extracted from
// muze namespace and assigned to DataModel variable.

const groupBy = DataModel.Operators.groupBy;
const dm = new DataModel(data, schema);

const groupByFn =groupBy(['Maker'], {
	Horsepower: 'max'
});

const outputDM = groupByFn(dm);

To know more about this operator, you can head here.

sort

What we want to do: We want to sort cars first by their origin alphabetically and then by weight.

You can apply multi level sorting using this operator. For more advance use case, you might want to order one variable with respect to another variable. Like, sort country based on average mileage of a car. To know all format of sorting, head here

main
run-button
run-button
reset-button
// Data and schema is retrieved from https://www.charts.com/static/cars.json,
// https://www.charts.com/static/cars-schema.json. DataModel is extracted from
// muze namespace and assigned to DataModel variable.

const sort = DataModel.Operators.sort;
const dm = new DataModel(data, schema);

const sortFn = sort([ /* No Compose Function */
  ['Origin'],
  ['Weight_in_lbs', 'desc']
]);

const outputDM = sortFn(dm);

First, it alphabetically sorts the DataModel by Origin and sorts the resultant dataset by Miles_per_Gallon returning the sorted DataModel.

logo

Sorting does not support compose

It's a computationally expensive operation. In order to perform sorting along with other operations, we can chain them together instead of composing them.

Other relational algebra compliant operators

There are a lot of other operators like innerJoin, naturalJoin, leftOuterJoin, rightOuterJoin, FullOuterJoin, union, difference which you can use in the similar way. Explore all of this in Datamodel's API.

Combining multiple operators

What we want to do: We want just the names of cars which were released in 1970.

To do this, you will need to execute multiple operators on the same data.

The first step is to apply the select operation to filter all cars from 1970. In the resultant DataModel, Year field contains the same value, which is 1970. In second step we project the name field.

We compose a new operator carsFrom1970 using our already known operator.

main
run-button
run-button
reset-button
// Data and schema is retrieved from https://www.charts.com/static/cars.json,
// https://www.charts.com/static/cars-schema.json. DataModel is extracted from
// muze namespace and assigned to DataModel variable.
  
const dm = new DataModel(data, schema); /* Creating new DataModel */
const compose = DataModel.Operators.compose; /* Getting the compose operator */
const select = DataModel.Operators.select;  /* Getting the project operator */
const project = DataModel.Operators.project; /* Getting the project operator */

//Apply selection function to filter out cars from 1970
const carsFrom1970 = compose(
  select(fields => +fields.Year.value === +new Date(1970, 0, 1)),
  project(['Name'])
);

const outputDM = carsFrom1970(dm);

Chaining operators

An alternate way to solve this is by chaining the operators.

main
run-button
run-button
reset-button
// Data and schema is retrieved from https://www.charts.com/static/cars.json,
// https://www.charts.com/static/cars-schema.json. DataModel is extracted from
// muze namespace and assigned to DataModel variable.
 
const dm = new DataModel(data, schema); /* Creating new DataModel */

const outputDM = dm
	.select(fields => fields.Origin.value === 'USA')
	.project(['Name'])

logo

Chaining vs Composing

Although chaining operators looks very concise, it creates intermediate instances of DataModel which we might never use. Thats why we prefer composing over chaining.

Additional operators provided by DataModel

There are few operators which are not defined by relational algebra concepts but are useful for completeness of data transformation for the intent of data visualization.

calculateVariable

What we want to do: We want to know how powerful a car is when you press the gas!

For a car, horsepower is not a measure of its power, although it looks like one. We have to take the weight into account. Cars with higher weights tend to have higher horsepower and vice versa.

Hence, to get the power to weight ratio of a car, we have to calculate a new variable from the existing one:

power_to_weight = horsepower / weight_in_lbs

main
run-button
run-button
reset-button
// Data and schema is retrieved from https://www.charts.com/static/cars.json,
// https://www.charts.com/static/cars-schema.json. DataModel is extracted from
// muze namespace and assigned to DataModel variable.
  
const calculateVariable = DataModel.Operators.calculateVariable;
const dm = new DataModel(data, schema);

const calculationFn = calculateVariable({ /* schema of the new variable */
  name: 'power_to_weight', 
  type: 'measure' 
}, ['Horsepower', 'Weight_in_lbs', /* dependent variable */ (hp, weight) => { 
  return hp / weight;
}]);

const outputDM = calculationFn(dm);

Breaking the above code down:

  • We've created a new field by declaring its name and field type:

    calculateVariable({
      name: 'power_to_weight',  /* the new variable being created */
      type: 'measure' /* type of the new variable */
    }
    
  • The next argument specifies existing fields, based on which the new field is to be derived and the derivation criteria. We provide an array, whose elements are the names of existing fields and the last element of array takes a function which returns the value of the new variable. The functions is called for each row of the subject DataModel instance passing the dependent fields' value as parameter.

    ['horsepower', 'weight_in_lbs',  /* value is calulated based on these to vaiables */
    ()=> {}]  /* function is called for each row passing the value of the dependent variables */
    

    This function receives the value of each of the fields from the DataModel and can be combined to get the calculated variable:

    ['horsepower', 'weight_in_lbs', (hp, weight) =>{ 
          return hp / weight; /* Logic for value of the calculated variable */
     }];
    
logo

calulateVariable is not composable

calculateVariable should not be done in a loop or towards the bottom of operation chain. If possible perform this action towards the top of the operation chain. calculateVariable is kept out of compose for intentionally to force the above two factor.

In the above example, we created a new field which calculated the power_to_weight ratio, which is a measure. If we need to create a dimension as a calculated variable, we can do it through a look-up table, as shown in the example below. Here, based on the power_to_weight ratio, we intend to assign a text value (dimension) which is one of: Low, Decent, or High.

We will name this new variable (field) Performance which is derived from the earlier calculated variable power_to_weight and a look-up table .

main
run-button
run-button
reset-button
// Data and schema is retrieved from https://www.charts.com/static/cars.json,
// https://www.charts.com/static/cars-schema.json. DataModel is extracted from
// muze namespace and assigned to DataModel variable.

const calculateVariable = DataModel.Operators.calculateVariable;
const performanceCategory = { range: [0.3, 0.6, 1], cat: ['Low', 'Decent', 'High'] };
const dm = new DataModel(data, schema);

const calculationFn = calculateVariable({
  name: 'power_to_weight', 
  type: 'measure' 
}, ['Horsepower', 'Weight_in_lbs', (hp, weight) =>{ 
  return hp / weight;
}]);

const newFieldDM = calculationFn(dm);

const fn = calculateVariable({
  name: 'Performance',
  type: 'dimension'
}, ['power_to_weight', (powerIndex) => {
	if (powerIndex > 0.042) {
    	return 'high';
    } else if (powerIndex > 0.035) {
    	return 'medium';
    } else {
    	return 'low';
    }
}]);

const outputDM = fn(newFieldDM);

Wrapping up

We have discussed the most used operators in the above section. DataModel supports many more operators. Checkout the API for DataModel and Relation to understand all the operators.