Introduction to DataModel

What is DataModel?

DataModel is a minimalistic, in-browser representation of tabular data, which is fed to Muze for visualization. It supports relational algebra operators which enable you to run selection, projection, group, bin, join (and many more) operations on the data.

Muze recognizes instances of DataModel as the first class representation of data. It makes use of the DataModel operators heavily. You can easily configure any custom visualization and interactions by using combinations of these operators. You can check the examples which DataModel facilitates.

DataModel can also be used independently without the need of Muze if you need an in-browser tabular data store for analysis, visualization or just general use of data stores. Since DataModel provides consistency in operations it can be used with different JavaScript libraries to achieve numerous use cases.

Why is a DataModel even necessary for visualization?

The first thing you need, to create a visualisation is data. Every thing that you see in a chart, starting from title, subtitle, scales, annotations, plots, interactions are driven by data which is derived by applying transformations on one single source of truth (which is your source data).

Many of the charting libraries currently available on the market accept only pre-defined formats for data, dictated by the chart type you're rendering. In addition, there are no in-built methods to transform data either before rendering, or post interaction on chart - thereby making the charts just a dumb rendering engine. This results in more work for you, as a developer, where you've to write chart specific transformations or data operations yourself, leading to development overheads, non scalable applications, and even multiple sources of truth.

DataModel gives you one consistent data format - a well established mental model and powerful set of transformation functions to address this.

For all those data manipulations, DataModel draws inspiration from SQL's data manipulation philosophy. SQL first appeared 1974 and still remains relevant for its simplicity, wide adoption & power of data manipulation (query, transform etc.). SQL paradigms follow the basic semantics of relational algebra which breaks high level operations into atomic sub operations and provide a grammar that enables composing multiple operations and then nesting or chaining them.

DataModel implements a subset of those high level and atomic operators with a few constraints (since it runs in browser with limited resources).

However, you don't need to have any prior understanding of relational algebra or SQL to use DataModel. We will be explaining the required concepts whenever needed.

Features of DataModel

  • Supports relational algebra operators to:

    • Filter rows
    • Filter columns
    • Group / aggregate data
    • Join data
    • Get union of data
    • Get intersection of data
  • Compose operations, with multiple levels of nesting possible

  • Provide additional data operators:

    • Create new calculated variable (column) from existing ones
    • Sort one or more columns
    • Bin data
  • Immutable - DataModel is immutable. Every operator creates a new instance of DataModel. Chaining of multiple such operators essentially creates a Directed Acyclic Graph (DAG). You can pick up any node from the graph and pass it to Muze to visualize the data.

  • Propagation - Since applying multiple operators creates a DAG, DataModel can propagate any data event from one node to another with proper semantics. Changes are automatically propagated from root (source) DataModel to all the derived (manipulated) DataModels. In a nutshell this is how muze renderer establishes auto connections of charts.

Creating instance of DataModel from the most common data formats

DataModel can accept input tabular data in 3 of the most popular formats:

  • Delimiter separated format (DSV)
  • Flat JSON
  • 2D Array

Here is the sample data which we will be representing in 3 different formats.

NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
chevrolet chevelle malibu1883071303504121970USA
buick skylark 320158350165369311.51970USA
plymouth satellite1883181503436111970USA

This table compares different attributes for a list of cars. Here we've 9 columns and 3 rows. For the sake of simplicity, we've reduced the data to just 3 rows.

Now, let's see example of how this table will need to be converted into different formats, to act as input for data model.

DataModel can ingest data in any of the three formats.

  • Delimiter separated format (DSV): If you want to use DSV format (CSV, TSV, etc.) to provide data to DataModel, the above table would result in this format:

    Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
    chevrolet chevelle malibu,18,8,307,130,3504,12,1970,USA
    buick skylark 320,15,8,350,165,3693,11.5,1970,USA
    plymouth satellite,18,8,318,150,3436,11,1970,USA
    
  • Flat JSON: If you want to use flat JSON as a data format, the above table would be represented as

    [
        { 
            "Name":"chevrolet chevelle malibu", 
            "Miles_per_Gallon":18,
            "Cylinders":8,
            "Displacement":307,
            "Horsepower":130,
            "Weight_in_lbs":3504,
            "Acceleration":12,
            "Year":"1970-01-01",
            "Origin":"USA"
        },
       {
            "Name":"buick skylark 320",
            "Miles_per_Gallon":15,
            "Cylinders":8,
            "Displacement":350,
            "Horsepower":165,
            "Weight_in_lbs":3693,
            "Acceleration":11.5,
            "Year":"1970-01-01",
            "Origin":"USA"
        },
       {
            "Name":"plymouth satellite",
            "Miles_per_Gallon":18,
            "Cylinders":8,
            "Displacement":318,
            "Horsepower":150,
            "Weight_in_lbs":3436,
            "Acceleration":11,
            "Year":"1970-01-01",
            "Origin":"USA"
        },
       { 
            "Name":"amc rebel sst",
            "Miles_per_Gallon":16,
            "Cylinders":8,
            "Displacement":304,
            "Horsepower":150,
            "Weight_in_lbs":3433,
            "Acceleration":12,
            "Year":"1970-01-01",
            "Origin":"USA"
        }
    ]
    
  • 2D Array: If you want to use 2D JavaScript array as a data format, the above table would be represented as

    [
       ["Name", "Miles_per_Gallon","Cylinders", "Displacement","Weight_in_lbs","Acceleration","Year","Origin"],
       [ "chevrolet chevelle malibu", 18, 8, 307, 130, 3504, 12,"1970-01-01", "USA" ],
       [ "buick skylark 320", 15, 8, 350, 165, 3693, 11.5,"1970-01-01", "USA" ],
       [ "plymouth satellite", 18, 8, 318, 150, 3436, 11,"1970-01-01", "USA" ]
    ];
    

Providing the schema of the data to DataModel

Till now, we've just converted the entire table of data into one of the 3 data formats to feed into DataModel. Next, we will learn how schema tells DataModel about the characteristics of the columns.

Following are the properties of schema which you should know in the first place. We will provide a brief description of each property and then we are gonna explain in details what each property means in the sub sequent section.

  • name: Name of the variable for which schema is defined

  • type: Type of the variable Measure | Dimension. Default is Dimension

  • subtype: For a particular type, subtype of a variable.

  • defAggFn: Default aggregation function. sum by default.

A field (variable in data, column in table) in DataModel can either be categorized as dimension or measure

Dimension

Dimensions are qualitative variables which helps in categorizing a data point.

Measure

Measures are quantitative variables which quantifies group of dimensional values. Mathematical functions are applied on measures.

The type of the variable is passed to DataModel by mentioning the type property of schema. DataModel's schema is an array of simple key value pairs. For each variables in data (column in table) , there has to be an entry in schema.

logo

Missing variable in schema

If a variable is present in data but omitted from schema, then the variable is not included in DataModel.

Dimensions can be further categorized as categorical or temporal by mentioning subtype. Date and time is treated as temporal dimension which has different qualities than categorical dimension. Default subtype for a dimension is categorical.

logo

Example

We've Year as a column, which contains the year of release of car. All date-time columns are also considered as Dimensions, on top of which you can run multiple operations (group by date-part, filter for specific date-range etc.).

For a temporal field, DataModel recognizes milliseconds and JavaScript Date object automatically. However schema has a format property which supports many different representations of date time. Read more about custom datetime parsing here.

Measures need to have an aggregation function (mathematical functions e.g., sum, avg, min, max etc.) defined on them. This function is called whenever grouping (aggregation) is required. If you don't specify this, the default aggregation function used is sum. This function is passed to schema as a value of defAggFn property. Read here to know more about the aggregation functions provided by DataModel. You can also define your own aggregation functions by using DataModel's reducer registration API.

logo

Example

  • You can do an average of Miles_per_gallon across a category of cars, or cars within a range of Horsepower. This helps you get more insights from your data, through transformations.

  • You can find the most powerful car by doing a max on Horsepower across all cars.

This is how a schema definition looks like

main
run-button
run-button
reset-button
const data =
 `Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
  chevrolet chevelle malibu,18,8,307,130,3504,12,1970,USA
  buick skylark 320,15,8,350,165,3693,11.5,1970,USA
  plymouth satellite,18,8,318,150,3436,11,1970,USA`;

const schema = [
  { 
    name: 'Name', /* Name of the variable in data */
  }, // By default its a dimension
  {
    name: 'Miles_per_Gallon', // Name of the variable in data
    type: 'measure'
	// Default aggregation function by defult is sum
  },
  {
    name: 'Cylinder', // Name of the variable
    type: 'dimension'
  },
  {
    name: 'Displacement',  // Name of the variable
    type: 'measure',
    defAggFn: 'max' // Default aggregation function is max for displacement
  },
  { 
    name: 'HorsePower',  // Name of the variable
    type: 'measure', 
    defAggFn: 'max'
  },
  {
    name: 'Weight_in_lbs',  // Name of the variable
    type: 'measure',
    defAggFn: 'avg' 
  },
  { 
    name: 'Acceleration',  // Name of the variable
    type: 'measure',
    defAggFn: 'avg' 
  },
  { 
    name: 'Year',  // Name of the variable
    type: 'dimension', // Date time is a dimension
    subtype: 'temporal', // Subtype is temporal by which DataModel understands its a datetime variable
    format: '%Y' // Token to parse datetime from data. Here its parsing the each value of the variable as year
  },
  { name: 'Origin' }
];
//Ingest both the data and schema in DataModel
const DataModel = muze.DataModel;
const dm = new DataModel(data, schema);

DataModel's schema supports more properties to deeply identify a variable. Read more about it here.

To recap on DataModel:

  • You bring your data to the browser in one of the 3 formats (DSV, Flat JSON and 2D Array) and feed to DataModel
  • You provide a schema of your data to DataModel, which maps:
    • Column in table → Field (Dimension or Measure) in DataModel
    • Date or number formats, if it's a date or numeric column
    • Default aggregation function for any numeric column

If you're using a relational database to fetch data, this mapping may help you:

  • Column in table → Field (Dimension/Measure) in DataModel
  • Row in table → Tuple / row in DataModel
  • Table definition (column types) → Schema in DataModel
  • Data operations on table through SQL → Relational Operators in DataModel

Wrapping up

Now that we know how a DataModel is created, lets see how the power of DataModel unfolds when we apply operators on instances of DataModel.