Skip to content

Dataset Extraction (Pinset)

The Pinset language offers specific syntax constructs to extract table-like datasets from models. The main objective of Pinset is to facilitate the analysis of models data via conventional data mining and machine learning techniques, which impose a tabular input format. In addition, tables can be useful as an extra viewpoint when creating model visualisations.

Model example

We use as running example a course model, which contains the enrolled students along with their grades. All models and Pinset scripts shown in this documentation can be found in an example project in the Epsilon repository.

All Pinset scripts query the following metamodel:

classDiagram class Course { name: String } class Student { ID: String name: String isRemote: Boolean } class ContactDetails { email: String phone: String } class EvaluationItem { name: String percentage: int } class Grade { points: int } Course *--> Student: students * Course *--> EvaluationItem: items * Student *--> ContactDetails: contact Student *--> Grade: grades * Grade --> EvaluationItem: item

As for the data shown as a result of the Pinset scripts, we use the following Flexmi model, which conforms to the metamodel above:

<?nsuri grades?>
<course name="Model-Driven Engineering">
    <item name="Lab 1" perc="15"/>
    <item name="Lab 2" perc="15"/>
    <item name="Partial Test" perc="20"/>
    <item name="Final Exam" perc="50"/>

    <student id="S1" name="Alice">
        <contact email="alice@university.com" phone="+44 101"/>
        <grade item="Lab 1" points="60"/>
        <grade item="Lab 2" points="90"/>
        <grade item="Partial Test" points="80"/>
        <grade item="Final Exam" points="85"/>
    </student>
    <student id="S2" name="Bob" remote="true">
        <contact email="bob@university.com" phone="+44 654"/>
        <grade item="Lab 1" points="60"/>
        <grade item="Final Exam" points="100"/>
    </student>
    <student id="S3" name="Charlie" remote="true">
        <contact email="charlie@university.com" phone="+44 333"/>
        <grade item="Lab 1" points="50"/>
        <grade item="Lab 2" points="35"/>
        <grade item="Partial Test" points="20"/>
    </student>
    <student id="S4" name="Dana">
        <contact email="dana@university.com"/>
        <grade item="Lab 1" points="100"/>
        <grade item="Lab 2" points="90"/>
        <grade item="Partial Test" points="70"/>
        <grade item="Final Exam" points="95"/>
    </student>
</course>

Overview

This first Pinset example defines a dataset from students data, containing some basic information such as name and student ID, contact details, the number of completed evaluation items, and the final grade for the course:

dataset studentsSummary over s : Student {
    column id: s.ID
    column name: s.name
    column phone: s.contact.phone

    column items_completed: s.grades.size
    column final_grade : s.getFinalGrade()

    column course_outcome {
        if (final_grade < 50) {
            return "fail";
        }
        else if (final_grade < 70) {
            return "good";
        }
        else if (final_grade < 90) {
            return "notable";
        }
        else {
            return "excellent";
        }
    }
}

@cached
operation Student getFinalGrade() {
    return self.grades
            .collect(g | g.points * g.item.percentage)
            .sum() / 100;
}

From that Pinset script, the following dataset is generated:

id name phone items_completed final_grade course_outcome
S1 Alice +44 101 4 81 notable
S2 Bob +44 654 2 59 good
S3 Charlie +44 333 3 16 fail
S4 Dana 4 90 excellent

As the above example shows, Pinset offers a rule-based syntax to declare datasets. These rules are specified as a set of column generators that capture data from instances of a type included in an input model. That type is defined as a parameter, after the over keyword. In the example, the chosen type is Student, which by default means that each Student instance of the input model will be used to populate a row of the output dataset.

Pinset offers different column generators. This first example uses the column one, which is composed of the name of the column header and an EOL expression to calculate the cell value over the row element.

Other common EOL constructs are also available in Pinset scripts. For instance, an EOL block can be used for those column calculations that might be better organised in an imperative set of statements, such as the course_outcome column that shows the final course result in a textual format as used in the Spanish education system. In addition, external operations can be invoked in the column expressions, such as the getFinalGrade() operation used in the example.

As a last comment for the column generator, values of previously calculated columns of an element can be used in subsequent definitions. For instance, the course_outcome column uses the finalGrade

After this overview, next sections describe extra column generators, as well as on other functionalities offered by Pinset for an easier dataset extraction specification.

Properties accessors

As a way to facilitate the definition of columns that simply hold element properties, Pinset offers some column generators to access these properties:

dataset studentsContact over s : Student {
    properties [ID as StudentId, name]
    reference contact[email, phone]
}

The previous dataset rule results in:

StudentId name contact_email contact_phone
S1 Alice alice@university.com +44 101
S2 Bob bob@university.com +44 654
S3 Charlie charlie@university.com +44 333
S4 Dana dana@university.com

Precisely, Pinset offers two property accessors: the properties generator can be used to generate columns for attributes of the selected type (e.g. ID and name in the example), while the references one allows getting attributes from single references (i.e. upper bound of 1) of the type, such as contact.

When using the properties accessor, the name of the attribute is used as column name, while for the references accessor a combination of the name of the reference with the name of the attribute is used (e.g. contact_phone). This default behaviour can be altered by using the as keyword.

These accessors also offer null safety. If any attributes or the traversed reference point to null, Pinset automatically inserts a blank value in the cell.

Row filtering

By default, all elements of the selected type are processed into rows. As this might sometimes not be desired, Pinset offers some ways to filter out rows from the resulting dataset:

dataset remoteStudents over s : Student {
    guard: s.isRemote

    properties[ID, name]
}

dataset finalExamAssistants over s : Student
    from : Student.all.select(s |
            s.grades.exists(g | g.item.name == "Final Exam")) {

    properties[ID, name]
}

These dataset rules show the two ways that can be used to perform filtering in Pinset:

  • The remoteStudents dataset uses a guard to limit the processed students to the remote ones (based in their boolean attribute). Any element not meeting the guard requirements is excluded from the dataset generation step.

  • The finalExamAssistants dataset uses a from expression to only include those students that took the final exam of the course. A from expression must return a collection of elements of the selected type to be used for the dataset generation. Therefore, this expression can be used for row filtering, and for other things such as performance improvements (i.e. calculate a collection, and use it for multiple dataset generations).

If necessary, both filtering mechanisms can be used simultaneously. For instance, if we combine the guard and from expressions shown above, we would obtain a dataset with the remote students that took the final exam of the course.

Multiple columns: grid

In some cases, we might want to generate a set of columns that are calculated using the same expression, just by changing the parameter(s) of such expression. In the course example, this happens when generating a table including the detailed grades of the students for all the evaluated items of the course, such as the following:

ID name Lab_1 Lab_2 Partial_Test Final_Exam final_grade
S1 Alice 60 90 80 85 81
S2 Bob 60 100 59
S3 Charlie 50 35 20 16
S4 Dana 100 90 70 95 90

Defining this table with the column generator would quickly become very verbose and tedious, as we would need to use one expression for each evaluated item of the course. Also, using that strategy would match the Pinset script to the specific course, as the script would include the name of the grades that are being represented as columns. Any new item added to future editions of the course, or any new course we might want to suport, would require updating the Pinset script / creating a new one.

To prevent this, Pinset offers the grid generator, which allows the batch-definition of similar columns. A grid has three components:

  • keys: determine the elements to use as seeds or parameters of each column.
  • header: used to create the name or header of the column, based on the value of each individual key.
  • body: used to calculate the value of each cell of the column. Generally, both the row element and the grid key intervene here.

This generator is used in the following dataset rule, which generates the grades table depicted above:

dataset studentGrades over s : Student {
    properties[ID, name]

    grid {
        keys: EvaluationItem.all
        header: key.name
        body: s.grades.selectOne(g | g.item == key)?.points
    }

    column final_grade : s.getFinalGrade()
}

In that grid generator, the course evaluation items are used as keys, which means that each one of these items would be evaluated over the header and body expressions to generate a new column. The header of the columns uses the item name, and the body is calculated by looking for a grade of the student for the evaluation item. The body uses the ?. safe null navigation operator in case the student does not have a grade for certain item.

Typeless dataset rules

The from expression presented above to filter rows during the generation can be also used to define datasets where the row elements are not instances coming from an input model. This can be useful to perform data aggregations, or to generate synthetic tables starting from a custom collection of values.

The following dataset rule generates a basic table using a sequence of numbers as row elements and different column generators:

dataset numbers over n from : 1.to(5) {
    column number : n
    column squared : n * n

    grid {
        keys: 2.to(5)
        header: "times_" + key
        body: n * key
    }
}
number squared times_2 times_3 times_4 times_5
1 1 2 3 4 5
2 4 4 6 8 10
3 9 6 9 12 15
4 16 8 12 16 20
5 25 10 15 20 25

Nested column generators

When certain intermediate value has to be used in several column calculations, Pinset offers a nested, composite column generator. This generator is defined by a from expression that calculates a value, followed by a block containing column generators that can use that value:

dataset gradesDetails over g : Grade {
    properties[points]
    reference item[name]
    from student : g.eContainer {
        column id : student.ID
        column final_grade : student.getFinalGrade()
        column grade_lowerthan_final : g.points < final_grade
    }
}

The rule above generates a dataset with one row per grade in the course. The rule includes a from expression, which obtains the student that obtained the grade through the containment reference. Then, it is used to obtain the student id and final grade, and an extra column that determines whether a grade contributed negatively to the final grade of the student, by checking if it has less points than the final grade.

The names of the nested column generators are prefixed with the name given to the object calculated by the from expression:

points item_name student_id student_final_grade student_grade_lowerthan_final
60 Lab 1 S1 81 true
90 Lab 2 S1 81 false
80 Partial Test S1 81 true
85 Final Exam S1 81 false
60 Lab 1 S2 59 false
100 Final Exam S2 59 false
50 Lab 1 S3 16 false
35 Lab 2 S3 16 false
20 Partial Test S3 16 false
100 Lab 1 S4 90 false
90 Lab 2 S4 90 false
70 Partial Test S4 90 true
95 Final Exam S4 90 false

Column post-processing

Pinset offers some column post-processing operations that are frequently used to prepare a dataset for an analysis. These operations are invoked by annotating the column generators.

dataset studentGradesPostProcessed over s : Student {
    properties[ID]

    @fillNulls 0
    grid {
        keys: EvaluationItem.all
        header: key.name
        body: s.grades.selectOne(g | g.item == key)?.points
    }

    column final_grade : s.getFinalGrade()
    @normalize 100
    column final_grade_normalized : final_grade
}
ID Lab_1 Lab_2 Partial_Test Final_Exam final_grade final_grade_normalized
S1 60 90 80 85 81 0.81
S2 60 0 0 100 59 0.59
S3 50 35 20 0 16 0.16
S4 100 90 70 95 90 0.9

Fill nulls

It is possible to @fillNulls with a custom value, or with a special and sometimes used value, such as the mean or the mode of the column values. The following dataset rule

By annotating the grid in the detailed grades example, we can fill with zeros those cells where a student did not took an evaluation item.

Normalisation

We can @normalize data columns between the [0,1] interval (useful when applying distance-based algorithms with numeric columns in different scales). A value can be provided to the annotation to perform the normalisation. If no value is given, the maximum value encountered in the column is used instead.

The dataset rule above contains a column with the normalised final grade of the course.

Coming soon

An integration of Pinset with Picto to ease the creation of advanced table visualisations inside the Eclipse IDE is on the way.