Tutorial - Part #2 - Models

Study case: Iris Pipeline

We will carry out a simple exercise, using our recently initialized pipeline to develop a pipeline for statistic calculations of the famous Fisher Iris Dataset.

The plan is to obtain information for each class of the Iris species ( Setosa, Virginica, and Versicolor) calculated separately, seizing the multi-processing of 3 cores at a time.

Finally we will set-up some alerts, just to let us know if any expected results are obtained.

We will define some commands as well, to check the pipeline general status.

Downloading the Data

First of all we need to download the csv file, with the raw data to feed the pipeline. We can get it from https://github.com/toros-astro/corral/raw/master/datasets/iris.csv and copy it inside the my_pipeline directory.

If we take a glance at our files at this point, it should look like:

in_corral.py
my_pipeline/
├── __init__.py
├── iris.csv
├── settings.py
├── pipeline.py
├── models.py
├── load.py
├── steps.py
├── alerts.py
└── commands.py

Basic Configuration

First thing to do is to edit settings.py.

A thing we need to be able to do, is finding paths dynamically, so we import the os module. The import should look like

import logging
import os

The CONNECTION variable specifies the RFC-1738 format (used by SQLAlchemy) for database connection. Default should look something like this:

CONNECTION = "sqlite:///my_pipeline-dev.db"

With this instruction, a file pipeline-dev.db will be created in the same directory where in_corral.py is located, containing the SQLite database that we just defined.

See also

For more information regarding other databases, you can search the SQLAlchemy documentation at: http://docs.sqlalchemy.org/en/latest/core/engines.html

At the end of the file we will add the following lines

PATH = os.path.abspath(os.path.dirname(__file__))
IRIS_PATH = os.path.join(PATH, "iris.csv")

First line stores in the variable PATH the directory where settings.py is located. The second line just creates a path to the file iris.csv that we downloaded before.

The Models

Now our pipeline needs to know the looks of our data stored in the database.

In my_pipeline/models.py file, we delete the Example class. Then we modify the file to look just like this:

class Name(db.Model):

    __tablename__ = 'Name'

    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(50), unique=True)


class Observation(db.Model):

    __tablename__ = 'Observation'

    id = db.Column(db.Integer, primary_key=True)

    name_id = db.Column(
        db.Integer, db.ForeignKey('Name.id'), nullable=False)
    name = db.relationship("Name", backref=db.backref("observations"))

    sepal_length = db.Column(db.Float, nullable=False)
    sepal_width = db.Column(db.Float, nullable=False)
    petal_length = db.Column(db.Float, nullable=False)
    petal_width = db.Column(db.Float, nullable=False)

As we can see, the Name and Observation classes inherit from db.Model, and by doing so, we let Corral know that these are tables in our database.

The Name model will be in charge of storing every different name on our dataset. Let’s remember that the dataset has three different types of Iris flowers: setosa, versicolor and virginica, which will translate to three different instances of this model. In this same class we have only three attributes. The first one, __tablename__, will determine the name of the table that will be created on the database to make our data persistent (Name in our case). id is a column on the Name table for the primary key, with an integer type. Finally, the column name will hold the name of the species itself, with a maximum length of 50 characters, and this name cannot repeat across the column.

On the other hand, the model Observation has, besides the attributes __tablename__ and id, references to the model Name (the attributes name_id and name). This implies that each instance of this table must have a name and 4 other columns with floating point numbers to hold the other 4 columns of the dataset.

Note

The models are models of the SQLAlchemy ORM in every sense; and db.Model is a declarative_base

To learn more about SQLAlchemy ORM please refer to their documentation on http://docs.sqlalchemy.org/en/rel_1_1/orm/tutorial.html

Note

When we execute the line from corral import db, we have available inside the db namespace, the namespaces for sqlalchemy, sqlalchemy.orm and sqlalchemy_utils.

Learn more about sqlalchemy_utils on: http://sqlalchemy-utils.readthedocs.org

To create the database, we need to execute the command:

$ python in_corral.py createdb

After a confirmation question, the output should look like this:

Do you want to create the database [Yes/no]? yes
[my_pipeline-INFO @ 2016-01-08 01:44:01,027] SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
[my_pipeline-INFO @ 2016-01-08 01:44:01,028] ()
[my_pipeline-INFO @ 2016-01-08 01:44:01,029] SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
[my_pipeline-INFO @ 2016-01-08 01:44:01,029] ()
[my_pipeline-INFO @ 2016-01-08 01:44:01,031] PRAGMA table_info("Observation")
[my_pipeline-INFO @ 2016-01-08 01:44:01,031] ()
[my_pipeline-INFO @ 2016-01-08 01:44:01,060] PRAGMA table_info("Name")
[my_pipeline-INFO @ 2016-01-08 01:44:01,060] ()
[my_pipeline-INFO @ 2016-01-08 01:44:01,061]
CREATE TABLE "Name" (
    id INTEGER NOT NULL,
    name VARCHAR(50),
    PRIMARY KEY (id),
    UNIQUE (name)
)

[my_pipeline-INFO @ 2016-01-08 01:44:01,201] ()
[my_pipeline-INFO @ 2016-01-08 01:44:01,333] COMMIT
[my_pipeline-INFO @ 2016-01-08 01:44:01,334]
CREATE TABLE "Observation" (
    id INTEGER NOT NULL,
    name_id INTEGER NOT NULL,
    sepal_length FLOAT NOT NULL,
    sepal_width FLOAT NOT NULL,
    petal_length FLOAT NOT NULL,
    petal_width FLOAT NOT NULL,
    PRIMARY KEY (id),
    FOREIGN KEY(name_id) REFERENCES "Name" (id)
)

[my_pipeline-INFO @ 2016-01-08 01:44:01,334] ()
[my_pipeline-INFO @ 2016-01-08 01:44:01,467] COMMIT

We can read in the output, the SQL instructions used to create the tables to make our models persistent, plus some extra tables used as support by corral, like __corral_alerted__

We can explore our recently created empty database, with the command python in_corral.py dbshell

$ python in_corral.py dbshell
Connected to: Engine(sqlite:///my_pipeline-dev.db)
Type 'exit;' or '<CTRL> + <D>' for exit the shell

SQL> select * from sqlite_master where type = 'table' and name != '__corral_alerted__';
+-------+-------------+-------------+----------+-----------------------------------------------------+
| type  |    name     |  tbl_name   | rootpage |                         sql                         |
+=======+=============+=============+==========+=====================================================+
| table | Name        | Name        | 2        | CREATE TABLE "Name" (                               |
|       |             |             |          |         id INTEGER NOT NULL,                        |
|       |             |             |          |         name VARCHAR(50),                           |
|       |             |             |          |         PRIMARY KEY (id),                           |
|       |             |             |          |         UNIQUE (name)                               |
|       |             |             |          | )                                                   |
| table | Observation | Observation | 5        | CREATE TABLE "Observation" (                        |
|       |             |             |          |         id INTEGER NOT NULL,                        |
|       |             |             |          |         name_id INTEGER NOT NULL,                   |
|       |             |             |          |         sepal_length FLOAT NOT NULL,                |
|       |             |             |          |         sepal_width FLOAT NOT NULL,                 |
|       |             |             |          |         petal_length FLOAT NOT NULL,                |
|       |             |             |          |         petal_width FLOAT NOT NULL,                 |
|       |             |             |          |         PRIMARY KEY (id),                           |
|       |             |             |          |         FOREIGN KEY(name_id) REFERENCES "Name" (id) |
|       |             |             |          | )                                                   |
+-------+-------------+-------------+----------+-----------------------------------------------------+
SQL>