CORA - Categorizing academic publications using getML

In this notebook, we compare getML against extant approaches in the relational learning literature on the CORA data set, which is often used for benchmarking. We demonstrate that getML outperforms the state of the art in the relational learning literature on this data set. Beyond the benchmarking aspects, this notebooks showcases getML's excellent capabilities in dealing with categorical data.

Summary:

  • Prediction type: Classification model
  • Domain: Academia
  • Prediction target: The category of a paper
  • Population size: 2708

Author: Dr. Patrick Urbanke

Background

CORA is a well-known benchmarking dataset in the academic literature on relational learning. The dataset contains 2708 scientific publications on machine learning. The papers are divided into 7 categories. The challenge is to predict the category of a paper based on the papers it cites, the papers it is cited by and keywords contained in the paper.

It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).

A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor here.

Where is this running?

Your getML live session is running inside a docker container on mybinder.org, a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import getml

getml.engine.set_project('cora')
Connected to project 'cora'

1. Loading data

1.1 Download from source

We begin by downloading the data from the source file:

In [2]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="CORA",
    port=3306,
    user="guest",
    password="relational"
)

conn
Out[2]:
Connection(conn_id='default',
           dbname='CORA',
           dialect='mysql',
           host='relational.fit.cvut.cz',
           port=3306)
In [3]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
In [4]:
paper = load_if_needed("paper")
cites = load_if_needed("cites")
content = load_if_needed("content")
In [5]:
paper
Out[5]:
Name paper_id class_label
Role unused_float unused_string
0 35  Genetic_Algorithms
1 40  Genetic_Algorithms
2 114  Reinforcement_Learning
3 117  Reinforcement_Learning
4 128  Reinforcement_Learning
...  ...
2703 1154500  Case_Based
2704 1154520  Neural_Networks
2705 1154524  Rule_Learning
2706 1154525  Rule_Learning
2707 1155073  Rule_Learning

2708 rows x 2 columns
memory usage: 0.09 MB
name: paper
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/paper/

In [6]:
cites
Out[6]:
Name cited_paper_id citing_paper_id
Role unused_float unused_float
0 35  887 
1 35  1033 
2 35  1688 
3 35  1956 
4 35  8865 
...  ... 
5424 853116  19621 
5425 853116  853155 
5426 853118  1140289 
5427 853155  853118 
5428 954315  1155073 

5429 rows x 2 columns
memory usage: 0.09 MB
name: cites
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/cites/

In [7]:
content
Out[7]:
Name paper_id word_cited_id
Role unused_float unused_string
0 35  word100
1 35  word1152
2 35  word1175
3 35  word1228
4 35  word1248
...  ...
49211 1155073  word75
49212 1155073  word759
49213 1155073  word789
49214 1155073  word815
49215 1155073  word979

49216 rows x 2 columns
memory usage: 1.20 MB
name: content
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/content/

1.2 Prepare data for getML

getML requires that we define roles for each of the columns.

In [8]:
paper.set_role("paper_id", getml.data.roles.join_key)
paper.set_role("class_label", getml.data.roles.categorical)
paper
Out[8]:
Name paper_id class_label
Role join_key categorical
0 35  Genetic_Algorithms
1 40  Genetic_Algorithms
2 114  Reinforcement_Learning
3 117  Reinforcement_Learning
4 128  Reinforcement_Learning
...  ...
2703 1154500  Case_Based
2704 1154520  Neural_Networks
2705 1154524  Rule_Learning
2706 1154525  Rule_Learning
2707 1155073  Rule_Learning

2708 rows x 2 columns
memory usage: 0.02 MB
name: paper
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/paper/

In [9]:
cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key)
cites
Out[9]:
Name cited_paper_id citing_paper_id
Role join_key join_key
0 35  887 
1 35  1033 
2 35  1688 
3 35  1956 
4 35  8865 
...  ... 
5424 853116  19621 
5425 853116  853155 
5426 853118  1140289 
5427 853155  853118 
5428 954315  1155073 

5429 rows x 2 columns
memory usage: 0.04 MB
name: cites
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/cites/

We need to separate our data set into a training, testing and validation set:

In [10]:
content.set_role("paper_id", getml.data.roles.join_key)
content.set_role("word_cited_id", getml.data.roles.categorical)
content
Out[10]:
Name paper_id word_cited_id
Role join_key categorical
0 35  word100
1 35  word1152
2 35  word1175
3 35  word1228
4 35  word1248
...  ...
49211 1155073  word75
49212 1155073  word759
49213 1155073  word789
49214 1155073  word815
49215 1155073  word979

49216 rows x 2 columns
memory usage: 0.39 MB
name: content
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/content/

We also have to separate the data set into a training and testing set:

In [11]:
random = paper.random()

is_training = (random < 0.7)
is_test = (~is_training)

data_train = paper.where("data_train", is_training)
data_test = paper.where("data_test", is_test)
In [12]:
data_train
Out[12]:
Name paper_id class_label
Role join_key categorical
0 35  Genetic_Algorithms
1 114  Reinforcement_Learning
2 130  Reinforcement_Learning
3 288  Reinforcement_Learning
4 424  Rule_Learning
...  ...
1882 1154276  Neural_Networks
1883 1154459  Genetic_Algorithms
1884 1154500  Case_Based
1885 1154524  Rule_Learning
1886 1155073  Rule_Learning

1887 rows x 2 columns
memory usage: 0.02 MB
name: data_train
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/data_train/

The goal is to predict seven different labels. We generate a target column for each of those labels:

In [13]:
class_label = np.unique(paper.class_label.to_numpy())
class_label
Out[13]:
array(['Case_Based', 'Genetic_Algorithms', 'Neural_Networks',
       'Probabilistic_Methods', 'Reinforcement_Learning', 'Rule_Learning',
       'Theory'], dtype='<U22')
In [14]:
def make_target_columns(data_frame):
    for label in class_label:
        name = "class_label=" + label
        data_frame[name] = (data_frame.class_label == label)
        data_frame.set_role(name, getml.data.roles.target)
    data_frame.set_role("class_label", getml.data.roles.unused_string)
In [15]:
make_target_columns(data_train)
make_target_columns(data_test)
In [16]:
data_train
Out[16]:
Name paper_id class_label=Case_Based class_label=Genetic_Algorithms class_label=Neural_Networks class_label=Probabilistic_Methods class_label=Reinforcement_Learning class_label=Rule_Learning class_label=Theory class_label
Role join_key target target target target target target target unused_string
0 35  0  1  0  0  0  0  0  Genetic_Algorithms
1 114  0  0  0  0  1  0  0  Reinforcement_Learning
2 130  0  0  0  0  1  0  0  Reinforcement_Learning
3 288  0  0  0  0  1  0  0  Reinforcement_Learning
4 424  0  0  0  0  0  1  0  Rule_Learning
...  ...  ...  ...  ...  ...  ...  ...  ...
1882 1154276  0  0  1  0  0  0  0  Neural_Networks
1883 1154459  0  1  0  0  0  0  0  Genetic_Algorithms
1884 1154500  1  0  0  0  0  0  0  Case_Based
1885 1154524  0  0  0  0  0  1  0  Rule_Learning
1886 1155073  0  0  0  0  0  1  0  Rule_Learning

1887 rows x 9 columns
memory usage: 0.16 MB
name: data_train
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/data_train/

In [17]:
data_test
Out[17]:
Name paper_id class_label=Case_Based class_label=Genetic_Algorithms class_label=Neural_Networks class_label=Probabilistic_Methods class_label=Reinforcement_Learning class_label=Rule_Learning class_label=Theory class_label
Role join_key target target target target target target target unused_string
0 40  0  1  0  0  0  0  0  Genetic_Algorithms
1 117  0  0  0  0  1  0  0  Reinforcement_Learning
2 128  0  0  0  0  1  0  0  Reinforcement_Learning
3 164  0  0  0  0  0  0  1  Theory
4 463  1  0  0  0  0  0  0  Case_Based
...  ...  ...  ...  ...  ...  ...  ...  ...
816 1154173  0  0  0  1  0  0  0  Probabilistic_Methods
817 1154229  0  0  0  1  0  0  0  Probabilistic_Methods
818 1154233  0  0  0  1  0  0  0  Probabilistic_Methods
819 1154520  0  0  1  0  0  0  0  Neural_Networks
820 1154525  0  0  0  0  0  1  0  Rule_Learning

821 rows x 9 columns
memory usage: 0.07 MB
name: data_test
type: getml.data.DataFrame
url: http://localhost:1709/#/getdataframe/cora/data_test/

2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model

To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.

That is because the class label can be predicting using three different pieces of information:

  • The keywords used by the paper
  • The keywords used by papers it cites and by papers that cite the paper
  • The class label of papers it cites and by papers that cite the paper
In [18]:
population_ph = getml.data.Placeholder('population')
cites_ph = getml.data.Placeholder('cites')
cites2_ph = getml.data.Placeholder('cites')
content_ph = getml.data.Placeholder('content')
paper_ph = getml.data.Placeholder('paper')

population_ph.join(
    cites_ph,
    join_key='paper_id',
    other_join_key='cited_paper_id'
)

cites_ph.join(
    content_ph,
    join_key='citing_paper_id',
    other_join_key='paper_id',
)

cites_ph.join(
    paper_ph,
    join_key='citing_paper_id',
    other_join_key='paper_id',
    relationship=getml.data.relationship.many_to_one
)

population_ph.join(
    cites2_ph,
    join_key='paper_id',
    other_join_key='citing_paper_id'
)

cites2_ph.join(
    content_ph,
    join_key='cited_paper_id',
    other_join_key='paper_id'
)

cites2_ph.join(
    paper_ph,
    join_key='cited_paper_id',
    other_join_key='paper_id',
    relationship=getml.data.relationship.many_to_one
)

population_ph.join(
    content_ph,
    join_key='paper_id'
)

population_ph
Out[18]:
contentpapercitescontentpapercitescontentpopulationpaper_id = citing_paper_idpaper_id = citing_paper_idRelationship: many-to-onepaper_id = cited_paper_idpaper_id = cited_paper_idRelationship: many-to-onecited_paper_id = paper_idciting_paper_id = paper_idpaper_id = paper_id

2.2 getML pipeline

Set-up the feature learner & predictor

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples).

In [19]:
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastPropModel(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1
)

relboost = getml.feature_learning.RelboostModel(
    num_features=10,
    num_subfeatures=10,
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    seed=4367,
    num_threads=1,
    min_num_samples=30
)

predictor = getml.predictors.XGBoostClassifier()

Build the pipeline

In [20]:
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    population=population_ph,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe1
Out[20]:
Pipeline(preprocessors=['Mapping'],
         feature_learners=['FastPropModel'],
         feature_selectors=[],
         include_categorical=False,
         peripheral=['cites', 'content', 'paper'],
         population='population',
         predictors=['XGBoostClassifier'],
         tags=['fast_prop'],
         share_selected_features=0.5)
In [21]:
pipe2 = getml.pipeline.Pipeline(
    tags=['relboost'],
    population=population_ph,
    feature_learners=[relboost],
    predictors=[predictor]
)

pipe2
Out[21]:
Pipeline(preprocessors=[],
         feature_learners=['RelboostModel'],
         feature_selectors=[],
         include_categorical=False,
         peripheral=['cites', 'content', 'paper'],
         population='population',
         predictors=['XGBoostClassifier'],
         tags=['relboost'],
         share_selected_features=0.5)

2.3 Model training

In [22]:
peripheral = {
    "cites": cites, 
    "content": content, 
    "paper": paper
}
In [23]:
pipe1.check(data_train, peripheral)
INFO [MIGHT TAKE LONG]: The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
Checking data model...

The training process seems a bit intimidating. That is because the relboost algorithms needs to train separate models for each class label. This is due to the nature of the generated features.

In [24]:
pipe1.fit(data_train, peripheral)
Checking data model...
INFO [MIGHT TAKE LONG]: The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
Preprocessing...
[========================================] 100%

FastProp: Trying 3780 features...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building features...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

Trained pipeline.
Time taken: 0h:0m:22.222449

Out[24]:
Pipeline(preprocessors=['Mapping'],
         feature_learners=['FastPropModel'],
         feature_selectors=[],
         include_categorical=False,
         peripheral=['cites', 'content', 'paper'],
         population='population',
         predictors=['XGBoostClassifier'],
         tags=['fast_prop'],
         share_selected_features=0.5)

url: http://localhost:1709/#/getpipeline/cora/Io0nkR/0/
In [25]:
pipe2.check(data_train, peripheral)
INFO [MIGHT TAKE LONG]: The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
Checking data model...
In [26]:
pipe2.fit(data_train, peripheral)
INFO [MIGHT TAKE LONG]: The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
Checking data model...

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Training features...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Training features...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Training features...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Training features...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Training features...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Training features...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Training subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Training features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

XGBoost: Training as predictor...
[========================================] 100%

Trained pipeline.
Time taken: 0h:2m:51.400568

Out[26]:
Pipeline(preprocessors=[],
         feature_learners=['RelboostModel'],
         feature_selectors=[],
         include_categorical=False,
         peripheral=['cites', 'content', 'paper'],
         population='population',
         predictors=['XGBoostClassifier'],
         tags=['relboost'],
         share_selected_features=0.5)

url: http://localhost:1709/#/getpipeline/cora/1MfuIf/0/

2.4 Model evaluation

In [27]:
pipe1.score(data_test, peripheral)
Preprocessing...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building features...
[========================================] 100%

Out[27]:
date time set used target accuracy auc cross entropy
0 2021-05-18 00:02:06 data_train class_label=Case_Based 0.9978802331743508 0.9999 0.02323
1 2021-05-18 00:02:06 data_train class_label=Genetic_Algorithms 1 1. 0.004915
2 2021-05-18 00:02:06 data_train class_label=Neural_Networks 0.9846316905140434 0.9983 0.065852
3 2021-05-18 00:02:06 data_train class_label=Probabilistic_Methods 0.9957604663487016 0.9998 0.02765
4 2021-05-18 00:02:06 data_train class_label=Reinforcement_Learning 0.9994700582935877 1. 0.009078
5 2021-05-18 00:02:06 data_train class_label=Rule_Learning 1 1. 0.007662
6 2021-05-18 00:02:06 data_train class_label=Theory 0.9925808161102279 0.9992 0.047382
7 2021-05-18 00:04:59 data_test class_label=Case_Based 0.9707673568818515 0.9861 0.086891
8 2021-05-18 00:04:59 data_test class_label=Genetic_Algorithms 0.9853836784409257 0.998 0.048981
9 2021-05-18 00:04:59 data_test class_label=Neural_Networks 0.951278928136419 0.9787 0.163552
10 2021-05-18 00:04:59 data_test class_label=Probabilistic_Methods 0.9732034104750305 0.9872 0.083174
11 2021-05-18 00:04:59 data_test class_label=Reinforcement_Learning 0.9805115712545676 0.9736 0.074599
12 2021-05-18 00:04:59 data_test class_label=Rule_Learning 0.9841656516443362 0.9937 0.052146
13 2021-05-18 00:04:59 data_test class_label=Theory 0.9573690621193667 0.977 0.128597
In [28]:
pipe2.score(data_test, peripheral)
Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Out[28]:
date time set used target accuracy auc cross entropy
0 2021-05-18 00:04:57 data_train class_label=Case_Based 1 1. 0.008368
1 2021-05-18 00:04:57 data_train class_label=Genetic_Algorithms 1 1. 0.004185
2 2021-05-18 00:04:57 data_train class_label=Neural_Networks 0.9925808161102279 0.9995 0.03748
3 2021-05-18 00:04:57 data_train class_label=Probabilistic_Methods 0.9978802331743508 1. 0.014195
4 2021-05-18 00:04:57 data_train class_label=Reinforcement_Learning 1 1. 0.004341
5 2021-05-18 00:04:57 data_train class_label=Rule_Learning 1 1. 0.00305
6 2021-05-18 00:04:57 data_train class_label=Theory 0.9973502914679385 0.9998 0.023495
7 2021-05-18 00:05:03 data_test class_label=Case_Based 0.9756394640682094 0.9801 0.103835
8 2021-05-18 00:05:03 data_test class_label=Genetic_Algorithms 0.9914738124238733 0.9992 0.033939
9 2021-05-18 00:05:03 data_test class_label=Neural_Networks 0.9390986601705238 0.9802 0.182987
10 2021-05-18 00:05:03 data_test class_label=Probabilistic_Methods 0.9732034104750305 0.9874 0.090399
11 2021-05-18 00:05:03 data_test class_label=Reinforcement_Learning 0.9817295980511571 0.9779 0.077956
12 2021-05-18 00:05:03 data_test class_label=Rule_Learning 0.9817295980511571 0.9918 0.06703
13 2021-05-18 00:05:03 data_test class_label=Theory 0.9500609013398295 0.966 0.164147

To make things a bit easier, we just look at our test results.

In [29]:
pipe1.scores.filter(lambda score: score.set_used == "data_test")
Out[29]:
date time set used target accuracy auc cross entropy
0 2021-05-18 00:04:59 data_test class_label=Case_Based 0.9708 0.9861 0.08689
1 2021-05-18 00:04:59 data_test class_label=Genetic_Algorithms 0.9854 0.998 0.04898
2 2021-05-18 00:04:59 data_test class_label=Neural_Networks 0.9513 0.9787 0.16355
3 2021-05-18 00:04:59 data_test class_label=Probabilistic_Methods 0.9732 0.9872 0.08317
4 2021-05-18 00:04:59 data_test class_label=Reinforcement_Learning 0.9805 0.9736 0.0746
5 2021-05-18 00:04:59 data_test class_label=Rule_Learning 0.9842 0.9937 0.05215
6 2021-05-18 00:04:59 data_test class_label=Theory 0.9574 0.977 0.1286
In [30]:
pipe2.scores.filter(lambda score: score.set_used == "data_test")
Out[30]:
date time set used target accuracy auc cross entropy
0 2021-05-18 00:05:03 data_test class_label=Case_Based 0.9756 0.9801 0.10383
1 2021-05-18 00:05:03 data_test class_label=Genetic_Algorithms 0.9915 0.9992 0.03394
2 2021-05-18 00:05:03 data_test class_label=Neural_Networks 0.9391 0.9802 0.18299
3 2021-05-18 00:05:03 data_test class_label=Probabilistic_Methods 0.9732 0.9874 0.0904
4 2021-05-18 00:05:03 data_test class_label=Reinforcement_Learning 0.9817 0.9779 0.07796
5 2021-05-18 00:05:03 data_test class_label=Rule_Learning 0.9817 0.9918 0.06703
6 2021-05-18 00:05:03 data_test class_label=Theory 0.9501 0.966 0.16415

We take the average of the AUC values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora).

In [31]:
print(np.mean(pipe1.auc))
print(np.mean(pipe2.auc))
0.9849037843150564
0.9832101980839714

The accuracy for multiple targets can be calculated using one of two methods. The first method is to simply take the average of the pair-wise accuracy values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora).

In [32]:
print(np.mean(pipe1.accuracy))
print(np.mean(pipe2.accuracy))
0.9718113798503566
0.9704193492256828

However, the benchmarking papers actually use a different approach:

  • They first generate probabilities for each of the labels:
In [33]:
probabilities1 = pipe1.predict(data_test, peripheral)
probabilities2 = pipe2.predict(data_test, peripheral)
Preprocessing...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building subfeatures...
[========================================] 100%

FastProp: Building features...
[========================================] 100%


Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building subfeatures...
[========================================] 100%

Relboost: Building features...
[========================================] 100%

  • They then find the class label with the highest probability:
In [34]:
ix_max = np.argmax(probabilities1, axis=1)
predicted_labels1 = np.asarray([class_label[ix] for ix in ix_max])

ix_max = np.argmax(probabilities2, axis=1)
predicted_labels2 = np.asarray([class_label[ix] for ix in ix_max])
  • They then compare that value to the actual class label:
In [35]:
actual_labels = data_test.class_label.to_numpy()

print("Share of accurately predicted class labels (pipe1):")
print((actual_labels == predicted_labels1).sum() / len(actual_labels))
print()
print("Share of accurately predicted class labels (pipe2):")
print((actual_labels == predicted_labels2).sum() / len(actual_labels))
print()
Share of accurately predicted class labels (pipe1):
0.8989037758830695

Share of accurately predicted class labels (pipe2):
0.8989037758830695

Since this is the method the benchmark papers use, this is the accuracy score we will report as well.

2.6 Studying features

Feature correlations

We want to analyze how the features are correlated with the target variables.

In [36]:
TARGET_NUM = 0
In [37]:
names, correlations = pipe2.features.correlations(target_num=TARGET_NUM)

plt.subplots(figsize=(20, 10))

plt.bar(names, correlations, color='#6829c2')

plt.title('Feature correlations with class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Correlations')
plt.xticks(rotation='vertical')
plt.show()

Feature importances

Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.

In [38]:
names, importances = pipe2.features.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Feature importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

Column importances

Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.

In [39]:
names, importances = pipe2.columns.importances(target_num=TARGET_NUM)

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Columns importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

Transpiling the learned features

We can also transpile the learned features to SQLite3 code. This is useful for productionization. Please also refer to the sqlite3 module in the getML Python API.

In [40]:
pipe1.features.to_sql().save("cora", split=True)

2.7 Benchmarks

State-of-the-art approaches on this data set perform as follows:

Approach Study Accuracy AUC
RelF Dinh et al (2012) 85.7% --
LBP Dinh et al (2012) 85.0% --
EPRN Preisach and Thieme (2006) 84.0% --
PRN Preisach and Thieme (2006) 81.0% --
ACORA Perlich and Provost (2006) -- 97.0%

As we can see, the performance of the relboost algorithm, as used in this notebook, compares favorably to these benchmarks.

Approach Accuracy AUC
FastProp 89.9% 98.5%
Relboost 89.9% 98.3%

3. Conclusion

In this notebook we have demonstrated that getML outperforms state-of-the-art relational learning algorithms on the CORA dataset.

References

Dinh, Quang-Thang, Christel Vrain, and Matthieu Exbrayat. "A Link-Based Method for Propositionalization." ILP (Late Breaking Papers). 2012.

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).

Perlich, Claudia, and Foster Provost. "Distribution-based aggregation for relational learning with identifier attributes." Machine Learning 62.1-2 (2006): 65-105.

Preisach, Christine, and Lars Schmidt-Thieme. "Relational ensemble classification." Sixth International Conference on Data Mining (ICDM'06). IEEE, 2006.

Next Steps

This tutorial benchmarked getML against academic state-of-the-art algorithms from relational learning literature and getML's qualities with respect to categorical data.

If you are interested in further real-world applications of getML, head back to the notebook overview and choose one of the remaining examples.

Here is some additional material from our documentation if you want to learn more about getML:

Get in contact

If you have any question schedule a call with Alex, the co-founder of getML, or write us an email. Prefer a private demo of getML? Just contact us to make an appointment.