Title: A tale of 80,251 cities: Feature Engineering for Semantic Graphs
I define Feature Engineering as "the process of representing a
problem domain as to make it amenable for learning techniques. This
process involves the initial discovery of features and their step-wise
improvement based on domain knowledge and the observed performance of
a given ML algorithm over specific training data." While Deep Learning
might have reduced the need for it, it is still amply used in the
industry and your best bet to get better performance from small
datasets. It is a slow, time consuming process that is as much an art
as it is technology.
Since 2016, I have been writing a textbook on the subject, which is
posed to be published in 2019.
In this talk, we will cover the proposed Feature Engineering cycle and
dive down in one of the case studies presented in the book: predicting
the population of a city, given the information graph derived from the
infoboxes on their Wikipedia pages (as distributed by the DBpedia
project).
Some of the topics covered include:
* Representing variable-length data into fixed-length feature vectors.
* Exploratory Data Analysis
* Error analysis using feature ablation and wrapper methods
* Mutual information feature utility metrics
* Discretization of floating-point variables
* Target rate encoding of categorical features
* Feature stability
If you want to follow along, bring a charged laptop with a working
jupyter notebook installation including scikit-learn, matplotlib and
graphviz.
About the speaker:
Pablo Duboue (http://duboue.net/personal.html) has been living in Vancouver since 2017, where he has a
small AI consulting company (Textualization Software Ltd). Prior to
that, he was at Montreal for eight years and in New York for 11
years. He has a PhD in Computer Science from Columbia University in NY
and was part of the IBM Watson team that beat the Jeopardy! champions
in 2011.