Feature Engineering for Machine Learning by Alice Zheng

Feature Engineering for Machine Learning by Alice Zheng pdf

Feature engineering is a crucial step in the machine-learning pipeline, yet this topic is rarely examined on its own. With this practical book, you’ll learn techniques for extracting and transforming features—the numeric representations of raw data—into formats for machine-learning models. Each chapter guides you through a single data problem, such as how to represent text or image data. Together, these examples illustrate the main principles of feature engineering.

Rather than simply teach these principles, authors Alice Zheng and Amanda Casari focus on practical application with exercises throughout the book. The closing chapter brings everything together by tackling a real-world, structured dataset with several feature-engineering techniques. Python packages including numpy, Pandas, Scikit-learn, and Matplotlib are used in code examples.

You’ll examine:

  • Feature engineering for numeric data: filtering, binning, scaling, log transforms, and power transforms
  • Natural text techniques: bag-of-words, n-grams, and phrase detection
  • Frequency-based filtering and feature scaling for eliminating uninformative features
  • Encoding techniques of categorical variables, including feature hashing and bin-counting
  • Model-based feature engineering with principal component analysis
  • The concept of model stacking, using k-means as a featurization technique
  • Image feature extraction with manual and deep-learning techniques

From the Preface

Machine learning fits mathematical models to data in order to derive insights or make predictions. These models take features as input. A feature is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline. Feature engineering is the act of extracting features from raw data, and transforming them into formats that is suitable for the machine learning model. It is a crucial step in the machine learning pipeline, because the right features can ease the difficulty of modeling, and therefore enable the pipeline to output results of higher quality.

Practitioners agree that the vast majority of time in building a machine learning pipeline is spent on feature engineering and data cleaning. Yet, despite its importance, the topic is rarely discussed on its own. Perhaps it’s because the right features can only be defined in the context of both the model and the data. Since data and models are so diverse, it’s difficult to generalize the practice of feature engineering across projects.

Nevertheless, feature engineering is not just an ad hoc practice. There are deeper principles at work, and they are best illustrated in situ. Each chapter of this book addresses one data problem: how to represent text data or image data, how to reduce dimensionality of auto-generated features, when and how to normalize, etc. Think of this as a collection of inter-connected short stories, as opposed to a single long novel. Each chapter provides a vignette into in the vast array of existing feature engineering techniques. Together, they illustrate the overarching principles.

Mastering a subject is not just about knowing the definitions and being able to derive the formulas. It is not enough to know how the mechanism works and what it can do. It must also involve understanding why it is designed that way, how it relates to other techniques, and what are the pros and cons of each approach.

Mastery is about knowing precisely how something is done, having an intuition for the underlying principles, and integrating it into the knowledge web of what we already know. One does not become a master of something by simply reading a book, though a good book can open new doors. It has to involve practice—putting the ideas to use, which is an iterative process. With every iteration, we know the ideas better and become increasingly more adept and creative at applying them. The goal of this book is to facilitate the application of its ideas.

This book tries to teach the intuition first, and the mathematics second. Instead of only discussing how something is done, we try to teach the why. Our goal is to provide the intuition behind the ideas, so that the reader may understand how and when to apply them. There are tons of descriptions and pictures for folks who learn in different ways. Mathematical formulas are presented in order to make the intuitions precise, and also to bridge this book with other existing offerings of knowledge.

About the Author

Alice is a technical leader in the field of Machine Learning. Her experience spans algorithm and platform development and applications. Currently, she is a Senior Manager in Amazon’s Ad Platform. Previous roles include Director of Data Science at GraphLab/Dato/Turi, machine learning researcher at Microsoft Research, Redmond, and postdoctoral fellow at Carnegie Mellon University. She received a Ph.D. in Electrical Engineering and Computer science, and B.A. degrees in Computer Science in Mathematics, all from U.C. Berkeley.