Member-only story

UMAP for Dimensionality Reduction: Visualization of Features to Understand Your Data Better

3 min readJul 8, 2022

UMAP, the Uniform Manifold Approximation and Projection, is a method for dimension reduction. The purpose of this post is to show how to use the python package UMAP in practice, and to visualize the results.

First, install UMAP through

pip install umap-learn[plot]

You can also install UMAP through pip install umap-learn if you do not wish to use their built-in plotting method.

Insert the following code to import necessary libraries:

import pandas as pd
import umap
import umap.plot
from sklearn.datasets import fetch_california_housing
from sklearn import preprocessing
import plotly.express as px
from pickle import dump

We will use the California Housing dataset from sklearn.

Since UMAP works with classification dataset, we will use the pandas.qcut to divide the dataset into 10 quantiles

The most expensive house will be labeled as 9 and the cheapest will be labeled as 0. The features are then normalized using StandardScaler from sklearn. This is to normalize the data with mean and standard deviation.

housing = fetch_california_housing()
target = pd.DataFrame(housing.target, columns=['target'])
target['target'] = pd.qcut(target['target'], 10, labels=False)

# normalize the data
scaler = preprocessing.StandardScaler().fit(housing.data)
X_scaled =…

UMAP for Dimensionality Reduction: Visualization of Features to Understand Your Data Better

Create an account to read the full story.

Written by Zex

No responses yet

More from Zex

Intuition of Dice Coefficient: Why it can deal with imbalanced dataset

For image segmentation task in computer vision, it is quite common to use dice loss as the loss function to deal with the imbalanced…

Save your Pandas DataFrame ~50x Faster with Parquet

Pandas is good for dealing with tabular data, and the most common file type is a csv file. However, when the data size goes up, dealing…

Sunk Cost Fallacy — Don’t Fall Further!

The sunk cost fallacy is a cognitive bias that affects decision-making. The term ‘sunk cost’ means that the ‘investment’, whether in the…

Speed up SkLearn Model Training by 10 or 100X… by just adding 2 lines of Code

Scikit-learn is a widely used machine learning library. It contains a number of algorithms that we can use for classification, regression…

Recommended from Medium

Surrogate Modeling: The Secret to Faster, Smarter Engineering

Its fundamentals, capabilities, and engineering applications

Clustering Text Data with K-Means and Visualizing with t-SNE

In NLP, analyzing and grouping text data into meaningful clusters is a vital task. Clustering helps us discover hidden patterns or…

Lists

Coding & Development

ChatGPT prompts

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

In the rapidly advancing landscape of artificial intelligence (AI) and machine learning (ML), specific methodologies and their…

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while retaining most of the variance…

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

In today’s fast-paced data world, traditional methods are evolving rapidly. In 2025, the fusion of classical statistics, AI, and modern…

How Has the Covid-19 Pandemic Affected Singapore: the Cost of Living and More

How the Covid-19 pandemic has affected the cost of living, unemployment rate, and population structure of Singapore using open datasets.