Streamlit

Create interactive data apps with pure Python in minutes

Beginner open-source self-hosted python data-science dashboard

GitHub Repository Official Website

Alternative To

• Gradio
• Dash
• Flask
• Shiny

Difficulty Level

Beginner

Suitable for users with basic technical knowledge. Easy to set up and use.

Overview

Streamlit is an open-source Python framework that allows data scientists and engineers to create interactive web applications directly from Python scripts. With its simple, declarative syntax, Streamlit enables developers to transform data analysis scripts into fully functional web applications without requiring any front-end experience.

The framework follows a “script runs from top to bottom” philosophy, making it intuitive for Python users and allowing rapid iteration. As you modify your code, Streamlit automatically updates the web application, creating a seamless development experience. This approach has made Streamlit especially popular among data professionals who want to share insights, build dashboards, and create interactive tools without getting bogged down in web development details.

System Requirements

Python: 3.8 or higher
CPU: 2+ cores (4+ recommended for data-intensive applications)
RAM: 4GB+ (8GB+ recommended)
GPU: Not required (useful for ML models)
Storage: 1GB+ for base installation
Operating System: Windows, macOS, or Linux

Installation Guide

Prerequisites

Python 3.8 or higher
Pip package manager

Basic Installation

Install Streamlit using pip:

pip install streamlit

To verify the installation and see a demo app:

streamlit hello

This will open a browser window with Streamlit’s demo application.

Installation in a Virtual Environment

For a more isolated environment:

# Create a virtual environment
python -m venv streamlit-env

# Activate on Windows
streamlit-env\Scripts\activate

# Activate on macOS/Linux
source streamlit-env/bin/activate

# Install Streamlit
pip install streamlit

Installation with Conda

If you’re using Anaconda or Miniconda:

# Create a new conda environment
conda create -n streamlit-env python=3.10

# Activate the environment
conda activate streamlit-env

# Install Streamlit
pip install streamlit

Practical Exercise: Building a Data Explorer

Let’s create a simple data explorer application that allows users to upload CSV files and perform basic exploratory data analysis:

import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set page configuration
st.set_page_config(
    page_title="Data Explorer",
    page_icon="📊",
    layout="wide"
)

# Add a title and description
st.title("📊 Data Explorer")
st.markdown("Upload your CSV file to explore and visualize your data.")

# File uploader
uploaded_file = st.file_uploader("Choose a CSV file", type=["csv"])

# If file is uploaded
if uploaded_file is not None:
    # Load the data
    try:
        df = pd.read_csv(uploaded_file)

        # Show success message
        st.success(f"Successfully loaded data with {df.shape[0]} rows and {df.shape[1]} columns.")

        # Basic data information
        st.header("Data Overview")

        # Display tabs for different data views
        tab1, tab2, tab3 = st.tabs(["Data Preview", "Data Statistics", "Data Types"])

        with tab1:
            # Display data preview
            st.subheader("Data Preview")
            st.dataframe(df.head(10))

        with tab2:
            # Display descriptive statistics
            st.subheader("Descriptive Statistics")
            st.dataframe(df.describe())

        with tab3:
            # Display data types
            st.subheader("Data Types")
            dtypes_df = pd.DataFrame({
                'Column': df.columns,
                'Data Type': df.dtypes.astype(str),
                'Non-Null Count': df.count().values,
                'Null Count': df.isna().sum().values,
                'Null Percentage': (df.isna().sum() / len(df) * 100).round(2).astype(str) + '%'
            })
            st.dataframe(dtypes_df)

        # Visualization section
        st.header("Data Visualization")

        # Sidebar for visualization options
        st.sidebar.header("Visualization Options")

        # Select columns for visualization
        numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
        categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

        if len(numeric_cols) > 0:
            st.sidebar.subheader("Numeric Column Analysis")
            selected_num_col = st.sidebar.selectbox("Select a numeric column:", numeric_cols)

            # Distribution plot
            st.subheader(f"Distribution of {selected_num_col}")
            fig, ax = plt.subplots(figsize=(10, 6))
            sns.histplot(df[selected_num_col].dropna(), kde=True, ax=ax)
            st.pyplot(fig)

            # Box plot
            st.subheader(f"Box Plot of {selected_num_col}")
            fig, ax = plt.subplots(figsize=(10, 6))
            sns.boxplot(x=df[selected_num_col].dropna(), ax=ax)
            st.pyplot(fig)

        if len(numeric_cols) >= 2:
            st.sidebar.subheader("Correlation Analysis")
            x_col = st.sidebar.selectbox("Select X column:", numeric_cols, key="x_col")
            y_col = st.sidebar.selectbox("Select Y column:", [c for c in numeric_cols if c != x_col], key="y_col")

            # Scatter plot
            st.subheader(f"Scatter Plot: {x_col} vs {y_col}")
            fig, ax = plt.subplots(figsize=(10, 6))
            sns.scatterplot(x=df[x_col], y=df[y_col], ax=ax)
            st.pyplot(fig)

        if len(categorical_cols) > 0 and len(numeric_cols) > 0:
            st.sidebar.subheader("Category Analysis")
            cat_col = st.sidebar.selectbox("Select category column:", categorical_cols)
            num_col = st.sidebar.selectbox("Select numeric column:", numeric_cols, key="num_for_cat")

            # Check if categorical column has a reasonable number of categories
            unique_cats = df[cat_col].nunique()

            if unique_cats <= 10:  # Only show if fewer than 10 categories
                # Bar plot
                st.subheader(f"Bar Plot: Average {num_col} by {cat_col}")
                fig, ax = plt.subplots(figsize=(12, 6))
                df.groupby(cat_col)[num_col].mean().sort_values().plot(kind='bar', ax=ax)
                st.pyplot(fig)

                # Box plot grouped
                st.subheader(f"Box Plot: {num_col} by {cat_col}")
                fig, ax = plt.subplots(figsize=(12, 6))
                sns.boxplot(x=cat_col, y=num_col, data=df, ax=ax)
                plt.xticks(rotation=45)
                st.pyplot(fig)

    except Exception as e:
        st.error(f"Error: {e}")
else:
    # Show example datasets
    st.info("No file uploaded. You can use one of the example datasets below:")

    example_datasets = {
        "Iris Dataset": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv",
        "Titanic Dataset": "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
        "Boston Housing": "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
    }

    selected_example = st.selectbox("Select an example dataset:", list(example_datasets.keys()))

    if st.button("Load Example Dataset"):
        with st.spinner("Loading example dataset..."):
            df = pd.read_csv(example_datasets[selected_example])
            st.success(f"Successfully loaded {selected_example} with {df.shape[0]} rows and {df.shape[1]} columns.")
            st.dataframe(df.head(10))

Advanced Example: Building a Machine Learning App

Here’s a more advanced example that creates an interactive machine learning app for classification tasks:

import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder
import plotly.express as px

# Set page configuration
st.set_page_config(
    page_title="ML Classification App",
    page_icon="🤖",
    layout="wide"
)

# Page title and description
st.title("🤖 Interactive Machine Learning App")
st.markdown("""
This app allows you to train a machine learning model on your data and visualize the results.
Upload your CSV file, select the target variable and features, and let the app do the rest!
""")

# Cache functions to improve performance
@st.cache_data
def load_data(file):
    return pd.read_csv(file)

# Sidebar for ML settings
st.sidebar.header("Model Settings")

# Upload data
uploaded_file = st.sidebar.file_uploader("Upload CSV file", type=["csv"])

# Main section
if uploaded_file is not None:
    # Load data
    df = load_data(uploaded_file)

    # Show data overview
    st.subheader("Data Preview")
    st.dataframe(df.head())

    # Data preprocessing
    st.subheader("Data Preprocessing")

    # Get list of columns
    columns = df.columns.tolist()

    # Select target variable
    target_column = st.selectbox("Select target variable (categorical):", columns)

    # Ensure target is categorical
    if df[target_column].nunique() > 10:
        st.warning(f"Warning: The selected target has {df[target_column].nunique()} unique values. Classification works best with fewer classes.")

    # Encode target variable if necessary
    if df[target_column].dtype == 'object':
        le = LabelEncoder()
        df[f"{target_column}_encoded"] = le.fit_transform(df[target_column])
        st.info(f"Target variable '{target_column}' has been encoded for modeling.")
        target_classes = dict(zip(le.transform(le.classes_), le.classes_))
        st.write("Encoding mapping:", target_classes)
        target_column_for_model = f"{target_column}_encoded"
    else:
        target_column_for_model = target_column

    # Select features
    feature_columns = [col for col in columns if col != target_column]
    selected_features = st.multiselect("Select features for training:", feature_columns, default=feature_columns[:min(5, len(feature_columns))])

    # Only continue if features are selected
    if len(selected_features) > 0:
        # Feature preprocessing
        numeric_features = df[selected_features].select_dtypes(include=['int64', 'float64']).columns.tolist()
        categorical_features = df[selected_features].select_dtypes(include=['object']).columns.tolist()

        # Handle categorical features
        df_processed = df.copy()

        if len(categorical_features) > 0:
            st.subheader("Categorical Feature Encoding")

            for cat_feat in categorical_features:
                # One-hot encode
                df_encoded = pd.get_dummies(df[cat_feat], prefix=cat_feat)
                # Add to processed dataframe
                df_processed = pd.concat([df_processed, df_encoded], axis=1)
                # Remove original column
                df_processed.drop(cat_feat, axis=1, inplace=True)

            st.success(f"Encoded {len(categorical_features)} categorical features using one-hot encoding.")

        # Get final feature set
        X_columns = [col for col in df_processed.columns if col != target_column and col != target_column_for_model]

        # Display correlation matrix for numeric features
        if len(numeric_features) > 1:
            st.subheader("Feature Correlation")
            corr = df[numeric_features].corr()
            fig, ax = plt.subplots(figsize=(10, 8))
            sns.heatmap(corr, annot=True, cmap='coolwarm', ax=ax)
            st.pyplot(fig)

        # Model training settings
        st.sidebar.subheader("Training Settings")
        test_size = st.sidebar.slider("Test set size", 0.1, 0.5, 0.2, 0.05)
        random_state = st.sidebar.slider("Random state", 0, 100, 42)
        n_estimators = st.sidebar.slider("Number of trees", 10, 500, 100, 10)

        # Split data
        X = df_processed[X_columns]
        y = df_processed[target_column_for_model]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

        # Model training
        train_button = st.button("Train Model")

        if train_button:
            with st.spinner("Training model..."):
                # Train model
                model = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)
                model.fit(X_train, y_train)

                # Make predictions
                y_pred = model.predict(X_test)

                # Evaluate model
                st.subheader("Model Performance")

                # Accuracy
                accuracy = accuracy_score(y_test, y_pred)
                st.metric("Accuracy", f"{accuracy:.4f}")

                # Classification report
                report = classification_report(y_test, y_pred, output_dict=True)
                df_report = pd.DataFrame(report).transpose()
                st.dataframe(df_report)

                # Confusion matrix
                st.subheader("Confusion Matrix")
                cm = confusion_matrix(y_test, y_pred)

                # If we have the original class names, use them
                if df[target_column].dtype == 'object':
                    class_names = le.classes_
                else:
                    class_names = [str(i) for i in range(len(np.unique(y)))]

                # Plot confusion matrix
                fig, ax = plt.subplots(figsize=(10, 8))
                sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names, ax=ax)
                plt.ylabel('Actual')
                plt.xlabel('Predicted')
                st.pyplot(fig)

                # Feature importance
                st.subheader("Feature Importance")
                feature_importance = pd.DataFrame({
                    'Feature': X_columns,
                    'Importance': model.feature_importances_
                }).sort_values('Importance', ascending=False)

                fig = px.bar(feature_importance, x='Importance', y='Feature', orientation='h',
                            title='Feature Importance')
                st.plotly_chart(fig)

                # Allow model download (pickle)
                import pickle
                model_pickle = pickle.dumps(model)
                st.download_button(
                    label="Download trained model",
                    data=model_pickle,
                    file_name="random_forest_model.pkl",
                    mime="application/octet-stream"
                )
    else:
        st.warning("Please select at least one feature to train the model.")
else:
    st.info("Please upload a CSV file to get started.")

    # Show example usage
    st.subheader("Example Usage")
    st.markdown("""
    1. Upload a CSV file with your data
    2. Select the target variable (what you want to predict)
    3. Select the features to use for prediction
    4. Adjust model parameters in the sidebar
    5. Click 'Train Model' to see the results
    """)

    # Sample datasets
    st.subheader("Sample Datasets")
    st.markdown("""
    - [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris) (Classification)
    - [Wine Quality Dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality) (Classification)
    - [Breast Cancer Wisconsin Dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)) (Classification)
    """)

Key Features

Streamlit provides numerous features that make it ideal for data applications:

Pure Python Development: Build complete web applications without HTML, CSS, or JavaScript
Live Reloading: Changes automatically reflect in the app when you save your script
Rich Widget Library: Extensive collection of UI components for user input and interaction
Data Visualization Support: Native integration with popular plotting libraries
Caching Mechanism: Performance optimization for data-heavy applications
Layout Options: Columns, tabs, expandable sections, and sidebar for UI organization
File Uploads and Downloads: Easy handling of file operations
Session State: Persistent state management across reruns
Multi-page Applications: Support for building applications with multiple pages
Component Ecosystem: Extensible with custom components from the community
Theme Customization: Configurable appearance and branding
Authentication: User authentication capabilities for secure applications
Cloud Deployment: Free hosting for public apps via Streamlit Community Cloud