Quant / ML — Finance

§ 01

One ingestion primitive, two models.

# auto-detect structure, IQR-clean, profile
ingest_csv(csv_path, target_column="") -> str

# model 1 — regression: continuous score in [0, 1]
liquidity_predictor(csv_path, target_column="liquidity_risk")
predict_liquidity(client_features: dict) -> str     # LOW / MODERATE / HIGH bucket at presentation time

# model 2 — classification: discrete segment
investor_classifier(csv_path, target_column="segment")
classify_investor(profile: dict) -> str

Both train tools save a joblib artifact under finance_output/models/. Both predict tools load the artifact and score one row. The artifact is the contract — fitted preprocessing and the model travel together, so feature scaling, one-hot encoding, and the model itself are never out of sync.

§ 02

EDA automation — the part most pipelines skip.

ingest_csv runs before any model: detects column types, drops constant columns, removes IQR outliers on numeric features, and writes summary statistics plus a charts directory. It is not glamorous; it is the step that decides whether the model that follows is honest.

def _detect_structure(df: pd.DataFrame) -> dict:
    return {
        "numeric":     [c for c in df.columns if pd.api.types.is_numeric_dtype(df[c])],
        "categorical": [c for c in df.columns if pd.api.types.is_object_dtype(df[c])],
        "constant":    [c for c in df.columns if df[c].nunique(dropna=True) <= 1],
        "high_null":   [c for c in df.columns if df[c].isna().mean() > 0.5],
    }

Design note

Outlier removal happens before the train/test split, on numeric columns only. This is a defensible asymmetry: outlier removal is part of dataset hygiene, not part of the model's learned distribution. Remove outliers post-split and your test set looks cleaner than production data — your evaluation lies. This is the kind of bug that ships a model.

§ 03

Liquidity regression — the sklearn pipeline, intact.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ("prep", ColumnTransformer([
        ("num", StandardScaler(), numeric_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
    ])),
    ("reg", LinearRegression()),
])

pipe.fit(X_train, y_train)               # fit ONLY on train
preds = pipe.predict(X_test)
rmse  = root_mean_squared_error(y_test, preds)
r2    = r2_score(y_test, preds)

joblib.dump(pipe, "finance_output/models/liquidity_pipeline.joblib")

Why regression, not classification: the target liquidity_risk ∈ [0, 1] is a continuous probability. Bucketing it into LOW / MODERATE / HIGH for the UI is fine; bucketing during training discards information. RMSE and R² report on the regression. Confusion matrices would lie about a problem that isn't classification.

§ 04

Investor classifier — random forest, three classes.

investor_classifier trains a RandomForestClassifier on a CSV of investor profile attributes, predicting one of conservative / moderate / aggressive. Same preprocessing pattern as the regression: ColumnTransformer with StandardScaler + OneHotEncoder, fitted inside the pipeline so the artifact is self-contained.

Why random forest: with mixed-type tabular features (credit score, debt ratio, region, age, income, risk tolerance, product preference) and modest sample sizes, RF gives interpretable feature importance, handles non-linearities without feature engineering, and resists overfitting through depth and bagging. It is the default for a reason.

pipe = Pipeline([
    ("prep", prep),
    ("clf",  RandomForestClassifier(n_estimators=200, random_state=42)),
])
pipe.fit(X_train, y_train)
acc = (pipe.predict(X_test) == y_test).mean()

§ 05

Why these models, not deep learning.

Three reasons, each load-bearing:

Tabular, not sequential. The data is rows of independent client/investor profiles, not time series and not text. Tree ensembles and linear models are state-of-the-art on tabular at this scale; deep learning is not.
Interpretability is a feature. A regulator, a compliance officer, or an LP can read LinearRegression coefficients. They cannot read a 20M-parameter MLP. Pretending otherwise is how AI projects die in audit.
Sample size discipline. A 1k-row CSV does not justify a deep network. Cross-validation on a linear model gives more reliable estimates of out-of-sample performance than any neural baseline at this size.

"The fanciest model that fits the data is rarely the right one. The simplest model that explains the data, ships."

§ 06

What this page is not.

Not an AutoML platform. No hyperparameter sweeps, no model zoo, no leaderboard. Two models, fit-for-purpose.
Not a feature store. Features come from the CSV the user supplied. There is no shared feature catalog and no online/offline parity layer — those exist when the surface is multi-team production, not single-user analyst workflows.
Not a deployment runtime. The model artifact is a .joblib file. Productionizing it is a separate concern with its own SLAs.