A linear regression on liquidity risk and a random-forest classifier on investor segment — both with auto-EDA, split-before-fit, persisted artifacts, and explicit RMSE / R² evaluation. The choice not to use deep learning here is the design. The data is tabular, the features are interpretable, and a regulator may need to read this code in a year.
Notes from Surendra Singh — production ML systems with explicit precision targets.
# auto-detect structure, IQR-clean, profile ingest_csv(csv_path, target_column="") -> str # model 1 — regression: continuous score in [0, 1] liquidity_predictor(csv_path, target_column="liquidity_risk") predict_liquidity(client_features: dict) -> str # LOW / MODERATE / HIGH bucket at presentation time # model 2 — classification: discrete segment investor_classifier(csv_path, target_column="segment") classify_investor(profile: dict) -> str
Both train tools save a joblib artifact under finance_output/models/. Both predict tools load the artifact and score one row. The artifact is the contract — fitted preprocessing and the model travel together, so feature scaling, one-hot encoding, and the model itself are never out of sync.
ingest_csv runs before any model: detects column types, drops constant columns, removes IQR outliers on numeric features, and writes summary statistics plus a charts directory. It is not glamorous; it is the step that decides whether the model that follows is honest.
def _detect_structure(df: pd.DataFrame) -> dict: return { "numeric": [c for c in df.columns if pd.api.types.is_numeric_dtype(df[c])], "categorical": [c for c in df.columns if pd.api.types.is_object_dtype(df[c])], "constant": [c for c in df.columns if df[c].nunique(dropna=True) <= 1], "high_null": [c for c in df.columns if df[c].isna().mean() > 0.5], }
Outlier removal happens before the train/test split, on numeric columns only. This is a defensible asymmetry: outlier removal is part of dataset hygiene, not part of the model's learned distribution. Remove outliers post-split and your test set looks cleaner than production data — your evaluation lies. This is the kind of bug that ships a model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipe = Pipeline([ ("prep", ColumnTransformer([ ("num", StandardScaler(), numeric_cols), ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols), ])), ("reg", LinearRegression()), ]) pipe.fit(X_train, y_train) # fit ONLY on train preds = pipe.predict(X_test) rmse = root_mean_squared_error(y_test, preds) r2 = r2_score(y_test, preds) joblib.dump(pipe, "finance_output/models/liquidity_pipeline.joblib")
Why regression, not classification: the target liquidity_risk ∈ [0, 1] is a continuous probability. Bucketing it into LOW / MODERATE / HIGH for the UI is fine; bucketing during training discards information. RMSE and R² report on the regression. Confusion matrices would lie about a problem that isn't classification.
investor_classifier trains a RandomForestClassifier on a CSV of investor profile attributes, predicting one of conservative / moderate / aggressive. Same preprocessing pattern as the regression: ColumnTransformer with StandardScaler + OneHotEncoder, fitted inside the pipeline so the artifact is self-contained.
Why random forest: with mixed-type tabular features (credit score, debt ratio, region, age, income, risk tolerance, product preference) and modest sample sizes, RF gives interpretable feature importance, handles non-linearities without feature engineering, and resists overfitting through depth and bagging. It is the default for a reason.
pipe = Pipeline([ ("prep", prep), ("clf", RandomForestClassifier(n_estimators=200, random_state=42)), ]) pipe.fit(X_train, y_train) acc = (pipe.predict(X_test) == y_test).mean()
Three reasons, each load-bearing:
LinearRegression coefficients. They cannot read a 20M-parameter MLP. Pretending otherwise is how AI projects die in audit."The fanciest model that fits the data is rarely the right one. The simplest model that explains the data, ships."
.joblib file. Productionizing it is a separate concern with its own SLAs.