While Bitcoin is renowned for its security, not all digital currencies share the same reliability. Despite glamorous success stories, many tokens ultimately prove to be scams—commonly known as "rug pulls." This occurs when token creators abruptly disappear with investors' funds, leaving them with nothing.
Solana, a Layer 1 blockchain, hosts a vast ecosystem of tokens that can be created quickly and at near-zero cost. While this fosters innovation, it also opens doors for bad actors to launch tokens solely to execute rug pulls once liquidity flows in.
In this guide, we’ll explore how to build Egeria, a machine learning API designed to predict token risks on Solana. You’ll also learn to create a simplified version yourself!
1. Environment Setup
Before diving in, let’s prepare our workspace. We’ll use Google Colab for convenience. Install the required dependencies:
# Install essential modules
!pip install -U pandas scikit-learn numpy matplotlib
!pip install xgboost==2.0.3 joblib==1.3.2
Key Tools:
- Pandas: Data manipulation.
- NumPy: Numerical computations.
- SciKit-Learn: ML algorithms.
- XGBoost: Optimized gradient boosting.
- Joblib: Model serialization.
2. Data Collection
A labeled dataset is critical for supervised learning. We’ll use the Vybe API to gather token data, ensuring a balanced mix of high-risk and safe tokens.
👉 Explore Vybe API documentation
Tip: An unbalanced dataset may bias the model toward over-predicting high-risk tokens.
3. Data Preprocessing
Clean and structure the raw data by removing irrelevant columns (e.g., address
, lastTradeUnixTime
):
def preprocess_data(df):
df = df.drop(['address', 'lastTradeUnixTime', 'mc'], axis=1)
X = df.drop('Risk', axis=1)
y = df['Risk'].map({'Danger': 1, 'Warning': 1, 'Good': 0}).astype(int)
return train_test_split(X, y, test_size=0.4, random_state=42)
Handling Feature Types
- Numeric features (e.g.,
liquidity
,Volatility
): Standardized after mean imputation. - Categorical features (e.g.,
symbol
,name
): One-hot encoded.
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
],
remainder='passthrough'
)
4. Model Training
We’ll use XGBoost for its robustness against overfitting and ability to handle complex datasets:
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
))
])
model.fit(X_train, y_train)
Why XGBoost?
- Regularization prevents overfitting.
- Ensemble learning improves accuracy iteratively.
5. Model Evaluation
Assess performance via a confusion matrix:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Metrics:
- Accuracy: Overall correctness.
- Precision/Recall: Balance between false positives/negatives.
6. Saving the Model
Persist the model and preprocessor for future use:
joblib.dump(model, "predictModel.pkl")
joblib.dump(preprocessor, "mainPreprocessor.pkl")
FAQ
Q1: What’s a rug pull?
A: A scam where token creators drain liquidity and abandon the project.
Q2: Why use Solana for this model?
A: Solana’s low-cost token creation attracts both innovation and scams, making risk prediction vital.
Q3: Can I use logistic regression instead?
A: Logistic regression is simpler but less effective for complex patterns in token data.
👉 Learn more about DeFi risk tools
Next Steps: Integrate this model into a FastAPI service for real-time risk scoring!