How to Tune a Model Using Feature Contribution and Simple Analytics
Tuning a model is a core element of a data scientist’s work. An important and integral part of the model tuning process is the feature selection process. This is because in many cases, the model itself is a “black box,” which makes it hard to understand the features’ performance.
We can add, remove or change features. Normally each feature has both a good and a bad impact on the model’s performance. We would like to know if a feature is good or bad and how its performance changes in different test sets. To accomplish this, we suggest a simple process that consists of two main steps:
- Calculate and keep the feature contribution data for different experiments and test sets
- Analyze the data using a query engine or other analytics tools
In this post, you will train a model, calculate the feature contribution for a test set, and perform an analysis on top of the results. You will use different methods to analyze the feature performance. Later you will decide how to update your features and see how the model improves.
Step 1: Learn about the data
A web application firewall is used to detect and block malicious traffic:
SQL injection is one of the most common web hacking techniques. It is a placement of malicious code in SQL statements, via web page input. For example, an attacker can submit the following input in the “User Id” field:
The application may construct the following statement, and return the entire “users” table:
SELECT * FROM users WHERE user_id = 105 OR 1=1
A successful attack may result in the unauthorized viewing of user lists, the deletion of entire tables and, in certain cases, the attacker gaining administrative rights to a database.
Here are some examples from the data set we will use:
Next, we will build a model for detecting SQL Injection attacks.
Step 2: Train a model for detection of SQL Injection attacks
First we read the data and split it to a train set and a test set:
df = pd.read_csv("sqliv2-updated.csv.gz")
train, test = train_test_split(df, random_state=0)
The feature extraction step is defined using a lambda function per feature. Each lambda gets a query string and returns a value:
features: Dict[str, Callable[[str], object]] = {
"length": lambda x: len(x),
"single_quotes": lambda x: x.count("'"),
"punctuation": lambda x: count_chars(x, string.punctuation),
"single_line_comments": lambda x: x.count("--"),
"keywords": lambda x: len(re.findall("select|delete|insert|union", x)),
"whitespaces": lambda x: count_chars(x, string.whitespace),
"operators": lambda x: count_chars(x, "+-*/%&|^=><"),
"special_chars": lambda x: len(x) - count_chars(x, string.ascii_lowercase),
}def process_feature_vectors(input_df: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame({f: [features[f](qs) for qs in input_df["query_string"]] for f in features})train_input = process_feature_vectors(train)
test_input = process_feature_vectors(test)
Here are some feature vectors examples:
We train a binary classification model:
model = RandomForestClassifier(n_estimators=5, random_state=0)
model.fit(train_input, train["label"])
predictions = model.predict(test_input)
We calculate metrics and print them:
labels = test["label"]
print(f"precision: {round(precision_score(labels, predictions), 3)}")
print(f"accuracy: {round(accuracy_score(labels, predictions), 3)}")
print_conf_mat(confusion_matrix(labels, predictions))
Here is the output of our metrics:
We also print the feature importance which ranks features based on the effect that they have on the model’s prediction:
model.feature_importances_
Here is an output example:
Now we have an initial working model and we know the feature importance. It is time to calculate the feature performance and tune our model.
Step 3: Feature Contribution Data
The feature contribution data consists of an array of floating point numbers per prediction. Each feature has number between -1 and 1 representing its contribution to the prediction:
We did the calculation using the SHAP library:
explainer = shap.TreeExplainer(model)
contrib = explainer.shap_values(test_input)[1]
contrib_df = pd.DataFrame(contrib, columns=features)
SHAP can also plot a waterfall chart for a single prediction:
shap.Explanation(contrib[i], explainer.expected_value[0], test_input.iloc[i], features))
This chart explains how each feature contributed to a prediction:
To perform analytics we remove very small contribution values, so we will be able to calculate an average based on meaningful values. In our case about 10% of the values were removed:
contrib_df[abs(contrib_df) <= 0.005] = None
We group the data by classification and print the result:
contrib_df["classification"] = get_classification(predictions, labels)
cls_mean_df = contrib_df.groupby("classification").mean()
From the result we can find, for example, the feature which contributes the most to false positives and try to update it. Here is what it looks like:
If we invert the contribution values for classes in which positive contribution is considered bad, we can draw a heat map for feature performance per classification:
We can also use the same values to score our features:
cls_mean_df.mean()
This is the flow we used:
Here is the output example:
Takeaway
We believe that knowing more about the contributions will help learning both about your model and your data. That will lead to better results, and it will be easier and faster to achieve them. You can use analytics tools to calculate your features metrics and use it during development and production.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.