Too high performances on a classification problem

Question

I have a .json file as dataset of the type:

and I am working on a classification problem in which I have to predict 4 classes, which are rhe semantic. I have worked through the problem, and after splitting the dtataset into training and test sets, I get an accuracy of $1$ .I have an unbalanced dataset, so I have oversampled it:

my code is the following:
dataFrame = pd.read_json('dataset.json',lines = True)
df = dataFrame[["lista_asm", "semantic"]].copy()

from sklearn.feature_extraction.text import TfidfVectorizer 
tfidf_vectorizer=TfidfVectorizer()
df_x = df['lista_asm']

X_all = tfidf_vectorizer.fit_transform(df_x)
y_all = df['semantic']

#oversampling
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
 X_ros, y_ros = ros.fit_sample(X_all, y_all)

print(X_ros.shape[0] - X_all.shape[0], 'new random picked points')

#splitting

X_train, X_test, y_train, y_test = train_test_split(X_ros, y_ros, 
    test_size=0.2, random_state=15)

from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

#fitting the model
clf = svm.SVC(kernel='linear', C=10).fit(X_train,y_train)

y_pred = clf.predict(X_test)

#checking accuracy
acc = clf.score(X_test, y_test)    
print("Accuracy %.3f" %acc)          #from here I get accuracy 1

I have not written the imports in order to avoid making the code too long here, but if needed I can add them.
So I am not testing on the training set, but the results are too good so for sure there mmust be something wrong.I don't understand what is wrong.
Is this something that happen for any specific reason?
I have tried to change some stuff to make the code worse, in order to see if the accuracy would go down, but it remains to one.
[EDIT]I have tried to use your suggestions, and I am already sorry since for sure I am doing something wrong.
Here is what I have done:
X_all = df['lista_asm']
y_all = df['semantic']

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, 
    test_size=0.2, random_state=15)

tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), lowercase = True, 
max_features = 20000)
tf_transformer = tf.fit(X_train)
xtrain = tf.fit_transform(X_train)

from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

clf = svm.SVC(kernel='linear', C=10).fit(xtrain,y_train)

tf2=TfidfVectorizer(analyzer='word', ngram_range=(1,2), lowercase = True, 
max_features = 20000)
tf_transformer = tf2.fit(X_test)
X_test = tf2.fit_transform(X_test)

y_pred = clf.predict(X_test)

acc = clf.score(X_test, y_test)    
print("Accuracy %.3f" %acc)       #gives accuracy 0.277

and if I plot the other performance metrics:

with the following message:

UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

[EDIT]I have also tried to use the same approch but with a .jsonl file, and in this case it works fine giving a reasonable accuracy. Maybe it is wrong to do so, but could the problem be here, so that I am not handling correctly a .json file?
Can somebody please help me?

Shahriyar Mammadli · Accepted Answer

It is normal to get 100% accuracy in this case. I mean it is inevitable to get 100% accuracy. If you train your algorithm even with 10% of your data you will again get 100% accuracy.
The abnormal thing in this code is using a vectorizer. Vectorizer is an algorithm that is used to extracting words from the documents as mentioned in the source:

Tf-idf stands for term frequency-inverse document frequency and is often used in information retrieval and text mining.

If you print what it extracts from your lista_asm variable, you will see almost every unique word in your feature's samples.
print(tfidf_vectorizer.get_feature_names())

Output (just 6 of them fitted the image, in total 23125 of them available):

As a result, it creates 23125 new features for you where all these features are unique words in your samples.
When you model your data with these 'features', even a simple model can learn the task easily. As you have only 4 target values which are ['string' 'math' 'encryption' 'sort'], SVM easily learns how to predict them your very spare feature matrix. In other words, if it sees a specific word or words it easily predicts your result. To more clarify it lets assume a simple training set as following:

So, when it will see a string which includes apple orange it will easily label it as sort.
Check this example if you want to understand how exactly Tf-idf Vectorizer works.

Thus, getting that 100% accuracy is normal. Since I do not know what exactly those features stand for, I cannot comment on them, however, consider that you converted the samples of your lista_asm (which is a list of strings) to a single string and then extracted the keywords from it.
If you would ask if it is a good solution or not. I think it is a weird solution. If your data a good representation of future data, then it will work correctly. I would say it is a good representation of your future data because when you even train your model with 10% percent of your data and test with 90%, it perfectly (with 100% accuracy) predicts the target.
Additional note:
Check the following result of the number of unique values for each category (only 1000 samples used). That is, the words that appear only in one specific category, not in others.

and following code (It is a bit dirty and inefficient code since I wrote it quickly. Sorry.)
import pandas as pd
import re

# Read the file into dataframe
dataFrame = pd.read_json('dataset.json', lines = True)

dataFrame = dataFrame[:1000]
dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x: re.findall("'([^']*)'", x))
df = dataFrame[['opcodes', 'semantic']]

print(df['semantic'].unique())

categories = df['semantic'].unique()

for cat in categories:
    catList = []
    refList = []
    catDf = df[df.semantic == cat]
    restDf = df[df.semantic != cat]
    for index, row in catDf.iterrows():
        catList.append(row['opcodes'])
    for index, row in restDf.iterrows():
        refList.append(row['opcodes'])
    print(len([i for i in catList if i not in refList]))

What the code does and the image above shows is to find words that are in one specific category and not in another. E.g.  the words that appear in the sort category but not in others. string, math, encryption, sort have 222, 305, 197, and 276 (in only 1000 samples, as you increase the sample size these values blow up) unique words (that only belongs to its category) respectively. By using those unique words (not exactly the same but similar) your model differentiates the categories easily.
Update 2:
If you take a subset of these features (or words), your accuracy drops significantly, because your unique words are not there anymore. e.g. first 100 features:
X_ros = X_ros[:,0:100]

Paste this code before train_test_split.

or first 1000 features
X_ros = X_ros[:,0:1000]

or let's use 20% of your features.
X_ros = X_ros[:,0:4625]

As you can see, 20 % of your features are not well-predictor of your 4 categories.
But do not confuse it with max_features which

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

This chooses the top best features not a random subset of features.
tfidf_vectorizer=TfidfVectorizer(max_features=10)

Even 10 of those are able to identify the class as good as the first 20% of your features. Their combination is able to give you the correct class.

Peter · Answer

One thing you could check is wether using the vectorizer on the whole text could be an issue. The test set should be "new" unseen data and should not necessarily be vectorized together with the training data. You can "refit" a vectorizer on new text (the test set) using the "old" vocabulary (from train set). However, I'm not sure if this really is an issue in your case. Essentially you would need to use the "old" vocabulary (from the train set) to prepare your (unseen) test data.
Minimal example with TfidfVectorizer():
from sklearn.linear_model import LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

# Corpus (X train) and target (y train) from a pandas df
corpus = df[['text']].values[:,0].astype('U').tolist()
ytrain = pd.Series(df['yvar'])

# Prepare tfidf vectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), lowercase = True, max_features = 20000)
tf_transformer = tf.fit(corpus)
xtrain = tf.fit_transform(corpus)
# Save transformer
pickle.dump(tf_transformer, open(mypath + "tfidf.pkl", "wb"))

# Do classification
clf = LogisticRegressionCV(n_jobs=2, penalty='l2', solver='liblinear', cv=10, scoring = 'accuracy', random_state=0)
clf.fit(xtrain, ytrain)

# Save classifier
with open(mypath + 'clf.pkl', 'wb') as f:
    pickle.dump(clf, f)

##### New text (test data)
# Load classifier
with open(modelpath + "clf.pkl", 'rb') as f:
    clf = pickle.load(f)

# Load transformer (I use the "old" vocabulary from tf1 here to generate tf1_new)
tf1 = pickle.load(open(modelpath + "tfidf.pkl", 'rb'))
tf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), lowercase = True, max_features = 20000, vocabulary = tf1.vocabulary_)

# Predict classifier on tf1_new...
xtest = tf1_new.fit_transform([mynewtext])
res = clf.predict_proba(xtest)

Also find the code here: https://github.com/Bixi81/Python-ml

Michael Long · Answer

As noted by Peter, you should split your data into test and train
sets before fitting the TfidfVectorizer to the training set (not to
all the data).

Pandas read_json() is not reading your data from file the way you
want. You are expecting df["lista_asm"] to be a Series of list
objects (each containing strings). This Series is instead a  Series
of strings (formatted to appear like a list of strings). You must
parse the strings into a list, if that is what you want. But I
believe that the TfidfVectorizer().fit method wants strings, not a
list of strings. (This issue was apparent before your edit)

Similar to the first issue, you should avoid using the
RandomOverSampler() on the test data. If you want to use it, apply it
only to the training data. I suspect that it may be creating
duplicates or near-duplicate records, some of which are later split
into the test set. Try not oversampling at all and evaluate the
performance.

Too high performances on a classification problem

3 Answers

Add your own answers!

Ask a Question