Same model/same data different results in Keras/TF

Question

So first to mention I used for all models the identical dataset (just in different shapes).
I started with a binary classification based on text following the tutorial from https://keras.io/examples/nlp/text_classification_from_scratch/. The results are quite promising on my dataset at ~80%. Dataset were .txt files generated from the CSV.

Further I wanted to add additional structured data to increase the accuracy following this tutorial (https://www.tensorflow.org/tutorials/load_data/csv#mixed_data_types). But the results were quite bad at ~50%. I then adapted the layers from the first tutorial but the results are still bad. Dataset was CSV directly.

Now I removed all structured data from the CSV and just left the text in it but the results are not similar. Why? Dataset was CSV directly.

import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import numpy as np
from typing import Union
from tensorflow.python.keras.engine.keras_tensor import KerasTensor

titanic = pd.read_csv("Dataset.csv")
titanic.head()

titanic_features = titanic.copy()
titanic_labels = titanic_features.pop('survived')

inputs = {}

for name, column in titanic_features.items():
  dtype = column.dtype
  if dtype == object:
    dtype = tf.string
  else:
    dtype = tf.float32

inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)

numeric_inputs = {name:input for name,input in inputs.items()
                  if input.dtype==tf.float32}

x = layers.Concatenate(name='ConcatNumeric')(list(numeric_inputs.values()))

norm = preprocessing.Normalization(name='PrepNormalization')

norm.adapt(np.array(titanic[numeric_inputs.keys()]))
all_numeric_inputs = norm(x)

# Not adding numeric values to dataset
preprocessed_inputs = []

# Model constants.
max_features = 20000
embedding_dim = 128
sequence_length = 500

# Now that we have our custom standardization, we can instantiate our text
# vectorization layer. We are using this layer to normalize, split, and map
# strings to integers, so we set our 'output_mode' to 'int'.
# Note that we're using the default split function,
# and the custom standardization defined above.
# We also set an explicit maximum sequence length, since the CNNs later in our
# model won't support ragged sequences.
vectorize_layer = TextVectorization(
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

for name, input in inputs.items():
  if input.dtype == tf.float32:
    continue
  
  x = vectorize_layer(input)
  x = layers.Embedding(max_features + 1, embedding_dim)(x)
  x = layers.Dropout(0.5)(x)
  x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
  x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
  x = layers.GlobalMaxPooling1D(name='GlobalMaxPooling')(x)
  x = layers.Dense(128, activation="relu")(x)
  x = layers.Dropout(0.5)(x)
  x = layers.Dense(1, activation="sigmoid", name="predictionasds")(x)
  preprocessed_inputs.append(x)

titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs)

titanic_preprocessing.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

titanic_features_dict = {name: np.array(value) 
                         for name, value in titanic_features.items()}

titanic_preprocessing.fit(x=titanic_features_dict, y=titanic_labels, epochs=3)

tf.keras.utils.plot_model(model = titanic_preprocessing , rankdir="LR", dpi=72, show_shapes=True)

Same model/same data different results in Keras/TF

Add your own answers!

Ask a Question