TransWikia.com

Same model/same data different results in Keras/TF

Data Science Asked by user116997 on August 17, 2021

So first to mention I used for all models the identical dataset (just in different shapes).

I started with a binary classification based on text following the tutorial from https://keras.io/examples/nlp/text_classification_from_scratch/. The results are quite promising on my dataset at ~80%. Dataset were .txt files generated from the CSV.
enter image description here

Further I wanted to add additional structured data to increase the accuracy following this tutorial (https://www.tensorflow.org/tutorials/load_data/csv#mixed_data_types). But the results were quite bad at ~50%. I then adapted the layers from the first tutorial but the results are still bad. Dataset was CSV directly.
enter image description here

Now I removed all structured data from the CSV and just left the text in it but the results are not similar. Why? Dataset was CSV directly.
enter image description here

import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import numpy as np
from typing import Union
from tensorflow.python.keras.engine.keras_tensor import KerasTensor

titanic = pd.read_csv("Dataset.csv")
titanic.head()

titanic_features = titanic.copy()
titanic_labels = titanic_features.pop('survived')

inputs = {}

for name, column in titanic_features.items():
  dtype = column.dtype
  if dtype == object:
    dtype = tf.string
  else:
    dtype = tf.float32

  inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)
  

numeric_inputs = {name:input for name,input in inputs.items()
                  if input.dtype==tf.float32}

x = layers.Concatenate(name='ConcatNumeric')(list(numeric_inputs.values()))

norm = preprocessing.Normalization(name='PrepNormalization')


norm.adapt(np.array(titanic[numeric_inputs.keys()]))
all_numeric_inputs = norm(x)


# Not adding numeric values to dataset
preprocessed_inputs = []


# Model constants.
max_features = 20000
embedding_dim = 128
sequence_length = 500

# Now that we have our custom standardization, we can instantiate our text
# vectorization layer. We are using this layer to normalize, split, and map
# strings to integers, so we set our 'output_mode' to 'int'.
# Note that we're using the default split function,
# and the custom standardization defined above.
# We also set an explicit maximum sequence length, since the CNNs later in our
# model won't support ragged sequences.
vectorize_layer = TextVectorization(
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

for name, input in inputs.items():
  if input.dtype == tf.float32:
    continue
  
  x = vectorize_layer(input)
  x = layers.Embedding(max_features + 1, embedding_dim)(x)
  x = layers.Dropout(0.5)(x)
  x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
  x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
  x = layers.GlobalMaxPooling1D(name='GlobalMaxPooling')(x)
  x = layers.Dense(128, activation="relu")(x)
  x = layers.Dropout(0.5)(x)
  x = layers.Dense(1, activation="sigmoid", name="predictionasds")(x)
  preprocessed_inputs.append(x)


titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs)

titanic_preprocessing.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

titanic_features_dict = {name: np.array(value) 
                         for name, value in titanic_features.items()}

titanic_preprocessing.fit(x=titanic_features_dict, y=titanic_labels, epochs=3)

tf.keras.utils.plot_model(model = titanic_preprocessing , rankdir="LR", dpi=72, show_shapes=True)

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP