Ask Ghassem - Recent questions tagged classification

Bankruptcy prediction and credit card

Sun, 10 Apr 2022 05:50:14 +0000

Hello everyone newbie data scientist here.
I'm working on a project to predict companies (probability of default) bankruptcy probability and to assign them a credit rating/score based on that :
For example below 50 probability is good and above is bad ( just for the example)
I have a dataset contains financial ratios and a class refers if the company is bankrupted or not (0 and one).
I'm planning to use this models:
Logistic regression linear discrimination analysis, decision trees, random forest, ANN, adaboost, Svm.

The question is and i know it is a dumb question:
Does those models return a probability? Which i can transform to labels, I saw that in a thesis and I'm not sure about it.

Otherwise, any guidance,tips anything will be appreciated.

How to perform a classification or regression using k-NN?

Thu, 27 Jun 2019 02:54:42 +0000

Suppose, you have given the following dataset where x and y are the 2 features and color Red or Blue is the target variable.

a) A new data point $x=1$ and $y=1$ is given. Using Euclidean distance in 3-NN, what you predict as the color for this data point?

Dataset
x	y	Color
-1	1	Red
0	1	Blue
0	2	Red
1	-1	Red
1	0	Blue
1	2	Blue
2	2	Red
2	3	Blue

b) Now assume we have the following dataset and the target value is the price. A new data point $x=1$ and $y=1$ is given. Using Euclidean distance in 3-NN. What would be the estimated price?

Dataset
x	y	Price
-1	1	$100
0	1	$50
0	2	$20
1	-1	$40
1	0	$30
1	2	$40
2	2	$70
2	3	$30

Passing variable length sentences to Tensorflow LSTM

Mon, 11 Feb 2019 05:06:27 +0000

I have a tensorflow LSTM model for predicting the sentiment. I build the model with the maximum sequence length 150. (Maximum number of words) While making predictions, i have written the code as below:

batchSize = 32
maxSeqLength = 150

def getSentenceMatrix(sentence):
    arr = np.zeros([batchSize, maxSeqLength])
    sentenceMatrix = np.zeros([batchSize,maxSeqLength], dtype='int32')
    cleanedSentence = cleanSentences(sentence)
    cleanedSentence = ' '.join(cleanedSentence.split()[:150])
    split = cleanedSentence.split()
    for indexCounter,word in enumerate(split):
        try:
            sentenceMatrix[0,indexCounter] = wordsList.index(word)
        except ValueError:
            sentenceMatrix[0,indexCounter] = 399999 #Vector for unkown words
    return sentenceMatrix

input_text = "example data"
inputMatrix = getSentenceMatrix(input_text)

In the code i'm truncating my input text to 150 words and ignoring remaining data.Due to this my predictions are wrong.

cleanedSentence = ' '.join(cleanedSentence.split()[:150])

I know that if we have lesser length than sequence length we can pad with zero's. What we need to do if we have more length. Can you suggest me the best way to do this. Thanks in advance.

Using Tensorflow.DNNClassifier, getting Error: assertion failed: [Labels must >= 0]

Wed, 24 Oct 2018 03:12:33 +0000

Hi All,

I am writing a simple program using Tensorflow and DNNClassifier. Training Data is 9 pixel with four spectral bands, i.e. 4*9=36 featurs. And each data-point will be mapped to a class (from 1 to 7).

Last parameter, is the class label.

A line of data-point is like this:

67,75,77,62,67,79,81,62,75,87,89,71,66,79,88,63,66,79,84,63,66,79,80,59,67,84,86,68,71,84,86,64,67,81,82,64,7

But I got below Error:

InvalidArgumentError (see above for traceback): assertion failed: [Labels must >= 0] [Condition x >= 0 did not hold element-wise:] [x (dnn/head/labels:0) = ] [[3][3][3]...]

I am sure there is no datapoint which has a label less than 0. Would you please advise?

import numpy as np

import pandas as pd

import tensorflow as tf

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit

print('** DNN Classification *******************************************************')

landsatData = pd.read_csv("./resources/landsat/lantsat.1.csv")

landsatData.describe()

X_landSatAllFeatures = landsatData.iloc[:, np.arange(36)].copy()

y_midPixelAsTarget = landsatData.iloc[:, 36].copy()

# Testing and training sentences splitting (stratified + shuffled) based on the index (sentence ID)
allFeaturesIndexes = X_landSatAllFeatures.index
targetData = y_midPixelAsTarget
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

for train_index, test_index in sss.split(allFeaturesIndexes, targetData):
    train_ind, test_ind = allFeaturesIndexes[train_index], allFeaturesIndexes[test_index]

Test_Matrix = X_landSatAllFeatures.loc[test_ind]
Test_Target_Matrix = y_midPixelAsTarget.loc[test_ind]
Train_Matrix = X_landSatAllFeatures.loc[train_ind]
Train_Target_Matrix = y_midPixelAsTarget.loc[train_ind]

scaler = StandardScaler().fit(Train_Matrix)
Train_Matrix, Test_Matrix = scaler.transform(Train_Matrix), scaler.transform(Test_Matrix)

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

X_train = Train_Matrix
y_train = Train_Target_Matrix
X_test = Test_Matrix
y_test = Test_Target_Matrix

xx, yy = Train_Matrix.shape
#training phase
feature_cols = [tf.feature_column.numeric_column("X", shape=[36])]
dnn_clf = tf.estimator.DNNClassifier(hidden_units=[300,100], n_classes=8, feature_columns=feature_cols)
# dnn_clf = tf.estimator.DNNClassifier(hidden_units=[300,100], n_classes=10)


input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"X": X_train}, y=y_train, num_epochs=40, batch_size=64, shuffle=True)
dnn_clf.train(input_fn=input_fn)

#testing phase
test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"X": X_test}, y=y_test, shuffle=False)
eval_results = dnn_clf.evaluate(input_fn=test_input_fn)
print("The prediction result is : {0:.2f}%".format(100*eval_results['accuracy']))
y_pred_iter = dnn_clf.predict(input_fn=test_input_fn)
y_pred = list(y_pred_iter)
y_pred[0]


print('**********************************************************************************')

What are Training set, Validation set, Test set, and Gold set in supervised and unsupervised machine learning?

Mon, 08 Oct 2018 11:48:29 +0000

What are the most important machine learning algorithms?

Mon, 08 Oct 2018 11:43:59 +0000

Is Naive Bayes a good classifier?

Thu, 04 Oct 2018 00:23:00 +0000

Here is an example of training a model using the Naïve Bayes classifier on the Glass dataset(from UCI). The objective is to predict the type of glass based on the 9 parameters. The metric used to understand the classification result are confusion matrix and classification report.

The program is available here

Few Observations/ Questions

By Varying the ‘random_state’ value inside the function train_test_split, we can observe different accuracy values? Is the behavior correct?
The StratfiedShuffle method of train_test_split also produces random results on the every run. Is there a bug with Naïve Bayes classifier implementation?

Explain Cross-validation and why should we use it in Machine Learning?

Fri, 28 Sep 2018 15:44:58 +0000

What is Bayes’ Theorem? How is it useful in machine learning? Where should we use it?

Thu, 27 Sep 2018 05:27:55 +0000