Ask Ghassem - Recent questions tagged linear-regression

How to calculate the residual errors, (MSE),(MAE), and (RMSE)?

Fri, 27 Jan 2023 04:09:28 +0000

Given the following sample dataset with 5 samples and 2 features:

Sample	Feature 1	Feature 2	Actual Value	Predicted Value
1	2	3	4	6
2	3	4	5	6
3	4	5	6	7
4	5	6	7	8
5	6	7	8	9

Calculate the residual errors, mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE) using a sample model.

Bankruptcy prediction and credit card

Sun, 10 Apr 2022 05:50:14 +0000

Hello everyone newbie data scientist here.
I'm working on a project to predict companies (probability of default) bankruptcy probability and to assign them a credit rating/score based on that :
For example below 50 probability is good and above is bad ( just for the example)
I have a dataset contains financial ratios and a class refers if the company is bankrupted or not (0 and one).
I'm planning to use this models:
Logistic regression linear discrimination analysis, decision trees, random forest, ANN, adaboost, Svm.

The question is and i know it is a dumb question:
Does those models return a probability? Which i can transform to labels, I saw that in a thesis and I'm not sure about it.

Otherwise, any guidance,tips anything will be appreciated.

How to calculate residual errors for linear regression and interpret regression metrics?

Tue, 18 Feb 2020 18:30:51 +0000

Assuming we have a linear regression equation and some data points (sample), how can we calculate residual error for each data point, and total cost based on the metrics such as MAE, MSE, RMSE, MAPE, or MPE if we have their formula?

How to create def for cross_val_score related to linear regression problem?

Wed, 03 Jul 2019 13:37:35 +0000

def cross_val_score(estimator,X,y,scoring,cv):
    scores=cross_val_score
    scores_rmse=np.sqrt(-scores)
    print('Scores: ',scores_rmse)
    print("Mean:", scores_rmse.mean())
    print("Standard deviation:", scores_rmse.std())

This is the def I created and passed to below

cross_val_score(SGDRegressor,X_train,y_train,scoring='neg_mean_squared_error',cv=5)

I am getting below error...

ValueError                                Traceback (most recent call last)
<ipython-input-181-275d240df219> in <module>()
----> 1 plot_validation_curve(lin_reg_SGD,X_train,y_train,'alpha', [0.001,0.01],scoring='neg_mean_squared_error',cv=5)
3 frames

/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    203     if len(uniques) > 1:
    204         raise ValueError("Found input variables with inconsistent numbers of"
--> 205                          " samples: %r" % [int(l) for l in lengths])
    206 
    207 

ValueError: Found input variables with inconsistent numbers of samples: [13903, 13903, 22]
SEARCH STACK OVERFLOW

How do I Plot the linear classifier calculated with LIBLINEAR using sklearn?

Thu, 16 May 2019 08:13:06 +0000

Make a scatter plot where the x-axis is the height of the citizens and the y-axis is the weight of the citizens. The color of the points need to be different for males and females. In the same figure, plot the linear classifier calculated with LIBLINEAR using sklearn

how can i convert LSTM model to linear regression model?

Mon, 29 Apr 2019 11:50:07 +0000

Here is LSTM predict model and i want to convert Linear Regression.

...
model.fit(x_train, y_train, epochs=10, batch_size=16)

trainPredict = model.predict(x_train)
testPredict = model.predict(x_test)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([y_train])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([y_test])

I tried,

y = trainPredict
x = range(0,len(y))
XGBModel = XGBRegressor()
XGBModel.fit(x,y, verbose=False)

And the result is :

Check failed: preds.Size() == info.labels_.Size() (1 vs. 56969) labels are not correctly providedpreds.size=1, label.size=56969'

I don't know why this error occurs. How can I solve this problem?

How to calculate univariate linear regression?

Thu, 11 Apr 2019 16:46:47 +0000

For the following dataset, calculate the regression equation $\hat{y} = ax+b$

dataset
x	y
1	42
3	50
10	75
16	100
26	150
36	200

How to update weights using gradient decent algorithm?

Thu, 28 Mar 2019 17:17:39 +0000

For the below neural network, imagine we are going to use the backpropagation algorithm to update weights. If the Bias (b) in this problem is always 0 (ignore bias when you solve the problem), and we have a dataset with only one record of $x=2$ and the target value of $y=5$ as you can see in the following table, and activation function is defined as $f(z) = z$

feature (x)	Target (y)
2	5

1) Define the cost function, $J(w)$, based on the error in backpropagation algorithm: $J(w) = E = \frac{1}{2}(predicted - target)^2$, and draw it

2) Initialize the weight by $w=3$, and calculate the error

3) Calculate updated weights using the gradient decent algorithm after three updates if we have the following values for learning rate ($\alpha$)

$\alpha$ = 1
$\alpha$ = 0.1
$\alpha$ = 0.5

Hint: $w_{new} = w_{old} - \alpha \frac{\partial E}{\partial w}$

https://i.imgur.com/uohFS6l.png

Looking for guidance on whether I have the necessary data to answer a Regression question

Sun, 24 Mar 2019 23:40:09 +0000

Hi everyone.

I'm currently working on my final project for a Data Science degree and after a month of literature review, exploratory analysis and model testing, I'm not sure if the questions I set out to answer are suitable for the data I have.

This is a very broad question I'm asking here, as it's more guidance than anything else, so if this is not the place to ask, I would appreciate it if you could redirect me to the right place.

You can find the data sets and code on my github here. The code is messy but working; I've only picked up programming last year.

The data

Indoor Air Quality data recorded hourly through 4 sensors (Kitchen, Bedroom, Living Room, Bathroom) for 7 days in a house for a total of 3 houses. For 6 of those days, each sensor was in a different room and on the last one, all sensors were together so we could see how spread apart their signals were and account for that). So in here I have 9 continuous variables: Temperature, Relative Humidity, CO, CO2, TVOC, PM2.5, NO2, Ozone and Air Pressure.

I then got 3 manually-filled questionnaires on Occupant Activity, one for each house, such as "Door open/closed", "Window open/closed", "Heating On/off", "Frying", "Boiling", "Hoovering", "Mopping", etc. Now, these logs were missing a lot of data.

These questionnaires were a mess and a lot of the missing values had to be imputed. This data is reported in binary format such as "Did Activity X occur at hour Y? - Yes(1)/No(1).

With this project I've chosen to predict a sensor data variable (in this case CO2), based on activities.

Models

Just to have a feel for the data, I ran a Linear Regression, Decision Tree and Random Forest model with a choice of only Occupant Activity predictors and both Occupant Activity and other sensor variables as predictors on individual rooms of each house and the results are just atrocious in every case. Cross-validation shows the model's performance to be all over the place and looking at features for statistical significance gives me different significant features in every room of every house, it's like I'm playing feature roulette. Problem with some features such as Mopping, Frying, Boiling, Hoovering is that there will be a lot of "0"s in comparison to "1"s due to the nature of the feature, so one or two "1"s in the wrong place is enough to give a misguided correlation.

As you can tell and see from this, I'm still a Data Scientist in-training here, having only done a few models in the past and rather new-ish to programming (1 year experience).

What I'm looking for

I suppose that more than anything, I'm asking for guidance on whether pursuing this as a Regression problem is feasible or not.

I'm very short on time but if this won't work, I can look into alternatives. For instance, Air Pollutants have safety thresholds. I could create a class feature on whether the value is over the threshold or not and turn it into a classification problem or even a cluster one to identify the room based on activities and air pollutants..

Bottom-line is that I have a 12,500 word paper to deliver in a month, I've been at this for month already with nothing to show for, so I'm hoping someone with more experience under their belt could see if I'm chasing a dead end. Any help in the form of guidance would be so very much appreciated, I've ran out of ideas here.

Thanks,

Tom

What is residual in the context of linear regression?

Tue, 30 Oct 2018 11:25:42 +0000

Please explain Linear Regression with an example?

Fri, 12 Oct 2018 02:14:00 +0000