본문 바로가기
Tech/Python

pandas - Adding New Columns and Rows to DataFrame

by Jyubaeng2 2023. 7. 30.

Adding New Columns with pandas

Since the Boston Housing Prices dataset does not contain a meaningful categorical variable, we can create a new column for feature engineering using the existing numerical features. Let's create a new column named "Age_Category" based on the "AGE" feature. We'll group the data into three age categories: "Young", "Middle-aged", and "Old". The age category ranges are arbitrary for demonstration purposes.

 

# Feature engineering: Create a new column "Age_Category" based on "AGE" feature
age_bins = [0, 30, 60, float('inf')]
age_labels = ['Young', 'Middle-aged', 'Old']

boston_df['Age_Category'] = pd.cut(boston_df['AGE'], bins=age_bins, labels=age_labels)

print(boston_df[['AGE', 'Age_Category']].head(10))

In this example, we used the pd.cut() function from Pandas to create the new "Age_Category" column based on the "AGE" feature. We defined the age bins as [0, 30, 60, float('inf')], corresponding to "Young", "Middle-aged", and "Old" age categories, respectively. The labels assigned to these categories are 'Young', 'Middle-aged', and 'Old'. The pd.cut() function then categorizes each value in the "AGE" column into the appropriate age category based on the bins and labels.

This new "Age_Category" column can now be used as a categorical variable for further analysis, visualization, or modeling purposes. You can similarly create other meaningful columns for feature engineering based on the existing features in the dataset.

 

let's create a new numerical variable named "Price_per_Room" that represents the average housing price per room in each property. This will provide additional insight into the pricing of houses based on the number of rooms they have.

 

# Feature engineering: Create a new column "Price_per_Room" for average housing price per room
boston_df['Price_per_Room'] = boston_df['PRICE'] / boston_df['RM']

print(boston_df[['RM', 'PRICE', 'Price_per_Room']].head(10))

In this example, we calculated the "Price_per_Room" by dividing the "PRICE" column (housing price) by the "RM" column (number of rooms) for each property. This creates a new column that indicates the average housing price per room in the property. This numerical variable can provide valuable insights into how the housing prices are affected by the number of rooms available.

The "Price_per_Room" column can now be used as a continuous numerical feature for further analysis, visualization, or modeling. It gives a different perspective on the relationship between housing prices and the number of rooms, which may provide more information for your analysis and decision-making.

 

Adding New Rows with pandas

Let's assume the new row data as follows:

new_row_data = {
    'CRIM': 0.04258,
    'ZN': 0.0,
    'INDUS': 5.19,
    'CHAS': 0.0,
    'NOX': 0.515,
    'RM': 6.5,
    'AGE': 65.0,
    'DIS': 5.615,
    'RAD': 5.0,
    'TAX': 224.0,
    'PTRATIO': 20.2,
    'B': 370.73,
    'LSTAT': 13.34,
    'PRICE': 28.7
}

There can be many ways to add a row to the dataframe using .append(), .concat(), iloc[], and loc[].

 

Using .append()

# Using .append() with a DataFrame:
new_index = boston_df.index.max() + 1
new_df = pd.DataFrame(new_row_data, index=[new_index])
boston_df = boston_df.append(new_df)
boston_df.loc[boston_df.index.max(),:]

Using .concat()

# Using .concat() with a DataFrame:
new_index = boston_df.index.max() + 1
new_df = pd.DataFrame(new_row_data, index=[new_index])
boston_df = pd.concat([boston_df, new_df])
boston_df.loc[boston_df.index.max(),:]

Using .iloc[]

# Using .iloc[] with an integer position:
boston_df.loc[boston_df.index.max() + 1] = new_row_data
boston_df.loc[boston_df.index.max(),:]

Using .loc[]

# Using .loc[] and a new index label with a dictionary:
boston_df = boston_df.append(new_row_data, ignore_index=True)
boston_df.loc[boston_df.index.max(),:]

 

 

https://ai-fin-tech.tistory.com/entry/Subsetting-Rows-with-Categorical-Variables

 

Subsetting Rows with Categorical Variables

Data Import Since there is no categorical variables in Boston dataset, I will just show you the example using dummy dataset. Let's consider a hypothetical dataset called "employee_data" with a categorical variable "Department" and other numerical features.

ai-fin-tech.tistory.com

https://ai-fin-tech.tistory.com/entry/Complete-Usage-of-loc-and-iloc-with-pandas

 

Complete Usage of loc and iloc with pandas

Data Import import pandas as pd from sklearn.datasets import load_boston # Load the Boston Housing Prices dataset boston = load_boston() boston_df = pd.DataFrame(boston.data, columns=boston.feature_names) boston_df['PRICE'] = boston.target A complete usage

ai-fin-tech.tistory.com

 

댓글