Property Cost (Analysis and Cleaning)

Aadil Aftab Shaik
3 min readDec 24, 2020

This is my data cleaning challenge (from a friend); data analysis and cleaning of property cost from Dubai based on the data. I used Pandas, NumPy, Regular Expressions, Seaborn, and, Scikit learn libraries to clean and analyse the data.

I imported the CSV file of property cost in Dubai as ‘data’ variable and did data analysis.

data.head()
data.dtypes
data.describe()

As you can see, the data is messy. First, I tried to clean the cost and area features with the help of regular expressions library.

data[‘cost’].unique()
data[‘Area’].unique()

I made a loop that loops over ‘cost’ feature of the data and gets the numeric values from the string. Then, I made a series of these numbers and replaced it with ‘cost’ feature.

I did the same with ‘area’ feature but I implemented try-except to avoid getting any error because of the NaN values.

Now comes the cleaning of ‘no. of bed’ and ‘Bathrooms’ features. Here, I simply replaced the ‘studio’ with 0 and ‘7+’ with 8.

data[‘no.of bed’].unique()
data[‘Bathrooms’].unique()

Then, I converted these values into numeric values.

data.dtypes

There are categorical features i.e., ‘type’ and ‘Location’ features. I replaced them with the respective ranking numbers based on the cost ascendingly.

data[‘type’].unique()
data[‘Location’].unique()

As you can see, there are many NaN values in the data, I fixed that issue by replacing them with the median.

data.isna().sum()
data.isna().sum()

As you can see, there is no huge relation in the given data because the data is small and random.

sns.pairplot(data,diag_kind=’kde’)

CONCLUSION

I did all this to practice cleaning data, the scrapped data was too small and random to build a model out of it.

Project Github: Property Cost

--

--