data = pd.read_csv("WeatherHistory.csv")
Data Cleaning and Exploring
This is major part of data analysis it ensures accuracy and best output or results.
first of all count all the attributes and datatype in the dataset to further analysis.
data.count() # count all the attributes
data.dtypes # checks the datatypes of attributes
now check all null values in the dataset as follows:
Formatted Date 0
Summary 0
Precip Type 517
Temperature (C) 0
Apparent Temperature (C) 0
Humidity 0
Wind Speed (km/h) 0
Wind Bearing (degrees) 0
Visibility (km) 0
Loud Cover 0
Pressure (millibars) 0
Daily Summary 0
dtype: int64
Here we can there are 517 null values in precip Type.
Now checking all unique values by this command:
data.nunique()
For the description of this dataset like min, max, mean, std etc. we use do following:
| Temperature (C) | Apparent Temperature (C) | Humidity | Wind Speed (km/h) | Wind Bearing (degrees) | Visibility (km) | Loud Cover | Pressure (millibars) |
---|
count | 96453.000000 | 96453.000000 | 96453.000000 | 96453.000000 | 96453.000000 | 96453.000000 | 96453.0 | 96453.000000 |
---|
mean | 11.932678 | 10.855029 | 0.734899 | 10.810640 | 187.509232 | 10.347325 | 0.0 | 1003.235956 |
---|
std | 9.551546 | 10.696847 | 0.195473 | 6.913571 | 107.383428 | 4.192123 | 0.0 | 116.969906 |
---|
min | -21.822222 | -27.716667 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
---|
25% | 4.688889 | 2.311111 | 0.600000 | 5.828200 | 116.000000 | 8.339800 | 0.0 | 1011.900000 |
---|
50% | 12.000000 | 12.000000 | 0.780000 | 9.965900 | 180.000000 | 10.046400 | 0.0 | 1016.450000 |
---|
75% | 18.838889 | 18.838889 | 0.890000 | 14.135800 | 290.000000 | 14.812000 | 0.0 | 1021.090000 |
---|
max | 39.905556 | 39.344444 | 1.000000 | 63.852600 | 359.000000 | 16.100000 | 0.0 | 1046.380000 |
---|
Date-time breaking down
This is done for formatted data into date-time for better use of dataset like this-
data[["Date-Time","TZ"]]=data["Formatted Date"].str.split("+",expand=True)
df=data.drop(columns="Formatted Date")
Now re-indexing and re ordering all the attributes-
columns_order=["Date-Time","TZ","Summary","Precip Type","Temperature (C)","Apparent Temperature (C)",
"Humidity","Wind Speed (km/h)","Wind Bearing (degrees)","Visibility (km)","Loud Cover",
"Pressure (millibars)", "Daily Summary"]
df1=df.reindex(columns=columns_order)
df2=df1.drop(columns="TZ")
now converting date-time into datetime as standard as follows:
df2["Date-Time"]=pd.to_datetime(df2["Date-Time"])
Now we are going to Adding Year, Month, Day attributes to the table to analysis precisely and effective for all cases-
df2["Year"]=pd.DatetimeIndex(df2["Date-Time"]).year
df2["Month"]=df2["Date-Time"].dt.month_name()
df2["day"]=df2["Date-Time"].dt.day
df2.head()
Data Analysis
-> Wind Speed Analysis
Here we are going analyze wind speed over the last ten years from 2006 to 2016
df2["Wind Speed (km/h)"].describe()
This will show all about its description like mean, min, max etc.
Average wind speed over the last 10 years we can see as:
| Wind Speed (km/h) |
---|
Year | |
---|
2006 | 10.189852 |
---|
2007 | 10.825392 |
---|
2008 | 11.303897 |
---|
2009 | 11.505948 |
---|
2010 | 11.015628 |
---|
2011 | 9.898262 |
---|
2012 | 11.264545 |
---|
2013 | 10.969389 |
---|
2014 | 10.502473 |
---|
2015 | 10.735247 |
---|
2016 | 10.703441 |
---|
Here we can that max average wind speed was in year 2013 and minimum in 2011.
Graphical representation of this wind speed of the ten years-
Text(0.5, 1.0, 'Average wind speed over the yeears')
Also we can see monthly average wind speed as follows:
| Wind Speed (km/h) |
---|
Month | |
---|
January | 11.512816 |
---|
February | 12.185543 |
---|
March | 13.405461 |
---|
April | 11.893094 |
---|
May | 10.959337 |
---|
June | 9.626471 |
---|
July | 9.639907 |
---|
August | 8.933431 |
---|
September | 9.621813 |
---|
October | 10.000153 |
---|
November | 10.944266 |
---|
December | 11.098682 |
---|
and its graphical representation-
Text(0.5, 1.0, 'Monthly Average wind speed over the yeears')
Here we can see that in the month of August was min. wind speed and in march max.
-> Humidity
Same above method will be done for this. first of all get description of humidity data and get all the humidity data over the 10 years:
| Humidity |
---|
Year | |
---|
2006 | 0.767341 |
---|
2007 | 0.689652 |
---|
2008 | 0.701237 |
---|
2009 | 0.707247 |
---|
2010 | 0.796858 |
---|
2011 | 0.736017 |
---|
2012 | 0.689500 |
---|
2013 | 0.754209 |
---|
2014 | 0.748578 |
---|
2015 | 0.732355 |
---|
2016 | 0.760874 |
---|
Graphical representation of this yearly data-
Text(0.5, 1.0, 'Average Humidity over the yeears')
Here we can see that humidity was maximum in 2010 and minimum in 2012.
Monthly average data trend of humidity-
| Humidity |
---|
Month | |
---|
January | 0.850723 |
---|
February | 0.813400 |
---|
March | 0.702966 |
---|
April | 0.641133 |
---|
May | 0.691325 |
---|
June | 0.686470 |
---|
July | 0.639657 |
---|
August | 0.635542 |
---|
September | 0.688790 |
---|
October | 0.774554 |
---|
November | 0.827828 |
---|
December | 0.870390 |
---|
Text(0.5, 1.0, 'Monthly Average Humidity over the yeears')
Here we can analyze average minimum humidity was in August and maximum in December.
-> Weather Condition Analysis
Here we will analyze weather condition like cloudy and overcast etc.
Now we will count all the values-
df2["Summary"].value_counts()
Most frequent weather report
| most frequent weather |
---|
Year | |
---|
2006 | Partly Cloudy |
---|
2007 | Partly Cloudy |
---|
2008 | Partly Cloudy |
---|
2009 | Partly Cloudy |
---|
2010 | Partly Cloudy |
---|
2011 | Partly Cloudy |
---|
2012 | Partly Cloudy |
---|
2013 | Partly Cloudy |
---|
2014 | Mostly Cloudy |
---|
2015 | Partly Cloudy |
---|
2016 | Mostly Cloudy |
---|
Here we can see there was partly cloudy in 10 years from 2006 to 2016.
Monthly data analysis
| top |
---|
Month | |
---|
January | Overcast |
---|
February | Overcast |
---|
March | Mostly Cloudy |
---|
April | Partly Cloudy |
---|
May | Partly Cloudy |
---|
June | Partly Cloudy |
---|
July | Partly Cloudy |
---|
August | Partly Cloudy |
---|
September | Partly Cloudy |
---|
October | Mostly Cloudy |
---|
November | Mostly Cloudy |
---|
December | Mostly Cloudy |
---|
-> Visibility Analysis
Monthly average visibility we can see by this
| Visibility (km) |
---|
Month | |
---|
January | 7.830584 |
---|
February | 8.731368 |
---|
March | 10.910450 |
---|
April | 11.784224 |
---|
May | 11.892754 |
---|
June | 11.990266 |
---|
July | 12.187820 |
---|
August | 12.455549 |
---|
September | 11.602874 |
---|
October | 9.741691 |
---|
November | 8.191229 |
---|
December | 6.773288 |
---|
Here we can analyze that there was maximum visibility was in August and minimum in December.
Graphical representation of average Monthly visibility-
Text(0.5, 1.0, 'Monthly visibility over the yeears')
-> Precipitation Analysis
Here we will see rain condition-
| Precip Type |
---|
Month | |
---|
January | rain |
---|
February | rain |
---|
March | rain |
---|
April | rain |
---|
May | rain |
---|
June | rain |
---|
July | rain |
---|
August | rain |
---|
September | rain |
---|
October | rain |
---|
November | rain |
---|
December | rain |
---|
Here we can analyze that there was mostly rainy in all over the year.
-> Temperature Analysis
Temperature trend-
Average temperature over the 10 years from 2006 to 2016
| Temperature (C) |
---|
Year | |
---|
2006 | 11.215365 |
---|
2007 | 12.135239 |
---|
2008 | 12.161876 |
---|
2009 | 12.267910 |
---|
2010 | 11.202061 |
---|
2011 | 11.524453 |
---|
2012 | 11.986726 |
---|
2013 | 11.940719 |
---|
2014 | 12.529737 |
---|
2015 | 12.311370 |
---|
2016 | 11.985292 |
---|
And here we can analyze that there was average maximum temperature in 2014 and minimum in 2010.
Graphical representation of average temperature from 2006 to 2016-
Text(0.5, 1.0, 'Annual average temperature')
Monthly average temperature analysis-
| Temperature (C) |
---|
Month | |
---|
January | 0.813890 |
---|
February | 2.159699 |
---|
March | 6.906599 |
---|
April | 12.756417 |
---|
May | 16.873692 |
---|
June | 20.715617 |
---|
July | 22.963943 |
---|
August | 22.345031 |
---|
September | 17.516790 |
---|
October | 11.342247 |
---|
November | 6.589907 |
---|
December | 1.633742 |
---|
Graphical Representation
Text(0.5, 1.0, 'monthly average temperature')
Here we can analyze that january is most cold and July is most hot.
-> Pressure Analysis
Average pressure over the 10 years from 2006 to 2016 as follows-
avg_pressure=pd.DataFrame(df2.groupby("Year")["Pressure (millibars)"].mean())
| Pressure (millibars) |
---|
Year | |
---|
2006 | 992.543529 |
---|
2007 | 1001.640226 |
---|
2008 | 1007.734504 |
---|
2009 | 1002.608735 |
---|
2010 | 1004.811891 |
---|
2011 | 1014.184075 |
---|
2012 | 999.341481 |
---|
2013 | 1004.950764 |
---|
2014 | 987.394676 |
---|
2015 | 1005.179401 |
---|
2016 | 1015.162161 |
---|
Here we can analyze that there was high pressure in 2016 and minimum pressure in 2014.
Pictorial representation of pressure over the 10 years
Text(0.5, 1.0, 'Average Pressure over the yeears')
Monthly average pressure maximum in november and minimum in december.
| Pressure (millibars) |
---|
Month | |
---|
January | 1006.125792 |
---|
February | 1003.929313 |
---|
March | 1001.551536 |
---|
April | 1009.996332 |
---|
May | 1003.499530 |
---|
June | 1001.883742 |
---|
July | 1008.566431 |
---|
August | 1001.716944 |
---|
September | 1000.565347 |
---|
October | 1003.243458 |
---|
November | 1012.297027 |
---|
December | 985.901753 |
---|
Text(0.5, 1.0, 'Monthly Average Pressure over the yeears')
-> Correlation
Here we will see correlation of data.
df3=df2.drop(columns=["Year","day","Loud Cover"])
| Temperature (C) | Apparent Temperature (C) | Humidity | Wind Speed (km/h) | Wind Bearing (degrees) | Visibility (km) | Pressure (millibars) |
---|
Temperature (C) | 1.000000 | 0.992629 | -0.632255 | 0.008957 | 0.029988 | 0.392847 | -0.005447 |
---|
Apparent Temperature (C) | 0.992629 | 1.000000 | -0.602571 | -0.056650 | 0.029031 | 0.381718 | -0.000219 |
---|
Humidity | -0.632255 | -0.602571 | 1.000000 | -0.224951 | 0.000735 | -0.369173 | 0.005454 |
---|
Wind Speed (km/h) | 0.008957 | -0.056650 | -0.224951 | 1.000000 | 0.103822 | 0.100749 | -0.049263 |
---|
Wind Bearing (degrees) | 0.029988 | 0.029031 | 0.000735 | 0.103822 | 1.000000 | 0.047594 | -0.011651 |
---|
Visibility (km) | 0.392847 | 0.381718 | -0.369173 | 0.100749 | 0.047594 | 1.000000 | 0.059818 |
---|
Pressure (millibars) | -0.005447 | -0.000219 | 0.005454 | -0.049263 | -0.011651 | 0.059818 | 1.000000 |
---|
Graphical representation of correlation-
Text(0.5, 1.0, 'correlations heat map')
-> Analysis Apparent temperature vs Humidity
Formatting date and time according to UTC
0 2006-03-31 22:00:00+00:00
1 2006-03-31 23:00:00+00:00
2 2006-04-01 00:00:00+00:00
3 2006-04-01 01:00:00+00:00
4 2006-04-01 02:00:00+00:00
...
96448 2016-09-09 17:00:00+00:00
96449 2016-09-09 18:00:00+00:00
96450 2016-09-09 19:00:00+00:00
96451 2016-09-09 20:00:00+00:00
96452 2016-09-09 21:00:00+00:00
Name: Formatted Date, Length: 96453, dtype: datetime64[ns, UTC]
setting up UTC time and date-
| Summary | Precip Type | Temperature (C) | Apparent Temperature (C) | Humidity | Wind Speed (km/h) | Wind Bearing (degrees) | Visibility (km) | Loud Cover | Pressure (millibars) | Daily Summary |
---|
Formatted Date | | | | | | | | | | | |
---|
2006-03-31 22:00:00+00:00 | Partly Cloudy | rain | 9.472222 | 7.388889 | 0.89 | 14.1197 | 251.0 | 15.8263 | 0.0 | 1015.13 | Partly cloudy throughout the day. |
---|
2006-03-31 23:00:00+00:00 | Partly Cloudy | rain | 9.355556 | 7.227778 | 0.86 | 14.2646 | 259.0 | 15.8263 | 0.0 | 1015.63 | Partly cloudy throughout the day. |
---|
2006-04-01 00:00:00+00:00 | Mostly Cloudy | rain | 9.377778 | 9.377778 | 0.89 | 3.9284 | 204.0 | 14.9569 | 0.0 | 1015.94 | Partly cloudy throughout the day. |
---|
2006-04-01 01:00:00+00:00 | Partly Cloudy | rain | 8.288889 | 5.944444 | 0.83 | 14.1036 | 269.0 | 15.8263 | 0.0 | 1016.41 | Partly cloudy throughout the day. |
---|
2006-04-01 02:00:00+00:00 | Mostly Cloudy | rain | 8.755556 | 6.977778 | 0.83 | 11.0446 | 259.0 | 15.8263 | 0.0 | 1016.51 | Partly cloudy throughout the day. |
---|
After resampling mean of monthly apparent temperature and humidity
| Apparent Temperature (C) | Humidity |
---|
Formatted Date | | |
---|
2005-12-01 00:00:00+00:00 | -4.050000 | 0.890000 |
---|
2006-01-01 00:00:00+00:00 | -4.173708 | 0.834610 |
---|
2006-02-01 00:00:00+00:00 | -2.990716 | 0.843467 |
---|
2006-03-01 00:00:00+00:00 | 1.969780 | 0.778737 |
---|
2006-04-01 00:00:00+00:00 | 12.098827 | 0.728625 |
---|
Variation plot in Apparent Temperature and Humidity with time
<AxesSubplot:title={'center':'Variation in Apparent Temperature and Humidity with time'}, xlabel='Formatted Date'>
Here we can analyze that humidity did'nt change much over the time of 10 years as temperature
Relation between Apparent temperature and Humidity-
Retrieving the data of a particular month from every year, we can say January by this command-
df1 = df_monthly_mean[df_monthly_mean.index.month==1]
print(df1)
df1.dtypes
Plotting each year temperature and humidity changes over the 10 years-
Text(0.5, 0, 'Month of April')
Here we see that humidity changes remained same in last 10 years while temperature change minimum in 2010 and maximum in 2007 in the month of April in last 10 years.
Comments
Post a Comment