Performing Analysis on Meteorological Data

 In this blog or project we are going to analyze meteorological data, we will check trend of weather in past years, that how were remain those like wind speed, temperature, pressure, humidity, weather condition etc.
                    Here we are going to use dataset of weatherHistory that is in excel format and analyze that, you can find this data from kaggle by this link (https://www.kaggle.com/muthuj7/weather-dataset). This dataset contains hourly data of temperature and others from 2006 to 2016. This dataset corresponds to the country Finland of Northern Europe. 
        To analyze this dataset first of all we have to import all necessary libraries like numpy, pandas, matplotlib, seaborn etc. like following way-
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

and after it we have to download dataset for analysis from kaggle and load it to the notebook as follows-
data = pd.read_csv("WeatherHistory.csv")

Data Cleaning and Exploring


This is major part of data analysis it ensures accuracy and best output or results.
first of all count all the attributes and datatype in the dataset to further analysis.
data.count() # count all the attributes
data.dtypes  # checks the datatypes of attributes

now check all null values in the dataset as follows:
data.isna().sum()
Formatted Date                0
Summary                       0
Precip Type                 517
Temperature (C)               0
Apparent Temperature (C)      0
Humidity                      0
Wind Speed (km/h)             0
Wind Bearing (degrees)        0
Visibility (km)               0
Loud Cover                    0
Pressure (millibars)          0
Daily Summary                 0
dtype: int64
Here we can there are 517 null values in precip Type.

Now checking all unique values by this command:
data.nunique()

For the description of this dataset like min, max, mean, std etc. we use do following:
data.describe()
Temperature (C)Apparent Temperature (C)HumidityWind Speed (km/h)Wind Bearing (degrees)Visibility (km)Loud CoverPressure (millibars)
count96453.00000096453.00000096453.00000096453.00000096453.00000096453.00000096453.096453.000000
mean11.93267810.8550290.73489910.810640187.50923210.3473250.01003.235956
std9.55154610.6968470.1954736.913571107.3834284.1921230.0116.969906
min-21.822222-27.7166670.0000000.0000000.0000000.0000000.00.000000
25%4.6888892.3111110.6000005.828200116.0000008.3398000.01011.900000
50%12.00000012.0000000.7800009.965900180.00000010.0464000.01016.450000
75%18.83888918.8388890.89000014.135800290.00000014.8120000.01021.090000
max39.90555639.3444441.00000063.852600359.00000016.1000000.01046.380000

Date-time breaking down
            This is done for formatted data into date-time for better use of dataset like this-
data[["Date-Time","TZ"]]=data["Formatted Date"].str.split("+",expand=True)
df=data.drop(columns="Formatted Date")

Now re-indexing and re ordering all the attributes-
columns_order=["Date-Time","TZ","Summary","Precip Type","Temperature (C)","Apparent Temperature (C)",
                "Humidity","Wind Speed (km/h)","Wind Bearing (degrees)","Visibility (km)","Loud Cover",
                "Pressure (millibars)", "Daily Summary"]
df1=df.reindex(columns=columns_order)
df2=df1.drop(columns="TZ")

now converting date-time into datetime as standard as follows:
df2["Date-Time"]=pd.to_datetime(df2["Date-Time"])

Now we are going to Adding Year, Month, Day attributes to the table to analysis precisely and effective for all cases-
df2["Year"]=pd.DatetimeIndex(df2["Date-Time"]).year
df2["Month"]=df2["Date-Time"].dt.month_name()
df2["day"]=df2["Date-Time"].dt.day
df2.head()

Data Analysis

-> Wind Speed Analysis

Here we are going analyze wind speed over the last ten years from 2006 to 2016

df2["Wind Speed (km/h)"].describe()
This will show all about its description like mean, min, max etc.

Average wind speed over the last 10 years we can see as:
avg_wind_Speed=pd.DataFrame(df2.groupby("Year")["Wind Speed (km/h)"].mean())
avg_wind_Speed
Wind Speed (km/h)
Year
200610.189852
200710.825392
200811.303897
200911.505948
201011.015628
20119.898262
201211.264545
201310.969389
201410.502473
201510.735247
201610.703441

Here we can that max average wind speed was in year 2013 and minimum in 2011.
Graphical representation of this wind speed of the ten years-
fig,ax=plt.subplots(figsize=(10,8))
sns.lineplot(x=avg_wind_Speed.index,y=avg_wind_Speed["Wind Speed (km/h)"])
plt.title("Average wind speed over the yeears")
Text(0.5, 1.0, 'Average wind speed over the yeears')

Also we can see monthly average wind speed as follows:
month_avg_wind_Speed=pd.DataFrame(df2.groupby("Month")["Wind Speed (km/h)"].mean())
order=["January","February","March","April","May","June","July","August","September",
            "October","November","December"]
monthly_wind_speed=month_avg_wind_Speed.reindex(index=order)
monthly_wind_speed
Wind Speed (km/h)
Month
January11.512816
February12.185543
March13.405461
April11.893094
May10.959337
June9.626471
July9.639907
August8.933431
September9.621813
October10.000153
November10.944266
December11.098682

and its graphical representation-
fig,ax=plt.subplots(figsize=(16,4))
sns.lineplot(x=monthly_wind_speed.index,y=monthly_wind_speed["Wind Speed (km/h)"])
plt.title("Monthly Average wind speed over the yeears")
Text(0.5, 1.0, 'Monthly Average wind speed over the yeears')

Here we can see that in the month of August was min. wind speed and in march max.

-> Humidity

Same above method will be done for this. first of all get description of humidity data and get all the humidity data over the 10 years:
avg_humidity=pd.DataFrame(df2.groupby("Year")["Humidity"].mean())
avg_humidity
Humidity
Year
20060.767341
20070.689652
20080.701237
20090.707247
20100.796858
20110.736017
20120.689500
20130.754209
20140.748578
20150.732355
20160.760874
Graphical representation of this yearly data-

fig,ax=plt.subplots(figsize=(10,8))
sns.lineplot(x=avg_humidity.index,y=avg_humidity["Humidity"])
plt.title("Average Humidity over the yeears")
Text(0.5, 1.0, 'Average Humidity over the yeears')

Here we can see that humidity was maximum in 2010 and minimum in 2012.
Monthly average data trend of humidity-
month_avg_humidity=pd.DataFrame(df2.groupby("Month")["Humidity"].mean())
order=["January","February","March","April","May","June","July","August","September",
            "October","November","December"]
monthly_humidity=month_avg_humidity.reindex(index=order)
monthly_humidity
Humidity
Month
January0.850723
February0.813400
March0.702966
April0.641133
May0.691325
June0.686470
July0.639657
August0.635542
September0.688790
October0.774554
November0.827828
December0.870390

fig,ax=plt.subplots(figsize=(16,4))
sns.lineplot(x=monthly_humidity.index,y=monthly_humidity["Humidity"])
plt.title("Monthly Average Humidity over the yeears")
Text(0.5, 1.0, 'Monthly Average Humidity over the yeears')
Here we can analyze average minimum humidity was in August and maximum in December.

-> Weather Condition Analysis
    Here we will analyze weather condition like cloudy and overcast etc.
Now we will count all the values-

df2["Summary"].value_counts()

Most frequent weather report
weather_condition=pd.DataFrame(df2.groupby("Year")["Summary"].describe(include="O").top)

weather_condition.rename(columns={"top":"most frequent weather"})
most frequent weather
Year
2006Partly Cloudy
2007Partly Cloudy
2008Partly Cloudy
2009Partly Cloudy
2010Partly Cloudy
2011Partly Cloudy
2012Partly Cloudy
2013Partly Cloudy
2014Mostly Cloudy
2015Partly Cloudy
2016Mostly Cloudy

Here we can see there was partly cloudy in 10 years from 2006 to 2016.

Monthly data analysis
monthly_weather_condition=pd.DataFrame(df2.groupby("Month")["Summary"].describe(include="O").top)
order=["January","February","March","April","May","June","July","August","September",
            "October","November","December"]
monthly_weather_condition.rename(columns={"top":"most frequent weather"})
monthly=monthly_weather_condition.reindex(index=order)
monthly
top
Month
JanuaryOvercast
FebruaryOvercast
MarchMostly Cloudy
AprilPartly Cloudy
MayPartly Cloudy
JunePartly Cloudy
JulyPartly Cloudy
AugustPartly Cloudy
SeptemberPartly Cloudy
OctoberMostly Cloudy
NovemberMostly Cloudy
DecemberMostly Cloudy

-> Visibility Analysis
    Monthly average visibility we can see by this
month_avg_visibility=pd.DataFrame(df2.groupby("Month")["Visibility (km)"].mean())
order=["January","February","March","April","May","June","July","August","September",
            "October","November","December"]
monthly_visibility=month_avg_visibility.reindex(index=order)
monthly_visibility
Visibility (km)
Month
January7.830584
February8.731368
March10.910450
April11.784224
May11.892754
June11.990266
July12.187820
August12.455549
September11.602874
October9.741691
November8.191229
December6.773288

Here we can analyze that there was maximum visibility was in August and minimum in December.

Graphical representation of average Monthly visibility-
fig,ax=plt.subplots(figsize=(16,4))
sns.lineplot(x=monthly_visibility.index,y=monthly_visibility["Visibility (km)"])
plt.title("Monthly visibility over the yeears")
Text(0.5, 1.0, 'Monthly visibility over the yeears')

-> Precipitation Analysis
    Here we will see rain condition-
percip=pd.DataFrame(df2.groupby("Month")["Precip Type"].describe(include="O").top)
order=["January","February","March","April","May","June","July","August","September",
            "October","November","December"]
m_p=percip.rename(columns={"top":"Precip Type"})
monthly_percip=m_p.reindex(index=order)
monthly_percip
Precip Type
Month
Januaryrain
Februaryrain
Marchrain
Aprilrain
Mayrain
Junerain
Julyrain
Augustrain
Septemberrain
Octoberrain
Novemberrain
Decemberrain
Here we can analyze that there was mostly rainy in all over the year.

-> Temperature Analysis
    Temperature trend-
fig,ax=plt.subplots(figsize=(14,5))
plt.hist(df2["Temperature (C)"],bins=10,rwidth=0.9)
plt.xlabel("Temperature (C)")
plt.ylabel("freq")
Text(0, 0.5, 'freq')

Average temperature over the 10 years from 2006 to 2016
year_avg_temp=pd.DataFrame(df2.groupby("Year")["Temperature (C)"].mean())
year_avg_temp
Temperature (C)
Year
200611.215365
200712.135239
200812.161876
200912.267910
201011.202061
201111.524453
201211.986726
201311.940719
201412.529737
201512.311370
201611.985292

And here we can analyze that there was average maximum  temperature in 2014 and minimum in 2010.
Graphical representation of average temperature from 2006 to 2016-
fig,ax=plt.subplots(figsize=(10,8))
sns.lineplot(x=year_avg_temp.index,y=year_avg_temp["Temperature (C)"])
plt.title("Annual average temperature")
Text(0.5, 1.0, 'Annual average temperature')

Monthly average temperature analysis-
month_temp=pd.DataFrame(df2.groupby("Month")["Temperature (C)"].mean())
order=["January","February","March","April","May","June","July","August","September",
            "October","November","December"]

monthly_avg_temp=month_temp.reindex(index=order)
monthly_avg_temp
Temperature (C)
Month
January0.813890
February2.159699
March6.906599
April12.756417
May16.873692
June20.715617
July22.963943
August22.345031
September17.516790
October11.342247
November6.589907
December1.633742

Graphical Representation
fig,ax=plt.subplots(figsize=(14,4))
sns.lineplot(x=monthly_avg_temp.index,y=monthly_avg_temp["Temperature (C)"])
plt.title("monthly average temperature")
Text(0.5, 1.0, 'monthly average temperature')

Here we can analyze that january is most cold and July is most hot.

-> Pressure Analysis
    Average pressure over the 10 years from 2006 to 2016 as follows-
    avg_pressure=pd.DataFrame(df2.groupby("Year")["Pressure (millibars)"].mean())
avg_pressure
Pressure (millibars)
Year
2006992.543529
20071001.640226
20081007.734504
20091002.608735
20101004.811891
20111014.184075
2012999.341481
20131004.950764
2014987.394676
20151005.179401
20161015.162161
Here we can analyze that there was high pressure in 2016 and minimum pressure in 2014.
Pictorial representation of pressure over the 10 years
fig,ax=plt.subplots(figsize=(10,8))
sns.lineplot(x=avg_pressure.index,y=avg_pressure["Pressure (millibars)"])
plt.title("Average Pressure over the yeears")
Text(0.5, 1.0, 'Average Pressure over the yeears')

Monthly average pressure maximum in november and minimum in december.
month_avg_pressure=pd.DataFrame(df2.groupby("Month")["Pressure (millibars)"].mean())
order=["January","February","March","April","May","June","July","August","September",
            "October","November","December"]
monthly_pressure=month_avg_pressure.reindex(index=order)
monthly_pressure
Pressure (millibars)
Month
January1006.125792
February1003.929313
March1001.551536
April1009.996332
May1003.499530
June1001.883742
July1008.566431
August1001.716944
September1000.565347
October1003.243458
November1012.297027
December985.901753

fig,ax=plt.subplots(figsize=(16,4))
sns.lineplot(x=monthly_pressure.index,y=monthly_pressure["Pressure (millibars)"])
plt.title("Monthly Average Pressure over the yeears")
Text(0.5, 1.0, 'Monthly Average Pressure over the yeears')

-> Correlation
   Here we will see correlation of data.
      df3=df2.drop(columns=["Year","day","Loud Cover"])
df3_corr=df3.corr()
df3_corr
Temperature (C)Apparent Temperature (C)HumidityWind Speed (km/h)Wind Bearing (degrees)Visibility (km)Pressure (millibars)
Temperature (C)1.0000000.992629-0.6322550.0089570.0299880.392847-0.005447
Apparent Temperature (C)0.9926291.000000-0.602571-0.0566500.0290310.381718-0.000219
Humidity-0.632255-0.6025711.000000-0.2249510.000735-0.3691730.005454
Wind Speed (km/h)0.008957-0.056650-0.2249511.0000000.1038220.100749-0.049263
Wind Bearing (degrees)0.0299880.0290310.0007350.1038221.0000000.047594-0.011651
Visibility (km)0.3928470.381718-0.3691730.1007490.0475941.0000000.059818
Pressure (millibars)-0.005447-0.0002190.005454-0.049263-0.0116510.0598181.000000

Graphical representation of correlation-
fig,ax=plt.subplots(figsize=(10,8))
sns.heatmap(df3_corr,annot=True,cmap='magma_r',linewidths=0.2)
plt.title("correlations heat map")
Text(0.5, 1.0, 'correlations heat map')

-> Analysis Apparent temperature vs Humidity
    Formatting date and time according to UTC
data['Formatted Date'] = pd.to_datetime(data['Formatted Date'], utc=True)
data['Formatted Date']
0       2006-03-31 22:00:00+00:00
1       2006-03-31 23:00:00+00:00
2       2006-04-01 00:00:00+00:00
3       2006-04-01 01:00:00+00:00
4       2006-04-01 02:00:00+00:00
                   ...           
96448   2016-09-09 17:00:00+00:00
96449   2016-09-09 18:00:00+00:00
96450   2016-09-09 19:00:00+00:00
96451   2016-09-09 20:00:00+00:00
96452   2016-09-09 21:00:00+00:00
Name: Formatted Date, Length: 96453, dtype: datetime64[ns, UTC]

setting up UTC time and date-
data = data.set_index('Formatted Date')
data.head(5)
SummaryPrecip TypeTemperature (C)Apparent Temperature (C)HumidityWind Speed (km/h)Wind Bearing (degrees)Visibility (km)Loud CoverPressure (millibars)Daily Summary
Formatted Date
2006-03-31 22:00:00+00:00Partly Cloudyrain9.4722227.3888890.8914.1197251.015.82630.01015.13Partly cloudy throughout the day.
2006-03-31 23:00:00+00:00Partly Cloudyrain9.3555567.2277780.8614.2646259.015.82630.01015.63Partly cloudy throughout the day.
2006-04-01 00:00:00+00:00Mostly Cloudyrain9.3777789.3777780.893.9284204.014.95690.01015.94Partly cloudy throughout the day.
2006-04-01 01:00:00+00:00Partly Cloudyrain8.2888895.9444440.8314.1036269.015.82630.01016.41Partly cloudy throughout the day.
2006-04-01 02:00:00+00:00Mostly Cloudyrain8.7555566.9777780.8311.0446259.015.82630.01016.51Partly cloudy throughout the day.

After resampling mean of monthly apparent temperature and humidity
data_columns = ['Apparent Temperature (C)', 'Humidity']
df_monthly_mean = data[data_columns].resample('MS').mean()
df_monthly_mean.head()
Apparent Temperature (C)Humidity
Formatted Date
2005-12-01 00:00:00+00:00-4.0500000.890000
2006-01-01 00:00:00+00:00-4.1737080.834610
2006-02-01 00:00:00+00:00-2.9907160.843467
2006-03-01 00:00:00+00:001.9697800.778737
2006-04-01 00:00:00+00:0012.0988270.728625

Variation plot in Apparent Temperature and Humidity with time

plt.figure(figsize=(14,6))
plt.title("Variation in Apparent Temperature and Humidity with time")
sns.lineplot(data=df_monthly_mean)
<AxesSubplot:title={'center':'Variation in Apparent Temperature and Humidity with time'}, xlabel='Formatted Date'>
Here we can analyze that humidity did'nt change much over the time of 10 years as temperature

Relation between Apparent temperature and Humidity-
sns.set_style("darkgrid")
sns.regplot(data=df_monthly_mean, x="Apparent Temperature (C)", y="Humidity", color="g")
plt.title("Relation between Apparent Temperature (C) and Humidity")
plt.show()
Retrieving the data of a particular month from every year, we can say January by this command-
df1 = df_monthly_mean[df_monthly_mean.index.month==1]
print(df1)

df1.dtypes

Plotting each year temperature and humidity changes over the 10 years-
import matplotlib.dates as mdates
from datetime import datetime

fig, ax = plt.subplots(figsize=(15,5))
ax.plot(df1.loc['2006-04-01':'2016-04-01', 'Apparent Temperature (C)'], marker='o', linestyle='-',label='Apparent Temperature (C)')
ax.plot(df1.loc['2006-04-01':'2016-04-01', 'Humidity'], marker='o', linestyle='-',label='Humidity')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.legend(loc = 'center right')
ax.set_xlabel('Month of April')
Text(0.5, 0, 'Month of April')

Here we see that humidity changes remained same in last 10 years while temperature change minimum in 2010 and maximum in 2007 in the month of April in last 10 years.



Comments

Popular posts from this blog

Computer Viruses