This is an exploratory data analysis on COVID-19 data published by Johns Hopkins. The objective was to understand the total confirmed cases and infection rate among different countries. (Maximum infection rate shows the spread better than total confirmed cases considering the variation in population.)
Data Collection and Preprocessing
Data Source:
Data is available on Kaggle. This is a daily updating version of COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU).
The data covers:
- confirmed cases and deaths on a country level
- confirmed cases and deaths by US county
- some metadata that’s available in the raw JHU data
Data Preprocessing:
Dropped the unnecessary columns.
#Latitude and Longitude are not important features.Let's drop them
covid_df.drop(columns = ['Lat','Long'], axis =1, inplace = True)
PythonExploratory Data Analysis
Let’s visualize the top 10 countries in confirmed cases.
most_affected=covid_df_aggregated.sort_values(by = columns[-1],ascending = False)
#Let's do a visualization of top 10 countries in confirmed cases
ax = sns.barplot(x=most_affected[most_affected.columns[-1]], y=most_affected.index.values)
ax.set_title("Total confirmed cases so far")
Python#Let's do a plot of confirmed cases in India
covid_df_aggregated.loc['India'].plot()
PythonLet’s do a comparison between a few countries.
#Let's do a comparison
covid_df_aggregated.loc['India'].plot()
covid_df_aggregated.loc['China'].plot()
covid_df_aggregated.loc['US'].plot()
covid_df_aggregated.loc['Italy'].plot()
plt.legend()
PythonLet’s look at the infection rate.
#find maximum infection rate for all of the countries.
countries = list(covid_df_aggregated.index)
max_infection_rates = []
for c in countries :
max_infection_rates.append(covid_df_aggregated.loc[c].diff().max())
max_infection_rates
#Let's add our findings to our data
covid_df_aggregated["max_infection_rates"] = max_infection_rates
#Extracting just the useful column.
covid_data = pd.DataFrame(covid_df_aggregated["max_infection_rates"])
PythonLet’s visualize the infection rate.
Libraries
- Pandas
- NumPy
- Seaborn
- Matplotlib