Cardamom is one of the most ancient and valuable spice crops of India. cardamom is a native of evergreen forests of South India, growing wild in the western Ghats between 2500 and 5000 ft.
I grew up in the hill stations of Kerala, the biggest producer of Cardamom in India. This was my humble attempt to help my family business with my data science skills😁. This project aimed to create visualizations to understand whether there are any patterns and trends in the price of cardamom.
Data Collection and Preprocessing
Data Source
Cardamom auction data was publicly available on the official Spices Board website.
Preprocessing
Changed the column names.
#Column names looked super messy.Let's change them
df1.columns = ['Date', 'Auctioneer', 'Lots', 'Arrived_Kg', 'Sold_Kg', 'MaxPrice', 'AvgPrice']
PythonSorted the data and transformed the date format.
#I want the data set to be in ascending order of dates.
#Let's clean up date column so that entries will look uniform.
df1["Date"] = pd.to_datetime(df1["Date"])
df2 = df1.sort_values(by="Date")
PythonLet’s see the distribution of different features along different months, different days of the week, etc.
Split the date column into day, month and year columns.
Dates = list(df2.Date)
Month = []
Day =[]
Year =[]
for date in Dates :
Month.append(date.month)
Year.append(date.year)
Day.append(date.weekday())
#Adding new columns
df2["Month"] = Month
df2["Year"] = Year
df2["Day"] = Day
#Let's split this dataset into some small datasets to draw some insights
#After consulting with a domain expert I got that the first entry is afternoon and second entry is morning in the sorted dataset.
#Splitting forenoon and afternoon
morning_df = df2.iloc[1::2].reset_index()
#Splitting forenoon and afternoon
afternoon_df = df2.iloc[0::2].reset_index()
#afternoon_df.head
morning_df.head()
PythonExploratory Data Analysis (EDA)
Plotted the average price for morning and afternoon slots separately to see if there was a pattern based on the time of the day.
#Trying to find if morning and afternoon matters a lot
plt.figure(figsize=(40, 7))
plt.subplot(1,2,1)
plt.plot(morning_df.AvgPrice.values, color='green', label='Morning')
plt.plot(afternoon_df.AvgPrice.values, color='red', label='Afternoon')
plt.title('Average price')
plt.xlabel('Days')
plt.ylabel('price')
plt.legend(loc='best')
PythonThe plot looks a bit congested, doesn’t it? Let’s recreate the same plot for a smaller time frame.
plt.figure(figsize=(40, 7))
plt.subplot(1,2,1)
plt.plot(morning_df[1000:1100].AvgPrice.values, color='blue', label='Morning')
plt.plot(afternoon_df[1000:1100].AvgPrice.values, color='orange', label='Afternoon')
plt.title('Average price')
plt.xlabel('Days')
plt.ylabel('price')
plt.legend(loc='best')
PythonThe graphs don’t seem to vary a lot between morning and afternoon slots.
Plotted the all-year average price over months (separate for each year). Took the mean of morning and evening prices to get the average price for the day.
#Splitting year wise
#First I'm gonna merge records of each day by taking average of prices,arrived,sold and no.of lots.
aggregated_by_date_df = df2.groupby("Date").mean().reset_index()
df_2020=aggregated_by_date_df[aggregated_by_date_df.Year==2020.0].reset_index()
df_2019=aggregated_by_date_df[aggregated_by_date_df.Year==2019.0].reset_index()
df_2018=aggregated_by_date_df[aggregated_by_date_df.Year==2018.0].reset_index()
df_2017=aggregated_by_date_df[aggregated_by_date_df.Year==2017.0].reset_index()
df_2016=aggregated_by_date_df[aggregated_by_date_df.Year==2016.0].reset_index()
#aggregated_by_date_df.head()
Pythonax = sns.barplot(y=aggregated_by_date_df['AvgPrice'], x=aggregated_by_date_df['Month'])
ax.set_title("All years-Average price over Months")
PythonPlotted all-year average price over the days of the week to see if there is any trend through out the week.
ax = sns.barplot(y=aggregated_by_date_df['AvgPrice'], x=aggregated_by_date_df['Day'])
ax.set_title("All year-Average price over weekdays")
PythonLet’s look at just 2020.
#2020
ax = sns.barplot(y=df_2020['AvgPrice'], x=df_2020['Day'])
ax.set_title("2020-Average over weekdays")
#,order=["Mon", "Tue","Wed","Thu","Fri","Sat"]
PythonWhat about different months?
ax = sns.barplot(y=df_2020['AvgPrice'], x=df_2020['Month'])
ax.set_title("2020-Average")
Python2018
Plotted average price for different auctioneers.
#Visualizing different Auctioneers
plt.figure(figsize=(40, 7))
plt.subplot(1,2,1)
plt.plot(df2[df2.Auctioneer == 'South Indian Green Cardamom Company Ltd, Kochi'].AvgPrice.values, color='green', label='South Indian')
plt.plot(df2[df2.Auctioneer == 'Mas Enterprises, Vandanmettu'].AvgPrice.values, color='red', label='Mass')
plt.plot(df2[df2.Auctioneer == 'Header Systems (India) Limited, Nedumkandam'].AvgPrice.values, color='blue', label='Header')
plt.plot(df2[df2.Auctioneer == "Cardamom Planters' Association, Santhanpara"].AvgPrice.values, color='yellow', label='Cardamom planters')
plt.title('Average price')
plt.xlabel('Days')
plt.ylabel('price')
plt.legend(loc='best')
PythonLooks like all top auctioneers act the same.
#Visualizing different Auctioneers over time
plt.figure(figsize=(40, 7));
plt.subplot(1,2,1);
plt.plot(df2[df2.Auctioneer == 'South Indian Green Cardamom Company Ltd, Kochi'].MaxPrice.values, color='green', label='Maximum price')
plt.plot(df2[df2.Auctioneer == 'South Indian Green Cardamom Company Ltd, Kochi'].AvgPrice.values, color='red', label='Average price')
plt.title('Average price')
plt.xlabel('Days')
plt.ylabel('price')
plt.legend(loc='best')
Pythonaggregated_by_auctioneer1=aggregated_by_auctioneer.sort_values(by='Arrived_Kg',ascending = False)
ax = sns.barplot(y=aggregated_by_auctioneer1['Auctioneer'], x=aggregated_by_auctioneer1['Arrived_Kg'])
ax.set_title("Arrived kgs per auctioneer")
Pythonaggregated_by_auctioneer2=aggregated_by_auctioneer.sort_values(by='AvgPrice',ascending = False)
ax = sns.barplot(y=aggregated_by_auctioneer2['Auctioneer'], x=aggregated_by_auctioneer2['AvgPrice'])
ax.set_title("Average price for each auctioneer")
PythonGenerated a scatter plot to see if there is any correlation between the number of lots that come in on a day and the average price.
plt.figure(figsize=(16, 10))
sns.scatterplot(aggregated_by_date_df["Lots"],aggregated_by_date_df["AvgPrice"], hue =aggregated_by_date_df["Year"], palette="deep")
Pythonplt.figure(figsize=(16, 10))
sns.scatterplot(aggregated_by_date_df["Lots"],aggregated_by_date_df["MaxPrice"], hue =aggregated_by_date_df["Year"], palette="deep")
PythonGenerated a swarm plot of price variation over 12 months of the year.
Generated a heat map to see if there is any correlation between features
cor = aggregated_by_date_df.corr()
plt.figure(figsize=(10,10))
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
PythonResults and Evaluation
The only pattern I could find is that cardamom prices are increasing as the years go by. I couldn’t find any patterns between auction slots, auction days or months with maximum or average prices.
Future Work
Predictive modeling to predict future prices.
Resources
The full code is available on GitHub (If you have trouble viewing the notebook, copy the link and paste it into nbviewer)