In this project we asked the questions: Are stock returns released around the 10k date correlated to the sentiment of the text within the 10k?. To answer this question, I downloaded 500 of the most recent 10k folders from the sec website. The code is able to calculate 3 measurements of returns: returns on the 10k release date, rolling returns for two days out form the release date, and rolling returns for ten days out from the release date. I then calculated sentiment scores of four variables by calculating the positive and negative sentiment of the 10k based on two lists of positive and negative words (BHR and LM). I then compared those variables later in the report below:
Sample:
Return variables:
pattern = r"\d{10}-\d{2}-\d{6}"
accessionnumber = re.findall(pattern, file_string)
accession_number_set = set(accessionnumber) # this made them uniuque numbers
accession_number_set # this is the set with only the unique accession numbers
accession_list = list(accession_number_set) #turned into a list
CIK = [str(int(num.split('-')[0])) for num in accession_list] #
ciks_accnums = pd.DataFrame({'acc_number_unique': accession_list, 'cik': CIK})
ciks_accnums
I was then able to merge this to the sp500 dataframe so the accession number is on the original dataset
SP500Merge_With_Rets = SP500Merge.merge(crsp.rename(columns={'ticker':'Symbol', 'date':'filing_date'}),
on=['Symbol', 'filing_date'],
how='inner',
validate='m:m')
Sentiment Variables:
In my code, I included only the 4 sentiment vairiables from the ML and BHR word datasets (due to sleep and sanity). To do this I:
BHR_negative = pd.read_csv('inputs/ML_negative_unigram.txt',
names=['word'])['word'].to_list()
RegexLM_pos = ['('+"|".join(LM_positive)+')']
RegexLM_pos_sentvar = ( len(re.findall(NEAR_regex(RegexLM_pos, partial=False, max_words_between=5), cleaned))
/ len(cleaned)
)
SP500Merge_With_Rets.loc[index, 'RegexLM_pos_sentvar'] = RegexLM_pos_sentvar
Caveat
Contextual Sentiment
While I was unable to correctly figure out how to define these variables and loop it all in with my code, I think that topics such as: Automation, Machine Learning, Medicine/Vaccine, Regulatory Policy, Changes in Supply/Demand, and Real Estate could all be cool topics to explore
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as pdr
import seaborn as sns
# these three are used to open the CCM dataset:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
analysis = pd.read_csv('output/analysis_sample.csv')
columnret = ['ret_t_t2', 'ret_t3_t10']
columnsent = ['RegexLM_neg_sentvar', 'RegexBHR_pos_sentvar', 'RegexBHR_neg_sentvar']
df_subset = analysis[columnret + columnsent]
correlations = df_subset.corr()
correlationstable = correlations.stack().reset_index()
correlationstable.columns = ['Variable 1', 'Variable 2', 'Correlation']
returns_corr = correlations_table[(correlations_table['Variable 1'].isin(returns_cols)) &
(correlations_table['Variable 2'].isin(sentiment_cols))]
returns_corr
Variable 1 | Variable 2 | Correlation | |
---|---|---|---|
2 | ret_t_t2 | RegexLM_neg_sentvar | 0.009049 |
3 | ret_t_t2 | RegexBHR_pos_sentvar | 0.092972 |
4 | ret_t_t2 | RegexBHR_neg_sentvar | 0.132312 |
7 | ret_t3_t10 | RegexLM_neg_sentvar | -0.129333 |
8 | ret_t3_t10 | RegexBHR_pos_sentvar | -0.049074 |
9 | ret_t3_t10 | RegexBHR_neg_sentvar | 0.059136 |
Correlation Scatterplot:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
returns_cols = ['ret_t_t2', 'ret_t3_t10']
sentiment_cols = ['RegexLM_neg_sentvar', 'RegexBHR_pos_sentvar', 'RegexBHR_neg_sentvar']
# Create a subset of the DataFrame with the relevant columns
df_subset = analysis[returns_cols + sentiment_cols]
# Create a grid of scatterplots for each sentiment variable against each return variable
sns.set(style='ticks')
sns.pairplot(df_subset, x_vars=returns_cols, y_vars=sentiment_cols, height=3, aspect=1.2, kind='scatter')
plt.show()
In the graphs above, we can see scatterplots of the correlation between returns and the sentiments from the 3 working word lists. Something interesting from these findings is that the graphs in the 3-10 day range reflect more of a correlated shape. For example, the LM negative 10 day graph shows a slightly negative correlation (which is what might be expected from how we framed out question in the first place). Also, the 2 day returns seem to have a tighter shape indicating a correlation close to zero. Most of these returns reflect a value close to zero which is not quite what we were looking for as far as results. Some explanations for this might include: error in the returns calculations or just the finding that sentiment does not reflect 2 and 10 day stock returns.
Some promising similar results are found within the literature of this topic: In “When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks” (The LM dataset), the researches found in their tests that when dividing firms into quintiles based on proportion of words, the 10ks produce “no discernable pattern”. An interesting point they made was that there may have been bias due to high freequency words used both positive and negative, but how this may not widely affect the overall findings. This research was alot more robust and includes variables such as different sized companys, and each companies different financial metrics in the inputs.
Looking to “The colour of finance words” (The BHR dataset) produced results based on both the ‘bag of words’ method and machine learning methods. They contest that machine learning methods outperform the bag of words as they can highlight which words might have more meaning, and words that are important but might be missed by humans. This can be extrapolated to the idea that specific words might have more weight than others to potentially correlate to, and predict future stock returns.
In conclusion, my results match those of Loughran and Mcdonald that when using a basket of words dictionary, there is no discernable correlation between positive/negative sentiments in 10k filings and the firms subsequent stock returns.