2 min readfrom Data Science

How to use NLP to compare text from two different corpora?

I am not well versed in NLP, so hopefully someone can help me out here. I am looking at safety incidents for my organization. I want to compare the text of incident reports and observations to investigate if our observations are deterring incidents.

I have a dataset of the incidents and a dataset of the observations. Both datasets have a free-text field that contains the description of the incident or observation. There is not really a good link between observations and incidents (as in, these observations were monitoring X activity on Y contract, and an incident also occurred during X activity on Y contract).

My feeling is that the observations are just busy work; they don’t actually observe the activities that need safety improvement. The correlation between number of observations and number of incidents is minor, but I want to make a stronger case. I want to investigate this by using NLP to describe the incidents, then describe the observations, and see if there is a difference in content. I can at the very least produce word counts and compare the top terms, but I don’t think that gets me where I need to be on its own.

I have used some topic modeling (Latent Dirichlet Allocation) to get an idea of the topics in each, but I’m hitting a wall trying to compare the topics from the incidents to the topics from the observations.

Does anyone have ideas?

submitted by /u/iwannabeunknown3
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#large dataset processing
#rows.com
#natural language processing for spreadsheets
#generative AI for data analysis
#financial modeling
#Excel alternatives for data analysis
#financial modeling with spreadsheets
#NLP
#safety incidents
#incident reports
#observations
#dataset
#free-text field
#correlation
#Latent Dirichlet Allocation
#topics
#word counts
#content comparison
#monitoring
#data analysis