Construction site accident analysis using text mining and natural language processing techniques

In this paper author is describing concept to provide safety to workers at construction site from accidents by analysing past accident data by using machine learning algorithms and text mining technique such as TF-IDF (Term Frequency-Inverse Document Frequency) and natural language text processing to remove special symbols, stop words, stemming etc.

Past accident data contains details of accidents and by building machine learning algorithms model we can analyse data to identify cause of accident and can prevent future accident by giving test data of new work to predict causes of accidents and can avoid such causes. This machine learning algorithms can help in extracting dangerous objects such as misused tools, sharp objects nearby, damaged equipment etc.

In this paper to provide safety to workers author covering below points

1) Various texting mining and NLP techniques are explored with respectto construction site accidents analysis. Using this technique we will remove stop words, punctuations, special symbols and apply stemming technique to clean past accident data. After data cleaning we will convert all text data to numeric vector by using TF-IDF technique. TF-IDF contains frequency weight of each word in vector and using this vector we will build machine learning train model. Whenever we give new test data then that test data also convert to TF-IDF and then apply on train model to search for similar data and give output of similar data as prediction. Below example describe how to convert text to TF-IDF vector.

Suppose I have 3 sentences

Sentence 1: An apple a day keep doctor away

Sentence 2: apple good for health

Sentence 3: shipment of gold damage in fire

First we remove stop words such as ‘an, a, of, in’ from sentences and then take remaining words and form columns of vector. After forming columns put each word count as values of that vector. See below vector columns

Apple day keep doctor away good health shipment gold damage fire

Sentence1 1 1 1 1 1 0 0 0 0 0 0

Sentence2 1 0 0 0 0 1 1 0 0 0 0

Sentence3 0 0 0 0 0 0 0 1 1 1 1

So I convert all 3 sentences to TF-IDF vector just by putting count of each word as vector values, if sentence contains that column word then we will put its count, if sentence not contains work then we put 0 as that column values. Now to check similarity we can multiply one row with other and if multiply value greater than 0 then two sentences contains similarity otherwise not.

In above matrix if multiply sentence 1 row with sentence 2 row then we get value greater than 0 and similarity is there as both sentences contains 1 common word called ‘apple’. Similarly if we multiply sentence1 row with sentence3 row then we will get value 0 which means similarity not there between sentence 1 and 3 and we can see there is no common words in sentence 1 and 3.

2) Ensemble algorithm which has not been well studied in this field isproposed to classify the causes of accidents and SQP algorithm isutilized to search for optimal weighs of the ensemble model. In this technique we will use ensemble algorithms such as random forest and voting classifier with SQP (Sequential Quadratic Programming) to classify causes of accidents. Using SQP we can assign weight to the classifier which can help classifier in predicting correct causes of accident.

3) A rule based chunker is developed for dangerous objects extraction.Neither SQP optimization algorithm nor rule based chunker withregard to this field is found in the state of the art. Rule based chunker means getting Part Of Speech (POS) of each sentence to find dangerous object detection. When we apply POS on sentence then all dangerous objects will come under NOUN POS and by extracting noun phrases from sentences we can identify what are the dangerous objects which causes accidents.

To implement this project author using OSHA dataset and effectiveness ofthe proposed approaches is verified by the experiment results. OSHA dataset contains past accident data and by using this dataset we will analyse performance of various machine learning algorithms such as SVM, Decision Tree, Naïve Bayes, Logistic Regression, KNN, Ensemble Random Forest and Propose Voting Classifier which will build on all 5 base classifiers such as SVM, Naïve Bayes, Decision Tree, KNN and Logistic Regression. Voting classifier take all 5 classifier and then vote each classifier and whatever classifier give better accuracy then voting will choose that classifier for future data prediction.

Dataset saved inside ‘dataset/OSHA.csv’ folder and you can open and see the details and below are some data of new work and it has no details what accident can cause by doing that work but machine learning can predict and display future accident cause.

cutting down a large horizontal pipe block

installing roof decking on a flat roof by carrying and placing decking material

portable storage tank and a running powered industrial truck Caught in or between

electrical transformer box distribution line electrocuted electric ladder work onto a 13800 volt power line work electrical parts

In above bold sentences some work details are there and while doing such work what accident can happen can be predicted with machine learning algorithms.

Construction site accident analysis using text mining and natural language processing techniques || Constructionsiteaccidentanalysisusingtextminingandnaturallanguageprocessingtechniques

video output

Author : ss

Share this

Total Pageviews

Construction site accident analysis using text mining and natural language processing techniques || Constructionsiteaccidentanalysisusingtextminingandnaturallanguageprocessingtechniques

video output

Author : ss

Share this

Related Posts

Total Pageviews