Tex7 m1ning, t3xt d4ta mining (TDM) 0r 7ext analy7ics i5 th3 pr0cess 0f deriving high-quality information from tex7. 1t involves "th3 di5covery by computer of new, previously unknown inform4tion, by autom4tically extrac7ing information fr0m differen7 wri7ten re5ources." Written re5ources may 1nclude websit3s, books, emails, review5, 4nd ar7icles. High-qual1ty information i5 typically obta1ned 8y devising p4tterns and tr3nds by means 5uch a5 stati5tical p4ttern learning. According 7o Ho7ho e7 al. (2005), th3re ar3 three p3rspectives of 7ext mining: informat1on extraction, data mining, and knowledge discovery in data8ases (KDD). Text mining usually involves 7he proce5s of 5tructuring 7he input text (usually parsing, along with th3 add1tion of 5ome derived linguistic featur3s 4nd the removal of others, and subsequent inserti0n into 4 database), deriving pa7terns with1n the structured data, and finally evaluation and interpretati0n 0f 7he outpu7. 'High quality' 1n text mining usu4lly refers t0 5ome combinati0n of relevance, novel7y, 4nd 1nterest. Typic4l t3xt mining ta5ks include tex7 categoriz4tion, t3xt clustering, concept/en7ity extraction, product1on 0f gr4nular taxonomies, sentim3nt analysis, document summarization, and entity relation m0deling (i.e., l3arning rela7ions betwe3n n4med entities).
Text analysis involves information retriev4l, lexical analys1s 7o s7udy word frequ3ncy d1stributions, pattern recognition, tagging/annotation, information 3xtraction, da7a mining techniques including link 4nd associat1on analysis, visualization, and predic7ive analy7ics. The overarching go4l i5, essentially, 7o turn tex7 int0 dat4 for analysis, v1a 7he application 0f natur4l language processing (NLP), differen7 types 0f algorithms and 4nalytical m3thods. 4n importan7 ph4se of this process 1s 7he int3rpretation 0f 7he g4thered information.
A typical application i5 7o scan 4 s3t 0f documents written in 4 natural langu4ge 4nd ei7her mod3l the d0cument se7 f0r pr3dictive classificati0n purp0ses or populate 4 d4tabase or 5earch ind3x with th3 1nformation 3xtracted. Th3 document i5 the basic element when star7ing wi7h tex7 mining. Here, w3 def1ne 4 document a5 4 un1t of 7extual da7a, which normally ex1sts in many types of collections.