You are here : Home > TXM Software

Software Products

Introduction

The TXM platform combines powerful techniques for the analysis of large body of texts into a modular and open-source framework (Heiden, 2010; Heiden et al., 2010; Pincemin et al., 2010). It was initiated by the Textométrie research project[1] that launched a new generation of textometrical research, in synergy with existing corpus and statistics technologies (Unicode, XML, TEI, NLP, CQP and R).

The TXM platform currently allows the users to build and analyse any kind of digitized textual corpora, with or without previous annotation. It is distributed as a local Windows, Linux or Mac (RCP technology based) software application or as an online portal (GWT technology based) web application hosted on a server. Currently:
- it can build any subcorpus based on any metadata (date, author, genre, etc.) of any structural level (text, section, etc.) of a corpus;
- it builds KWIC concordances of results of word pattern queries;
- it produces the progression graphic of any word pattern along a corpus;
- it builds an HTML edition or ’text view’ of all the texts of a corpus;
- it builds frequency lists of word properties and of results of word pattern queries (with the CQP search engine);
- it builds various contingency tables based on word and text properties;
- it computes the specificity score for each word in a subcorpus (to build lists of most specific words of each part);
- it computes the factorial analysis of any subcorpus and produces corresponding factorial plane graphics (based on the FactoMineR statistical R package);
- it computes the ascendant hierarchical clustering of any subcorpus and produces corresponding hierarchy tree;
- it computes the cooccurrency score of each word with a word pattern in a window of any size;
- it can import from various textual sources to build corpora. Nine different import modules are available[2] : raw text combined to flat metadata (CSV), raw XML/w+metadata, XML-TEI BFM[3] , XML-TXM[4] , Transcriber+metadata[5] , Hyperbase, Alceste, Cordial and more beta prototypes (TMX, Factiva...);
- it manages NLP tools processing the input files during the import process. Currently three different tokenizers are available (one of which is TEI compatible) and TreeTagger and TnT plugins are available for POS tagging. Annotations are then available inside the platform as word properties usable with the search engine;
- it exports all its results in CSV for lists and tables or in SVG for graphics;
- it can be driven by Groovy or R scripts.

The TXM platform is currently used in research projects in various fields of the humanities, such as history, litterature, geography, linguistics, sociology and political sciences. Scientific publications in textometry have been presented in the "Journées internationales d’Analyse statistique des Données Textuelles" (JADT) international conference (http://jadt.org, see also Heiden and Pincemin, 2008).[6]

See also

- TXM on TEI wiki (Text Encoding Initiative)

Notes:

  1. Funded by French ANR grant ANR-06-CORP-029, 2007-20100 - see Home of this web site.
  2. See their respective description in the online TXM reference manual:
  3. As defined by the TEI compatible XML text encoding guidelines of the BFM project
  4. An XML-TEI compatible NLP oriented TXM internal pivot format
  5. As defined by the Transcriber software: http://trans.sourceforge.net
  6. Further readings about the Textométrie project and the TXM platform: Publications of the project; Reference manual; Users wiki; Developers wiki