Rosette for E-Discovery
Expand Your e-Discovery Scope Beyond English
At its core, e-Discovery is about analyzing huge collections of unstructured content – documents, email, call logs, transcripts, contracts – to uncover information about people, places, and organizations.
In the age of globalization, this content may be written in different languages, using multiple scripts and character sets. The challenge is therefore how to efficiently search this multilingual text, extract entities with high accuracy and precision, and ensure that all the necessary information is revealed.
Basis Technology’s Rosette® suite of text analytics components provide a robust and scalable solution to this multilingual e-Discovery challenge. Through the combination of language identification, morphological analysis, entity extraction, and automatic name translation, Basis Technology can reveal the key information necessary to establish connections and build relationships.
Basis Technology helps the legal community meet its multilingual discovery challenges head-on with Rosette®, a linguistics platform proven in hundreds of commercial and government environments.
The Rosette software components are configured as building blocks, and work seamlessly within discovery workflows and information retrieval applications, covering the major European, Asian, and Middle Eastern languages. For legal professionals, Rosette provides the ability to examine multilingual text with unparalleled accuracy and efficiency.
Step 1: Language Identifier
Identify the language(s) in a document
The Rosette Language Identifier (RLI) identifies the language(s) and character encoding systems present in a document so that its textual content can be filtered and processed. Extracted text is converted to Unicode so that discovery and information retrieval applications can access a single data representation regardless of language. Using a module called the Language Boundary Locator, mixed-language documents are segmented into regions so that language-specific processing can be performed on each region.
Step 3: Entity Extraction
Extract the items of interest (including those you didn’t know about)
The Rosette Entity Extractor (REX) sifts through unstructured text and identifies people, places, dates, and other items that establish the true meaning of a document for further analysis.
REX locates generic terms as well as custom entities such as specific names, phone numbers, and email addresses. Statistical modeling helps determine if an entity resides within a document, rather than simply referring to a list of possibilities and risk overlooking a variation. The result is entity extraction technology that lets you find what you know —and also what you didn’t know.
Step 2: Base Linguistics
Apply linguistic intelligence to identify word forms, parts of speech, and sentence structure
Rosette Base Linguistics (RBL) examines documents and performs a complete morphological analysis so that text can be accurately filtered, analyzed, and searched.
RBL identifies parts of speech, sentence boundaries, word breaks, tokens, lemmas and other linguistic components in European, Asian, and Middle Eastern languages.
Step 4: Name Translation
Automatically translate non-English names into English to enable rapid triage of multilingual content
Rosette Name Translator (RNT) uses a combination of user-supplied name dictionaries, linguistic algorithms and statistical modeling to provide highly accurate, standardized English translations of names that originate from several non-latin writing systems, including Chinese, Russian and Arabic.
By combining REX and RNT, key names can be extracted and translated to help investigators rapidly identify relevant documents that need to be flagged for translation and further study.