Lexical Analysis

Tokenization: The process of breaking up a text into smaller units (tokens), typically based on whitespace or punctuation marks.

Stemming: The process of reducing words to their root or base form, which can improve the accuracy of textual analysis.

Part-of-speech tagging: The process of assigning a part of speech (e.g. noun, verb, adjective) to each word in a text, which allows for more granular analysis of linguistic patterns.

Named entity recognition: The process of identifying and classifying named entities (e.g. people, organizations, locations) in a text, which can be useful for information extraction tasks.

Word frequency analysis: The process of counting the occurrence of words in a text or corpus, which can provide insights into patterns of language use.

Collocation analysis: The process of identifying words that tend to co-occur with each other in a text or corpus, which can reveal patterns of usage and potential relationships between words.

Concordancing: The process of generating a concordance (i.e. a list of all the occurrences) of a particular word or phrase in a text or corpus, which allows for detailed analysis of how the word or phrase is used.

Corpus construction and annotation: The process of compiling and categorizing a corpus (i.e. a collection of texts) for use in linguistic analysis.

Corpus linguistics software: Tools and software designed specifically for corpus linguistics analysis, such as AntConc, TreeTagger, and WordSmith.

Corpus linguistics methodology: The general principles and guidelines for conducting linguistic analysis on large-scale corpora, such as sampling techniques, research questions, and statistical analysis methods.

Tokenization: Separating words or smaller linguistic units (tokens) from a text.

Stemming: Reducing words to their root or stem form. E.g., "driving, drive, driver" will be reduced to "drive".

Lemmatization: Similar to Stemming, but attempts to accurately identify base forms of a word.

Part-of-speech (POS) tagging: Identifying the grammatical category of each word in a text (e.g., noun, verb, adjective, etc.).

Named Entity Recognition (NER): Identifying entities in a text, including names, dates, organizations, etc.

Collocation analysis: Identifying words that are likely to appear together in a language, based on patterns of co-occurrence.

Concordance analysis: A method for examining the context in which a word or phrase appears in a text and the frequency of its occurrence, including the collocation.

Dependency parsing: Identifying the syntactic structure of a sentence by analyzing the relationships between words in a sentence.

Frequency analysis: Estimating the frequency or occurrence of a word or term within a given corpus.

Sentiment analysis: Measures positive or negative attitudes or emotions expressed in texts.