"Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a 'lexer' program."
Methods for identifying patterns and relationships between words and phrases, including lexical profiling, lexicon building, and semantic annotation.
Tokenization: The process of breaking up a text into smaller units (tokens), typically based on whitespace or punctuation marks.
Stemming: The process of reducing words to their root or base form, which can improve the accuracy of textual analysis.
Part-of-speech tagging: The process of assigning a part of speech (e.g. noun, verb, adjective) to each word in a text, which allows for more granular analysis of linguistic patterns.
Named entity recognition: The process of identifying and classifying named entities (e.g. people, organizations, locations) in a text, which can be useful for information extraction tasks.
Word frequency analysis: The process of counting the occurrence of words in a text or corpus, which can provide insights into patterns of language use.
Collocation analysis: The process of identifying words that tend to co-occur with each other in a text or corpus, which can reveal patterns of usage and potential relationships between words.
Concordancing: The process of generating a concordance (i.e. a list of all the occurrences) of a particular word or phrase in a text or corpus, which allows for detailed analysis of how the word or phrase is used.
Corpus construction and annotation: The process of compiling and categorizing a corpus (i.e. a collection of texts) for use in linguistic analysis.
Corpus linguistics software: Tools and software designed specifically for corpus linguistics analysis, such as AntConc, TreeTagger, and WordSmith.
Corpus linguistics methodology: The general principles and guidelines for conducting linguistic analysis on large-scale corpora, such as sampling techniques, research questions, and statistical analysis methods.
Tokenization: Separating words or smaller linguistic units (tokens) from a text.
Stemming: Reducing words to their root or stem form. E.g., "driving, drive, driver" will be reduced to "drive".
Lemmatization: Similar to Stemming, but attempts to accurately identify base forms of a word.
Part-of-speech (POS) tagging: Identifying the grammatical category of each word in a text (e.g., noun, verb, adjective, etc.).
Named Entity Recognition (NER): Identifying entities in a text, including names, dates, organizations, etc.
Collocation analysis: Identifying words that are likely to appear together in a language, based on patterns of co-occurrence.
Concordance analysis: A method for examining the context in which a word or phrase appears in a text and the frequency of its occurrence, including the collocation.
Dependency parsing: Identifying the syntactic structure of a sentence by analyzing the relationships between words in a sentence.
Frequency analysis: Estimating the frequency or occurrence of a word or term within a given corpus.
Sentiment analysis: Measures positive or negative attitudes or emotions expressed in texts.
"In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc."
"In case of a programming language, the categories include identifiers, operators, grouping symbols and data types."
"Lexical tokenization is not the same process as the probabilistic tokenization, used for large language model's data preprocessing, that encode text into numerical tokens, using byte pair encoding."
"Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a 'lexer' program."
"Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a 'lexer' program."
"In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc."
"In case of a programming language, the categories include identifiers, operators, grouping symbols and data types."
"Those categories include nouns, verbs, adjectives, punctuations etc."
"The categories include identifiers, operators, grouping symbols and data types."
"Lexical tokenization is not the same process as the probabilistic tokenization, used for large language model's data preprocessing, that encode text into numerical tokens, using byte pair encoding."
"Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a 'lexer' program."
"Those categories include nouns, verbs, adjectives, punctuations etc."
"The categories include identifiers, operators, grouping symbols and data types."
"In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc."
"Lexical tokenization is not the same process as the probabilistic tokenization, used for large language model's data preprocessing, that encode text into numerical tokens, using byte pair encoding."
"The categories include identifiers, operators, grouping symbols and data types."
"Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a 'lexer' program."
"The categories include identifiers, operators, grouping symbols and data types."
"In case of a programming language, the categories include identifiers, operators, grouping symbols and data types."