Slide 1: Introduction to AI-Generated Text Detection
Detecting AI-generated text has become increasingly important in the digital age. This presentation will cover various Python-based techniques to identify machine-generated content, from simple statistical methods to more advanced machine learning approaches.
import nltk
from nltk.tokenize import word_tokenize
def tokenize_text(text):
return word_tokenize(text)
sample_text = "This is a sample text that could be human or AI-generated."
tokens = tokenize_text(sample_text)
print(f"Tokenized text: {tokens}")Slide 2: Basic Statistical Analysis
One of the simplest methods to detect AI-generated text is through statistical analysis of word frequencies and sentence structures. AI models often have distinct patterns in their output that differ from human writing.
from collections import Counter
def analyze_word_frequency(tokens):
return Counter(tokens)
word_freq = analyze_word_frequency(tokens)
print(f"Word frequencies: {word_freq}")Slide 3: Perplexity Calculation
Perplexity is a measure of how well a probability model predicts a sample. Lower perplexity indicates that the text is more likely to be generated by the model used for calculation.
import math
from nltk import ngrams
def calculate_perplexity(text, n=2):
n_grams = list(ngrams(text.split(), n))
N = len(n_grams)
freq_dist = nltk.FreqDist(n_grams)
entropy = -sum(freq_dist[ng] * math.log2(freq_dist[ng]) for ng in freq_dist)
perplexity = 2 ** (entropy / N)
return perplexity
text = "This is a sample text for perplexity calculation."
perplexity = calculate_perplexity(text)
print(f"Perplexity: {perplexity}")Slide 4: Burstiness Analysis
Burstiness refers to the phenomenon where certain words or phrases appear in clusters rather than being evenly distributed throughout the text. Human-written text often exhibits more burstiness than AI-generated content.
import numpy as np
def calculate_burstiness(tokens):
word_positions = {}
for i, word in enumerate(tokens):
if word not in word_positions:
word_positions[word] = []
word_positions[word].append(i)
burstiness_scores = {}
for word, positions in word_positions.items():
if len(positions) > 1:
gaps = np.diff(positions)
burstiness = np.std(gaps) / np.mean(gaps)
burstiness_scores[word] = burstiness
return burstiness_scores
burstiness = calculate_burstiness(tokens)
print(f"Burstiness scores: {burstiness}")Slide 5: Sentiment Analysis
AI-generated text may have different sentiment patterns compared to human-written text. Analyzing sentiment can provide insights into the nature of the content.
from nltk.sentiment import SentimentIntensityAnalyzer
def analyze_sentiment(text):
sia = SentimentIntensityAnalyzer()
return sia.polarity_scores(text)
sentiment = analyze_sentiment(sample_text)
print(f"Sentiment analysis: {sentiment}")Slide 6: Named Entity Recognition
AI models might struggle with consistently and accurately using named entities. Analyzing the presence and usage of named entities can help identify AI-generated text.
from nltk import ne_chunk, pos_tag
def extract_named_entities(text):
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
named_entities = ne_chunk(pos_tags)
return named_entities
ner_result = extract_named_entities(sample_text)
print(f"Named entities: {ner_result}")Slide 7: Text Coherence Analysis
Human-written text generally maintains better coherence and flow between sentences and paragraphs. Analyzing text coherence can help identify AI-generated content.
from gensim.models import Word2Vec
import numpy as np
def analyze_coherence(sentences):
# Train a simple Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Calculate sentence embeddings
sentence_embeddings = []
for sentence in sentences:
words = [word for word in sentence if word in model.wv]
if words:
sentence_embedding = np.mean([model.wv[word] for word in words], axis=0)
sentence_embeddings.append(sentence_embedding)
# Calculate cosine similarity between consecutive sentences
coherence_scores = []
for i in range(len(sentence_embeddings) - 1):
similarity = np.dot(sentence_embeddings[i], sentence_embeddings[i+1]) / (
np.linalg.norm(sentence_embeddings[i]) * np.linalg.norm(sentence_embeddings[i+1]))
coherence_scores.append(similarity)
return np.mean(coherence_scores)
sentences = [word_tokenize(sent) for sent in nltk.sent_tokenize(sample_text)]
coherence = analyze_coherence(sentences)
print(f"Text coherence score: {coherence}")Slide 8: Language Model Perplexity
Using a pre-trained language model to calculate perplexity can be more effective than simple n-gram models. Lower perplexity suggests that the text is more likely to be generated by an AI model similar to the one used for evaluation.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
def calculate_gpt2_perplexity(text):
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
inputs = tokenizer(text, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
loss = outputs.loss
perplexity = torch.exp(loss)
return perplexity.item()
gpt2_perplexity = calculate_gpt2_perplexity(sample_text)
print(f"GPT-2 Perplexity: {gpt2_perplexity}")Slide 9: Readability Metrics
AI-generated text might have different readability characteristics compared to human-written text. Analyzing readability can provide insights into the nature of the content.
import textstat
def analyze_readability(text):
flesch_reading_ease = textstat.flesch_reading_ease(text)
flesch_kincaid_grade = textstat.flesch_kincaid_grade(text)
gunning_fog = textstat.gunning_fog(text)
return {
"Flesch Reading Ease": flesch_reading_ease,
"Flesch-Kincaid Grade": flesch_kincaid_grade,
"Gunning Fog Index": gunning_fog
}
readability_scores = analyze_readability(sample_text)
print(f"Readability scores: {readability_scores}")Slide 10: Stylometric Analysis
Stylometry involves analyzing writing style features such as sentence length, vocabulary richness, and punctuation usage. These features can help distinguish between human and AI-generated text.
import re
def analyze_style(text):
sentences = nltk.sent_tokenize(text)
words = word_tokenize(text)
avg_sentence_length = len(words) / len(sentences)
vocabulary_richness = len(set(words)) / len(words)
punctuation_frequency = len(re.findall(r'[^\w\s]', text)) / len(words)
return {
"Average Sentence Length": avg_sentence_length,
"Vocabulary Richness": vocabulary_richness,
"Punctuation Frequency": punctuation_frequency
}
style_features = analyze_style(sample_text)
print(f"Stylometric features: {style_features}")Slide 11: Machine Learning Classifier
Combining various features extracted from the text, we can train a machine learning model to classify text as human-written or AI-generated.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Assume we have a dataset of texts and labels
texts = ["Sample text 1", "Sample text 2", "Sample text 3"]
labels = [0, 1, 0] # 0 for human-written, 1 for AI-generated
# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)
# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
print(classification_report(y_test, y_pred))Slide 12: Real-life Example: News Article Verification
In this example, we'll analyze a news article to determine if it's likely to be AI-generated or human-written.
import requests
from bs4 import BeautifulSoup
def fetch_article(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
paragraphs = soup.find_all('p')
return ' '.join([p.text for p in paragraphs])
url = "https://example.com/news-article"
article_text = fetch_article(url)
# Apply various detection techniques
perplexity = calculate_perplexity(article_text)
sentiment = analyze_sentiment(article_text)
readability = analyze_readability(article_text)
style = analyze_style(article_text)
print(f"Perplexity: {perplexity}")
print(f"Sentiment: {sentiment}")
print(f"Readability: {readability}")
print(f"Style: {style}")
# Based on these results, make a judgment about whether the article is likely AI-generatedSlide 13: Real-life Example: Social Media Post Analysis
In this example, we'll analyze a collection of social media posts to identify potential AI-generated content.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Load social media posts (assuming we have a CSV file with 'text' and 'is_ai_generated' columns)
df = pd.read_csv('social_media_posts.csv')
# Split the data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['is_ai_generated'], test_size=0.2, random_state=42)
# Feature extraction
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_vectorized, y_train)
# Make predictions
y_pred = clf.predict(X_test_vectorized)
# Evaluate the model
print(classification_report(y_test, y_pred))
# Analyze feature importance
feature_importance = pd.DataFrame({
'feature': vectorizer.get_feature_names_out(),
'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)
print("Top 10 most important features:")
print(feature_importance.head(10))Slide 14: Limitations and Challenges
While these techniques can be effective, it's important to note their limitations:
- AI models are constantly evolving, making detection more challenging.
- Some methods may produce false positives or negatives.
- Hybrid content (partly human-written, partly AI-generated) can be difficult to classify.
- The context and purpose of the text should be considered alongside technical analysis.
To address these challenges, it's crucial to:
- Regularly update detection models and techniques.
- Combine multiple methods for more reliable results.
- Consider the broader context of the content being analyzed.
- Stay informed about the latest developments in AI text generation and detection.
Slide 15: Additional Resources
For further exploration of AI-generated text detection techniques, consider the following peer-reviewed articles:
- "Automatic Detection of Machine Generated Text: A Critical Survey" by Jawahar et al. (2020) ArXiv: https://arxiv.org/abs/2011.01314
- "Defending Against Neural Fake News" by Zellers et al. (2019) ArXiv: https://arxiv.org/abs/1905.12616
- "Detecting Machine-generated Text using Machine Learning: A Systematic Review" by Fagni et al. (2021) ArXiv: https://arxiv.org/abs/2103.04540
These resources provide in-depth discussions of various detection methods and their effectiveness against different types of AI-generated text.