PYTHON基础技能 – 文本清洗和预处理的 15 项技术

释放双眼,带上耳机,听听看~!

文本清洗和预处理是自然语言处理(NLP)中的重要步骤。无论你是处理社交媒体数据、新闻文章还是用户评论,都需要先对文本进行清洗和预处理,以确保后续的分析或建模能够顺利进行。本文将详细介绍15项Python文本清洗和预处理技术,并通过实际代码示例来帮助你更好地理解和应用这些技术。

1. 去除空白字符

空白字符包括空格、制表符、换行符等,这些字符通常不会影响文本内容的意义,但会增加数据的复杂性。使用

1
strip()

1
replace()

方法可以轻松去除这些字符。


1
text&nbsp;=&nbsp;"&nbsp;&nbsp;Hello,&nbsp;World!&nbsp;\n"<br>clean_text&nbsp;=&nbsp;text.strip()&nbsp;&nbsp;<em>#&nbsp;去除首尾空白字符</em><br>print(clean_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;Hello,&nbsp;World!</em><br><br>text_with_tabs&nbsp;=&nbsp;"Hello\tWorld!"<br>clean_text&nbsp;=&nbsp;text_with_tabs.replace("\t",&nbsp;"&nbsp;")&nbsp;&nbsp;<em>#&nbsp;将制表符替换为空格</em><br>print(clean_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;Hello&nbsp;World!</em>

2. 转换为小写

将所有文本转换为小写可以避免因大小写不同而引起的不一致问题。


1
text&nbsp;=&nbsp;"Hello,&nbsp;World!"<br>lower_text&nbsp;=&nbsp;text.lower()<br>print(lower_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;hello,&nbsp;world!</em>

3. 去除标点符号

标点符号通常不会对文本的语义产生实质性的影响,但在某些情况下(如情感分析)可能会有影响。使用

1
string

模块中的

1
punctuation

可以轻松去除标点符号。


1
import&nbsp;string<br><br>text&nbsp;=&nbsp;"Hello,&nbsp;World!"<br>clean_text&nbsp;=&nbsp;text.translate(str.maketrans("",&nbsp;"",&nbsp;string.punctuation))<br>print(clean_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;Hello&nbsp;World</em>

4. 分词

分词是将文本分割成单词或短语的过程。使用

1
nltk

库的

1
word_tokenize

方法可以实现这一点。


1
import&nbsp;nltk<br>from&nbsp;nltk.tokenize&nbsp;import&nbsp;word_tokenize<br><br>nltk.download('punkt')<br>text&nbsp;=&nbsp;"Hello,&nbsp;World!&nbsp;This&nbsp;is&nbsp;a&nbsp;test."<br>tokens&nbsp;=&nbsp;word_tokenize(text)<br>print(tokens)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;&#091;'Hello',&nbsp;',',&nbsp;'World',&nbsp;'!',&nbsp;'This',&nbsp;'is',&nbsp;'a',&nbsp;'test',&nbsp;'.']</em>

5. 去除停用词

停用词是那些在文本中频繁出现但对语义贡献不大的词汇,如“the”、“is”等。使用

1
nltk

库的

1
stopwords

模块可以去除这些词。


1
from&nbsp;nltk.corpus&nbsp;import&nbsp;stopwords<br><br>nltk.download('stopwords')<br>stop_words&nbsp;=&nbsp;set(stopwords.words('english'))<br>tokens&nbsp;=&nbsp;&#091;'Hello',&nbsp;'World',&nbsp;'This',&nbsp;'is',&nbsp;'a',&nbsp;'test']<br>filtered_tokens&nbsp;=&nbsp;&#091;token&nbsp;for&nbsp;token&nbsp;in&nbsp;tokens&nbsp;if&nbsp;token&nbsp;not&nbsp;in&nbsp;stop_words]<br>print(filtered_tokens)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;&#091;'Hello',&nbsp;'World',&nbsp;'test']</em>

6. 词干提取

词干提取是将单词还原为其基本形式的过程。使用

1
nltk

库的

1
PorterStemmer

可以实现这一点。


1
from&nbsp;nltk.stem&nbsp;import&nbsp;PorterStemmer<br><br>stemmer&nbsp;=&nbsp;PorterStemmer()<br>words&nbsp;=&nbsp;&#091;'running',&nbsp;'jumps',&nbsp;'easily']<br>stemmed_words&nbsp;=&nbsp;&#091;stemmer.stem(word)&nbsp;for&nbsp;word&nbsp;in&nbsp;words]<br>print(stemmed_words)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;&#091;'run',&nbsp;'jump',&nbsp;'easili']</em>

7. 词形还原

词形还原是将单词还原为其词典形式的过程。使用

1
nltk

库的

1
WordNetLemmatizer

可以实现这一点。


1
from&nbsp;nltk.stem&nbsp;import&nbsp;WordNetLemmatizer<br><br>nltk.download('wordnet')<br>lemmatizer&nbsp;=&nbsp;WordNetLemmatizer()<br>words&nbsp;=&nbsp;&#091;'running',&nbsp;'jumps',&nbsp;'easily']<br>lemmatized_words&nbsp;=&nbsp;&#091;lemmatizer.lemmatize(word)&nbsp;for&nbsp;word&nbsp;in&nbsp;words]<br>print(lemmatized_words)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;&#091;'running',&nbsp;'jump',&nbsp;'easily']</em>

8. 去除数字

数字通常不会对文本的语义产生实质性的影响。使用正则表达式可以轻松去除数字。


1
import&nbsp;re<br><br>text&nbsp;=&nbsp;"Hello,&nbsp;World!&nbsp;123"<br>clean_text&nbsp;=&nbsp;re.sub(r'\d+',&nbsp;'',&nbsp;text)<br>print(clean_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;Hello,&nbsp;World!&nbsp;</em>

9. 去除特殊字符

特殊字符如

1
@

1
#

1
$

等通常不会对文本的语义产生实质性的影响。使用正则表达式可以轻松去除这些字符。


1
text&nbsp;=&nbsp;"Hello,&nbsp;@World!&nbsp;#Python&nbsp;$123"<br>clean_text&nbsp;=&nbsp;re.sub(r'&#091;^\w\s]',&nbsp;'',&nbsp;text)<br>print(clean_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;Hello&nbsp;&nbsp;World&nbsp;&nbsp;Python&nbsp;123</em>

10. 去除 HTML 标签

如果文本来自网页,可能包含 HTML 标签。使用

1
BeautifulSoup

库可以轻松去除这些标签。


1
from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup<br><br>html_text&nbsp;=&nbsp;"&lt;html&gt;&lt;body&gt;&lt;h1&gt;Hello,&nbsp;World!&lt;/h1&gt;&lt;/body&gt;&lt;/html&gt;"<br>soup&nbsp;=&nbsp;BeautifulSoup(html_text,&nbsp;'html.parser')<br>clean_text&nbsp;=&nbsp;soup.get_text()<br>print(clean_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;Hello,&nbsp;World!</em>

11. 去除 URL

URL 通常不会对文本的语义产生实质性的影响。使用正则表达式可以轻松去除 URL。


1
text&nbsp;=&nbsp;"Check&nbsp;out&nbsp;this&nbsp;link:&nbsp;https://example.com"<br>clean_text&nbsp;=&nbsp;re.sub(r'http\S+|www.\S+',&nbsp;'',&nbsp;text)<br>print(clean_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;Check&nbsp;out&nbsp;this&nbsp;link:&nbsp;</em>

12. 去除重复单词

重复单词可能会增加文本的复杂性。使用集合可以轻松去除重复单词。


1
tokens&nbsp;=&nbsp;&#091;'Hello',&nbsp;'World',&nbsp;'Hello',&nbsp;'Python',&nbsp;'Python']<br>unique_tokens&nbsp;=&nbsp;list(set(tokens))<br>print(unique_tokens)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;&#091;'Hello',&nbsp;'Python',&nbsp;'World']</em>

13. 去除短词

短词通常不会对文本的语义产生实质性的影响。可以设置一个阈值来去除长度小于该阈值的单词。


1
tokens&nbsp;=&nbsp;&#091;'Hello',&nbsp;'World',&nbsp;'a',&nbsp;'is',&nbsp;'Python']<br>min_length&nbsp;=&nbsp;3<br>filtered_tokens&nbsp;=&nbsp;&#091;token&nbsp;for&nbsp;token&nbsp;in&nbsp;tokens&nbsp;if&nbsp;len(token)&nbsp;&gt;=&nbsp;min_length]<br>print(filtered_tokens)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;&#091;'Hello',&nbsp;'World',&nbsp;'Python']</em>

14. 去除罕见词

罕见词可能会增加文本的复杂性。可以设置一个频率阈值来去除出现次数少于该阈值的单词。


1
from&nbsp;collections&nbsp;import&nbsp;Counter<br><br>tokens&nbsp;=&nbsp;&#091;'Hello',&nbsp;'World',&nbsp;'Hello',&nbsp;'Python',&nbsp;'Python',&nbsp;'test',&nbsp;'test',&nbsp;'test']<br>word_counts&nbsp;=&nbsp;Counter(tokens)<br>min_frequency&nbsp;=&nbsp;2<br>filtered_tokens&nbsp;=&nbsp;&#091;token&nbsp;for&nbsp;token&nbsp;in&nbsp;tokens&nbsp;if&nbsp;word_counts&#091;token]&nbsp;&gt;=&nbsp;min_frequency]<br>print(filtered_tokens)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;&#091;'Hello',&nbsp;'Hello',&nbsp;'Python',&nbsp;'Python',&nbsp;'test',&nbsp;'test',&nbsp;'test']</em>

15. 使用正则表达式进行复杂清洗

正则表达式是一种强大的工具,可以用于复杂的文本清洗任务。例如,去除特定模式的字符串。


1
text&nbsp;=&nbsp;"Hello,&nbsp;World!&nbsp;123-456-7890"<br>clean_text&nbsp;=&nbsp;re.sub(r'\d{3}-\d{3}-\d{4}',&nbsp;'PHONE',&nbsp;text)<br>print(clean_text)&nbsp;&nbsp;<em>#&nbsp;输出:&nbsp;Hello,&nbsp;World!&nbsp;PHONE</em>

实战案例:清洗社交媒体评论

假设你有一个包含社交媒体评论的数据集,需要对其进行清洗和预处理。我们将综合运用上述技术来完成这个任务。


1
import&nbsp;pandas&nbsp;as&nbsp;pd<br>import&nbsp;re<br>import&nbsp;string<br>from&nbsp;nltk.tokenize&nbsp;import&nbsp;word_tokenize<br>from&nbsp;nltk.corpus&nbsp;import&nbsp;stopwords<br>from&nbsp;nltk.stem&nbsp;import&nbsp;WordNetLemmatizer<br>from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup<br><br><em>#&nbsp;下载必要的NLTK资源</em><br>nltk.download('punkt')<br>nltk.download('stopwords')<br>nltk.download('wordnet')<br><br><em>#&nbsp;示例数据</em><br>data&nbsp;=&nbsp;{<br>&nbsp;&nbsp;&nbsp;&nbsp;'comment':&nbsp;&#091;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"Check&nbsp;out&nbsp;this&nbsp;link:&nbsp;https://example.com",<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"Hello,&nbsp;@World!&nbsp;#Python&nbsp;$123",<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"&lt;html&gt;&lt;body&gt;&lt;h1&gt;Hello,&nbsp;World!&lt;/h1&gt;&lt;/body&gt;&lt;/html&gt;",<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"Running&nbsp;jumps&nbsp;easily&nbsp;123-456-7890"<br>&nbsp;&nbsp;&nbsp;&nbsp;]<br>}<br><br>df&nbsp;=&nbsp;pd.DataFrame(data)<br><br>def&nbsp;clean_text(text):<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;去除HTML标签</em><br>&nbsp;&nbsp;&nbsp;&nbsp;text&nbsp;=&nbsp;BeautifulSoup(text,&nbsp;'html.parser').get_text()<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;去除URL</em><br>&nbsp;&nbsp;&nbsp;&nbsp;text&nbsp;=&nbsp;re.sub(r'http\S+|www.\S+',&nbsp;'',&nbsp;text)<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;去除特殊字符</em><br>&nbsp;&nbsp;&nbsp;&nbsp;text&nbsp;=&nbsp;re.sub(r'&#091;^\w\s]',&nbsp;'',&nbsp;text)<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;去除数字</em><br>&nbsp;&nbsp;&nbsp;&nbsp;text&nbsp;=&nbsp;re.sub(r'\d+',&nbsp;'',&nbsp;text)<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;转换为小写</em><br>&nbsp;&nbsp;&nbsp;&nbsp;text&nbsp;=&nbsp;text.lower()<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;分词</em><br>&nbsp;&nbsp;&nbsp;&nbsp;tokens&nbsp;=&nbsp;word_tokenize(text)<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;去除停用词</em><br>&nbsp;&nbsp;&nbsp;&nbsp;stop_words&nbsp;=&nbsp;set(stopwords.words('english'))<br>&nbsp;&nbsp;&nbsp;&nbsp;tokens&nbsp;=&nbsp;&#091;token&nbsp;for&nbsp;token&nbsp;in&nbsp;tokens&nbsp;if&nbsp;token&nbsp;not&nbsp;in&nbsp;stop_words]<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;词形还原</em><br>&nbsp;&nbsp;&nbsp;&nbsp;lemmatizer&nbsp;=&nbsp;WordNetLemmatizer()<br>&nbsp;&nbsp;&nbsp;&nbsp;tokens&nbsp;=&nbsp;&#091;lemmatizer.lemmatize(token)&nbsp;for&nbsp;token&nbsp;in&nbsp;tokens]<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;去除短词</em><br>&nbsp;&nbsp;&nbsp;&nbsp;tokens&nbsp;=&nbsp;&#091;token&nbsp;for&nbsp;token&nbsp;in&nbsp;tokens&nbsp;if&nbsp;len(token)&nbsp;&gt;=&nbsp;3]<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;<em>#&nbsp;去除罕见词</em><br>&nbsp;&nbsp;&nbsp;&nbsp;word_counts&nbsp;=&nbsp;Counter(tokens)<br>&nbsp;&nbsp;&nbsp;&nbsp;min_frequency&nbsp;=&nbsp;2<br>&nbsp;&nbsp;&nbsp;&nbsp;tokens&nbsp;=&nbsp;&#091;token&nbsp;for&nbsp;token&nbsp;in&nbsp;tokens&nbsp;if&nbsp;word_counts&#091;token]&nbsp;&gt;=&nbsp;min_frequency]<br>&nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;return&nbsp;'&nbsp;'.join(tokens)<br><br><em>#&nbsp;应用清洗函数</em><br>df&#091;'cleaned_comment']&nbsp;=&nbsp;df&#091;'comment'].apply(clean_text)<br>print(df)

总结

本文详细介绍了15项Python文本清洗和预处理技术,包括去除空白字符、转换为小写、去除标点符号、分词、去除停用词、词干提取、词形还原、去除数字、去除特殊字符、去除HTML标签、去除URL、去除重复单词、去除短词、去除罕见词以及使用正则表达式进行复杂清洗。通过实际代码示例,我们展示了如何应用这些技术来清洗和预处理文本数据。最后,我们通过一个实战案例,综合运用这些技术对社交媒体评论进行了清洗和预处理。

给TA打赏
共{{data.count}}人
人已打赏
安全运维

安全运维之道:发现、解决问题的有效闭环

2024-4-14 20:59:36

安全运维

稳定性建设 – 架构优化的关键策略

2025-2-11 17:15:56

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索