jieba如何去除jieba 停用词词

点击联系发帖人 时间：2018-03-05 18:42

jieba分词去除停用词

Sina Visitor System处理去停用词代码，有问题请大神请教
[问题点数：40分，结帖人zhangsiyututu]
本版专家分：0
CSDN今日推荐
本版专家分：1096
本版专家分：0
本版专家分：1096
本版专家分：0
本版专家分：389555
2017年总版技术专家分年内排行榜第一
2014年总版技术专家分年内排行榜第二
2013年总版技术专家分年内排行榜第三
2012年总版技术专家分年内排行榜第七
本版专家分：7911
2015年6月 C/C++大版内专家分月排行榜第二2015年5月 C/C++大版内专家分月排行榜第二
2015年4月 C/C++大版内专家分月排行榜第三
本版专家分：2213
本版专家分：2213
本版专家分：2213
本版专家分：0
本版专家分：0
本版专家分：2213
本版专家分：2213
匿名用户不能发表回复！|
CSDN今日推荐Python学习（二）利用jieba分词及去停用词
import sys
sys.path.append("../")
import jieba
import jieba.posseg as pseg
from jieba import analyse
stop = [line.strip().decode('utf-8') for line in open('stop_words.txt').readlines() ]
jieba.load_userdict("userdict.txt")
f = open('example.txt')
s = f.read()
segs = jieba.cut(s, cut_all=False)
segs = pseg.cut(s)
final = ''
for seg ，flag in segs:
if seg not in stop:
if flag !='m' and flag !='x':
final +=' '+ seg
print final
没有更多推荐了，
加入CSDN，享受更精准的内容推荐，与500万程序员共同成长！共被编辑 2 次
如何用python对一个文件夹下的多个txt文本进行去停用词。
在用 for 循环去停用词的部分，出错，仅去掉了 stopwords 中的部分停用词，且相同停用词只去除了一次。求大神告知错误之处，贴上代码再好不过！！
#encoding=utf-8
import sys
import codecs
import shutil
import jieba
import jieba.analyse
#导入自定义词典
#jieba.load_userdict("dict_baidu.txt")
#Read file and cut
def read_file_cut():
#create path
stopwords = {}.fromkeys([ line.strip() for line in open('stopword.txt') ])
path = "Lon\\"
respath = "Lon_Result\\"
if os.path.isdir(respath):
#如果respath这个路径存在
shutil.rmtree(respath, True)
#则递归移除这个路径
os.makedirs(respath)
#重新建立一个respath目录
while num&=20:
name = "%d" % num
fileName = path + str(name) + ".txt"
resName = respath + str(name) + ".txt"
source = open(fileName, 'r')
if os.path.exists(resName):
os.remove(resName)
result = codecs.open(resName, 'w', 'utf-8')
line = source.readline()
line = line.rstrip('\n')
while line!="":
line = unicode(line, "utf-8")
seglist = jieba.cut(line,cut_all=False)
for seg in seglist:
seg=seg.encode('utf-8')
if seg not in stopwords:
output+=seg
output = ' '.join(list(seglist))#空格拼接
print output
result.write(output + '\r\n')
line = source.readline()
print 'End file: ' + str(num)
source.close()
result.close()
num = num + 1
print 'End All'
#Run function
if __name__ == '__main__':
read_file_cut()
如何用python对一个文件夹下的多个txt文本进行去停用词。
在用for 循环去停用词的部分，出错，仅去掉了stopwords中的部分停用词，且相同停用词只去除了一次。求大神告知错误之处，贴上代码再好不过！！
encoding=utf-8
import sys
import codecs
import shutil
import jieba
import jieba.analyse
导入自定义词典
jieba.load_userdict("dict_baidu.txt")
Read file and cut
def read_file_cut():
#create path
stopwords = {}.fromkeys([ line.strip() for line in open('stopword.txt') ])
path = "Lon\\"
respath = "Lon_Result\\"
if os.path.isdir(respath):
#如果respath这个路径存在
shutil.rmtree(respath, True)
#则递归移除这个路径
os.makedirs(respath)
#重新建立一个respath目录
while num&=20:
name = "%d" % num
fileName = path + str(name) + ".txt"
resName = respath + str(name) + ".txt"
source = open(fileName, 'r')
if os.path.exists(resName):
os.remove(resName)
result = codecs.open(resName, 'w', 'utf-8')
line = source.readline()
line = line.rstrip('\n')
while line!="":
line = unicode(line, "utf-8")
seglist = jieba.cut(line,cut_all=False)
for seg in seglist:
seg=seg.encode('utf-8')
if seg not in stopwords:
output+=seg
output = ' '.join(list(seglist))#空格拼接
print output
result.write(output + '\r\n')
line = source.readline()
print 'End file: ' + str(num)
source.close()
result.close()
num = num + 1
print 'End All'
Run function
if name == '__main__':
read_file_cut()
我要该，理由是：}

我爱游戏网