博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
综合练习:词频统计
阅读量:6697 次
发布时间:2019-06-25

本文共 2483 字,大约阅读时间需要 8 分钟。

综合练习

词频统计预处理

下载一首英文的歌词或文章

将所有,.?!’:等分隔符全部替换为空格

将所有大写转换为小写

生成单词列表

生成词频统计

排序

排除语法型词汇,代词、冠词、连词

输出词频最大TOP20

text='''I became what I am today at the age of twelve, on a frigid overcast day in the winter of 1975. I remember the precise moment, crouching behind a crumbling mud wall, peeking into the alley near the frozen creek. That was a long time ago, but it's wrong what they say about the past, I've learned, about how you can bury it. Because the past claws its way out. Looking back now, I realize I have been peeking into that deserted alley for the last twenty-six years.One day last summer, my friend Rahim Khan called from Pakistan. He asked me to come see him. Standing in the kitchen with the receiver to my ear, I knew it wasn't just Rahim Khan on the line. It was my past of unatoned sins. After I hung up, I went for a walk along Spreckels Lake on the northern edge of Golden Gate Park. The early-afternoon sun sparkled on the water where dozens of miniature boats sailed, propelled by a crisp breeze. Then I glanced up and saw a pair of kites, red with long blue tails, soaring in the sky. They danced high above the trees on the west end of the park, over the windmills, floating side by side like a pair of eyes looking down on San Francisco, the city I now call Home. And suddenly Hassan's voice whispered in my head: _For you, a thousand times over_. Hassan the harelipped kite runner.I sat on a park bench near a willow tree. I thought about something Rahim Khan said just before he hung up, almost as an after thought. _There is a way to be good again_. I looked up at those twin kites. I thought about Hassan. Thought about Baba. Ali. Kabul. I thought of the life I had lived until the winter of 1975 came and changed everything. And made me what I am today.'''#标点替换为空格symbol='''!?,.@#$%*_+-'''# 把文章的标点符号替换for i in symbol:    text=text.replace(i,'')# 改成小写  以空格将字符串分成单词列表textlist=text.lower().split()#用字典记录单词和其出现次数dic={}for i in textlist:    count=text.count(i)    dic[i]=countwords = '''a an the in on to at and of is was are were i he she you your they us their our it or for be too do no that s so as but it's'''for i in words:    if(dic.get(i)!=None): #如果为冠词之类的无意义的词,将其舍弃        dic.pop(i)new_dic = sorted(dic.items(),key=lambda x:x[1],reverse = True)for i in range(20):    print(new_dic[i]) #取出现频率最高的10个单词print(dic)

  

 

将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

 

fo = open('text.txt', 'r')text=fo.read()fo.close()

 

转载于:https://www.cnblogs.com/2015110114z/p/8649854.html

你可能感兴趣的文章
tcp/udp高并发和高吐吞性能测试工具
查看>>
git push --no-thin
查看>>
Linux驱动开发必看详解神秘内核(完全转载)
查看>>
cocos2d-x坐标系
查看>>
Chrome英文版离线安装包下载
查看>>
ThinkPHP 的URL重写时遇到No input file specified的解决方法
查看>>
CAS实现单点登录方案(SSO完整版)
查看>>
纯后台生成highcharts图片有哪些方法?
查看>>
Oracle手边常用70则脚本知识汇总
查看>>
Win10 IIS本地部署网站运行时图片和样式不正常?
查看>>
Creating Apps With Material Design —— Creating Lists and Cards
查看>>
GIS基础软件及操作(二)
查看>>
Underscore.js (1.7.0)-函数预览
查看>>
003很好的网络博客(TCP/IP)-很全
查看>>
php版redis插件,SSDB数据库,增强型的Redis管理api实例
查看>>
Why does pthread_cond_signal not work?【转】
查看>>
Category 的一些事
查看>>
System.InvalidOperationException : 不应有 <Response xmlns=''>。
查看>>
Linux 网络编程详解一(IP套接字结构体、网络字节序,地址转换函数)
查看>>
AS 2.0新功能 Instant Run
查看>>