Coder Social home page Coder Social logo

jupyters_and_slides's People

Contributors

e-abu avatar mingxiao-li avatar ustccheng02 avatar xguojing avatar xtzd avatar zhaoxingrong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jupyters_and_slides's Issues

【性能问题】数据库读取后保存数据到txt的速度问题

我有个数据库读取保存数据的性能问题要请教下:
疑问A:

  • 同样的代码,save_txt的代码写到get_news_from_sql的最后面,保存文本慢得要死,一行一行地读取数据
  • 将代码分开写成函数,速度一下子提升上万倍,一下子就保存好了

疑问B:

  • 怎么排查,调试这种问题呢?

下面快的代码和慢的代码的主要区别:
image

下面是写在一起,速度很快的代码

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''=================================================
@IDE    :PyCharm
@Author :LuckyHuibo
@Date   :2019/8/20 20:03
@Desc   :连接数据库,读取数据——分开写,速度很快的代码
=================================================='''
import pymysql
import re
import pysnooper


def clean(s):
    """
    清洗数据
    :param s: 文本
    :return:
    """
    re_compile = re.compile(r'�|《|》|\/|)|(|【|】|\\n|\\r|\\t|\\u3000|;|\*')
    string = re_compile.sub('', str(s))
    return string


# 从数据库中得到新闻语料库
@pysnooper.snoop()
def get_news_from_sql(host, user, password, database, port):
    print('开始连接数据库...')
    db = pymysql.connect(host, user, password, database, port, charset='utf8')  # 不添加charset,读取到的数据是乱码
    print(db)
    print('连接成功...')

    cursor = db.cursor()
    sql = """SELECT content from news_chinese"""
    try:
        cursor.execute(sql)
    except Exception as e:
        # 如果发生异常,则回滚
        print("发生异常", e)
        db.rollback()
        return

    news = cursor.fetchall()
    print(len(news))
    cursor.close()
    db.close()

    return news

    # 同样的代码,save_txt的代码写到get_news_from_sql的最后面,保存文本慢得要死,一行一行地读取数据
    # 将代码分开写成函数,速度一下子提升上万倍,一下子就保存好了

def save_txt(news):
    try:
        with open('../data/news-sentences-xut2.txt', 'w', encoding='utf-8') as f:
            for content in news:
                data = content[0]
                text = clean(data)
                f.write(text + '\n')
    except Exception as w:
        print('保存数据到文本出现问题', w)


if __name__ == "__main__":
    host = "rm-8vbwj6507z6465505ro.mysql.zhangbei.rds.aliyuncs.com"
    user = "root"
    password = "AI@2019@ai"
    database = "stu_db"
    port = 3306
    try:
        contents = get_news_from_sql(host, user, password, database, port)
        save_txt(contents)
    except Exception:
        # 如果发生异常,则回滚
        print("ERROR", Exception)
        # db.rollback()
        pass


下面是写在一起,速度很慢的代码

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''=================================================
@IDE    :PyCharm
@Author :LuckyHuibo
@Date   :2019/8/20 20:03
@Desc   :连接数据库,读取数据

【问题】我有个数据库读取保存数据的性能问题要请教下:
# 同样的代码,save_txt的代码写到get_news_from_sql的最后面,保存文本慢得要死,一行一行地读取数据
# 将代码分开写成函数,速度一下子提升上万倍,一下子就保存好了
=================================================='''
import pymysql
import re
import pysnooper


def clean(s):
    """
    清洗数据
    :param s: 文本
    :return:
    """
    re_compile = re.compile(r'�|《|》|\/|)|(|【|】|\\n|\\r|\\t|\\u3000|;|\*')
    string = re_compile.sub('', str(s))
    return string


# 从数据库中得到新闻语料库
@pysnooper.snoop()
def get_news_from_sql(host, user, password, database, port):
    print('开始连接数据库...')
    db = pymysql.connect(host, user, password, database, port, charset='utf8')  # 不添加charset,读取到的数据是乱码
    print(db)
    print('连接成功...')

    cursor = db.cursor()
    sql = """SELECT content from news_chinese"""
    try:
        cursor.execute(sql)
    except Exception as e:
        # 如果发生异常,则回滚
        print("发生异常", e)
        db.rollback()
        return

    news = cursor.fetchall()
    print(len(news))
    cursor.close()
    db.close()

    # return news

    # 同样的代码,save_txt的代码写到get_news_from_sql的最后面,保存文本慢得要死,一行一行地读取数据
    # 将代码分开写成函数,速度一下子提升上万倍,一下子就保存好了

    # def save_txt(news):
    try:
        with open('../data/news-sentences-xut.txt', 'w', encoding='utf-8') as f:
            for content in news:
                data = content[0]
                text = clean(data)
                f.write(text + '\n')
    except Exception as w:
        print('保存数据到文本出现问题', w)


if __name__ == "__main__":
    host = "rm-8vbwj6507z6465505ro.mysql.zhangbei.rds.aliyuncs.com"
    user = "root"
    password = "AI@2019@ai"
    database = "stu_db"
    port = 3306
    try:
        contents = get_news_from_sql(host, user, password, database, port)
        # save_txt(contents)
    except Exception:
        # 如果发生异常,则回滚
        print("ERROR", Exception)
        # db.rollback()
        pass


【Word2Vec】训练的model没有“说”字,报KeyError: "word '说' not in vocabulary"

根据数据库训练出来的model,找不到跟说相关的词,报KeyError: "word '说' not in vocabulary"

min_count=1 已经设置为1了

path_news_txt(保存读取的news_chinese表的数据)


from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from gensim import models
# 从config配置中读取path_news_txt(保存读取的news_chinese表的数据), path_news_model(保存的model的路径)文件路径
from config.file_path import path_news_txt, path_news_model

if __name__ == "__main__":
    # 对读取的数据库news进行训练
    news_vec = Word2Vec(LineSentence(path_news_txt), size=100, min_count=1, workers=8)
    # 将训练结果保存为model
    news_vec.save(path_news_model)

    # 加载news_model,进行数据的测试
    model = models.Word2Vec.load(path_news_model)
    # 查找model中跟“说”相关的词
    said = model.most_similar('说')

    '''执行后报错,说训练的model中没有“说”这个词,但是数据库中有【说】字,且min_count=1了
    
    File "C:\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 464, in word_vec
    raise KeyError("word '%s' not in vocabulary" % word)
    KeyError: "word '说' not in vocabulary"  
      
    '''

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.