- 博客(136)
- 资源 (3)
- 收藏
- 关注
原创 对语料库的每一个句子的每一个单词加权重
包括预处理,使用tfidf加权重#!/usr/bin/env python# -*- coding: utf-8 -*-# created by fhqplzj on 2017/05/15 上午10:48import itertoolsimport reimport jiebafrom six.moves import xrangefrom sklearn.feature_ex
2017-08-10 10:45:45
1287
原创 句子相似度计算的几种方法
雅可比相似度,余弦相似度,带tfidf的余弦相似度#!/usr/bin/env python# -*- coding: utf-8 -*-# created by fhqplzj on 2017/07/22 上午12:23import numpy as npfrom scipy.linalg import normfrom sklearn.feature_extraction.tex
2017-08-09 19:16:29
7659
原创 MiniBatchKMeans简单应用
MiniBatchKMeans比KMeans快很多,效果也不错,应用于文本聚类如下:#!/usr/bin/env python# -*- coding: utf-8 -*-from __future__ import print_functionimport loggingimport osimport refrom collections import defaultdict
2017-08-07 11:06:15
2546
原创 spark构建图graphx
import org.apache.spark.graphx.{Edge, Graph}import utility.Helpersimport scala.collection.mutable/** * Created by fhqplzj on 2017/7/20. */object SemanticNormalization { def word_count(s: S
2017-07-21 11:54:50
1562
原创 基于协程的异步爬虫
基于tornado框架的异步爬虫小例子:#!/usr/bin/env python# -*- coding: utf-8 -*-# created by fhqplzj on 2017/07/19 下午5:48import loggingimport timefrom datetime import timedeltafrom urlparse import urljoin, ur
2017-07-19 19:44:49
675
原创 javascript版本的最长公共子序列
初学js,拿个lcs问题练练手:/** * Created by fhqplzj on 2017/7/15. */function lcs(s1, s2) { var m = s1.length, n = s2.length; var dp = new Array(m + 1); for (var i = 0; i < dp.length; i++) {
2017-07-15 11:16:29
1015
原创 带有xavier初始化、dropout的多层神经网络
#!/usr/bin/env python# -*- coding: utf-8 -*-# created by fhqplzj on 2017/07/07 下午3:22import randomimport matplotlib.pyplot as pltimport tensorflow as tffrom tensorflow.contrib.layers import xav
2017-07-08 15:25:37
1017
原创 python猜数字游戏
无聊写个python猜数字游戏:#!/usr/bin/env python# -*- coding: utf-8 -*-# created by fhqplzj on 2017/06/23 下午12:18import randomimport stringimport timedef game(target): possibles = string.letters +
2017-06-27 19:55:58
623
原创 seq2seq模型
最近要做机器翻译,需要用到seq2seq模型。seq2seq由一个encoder和一个decoder组成,encoder和decoder实际上都是基于lstm的rnn模型。在decoder阶段,上一个cell的输出作为下一个cell的输入。
2017-06-20 16:13:00
659
原创 单向RNN和双向RNN在mnist数据集上的分类实验
RNN用于图像分类思路很奇特,不明觉厉,具体可以参考相关论文,rnn和birnn的实验:#!/usr/bin/env python# -*- coding: utf-8 -*-# created by fhqplzj on 2017/06/19 下午10:28from __future__ import print_functionimport tensorflow as tffr
2017-06-20 10:20:46
1288
原创 simhash的python实现
import hashlibdef hash_str(s): md5 = hashlib.md5() md5.update(s) res = int(md5.hexdigest()[:16], base=16) return bin(res)[2:].zfill(64)def simhash(words, weights): words = ma
2017-03-23 23:04:33
1629
原创 自动摘要提取python,textrank
# encoding=utf-8import jiebaimport networkx as nxfrom sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformerdef cut_sentence(sentence): """ 分句 :param sentence:
2017-03-10 22:29:27
7774
2
原创 python中文分句
# -*-coding=UTF-8-*-def cut_sentences(sentence): if not isinstance(sentence, unicode): sentence = unicode(sentence) puns = frozenset(u'。!?') tmp = [] for ch in sentence:
2017-03-10 18:46:32
5907
1
原创 TextRank算法
# -*-coding=UTF-8-*-import networkxfrom nltk.tokenize.punkt import PunktSentenceTokenizerfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerdocument = """To Sherlock Hol
2017-03-10 15:56:43
1264
原创 bulk批量建立索引python
# encoding=utf-8import elasticsearch.helpersfrom elasticsearch import Elasticsearchpath = '/home/fhqplzj/data/orion/news.json'es = Elasticsearch('localhost:9200')my_index, my_type = 'test_index'
2017-03-04 17:17:50
1332
原创 simhash实现
import com.clearspring.analytics.hash.MurmurHash/** * Created by fhqplzj on 17-3-1 at 下午6:07. */object Sim { def simHash(features: Array[String], weights: Array[Int]): Long = { val hist =
2017-03-01 18:19:42
1159
原创 最长公共子序列算法
/** * Created by fhqplzj on 17-2-28 at 上午9:56. */object LongestCommonSubsequence { def LCS(s1: String, s2: String): (Int, String) = { val m = s1.length val n = s2.length val dp = Ar
2017-02-28 10:17:28
530
原创 numpy中的argsort
import numpy as npa = np.random.rand(10)print aindex_array = np.argsort(a)print a[index_array]
2017-02-26 11:01:07
524
原创 python代码ssh自动连接ubuntu
import getpassimport socketimport termiosimport sysimport tracebackimport ttyimport paramikofrom paramiko.py3compat import udef posix_shell(chan): import select oldtty = termios.t
2017-02-25 18:13:12
690
原创 python堆排序
import heapqdef heapsort(items): heap = [] for item in items: heapq.heappush(heap, item) return [heapq.heappop(heap) for _ in range(len(heap))]nums = [1, 3, 5, 7, 9, 2, 4, 6,
2017-02-25 13:13:29
695
原创 jieba分词python建立倒排索引
# encoding=utf-8import jsonimport jiebafrom sys import argvfrom collections import defaultdictpath = argv[1]objs = map(lambda s: json.loads(s), open(path).readlines())res = defaultdict(list)f
2017-02-25 10:36:54
4316
原创 elasticsearch简单例子
bool 搜索和 dis_max搜索的例子:Disjunction means or while conjunction means and.# encoding=utf-8from pprint import pprintfrom elasticsearch import Elasticsearches = Elasticsearch('localhost:9200')_in
2017-02-23 18:21:56
822
原创 elasticsearch搜索例子
# encoding=utf-8import jsonfrom elasticsearch import Elasticsearchfrom elasticsearch.helpers import scanes = Elasticsearch(hosts='10.52.66.31')def select(_index, _type, offset, limit, **kwarg
2017-02-23 16:08:49
570
原创 完美打印JSON
package java2.format;import com.google.common.base.Strings;import com.google.common.collect.Lists;import org.apache.commons.io.FileUtils;import org.apache.commons.lang3.StringUtils;import java.
2017-02-23 15:39:10
1022
原创 topk问题java实现
import com.google.common.base.Splitter;import com.google.common.collect.HashMultiset;import com.google.common.collect.Multiset;import com.google.common.primitives.Ints;import org.apache.commons.io
2017-02-18 22:01:29
1124
原创 guava的停表使用, Stopwatch
import com.google.common.base.Stopwatch;import java.util.concurrent.TimeUnit;/** * Created by fhqplzj on 17-2-16 at 下午9:57. */public class My4 { public static void main(String[] args) {
2017-02-16 22:13:14
5169
原创 ncut算法python实现
normalized cut算法python实现,只是针对二维数据点:# encoding=utf-8import numpy as npimport matplotlib.pyplot as pltfrom scipy import linalg as LAfrom sklearn.cluster import KMeansfrom sklearn.datasets import
2017-01-27 06:33:53
4943
原创 清华大学 质因数的个数
import java.util.Scanner;/** * Created by fhqplzj on 17-1-26 at 下午5:16. */public class My10 { private static int getNum(int n) { int result = 0; while (true) { int
2017-01-26 17:58:27
469
原创 清华大学 约数的个数
import java.util.Scanner;/** * Created by fhqplzj on 17-1-25 at 下午11:06. */public class My2 { private static int getNum(int n) { int result = 0; int i = 1; for (; i *
2017-01-25 23:15:06
377
原创 清华大学 成绩排序
import java.util.ArrayList;import java.util.Scanner;/** * Created by fhqplzj on 17-1-25 at 下午10:19. */class Student { private int index; private String name; private int score;
2017-01-25 23:02:41
743
原创 每天一段代码,签到
package tmpimport java.lang.management.ManagementFactory/** * Created by fhqplzj on 17-1-20 at 上午9:56. */object pro4 { val memory: Int = inferDefaultMemory() def inferDefaultCores(): Int
2017-01-20 10:07:35
1241
原创 文本预处理,去除词频数为1的文档
package clustering.garbageimport java.io.PrintWriterimport org.apache.spark.SparkContext/** * Created by fhqplzj on 17-1-12 at 下午8:40. */object Lines { def main(args: Array[String]): Unit
2017-01-17 15:47:59
769
原创 spark-defaults.conf配置文件
spark.master spark://master:7077spark.eventLog.enabled truespark.eventLog.dir hdfs://namenode:8021/directoryspark.serializer org.apache.spark.serializer.KryoSerializerspark.driver.memory 5gspark.execu
2017-01-16 22:14:35
9986
原创 csr_matrix计算tf
from scipy.sparse import csr_matrixdef tf(docs): """ As an example of how to construct a CSR matrix incrementally, the following snippet builds a term-document matrix from texts: :t
2017-01-11 21:58:30
1290
原创 python 谱聚类 幂迭代
紧接着上篇博文,实现了幂迭代聚类算法:# encoding=utf-8import numpy as npimport matplotlib.pyplot as pltfrom numpy import linalg as LAfrom sklearn.cluster import KMeansfrom sklearn.metrics.pairwise import rbf_kern
2017-01-10 20:17:04
1180
原创 python实现谱聚类,NJW算法
代码中有注释:# encoding=utf-8import matplotlib.pyplot as pltimport numpy as npfrom numpy import linalg as LAfrom sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsfrom sklearn.metr
2017-01-10 16:05:10
8174
3
原创 soundex算法
def soundex(name, len=4): """ :type name:str :rtype:str :param name: :param len: :return: """ soundex_digits = '01230120022455012623010202' sndx = '' fc = ''
2017-01-08 20:41:43
2212
原创 python按照词典中的单词对一一替换
import reimport sixclass Xlator(dict): def _make_regex(self): return re.compile("|".join(map(re.escape, six.iterkeys(self)))) def __call__(self, match): return self[match.
2017-01-08 12:58:28
2035
原创 matplotlib设置默认参数和属性cycle
from cycler import cyclerimport numpy as npimport matplotlib.pyplot as pltplt.style.use('ggplot')x = np.linspace(0, 2 * np.pi)offsets = np.linspace(0, 2 * np.pi, num=4, endpoint=False)yy = np.t
2017-01-08 11:30:42
2911
原创 python画bar图
import numpy as npimport matplotlib.pyplot as pltplt.style.use('ggplot')logistic_regression = [ [184, 111, 76], [116, 80, 62], [15, 6, 3]]k_means = [ [274, 157, 106], [197, 1
2017-01-07 16:05:04
5633
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人