海涛anywn-优快云博客

原创机器学习中常用的特征选择方法

特征选择主要有两个功能：减少特征数量、降维，使模型泛化能力更强，减少过拟合增强对特征和特征值之间的理解1 去掉取值变化小的特征 Removing features with low variance这应该是最简单的特征选择方法了：假设某特征的特征值只有0和1，并且在所有输入样本中，95%的实例的该特征取值都是1，那就可以认为这个特征作用不大。如果100%都是1，那这个特征...

2019-05-27 15:48:34 2196

原创 xgboost在linux环境下的安装步骤

xgboost是轻量级的gbdt，在安装的时候遇到不少坑，现在记下来。1.安装anaconda安装xgboost 有一些依赖包，所以在安装xgboost之前需要安装这些依赖包。2.下载xgboost3.编译并安装cd /home/xgboost-mastermakecd wrapperpython ../python-package/setup.py install...

2019-05-27 15:47:51 1472

原创 Tensorflow基于linux环境python语言开发环境安装

Tensorflow的安装比较简单，跟安装python依赖包几乎差不多。我当时安装是基于anaconda和pip，两个结合着用的。1.安装anaconda先到https://www.continuum.io/downloads 下载anaconda, 现在的版本有python2.7版本和python3.5版本，下载好对应版本、对应系统的anaconda，它实际上是一个sh脚本文件，

2017-04-17 23:24:30 2276

原创 python sklearn常用分类算法模型的调用

实现对'NB', 'KNN', 'LR', 'RF', 'DT', 'SVM','SVMCV', 'GBDT'模型的简单调用。# coding=gbkimport time from sklearn import metrics import pickle as pickle import pandas as pd # Multinomial Naive Bayes Cla

2017-03-27 15:19:30 16040 2

原创 SimHash算法原理

刚到公司项目中用到了simhash，但没有详细的了解，后来被问到原理，结果就狗带了。。下面是自己查资料和自己的一些理解，不愧是google出品，比较符合google的风格，简单实用。先贴一张网上的图片：解释一下图片：这里feature可以指一篇文档分词后的某个词，即将文档中的某个词作为一个特征。weight是这个词的权重，这里可以是这个词在这个句子中出现的次数。这里的hash算

2016-08-29 19:42:26 18075 2

转载 Word2vec 句向量模型PV-DM与PV-DBOW

参考原文：LE, Quoc V.; MIKOLOV, Tomas. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014.这篇论文提出了一个使用Word2vec的原理创建句子的特征向量的方法，阅读需要先掌握Word2vec的相关知识，这里推荐一篇博文《Word

2016-08-24 11:38:27 12370

原创 bootstrap, boosting, bagging 几种方法的区别与联系

参考来源：http://blog.sina.com.cn/s/blog_4a0824490102vb2c.html==========================================================这两天在看关于boosting算法时，看到一篇不错的文章讲bootstrap, jackknife, bagging, boosting, random fore

2016-07-28 20:01:21 5420

原创算法效率比较

题目：针对数组A和数组B，两个数组的元素内容相同，不过数组A是已经排序的，数组B是乱序的，针对数组的中位数，存在以下两组程序，比较其效率并分析原因。int g;int main() { g = 0; for(int i = 0 ; i < n ; i++) { if( A[i] > mid ) g++; } for(int

2016-07-28 10:54:16 714

原创旋转有序数组中找最小值

O(n)的算法就不说了，这题主要考查的是 O(logn)的算法。有序数组容易想到使用二分查找解决，这题就是在二分基础上做一些调整。数组只有一次翻转，可以知道原有序递增数组被分成两部分，这俩部分都是有序递增的（这题只需要考虑有序数组的递增情况）。假如翻转后的数组以第 x 个结点分为两部分 A[0..x] 和 A[x+1..n]。则 A[0..x] 这一段是有序递增的， A[x+1..

2016-07-27 09:37:15 2019

原创机器学习优化算法之EM算法

EM算法简介EM算法其实是一类算法的总称。EM算法分为E-Step和M-Step两步。EM算法的应用范围很广，基本机器学习需要迭代优化参数的模型在优化时都可以使用EM算法。EM算法的思想和过程E-Step:E的全称是Expectation，即期望的意思。E-step也是获取期望的过程。即根据现有的模型，计算各个观测数据输入到模型中的计算结果。这个过程称为期望值计算过程，即E过程。M

2016-07-13 09:05:53 7585

原创合并两个排序的链表

输入两个单调递增的链表，输出两个链表合成后的链表，当然我们需要合成后的链表满足单调不减规则。/*public class ListNode { int val; ListNode next = null; ListNode(int val) { this.val = val; }}*/public class Solution {

2016-07-10 16:54:12 344

原创斐波那契数列

大家都知道斐波那契数列，现在要求输入一个整数n，请你输出斐波那契数列的第n项。npublic class Solution { public int Fibonacci(int n) { if(n<=1){ return n; } int[] record = new int[n+1];

2016-07-10 16:16:33 305

原创旋转数组的最小数字

把一个数组最开始的若干个元素搬到数组的末尾，我们称之为数组的旋转。输入一个递增排序的数组的一个旋转，输出旋转数组的最小元素。例如数组{3,4,5,1,2}为{1,2,3,4,5}的一个旋转，该数组的最小值为1。NOTE：给出的所有元素都大于0，若数组大小为0，请返回0。import java.util.ArrayList;public class Solution { publi

2016-07-10 16:08:03 327

原创重建二叉树

输入某二叉树的前序遍历和中序遍历的结果，请重建出该二叉树。假设输入的前序遍历和中序遍历的结果中都不含重复的数字。例如输入前序遍历序列{1,2,4,7,3,5,6,8}和中序遍历序列{4,7,2,1,5,3,8,6}，则重建二叉树并返回。/** * Definition for binary tree * public class TreeNode { * int val; *

2016-07-10 15:47:25 439

原创二维数组中的查找

在一个二维数组中，每一行都按照从左到右递增的顺序排序，每一列都按照从上到下递增的顺序排序。请完成一个函数，输入这样的一个二维数组和一个整数，判断数组中是否含有该整数。public boolean Find(int [][] array,int target) { int len = array.length-1; int i = 0;

2016-07-10 13:47:16 319

原创 5. Longest Palindromic Substring

Given a string S, find the longest palindromic substring in S. You may assume that the maximum length of S is 1000, and there exists one unique longest palindromic substring.class Solution {p

2016-07-10 10:40:37 723

原创 Stanford-parser依存句法关系解释

ROOT：要处理文本的语句；IP：简单从句；NP：名词短语；VP：动词短语；PU：断句符，通常是句号、问号、感叹号等标点符号；LCP：方位词短语；PP：介词短语；CP：由‘的’构成的表示修饰性关系的短语；DNP：由‘的’构成的表示所属关系的短语；ADVP：副词短语；ADJP：形容词短语；DP：限定词短语；QP：量词短语；NN：常用名词；NR：固有名词；NT：ROOT：要处理文本的语句IP：

2016-07-02 21:19:28 31374 3

原创计算文章中每个词的权重值-信息熵及代码实现

计算出每个词的信息熵可以用来作为词的权重，信息熵公式是：W代表该词，p代表该词左右出现的不同词的数目。比如现在某篇文章中出现了两次 A W C，一次B W D那么W的左侧信息熵为:2/3表示词组A在3次中出现了2次，B只出现了一次，故为1/3.W右侧的信息熵也是一样的。如果是A W C, B W C那么W右侧就是0，因为是 -1log(1)。对所有的词

2016-06-29 16:15:32 9155 4

原创基于标题分类的文章主题句识别与提取方法

基于标题分类的主题句提取方法基于标题分类的主题句提取方法可描述为: 给定一篇新闻报道, 计算标题与新闻主题词集的相似度, 判断标题是否具有提示性。对于提示性标题,抽取新闻报道中与其最相似的句子作为主题句; 否则, 综合利用多种特征计算新闻报道中句子的重要性, 将得分最高的句子作为主题句。算法过程：1. 构造新闻的主题词集（1）对于爬取的有标签的或关

2016-06-24 17:53:46 9880 1

原创使用CRF++进行分词的原理和实现过程

使用CRF分词的原理和实现过程目前业内分词效果最好的是CRF模型，而CRF++是CRF实现的比较成熟的工具，下面是用CRF++做分词的过程。1.使用4-tags标记，对训练语料做预处理分别用B代表词首，E代表词尾，M代表词中，S代表单字词。然后使用python将训练语料中的词处理成CRF输入的格式。如句子：海內外關注的一九九七年七月一日終於來到

2016-06-22 20:58:54 8096

原创 spark性能调优

spark性能调优有很多措施，下面说说我用到的一些调优手段。1.RDD分片数和executor个数的协调要想充分的使数据并行执行，并且能充分的利用每一个executor，则在rdd的个数与executor的个数之间要有一个合适的值。若rdd的个数较多而executor的个数较少，则会导致部分rdd需要等待空闲的executor，这样不能使所有数据同时并行执行。若rdd较少，而executo

2016-06-21 18:33:33 6372

原创斯坦福和NLTK英语短语词组抽取工具原理及源码理解

一、斯坦福短语抽取工具实现了四个方法来进行短语搭配抽取(1)基于统计频率数的方法该方法用于查找长度为2或者3并且连续的短语搭配。因此只处理bigrams和trigrams语料库。对于候选短语集，首先使用预定义的词性序列做一个初步的过滤，将不符合该词性序列的短语组合过滤掉。预定义的词性组合为：NN_NNJJ_NNVB_NNNN_NN_NNJJ_NN_NNNN_

2016-06-12 12:07:55 12092

原创 328. Odd Even Linked List

Given a singly linked list, group all odd nodes together followed by the even nodes. Please note here we are talking about the node number and not the value in the nodes.You should try to do it in

2016-06-05 18:37:28 434

原创 326. Power of Three

Given an integer, write a function to determine if it is a power of three.public class Solution { public boolean isPowerOfThree(int n) { double res = Math.log(n)/Math.log(3); ret

2016-06-05 18:36:50 276

You are playing the following Nim Game with your friend: There is a heap of stones on the table, each time one of you take turns to remove 1 to 3 stones. The one who removes the last stone will be the

2016-06-05 18:36:06 288

原创 258. Add Digits

Given a non-negative integer num, repeatedly add all its digits until the result has only one digit.For example:Given num = 38, the process is like: 3 + 8 = 11, 1 + 1 = 2. Since 2 has on

2016-06-05 18:35:18 295

原创 242. Valid Anagram

Given two strings s and t, write a function to determine if t is an anagram of s.For example,s = "anagram", t = "nagaram", return true.s = "rat", t = "car", return false.public class Solutio

2016-06-05 18:34:18 271

原创 237. Delete Node in a Linked List

Write a function to delete a node (except the tail) in a singly linked list, given only access to that node.Supposed the linked list is 1 -> 2 -> 3 -> 4 and you are given the third node with value

2016-06-05 18:33:33 269

原创 231. Power of Two

Given an integer, write a function to determine if it is a power of two.public class Solution { public boolean isPowerOfTwo(int n) { return n > 0 && (n & (n - 1)) == 0; }}

2016-06-05 18:32:23 295

原创 226. Invert Binary Tree

Invert a binary tree. 4 / \ 2 7 / \ / \1 3 6 9to 4 / \ 7 2 / \ / \9 6 3 1/** * Definition for a binary tree node. * public class TreeNode { *

2016-06-05 18:31:17 255

原创 217. Contains Duplicate

Given an array of integers, find if the array contains any duplicates. Your function should return true if any value appears at least twice in the array, and it should return false if every element

2016-06-05 18:30:13 271

原创 203. Remove Linked List Elements

Remove all elements from a linked list of integers that have value val.ExampleGiven: 1 --> 2 --> 6 --> 3 --> 4 --> 5 --> 6, val = 6Return: 1 --> 2 --> 3 --> 4 --> 5/** * Definition for sing

2016-06-05 18:29:08 282

原创 202. Happy Number

Write an algorithm to determine if a number is "happy".A happy number is a number defined by the following process: Starting with any positive integer, replace the number by the sum of the squares

2016-06-05 18:28:20 293

原创 110. Balanced Binary Tree

Given a binary tree, determine if it is height-balanced.For this problem, a height-balanced binary tree is defined as a binary tree in which the depth of the two subtrees of every node never diffe

2016-06-05 18:27:07 426

原创 104. Maximum Depth of Binary Tree

Given a binary tree, find its maximum depth.The maximum depth is the number of nodes along the longest path from the root node down to the farthest leaf node./** * Definition for a binary tree

2016-06-05 18:26:15 309

原创基于spark实现的CRF模型的使用与源码分析

CRF基于spark实现的过程与源码分析Crf-spark实现时基于spark的LBFGS算法实现，由于在spark的mllib库中实现了LBFGS算法，因此在使用crf训练时调用该算法在spark平台上将会使迭代更加快速。缩短训练时间。源码地址：https://github.com/lihait/CRF-Spark源码是scala语言写的，将源码下载后使用sbt工具打包成

2016-06-03 21:21:57 3750

原创 70. Climbing Stairs

You are climbing a stair case. It takes n steps to reach to the top.Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top?public class Solution { pu

2016-06-01 18:42:40 327

原创 67. Add Binary

Given two binary strings, return their sum (also a binary string).For example,a = "11"b = "1"Return "100".public class Solution { public String addBinary(String a, String b) {

2016-06-01 18:41:49 323

原创 66. Plus One

Given a non-negative number represented as an array of digits, plus one to the number.The digits are stored such that the most significant digit is at the head of the list.import java.math.Big

2016-06-01 18:41:01 290

原创 38. Count and Say

The count-and-say sequence is the sequence of integers beginning as follows:1, 11, 21, 1211, 111221, ...1 is read off as "one 1" or 11.11 is read off as "two 1s" or 21.21 is read off as

2016-06-01 18:39:34 254

空空如也

空空如也