- 博客(230)
- 资源 (20)
- 收藏
- 关注
原创 ADSL命令定时切换
使用 scapy 爬取数据时,经常遇到 IP 被限制,在一些 VPS 上使用 adsl 重连可以达到切换 IP 的目的。 在 windows 上可以使用以下命令连接或断开 adsl。rasdial ADSL user_name password #连接rasdial ADLS /d #断开为了方便,使用以下 python 脚本定时控制:#coding:utf-8import osimport
2017-06-29 14:39:38
1604
原创 outlook html 邮件表格边框问题
工作中需要定时自动发送一个邮件报表,使用 HTML、CSS 编写正文。为了减少代码量,在外部 style 中定义如下:td {border:1px solid;}这种写法在 foxmail 上显示的邮件是正常的,但在 outlook 上邮件显示没有边框,于是在每一个 td 加上 border 属性,发现边框分别出现,即每两格之间的边框都出现,非常难看。按照如下编写则会正常显示: table 有一个
2017-06-16 20:35:03
13823
1
原创 备份MySQL大表的数据
需求:有一个数据库,其中一些表每天写入百万条。现要求将某段时间的表数据保存到本地。最开始使用fetchall(),导致服务器直接宕机。如果使用mysqldump命令,会锁表,导致不能写入数据。后来发现python 的MySQLdb提供了fetchmany()的函数,可以控制每次获取的行数。以下的代码可以根据where条件读取数据库,而不给服务器造成很大压力。# coding=utf-8# crea
2016-01-27 18:05:02
2947
原创 wget命令从kaggle.com下载文件
kaggle.com上的数据集有时候会比较大 ,而且没有提供网盘下载机制,国内下载速度非常慢,同时下载需要验证,也无法使用迅雷工具下载。kaggle论坛上看到有wget的下载方式介绍[1]: 做法是先登录kaggle.com,记下浏览器中的cookie,将cookie保存到cookies.txt中,执行如下命令:wget -x --load-cookies cookies.txt -P
2015-11-02 16:12:20
8930
原创 使用Spark SQL 探索“全国失信人数据”
“全国法院失信被执行人名单”,网址:http://shixin.court.gov.cn/,可供查询,用于惩罚失信人员。数据量有100多万,也算是大数据了。其中身份证号已被处理,并不能直接看到全部号码。本人承诺不将此数据用于非法用途和不正当用途,仅作为个人学习数据处理分析的数据源,不针对任何个人和组织。数据字段如下: 被执行人姓名/名称 性别 年龄 身份证号码/组织机构代码
2015-09-04 15:06:34
2740
原创 使用Spark和Zeppelin探索movie-lens数据
MovieLens 100k数据包含有100,000条用户与电影的相关数据。 首先下载并解压数据:wget http://files.grouplens.org/datasets/movielens/ml-100k.zipunzip ml-100k.zipcd ml-100k#用户文件(ID,年龄,性别,职业,邮编)zhf@ubuntu:~/Downloads/ml-100k$ head
2015-08-30 20:31:08
4412
2
翻译 Apache Zeppelin简介
Zeppelin是一个Apache的孵化项目,一个多用途笔记本。(类似于ipython notebook,可以直接在浏览器中写代码、笔记并共享) 可实现你所需要的: - 数据采集 - 数据发现 - 数据分析 - 数据可视化和协作支持多种语言,默认是scala(背后是spark shell),SparkSQL, Markdown 和 Shell。 甚至可以添加自己的语言支持。如何写一个
2015-04-01 12:13:53
28669
原创 SQL注入
通过成功地SQL注入,可能可以拿到目标数据库的全部信息!首先要找到目标网址,以进行漏洞测试。在google中搜索:inurl:news.php?id=2任意点入一个网址:在网址后追加SQL语句,如果报错,则OK,可注入,如果未报错,无可注入漏洞或未找到。 比如,找到一个网址:http://www.calidus.ro/en/news.php?id=2将此链接变成如下,去访问
2015-03-24 18:30:07
1864
原创 简单的商品信息爬虫——爬易迅网
收集到很多易迅网的商品ID,于是想把这些ID对应的商品信息爬下来。通过简单分析发现,易迅网的各类信息都是直接放在HTML页面上,所以,解析一个页面就好了。最后返回每个ID对应的商品url,标题,易迅价,促销价,类目 。下面是python代码:#!/usr/bin/env python#coding:utf-8'''Created on 2015年03月11日@author: z
2015-03-12 15:37:48
1574
原创 1000万条用户名密码数据概览
一名安全研究员发布了一份包含1000万条记录的用户名、密码文件。原文可见:Today I Am Releasing Ten Million Passwords下载下来看看:确实是刚好有1000万条记录$ wc -l 10-million-combos.txt 10000000 10-million-combos.txt共有两列,分别是username、password$
2015-03-07 15:54:45
3180
原创 使用Spark计算PV、UV
日志字段格式:id,ip,url,ref,cookie,time_stamp把日志文件放到HDFS。仅取了1000行。hadoop fs -put 1000_log hdfs://localhost:9000/user/root/input计算PV。scala> val textFile = sc.textFile("hdfs://localhost:9000/user/ro
2015-01-28 14:06:06
10855
原创 Kaggle竞赛题之——Sentiment Analysis on Movie Reviews
Classify the sentiment of sentences from the Rotten Tomatoes dataset题目链接:https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews越来越喜欢iPython notebook了。以下所有工作都可以在一个页面上完成,FireFox支持比Chrome
2015-01-18 13:49:48
7400
原创 Kaggle竞赛题目之——Digit Recognizer
Classify handwritten digits using the famous MNIST dataThis competition is the first in a series of tutorial competitions designed to introduce people to Machine Learning.The goal in this comp
2015-01-16 12:24:47
5791
1
原创 User-Agent分析及其价值简析
User-Agent,用户代理。用户在上网访问的时候会作为HTTP的包头的一部分向服务器发送,用于识别用户的当前环境,如浏览器及版本号、操作系统等信息。在Chrome中可以在访问网站的时候按下F12查看。比如我在使用的Chrome的User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like G
2014-12-19 20:14:34
24915
原创 URL链接中的utm_source,utm_medium简析
工作中需要分析一些链接,统计分析一些信息。比如如下的链接:http://lightapplication.xxxx.com/?utm_source=ucweb&utm_medium=cpt&utm_term=zhilian&utm_content=textlink&utm_campaign=nov这个链接中带有一些参数,这些参数是什么意思呢,一直很好奇,现在需要用到这些信息了,对于网站主,
2014-12-17 17:37:36
517105
6
原创 Kaggle竞赛题目之——Titanic: Machine Learning from Disaster
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 22
2014-11-25 19:47:00
5547
原创 Kaggle竞赛题目之——Predicting a Biological Response
Predict a biological response of molecules from their chemical properties从分子的化学属性中预测其生物反应。The objective of the competition is to help us build as good a model as possible so that we can, as op
2014-11-24 17:24:00
5023
原创 LeetCode——Subsets
Given a set of distinct integers, S, return all possible subsets.Note:Elements in a subset must be in non-descending order.The solution set must not contain duplicate subsets.For exa
2014-11-23 13:24:16
2194
原创 LeetCode——Simplify Path
Given an absolute path for a file (Unix-style), simplify it.For example,path = "/home/", => "/home"path = "/a/./b/../../c/", => "/c"click to show corner cases.Corner Cases:Did
2014-11-22 18:13:22
2048
原创 Minimum Path Sum
Given a m x n grid filled with non-negative numbers, find a path from top left to bottom right which minimizes the sum of all numbers along its path.Note: You can only move either down or right at
2014-11-22 13:42:30
1938
原创 LeetCode——Maximum Product Subarray
Find the contiguous subarray within an array (containing at least one number) which has the largest product.For example, given the array [2,3,-2,4],the contiguous subarray [2,3] has the largest
2014-11-21 18:17:03
2157
原创 LeetCode——Maximum Product Subarray
Find the contiguous subarray within an array (containing at least one number) which has the largest product.For example, given the array [2,3,-2,4],the contiguous subarray [2,3] has the largest
2014-11-21 18:13:18
1672
原创 LeetCode——Sqrt(x)
Implement int sqrt(int x).Compute and return the square root of x.原题链接:https://oj.leetcode.com/problems/sqrtx/使用二分法来解题。 public int sqrt(int x) { if(x == 0 || x== 1) return x; in
2014-11-21 14:28:18
2600
原创 LeetCode——Unique Paths II
Follow up for "Unique Paths":Now consider if some obstacles are added to the grids. How many unique paths would there be?An obstacle and empty space is marked as 1 and 0 respectively in the
2014-11-21 10:31:23
2020
原创 LeetCode——Unique Paths
A robot is located at the top-left corner of a m x n grid (marked 'Start' in the diagram below).The robot can only move either down or right at any point in time. The robot is trying to reach the
2014-11-20 13:48:18
2021
原创 LeetCode——Rotate List
Given a list, rotate the list to the right by k places, where k is non-negative.For example:Given 1->2->3->4->5->NULL and k = 2,return 4->5->1->2->3->NULL.原题链接:https://oj.leetcode.com/proble
2014-11-20 12:54:45
2063
原创 LeetCode——Permutation Sequence
The set [1,2,3,…,n] contains a total of n! unique permutations.By listing and labeling all of the permutations in order,We get the following sequence (ie, for n = 3):"123""132""213""231""3
2014-11-19 17:06:49
2248
原创 LeetCode——Spiral Matrix II
Given an integer n, generate a square matrix filled with elements from 1 to n2 in spiral order.For example,Given n = 3,You should return the following matrix:[ [ 1, 2, 3 ], [ 8, 9, 4 ], [
2014-11-19 14:43:45
2095
原创 LeetCode——Jump Game
Given an array of non-negative integers, you are initially positioned at the first index of the array.Each element in the array represents your maximum jump length at that position.Determine i
2014-11-19 12:31:47
2209
原创 LeetCode——Spiral Matrix
Given a matrix of m x n elements (m rows, n columns), return all elements of the matrix in spiral order.For example,Given the following matrix:[ [ 1, 2, 3 ], [ 4, 5, 6 ], [ 7, 8, 9 ]]
2014-11-19 11:11:32
2083
原创 LeetCode——Pow(x, n)
Implement pow(x, n).原题链接:https://oj.leetcode.com/problems/powx-n/ public double pow(double x, int n) { if(n== 0) return 1; if(n == 1) return x; if(n % 2 ==0) return pow(x*x,n/2);
2014-11-18 22:14:44
2323
原创 LeetCode——Anagrams
Given an array of strings, return all groups of strings that are anagrams.Note: All inputs will be in lower-case.原题链接:https://oj.leetcode.com/problems/anagrams/易位构词游戏的英文词汇是 anagram,这个词来源于有
2014-11-18 21:44:33
2207
原创 LeetCode——Multiply Strings
Given two numbers represented as strings, return multiplication of the numbers as a string.Note: The numbers can be arbitrarily large and are non-negative.原题链接:https://oj.leetcode.com/problems
2014-11-18 17:06:03
2024
原创 LeetCode——Combination Sum II
Given a collection of candidate numbers (C) and a target number (T), find all unique combinations in C where the candidate numbers sums to T.Each number in C may only be used once in the combina
2014-11-18 15:13:12
1908
原创 LeetCode——Combination Sum
Given a set of candidate numbers (C) and a target number (T), find all unique combinations in C where the candidate numbers sums to T.The same repeated number may be chosen from C unlimited numb
2014-11-14 23:54:07
1730
原创 LeetCode——Min Stack
Design a stack that supports push, pop, top, and retrieving the minimum element in constant time.push(x) -- Push element x onto stack.pop() -- Removes the element on top of the stack.top() -- Get
2014-11-13 19:47:17
2076
原创 LeetCode——Valid Number
Validate if a given string is numeric.Some examples:"0" => true" 0.1 " => true"abc" => false"1 a" => false"2e10" => trueNote: It is intended for the problem statement to be ambiguo
2014-11-12 19:07:45
1959
原创 LeetCode——Count and Say
The count-and-say sequence is the sequence of integers beginning as follows:1, 11, 21, 1211, 111221, ...1 is read off as "one 1" or 11.11 is read off as "two 1s" or 21.21 is read off as
2014-11-12 18:25:13
1914
原创 LeetCode——Valid Sudoku
Determine if a Sudoku is valid, according to: Sudoku Puzzles - The Rules.The Sudoku board could be partially filled, where empty cells are filled with the character '.'.A partially fille
2014-11-12 16:33:49
2471
原创 Hive自定义函数的使用——useragent解析
想要从日志数据中分析一下操作系统、浏览器、版本使用情况,但是hive中的函数不能直接解析useragent,于是可以写一个UDF来解析。useragent用于表示用户的当前操作系统,浏览器版本信息,形如:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 S
2014-10-30 16:56:34
6922
2
实时分析-分析和可视化流数据的技术
2015-02-09
自然语言处理语料
2014-06-20
水木清华社区招聘信息定时抓取,部署于新浪云
2014-06-02
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人