Key Points from "Introduce to Data Science"

最新推荐文章于 2023-10-09 02:02:27 发布

原创最新推荐文章于 2023-10-09 02:02:27 发布 · 785 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#data science

Week 1 Introduction

----------------------------------------------------------

Data Science refersto an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.

Three types of tasks in a data science project:

Preparing to run a model (80% of the work)

Running the model

communicating the results(the other 80% of the work)

Science is about asking questions

Taditionally: query the world

eScience: download the word

Ways to do science

Empirical (for thousands of years)

Theoretical (in the last few hundred years, reinforcing empirical methods)

Computational (in the last 50 years or so, simulate phenomenon that cannot be obseve directly and theoretical models become too complex to sovle analytically)

eScience( = Data Science) (int last ten years or so, explore massive data)

What's Big Data

Big data is any data that is expensive to manage and hard to extract value from

Big Data: Three Challenges

Volumne(the size of the data)

Velocity (the latency of data processing relative to the growing demand for interactivity)

Variety (the diversity of sources, formats, quality, structures)

Week 2 Relational Databases, Relational Algebra

-----------------------------------------------------------------------------------

What is a Data Model

trhee components:

1. structures

2. constraints

3. operations

Database:

Physical Data Independence (just table of algebra)

select, project, cross-product, join

SQL is declarative language, about "what not how"

Algebraic Optimization

Logical Data Independence

view: a query with a name

Database can exploit index and it is sure to complete an operation no matter how large the data size is

Week 3 MapReduce

------------------------------------------------------------------------------------------

Scalable

Operationally:

Scale up: works even if data doesn't fit in main memory

Scale out:can make use of 1000s of cheap computers

Algorithmically:

the complex should be polynomial, parallelized polynomial or nlog(n)

Parallel Architectures

Shared nothing

Shared disc

shared memory

Two notions of parallel query processing

distributed query

rewrite they query as a union of subqueries, finally the results are combined (bottleneck)

parallel query (Teradata, parallel database)

each operator is implemented with a parallel algorithm (like the mapreduce fashion)

Pig (Yahoo)

Relational Algebra over Hadoop

Hive (Facebook)

SQL over Hadoop

Both are Declarative query lanquages, support schemas and algebraic optimization

Hadoop vs. RDBMS

loading data: hadoop is faster (Hadoop just needs to do parition, databases need extra effort)

execution: RDBMS is faster (becasuse of index)

Week 4 NoSql

----------------------------------------------------------------------

NoSql is mainly used to building very large scalable web application

Social Network application (when to see a friend's status)

database: see all or nothing

two-phase commit

prepareto be ready: usually write to a log

commit: if all subordinates are ready

if one coordinator used: signal point failure

distributed protocol for committing: Paxos

MongoDB

eventual consistency through vector clocks

CAP Theorem

sacrifices Consistency or availablity to achieve parition

NoSQL features

lookup, read, write 1 or few records over many servers (high scale)

able to replicate and partition data (high scale)

no sql (no sql)

weaker concurrency model than ACID(Atomicity, Consistency, Isolation, Durability) transactions (no transaction)

dynamically add new attributes to records (no schema)

Category for data models

document = nested values,extensible records(XML, JSON)

extensible record (hbase/ BigTable)

key-value object (memcache)

Consistent hashing ( Memcached: no persistence, no replication, no fault-tolerance)

map server IDs and the key values into the same space

schema-on-read, instead schema-on-write (pig)

When data is too big, you cannot bring data to computation, you have to bring the computation to the data

Three Special Join:

Replicated Join

Skewed Join

Merge Join

NoSQL Features:

No Schema

No Language

No Transactions