Key Points from "Introduce to Data Science"

Week 1  Introduction

----------------------------------------------------------

Data Science refersto an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.


Three types of tasks in a data science project:

         Preparing to run a model (80% of the work)

         Running the model

         communicating the results(the other 80% of the work)


Science is about asking questions

         Taditionally: query the world

        eScience: download the word


Ways to do science

         Empirical (for thousands of years)

         Theoretical (in the last few hundred years, reinforcing empirical methods)

         Computational (in the last 50 years or so, simulate phenomenon that cannot be obseve directly and theoretical models become too complex to sovle analytically)

         eScience( = Data Science) (int last ten years or so, explore massive data)


What's Big Data

       Big data is any data that is expensive to manage and hard to extract value from


Big Data: Three Challenges

       Volumne(the size of the data)

       Velocity (the latency of data processing relative to the growing demand for interactivity)

       Variety (the diversity of sources, formats, quality, structures)


Week  2   Relational Databases, Relational Algebra

-----------------------------------------------------------------------------------

What is a Data Model

            trhee components:

                      1. structures

                      2. constraints

                      3. operations


Database:

          Physical Data Independence (just table of algebra)

                      select, project, cross-product, join

                       SQL is declarative language, about "what not how"

                      Algebraic Optimization

           Logical Data Independence

                        view: a query with a name


Database can exploit index and it is sure to complete an operation no matter how large the data size is


Week 3    MapReduce

------------------------------------------------------------------------------------------

Scalable 

          Operationally:

                     Scale up: works even if data doesn't fit in main memory

                     Scale out:can make use of 1000s of cheap computers

           Algorithmically:

                     the complex should be polynomial, parallelized polynomial or nlog(n)


Parallel Architectures

           Shared nothing

           Shared disc

           shared memory


Two notions of parallel query processing

           distributed query

                       rewrite they query as a union of subqueries, finally the results are combined (bottleneck)

           parallel query (Teradata, parallel database)

                       each operator is implemented with a parallel algorithm (like the mapreduce fashion)


Pig (Yahoo)

          Relational Algebra over Hadoop

Hive (Facebook)

          SQL over Hadoop

Both are Declarative query lanquages, support schemas and algebraic optimization


Hadoop vs. RDBMS

            loading data: hadoop is faster (Hadoop just needs to do parition, databases need extra effort)

            execution: RDBMS is faster (becasuse of index)


Week 4     NoSql

----------------------------------------------------------------------


NoSql is mainly used to building very large scalable web application

Social Network application (when to see a friend's status)

            database: see all or nothing

                       two-phase commit

                                prepareto be ready: usually write to a log

                                 commit: if all subordinates are ready

                        if one coordinator used: signal point failure

                        distributed protocol for committing: Paxos   

            MongoDB

                        eventual consistency through vector clocks


CAP Theorem

        sacrifices Consistency or availablity to achieve parition


NoSQL features

          lookup, read, write 1 or few records over many servers  (high scale)

          able to replicate and partition data  (high scale)

          no sql  (no sql)

          weaker concurrency model than ACID(Atomicity, Consistency, Isolation, Durability)  transactions (no transaction)

          dynamically add new attributes to records (no schema)


Category for data models

           document = nested values,extensible records(XML, JSON)

            extensible record (hbase/ BigTable)

            key-value object (memcache)


Consistent hashing ( Memcached: no persistence, no replication, no fault-tolerance)

           map server IDs and  the key values  into the same space 


schema-on-read, instead schema-on-write (pig)


When data is too big, you cannot bring data to computation, you have to bring the computation to the data


Three Special Join:

           Replicated Join

           Skewed Join

           Merge Join


NoSQL Features:

            No Schema

            No Language

            No Transactions


          

           


                      

                               


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值