can you explain this in chinese as much as you can:Hello everyone. Today I'm going
to give you all example of data,
so remember we're talking about data understanding.
When you have a dataset, what do we do?
You want to be able to describe the dataset
and analyze the dataset
so you understand what data you're working with.
I'm going to use one example,
so this is about air quality sensing.
We work on a project.
This is base is just for mobile air quality sensing.
The motivation is that while there are
actually air quality monitoring stations in most areas,
but those are limited.
For example, you usually get a dozen in a urban area.
It's very limited, so if you consider individually,
we go to different places,
we go to different rooms.
I care more about my neighboring area,
so what air quality I'm
subject or air pollution I'm subject to.
In this particular project,
so what we did was that well developed sensing device,
and that actually then would be coupled with
your mobile phone so that we
can collect various air pollutant information gets used,
send it to the mobile phone,
and then send to the Cloud server
so you can do further analysis.
So from the data side.
We say this is the air quality data
, air quality sensing.
That's the dataset, it's air quality sensing data.
But what are the things I can describe?
If you're say, if I'm
attending your like, I have this dataset.
Will it be the immediate questions you want to ask?
Starting point, of course,
is that well depending,
not in particular order, of course,
winds that the size.
But there are different ways to ask about it.
Size is about okay.
How many sensors by per device,
also how many users,
and also how long?
So that was the duration of the study.
Also you can get your bit area.
To answer this particular question,
so for example, like we have,
I think on every device you will have sensors for CO,
CO2, the max,
and then ozone,
and also we have temperature,
humidity, and those other.
You want to be able to specify
the specific types of sensors,
and then users, since those devices are mobile devices,
they're being carried by users.
In our study, we had,
was 20-30 different users,
so they're carrying those devices.
We did that over about a half year,
so actually have data coming from those users,
from those sensing devices for that time duration.
We also did some event specific
like monitoring is there
are certain events happening in certain areas,
so we actually did a few more narrowed foot
crossed data collection there.
The area of course they'll say,
Are you talking about say I'm in Boulder.
We actually did that and mostly in Boulder but also
had a couple of students take the devices.
They were riding the bus ride to
Denver and also to the airport,
and also to a shopping mall.
That's basically just understanding
about what data being collected,
and you can also ask a little bit more
about how the data were collected.
That's generally the size or the scale.
Now of course, when you get to the specifics,
will say what your data understanding,
is that, what's the norm,
what's the potential distribution, the dispersion,
and also any extreme values,
so this really gets to some of those statistical part.
Your statistics you can
do basically can look at each with individual readings.
You can say that if I take all the CO2 readings,
I can then have some way of calculating the average,
the mean value, the standard deviation.
I can plot the overall distribution
and I can actually even show you whether
what the normal ranges are here they look okay
and they are generally below the threshold,
based on the Environmental Protection Agency,
or you actually see some extremely high values.
That's with CO2.
We are concerned if we are above the safety threshold.
You're also looking at higher values that are extreme,
so those can of course easily plotted.
We'll talk about a few visualization capabilities,
so that would allow you to plot.
You can use a single-line box plot
is actually very helpful in this case,
where you can quickly show the distribution.
Also, it's actually very harmful you want to compare.
For example, I have
different users or different locations,
or different time periods.
That actually would allow you to
break apart your data and then be able
to compare the statistics across different dimensions.
That's what kind of data I have or what kind of
statistics I can leverage.
Then we covered rise of visualization.
I'll put it in there. Think about
how we can visualize your data,
so you can allow you to have some good insights.
Some further analysis,
you can then plot it, visualization of satellite.
You can do individual distribution or you can do
some spatial temporal plotting in some comparative study.
That's actually all very useful.
I want to touch a little bit about
also is part of the similarity part,
or distance because that's important.
As we have said in the class,
when you have different objects,
those objects are defined by those attributes.
Then you want to have some way of
identifying similarity or distance similarity.
In this case where I will have
the air quality sensor readings,
so one natural question when
you're talking about the similarities that,
does it look similar or there's significant differences?
Since I have multiple sensors, well,
I remember we're talking about
the different attributed types and then the different,
like whether you have
binary values or whether you have
other information you can leverage.
If you look at my sensors,
so if I say my CO, CO_2,
the NO_*, and then like ozone,
and then you may have PM_2.5.
As you can see, these are all numerical values,
so you have this time series of those numerical values.
But on top of that, you could have,
say maybe different location.
Then the other way I will also talk
about temperature or humidity.
I will say actually we also add not just the location,
there's also more like the activity.
This is actually related to your life.
For example, if I'm riding a bus,
like biking, riding bus,
or I'm out running outdoors,
or I'm just sleeping.
As we can see, the different types of information,
so we're talking about the similarity,
I can see the numerical values you have
some way with defining the closeness.
Or in essence we care more
about when they're above a certain threshold,
you can quantify this into not just the actual value,
can just maybe say 10% or higher above the threshold,
20% or higher, or how often,
you can actually convert
that into some kind of frequency.
Then if you have the different activity,
then you've actually grouped that by,
remember we're talking about if it's a binary,
the binary code encoding,
could it be what category of value it can
convert data into some binary just okay,
you have to match exactly
because I only care about the same type of
activity because like riding bus is
public and would be very different versus when sleeping.
Those are all the information
you need to consider as you're
trying to decide on
your similarity function or distance or calculations.
Another important thing we talk about is that,
remember we're saying like the weighted.
The way to the decent calculation,
that is actually very important
angle because you want to make
sure that as you're putting all this together,
those attributes are not equal.
You want to be able to tailor
your similarity calculation or the decent calculation by
identifying like say air pollutants data have
more severe negative impact
when they are above certain threshold.
As you can see,
even though they are a standard,
distance of calculation or
similarity calculation functions,
you want to be able to understand your problem setting,
understand the semantics of your data.
Then I have a specific calculation,
similarity function that is meaningful for
your dataset and for your application scenario.
Of course this is just one example,
but hopefully that give you an idea about
if you are being presented with a particular dataset,
what kind of question do you ask or what are
the specific things you can
do in terms of analyzing data,
understanding the potential like semantics,
the challenges or issues,
or knowledge you can learn on top of it.
A particular, the similarity part we'll talk about is
that because it's hard to give like say,
just use this function and you'll be fine.
Many times you need to do the reasoning is about,
okay, what do I have?
What [inaudible] do I have and what
do I care about from my application?
I have my calculations,
that's more applicable
format intended usage of scenarios.
That's all. Thank you.