What two types of sensors do the researchers differentiate?
on body sensors
off body sensors
What is usually done to improve indoor occupant sensing?
fuse on and off body sensing
as multimodal signals can provide complementary information for the same target
and thus achieve robust information inference
In what cases is it neccesarry to associate signals from different modalities?
multiple occupants
large IoT systems that capture sensor data from more than one person at once -> requires association
For what use cases is cross-modal association of sensor data useful?
user signal segment annotation
use on body device for associating data with user (no manual effort as device already associated)
enhancing multimodal learning efficiency
What are some challenges with cross-modal signal segment association?
indirect sensing leads to lack of direct comparable infomrmation
indirect -> e.g. structural vibration
raw data often not really interpretable and thus not easily compared for shared context
complementary leads to disassociation
multiple modalities -> use complementary to gain overall more informaiton to achieve more efficient modeling
due to being complementary -> captuer different type of information and thus not really share much information
mobility variance leads to spatiotemporal variation
e.g. people moving around house -> association of on body sensor with off body sensor may vary (which one to associate) over time due to movement
What is the goal of the researchers?
determine wether two signal segments (from different senrors, (on body with n off body)) over the same period of time are associated (capture the same person)
What metric do the researchers introudce to compute wether two signal segments are associated?
association probabiliy
What is the intuition behind the association probability?
as long as sensors capture same physical activity -> there will be implicit shared context between two signal segments
assume that for structural virbation signals that are segmented as one activity (e.g. 5 seconds) -> there will be only a single wearable sensor associated
What type of NN do the researchers use?
temporal convolutional neural network
What are the contributions of the researchers?
CMA -> cross modal sensing signals segment level association scheme for multimodal IoT systems
AD-TCN -> framework for leanring segment level cross model representation and use learned model parameters to calculate amount of shared context between modalities
What areas does the introduced related work cover?
cross modal IoT device identification
IoT for occupant identification
What related work did they present in cross modal iot device identification? What are downsides?
shared 3D motion of human body parts captured by camera and IMU sensors for IoT device identifiaction
activity start/end time for fingerprinting co-located device pairing
=> have explicit context that is straightforward to see (e.g. rule based…)
=> TCN to discover limited association inforation that is a result form indirect sensing
use walking pattern / gait
voice
human body reflection
refraction, diffraction, absorption of radio signals
=> require labeled data to train
=> argument: difficult and impractical to asume availability of labeled data for each deployment…
=> use our approach to automatically label the data…
What is CMA written out?
cross modal signal segment association scheme
WHat modules make up CMA?
signal alignment and event detection / segmentation
training of AD-TCN and output of assiciiation score layer
calculation of pairwise association probabilty using CMA
What is AD-TCN written out?
association discovery temporal convolutional neural network
Why do we need to align signals?
model requires input of same dimensionality
sensors have different sampling rate
synchronize timestamps
How is the sampling rate adjusted?
select lowest sampling rate of all available sensors
resample each other sensor inputs
by calculating least common multiple (kleinstes gemeinsames vielfaches)
then upsamle the higher sample rate signal to least common multiple
apply low pass
down sample ot lowest samping rate
How is event detection performed?
sliding window over vibraiton signal and calcualte energy of each window
model out noise (gaussian)
apply threshold to energy
if above -> consider as event
After detecting event windows in vibraiton sensor data, how do the researchers proceed to segment the sensor data?
activity swegmentation
interval-based lumping mehtod
look at consecutive event windows
and lump consecutive events that are together temporally shorter than a delta to one activity
Whgat boundaries are held for event segmentation?
lower and upper
if shorter than lower -> discard
if higher than upper -> divide and ggf. discard shorter
What sensor is in the wearable device?
IMU (inertial measurement unit)
How do the researchers associate aligned signals?
causal discovery between time series
method to infer causal relationship between pairs of multimodal sensing signals
train individual network for each vibration with all wearable sensors to estimate association relationship (-> train for associating 1 vibraiton with one of n wearables -> each virbration one wearable but wearable can have more…)
How is the AD-TCN network struvctured?
input: one vibration TS and n wearable TS (time aligned)
first association score layer (layer with a weight for each input) initiuaolized with same value
updated during training with gradient descent
from association score layer, compute association score from output using softmax
branch out for actual training with TCN blocks that are used to predict the infrastructure signal (i.e. aim to predict the input as output including the wearable signals) and use this to trian the network
When predicting, what input data is used for predicting an output (w.r.t. time)
vibration is shifted left with 1 compared to prediction and wareable info
-> use vibration t-1 as baseline info and wearable t as prediciton basis to predict vibration t…
Explian all components of the network
input time shifted vibraiton plus all wearable
go through weight layer with softmax
outputs go into individual TCN residual blocks (no detail)
pointwise convolution layer to integrate all outputs of blocks before
this output predicts the vibraion input + 1
use loss function to compare to actual input to train network
Can one directly use the association scores coming out of the softmax in the AD-TCN network?
no not comparable between different networks
output is actually attention value and cannot represent association relationship directly
=> common represntaiton of association relationship needed
How do the researchers make the output of the network comparable?
use association score to calculate divergence between vibration sensor and wearable sensors
and then use softmax to convert divergence to the association probability
WHat is the idea of divergence and how is it calculated? How is it made comparable?
idea: low association divergence -> low contribution to prediction
calculate by sqrt of euclidean norm of the output over the channels of each individual wearable sensor
then calculate softmax over all these wearable sensor specific values
Do the wearable sensors have more than one timeseries?
yes multiple channels
e.g. accelerometer
gyroscope
What aspects do the researchers evlauate?
association performance and system characterization on public dataset and own uncontrolled dataset
use case study for real application demonstration
What is the public datset they used?
floor vibraiton sensors
IMU (6-axis) sensors
two buildings with 6 humans and nine types of in home actiities of daily living
What activity data is represented in the public dataset?
keyboard typing
using mouse
handwriting
cutting food
stir fry
wiping countertop
sweeping floor
vacuuming floor
open / close drawer
for each scenario fpour vibration sensors
10 times for 15 seconds
one building one human
Hoiw is the uncontrolled dataset structured?
same types of sensors
11 person
3 per house for data collection
three vibration sensors on surface of furniture
do their thing for around one hour simulatneously
use camera to verify later on (=> ground truth for evaluation)
What evaluation metrics do they consider?
ROC and AUC for activity recognition thresholding
CMA and baseline methods
F1 score and accuracy for performance in selected (best) thresold
intutitive overall eavluation on public and uncontrolled dataset
What does the ROC curve show?
ture positive (right association wearable and vibration)
false positive (wrongly associatted)
What baseline methods do they use to compare their method to?
regular signal similarity…
cosine similarity
surface similarity
max cross correlation
between IMU and vibration
What is the use case study they conducted?
two use cases implemented on public dataset (due to presence of activity and identity labels)
occupant identification (
hard to use only vibration and label it thus temporal setup with IMU to learn identity association using SVM without IMU (wearable))
multimodal human activity recognition
associate signals and then input both to activity recognition
What do they compare their model to in the use case experinemts?
ground truth -> input only correct association to donwntream task
use the associations their model produces
randomly associate inputs
By how much were they able to increaase overall performance?
up to 31 percent better F1 than baseline
How did they evaluate hte impact of activity category on association levels?
separate evaluation of direct (e.g. cut with knive -> cut creates vibration) and indirect ( vacuum floor -> motor creates vibraiton)
show ROC
How did the model perform w.r.t. differnt movement types? (direct, indirect etc.)
was robust
How did they further aim to evaluate real world scenarios?
choise three pairs
unassociated 0, 1, 2 (=> have a vibration without fitting IMU)
+> the more unassociated the lower the performacne
expl. multiple IMU equally not associated -> not efficient for distinguishing
How did they further evalutate the scalability of CMA?
increase number of weareable devices to alrger than 3
up to six
slight average decvrease in AUC
but still good
How did they evaluate the uncontrolled dataset?
same as controlled but only for general performance (ROC, AUC, F1, Accuracy)
slightly better than in controlled dataset
How was the performance in the use case? (occupant identification, human activity recognition)?
occupant: improvement over random association but worse than ideal association
same in activity recognition
Last changeda year ago