Archive of Computer Science Technical Reports

Archive of Computer Science Technical Reports

Archive of Computer Science Publications

The following is a list of our faculty's publications ordered by date.

Moving Target Defense for Avonic System.
Heydari, Vahid
Conference
National Cyber Summit (NCS). IEEE, 2018.

Exploring Bias in the US Electoral College System via Big-Data Simulation.
Breitzman, Anthony F.
Workshop
BigData 2018: 4304-4312

Using Cartograms to Visualize Population Normalized Big-Data Sets.
Breitzman, Anthony F.
Workshop
BigData 2018: 3575-3580

Visual Analytics for Real-Time Flight Behavior Threat Assessment.
Bo Sun, Eric Zielonka, Aleksandr Fritz, Matthew Schofield, Brennan Ringel, Brendan Armstrong, Shen-Shyang Ho, Anthony F. Breitzman, Jason Snouffer, Jean Kirschner, Kimberly Davis
Workshop
BigData 2018: 3607-3612.

Dynamic Demand Prediction and Allocation in Cloud Service Brokerage, IEEE Transactions on Cloud Computing, to appear.
C. Qui and H. Shen
Journal Article 2019

N-ary Decomposition for Multi-class Classification
S-S Ho, J. T. Zhou, I. W. Tsang, and K-R Müller
Machine Learning Journal
2019-01-08

ParkLoc: Light-weight Graph-based Vehicular Localization in Parking Garages, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), vol. 2, issue 3
S.-S. Ho, J. Cherian, and J. Luo
2018-09

Moving target defense for securing SCADA communications.
Heydari, Vahid
IEEE Access 6 (2018): 33329-33343.

Using Smart Glasses for Facial Recognition," American Journal of Undergraduate Research (AJUR)
G. Mayorga, X. Do, and V. Heydari
2019

Robust energy-based least squares twin support vector machines
Mohammad Tanveer, Mohammad Asif Khan, Shen-Shyang Ho
Journal Article
2016-07-01
DOI: 10.1007/s10489-015-0751-1
Abstract
Twin support vector machine (TSVM), least squares TSVM (LSTSVM) and energy-based LSTSVM (ELS-TSVM) satisfy only empirical risk minimization principle. Moreover, the matrices in their formulations are always positive semi-definite. To overcome these problems, we propose in this paper a robust energy-based least squares twin support vector machine algorithm, called RELS-TSVM for short. Unlike TSVM, LSTSVM and ELS-TSVM, our RELS-TSVM maximizes the margin with a positive definite matrix formulation and implements the structural risk minimization principle which embodies the marrow of statistical learning theory. Furthermore, RELS-TSVM utilizes energy parameters to reduce the effect of noise and outliers. Experimental results on several synthetic and real-world benchmark datasets show that RELS-TSVM not only yields better classification performance but also has a lower training time compared to ELS-TSVM, LSPTSVM, LSTSVM, TBSVM and TSVM.
Major milestones in the twin prime conjecture
Anthony Breitzman
Journal Article
2016-06-30
Abstract
In April 2013 Yitang Zhang announced a proof that there are infinitely many pairs of prime numbers that have a difference of at most 70 million (see Zhang (2014)). Others have since narrowed the gap from 70 million to just 246. Reducing the gap to 2 would prove the twin prime conjecture. The significance is a path to solving an ancient problem that looked hopeless just a few years ago. The popular press has discussed Zhang's result but at a very high level. Zhang (2014) and subsequent papers appearing in number theory journals present only the most recent details and are not really accessible to the nonspecialist. This paper attempts to bridge the gap and present a history of major milestones leading to the current state of the twin prime conjecture written at the level of working mathematicians.
ParkGauge: Gauging the Occupancy of Parking Garages with Crowdsensed Parking Characteristics
Jim Cherian, Jun Luo, Hongliang Guo, Shen-Shyang Ho, Richard Wisbrun
IEEE 17th International Conference on Mobile Data Management
2016-06-13
Abstract
Finding available parking spaces in dense urban areas is a globally recognized issue in urban mobility. Whereas prior studies have focused on outdoor/street parking due to a common belief that parking garages are capable of delivering real-time occupancy information, we specifically target at (indoor) parking garages as this belief is far from true. This problem is very challenging as all the infrastructure supports (e.g., GPS and Wi-Fi) assumed by existing proposals are not available to parking garages, so counting how many vehicles are using a parking garage by crowd sensing can be extremely difficult. To this end, we present Park Gauge, a method to gauge the occupancy of parking garages, along with a reference system prototype for performance evaluation, it infers parking occupancy from crowd sensed parking characteristics instead of counting the parked vehicles. Park Gauge adopts low-power sensors (e.g., accelerometer and barometer) in the driver's smartphone to determine the driving states (e.g., turning and braking). A sequence of such states further allows the inference of driving contexts (e.g., driving, queuing and parked) that in turn yield temporal parking characteristics of a parking garage, including time-to-park and time-in-cruising/queuing. Mining such mobile data opportunistically collected from a crowd of drivers arriving at various garages yields a good measure of their occupancies and hence useful recommendations can be generated (in real-time) to inform drivers coming toward these venues. Through extensive experiments, we demonstrate that our method fully explores these parking characteristics to efficiently infer occupancies of parking garages with high accuracy.
Manifold Learning for Multivariate Variable-Length Sequences With an Application to Similarity Search
Shen-Shyang Ho, Peng Dai, Frank Rudzicz
IEEE Transactions on Neural Networks and Learning Systems
2016-06-01
Abstract
Multivariate variable-length sequence data are becoming ubiquitous with the technological advancement in mobile devices and sensor networks. Such data are difficult to compare, visualize, and analyze due to the nonmetric nature of data sequence similarity measures. In this paper, we propose a general manifold learning framework for arbitrary-length multivariate data sequences driven by similarity/distance (parameter) learning in both the original data sequence space and the learned manifold. Our proposed algorithm transforms the data sequences in a nonmetric data sequence space into feature vectors in a manifold that preserves the data sequence space structure. In particular, the feature vectors in the manifold representing similar data sequences remain close to one another and far from the feature points corresponding to dissimilar data sequences. To achieve this objective, we assume a semisupervised setting where we have knowledge about whether some of data sequences are similar or dissimilar, called the instance-level constraints. Using this information, one learns the similarity measure for the data sequence space and the distance measures for the manifold. Moreover, we describe an approach to handle the similarity search problem given user-defined instance level constraints in the learned manifold using a consensus voting scheme. Experimental results on both synthetic data and real tropical cyclone sequence data are presented to demonstrate the feasibility of our manifold learning framework and the robustness of performing similarity search in the learned manifold.
The Distributed Esteemed Endorser Review: A Novel Approach to Participant Assessment in MOOCs
Jennifer S. Kay, Tyler J. Nolan, Thomas M. Grello
L@S '16
2016-04-26
DOI: 10.1145/2876034.2893396
Abstract
One of the most challenging aspects of developing a Massive Open Online Course (MOOC) is designing an accurate method to effectively assess participant knowledge and skills. The Distributed Esteemed Endorser Review (DEER) approach has been developed as an alternative for those MOOCs where traditional approaches to assessment are not appropriate. In DEER, course projects are certified in-person by an "Esteemed Endorser", an individual who is typically senior in rank to the student, but is not necessarily an expert in the course content. Not only does DEER provide a means to certify that course goals have been met, it also provides MOOC participants with the opportunity to share information about what they have learned with others at the local level.
Exploiting sparsity for image-based object surface anomaly detection
Woon Huei Chai, Shen-Shyang Ho, Chi-Keong Goh
41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
2016-03-21
Abstract
The anomaly detection task plays an important role in quality control in many industrial or manufacturing processes. However, in many such processes, anomaly detection is done visually by human experts who have in-depth knowledge and vast experience on a product in order to perform well in the detection task. In this paper, we present an approach that (i) identifies anomalies in an image based on the sparse residuals (or errors) obtained during image reconstruction using sparse representation and (ii) learns the threshold to classify an image pixel based on its residual value. The intuitions for our proposed sparse approximation driven approach are, namely: (i) anomalies are infrequent and (ii) anomalies are unwanted portions of an image reconstruction. Empirical results on a real-world image dataset for an industrial surface defect detection task are used to demonstrate the feasibility of our proposed approach.
Accuracy of class prediction using similarity functions in PAM
Umashanger Thayasivam, Vasil Hnatyshin, Isaac B. Muck
2016 IEEE International Conference on Industrial Technology (ICIT)
2016-03-14
DOI: 10.1109/ICIT.2016.7474815
Abstract
Clustering have been proven to be an effective technique for finding data instances with similar characteristics. Such algorithms are based on the notion of distance between data points, often computed using Euclidean metric. That is why, clustering algorithms are mostly applicable to the data sets comprising of numerical values. However, the real life data often consist of features which are categorical in nature. For example, to identify abnormal behavior or a cyberattack in a network, we usually examine packet headers which contain categorical values such as source and destination IP addresses, source and destination port numbers, upper layer protocols, etc. Euclidean metric is not applicable to such data sets because it cannot compute the distance between categorical variables. To address this problem, similarity functions have been designed to determine the relationship between given categorical values. Similarity defines how closely related the objects are to one another. Often similarity could be thought of as opposite to distance where similar objects have high value, while dissimilar objects have low or zero value. In this paper we explored accuracy of various similarity functions using the Partitioning Around Medoids (PAM) clustering algorithm. We tested similarity functions on several data sets to determine their ability to correctly predict the class labels. We also examined the applicability of various similarity functions to different types of data sets.
Transfer Learning for Cross-Language Text Categorization through Active Correspondences Construction
Joey Tianyi Zhou, Sinno Jialin Pan, Ivor W. Tsang, Shen-Shyang Ho
AAAI 2016
2016-03-02
Abstract
Most existing heterogeneous transfer learning (HTL) methods for cross-language text classification rely on sufficient cross-domain instance correspondences to learn a mapping across heterogeneous feature spaces, and assume that such correspondences are given in advance. However, in practice, correspondences between domains are usually unknown. In this case, extensively manual efforts are required to establish accurate correspondences across multilingual documents based on their content and meta-information. In this paper, we present a general framework to integrate active learning to construct correspondences between heterogeneous domains for HTL, namely HTL through active correspondences construction (HTLA). Based on this framework, we develop a new HTL method. On top of the new HTL method, we further propose a strategy to actively construct correspondences between domains. Extensive experiments are conducted on various multilingual text classification tasks to verify the effectiveness of HTLA.
An efficient regularized K-nearest neighbor based weighted twin support vector regression
M. Tanveer, K. Shubham, M. Aldhaifallah, Shen-Shyang Ho
Knowledge-Based Systems Volume 94
2016-02-15
Abstract
In general, pattern classification and regression tasks do not take into consideration the variation in the importance of the training samples. For twin support vector regression (TSVR), this implies that all the training samples play the same role on the bound functions. However, the number of close neighboring samples near to each training sample has an effect on the bound functions. In this paper, we formulate a regularized version of the KNN-based weighted twin support vector regression (KNNWTSVR) called RKNNWTSVR which is both efficient and effective. By introducing the regularization term and replacing 2-norm of slack variables instead of 1-norm, our RKNNWTSVR only needs to solve a simple system of linear equations with low computational cost, and at the same time, it improves the generalization performance. Particularly, we compare four implementations of RKNNWTSVR with existing approaches. Experimental results on several synthetic and benchmark datasets indicate that, comparing to SVR, WSVR, TSVR and KNNWTSVR, our proposed RKNNWTSVR has better generalization ability and requires less computational time.
Multi-Label Regularized Generative Model for Semi-Supervised Collective Classification in Large-Scale Networks
Qingyao Wu, Jian Chen, Shen-Shyang Ho, Xutao Li, Huaqing Min, Chao Han
Big Data Research 2(4)
2015-12
DOI: 10.1016/j.bdr.2015.04.002
Abstract
The problem of collective classification(CC) for large-scale network data has received considerable attention in the last decade. Enabling CC usually increases accuracy when given a fully-labeled network with a large amount of labeled data. However, such labels can be difficult to obtain and learning a CC model with only a few such labels in large-scale sparsely labeled networks can lead to poor performance. In this paper, we show that leveraging the unlabeled portion of the data through semi-supervised collective classification(SSCC) is essential to achieving high performance. First, we describe a novel data-generating algorithm, called generative model with network regularization(GMNR), to exploit both labeled and unlabeled data in large-scale sparsely labeled networks. In GMNR, a network regularizer is constructed to encode the network structure information, and we apply the network regularizer to smooth the probability density functions of the generative model. Second, we extend our proposed GMNR algorithm to handle network data consisting of multi-label instances. This approach, called the multi-label regularized generative model(MRGM), includes an additional label regularizer to encode the label correlation, and we show how these smoothing regularizers can be incorporated into the objective function of the model to improve the performance of CC in multi-label setting. We then develop an optimization scheme to solve the objective function based on EM algorithm. Empirical results on several real-world network data classification tasks show that our proposed methods are better than the compared collective classification algorithms especially when labeled data is scarce. (C) 2015 Elsevier Inc. All rights reserved.
SciDB-based Framework for Efficient Satellite Data Storage and Query based on Dynamic Atmospheric Event Trajectory
Luboš Krčál, Shen-Shyang Ho
BigSpatial'15 Proceedings of the 4th International ACM SIGSPATIAL Workshop on Analytics for Big Geospatial Data
2015-11-03
Abstract
Current research in climate informatics focuses mainly on the development of novel (machine learning, data mining, or statistical) techniques to analyze climate data (e.g. model, in-situ, or satellite) or to make prediction based on these climate data. One important component missing from this analysis workflow is data management that allows efficient and flexible data retrieval, (ease of) reproducibility, and the (ease of) techniques reuse on user-defined data subsets or other data.

In this paper, we describe our preliminary investigation on the utilization of the distributed array-based database management system, SciDB, to support data-driven climate science research. We focus on modeling and generating indices that allow effective execution of various spatiotemporal queries on satellite data. Moreover, we demonstrate fast and accurate data retrieval based on user-specified trajectories from the SciDB database containing tropical cyclone trajectories and the complete ten-year QuikSCAT ocean surface wind fields satellite data.

Our preliminary work indicates the feasibility of the array-based technology for multiple satellite data storage, query, and analysis. Towards this end, a successful deployment of SciDB-based data storage can facilitate the use of data from multiple satellites for climate and weather research.
Stochastic location optimization in a dynamic environment
Hongliang Guo, Yubo Dong, Shen-Shyang Ho
SIGSPATIAL/GIS 2015
2015-11-03
Abstract
Existing location optimization solutions only consider the positioning of resouces/seeds at the best location in a target area by minimizing a certain metric over distance. In reality, what really matters is time. In this paper, the location optimization problem is formulated as the expected response time minimization problem rather than a distance minimization problem. Moreover, we propose an algorithm that takes into consideration various stochastic factors which affect the location optimization problem, such as non-uniform probability distribution of the demands, road congestion level, and vehicles' maximum speed. Our proposed algorithm shows promising performance when the disparity among vehicle capabilities (e.g., maximum speed) are large and the environment constraints (e.g., traffic jam) are taken into consideration.
Physical Activity in a Theory of Computing Class
Nancy Tinkham
Proceedings of the 20th ACMS Conference
2015-05
Abstract
Physical activity breaks, sometimes called brain breaks, are beginning to gain attention among K- 12 teachers as a way to keep their students alert and engaged in the classroom. In the Fall 2014 semester, faced with the task of teaching an introductory course in Theory of Computing in a once-a-week, 2.5-hour format, I decided to try incorporating physical activity into my own classroom. Time is precious in the college classroom, so any physical activities have to be directly related to the course material. I will describe some physically active exercises that I used in the classroom to teach students about regular expressions, finite automata, and other theoretical concepts. During the semester, I found that these exercises helped students to have fun and to stay connected to the material, even at the end of this long, late-night class. I also found that the exam averages and the overall course average were higher in Fall 2014 than they had been during the previous four years of teaching this night class. This invites further experimentation with the technique in future semesters.
Inventor team size as a predictor of the future citation impact of patents
Anthony Breitzman, Patrick Thomas
Scientometrics Volume 103 2015-03-01
DOI: 10.1007/s11192-015-1550-5
Abstract
Forward citations are widely recognized as a useful measure of the impact of patents upon subsequent technological developments. However, an inherent characteristic of forward citations is that they take time to accumulate. This makes them valuable for retrospective impact evaluations, but less helpful for prospective forecasting exercises. To overcome this, it would be desirable to have indicators that forecast future citations at the time a patent is issued. In this paper, we outline one such indicator, based on the size of the inventor teams associated with patents. We demonstrate that, on average, patents with eight or more co-inventors are cited significantly more frequently in their first 5 years than peer patents with fewer inventors. This result holds true across technologies, assignee type, citation source (examiner versus applicant), and after self-citations are accounted for. We hypothesize that inventor team size may be a reflection of the amount of resources committed by an organization to a given innovation, with more researchers attached to innovations regarded as having particular promise or value.
ML-TREE: A Tree-Structure-Based Approach to Multilabel Learning
Qingyao Wu, Yunming Ye, Haijun Zhang, Tommy W. S. Chow, Shen-Shyang Ho
IEEE Trans. Neural Netw. Learning Syst. 26(3)
2015-03
DOI: 10.1109/TNNLS.2014.2315296
Abstract
Multilabel learning aims to predict labels of unseen instances by learning from training samples that are associated with a set of known labels. In this paper, we propose to use a hierarchical tree model for multilabel learning, and to develop the ML-Tree algorithm for finding the tree structure. ML-Tree considers a tree as a hierarchy of data and constructs the tree using the induction of one-against-all SVM classifiers at each node to recursively partition the data into child nodes. For each node, we define a predictive label vector to represent the predictive label transmission in the tree model for multilabel prediction and automatic discovery of the label relationships. If two labels co-occur frequently as predictive labels at leaf nodes, these labels are supposed to be relevant. The amount of predictive label co-occurrence provides an estimation of the label relationships. We examine the ML-Tree method on 11 real data sets of different domains and compare it with six wellestablished multilabel learning algorithms. The performances of these approaches are evaluated by 16 commonly used measures. We also conduct Friedman and Nemenyi tests to assess the statistical significance of the differences in performance. Experimental results demonstrate the effectiveness of our method.
Sequential behavior prediction based on hybrid similarity and cross-user activity transfer
Peng Dai, Shen-Shyang Ho, Frank Rudzicz
Knowledge Based Systems, Volume 77
2015-03
DOI: 10.1016/j.knosys.2014.12.026
Abstract
The proliferation of smart phones has opened up new kinds of data to model human behavior and predict future activity but this prediction can be tempered by the relative sparsity of data. In this paper, we integrate a time-dependent instance transfer mechanism, driven by a hybrid similarity measure, into learning and predicting human behavior. In particular, transfer component analysis (TCA) is utilized for domain adaptation from different data types to overcome data sparsity. The hybrid user similarity measure is developed based on three different characteristics: eigen-behavior, longest common behavior (LCB), and daily common behavior (DCB). Extensive comparisons are made against state-of-the-art time series prediction algorithms using the Nokia Mobile Data Challenge (MDC) dataset and the MIT Reality Mining dataset. We compare the prediction performance given (i) no additional data, (ii) only data from identical behavior from other users, and (iii) data from any type of behavior from other users. Experimental results show that our proposed algorithm significantly improves the performance of behavior prediction. (C) 2015 Elsevier B.V. All rights reserved.
The Emerging Clusters Model: A tool for identifying emerging technologies across multiple patent systems
Anthony Breitzman, Patrick Thomas
Research Policy Volume 4
2015-02
DOI: http://dx.doi.org/10.1016/j.respol.2014.06.006
Abstract
Emerging technologies are of great interest to a wide range of stakeholders, but identifying such technologies is often problematic, especially given the overwhelming amount of information available to analysts and researchers on many subjects. This paper describes the Emerging Clusters Model, which uses advanced patent citation techniques to locate emerging technologies in close to real time, rather than retrospectively. The model covers multiple patent systems, and is designed to be extensible to additional systems. This paper also describes the first large scale test of the Emerging Clusters Model. This test reveals that patents in emerging clusters consistently have a significantly higher impact on subsequent technological developments than patents outside these clusters. Given that these emerging clusters are defined as soon as a given time period ends, without the aid of any forward-looking information, this suggests that the Emerging Clusters Model may be a useful tool for identifying interesting new technologies as they emerge.
Chapter 13: Machine learning algorithms for metabolomics applications
Vasil Hnatyshin
Metabolomic Data Processing and Analysis 2015-01-06
Abstract
Needs info.
Semi-supervised multi-label collective classification ensemble for functional genomics
Qingyao Wu, Yunming YeEmail author, Shen-Shyang Ho, Shuigeng Zhou
Thirteenth International Conference on Bioinformatics (InCoB2014): Computational Biology
2014-12-08
DOI: 10.1186/1471-2164-15-S9-S17
Abstract
With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fully-labeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data.
Traffic incident validation and correlation using text alerts and images
Wye Huong Yan, Justin Ong, Shen-Shyang Ho, Jim Cherian
SIGSPATIAL/GIS 2014
2014-11-06
Abstract
One of the major challenges during the process of extracting information from multiple spatio-temporal data sources of diverse data types is the matching and fusion of extracted knowledge (e.g. interesting nearby events detected from text, estimated density or flow from a set of geo-coded images). In this demonstration, we present PETRINA ("PErsonalized TRaffic INformation Analytics"), a system that provides traffic-related incident monitoring, mapping, and analytics services. In particular, we showcase two main functionalities: (1) text traffic alert validation based on traffic condition information derived from traffic camera images and (2) traffic incident correlation based on spatio-temporal proximity of different incident types (e.g., accidents and heavy traffic). Despite the fact that the images are sparse (available every three minutes), the regularity makes it possible to validate whether a text traffic alert is outdated or not, and to more accurately estimate the time elapsed and total incident time. Multiple traffic incidents can be grouped together as a single event based on the traffic incident correlation to reduce information redundancy. Such enhanced real-time traffic information enables PETRINA to offer services such as dynamic routing with traffic incident advices, spatiotemporal traffic incident visual analytics, and congestion analysis.
FORESTEXTER: An efficient random forest algorithm for imbalanced text categorization
Qingyao Wu, Yunming Ye, Haijun Zhang, Michael K. Ng, Shen-Shyang Ho
Knowledge Based Systems, Volume 67
2014-09
DOI: 10.1016/j.knosys.2014.06.004
Abstract
In this paper, we propose a new random forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in building their decision trees. As a result, it selects many subspaces that contain few, if any, informative features for the minority class. Furthermore, the Gini measure for data splitting is considered to be skew sensitive and bias towards the majority class. Due to the inherent complex characteristics of imbalanced text datasets, learning RF from such data requires new approaches to overcome challenges related to feature subspace selection and cut-point choice while performing node splitting. To this end, we propose a new tree induction method that selects splits, both feature subspace selection and splitting criterion, for RF on imbalanced text data. The key idea is to stratify features into two groups and to generate effective term weighting for the features. One group contains positive features for the minority class and the other one contains the negative features for the majority class. Then, for feature subspace selection, we effectively select features from each group based on the term weights. The advantage of our approach is that each subspace contains adequate informative features for both minority and majority classes. One difference between our proposed tree induction method and the classical RF method is that our method uses Support Vector Machines (SVM) classifier to split the training data into smaller and more balance subsets at each tree node, and then successively retrains the SVM classifiers on the data partitions to refine the model while moving down the tree. In this way, we force the classifiers to learn from refined feature subspaces and data subsets to fit the imbalanced data better. Hence, the tree model becomes more robust for text categorization task with imbalanced dataset. Experimental results on various benchmark imbalanced text datasets (Reuters-21578, Ohsumed, and imbalanced 20 newsgroup) consistently demonstrate the effectiveness of our proposed ForesTexter method. The performance of our proposed approach is competitive against the standard random forest and different variants of SVM algorithms.
A Smartphone User Activity Prediction Framework Utilizing Partial Repetitive and Landmark Behaviors
Peng Dai, Shen-Shyang Ho
2014 Ieee 15th International Conference on Mobile Data Management (MMDM), Vol 1
2014-07-14
DOI: 10.1109/MDM.2014.31
Abstract
In this paper, we propose a general smartphone user activity prediction framework utilizing the general concept of partial repetitive behavior (instead of the stronger periodicity condition) for similarity scoring and the landmark behaviors (representative behaviors to identify groups of similar behavior vectors). Prediction of the next-day(s) behavior is based on a weighted sum of the most similar behavior vectors related to the landmark behavior of the next-day(s) behavior. These behavior vectors are selected based on the likely partial repetition of the next-day behavior and similarity in the eigen behavior feature space. Our proposed prediction algorithm allows one to categorically quantify the frequency of a target behavior, such as no behavior, normal behavior, and high frequency behavior, or other more refined categorization based on user preference. Extensive experiments are carried out using the Nokia Mobile Data Challenge (MDC) dataset to demonstrate the feasibility of our proposed approach and its generality using arbitrary call activity, voice call activity, short message activity, media consumption, and apps usage data types.
Robust prediction in nearly periodic time series using motifs
Woon Huei Chai, Hongliang Guo, Shen-Shyang Ho
International Joint Conference on Neural Networks (IJCNN)
2014-07-06
Abstract
In this paper, we consider the prediction task for a process with nearly periodic property, i.e., patterns occur with some regularities but no exact periodicity. We propose an inference approach based on probabilistic Markov framework utilizing motif-driven transition probabilities for sequential prediction. In particular, a Markov-based weighting framework utilizing fully the information from recent historical data and sequential pattern regularities is developed for nearly periodic time series prediction. Preliminary experimental results show that our prediction approach is competitive against the moving average and multi-layer perceptron neural network approaches on synthetic data. Moreover, our proposed method is shown to be empirically robust on time-series with missing data and noise. We also demonstrate the usefulness of our proposed approach on a real-world vehicle parking lot availability prediction task.
A Generative Model with Network Regularization for Semi-Supervised Collective Classification
Ruichao Shi, Qingyao Wu, Yunming Ye, Shen-Shyang Ho
SIAM International Conference on Data Mining (SDM)
2014-04-24
DOI: http://dx.doi.org/10.1137/1.9781611973440.8
Abstract
In recent years much effort has been devoted to Collective Classification (CC) techniques for predicting labels of linked instances. Given a large number of labeled data, conventional CC algorithms make use of local labeled neighbours to increase accuracy. However, in many real-world applications, labeled data are limited and very expensive to obtain. In this situation, most of the data have no connection to labeled data, and supervision knowledge cannot be obtained from the local connections. Recently, Semi-Supervised Collective Classification (SSCC) has been examined to leverage unlabeled data for enhancing the classification performance of CC. In this paper we propose a probabilistic generative model with network regularization (GMNR) for SSCC. Our main idea is to compute label probability distributions for unlabeled instances by maximizing both the log-likelihood in the generative model and the label smoothness on the network topology of data. The proposed generative model is based on the Probabilistic Latent Semantic Analysis (PLSA) method using attribute features of all instances. A network regularizer is employed to smooth the label probability distributions on the network topology of data. Finally, we develop an effective EM algorithm to compute the label probability distributions for label prediction. Experimental results on three real sparsely-labeled network datasets show that the proposed model GMNR outperforms state-of-the-art CC algorithms and other SSCC algorithms.
Teach algorithm design and intractability with a project-based curriculum centered on a single intractable problem
Andrea F. Lobo, Ganesh R. Baliga
SIGCSE '14 Proceedings of the 45th ACM technical symposium on Computer science education
2014-03-05
DOI: 10.1145/2538862.2539031
Abstract
This workshop presents an award-winning, NSF-funded, project-based curriculum for algorithm design that includes algorithmic strategies for intractable problems. This curriculum is a sequence of laboratory projects comprising increasingly sophisticated solvers for a single intractable problem, designed to integrate into existing, one-term, undergraduate courses that teach algorithm design and/or intractability without sacrificing traditional course content. The presenters have used the curriculum in the Design and Analysis of Algorithms course at their institution to help students tackle and appreciate intractability. This workshop presents versions of the curriculum centered on TSP, SAT and Sudoku. Attendees will receive adoption materials and access to an adopters' forum. NSF is funding the development, evaluation, dissemination and adoption of the curriculum. Potential adopters are encouraged to apply for funding to attend this workshop and SIGCSE 2014 at http://www.rowan.edu/~lobo/AlgosCurriculum. This material is based upon work supported by the National Science Foundation under Grant No. 1140753. Laptop optional
CS professional development MOOCs
Erin Mindell, Karen Brennan, Gwendolyn Britton, Jennifer S. Kay, Jennifer Rosato
SIGCSE '14 Proceedings of the 45th ACM technical symposium on Computer science education 2014-03-05
DOI: 10.1145/2538862.2538872
Abstract
CS4HS (Computer Science for High School) is an initiative sponsored by Google to promote Computer Science and Computational Thinking in high school and middle school curricula. In the past, workshops were offered in a face-to-face format; however, this left many K-12 computer science teachers unable to attend a workshop in their geographical region. During the 2013 round of funding, Google funded the creation of 4 workshops to be delivered in an online format, open to teachers across the United States and beyond. The panelists will share their experiences with development and deployment of large scale workshops that aim to fill the gap in professional development for K-12 computer science teachers.
Sneaking in through the back door: introducing k-12 teachers to robot programming
Jennifer S. Kay, Janet G. Moss, Shelly Engelman, Tom McKlin
SIGCSE '14 Proceedings of the 45th ACM technical symposium on Computer science education
2014-03-05
DOI: 10.1145/2538862.2538972
Abstract
Few question the need to offer excellent programs in computer science at the Bachelors and Graduate Levels. But computer science is not just for computer scientists! An understanding of key computer science concepts is essential to comprehending the underpinnings of what drives much of the culture and environment that students will encounter upon graduation. Unfortunately, in the United States most state, regional, and national K-12 standards do not include computer science among the core competencies required of all students. However, careful study reveals many opportunities to satisfy mandatory non-computer-science standards while simultaneously teaching important concepts in computer science. This paper begins with an overview of these standards and suggests that educational robotics could be incorporated into K-12 curricula to satisfy these standards.
But even if robots truly are a magic panacea, most K-12 teachers have never used them. The remainder of this paper discusses a pair of 3 day workshops we offered in the summers of 2011 and 2012 which were designed to introduce K-12 teachers with no prior programming experience to LEGO robot programming. We discuss the content of the workshops, how teachers' skills and attitudes changed as a result of these workshops, and how teachers used the material they learned in their schools.
The challenges of using a MOOC to introduce "absolute beginners" to programming on specialized hardware
Jennifer S. Kay, Tom McKlin
L@S '14 Proceedings of the first ACM conference on Learning @ scale conference
2014-03-04
DOI: 10.1145/2556325.2567886
Abstract
Educational Robotics for Absolute Beginners is a MOOC designed to introduce K-12 teachers with no prior computer science or robotics experience to the basics of LEGO NXT Robot programming. The course was developed following several successful in-person workshops on the same topic. This paper introduces some of the issues that arose as we transitioned the material to a MOOC, describes some of the unique challenges we faced by incorporating specialized hardware into a MOOC, and presents some preliminary data evaluating the success of our approach.
Collective prediction of protein functions from protein-protein interaction networks
Qingyao Wu, Yunming Ye, Michael K. Ng, Shen-Shyang Ho, Ruichao Shi
BMC Bioinformatics, Volume 15 - Supplements
2014-01-24
DOI: 10.1186/1471-2105-15-S2-S9
Abstract
Automated assignment of functions to unknown proteins is one of the most important task in computational biology. The development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function from protein-protein interaction (PPI) network data. Existing techniques for collective classification (CC) usually increase accuracy for network data, wherein instances are interlinked with each other, using a large amount of labeled data for training. However, the labeled data are time-consuming and expensive to obtain. On the other hand, one can easily obtain large amount of unlabeled data. Thus, more sophisticated methods are needed to exploit the unlabeled data to increase prediction accuracy for protein function prediction.
A Generalized Morphological Skeleton Transform Using both Internal and External Skeleton Points
Jianning Xu
Pattern Recognition 47(8)
2014
Abstract
The morphological skeleton transform (MST) is a leading morphological shape representation scheme. In the MST, a given shape is represented as the union of all the maximal disks contained in the shape. The concept of external skeleton points and external maximal disks has been used for shape description and characterization purposes. In this paper, we develop a generalized morphological skeleton transform that combines the concepts of internal and external maximal disks into a unified framework. In this framework, a shape is described in terms of disk components that need to be added as well as disk components that need to be removed. The procedures and formulae describing the extraction of the disk components and the reconstruction of the original shape from these components are developed. The correctness of the procedures and formulae is established. This new framework seems to provide a more powerful and more natural way of modeling the approximation and reconstruction of binary shapes using primitive shape components.
Cluster Tree based Multi-Label Classification for Protein Function Prediction
Qingyao Wu, Yunming Ye, Xiaofeng Zhang, Shen-Shyang Ho
BIBM 2013
2013-12-18
Abstract
Automatically assigning functions for unknown proteins is a key task in computational biology. Proteins in nature have multiple classes according to the functions they perform. Many efforts have been made to cast the protein function prediction into a multi-label learning problem. This paper proposes a novel Cluster Tree based Multi-label Learning algorithm (CTML) for protein function prediction. The main idea is to compute a set of predictive labels associated at each node for multi-label prediction by using the k-means clustering techniques and the predictive functions via the learning data at the nodes. With the propagation of the predictive labels from the root node to the leaf node, the correlations between labels can be preserved. Experimental results on benchmark data (genbase and yeast datasets) show that the proposed CTML algorithm is effective in predicting protein functions. Moreover, the classification performance of the CTML algorithm is competitive against the other baseline multi-label learning algorithms.
Robotics in computer science education
Jennifer S. Kay, Tom Lauwers
Computer Science Education Volume 23, Issue 4, 2013
2013-11-14
DOI:10.1080/08993408.2013.856614
Abstract
Robots would not be robots without computer science. Leave out computer science and what remains are fancy mechanisms and remotely controlled machines. It is no surprise then that robots appear in numerous computer science courses. The papers in this special issue, like the title of the issue itself, represent two distinct topics. The first two papers investigate the use of robots in teaching computer science concepts to a general audience. The final three papers study how to best teach concepts in robotics to upper-level compter science students. Taken together, these papers show the use of robotics in education with 11-year-old school children all the way up to students pursuing graduate studies in computer science.
Using high-powered long-range zigbee devices for communication during amateur car racing events
James Wakemen, Matthew Hodson, Philip Shafer, Vasil Hnatyshin
2013 3rd International Conference on Wireless Communications, Vehicular Technology, Information Theory and Aerospace and Electronic Systems, VITAE 2013 - Co-located with Global Wireless Summit 2013
2013-06-27
DOI: 10.1109/VITAE.2013.6617066
Abstract
In the world of amateur motorsports the racers are always looking to improve their skills. This has led to the development of various monitoring and data logging systems, often coupled with 'action sports' cameras which capture the car performance in enough detail to allow the driver to study the race after completion. In this paper we investigate the possibility of the using data logging systems together with IEEE 802.15.4 high power devices for race car - pit crew communication during the race. This will allow the pit crew to monitor the car and get the driver to stop before a catastrophic failure occurs. We used OPNET IT Guru ver. 17.0 network software package to conduct our simulation study. Specifically, we focus on such aspects of 802.15.4 protocol as communication range, reliable data delivery, achievable throughput, and end-to-end delay experiences by the application during the car race. © 2013 IEEE.
Pedagogical Enhancements to the DeSymbol Logic Translator
Darren Provine, Nancy Tinkham
Proceedings of the 19th ACMS Conference 2013-05
Abstract
DeSymbol is a program that translates first-order predicate logic expressions into English. It is intended to be a practice tool for students who are learning logic for the first time or who are trying to refresh their memories if they need to use symbolic logic for an upper-level course. Students start with an English sentence and translate it by hand into symbolic logic notation; then they can check their work by using DeSymbol to translate their notation back into English. If the English sentence produced by DeSymbol differs significantly from the original English sentence, this helps the student to see what error was made in the logic expression.
The latest version of DeSymbol adds support for prepositions, so that the student can now test expressions such as on(a, b) and ∀ x ∀ y (on(x, y) → under(y, x)). It also now supports a wider variety of idiomatic translations, including improved translations of common student mistakes. For example, the student who begins with the English sentence All cats are mammals and writes the expression ∀ x(cat(x) ∧ mammal(x)) will see DeSymbol re-translate the expression as Everything is a cat and a mammal, which helps the student to see why the expression is incorrect.
Patent trends among small and large innovative firms during the 2007-2009 recession
Anthony Breitzman
US Small Business Administration
2013-05
Abstract
This report describes the key findings from an ambitious project designed to measure the effects of the recent economic downturn on highly innovative US small and large firms. For this project, we leveraged and updated a detailed database of 1,279 small and large technology firms built for the SBA-Green project (Breitzman and Thomas, 2010). The firms in this database are referred to as highly innovative firms because to enter the database they must have been granted at least 15 US patents in the period 2005-09. As such, they are a special subset of US firms that produce significant numbers of patents. In total, these firms own more than one million patents.

For the SBA-Green project, we built the database of innovative firms and then removed any that appeared to be out of business through December 31, 20091 . We also identified the number of employees as of that date, and tagged as small businesses any with 500 or fewer employees. In addition to patent information, the database contains information on revenues, and industry classification where available. There are significant advantages to leveraging the database from the prior study. First, it contains all US firms with extensive patent activity over the period 2005-09, and is not restricted to small or large firms. Second, the patent activity for these firms during these years covers both the period leading to the recession, and the period after the recession had begun.

As noted above, we used the selection criteria of 15 patents from 2005-09 in order to leverage the extensive database from the previous project. For this project, we enhanced the database to include US patents granted from January 1, 2005 through June 30, 2011, as well as all published US applications from January 1, 2005 through July 31, 2011.
Predicting Mobile Call Behavior via Subspace Methods
Peng Dai, Wanqing Yang, Shen-Shyang Ho
SBP 2013
2013-04-02
Abstract
We investigate behavioral prediction approaches based on subspace methods such as principal component analysis (PCA) and independent component analysis (ICA). Moreover, we propose a personalized sequential prediction approach to predict next day behavior based on features extracted from past behavioral data using subspace methods. The proposed approach is applied to the individual call (voice calls and short messages) behavior prediction task. Experimental results on the Nokia mobility data challenge (MDC) dataset are used to show the feasibility of our proposed prediction approach. Furthermore, we investigate whether prediction accuracy can be improved (i) when specific call type (voice call or short message), instead of the general call behavior prediction, is considered in the prediction task, and (ii) when workday and weekend scenarios are considered separately.
Preserving Privacy for Interesting Location Pattern Mining from Trajectory Data
Shen-Shyang Ho, Shuhua Ruan
Trans. Data Privacy 6(1)
2013-04-01
Abstract
One main concern for individuals participating in the data collection of personal location history records (i.e., trajectories) is the disclosure of their location and related information when a user queries for statistical or pattern mining results such as frequent locations derived from these records. In this paper, we investigate how one can achieve the privacy goal that the inclusion of his location history in a statistical database with interesting location mining capability does not substantially increase risk to his privacy. In particular, we propose a (∈, δ)-differentially private interesting geographic location pattern mining approach motivated by the sample-aggregate framework. The approach uses spatial decomposition to limit the number of stay points within a localized spatial partition and then followed by density-based clustering. The (∈, δ)-differential privacy mechanism is based on translation and scaling insensitive Laplace noise distribution modulated by database instance dependent smoothed local sensitivity. Unlike the database independent ∈-differential privacy mechanism, the output perturbation from a (∈, δ)-differential privacy mechanism depends on a lower (local) sensitivity resulting in a better query output accuracy and hence, more useful at a higher privacy level, i.e., smaller ∈. We demonstrate our (∈, δ)-differentially private interesting geographic location discovery approach using the region quadtree spatial decomposition followed by the DBSCAN clustering. Experimental results on the real-world GeoLife dataset are used to show the feasibility of the proposed (∈, δ)-differentially private interesting location mining approach.
An Effective Vortex Detection Approach for Velocity Vector Field
Shen-Shyang Ho
ICPR 2012
2012-11-15
Abstract
Detection of vortices, which are rotating flow features, is an important task to identify, analyze, and understand flow dynamics in a fluid. For example, it can be used to accurately tag nonrigid salient rotation features from large amount of wind vectors captured by orbiting satellites for hurricane research. In this paper, we describe in detail a general vortex detection algorithm motivated by Hough transform and flow vector tree structures. The vortex detection algorithm allows one to find the exact vortex center efficiently if it is in the vector field. A special case of the algorithm has been successfully applied to cyclone annotation and tracking using QuikSCAT satellite wind measurements.
Mining multivariate spatiotemporal patterns from heterogeneous mobility data
Shen-Shyang Ho
SIGSPATIAL/GIS 2012
2012-11-06
Abstract
Mobility data mining in the form of trajectory data mining has been extensively investigated in recent years. Predictive modeling and pattern discovery approaches have been proposed to predict movements and locations, and to extract useful trajectory and location patterns. Nowadays, mobility data consist of not only trajectory data. Mobility data from smart phones include measurements such as call duration/time, call type, digital media consumption, calendar information, apps usage, social interactions, and mobile browsing. These heterogeneous multivariate data allow one to discover interesting and more complex behavioral patterns and rules in terms of space and time.

In this paper, we investigate spatiotemporal rule mining on heterogeneous multivariate mobility data. We propose a systematic approach consisting of three main steps: data fusion, frequent temporal multivariate-location extraction, and rule generation. In particular, we explore the task of extracting multivariate spatiotemporal patterns corresponding to the "where", "when", and "who" queries (and their combinations) related to phone call variables collected from smart phone users. Experimental results on the data from Nokia Mobile Data Challenge is used to show the feasibility and usefulness of our proposed approach.
Mining future spatiotemporal events and their sentiment from online news articles for location-aware recommendation system
Shen-Shyang Ho, Mike Lieberman, Pu Wang, Hanan Samet
MobiGIS 2012
2012-11-06
Abstract
The future-related information mining task for online web resources such as news articles and blogs has been getting more attention due to its potential usefulness in supporting individual's decision making in a world where massive new data are generated daily. Instead of building a data-driven model to predict the future, one extracts future events from these massive data with high probability that they occur at a future time and a specific geographic location. Such spatiotemporal future events can be utilized by a recommender system on a location-aware device to provide localized future event suggestions.

In this paper, we describe a systematic approach for mining future spatiotemporal events from web; in particular, news articles. In our application context, a valid event is defined both spatially and temporally. The mining procedure consists of two main steps: recognition and matching. For the recognition step, we identify and resolve toponyms (geographic location) and future temporal patterns. In the matching step, we perform spatiotemporal disambiguation, de-duplication, and pairing. To provide more useful future event guidance, we attach to each event a sentiment linguistic variable: positive, negative, or neutral, so that one may use these extracted event information for recommendation purposes in the form of "avoid Event A" or "avoid geographic location L at time T" or "attend Event B" based on the event sentiment. The identified future event consists of its geographic location, temporal pattern, sentiment variable, news title, key phrase, and news article URL. Experimental results on 3652 news articles from 21 online new sources collected over a 2-week period in the Greater Washington area are used to illustrate some of the critical steps in our mining procedure.
Using robots to teach programming to K-12 teachers
Jennifer S. Kay, Janet G. Moss
2012 Frontiers in Education Conference Proceedings
2012-10-3
DOI: 10.1109/fie.2012.6462375
Abstract
We present the results of a pilot study in which twenty K-12 teachers were introduced to LEGO NXT-G robot programming through a three-day summer workshop. Our aim was to give teachers the confidence and skills to start after-school robotics programs with their students. We present details on the workshop, including the approach we used to recruit teachers and an overview of the three-day course. We discuss the data gathered from the teachers following the workshop and also give our own recommendations for others who may wish to run a similar program. Participants ranged from elementary school general classroom teachers to high school math, science, and even computer science teachers. Prior to attending our workshop, 89% of the teachers had little or no programming experience and generally were not very confident in their own ability to be able to learn how to program a robot. After completing the workshop, their confidence increased dramatically and they had a strong expectation that they would use the material with their students. A follow-up survey nine months later indicated that hundreds of students and many colleagues were impacted in the first year alone.
A Generalized Morphological Skeleton Transform Using both Internal and External Skeleton Points
Jianning Xu
Proceedings of 2012 International Conference on Image Processing, Computer Vision, and Pattern Recognition
2012-07
Abstract
The morphological skeleton transform (MST) is a leading morphological shape representation scheme. In the MST, a given shape is represented as a union of all maximal disks contained in the shape. The concepts of external skeleton points and external maximal disks were introduced recently to derive so-called external shape components for shape matching purposes. In this paper, we develop a generalized morphological skeleton transform that combines the concepts of internal and external maximal disks into a unified framework. In this framework, a shape is described in terms disk components that need to be added as well as disk components that need to be removed. This framework provides a more natural way of modeling the approximation and reconstruction of binary shapes.
Preserving privacy for moving objects data mining
Shen-Shyang Ho
ISI 2012
2012-06-11
Abstract
The prevalence of mobile devices with geopositioning capability has resulted in the rapid growth in the amount of moving object trajectories. These data have been collected and analyzed for both commercial (e.g., recommendation system) and security (e.g. surveillance and monitoring system) purposes. One needs to ensure the privacy of these raw trajectory data and the derived knowledge by not disclosing or releasing them to adversary. In this paper, we propose a practical implementation of a (ε; δ)-differentially private mechanism for moving objects data mining; in particular, we apply it to the frequent location pattern mining algorithm. Experimental results on the real-world GeoLife dataset are used to compare the performance of the (ε; δ)-differential privacy mechanism with the standard ε-differential privacy mechanism.
Analysis of Patent Referencing to IEEE Papers, Conferences, and Standards 1997-2011
Anthony Breitzman
Report prepared for IEEE
2012-04-02
Abstract
In previous studies, it was found that patents reference papers from IEEE journals much more often than papers from other journal publishers. In this report, we update the previous results, and study US patents issued from January 1997 through December 2011. Although this report is an update of previous results, we have made our best efforts to make this report self-contained. The aim of this report, as in previous reports, is to analyze references from patents to journal articles, conferences and standards documents, in order to assess IEEE's impact upon technological developments.
This report covers 12 subcategories of Technology where IEEE members and readers are active. Many, but not all are related to Information Technology.
Calico: a multi-programming-language, multi-context framework designed for computer science education
Douglas Blank, Jennifer S. Kay, James Marshall, Keith O'Hara, Mark Russo
SIGCSE '12 Proceedings of the 43rd ACM technical symposium on Computer Science Education
2012-02-29
DOI: 10.1145/2157136.2157158
Abstract
The Calico project is a multi-language, multi-context programming framework and learning environment for computing education. This environment is designed to support several interoperable programming languages (including Python, Scheme, and a visual programming language), a variety of pedagogical contexts (including scientific visualization, robotics, and art), and an assortment of physical devices (including different educational robotics platforms and a variety of physical sensors). In addition, the environment is designed to support collaboration and modern, interactive learning. In this paper we describe the Calico project, its design and goals, our prototype system, and its current use.
Apple Has the Most Powerful Patent Portfolio in Consumer Electronics
Anthony Breitzman, Patrick Thomas
Journal Article
2011-11-21
Abstract
New Patent Power Scorecards show U.S. companies like Apple, Microsoft, Google, and Yahoo inventing their way to global dominance.
Differential privacy for location pattern mining
Shen-Shyang Ho, Shuhua Ruan
SPRINGL 2011
2011-11-01
Abstract
One main concern for individuals to participate in the data collection of personal location history records is the disclosure of their location and related information when a user queries for statistical or pattern mining results derived from these records. In this paper, we investigate how the privacy goal that the inclusion of one's location history in a statistical database with location pattern mining capabilities does not substantially increase one's privacy risk. In particular, we propose a differentially private pattern mining algorithm for interesting geographic location discovery using a region quadtree spatial decomposition to preprocess the location points followed by applying a density-based clustering algorithm. A differentially private region quadtree is used for both de-noising the spatial domain and identifying the likely geographic regions containing the interesting locations. Then, a differential privacy mechanism is applied to the algorithm outputs, namely: the interesting regions and their corresponding stay point counts. The quadtree spatial decomposition enables one to obtain a localized reduced sensitivity to achieve the differential privacy goal and accurate outputs. Experimental results on synthetic datasets are used to show the feasibility of the proposed privacy preserving location pattern mining algorithm.
Public school students left behind: Contrasting the trends in public and private school computer science advanced placement participation
Kevin Freisen, Tim Sanders, Jennifer S. Kay
2011 Frontiers in Education Conference (FIE)
2011-10-15
DOI: 10.1109/fie.2011.6143080
Abstract
Across the United States, interest in computer science as a major is down, as are the number of Bachelor's degrees in computer science. While there are obvious factors like the dot com bust that may explain much of our communal enrollment crash over the last few years, anecdotal reports also suggest that the No Child Left Behind act of 2001 (NCLB), and specifically the fact that computer science is not an area that students are tested on, may be a factor in the decreased presence of computer science at the high school level. But how can we empirically separate the effect of the dot com bust from that of NCLB given the proximity in time of the two events? This paper presents a first attempt to do so: recognizing the fact that private schools are exempt from NCLB, it seems appropriate to compare public school students with their private school counterparts. We present some initial results of our investigation focusing on our home state of New Jersey. This paper discusses these results and further directions of study.
Work in progress - Programming in a confined space - A case study in porting modern robot software to an antique platform
Stacey L. Montresor, Jennifer S. Kay, Michel Tokic, Jonathan M. Summerton
2011 Frontiers in Education Conference (FIE)
2011-10-14
DOI: 10.1109/fie.2011.6143099
Abstract
In a typical introductory AI class, the topic of reinforcement learning may be allocated only a few hours of class time. One engaging example of reinforcement learning uses a crawling robot that learns to use its two-degree-of-freedom arm to drag itself forward. Unfortunately, the cost of the required hardware is prohibitively expensive for many departments for what is typically a once-a-semester demonstration. So we decided to port the algorithm to a platform that many departments may already have on hand: the LEGO Mindstorms RCX 2.0. Initially the task seemed relatively straightforward: build a robot base out of LEGO parts and implement the algorithm in the Not Quite C language. However the challenges of designing a robot arm without servos and attempting to trim code down to a size that would fit on the RCX has proven to be as educational to the undergraduates working on the project as we hope the final product will be to students in AI classes. This paper describes the challenges we have faced and the solutions we have implemented, as well as the work that remains to be completed.
A comparative study of location aided routing protocols for MANET
Vasil Hnatyshin, Malik Ahmed, Remo Cocco, Daniel Urbano
IFIP Wireless Days Volume 1, Issue 1, 2011, Article number 6098169
2011-10-12
DOI: 10.1109/WD.2011.6098169
Abstract
Location-aided routing (LAR) is a mechanism which attempts to reduce the control message overhead of Ad-hoc on-demand distance vector (AODV) routing protocol by flooding only the portion of the network that is likely to contain the route to destination. LAR takes advantage of Global Positioning System (GPS) coordinates to identify a possible location of the destination node. Based on this information, LAR defines a portion of the network which will be subject to the limited flooding, thus reducing the total number of the control packet traveling through the network during the route discovery process. GeoAODV is a variation of the AODV protocol which like LAR also employs GPS coordinates to limit the search area used during the route discovery process. However, unlike LAR, GeoAODV does not make the assumption that every node in the network knows the traveling speed and location of the corresponding destination node. Instead, GeoAODV tries to dynamically learn and distribute location information among the nodes in the network. This paper examines and compares through simulation the performance of AODV, LAR, and GeoAODV protocols under different environmental settings. © 2011 IEEE.
Analysis of Small Business Innovation in Green Technologies
Anthony Breitzman, Patrick Thomas
Office of Advocacy, United States Small Business Administration, Contract
2011-10
Abstract
Previous Advocacy-funded studies of small busi-ness patenting activity established the existence of a cohort of independent, for-profit innovative small firms with 15 or more patents over a five-year period. The studies also showed that innovative small firms had a higher percentage of emerging technology patents in their portfolios than their larger counterparts. A recent focus on "green" jobs, businesses, and technology led to this study of a subset of these innovative patent holders. This project was designed to highlight differences in the patent activity of small and large firms in green technologies and industries.
Shape matching using both internal and external morphological shape components
Jianning Xu
Proceedings of 2011 International Conference on Image Processing, Computer Vision, and Pattern Recognition
2011-07
Abstract
A number of morphological shape representation algorithms have been proposed over the years. However, not many shape matching algorithms have been developed based on these representation algorithms. In this paper, we present a structural shape matching algorithm that uses both internal and external shape components selected from the maximal disks determined by a traditional and a generalized morphological skeleton transforms. The algorithm uses relaxation labeling to maintain structural consistency and to derive matching scores. The experiments show that the matching algorithm produces good matching results and it performs better than an earlier algorithm that uses internal components only.
A Moving Objects Database Infrastructure for Hurricane Research: Data Integration and Complex Object Management
Markus Schneider, Shen-Shyang Ho, Malvika Agrawal, Tao Chen, Hechen Liu, Ganesh Viswanathan
Earth Science Technology Forum
2011-06-26
Abstract
Current web-based weather event and satellite data portals provide large amounts of data over a historical timeline. However, users of these portals often get access to data only through limited, pre-defined queries based on a strict set of criteria and event trajectories. Desirable capabilities, such as spatial-temporal analysis, efficient satellite data retrieval, and ad-hoc queries on trajectory data, are not available in these information systems and data archives. In this paper, we describe our current work on and progress in the development of a sophisticated moving objects database infrastructure designed primarily to allow ad-hoc querying of dynamic atmospheric events (e.g., hurricanes and storm systems) and the efficient retrieval of satellite (e.g, QuikSCAT, TRMM) measurements. In particular, we describe our progress in the integration of tropical cyclone event data and satellite measurements from different sources into a single moving objects database system for scientific users to perform ad-hoc queries and sophisticated spatiotemporal analysis. Moreover, we describe how a user can remotely connect her personal analysis software to the database system to perform flexible querying on tropical cyclone best track data and retrieve the associated satellite measurements. Finally, we show how complex objects like hurricane trajectories and massive satellite sensor trajectories with measurements can be effectively stored and handled in a database context using our novel iBLOB (Intelligent Binary Large Objects) concept and data structure.
Intelligent Evidence-Based Management for Data Collection and Decision-Making Using Algorithmic Randomness and Active Learning
Harry Wechsler, Shen-Shyang Ho
Intelligent Information Management 3(4)
2011-05-10
Abstract
We describe here a comprehensive framework for intelligent information management (IIM) of data collection and decision-making actions for reliable and robust event processing and recognition. This is driven by algorithmic information theory (AIT), in general, and algorithmic randomness and Kolmogorov complexity (KC), in particular. The processing and recognition tasks addressed include data discrimination and multi-layer open set data categorization, change detection, data aggregation, clustering and data segmentation, data selection and link analysis, data cleaning and data revision, and prediction and identification of critical states. The unifying theme throughout the paper is that of "compression entails comprehension", which is realized using the interrelated concepts of randomness vs. regularity and Kolmogorov complexity. The constructive and all encompassing active learning (AL) methodology, which mediates and supports the above theme, is context-driven and takes advantage of statistical learning, in general, and semi-supervised learning and transduction, in particular. Active learning employs explore and exploit actions characteristic of closed-loop control for evidence accumulation in order to revise its prediction models and to reduce uncertainty. The set-based similarity scores, driven by algorithmic randomness and Kolmogorov complexity, employ strangeness/typicality and p-values. We propose the application of the IIM framework to critical states prediction for complex physical systems; in particular, the prediction of cyclone genesis and intensification.
Contextualized approaches to introductory computer science: the key to making computer science relevant or simply bait and switch?
Jennifer Kay
SIGCSE '11 Proceedings of the 42nd ACM technical symposium on Computer science education
2011-03-09
DOI: 10.1145/1953163.195321
Abstract
America's youth perceive Computer Science to be difficult, tedious, boring, irrelevant and asocial. Unfortunately, many traditional introductory Computer Science classes and textbooks do little to improve that image. In contrast, contextualized approaches to teaching introductory Computer Science are very attractive. Instead of writing a leap year program, students can learn about conditional statements by programming a robot to follow a light, or by creating an animation to tell a story, or even by modifying a picture of the college president so that she is wearing a neon orange jacket instead of a navy blue one. The arguments in favor of contextualized approaches to attract non-Computer-Science-majors to our classes are very persuasive. But what about students who then choose to major or minor in Computer Science? Of course we want to offer them interesting and engaging first courses in Computer Science, and indeed this may help with our efforts to attract more students to our programs. But what happens in subsequent semesters? The purpose of this paper is to initiate a general discussion on the use of any sort of "cool" new approach into both undergraduate and K-12 Computer Science education. These approaches successfully attract students to study subjects that we ourselves are deeply engaged in. But we need to discuss as a community what happens to students who do choose to major or minor in Computer Science when our individual classes conclude and the rest of their studies commence.
The Clean Tech 50: IEEE Spectrum and 1790 Analytics Rank the 50 Top Clean Tech Patent Portfolios
Anthony Breitzman, Patrick Thomas
Journal Article
2010-10
Abstract
1790 Analytics, cofounded by Anthony Breitzman and Patrick Thomas, present, "the first Clean Tech 50-a scorecard containing the 50 organizations with the world's most powerful clean technology patent portfolios of 2009".
Tropical cyclone event sequence similarity search via dimensionality reduction and metric learning
Shen-Shyang Ho, Wenqing Tang, W. Timothy Liu
KDD '10 Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
2010-07-25
Abstract
The Earth Observing System Data and Information System (EOSDIS) is a comprehensive data and information system which archives, manages, and distributes Earth science data from the EOS spacecrafts. One non-existent capability in the EOSDIS is the retrieval of satellite sensor data based on weather events (such as tropical cyclones) similarity query output.

In this paper, we propose a framework to solve the similarity search problem given user-defined instance-level constraints for tropical cyclone events, represented by arbitrary length multidimensional spatio-temporal data sequences. A critical component for such a problem is the similarity/metric function to compare the data sequences. We describe a novel Longest Common Subsequence (LCSS) parameter learning approach driven by nonlinear dimensionality reduction and distance metric learning. Intuitively, arbitrary length multidimensional data sequences are projected into a fixed dimensional manifold for LCSS parameter learning. Similarity search is achieved through consensus among the (similar) instance-level constraints based on ranking orders computed using the LCSS-based similarity measure.

Experimental results using a combination of synthetic and real tropical cyclone event data sequences are presented to demonstrate the feasibility of our parameter learning approach and its robustness to variability in the instance constraints. We, then, use a similarity query example on real tropical cyclone event data sequences from 2000 to 2008 to discuss (i) a problem of scientific interest, and (ii) challenges and issues related to the weather event similarity search problem.
Practical methodology for modeling wireless routing protocols using OPNET modeler
Vasil Hnatyshin, Hristo Asenov, John Robinson
Proceedings of the IASTED International Conference on Modelling and Simulation
2010-07-15
Abstract
OPNET Modeler is one of the most popular commercial products for simulating and modeling of computer networks and related technologies. While creating a new simulation study using standard models is a fairly straight-forward task, developing new models or modifying existing ones could become a challenging and often frustrating undertaking. This paper provides an overview of OPNET Modeler's software architecture for modeling wireless networks and MANET routing protocols. In particular, this paper concentrates on the modeling and simulation portion of the research project that studies improvement of Ad-Hoc On-Demand Distance Vector (AODV) routing protocol through the use of GPS coordinates. Using AODV modifications as an example, this paper introduces practical methodology for changing existing simulation models of MANET routing protocols and seamlessly integrating them within OPNET Modeler. In addition this paper introduces GeoAODV protocol which reduces the route discovery overhead through the use of GPS coordinates.
A Framework for Moving Sensor Data Query and Retrieval of Dynamic Atmospheric Events
Shen-Shyang Ho, Wenqing Tang, W. Timothy Liu, Markus Schneider
SSDBM 2010
2010-07-02
Abstract
One challenge in Earth science research is the accurate and efficient ad-hoc query and retrieval of Earth science satellite sensor data based on user-defined criteria to study and analyze atmospheric events such as tropical cyclones. The problem can be formulated as a spatio-temporal join query to identify the spatio-temporal location where moving sensor objects and dynamic atmospheric event objects intersect, either precisely or within a user-defined proximity. In this paper, we describe an efficient query and retrieval framework to handle the problem of identifying the spatio-temporal intersecting positions for satellite sensor data retrieval. We demonstrate the effectiveness of our proposed framework using sensor measurements from QuikSCAT (wind field measurement) and TRMM (precipitation vertical profile measurements) satellites, and the trajectories of the tropical cyclones occurring in the North Atlantic Ocean in 2009.
Moving Objects Database Technology for Ad-Hoc Querying and Satellite Data Retrieval of Dynamic Atmospheric Events
Markus Schneider, Shen-Shyang Ho, Tao Chen, Arif Khan, Ganesh Viswanathan, Wenqing Tang, W. Timothy Liu
NASA Earth Science Technology Forum
2010-06-22
Abstract
Existing state-of-the-art and web-based weather event information portals, data archives, and forecast services provide excellent subsetting and visualizations of weather events and satellite sensor measurements. However, users only obtain limited, simple, and hard-coded query, retrieval, and analysis capabilities from these sources. One non-existent but desirable capability is the accurate and efficient ad-hoc querying of the trajectories of dynamic atmospheric events (e.g., tropical cyclones, hurricanes) as well as the efficient retrieval of Earth science satellite sensor data for these events based on ad-hoc, user defined criteria and the event trajectories. In this paper, we describe our current work and progress in the development of a sophisticated framework based on the moving objects database technology for ad-hoc querying and retrieval of atmospheric events and their associated satellite measurements. Such a capability is extremely important to scientists who process sensor data of atmospheric events for statistical analysis and scientific investigation.
A Martingale Framework for Detecting Changes in Data Streams by Testing Exchangeability
Shen-Shyang Ho, Harry Wechsler
IEEE Transactions on Pattern Analysis and Machine Intelligence Volume 32 Issue 12
2010-03-18
DOI: 10.1109/TPAMI.2010.48
Abstract
In a data streaming setting, data points are observed sequentially. The data generating model may change as the data are streaming. In this paper, we propose detecting this change in data streams by testing the exchangeability property of the observed data. Our martingale approach is an efficient, nonparametric, one-pass algorithm that is effective on the classification, cluster, and regression data generating models. Experimental results show the feasibility and effectiveness of the martingale methodology in detecting changes in the data generating model for time-varying data streams. Moreover, we also show that: (1) An adaptive support vector machine (SVM) utilizing the martingale methodology compares favorably against an adaptive SVM utilizing a sliding window, and (2) a multiple martingale video-shot change detector compares favorably against standard shot-change detection algorithms.
Robots in the classroom ... and the dorm room
Jennifer Kay
Journal of Computing Sciences in Colleges Volume 25 Issue 3
2010-01-01
Abstract
The purpose of this paper is twofold. First, to argue that despite some disappointing results in using robots in the computer science classroom in the past, we should not yet conclude that robots do not belong there. Second, to present the results of a small pilot study comparing two approaches to teaching an introductory computer science class -- one traditional and one which used robots. While the study is not sufficiently controlled to be considered proof of success, initial results are compelling and support the need for further investigation.
Patent Power Scorecards: Japan Ascendant
Patrick Thomas, Anthony Breitzman
Journal Article
2010
Abstract
The surprise story of this edition of the IEEE Spectrum Patent Power Scorecards is the reemergence of Japan as a global leader in innovation. Based on data from 2009, out of the 323 leading organizations in the scorecards, 65 (20 percent) are Japanese. This percentage is markedly higher than in the 2007 scorecards, in which 45 out of 319 companies (14 percent) were Japanese.