Machine learning. A classifier is trained to find the optimal segmentation solution by providing it with ground-truth data and manually indicating which pixels of an image belong to different classes of objects23. This approach typically involves applying various transformations to the image to capture different patterns in the local pixel neighborhood. Segmentation is ultimately achieved by applying the trained model to new images to classify pixels accordingly.
Using full images results in a loss of single-cell resolution but offers several advantages: the avoidance of the segmentation step eliminates the sometimes tedious manual tuning of segmentation and feature extraction algorithms, saves computation time, avoids segmentation errors, and may better capture visual patterns resulting from multiple cells. Using single-cell images explicitly captures heterogeneity and may offer improved accuracy with less training.
Sam shanmugam solution manual random 19
In multi-label tasks, sufficient and class-balanced label is usually hard to obtain, which makes it challenging to train a good classifier. In this paper, we consider the problem of learning from imbalanced and incomplete supervision, where only a small subset of labeled data is available and the label distribution is highly imbalanced. This setting is of importance and commonly appears in a variety of real applications. For instance, considering the ride-sharing liability judgment task, liability disputes usually due to a variety of reasons, however, it is expensive to manually annotate the reasons, meanwhile, the distribution of reason is often seriously imbalanced. In this paper, we present a systemic framework Limi consisting of three sub-steps, that is, Label Separating, Correlation Mining and Label Completion. Specifically, we propose an effective two-classifier strategy to separately tackle head and tail labels so as to alleviate the performance degradation on tail labels while maintaining high performance on head labels. Then, a novel label correlation network is adopted to explore the label relation knowledge with flexible aggregators. Moreover, the Limi framework completes the label on unlabeled instances in a semi-supervised fashion. The framework is general, flexible, and effective. Extensive experiments on diverse applications, such as the ride-sharing liability judgment task from Didi and various benchmark tasks, demonstrate that our solution is clearly better than many competitive methods.
Graph embedding based on random-walks supports effective solutions for many graph-related downstream tasks. However, the abundance of embedding literature has made it increasingly difficult to compare existing methods and to identify opportunities to advance the state-of-the-art. Meanwhile, existing work has left several fundamental questions---such as how embeddings capture different structural scales and how they should be applied for effective link prediction---unanswered. This paper addresses these challenges with an analytical framework for random-walk based graph embedding that consists of three components: a random-walk process, a similarity function, and an embedding algorithm. Our framework not only categorizes many existing approaches but naturally motivates new ones. With it, we illustrate novel ways to incorporate embeddings at multiple scales to improve downstream task performance. We also show that embeddings based on autocovariance similarity, when paired with dot product ranking for link prediction, outperform state-of-the-art methods based on Pointwise Mutual Information similarity by up to 100%.
Gradient Boosting Machines (GBM) are hugely popular for solving tabular data problems. However, practitioners are not only interested in point predictions, but also in probabilistic predictions in order to quantify the uncertainty of the predictions. Creating such probabilistic predictions is difficult with existing GBM-based solutions: they either require training multiple models or they become too computationally expensive to be useful for large-scale settings. We propose Probabilistic Gradient Boosting Machines (PGBM), a method to create probabilistic predictions with a single ensemble of decision trees in a computationally efficient manner. PGBM approximates the leaf weights in a decision tree as a random variable, and approximates the mean and variance of each sample in a dataset via stochastic tree ensemble update equations. These learned moments allow us to subsequently sample from a specified distribution after training. We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods: (i) PGBM enables probabilistic estimates without compromising on point performance in a single model, (ii) PGBM learns probabilistic estimates via a single model only (and without requiring multi-parameter boosting), and thereby offers a speedup of up to several orders of magnitude over existing state-of-the-art methods on large datasets, and (iii) PGBM achieves accurate probabilistic estimates in tasks with complex differentiable loss functions, such as hierarchical time series problems, where we observed up to 10% improvement in point forecasting performance and up to 300% improvement in probabilistic forecasting performance.
Medical ontologies are widely used to describe and organize medical terminologies and to support many critical applications on healthcare databases. These ontologies are often manually curated (e.g., UMLS, SNOMED CT, and MeSH) by medical experts. Medical databases, on the other hand, are often created by database administrators, using different terminology and structures. The discrepancies between medical ontologies and databases compromise interoperability between them. Data to ontology matching is the process of finding semantic correspondences between tables in databases to standard ontologies. Existing solutions such as ontology matching have mostly focused on engineering features from terminological, structural, and semantic model information extracted from the ontologies. However, this is often labor intensive and the accuracy varies greatly across different ontologies. Worse yet, the ontology capturing a medical database is often not given in practice. In this paper, we propose MEDTO, a novel end-to-end framework that consists of three innovative techniques: (1) a lightweight yet effective method that bootstrap a semantically rich ontology from a given medical database, (2) a hyperbolic graph convolution layer that encodes hierarchical concepts in the hyperbolic space, and (3) a heterogeneous graph layer that encodes both local and global context information of a concept. Experiments on two real-world medical datasets matching against SNOMED CT show significant improvements compared to the state-of-the-art methods. MEDTO also consistently achieves competitive results on a benchmark from the Ontology Alignment Evaluation Initiative.
Online experimentation, also known as A/B testing, is the gold standard for measuring product impacts and making business decisions in the tech industry. The validity and utility of experiments, however, hinge on unbiasedness and sufficient power. In two-sided online marketplaces, both requirements are called into question. The Bernoulli randomized experiments are biased because treatment units interfere with control units through market competition and violate the "stable unit treatment value assumption"(SUTVA). The experimental power on at least one side of the market is often insufficient because of disparate sample sizes on the two sides. Despite the importance of online marketplaces to the online economy and the crucial role experimentation plays in product development, there lacks an effective and practical solution to the bias and low power problems in marketplace experimentation. In this paper we address this shortcoming by proposing the budget-split design, which is unbiased in any marketplace where buyers have a finite or infinite budget. We show that it is more powerful than all other unbiased designs in the literature. We then provide a generalizable system architecture for deploying this design to online marketplaces. Finally, we confirm the effectiveness of our proposal with empirical performance from experiments run in two real-world online marketplaces. We demonstrate how it achieves over 15x gain in experimental power and removes market competition induced bias, which can be up to 230% the treatment effect size.
East Africa is experiencing the worst locust infestation in over 25 years, which has severely threatened the food security of millions of people across the region. The primary strategy adopted by human experts at the United Nations Food and Agricultural Organization (UN-FAO) to tackle locust outbreaks involves manually surveying at-risk geographical areas, followed by allocating and spraying pesticides in affected regions. In order to augment and assist human experts at the UN-FAO in this task, we utilize crowdsourced reports of locust observations collected by PlantVillage (the world's leading knowledge delivery system for East African farmers) and develop PLAN, a Machine Learning (ML) algorithm for forecasting future migration patterns of locusts at high spatial and temporal resolution across East Africa. PLAN's novel spatio-temporal deep learning architecture enables representing PlantVillage's crowdsourced locust observation data using novel image-based feature representations, and its design is informed by several unique insights about this problem domain. Experimental results show that PLAN achieves superior predictive performance against several baseline models - it achieves an AUC score of 0.9 when used with a data augmentation method. PLAN represents a first step in using deep learning to assist and augment human expertise at PlantVillage (and UN-FAO) in locust prediction, and its real-world usability is currently being evaluated by domain experts (including a potential idea to use the heatmaps created by PLAN in a Kenyan TV show). The source code is available at -tabar/PLAN. 2ff7e9595c
Comments