Coder Social home page Coder Social logo

arxiv's People

Contributors

coffeekumazaki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

arxiv's Issues

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization. (arXiv:2010.12126v1 [cs.CV])

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization. (arXiv:2010.12126v1 [cs.CV])

https://ift.tt/2FY767c

Matching information across image and text modalities is a fundamental challenge for many applications that involve both vision and natural language processing. The objective is to find efficient similarity metrics to compare the similarity between visual and textual information. Existing approaches mainly match the local visual objects and the sentence words in a shared space with attention mechanisms. The matching performance is still limited because the similarity computation is based on simple comparisons of the matching features, ignoring the characteristics of their distribution in the data. In this paper, we address this limitation with an efficient learning objective that considers the discriminative feature distributions between the visual objects and sentence words. Specifically, we propose a novel Adversarial Discriminative Domain Regularization (ADDR) learning framework, beyond the paradigm metric learning objective, to construct a set of discriminative data domains within each image-text pairs. Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks by regulating the distribution of the hidden space between the matching pairs. The experimental results show that this new approach significantly improves the overall performance of several popular cross-modal matching techniques (SCAN, VSRN, BFAN) on the MS-COCO and Flickr30K benchmarks.



via cs.CV updates on arXiv.org http://arxiv.org/

Investigating Cultural Aspects in the Fundamental Diagram using Convolutional Neural Networks and Simulation. (arXiv:2010.11995v1 [cs.OH])

Investigating Cultural Aspects in the Fundamental Diagram using Convolutional Neural Networks and Simulation. (arXiv:2010.11995v1 [cs.OH])

https://ift.tt/35v5ta0

This paper presents a study regarding group behavior in a controlled experiment focused on differences in an important attribute that vary across cultures -- the personal spaces -- in two Countries: Brazil and Germany. In order to coherently compare Germany and Brazil evolutions with same population applying same task, we performed the pedestrian Fundamental Diagram experiment in Brazil, as performed in Germany. We use CNNs to detect and track people in video sequences. With this data, we use Voronoi Diagrams to find out the neighbor relation among people and then compute the walking distances to find out the personal spaces. Based on personal spaces analyses, we found out that people behavior is more similar, in terms of their behaviours, in high dense populations and vary more in low and medium densities. So, we focused our study on cultural differences between the two Countries in low and medium densities. Results indicate that personal space analyses can be a relevant feature in order to understand cultural aspects in video sequences. In addition to the cultural differences, we also investigate the personality model in crowds, using OCEAN. We also proposed a way to simulate the FD experiment from other countries using the OCEAN psychological traits model as input. The simulated countries were consistent with the literature.



via cs.CV updates on arXiv.org http://arxiv.org/

Tensor Reordering for CNN Compression. (arXiv:2010.12110v1 [cs.LG])

Tensor Reordering for CNN Compression. (arXiv:2010.12110v1 [cs.LG])

https://ift.tt/35EC2T0

We show how parameter redundancy in Convolutional Neural Network (CNN) filters can be effectively reduced by pruning in spectral domain. Specifically, the representation extracted via Discrete Cosine Transform (DCT) is more conducive for pruning than the original space. By relying on a combination of weight tensor reshaping and reordering we achieve high levels of layer compression with just minor accuracy loss. Our approach is applied to compress pretrained CNNs and we show that minor additional fine-tuning allows our method to recover the original model performance after a significant parameter reduction. We validate our approach on ResNet-50 and MobileNet-V2 architectures for ImageNet classification task.



via cs.CV updates on arXiv.org http://arxiv.org/

Contrastive Learning with Adversarial Examples. (arXiv:2010.12050v1 [cs.CV])

Contrastive Learning with Adversarial Examples. (arXiv:2010.12050v1 [cs.CV])

https://ift.tt/31Yoqkv

Contrastive learning (CL) is a popular technique for self-supervised learning (SSL) of visual representations. It uses pairs of augmentations of unlabeled training examples to define a classification task for pretext learning of a deep embedding. Despite extensive works in augmentation procedures, prior works do not address the selection of challenging negative pairs, as images within a sampled batch are treated independently. This paper addresses the problem, by introducing a new family of adversarial examples for constrastive learning and using these examples to define a new adversarial training algorithm for SSL, denoted as CLAE. When compared to standard CL, the use of adversarial examples creates more challenging positive pairs and adversarial training produces harder negative pairs by accounting for all images in a batch during the optimization. CLAE is compatible with many CL methods in the literature. Experiments show that it improves the performance of several existing CL baselines on multiple datasets.



via cs.CV updates on arXiv.org http://arxiv.org/

Improving the generalization of network based relative pose regression: dimension reduction as a regularizer. (arXiv:2010.12796v1 [cs.CV])

Improving the generalization of network based relative pose regression: dimension reduction as a regularizer. (arXiv:2010.12796v1 [cs.CV])

https://ift.tt/31Lgmn4

Visual localization occupies an important position in many areas such as Augmented Reality, robotics and 3D reconstruction. The state-of-the-art visual localization methods perform pose estimation using geometry based solver within the RANSAC framework. However, these methods require accurate pixel-level matching at high image resolution, which is hard to satisfy under significant changes from appearance, dynamics or perspective of view. End-to-end learning based regression networks provide a solution to circumvent the requirement for precise pixel-level correspondences, but demonstrate poor performance towards cross-scene generalization. In this paper, we explicitly add a learnable matching layer within the network to isolate the pose regression solver from the absolute image feature values, and apply dimension regularization on both the correlation feature channel and the image scale to further improve performance towards generalization and large viewpoint change. We implement this dimension regularization strategy within a two-layer pyramid based framework to regress the localization results from coarse to fine. In addition, the depth information is fused for absolute translational scale recovery. Through experiments on real world RGBD datasets we validate the effectiveness of our design in terms of improving both generalization performance and robustness towards viewpoint change, and also show the potential of regression based visual localization networks towards challenging occasions that are difficult for geometry based visual localization methods.



via cs.CV updates on arXiv.org http://arxiv.org/

Learn Robust Features via Orthogonal Multi-Path. (arXiv:2010.12190v1 [cs.CV])

Learn Robust Features via Orthogonal Multi-Path. (arXiv:2010.12190v1 [cs.CV])

https://ift.tt/31G6UkJ

It is now widely known that by adversarial attacks, clean images with invisible perturbations can fool deep neural networks. To defend adversarial attacks, we design a block containing multiple paths to learn robust features and the parameters of these paths are required to be orthogonal with each other. The so-called Orthogonal Multi-Path (OMP) block could be posed in any layer of a neural network. Via forward learning and backward correction, one OMP block makes the neural networks learn features that are appropriate for all the paths and hence are expected to be robust. With careful design and thorough experiments on e.g., the positions of imposing orthogonality constraint, and the trade-off between the variety and accuracy, the robustness of the neural networks is significantly improved. For example, under white-box PGD attack with $l_\infty$ bound ${8}/{255}$ (this is a fierce attack that can make the accuracy of many vanilla neural networks drop to nearly $10\%$ on CIFAR10), VGG16 with the proposed OMP block could keep over $50\%$ accuracy. For black-box attacks, neural networks equipped with an OMP block have accuracy over $80\%$. The performance under both white-box and black-box attacks is much better than the existing state-of-the-art adversarial defenders.



via cs.CV updates on arXiv.org http://arxiv.org/

Automated triage of COVID-19 from various lung abnormalities using chest CT features. (arXiv:2010.12967v1 [eess.IV])

Automated triage of COVID-19 from various lung abnormalities using chest CT features. (arXiv:2010.12967v1 [eess.IV])

https://ift.tt/3jvSqtG

The outbreak of COVID-19 has lead to a global effort to decelerate the pandemic spread. For this purpose chest computed-tomography (CT) based screening and diagnosis of COVID-19 suspected patients is utilized, either as a support or replacement to reverse transcription-polymerase chain reaction (RT-PCR) test. In this paper, we propose a fully automated AI based system that takes as input chest CT scans and triages COVID-19 cases. More specifically, we produce multiple descriptive features, including lung and infections statistics, texture, shape and location, to train a machine learning based classifier that distinguishes between COVID-19 and other lung abnormalities (including community acquired pneumonia). We evaluated our system on a dataset of 2191 CT cases and demonstrated a robust solution with 90.8% sensitivity at 85.4% specificity with 94.0% ROC-AUC. In addition, we present an elaborated feature analysis and ablation study to explore the importance of each feature.



via cs.CV updates on arXiv.org http://arxiv.org/

GPS-Denied Navigation Using SAR Images and Neural Networks. (arXiv:2010.12108v1 [cs.CV])

GPS-Denied Navigation Using SAR Images and Neural Networks. (arXiv:2010.12108v1 [cs.CV])

https://ift.tt/31ExWJm

Unmanned aerial vehicles (UAV) often rely on GPS for navigation. GPS signals, however, are very low in power and easily jammed or otherwise disrupted. This paper presents a method for determining the navigation errors present at the beginning of a GPS-denied period utilizing data from a synthetic aperture radar (SAR) system. This is accomplished by comparing an online-generated SAR image with a reference image obtained a priori. The distortions relative to the reference image are learned and exploited with a convolutional neural network to recover the initial navigational errors, which can be used to recover the true flight trajectory throughout the synthetic aperture. The proposed neural network approach is able to learn to predict the initial errors on both simulated and real SAR image data.



via cs.CV updates on arXiv.org http://arxiv.org/

Deep Image Prior for Sparse-sampling Photoacoustic Microscopy. (arXiv:2010.12041v1 [eess.IV])

Deep Image Prior for Sparse-sampling Photoacoustic Microscopy. (arXiv:2010.12041v1 [eess.IV])

https://ift.tt/37TAC9Z

Photoacoustic microscopy (PAM) is an emerging method for imaging both structural and functional information without the need for exogenous contrast agents. However, state-of-the-art PAM faces a tradeoff between imaging speed and spatial sampling density within the same field-of-view (FOV). Limited by the pulsed laser's repetition rate, the imaging speed is inversely proportional to the total number of effective pixels. To cover the same FOV in a shorter amount of time with the same PAM hardware, there is currently no other option than to decrease spatial sampling density (i.e., sparse sampling). Deep learning methods have recently been used to improve sparsely sampled PAM images; however, these methods often require time-consuming pre-training and a large training dataset that has fully sampled, co-registered ground truth. In this paper, we propose using a method known as "deep image prior" to improve the image quality of sparsely sampled PAM images. The network does not need prior learning or fully sampled ground truth, making its implementation more flexible and much quicker. Our results show promising improvement in PA vasculature images with as few as 2% of the effective pixels. Our deep image prior approach produces results that outperform interpolation methods and can be readily translated to other high-speed, sparse-sampling imaging modalities.



via cs.CV updates on arXiv.org http://arxiv.org/

Vision-based Robotic Grasping From Object Localization Object Pose Estimation to Grasp Estimation for Parallel Grippers: A Review. (arXiv:1905.06658v4 [cs.RO] UPDATED)

Vision-based Robotic Grasping From Object Localization, Object Pose Estimation to Grasp Estimation for Parallel Grippers: A Review. (arXiv:1905.06658v4 [cs.RO] UPDATED)

https://ift.tt/37WFlaX

This paper presents a comprehensive survey on vision-based robotic grasping. We conclude three key tasks during vision-based robotic grasping, which are object localization, object pose estimation and grasp estimation. In detail, the object localization task contains object localization without classification, object detection and object instance segmentation. This task provides the regions of the target object in the input data. The object pose estimation task mainly refers to estimating the 6D object pose and includes correspondence-based methods, template-based methods and voting-based methods, which affords the generation of grasp poses for known objects. The grasp estimation task includes 2D planar grasp methods and 6DoF grasp methods, where the former is constrained to grasp from one direction. These three tasks could accomplish the robotic grasping with different combinations. Lots of object pose estimation methods need not object localization, and they conduct object localization and object pose estimation jointly. Lots of grasp estimation methods need not object localization and object pose estimation, and they conduct grasp estimation in an end-to-end manner. Both traditional methods and latest deep learning-based methods based on the RGB-D image inputs are reviewed elaborately in this survey. Related datasets and comparisons between state-of-the-art methods are summarized as well. In addition, challenges about vision-based robotic grasping and future directions in addressing these challenges are also pointed out.



via cs.CV updates on arXiv.org http://arxiv.org/

Classification of Spot-welded Joints in Laser Thermography Data using Convolutional Neural Networks. (arXiv:2010.12976v1 [cs.CV])

Classification of Spot-welded Joints in Laser Thermography Data using Convolutional Neural Networks. (arXiv:2010.12976v1 [cs.CV])

https://ift.tt/37IImvb

Spot welding is a crucial process step in various industries. However, classification of spot welding quality is still a tedious process due to the complexity and sensitivity of the test material, which drain conventional approaches to its limits. In this paper, we propose an approach for quality inspection of spot weldings using images from laser thermography data.We propose data preparation approaches based on the underlying physics of spot welded joints, heated with pulsed laser thermography by analyzing the intensity over time and derive dedicated data filters to generate training datasets. Subsequently, we utilize convolutional neural networks to classify weld quality and compare the performance of different models against each other. We achieve competitive results in terms of classifying the different welding quality classes compared to traditional approaches, reaching an accuracy of more than 95 percent. Finally, we explore the effect of different augmentation methods.



via cs.CV updates on arXiv.org http://arxiv.org/

Towards falsifiable interpretability research. (arXiv:2010.12016v1 [cs.CY])

Towards falsifiable interpretability research. (arXiv:2010.12016v1 [cs.CY])

https://ift.tt/35w7pip

Methods for understanding the decisions of and mechanisms underlying deep neural networks (DNNs) typically rely on building intuition by emphasizing sensory or semantic features of individual examples. For instance, methods aim to visualize the components of an input which are "important" to a network's decision, or to measure the semantic properties of single neurons. Here, we argue that interpretability research suffers from an over-reliance on intuition-based approaches that risk-and in some cases have caused-illusory progress and misleading conclusions. We identify a set of limitations that we argue impede meaningful progress in interpretability research, and examine two popular classes of interpretability methods-saliency and single-neuron-based approaches-that serve as case studies for how overreliance on intuition and lack of falsifiability can undermine interpretability research. To address these concerns, we propose a strategy to address these impediments in the form of a framework for strongly falsifiable interpretability research. We encourage researchers to use their intuitions as a starting point to develop and test clear, falsifiable hypotheses, and hope that our framework yields robust, evidence-based interpretability methods that generate meaningful advances in our understanding of DNNs.



via cs.CV updates on arXiv.org http://arxiv.org/

Delving into the Cyclic Mechanism in Semi-supervised Video Object Segmentation. (arXiv:2010.12176v1 [cs.CV])

Delving into the Cyclic Mechanism in Semi-supervised Video Object Segmentation. (arXiv:2010.12176v1 [cs.CV])

https://ift.tt/2TvPmmF

In this paper, we address several inadequacies of current video object segmentation pipelines. Firstly, a cyclic mechanism is incorporated to the standard semi-supervised process to produce more robust representations. By relying on the accurate reference mask in the starting frame, we show that the error propagation problem can be mitigated. Next, we introduce a simple gradient correction module, which extends the offline pipeline to an online method while maintaining the efficiency of the former. Finally we develop cycle effective receptive field (cycle-ERF) based on gradient correction to provide a new perspective into analyzing object-specific regions of interests. We conduct comprehensive experiments on challenging benchmarks of DAVIS17 and Youtube-VOS, demonstrating that the cyclic mechanism is beneficial to segmentation quality.



via cs.CV updates on arXiv.org http://arxiv.org/

Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection. (arXiv:2010.12023v1 [cs.CV])

Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection. (arXiv:2010.12023v1 [cs.CV])

https://ift.tt/3kxvfk3

Weakly Supervised Object Detection (WSOD) has emerged as an effective tool to train object detectors using only the image-level category labels. However, without object-level labels, WSOD detectors are prone to detect bounding boxes on salient objects, clustered objects and discriminative object parts. Moreover, the image-level category labels do not enforce consistent object detection across different transformations of the same images. To address the above issues, we propose a Comprehensive Attention Self-Distillation (CASD) training approach for WSOD. To balance feature learning among all object instances, CASD computes the comprehensive attention aggregated from multiple transformations and feature layers of the same images. To enforce consistent spatial supervision on objects, CASD conducts self-distillation on the WSOD networks, such that the comprehensive attention is approximated simultaneously by multiple transformations and feature layers of the same images. CASD produces new state-of-the-art WSOD results on standard benchmarks such as PASCAL VOC 2007/2012 and MS-COCO.



via cs.CV updates on arXiv.org http://arxiv.org/

A generalized deep learning model for multi-disease Chest X-Ray diagnostics. (arXiv:2010.12065v1 [q-bio.QM])

A generalized deep learning model for multi-disease Chest X-Ray diagnostics. (arXiv:2010.12065v1 [q-bio.QM])

https://ift.tt/31Yopgr

We investigate the generalizability of deep convolutional neural network (CNN) on the task of disease classification from chest x-rays collected over multiple sites. We systematically train the model using datasets from three independent sites with different patient populations: National Institute of Health (NIH), Stanford University Medical Centre (CheXpert), and Shifa International Hospital (SIH). We formulate a sequential training approach and demonstrate that the model produces generalized prediction performance using held out test sets from the three sites. Our model generalizes better when trained on multiple datasets, with the CheXpert-Shifa-NET model performing significantly better (p-values < 0.05) than the models trained on individual datasets for 3 out of the 4 distinct disease classes. The code for training the model will be made available open source at: https://ift.tt/3kAf4Ti at the time of publication.



via cs.CV updates on arXiv.org http://arxiv.org/

Classifying Eye-Tracking Data Using Saliency Maps. (arXiv:2010.12913v1 [cs.CV])

Classifying Eye-Tracking Data Using Saliency Maps. (arXiv:2010.12913v1 [cs.CV])

https://ift.tt/31NHxNT

A plethora of research in the literature shows how human eye fixation pattern varies depending on different factors, including genetics, age, social functioning, cognitive functioning, and so on. Analysis of these variations in visual attention has already elicited two potential research avenues: 1) determining the physiological or psychological state of the subject and 2) predicting the tasks associated with the act of viewing from the recorded eye-fixation data. To this end, this paper proposes a visual saliency based novel feature extraction method for automatic and quantitative classification of eye-tracking data, which is applicable to both of the research directions. Instead of directly extracting features from the fixation data, this method employs several well-known computational models of visual attention to predict eye fixation locations as saliency maps. Comparing the saliency amplitudes, similarity and dissimilarity of saliency maps with the corresponding eye fixations maps gives an extra dimension of information which is effectively utilized to generate discriminative features to classify the eye-tracking data. Extensive experimentation using Saliency4ASD, Age Prediction, and Visual Perceptual Task dataset show that our saliency-based feature can achieve superior performance, outperforming the previous state-of-the-art methods by a considerable margin. Moreover, unlike the existing application-specific solutions, our method demonstrates performance improvement across three distinct problems from the real-life domain: Autism Spectrum Disorder screening, toddler age prediction, and human visual perceptual task classification, providing a general paradigm that utilizes the extra-information inherent in saliency maps for a more accurate classification.



via cs.CV updates on arXiv.org http://arxiv.org/

AdaCrowd: Unlabeled Scene Adaptation for Crowd Counting. (arXiv:2010.12141v1 [cs.CV])

AdaCrowd: Unlabeled Scene Adaptation for Crowd Counting. (arXiv:2010.12141v1 [cs.CV])

https://ift.tt/3mmciRN

We address the problem of image-based crowd counting. In particular, we propose a new problem called unlabeled scene adaptive crowd counting. Given a new target scene, we would like to have a crowd counting model specifically adapted to this particular scene based on the target data that capture some information about the new scene. In this paper, we propose to use one or more unlabeled images from the target scene to perform the adaptation. In comparison with the existing problem setups (e.g. fully supervised), our proposed problem setup is closer to the real-world applications of crowd counting systems. We introduce a novel AdaCrowd framework to solve this problem. Our framework consists of a crowd counting network and a guiding network. The guiding network predicts some parameters in the crowd counting network based on the unlabeled images from a particular scene. This allows our model to adapt to different target scenes. The experimental results on several challenging benchmark datasets demonstrate the effectiveness of our proposed approach compared with other alternative methods.



via cs.CV updates on arXiv.org http://arxiv.org/

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering. (arXiv:2010.12917v1 [cs.CV])

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering. (arXiv:2010.12917v1 [cs.CV])

https://ift.tt/3muJuXz

Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.



via cs.CV updates on arXiv.org http://arxiv.org/

REDE: End-to-end Object 6D Pose Robust Estimation Using Differentiable Outliers Elimination. (arXiv:2010.12807v1 [cs.CV])

REDE: End-to-end Object 6D Pose Robust Estimation Using Differentiable Outliers Elimination. (arXiv:2010.12807v1 [cs.CV])

https://ift.tt/34tUfD6

Object 6D pose estimation is a fundamental task in many applications. Conventional methods solve the task by detecting and matching the keypoints, then estimating the pose. Recent efforts bringing deep learning into the problem mainly overcome the vulnerability of conventional methods to environmental variation due to the hand-crafted feature design. However, these methods cannot achieve end-to-end learning and good interpretability at the same time. In this paper, we propose REDE, a novel end-to-end object pose estimator using RGB-D data, which utilizes network for keypoint regression, and a differentiable geometric pose estimator for pose error back-propagation. Besides, to achieve better robustness when outlier keypoint prediction occurs, we further propose a differentiable outliers elimination method that regresses the candidate result and the confidence simultaneously. Via confidence weighted aggregation of multiple candidates, we can reduce the effect from the outliers in the final estimation. Finally, following the conventional method, we apply a learnable refinement process to further improve the estimation. The experimental results on three benchmark datasets show that REDE slightly outperforms the state-of-the-art approaches and is more robust to object occlusion.



via cs.CV updates on arXiv.org http://arxiv.org/

Unsupervised deep learning for grading of age-related macular degeneration using retinal fundus images. (arXiv:2010.11993v1 [cs.CV])

Unsupervised deep learning for grading of age-related macular degeneration using retinal fundus images. (arXiv:2010.11993v1 [cs.CV])

https://ift.tt/3mfY7h8

Many diseases are classified based on human-defined rubrics that are prone to bias. Supervised neural networks can automate the grading of retinal fundus images, but require labor-intensive annotations and are restricted to the specific trained task. Here, we employed an unsupervised network with Non-Parametric Instance Discrimination (NPID) to grade age-related macular degeneration (AMD) severity using fundus photographs from the Age-Related Eye Disease Study (AREDS). Our unsupervised algorithm demonstrated versatility across different AMD classification schemes without retraining, and achieved unbalanced accuracies comparable to supervised networks and human ophthalmologists in classifying advanced or referable AMD, or on the 4-step AMD severity scale. Exploring the networks behavior revealed disease-related fundus features that drove predictions and unveiled the susceptibility of more granular human-defined AMD severity schemes to misclassification by both ophthalmologists and neural networks. Importantly, unsupervised learning enabled unbiased, data-driven discovery of AMD features such as geographic atrophy, as well as other ocular phenotypes of the choroid, vitreous, and lens, such as visually-impairing cataracts, that were not pre-defined by human labels.



via cs.CV updates on arXiv.org http://arxiv.org/

Deep Convolutional Neural Networks Model-based Brain Tumor Detection in Brain MRI Images. (arXiv:2010.11978v1 [eess.IV])

Deep Convolutional Neural Networks Model-based Brain Tumor Detection in Brain MRI Images. (arXiv:2010.11978v1 [eess.IV])

https://ift.tt/3osDe4c

Diagnosing Brain Tumor with the aid of Magnetic Resonance Imaging (MRI) has gained enormous prominence over the years, primarily in the field of medical science. Detection and/or partitioning of brain tumors solely with the aid of MR imaging is achieved at the cost of immense time and effort and demands a lot of expertise from engaged personnel. This substantiates the necessity of fabricating an autonomous model brain tumor diagnosis. Our work involves implementing a deep convolutional neural network (DCNN) for diagnosing brain tumors from MR images. The dataset used in this paper consists of 253 brain MR images where 155 images are reported to have tumors. Our model can single out the MR images with tumors with an overall accuracy of 96%. The model outperformed the existing conventional methods for the diagnosis of brain tumor in the test dataset (Precision = 0.93, Sensitivity = 1.00, and F1-score = 0.97). Moreover, the proposed model's average precision-recall score is 0.93, Cohen's Kappa 0.91, and AUC 0.95. Therefore, the proposed model can help clinical experts verify whether the patient has a brain tumor and, consequently, accelerate the treatment procedure.



via cs.CV updates on arXiv.org http://arxiv.org/

Automating Abnormality Detection in Musculoskeletal Radiographs through Deep Learning. (arXiv:2010.12030v1 [eess.IV])

Automating Abnormality Detection in Musculoskeletal Radiographs through Deep Learning. (arXiv:2010.12030v1 [eess.IV])

https://ift.tt/37B8VT4

This paper introduces MuRAD (Musculoskeletal Radiograph Abnormality Detection tool), a tool that can help radiologists automate the detection of abnormalities in musculoskeletal radiographs (bone X-rays). MuRAD utilizes a Convolutional Neural Network (CNN) that can accurately predict whether a bone X-ray is abnormal, and leverages Class Activation Map (CAM) to localize the abnormality in the image. MuRAD achieves an F1 score of 0.822 and a Cohen's kappa of 0.699, which is comparable to the performance of expert radiologists.



via cs.CV updates on arXiv.org http://arxiv.org/

Towards Fair Knowledge Transfer for Imbalanced Domain Adaptation. (arXiv:2010.12184v1 [cs.CV])

Towards Fair Knowledge Transfer for Imbalanced Domain Adaptation. (arXiv:2010.12184v1 [cs.CV])

https://ift.tt/34t1c7x

Domain adaptation (DA) becomes an up-and-coming technique to address the insufficient or no annotation issue by exploiting external source knowledge. Existing DA algorithms mainly focus on practical knowledge transfer through domain alignment. Unfortunately, they ignore the fairness issue when the auxiliary source is extremely imbalanced across different categories, which results in severe under-presented knowledge adaptation of minority source set. To this end, we propose a Towards Fair Knowledge Transfer (TFKT) framework to handle the fairness challenge in imbalanced cross-domain learning. Specifically, a novel cross-domain mixup generation is exploited to augment the minority source set with target information to enhance fairness. Moreover, dual distinct classifiers and cross-domain prototype alignment are developed to seek a more robust classifier boundary and mitigate the domain shift. Such three strategies are formulated into a unified framework to address the fairness issue and domain shift challenge. Extensive experiments over two popular benchmarks have verified the effectiveness of our proposed model by comparing to existing state-of-the-art DA models, and especially our model significantly improves over 20% on two benchmarks in terms of the overall accuracy.



via cs.CV updates on arXiv.org http://arxiv.org/

Inferring Point Clouds from Single Monocular Images by Depth Intermediation. (arXiv:1812.01402v3 [cs.CV] UPDATED)

Inferring Point Clouds from Single Monocular Images by Depth Intermediation. (arXiv:1812.01402v3 [cs.CV] UPDATED)

https://ift.tt/35A85mV

In this paper, we propose a pipeline to generate 3D point cloud of an object from a single-view RGB image. Most previous work predict the 3D point coordinates from single RGB images directly. We decompose this problem into depth estimation from single images and point cloud completion from partial point clouds.

Our method sequentially predicts the depth maps from images and then infers the complete 3D object point clouds based on the predicted partial point clouds. We explicitly impose the camera model geometrical constraint in our pipeline and enforce the alignment of the generated point clouds and estimated depth maps.

Experimental results for the single image 3D object reconstruction task show that the proposed method outperforms existing state-of-the-art methods. Both the qualitative and quantitative results demonstrate the generality and suitability of our method.



via cs.CV updates on arXiv.org http://arxiv.org/

Video Understanding based on Human Action and Group Activity Recognition. (arXiv:2010.12968v1 [cs.CV])

Video Understanding based on Human Action and Group Activity Recognition. (arXiv:2010.12968v1 [cs.CV])

https://ift.tt/2TrYKrB

A lot of previous work, such as video captioning, has shown promising performance in producing general video understanding. However, it is still challenging to generate a fine-grained description of human actions and their interactions using state-of-the-art video captioning techniques. The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc. In this study, we will propose and improve the video understanding method based on the Group Activity Recognition model by learning Actor Relation Graph (ARG).We will enhance the functionality and the performance of the ARG based model to perform a better video understanding by applying approaches such as increasing human object detection accuracy with YOLO, increasing process speed by reducing the input image size, and applying ResNet in the CNN layer.We will also introduce a visualization model that will visualize each input video frame with predicted bounding boxes on each human object and predicted "video captioning" to describe each individual's action and their collective activity.



via cs.CV updates on arXiv.org http://arxiv.org/

Rethinking the competition between detection and ReID in Multi-Object Tracking. (arXiv:2010.12138v1 [cs.CV])

Rethinking the competition between detection and ReID in Multi-Object Tracking. (arXiv:2010.12138v1 [cs.CV])

https://ift.tt/3otDi3F

Due to balanced accuracy and speed, joint learning detection and ReID-based one-shot models have drawn great attention in multi-object tracking(MOT). However, the differences between the above two tasks in the one-shot tracking paradigm are unconsciously overlooked, leading to inferior performance than the two-stage methods. In this paper, we dissect the reasoning process of the aforementioned two tasks. Our analysis reveals that the competition of them inevitably hurts the learning of task-dependent representations, which further impedes the tracking performance. To remedy this issue, we propose a novel cross-correlation network that can effectively impel the separate branches to learn task-dependent representations. Furthermore, we introduce a scale-aware attention network that learns discriminative embeddings to improve the ReID capability. We integrate the delicately designed networks into a one-shot online MOT system, dubbed CSTrack. Without bells and whistles, our model achieves new state-of-the-art performances on MOT16 and MOT17. We will release our code to facilitate further work.



via cs.CV updates on arXiv.org http://arxiv.org/

Using Deep Image Priors to Generate Counterfactual Explanations. (arXiv:2010.12046v1 [cs.LG])

Using Deep Image Priors to Generate Counterfactual Explanations. (arXiv:2010.12046v1 [cs.LG])

https://ift.tt/2TqYity

Through the use of carefully tailored convolutional neural network architectures, a deep image prior (DIP) can be used to obtain pre-images from latent representation encodings. Though DIP inversion has been known to be superior to conventional regularized inversion strategies such as total variation, such an over-parameterized generator is able to effectively reconstruct even images that are not in the original data distribution. This limitation makes it challenging to utilize such priors for tasks such as counterfactual reasoning, wherein the goal is to generate small, interpretable changes to an image that systematically leads to changes in the model prediction. To this end, we propose a novel regularization strategy based on an auxiliary loss estimator jointly trained with the predictor, which efficiently guides the prior to recover natural pre-images. Our empirical studies with a real-world ISIC skin lesion detection problem clearly evidence the effectiveness of the proposed approach in synthesizing meaningful counterfactuals. In comparison, we find that the standard DIP inversion often proposes visually imperceptible perturbations to irrelevant parts of the image, thus providing no additional insights into the model behavior.



via cs.CV updates on arXiv.org http://arxiv.org/

Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation. (arXiv:2010.12136v1 [cs.CV])

Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation. (arXiv:2010.12136v1 [cs.CV])

https://ift.tt/31YolNJ

We propose a novel lightweight generative adversarial network for efficient image manipulation using natural language descriptions. To achieve this, a new word-level discriminator is proposed, which provides the generator with fine-grained training feedback at word-level, to facilitate training a lightweight generator that has a small number of parameters, but can still correctly focus on specific visual attributes of an image, and then edit them without affecting other contents that are not described in the text. Furthermore, thanks to the explicit training signal related to each word, the discriminator can also be simplified to have a lightweight structure. Compared with the state of the art, our method has a much smaller number of parameters, but still achieves a competitive manipulation performance. Extensive experimental results demonstrate that our method can better disentangle different visual attributes, then correctly map them to corresponding semantic words, and thus achieve a more accurate image modification using natural language descriptions.



via cs.CV updates on arXiv.org http://arxiv.org/

Few-shot Image Recognition with Manifolds. (arXiv:2010.12084v1 [cs.CV])

Few-shot Image Recognition with Manifolds. (arXiv:2010.12084v1 [cs.CV])

https://ift.tt/3oqaLvQ

In this paper, we extend the traditional few-shot learning (FSL) problem to the situation when the source-domain data is not accessible but only high-level information in the form of class prototypes is available. This limited information setup for the FSL problem deserves much attention due to its implication of privacy-preserving inaccessibility to the source-domain data but it has rarely been addressed before. Because of limited training data, we propose a non-parametric approach to this FSL problem by assuming that all the class prototypes are structurally arranged on a manifold. Accordingly, we estimate the novel-class prototype locations by projecting the few-shot samples onto the average of the subspaces on which the surrounding classes lie. During classification, we again exploit the structural arrangement of the categories by inducing a Markov chain on the graph constructed with the class prototypes. This manifold distance obtained using the Markov chain is expected to produce better results compared to a traditional nearest-neighbor-based Euclidean distance. To evaluate our proposed framework, we have tested it on two image datasets - the large-scale ImageNet and the small-scale but fine-grained CUB-200. We have also studied parameter sensitivity to better understand our framework.



via cs.CV updates on arXiv.org http://arxiv.org/

Discriminative feature generation for classification of imbalanced data. (arXiv:2010.12888v1 [cs.CV])

Discriminative feature generation for classification of imbalanced data. (arXiv:2010.12888v1 [cs.CV])

https://ift.tt/3dXXK89

The data imbalance problem is a frequent bottleneck in the classification performance of neural networks. In this paper, we propose a novel supervised discriminative feature generation (DFG) method for a minority class dataset. DFG is based on the modified structure of a generative adversarial network consisting of four independent networks: generator, discriminator, feature extractor, and classifier. To augment the selected discriminative features of the minority class data by adopting an attention mechanism, the generator for the class-imbalanced target task is trained, and the feature extractor and classifier are regularized using the pre-trained features from a large source data. The experimental results show that the DFG generator enhances the augmentation of the label-preserved and diverse features, and the classification results are significantly improved on the target task. The feature generation model can contribute greatly to the development of data augmentation methods through discriminative feature generation and supervised attention methods.



via cs.CV updates on arXiv.org http://arxiv.org/

CellCycleGAN: Spatiotemporal Microscopy Image Synthesis of Cell Populations using Statistical Shape Models and Conditional GANs. (arXiv:2010.12011v1 [eess.IV])

CellCycleGAN: Spatiotemporal Microscopy Image Synthesis of Cell Populations using Statistical Shape Models and Conditional GANs. (arXiv:2010.12011v1 [eess.IV])

https://ift.tt/37FWk0J

Automatic analysis of spatio-temporal microscopy images is inevitable for state-of-the-art research in the life sciences. Recent developments in deep learning provide powerful tools for automatic analyses of such image data, but heavily depend on the amount and quality of provided training data to perform well. To this end, we developed a new method for realistic generation of synthetic 2D+t microscopy image data of fluorescently labeled cellular nuclei. The method combines spatiotemporal statistical shape models of different cell cycle stages with a conditional GAN to generate time series of cell populations and provides instance-level control of cell cycle stage and the fluorescence intensity of generated cells. We show the effect of the GAN conditioning and create a set of synthetic images that can be readily used for training and benchmarking of cell segmentation and tracking approaches.



via cs.CV updates on arXiv.org http://arxiv.org/

Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions. (arXiv:2010.12852v1 [cs.CV])

Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions. (arXiv:2010.12852v1 [cs.CV])

https://ift.tt/34xjfcK

Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case of open-ended VQA), or via classification over a set of multiple-choice-type answers. In this work, we present a completely generative formulation where a multi-word answer is generated for a visual query. To take this a step forward, we introduce a new task: ViQAR (Visual Question Answering and Reasoning), wherein a model must generate the complete answer and a rationale that seeks to justify the generated answer. We propose an end-to-end architecture to solve this task and describe how to evaluate it. We show that our model generates strong answers and rationales through qualitative and quantitative evaluation, as well as through a human Turing Test.



via cs.CV updates on arXiv.org http://arxiv.org/

AutoPruning for Deep Neural Network with Dynamic Channel Masking. (arXiv:2010.12021v1 [cs.CV])

AutoPruning for Deep Neural Network with Dynamic Channel Masking. (arXiv:2010.12021v1 [cs.CV])

https://ift.tt/3jtac0N

Modern deep neural network models are large and computationally intensive. One typical solution to this issue is model pruning. However, most current pruning algorithms depend on hand crafted rules or domain expertise. To overcome this problem, we propose a learning based auto pruning algorithm for deep neural network, which is inspired by recent automatic machine learning(AutoML). A two objectives' problem that aims for the the weights and the best channels for each layer is first formulated. An alternative optimization approach is then proposed to derive the optimal channel numbers and weights simultaneously. In the process of pruning, we utilize a searchable hyperparameter, remaining ratio, to denote the number of channels in each convolution layer, and then a dynamic masking process is proposed to describe the corresponding channel evolution. To control the trade-off between the accuracy of a model and the pruning ratio of floating point operations, a novel loss function is further introduced. Preliminary experimental results on benchmark datasets demonstrate that our scheme achieves competitive results for neural network pruning.



via cs.CV updates on arXiv.org http://arxiv.org/

Keep your Eyes on the Lane: Attention-guided Lane Detection. (arXiv:2010.12035v1 [cs.CV])

Keep your Eyes on the Lane: Attention-guided Lane Detection. (arXiv:2010.12035v1 [cs.CV])

https://ift.tt/3kxeTYA

Modern lane detection methods have achieved remarkable performances in complex real-world scenarios, but many have issues maintaining real-time efficiency, which is important for autonomous vehicles. In this work, we propose LaneATT: an anchor-based deep lane detection model, which, akin to other generic deep object detectors, uses the anchors for the feature pooling step. Since lanes follow a regular pattern and are highly correlated, we hypothesize that in some cases global information may be crucial to infer their positions, especially in conditions such as occlusion, missing lane markers, and others. Thus, we propose a novel anchor-based attention mechanism that aggregates global information. The model was evaluated extensively on two of the most widely used datasets in the literature. The results show that our method outperforms the current state-of-the-art methods showing both a higher efficacy and efficiency. Moreover, we perform an ablation study and discuss efficiency trade-off options that are useful in practice. To reproduce our findings, source code and pretrained models are available at https://ift.tt/3jsMmlG



via cs.CV updates on arXiv.org http://arxiv.org/

Deep Denoising For Scientific Discovery: A Case Study In Electron Microscopy. (arXiv:2010.12970v1 [cs.CV])

Deep Denoising For Scientific Discovery: A Case Study In Electron Microscopy. (arXiv:2010.12970v1 [cs.CV])

https://ift.tt/2Tu6RUs

Denoising is a fundamental challenge in scientific imaging. Deep convolutional neural networks (CNNs) provide the current state of the art in denoising natural images, where they produce impressive results. However, their potential has barely been explored in the context of scientific imaging. Denoising CNNs are typically trained on real natural images artificially corrupted with simulated noise. In contrast, in scientific applications, noiseless ground-truth images are usually not available. To address this issue, we propose a simulation-based denoising (SBD) framework, in which CNNs are trained on simulated images. We test the framework on data obtained from transmission electron microscopy (TEM), an imaging technique with widespread applications in material science, biology, and medicine. SBD outperforms existing techniques by a wide margin on a simulated benchmark dataset, as well as on real data. Apart from the denoised images, SBD generates likelihood maps to visualize the agreement between the structure of the denoised image and the observed data. Our results reveal shortcomings of state-of-the-art denoising architectures, such as their small field-of-view: substantially increasing the field-of-view of the CNNs allows them to exploit non-local periodic patterns in the data, which is crucial at high noise levels. In addition, we analyze the generalization capability of SBD, demonstrating that the trained networks are robust to variations of imaging parameters and of the underlying signal structure. Finally, we release the first publicly available benchmark dataset of TEM images, containing 18,000 examples.



via cs.CV updates on arXiv.org http://arxiv.org/

Language-Conditioned Imitation Learning for Robot Manipulation Tasks. (arXiv:2010.12083v1 [cs.RO])

Language-Conditioned Imitation Learning for Robot Manipulation Tasks. (arXiv:2010.12083v1 [cs.RO])

https://ift.tt/35w7ClH

Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone (i.e., motion trajectories and perceptual data). No adequate communication channel exists between the human expert and the robot to describe critical aspects of the task, such as the properties of the target object or the intended shape of the motion. Motivated by insights into the human teaching process, we introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent (e.g., "go to the large green bowl"). The training process then interrelates these two modalities to encode the correlations between language, perception, and motion. The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity. We demonstrate in a set of simulation experiments how our approach can learn language-conditioned manipulation policies for a seven-degree-of-freedom robot arm and compare the results to a variety of alternative methods.



via cs.CV updates on arXiv.org http://arxiv.org/

MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences. (arXiv:2010.11985v1 [cs.CL])

MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences. (arXiv:2010.11985v1 [cs.CL])

https://ift.tt/3mnfNHE

Human communication is multimodal in nature; it is through multiple modalities, i.e., language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Multimodal Temporal Graph Attention Networks (MTGAT). MTGAT is an interpretable graph-based neural model that provides a suitable framework for analyzing this type of multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions between different modalities through time. Then, a novel graph operation, called Multimodal Temporal Graph Attention, along with a dynamic pruning and read-out technique is designed to efficiently process this multimodal temporal graph. By learning to focus only on the important interactions within the graph, our MTGAT is able to achieve state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks including IEMOCAP and CMU-MOSI, while utilizing significantly fewer computations.



via cs.CV updates on arXiv.org http://arxiv.org/

Noise2Same: Optimizing A Self-Supervised Bound for Image Denoising. (arXiv:2010.11971v1 [cs.CV])

Noise2Same: Optimizing A Self-Supervised Bound for Image Denoising. (arXiv:2010.11971v1 [cs.CV])

https://ift.tt/34veDnD

Self-supervised frameworks that learn denoising models with merely individual noisy images have shown strong capability and promising performance in various image denoising tasks. Existing self-supervised denoising frameworks are mostly built upon the same theoretical foundation, where the denoising models are required to be J-invariant. However, our analyses indicate that the current theory and the J-invariance may lead to denoising models with reduced performance. In this work, we introduce Noise2Same, a novel self-supervised denoising framework. In Noise2Same, a new self-supervised loss is proposed by deriving a self-supervised upper bound of the typical supervised loss. In particular, Noise2Same requires neither J-invariance nor extra information about the noise model and can be used in a wider range of denoising applications. We analyze our proposed Noise2Same both theoretically and experimentally. The experimental results show that our Noise2Same remarkably outperforms previous self-supervised denoising methods in terms of denoising performance and training efficiency. Our code is available at https://ift.tt/2J2xnCt.



via cs.CV updates on arXiv.org http://arxiv.org/

Non-local Meets Global: An Iterative Paradigm for Hyperspectral Image Restoration. (arXiv:2010.12921v1 [eess.IV])

Non-local Meets Global: An Iterative Paradigm for Hyperspectral Image Restoration. (arXiv:2010.12921v1 [eess.IV])

https://ift.tt/2TuRGdT

Non-local low-rank tensor approximation has been developed as a state-of-the-art method for hyperspectral image (HSI) restoration, which includes the tasks of denoising, compressed HSI reconstruction and inpainting. Unfortunately, while its restoration performance benefits from more spectral bands, its runtime also substantially increases. In this paper, we claim that the HSI lies in a global spectral low-rank subspace, and the spectral subspaces of each full band patch group should lie in this global low-rank subspace. This motivates us to propose a unified paradigm combining the spatial and spectral properties for HSI restoration. The proposed paradigm enjoys performance superiority from the non-local spatial denoising and light computation complexity from the low-rank orthogonal basis exploration. An efficient alternating minimization algorithm with rank adaptation is developed. It is done by first solving a fidelity term-related problem for the update of a latent input image, and then learning a low-dimensional orthogonal basis and the related reduced image from the latent input image. Subsequently, non-local low-rank denoising is developed to refine the reduced image and orthogonal basis iteratively. Finally, the experiments on HSI denoising, compressed reconstruction, and inpainting tasks, with both simulated and real datasets, demonstrate its superiority with respect to state-of-the-art HSI restoration methods.



via cs.CV updates on arXiv.org http://arxiv.org/

Simple Neighborhood Representative Pre-processing Boosts Outlier Detectors. (arXiv:2010.12061v1 [cs.LG])

Simple Neighborhood Representative Pre-processing Boosts Outlier Detectors. (arXiv:2010.12061v1 [cs.LG])

https://ift.tt/31CucYL

Outlier detectors heavily rely on data distribution. All outlier detectors will become ineffective, for example, when data has collective outliers or a large portion of outliers. To better handle this issue, we propose a pre-processing technique called neighborhood representative. The neighborhood representative first selects a subset of representative objects from data, then employs outlier detectors to score the representatives. The non-representative data objects share the same score with the representative object nearby. The proposed technique is essentially an add-on to most existing outlier detector as it can improve 16% accuracy (from 0.64 AUC to 0.74 AUC) on average evaluated on six datasets with nine state-of-the-art outlier detectors. In datasets with fewer outliers, the proposed technique can still improve most of the tested outlier detectors.



via cs.CV updates on arXiv.org http://arxiv.org/

LagNetViP: A Lagrangian Neural Network for Video Prediction. (arXiv:2010.12932v1 [cs.LG])

LagNetViP: A Lagrangian Neural Network for Video Prediction. (arXiv:2010.12932v1 [cs.LG])

https://ift.tt/3muJ7Mn

The dominant paradigms for video prediction rely on opaque transition models where neither the equations of motion nor the underlying physical quantities of the system are easily inferred. The equations of motion, as defined by Newton's second law, describe the time evolution of a physical system state and can therefore be applied toward the determination of future system states. In this paper, we introduce a video prediction model where the equations of motion are explicitly constructed from learned representations of the underlying physical quantities. To achieve this, we simultaneously learn a low-dimensional state representation and system Lagrangian. The kinetic and potential energy terms of the Lagrangian are distinctly modelled and the low-dimensional equations of motion are explicitly constructed using the Euler-Lagrange equations. We demonstrate the efficacy of this approach for video prediction on image sequences rendered in modified OpenAI gym Pendulum-v0 and Acrobot environments.



via cs.CV updates on arXiv.org http://arxiv.org/

Zoom on the Keystrokes: Exploiting Video Calls for Keystroke Inference Attacks. (arXiv:2010.12078v1 [cs.CR])

Zoom on the Keystrokes: Exploiting Video Calls for Keystroke Inference Attacks. (arXiv:2010.12078v1 [cs.CR])

https://ift.tt/3opQ3fJ

Due to recent world events, video calls have become the new norm for both personal and professional remote communication. However, if a participant in a video call is not careful, he/she can reveal his/her private information to others in the call. In this paper, we design and evaluate an attack framework to infer one type of such private information from the video stream of a call -- keystrokes, i.e., text typed during the call. We evaluate our video-based keystroke inference framework using different experimental settings and parameters, including different webcams, video resolutions, keyboards, clothing, and backgrounds. Our relatively high keystroke inference accuracies under commonly occurring and realistic settings highlight the need for awareness and countermeasures against such attacks. Consequently, we also propose and evaluate effective mitigation techniques that can automatically protect users when they type during a video call.



via cs.CV updates on arXiv.org http://arxiv.org/

The Perception-Distortion Tradeoff. (arXiv:1711.06077v4 [cs.CV] UPDATED)

The Perception-Distortion Tradeoff. (arXiv:1711.06077v4 [cs.CV] UPDATED)

https://ift.tt/2zNnoKC

Image restoration algorithms are typically evaluated by some distortion measure (e.g. PSNR, SSIM, IFC, VIF) or by human opinion scores that quantify perceived perceptual quality. In this paper, we prove mathematically that distortion and perceptual quality are at odds with each other. Specifically, we study the optimal probability for correctly discriminating the outputs of an image restoration algorithm from real images. We show that as the mean distortion decreases, this probability must increase (indicating worse perceptual quality). As opposed to the common belief, this result holds true for any distortion measure, and is not only a problem of the PSNR or SSIM criteria. We also show that generative-adversarial-nets (GANs) provide a principled way to approach the perception-distortion bound. This constitutes theoretical support to their observed success in low-level vision tasks. Based on our analysis, we propose a new methodology for evaluating image restoration methods, and use it to perform an extensive comparison between recent super-resolution algorithms.



via cs.CV updates on arXiv.org http://arxiv.org/

Advancing Non-Contact Vital Sign Measurement using Synthetic Avatars. (arXiv:2010.12949v1 [cs.CV])

Advancing Non-Contact Vital Sign Measurement using Synthetic Avatars. (arXiv:2010.12949v1 [cs.CV])

https://ift.tt/3mpkCAj

Non-contact physiological measurement has the potential to provide low-cost, non-invasive health monitoring. However, machine vision approaches are often limited by the availability and diversity of annotated video datasets resulting in poor generalization to complex real-life conditions. To address these challenges, this work proposes the use of synthetic avatars that display facial blood flow changes and allow for systematic generation of samples under a wide variety of conditions. Our results show that training on both simulated and real video data can lead to performance gains under challenging conditions. We show state-of-the-art performance on three large benchmark datasets and improved robustness to skin type and motion.



via cs.CV updates on arXiv.org http://arxiv.org/

Persian Handwritten Digit Character and Words Recognition by Using Deep Learning Methods. (arXiv:2010.12880v1 [cs.CV])

Persian Handwritten Digit, Character, and Words Recognition by Using Deep Learning Methods. (arXiv:2010.12880v1 [cs.CV])

https://ift.tt/35Gb1hK

Digit, character, and word recognition of a particular script play a key role in the field of pattern recognition. These days, Optical Character Recognition (OCR) systems are widely used in commercial market in various applications. In recent years, there are intensive research studies on optical character, digit, and word recognition. However, only a limited number of works are offered for numeral, character, and word recognition of Persian scripts. In this paper, we have used deep neural network and investigated different versions of DensNet models and Xception and compare our results with the state-of-the-art methods and approaches in recognizing Persian character, number, and word. Two holistic Persian handwritten datasets, HODA and Sadri, have been used. For a comparison of our proposed deep neural network with previously published research studies, the best state-of-the-art results have been considered. We used accuracy as our criteria for evaluation. For HODA dataset, we achieved 99.72% and 89.99% for digit and character, respectively. For Sadri dataset, we obtained accuracy rates of 99.72%, 98.32%, and 98.82% for digit, character, and words, respectively.



via cs.CV updates on arXiv.org http://arxiv.org/

Characterizing Datasets for Social Visual Question Answering and the New TinySocial Dataset. (arXiv:2010.11997v1 [cs.HC])

Characterizing Datasets for Social Visual Question Answering, and the New TinySocial Dataset. (arXiv:2010.11997v1 [cs.HC])

https://ift.tt/37Kq3FV

Modern social intelligence includes the ability to watch videos and answer questions about social and theory-of-mind-related content, e.g., for a scene in Harry Potter, "Is the father really upset about the boys flying the car?" Social visual question answering (social VQA) is emerging as a valuable methodology for studying social reasoning in both humans (e.g., children with autism) and AI agents. However, this problem space spans enormous variations in both videos and questions. We discuss methods for creating and characterizing social VQA datasets, including 1) crowdsourcing versus in-house authoring, including sample comparisons of two new datasets that we created (TinySocial-Crowd and TinySocial-InHouse) and the previously existing Social-IQ dataset; 2) a new rubric for characterizing the difficulty and content of a given video; and 3) a new rubric for characterizing question types. We close by describing how having well-characterized social VQA datasets will enhance the explainability of AI agents and can also inform assessments and educational interventions for people.



via cs.CV updates on arXiv.org http://arxiv.org/

The Analysis of Facial Feature Deformation using Optical Flow Algorithm. (arXiv:2010.12199v1 [cs.CV])

The Analysis of Facial Feature Deformation using Optical Flow Algorithm. (arXiv:2010.12199v1 [cs.CV])

https://ift.tt/3kv0EUn

Facial features deformed according to the intended facial expression. Specific facial features are associated with specific facial expression, i.e. happy means the deformation of mouth. This paper presents the study of facial feature deformation for each facial expression by using an optical flow algorithm and segmented into three different regions of interest. The deformation of facial features shows the relation between facial the and facial expression. Based on the experiments, the deformations of eye and mouth are significant in all expressions except happy. For happy expression, cheeks and mouths are the significant regions. This work also suggests that different facial features' intensity varies in the way that they contribute to the recognition of the different facial expression intensity. The maximum magnitude across all expressions is shown by the mouth for surprise expression which is 9x10-4. While the minimum magnitude is shown by the mouth for angry expression which is 0.4x10-4.



via cs.CV updates on arXiv.org http://arxiv.org/

End-to-End Jet Classification of Quarks and Gluons with the CMS Open Data. (arXiv:1902.08276v2 [hep-ex] UPDATED)

End-to-End Jet Classification of Quarks and Gluons with the CMS Open Data. (arXiv:1902.08276v2 [hep-ex] UPDATED)

https://ift.tt/2ICXmjk

We describe the construction of end-to-end jet image classifiers based on simulated low-level detector data to discriminate quark- vs. gluon-initiated jets with high-fidelity simulated CMS Open Data. We highlight the importance of precise spatial information and demonstrate competitive performance to existing state-of-the-art jet classifiers. We further generalize the end-to-end approach to event-level classification of quark vs. gluon di-jet QCD events. We compare the fully end-to-end approach to using hand-engineered features and demonstrate that the end-to-end algorithm is robust against the effects of underlying event and pile-up.



via cs.CV updates on arXiv.org http://arxiv.org/

End-to-End Physics Event Classification with CMS Open Data: Applying Image-Based Deep Learning to Detector Data for the Direct Classification of Collision Events at the LHC. (arXiv:1807.11916v3 [hep-ex] UPDATED)

End-to-End Physics Event Classification with CMS Open Data: Applying Image-Based Deep Learning to Detector Data for the Direct Classification of Collision Events at the LHC. (arXiv:1807.11916v3 [hep-ex] UPDATED)

https://ift.tt/2Arhv7M

This paper describes the construction of novel end-to-end image-based classifiers that directly leverage low-level simulated detector data to discriminate signal and background processes in pp collision events at the Large Hadron Collider at CERN. To better understand what end-to-end classifiers are capable of learning from the data and to address a number of associated challenges, we distinguish the decay of the standard model Higgs boson into two photons from its leading background sources using high-fidelity simulated CMS Open Data. We demonstrate the ability of end-to-end classifiers to learn from the angular distribution of the photons recorded as electromagnetic showers, their intrinsic shapes, and the energy of their constituent hits, even when the underlying particles are not fully resolved, delivering a clear advantage in such cases over purely kinematics-based classifiers.



via cs.CV updates on arXiv.org http://arxiv.org/

Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions. (arXiv:2010.12831v1 [cs.CL])

Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions. (arXiv:2010.12831v1 [cs.CL])

https://ift.tt/2HFYJxT

Pre-trained contextual vision-and-language (V&L) models have brought impressive performance improvement on various benchmarks. However, the paired text-image data required for pre-training are hard to collect and scale up. We investigate if a strong V&L representation model can be learned without text-image pairs. We propose Weakly-supervised VisualBERT with the key idea of conducting "mask-and-predict" pre-training on language-only and image-only corpora. Additionally, we introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. Evaluation on four V&L benchmarks shows that Weakly-supervised VisualBERT achieves similar performance with a model pre-trained with paired data. Besides, pre-training on more image-only data further improves a model that already has access to aligned data, suggesting the possibility of utilizing billions of raw images available to enhance V&L models.



via cs.CV updates on arXiv.org http://arxiv.org/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.