Software building howto pdf




















PCI Express 6. Use Your iPhone as a Webcam. Hide Private Photos on iPhone. All Microsoft's PowerToys for Windows. Take Screenshot by Tapping Back of iPhone. Windows 11 Default Browser. Browse All Windows Articles. Windows 10 Annual Updates. OneDrive Windows 7 and 8. Copy and Paste Between Android and Windows.

Protect Windows 10 From Internet Explorer. Mozilla Fights Double Standard. Connect to a Hidden Wi-Fi Network. Change the Size of the Touch Keyboard.

Reader Favorites Take Screenshot on Windows. Essentials of Computer Architecture, 2nd Edition. Handbook of Big Data Technologies. Please enter your comment! Please enter your name here. You have entered an incorrect email address! Follow Us! Latest Books. Own your quota. From proposal to close. Get documents out the door to keep deals moving. Own efficiency. From ideation to delivery. Take control of the conversation and speed up marketing workflows.

Own hiring. From recruitment to onboarding. Spend your time hiring the right people, and less time worrying about paperwork.

After the system is deployed it is used to decide on new data including evasive data generated by the attacker. For any given malware type - PDF or others - one may start with some examples made available from the past. Some are either available publicly or pro- vided privately by security teams. Additionally, security teams can create malware directly to test their own defenses. In recent years, machine learning has been used to automatically create malware samples.

In this chapter we describe multiple methods used to create malware samples and devise a strategy of our own. Our focus is still on PDFs. Automatic creation of malicious samples can be approached in two ways i supervised and ii unsupervised.

In the supervised method the focus is on creating samples that are able to evade a detector - typically a machine learning classifier itself. A second approach, is unsupervised which attempts to create samples that are distant in feature space but are still malicious. In the next subsection, we present the supervised methods. Many features in PDF malware are used used manipulate the presented features of a file without modifying the underlying functionality.

In this case, there are two types of Synthesize functions used. The first type Initial Sample Synthesis. This could be done by loading a folder of files and generating labels. In almost all cases, these methods rely on feedback from the classifier - that they are trying to evade - to create new variants. Hence we categorize them as supervised methods. The mimicus framework presents a method to manipulate PDF classification using mimicry attack through modifying mutable features and through gradient descent methods using attributes of the model [6, 19].

The EvadeML framework presents a blackbox genetic programming approach to evade a classifier when the classification score is known [39]. The EvadeHC method, evades machine learning classifiers that evades classifiers without knowledge of the model or classification score [14]. Other methods operate on the feature space and generate evasive features that could confuse classifiers, however it often unclear how to convert evasive features back into a malicious file [17].

Many other methods have been presented to deceive machine learning classifiers based on the stationarity assumption not holding in an adversarial environment [13, 27, 14, 25, 16, 35, 13, 9, 8, 18, 7] These attacks often focus on complex classifiers like deep learning systems, where classifiers can be over fit to rely on features that are correlated with malware rather than those necessary for malware. In [38], Wang et al showed that complex classifiers are able to be evaded with the presence of even one unnecessary feature.

The authors of EvadeML have made their software open source. EvadeML uses a genetic programming method to produce tree-structure variants of malicious seeds - to evade static classifiers such as Hidost and PDFRate. These variants are then tested against the Cuckoo sandbox to ensure maintained malicious activity then scored using static classification scores [39].

This indicates it was successfully able to confuse the PDFRate classifier. Call this set Smutants. Step 2 : Check which among the mutants are malicious using the oracle function o. Step 4 : Select the mutants that have classification scores greater than the cutoff. Add these to the set Sevade. These files represent the ones that are able to evade the PDFrate classifier.

Max-Diff algorithm We propose the Max-Diff algorithm as an alternative way to generate malicious files. The Max-Diff algorithm is similar to the EvadeML algorithm in that it uses a malicious and benign pool of variants, scores the malicious variants, mutates mutates the best scoing variants, adds them to the pool of malicious files and continues.

However, unlike the Evade-ML algorithm, it does not seek to find files that receives a classification score less than the cutof f for a single classifier. Instead, it selects for files that receive different classification scores with different classifiers in the system. Collect these scores in P1 Step 4: Upload the set to the Virus Total website and generate virus total classifier m scores for Smutant and generate classification scores. These files represent files that different scores between PDFRate and Virus Total and could confuse a classification system.

The mutation function requires a pool malicious files Sm and a pool of benign files. The malicious files are mutated using components from the pool of benign files Sb. Step 2 : Mutate each malicious PDF by randomly selecting one of the following meth- ods: — Insert randomly selected sub-tree from a randomly selected sub-tree from the benign file. Step 3 : Write tree representation of mutated malicious files to PDF files.

As shown in Figure , we observe that evasive files generated using the Max-Diff algorithm are especially effective at evading the Virus Total classifier and achieve the same scores as benign files.

In comparing these results we see that the more time consuming classifier, Virus Total, does achieve higher accuracy against evasive variants than the PDFRate classifier. However, even Virus Total is not fool- proof which motivates the need for use of human analysts.

The KDE plot was generated with Gaus- sian kernel of width 0. We divide the classifiers into two types pri- mary and secondary. All secondary models are machine learning models, whereas not all primary models are. These models enable us to develop an incremental decision system as we will describe in Chapter 6 that in turn allows us to trade off between accuracy and resources used. For simplicity we describe samples as an array of file paths and labels as array of 1 for malicious and 0 for benign.

A random forest machine learning model is then trained on the set of features and labels. Secondary classifiers oper- ate off of probabilistic scores as inputs. Secondary classifier C4 uses inputs p1 ,p2 to produce the probabilistic score p4. Secondary classifier C4 uses inputs p1 ,p2 ,p3 to produce the probabilistic score p5.

When the Cuckoo model is used, it sends files to a running Cuckoo server that accepts files and returns scores p2 indicating if known behavioural signatures of malicious files were detected.

Two secondary classifiers are developed in our active defender system. In real time, in order to determine whether or not a new input file s is malicious, we apply a hierarchical decision system that makes use of multiple classifiers. In this system, we use three primary classifiers. Classifier C1 , PDFRate, is the cheapest of all in terms of the computational time required to make a decision, but is also the most inaccurate and could be evaded easily.

The Cuckoo classifier, C2 , requires the dynamic analysis of the file, and thus needs more time than C1. VirusTotal, C3 , requires us to use their API, and it takes about 2 minutes on an average to receive the scores back. Among the three, Virus Total is the most accurate. For these reasons, in developing a decision system, we considered the following goals: — Increase throughput: We would like to make decisions for PDFs as fast as possible. Using an input score of pi , the bi-level decision returns a result if it is certain of the classification.

Given a fully specified decision system, with classifiers C We provide two methods of characterizing system accuracy. In g1. Since the active defender system utilizes a set of thresholds to determine the decision for an input sample as shown in Algorithm 1, tune optimizes these thresholds based on their effect on a cost function.

Tune comprises two main steps. First, the tune algorithm enumerates an initial set of classifier thresholds using an enumeration function e to generate a set of thresholds T , and scores them with the cost function c. Enumerating a large threshold set is important in systems with complex costs functions such as g2. If too few initial thresholds are enumerated, optimization can result in thresholds that find a local rather than global minimum cost function value.

Second, tune uses a maximum of niterations of Bayesian hyper-parameter tuning to propose an additional candidate threshold, evaluate it using c, add it to the thresh- old set T , and find the thresholds that minimize the cost function c.

Bayesian optimization allows for the optimization of a black-box cost function using a set of tunable parameters. In our system, the tunable parameters for the decision system are the lower thresholds in each set and the difference between the lower and upper thresholds which is fixed to 0 for the last threshold set. This is done in two steps. We propose a simple enumeration function e1.

For example, if previous PDFRate classification scores p1 were observed between 0. More complex enumeration functions can be developed to capture a more expres- sive range of thresholds. A lot of work has been done to show how motivated attackers can build evasive variants. It is natural to ask how would the de- fense mechanisms adapt as new variants are produced. Known as active learning, this adaptation can happen over time by simply adding training examples whose labels are verified.

In [37], Veeramachaneni and Arnaldo study the use of Active Learning in a human in the loop detection system. Using multiple outlier detection systems to send suspi- cious data to analysts the system is able to improve the machine learning model - over time. Building on the idea of sending data that is classified with some uncertainty by a faster more cost-effective model to a more expensive, but accurate analyst, we expand this method to use a variety of possible ways to generate more training data.

Our new training examples come from the following ways: — higher accuracy classifiers: we can incorporate predictions from Virus Total as possible source of truth and incorporate them as training examples. This is an expensive mechanism but still doable. Figure Diagram of the adapt system. The system can be adapted using synthetic data or unlabeled data.

In the case of unlabeled data, the system generates labels and final probabilities using the predictions from the previous learned models and the decision system. The adapt algorithm uses the following steps as shown in Figure In the active defender system, the P DF Rate classifier C1 is the only one of the primary classifiers that can be retrained to utilize additional data.

In the experimental design, we first split the data into two data sets, as shown in Figure D1 corresponds to data used to train the system, and D2 is data received by the system after it is deployed. Training Data: D1 is the training data available to the system before it is deployed. In our exper- imental setup, D1 consists of the 10, Contagio malware files and 10, benign PDFs randomly selected from the 44, benign files discussed in Section 2.

This training data consists of malicious files collected by security analysts and a corpus of collected benign PDF files. As shown in Figure , this data is split into subsets q1 through q5 and is sent to the decision system across 5 stages or time periods. For both D1 and D2 , the order of the files is randomized across trials, so splitting gives different subsets q D1 is used to initialize the decision system. Figure Updating the decision system. In setting up the decision system, we set the following tuning parameters, as described in Chapter 6 and in Chapter 7.



0コメント

  • 1000 / 1000