ML4T Exam1 Prep - OMSCS 7646 Machine Learning for Trading Exam 1 Prep Notes PDF

Title	ML4T Exam1 Prep - OMSCS 7646 Machine Learning for Trading Exam 1 Prep Notes
Course	Mach Learn For Trading
Institution	Georgia Institute of Technology
Pages	19
File Size	1 MB
File Type	PDF
Total Downloads	59
Total Views	122

Preview

CLICK TO PREVIEW PDF

Summary

OMSCS 7646 Machine Learning for Trading Exam 1 Prep Notes...

Description

Open - opening stock price of day High - Highest price Low - Lowest price Close - closing price Volume - How many shares traded that day altogether Adjusted close - which is a historically-adjusted value of the stock that takes into account

corporate actions (such as stock splits) and distributions (such as dividends issued) Pandas - dataframe data=None, index: Optional[Collection] = None, columns: Optional[Collection] = None, dtype: Optional[Union[str, numpy.dtype, ExtensionDtype]] = None, copy: bool = False)[source]¶ pandas.DataFrame.head(n: int,5 default) - first n frames panda.DataFrame.tail(n:int, 5 default) - last n frames Ex. df.head(5), df.tail(7) Df[10:21], df[‘Close’], df[‘Close].max() df[‘Adj Close’].plot() For symbol in [‘AAPL’, ‘IBM’]

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

SPY - ETF representing S&P 600 NYSE traded US stocks 252 days in 2014 S&P 500 - Stock market index based on 500 large American companies listed on NYSE or NASDAQ, weighted mean of stock prices Start_date = End_date = Dates=pd.date_range(start_date, end_date) Df1 = pd.DataFrame(index=dates) dfSPY = pd.read_csv(“data/SPY.csv”, parse_dates=True, usecols=[‘Date’,’Adj Close’], na_values=[‘nan’]) Df1 = Df1.join(dfSPY) - how=’inner’, rules related to both frames print(Df1) df1.dropna() - drops nan values Df.ix[‘2010-01-10’,’2010-07-10’] - slice by row range dates Normalizing data prices: d f = df / df.ix[0], df = df / df.ix[0, :]

Ndarray: nd =df.values

Bollinger Bands: A way of quantifying how far stock price has deviated from some norm. Daily returns: Day-to-day change in stock price.  Rolling mean: s .r olling(2). mean() https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html Bollinger Bands:

Daily returns:

Cumulative returns:

Pristine data: Reality, data is an amalgamation (ex. Variability between NYSE and NASDAQ), not all stocks trade every day An ETF or Exchange-Traded Fund is a basket of equities allocated in such a way that the overall portfolio tracks the performance of a stock exchange index. ETFs can be bought and sold on the market like shares. For example, SPY tracks the S&P 500 index (Standard & Poor's selection of 500 large publicly-traded companies). Thinly traded stocks - don’t have high market capitalization, fill forward when there are discrepancies due to stocks not being traded at that time, fill backward at the beginning

Pandas.dataframe.fillna(value to fill, method, axis, fill-in-place?, limit on NaNs?, downcast to specific type) - method=’ffill or pad’, bfill (backfill) Histogram - bins of data with counts on y-axis Daily returns usually map as Gaussian distribution Positive kurtosis - fat tails on either ends of gausians, more occurrences

Plot histogram: daily_returns.hist(bins=20)

Add stuff to histogram

Slope of stock relationship to S&P/generally fitted line to scatter data = beta Alpha - where line intercepts vertical axis, + alpha = stock performing better than S&P, negative is worse Slope does NOT equal correlation in scatter plot Correlation is a measure of how tightly points fit the linear relationship Making scatter plots of data:

Computing statistics on portfolios: See code from Project 2! Risk = standard deviation or volatility Sharpe Ratio - Risk adjusted return, all else being equal -> low risk = better, high return = better, also considers risk free rate of return: interest on money if put in a risk free asset like bank account or short term treasury (after mid-2015, 0%) Rp = portfolio return Rf = risk free rate of return Sigma_p = std dev of portfolio return SR = E(Rp - Rf)/sigma_p = mean(Rp - Rf) / std(Rp - Rf) = mean(daily_ret daily_rf)/std(daily_ret - daily_rf) SR_annual = sqrt(# samples per year)*SR = sqrt(252) for daily sampling

Optimizers: Algorithm which can fine min values for functions, build parameterized models based on data, and refine allocations to stocks in portfolios How to use optimizers: 1) Provide a function to minimize (Ex. optimize f(x) for f(x) = x^2 + 5) 2) Provide an initial guess 3) Call the optimizer Ex. Gradient descent Scipy minimizer:

Flat areas hard for minimizer Local minima cause problems, and duplicate global minima also cause problems Discontinuities and piece wise functions are hard to solve Convex problems - a real-valued function f(x) defined on an interval is called c  onvex if the line segment between any two points on the graph of the function lies above the graph ..." Ex: choose two points, draw line -> Convex if line is above graph, non-convex if below

Building a parameterized model: Example error metrics -

Cumulative return is the most trivial measure to use - simply investing all your money in the stock with maximum return (and none in others) would be your optimal portfolio, in this case. Hence, it is the easiest to solve for. But probably not the best for risk mitigation. ML Problem, X & Y inputs and outputs

Supervised (provide example X, Y) regression (numerical prediction) learning (train with data) (as opposed to classification learning, classifying item into one of several classes) ● Linear regression - Finds parameters for a model (parametric)

● K Nearest Neighbors (KNN) - Take in data, munge around to come up with a few params and then throw the data away, keep historic data (X and Y pairs) and then use that to make new predictions (instance based) ● Decision Trees - Store a tree structure, then when new query data comes it bounces around the tree according to factors in the forest (i.e. is this data greater or less than the other data at the node), and when reaching a leaf, that regression value is returned ● Decision Forests - Lots of decision trees taken together, used to generate an overall result Ex. Stocks: Use historical feature data to predict a future price To train a model we would use future price as well Pair older X data with newer future Y data to train the model Predictive factors - Measurable quantities about a company which can be predictive of stock price (ex. Bollinger bands, PE ratio, momentum, price change etc. ) Select X1, X2, X3, then Y, then define time period, stock universe, and then train model Backtesting: How accurate are the models? ● Roll back time to test system ● Only allow training data up to a certain point ● On the basis of that we can pic test data Problems with regression: ● Noisy and uncertain forecasts, has to be accumulated over many trading opportunities ● Challenging to estimate confidence ● Holding time, allocation - Not clear how long to hold a position that has arisen from Can be addressed with reinforcement learning (identifiying a policy) KNN (equal weight for neighbors) vs Kernel Regression (weighting factor for each neighbor) - Both non-parametric, instance based methods though KNN Solutions: Main problems, beginning and end will just be flat horizontal since there is no data on either end to help with trends

As K increases, overfitting becomes less of an issue (K=1 equals line perfectly fitting data, K=3 equals line somewhat fitting data, K=N equals linear line since averaging occurs over entire dataset) For polynomial mode of degree d, as d increases overfitting is more likely to occur Metric 1: RMSE error = sqrt((sum(Y_test - Y_pred)^2)/N), in-sample = against Y training data, out of sample = against Y testing data Cross validation -Getting different chunks of the data and doing multiple trial sets Roll forward cross validation - Training data always ahead of testing data? Metric 2: Correlation - np.corrcoef(), +1 means strongly correlated, -1 means not so much

As RMS error increases, correlation decreases Overfitting effects: In-sample error will continue to decrease but out-of-sample error with start to increase

Evaluation of learning algorithms (Linear regression vs.KNN)

Ensemble Learners: ● Lower error ● Less overfitting ● Taste great (each individual learner has its own bias, overall bias reduction with combination) Query each model by itself and then combine the answers (take the mean!) Bootstrap aggregating - bagging A single 1NN model trained on all the data is more likely to overfit than an ensemble of 10 1NN learners trained on a different 60% of the data each, since kNN with k=1 matches the training data -> An ensemble provides some generalization and typically less out-of-sample error Boosting: Ada(Adaptive)Boost - Focusing on areas where the learners are not as successful 1. Train model with random data, and then test with training data, discover some points are not useful (not well predicted, significant error) 2. Re-train model with new random subset with added weighting based on error from previous training, adding the failed training points to new learner instance 3. Test system altogether with in-sample data and then measure again, keep iterating until happy with error

As boost value increases (m), AdaBoost will try to assign more and more specific data points to model all the difficult examples to subsequent learners, may result in overfitting Decision tree: Example: Animal identification Factors: X1: Skin texture (i.e. fur/no fur) X2: spots (yes/no) X3: Size (...