PREVENTING TAX EVASION OF THE COMMERCIAL TAX DEPARTMENTS THROUGH STATISTICAL MODELS

Billions of dollars across the Tax administrations around the world are lost every year due to noncompliance, evasions, frauds or non-collection including both direct and indirect taxes. With access to vast quantities of data from a range of sources (e.g. financial institutions, utilities, bank transactions, social media data, etc.) both in terms of structured as well as unstructured (text, video, pdfs, etc.), tax authorities can increasingly use business rules, quantitative statistical models and advance analytics techniques to conduct audits and uncover trends and discrepancies, using new techniques such as rulebased monitoring, predictive modelling and outlier detection. This paper will showcase how the tax authorities should be moving into the predictive mode rather than post audit reactive mode for zeroing on the probable risky dealers in the indirect tax domain for highest impact of revenue recovery and collection through a statistical, scientific and information driven model for decision making.

a times interventions being impossible (Reserve Bank of India, 2011). Thus there is an immense requirement to create a predictive model or a scorecard predicting the likelihood of risk for different dealers. The transaction time flag has the potential to predict the probable tax evasion and thus prevent tax evasion rather than detect fraud detection post audit.

Objectives of Analysis:
F The Objective of this analysis it to develop a predictive risk model to predict dealers' likelihood of risk. Risk score generated can be used to scrutinize dealers for audit.
The predictive model can also distinguish between the significant variables and the insignificant variables thus prioritizing the variables to be looked into.
To predict the probable tax evasion with the help of transaction time flag and thus prevent tax evasion rather than detect fraud detection post audit.

Literature Review:
K u m a r, N a g a r a n d S a m a n t a ( 2 0 0 7 ) incorporated econometric model in order to identify the effectiveness of direct tax administration. Their study has given focus on the collection of personal income tax and corporation tax at pre-assessment and post assessment stage. They concluded that perceived inequity of tax system, complexity of tax laws, lack of fairness in penalty system and weak taxpayer education programmes were the factors for poor voluntary compliance. The study further pointed out the requirement of effective information system and database in order to achieve effective tax administration.

F F
The study of Singh and Sharma (2007) was based on primary data where they tried to examine the importance and perception of tax professionals in the system of Indian Income Tax. In their study they have identified seven significant factors like reduction in tax evasion, extension of relief to taxpayers, incentives for dependents and honest taxpayers, broadening the tax base, e-filing of returns, adequacy of deductions and impact of exempt tax system that determining the effectiveness of Indian tax system.
Datar (2010) has conducted a research on direct tax code. His paper "Why the Code must be shelved" expressed his views about the importance of direct tax code. According to him people would have to waste a lot of time in understanding the new provisions of income tax law and CBDT would have to issue numerous circulars and frame several rules all over again. He also mentioned that proposed Code would neither improve efficiency nor tax collection due to deep rooted corruption. Hypothesis:

Hypothesis 1
H0: A statistical model used to predict the probability of riskiness of a dealer is as good as a baseline model (no model) and provides no benefit compared to random choice.
H1: A statistical model used to predict the probability of riskiness of a dealer is better than the baseline model (no model) and provides significant benefit compared to random choice.

Hypothesis 2
H0: All the factors including demographics, tax ratios and transactional variables are equally important in discriminating between a probable evader and non-evader of tax.
H1: Some of the factors are more significant than the other in discriminating a dealer between a probable evader and a nonevader.

Data
Collection & the Variable characteristics : The source of the data includes Dealer master, Registration data, Returns data & Audit risk output. The 20,000 samples comprising of 10000 confirmed risky dealers & 10,000 confirmed non risky dealers. The risky and non-risky dealers are confirmed by the audit and vigilance team on the historical data via audit reports. These 20,000 samples are randomly chosen from the dealer database of approximately 2.5 lakh dealer base. Thus the sampling technique incorporated is the stratified random sampling maintaining equal proportion of evaders and non-evaders.

Dependent Variable:
A response variable was created as indicator of risk with value 1 and 0. 1 indicates dealer is risky whereas 0 indicates dealer is not risky.

Independent Variables :
Following independent parameter were considered for the analysis The ratio that has been maintained for the training and validation data, the two parts in which the entire sample is divided is 70:30. It has been taken care that the data is divided randomly by giving every observation the equal chance of being picked removing bias. The Logistic Regression Model is created on the training data and the model validation is performed on the validation data. The full fit model, forward and backward stepwise regression are the three candidate models which are compared against each other in terms of misclassification.

Variable Selection:
Data was imported into SPSS for analysis. Logistic regression model technique was used and significant variables were selected after running multiple model iteration. Variable were selected based on Chi-square test statistics and best model was selected as mathematical equation with combination of all significant variables. The different candidate models used are forward, backward and stepwise. The best model with the least misclassification is zeroed in for scoring.

Analysis & Discussions:
Variable Selection Summary: impact of the significant variables on the target. Also the sign of the coefficient showcases, the direction in which the variables are impacting (i.e. in the positive direction or in the negative direction). The category of Output/ Input Ratio less than 1 has the highest coefficient of 2.24 followed by the category Output/ Input Ratio greater than 1 (2.05), Revenue/Tax Paid (1.39) and Age of Account (0.245). All the variables are in the positive direction thus showcasing that an increase of one unit for each of them contributed in the riskiness of the dealers.

Odds Ratio:
The odds ratio is a relative way of gauging the impact of the independent variables on the target variable. The odds ratio estimates indicate by what factor the odds of tax evasion increase for each unit change of associated input. Thus a dealer who is in the system more than 1 year has the odds of tax evasion increasing by 1.27 times than a dealer who is less than a year compared to the stated dealer. One unit change of O/I ratio for the category (less than 1) increases the odds ratio by 9.44 times while one unit change of O/I ratio for The two variables which were identified as significant are age of account and Out Input ratio. Age of account contributed significantly in the model and has positive impact on the risk variable i.e. chances of risk of dealer increase with increase in number of years in the system. Out tax to input tax credit ration was grouped into 3 groups (OI ratio equal to zero, OI ratio less than 1 and OI ratio greater than 1). OI ratio category of where OI ratio less 1 has positive impact on the risk variable i.e. dealer having OI less than 1 has higher chances of being at risk as compared to dealer with OI ratio greater than 1. That is the dealers who take more input tax credits for a period of time has the highest propensity of evading tax.
This propels us to reject the null hypothesis of hypothesis 2 and accept the alternate hypothesis that some of the factors are more significant than the other in discriminating a dealer between a probable evader and a nonevader.

Model coefficient:
The model coefficients give the quantum of Dealers with tax paid greater than 1 crore has higher chances of being risky

Output Input ratio
Source: Secondary Data the category (greater than 1) increases the odds ratio by 7.8 times. The odds ratio for Revenue/ Tax paid is 4.02.

Summary from the output of Logistic Regression
Classification Table   of intervening at the transaction time or preaudit for highest impact of tax revenue collection and recovery. Also all the variables are not equally significant in terms of discriminating between tax evader and none v a d e r a n d p r i o r i t i z a t i o n s h o u l d b e incorporated for the significant variables.
The recommendation should be moving into the predictive mode rather than post audit reactive mode for zeroing on the probable risky dealers for highest impact of revenue recovery and collection and a statistical, scientific and information driven model for decision making is the right direction to it.

Bibliography:
Bryman of the times the model have been able to rightly predict a probable risky dealer as a risky dealer and a probable non-risky dealer as a non-risky dealer. Thus this is significantly more than the base line model (Classification=50%). This propels us to reject the null hypothesis of hypothesis 1 and accept the alternate hypothesis that the statistical model used to predict the probability of riskiness of a dealer is better than the baseline model (no model) and provides significant benefit compared to random choice

Conclusion & Recommendation:
This paper has showcased how the tax authorities can move into the predictive mode rather than post audit reactive mode for zeroing on the probable risky dealers in the indirect tax domain for highest impact of revenue recovery and collection through a statistical, scientific and information driven model for decision making. A statistical and scientific model used to predict the probability of riskiness of a dealer is better than the baseline model (no model) and provides significant benefit compared to random choice . A figure of 18% more classification (68% for the model vs 50% randomly chosen) is a significant lift in terms