Tuesday, October 13, 2015

SQL Select Statement Binding Order

Hi,

Recently I wrote an article about how SQL Server will process SELECT statement during execution or running in Management Studio ( SSMS ). 

http://www.msbimaru.blogspot.in/2015/04/sql-logical-query-processing-phases.html

Here, I have to update and add some more add-ins for SELECT statement execution. Please find the below order of how SQL Server will execute:

1.FROM
2.ON
3.JOIN
4.WHERE
5.GROUP BY
6.WITH CUBE or WITH ROLLUP (deprecated)
7.HAVING
8.SELECT
9.DISTINCT
10.ORDER BY
11.TOP

Hope, this helps to understand about SELECT statement binding order.

Thursday, September 3, 2015

What is Index ? Difference between Clustered and Non-Clustered Index.

Index:

An index can be created in a table to find data more quickly and efficiently.
The users cannot see the indexes, they are just used to speed up searches/queries.
Note: Updating a table with indexes takes more time than updating a table without (because the indexes also need an update). So you should only create indexes on columns (and tables) that will be frequently searched against.

OR

Index is a database object, which can be created on one or more columns (16 Max column combination). When creating the index will read the column(s) and forms a relevant data structure to minimize the number of data comparisons. The index will improve the performance of data retrieval and adds some overhead on data modification such as create, delete and modify. So it depends on how much data retrieval can be performed on table versus how much of DML (Insert, Delete and Update) operations.

SQL CREATE INDEX Syntax:
Creates an index on a table. Duplicate values are allowed: 
CREATE INDEX index_name ON table_name (column_name) 

SQL CREATE UNIQUE INDEX Syntax:
Creates a unique index on a table. Duplicate values are not allowed:
CREATE UNIQUE INDEX index_name ON table_name (column_name)

Clustered Index
Non-Clustered Index
Special type of index that reorders the way records in the table are physically stored. Therefore table can have only one clustered index. The leaf nodes of clustered index contain the data types.
Type of index in which the logical order of the index doesn’t match the physical order of the rows on disk. The leaf nodes of non-clustered index doesn’t consist of the data pages. Instead, the leaf nodes contain index rows and a pointer to the data.
Automatically created when a primary key is defined on a table.
NOTE: Non-Clustered Index can be created on a Primary Key Column.
Manually need to be created on a table
1 clustered index on a table
SQL 2005 version – 249 Indexes
SQL 2008 version – 999 Indexes
Can be defined on only one column
Can be defined on more than 1 column
Index Point to the data in the table
Points to Pointer to data in a table


Recursive CTE for Month information


  • DECLARE @startDate datetime 
  • DECLARE @endDate datetime

  • set @startDate = '2015-01-01 00:00:00:000';
  • set @endDate   = '2015-09-01 00:00:00:000';

  • WITH monthlyRange(startMonth, startNextMonth) AS
  • (SELECT     dateadd(m, datediff(m, 0, @startDate), 0), dateadd(m, datediff(m, 0, @startDate) + 1, 0)
  • UNION ALL
  • SELECT     dateadd(m, 1, startMonth), dateadd(m, 1, startNextMonth)
  • FROM       monthlyRange
  • WHERE     startNextMonth <= dateadd(m, datediff(m, 0, @endDate), 0))
  • SELECT     
  • Year(monthlyRange.startMonth) as Year, Month(monthlyRange.startMonth) as Month, 
  • LEFT(DATENAME(MONTH, monthlyRange.startMonth), 3) + '-' + CONVERT(CHAR, DATEPART(YY, monthlyRange.startMonth)) AS 'Month name', 
  • Ltrim(Rtrim(CONVERT(CHAR, DATEPART(YY, monthlyRange.startMonth)))) + '-' + LEFT(Datepart(MONTH, monthlyRange.startMonth), 3) AS 'MonthYear', 
  • Ltrim(Rtrim(CONVERT(CHAR,DATEPART(YY, monthlyRange.startMonth)))) + '-' + CASE WHEN (LEN(MONTH(monthlyRange.startMonth)) = 1) 
  • THEN '0' + CONVERT(VARCHAR(2), MONTH(monthlyRange.startMonth)) ELSE CONVERT(VARCHAR(2), MONTH(monthlyRange.startMonth)) 
  • END AS YearMonthValue

  • from monthlyRange

Recursive CTE examples for Date

  •  Recursive CTE's for retrieving date information like last 5 months starting dates and last 5 weeks starting date:
    1. Declare @todate datetime, @fromdate datetime
    2. Select 
    3. @fromdate = DATEADD(MM,-5,(DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()), 0))),
    4. @todate =    DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE())-1, 0);

    5. ;WITH DateSequence as
    6. (
    7.     Select @fromdate as Weekname  
    8.         union all
    9.     Select dateadd(MM, 1, Weekname) 
    10.         from DateSequence
    11.         where Weekname >= @fromdate  AND Weekname <  @todate
    12. )
    13. Select * from DateSequence

    14. --Declare @todate datetime, @fromdate datetime
    15. Select 
    16. @fromdate = DATEADD(wk,-4,(DATEADD(wk,DATEDIFF(wk,7,GETDATE()),5))),
    17. @todate =    DATEADD(wk,DATEDIFF(wk,7,GETDATE()),5);

    18. ;WITH DateSequence as
    19. (
    20.     Select @fromdate as Weekname  
    21.         union all
    22.     Select dateadd(WK, 1, Weekname) 
    23.         from DateSequence
    24.         where Weekname >= @fromdate  AND Weekname <  @todate
    25. )
    26. Select * from DateSequence


    Monday, August 31, 2015

    Primary Key Vs Unique Key in SQL Server


    Temporary Tables in SQL Server

    Temporary Tables in SQL Server:

    1. Are real materialized tables that exists in TempDB
    2. Can be indexed
    3. Can have constraints
    4. Persist or Scope for the current connection only for the user. And deleted automatically when      the user disconnected from instances.
    5. Can be referenced by the queries or Sub-procedures
    6. Have dedicated stats generated by SQL Engine
    7. A table created on Disk.
    8. Use temp tables for longer or larger queries
    9. Starts with #TempTable
    10. Insert or Delete records in the temporary table which is similar to general table
    11. Temporary tables are stored inside the Temporary Folder of tempdb


    Difference Between SQL Server Vs SSIS Tool


    Saturday, August 29, 2015

    Follow steps to create Stored Procedure(SP) in SQL Server

    Hi All,

    These are the few baselines for your Sp to be initially reviewed:
    • Check to use SELECT INTO than INSERT INTO
    • Check to use PARTITION BY instead of GROUP BY
    • Check to use NOLOCK when using joining tables. Also checks READ UNCOMMITTED are really required.
    • Check to use collation to DATABASE_DEFAULT on the join conditions of local tables. Specify the collation (of the remote table) when joining with remote tables.
    • Check the INEDX is implemented properly on the field used in joins.
    • Check to use INDEX on temporary table used in sp.
    • Check to use INNER JOIN instead of Subquery
    • Check to use TABLE variable instead of Temp Table where the data fetch is less than 10 rows.
    • Check to use proper INDENT.
    • Check to drop temp table at the end of the sp.
    • Microsoft SQL Server offers a set option called NOCOUNT. It's turned off by default so that each operation returns information regarding the number of rows affected. However, applications don’t need this information. If you turn on the NOCOUNT option, stored procedures won’t return row-count information—and therefore, you’ll save the network overhead involved with communicating that information to the client.
    • Use Schema Name with Object Name (SELECT * FROM dbo.MyTable). Schema name should be used with the stored procedure name and with all objects referenced inside the stored procedure. This help in directly finding the complied plan instead of searching the objects in other possible schema before finally deciding to use a cached plan
    • Do not use the prefix “sp_” in the stored procedure name : If a stored procedure name begins with “SP_,” then SQL server first searches in the master database and then in the current session database. Searching in the master database causes extra overhead and even a wrong result if another stored procedure with the same name is found in master database.
    • Use IF EXISTS (SELECT 1) instead of (SELECT *): To check the existence of a record in another table
    • Use the sp_executesql stored procedure instead of the EXECUTE statement.
    • Try to avoid using SQL Server cursors whenever possible: Cursor uses a lot of resources and decreases the performance. If we need to process records one-by-one in a loop, then we should use the WHILE clause.
    • Keep the Transaction as short as possible: The length of transaction affects blocking and deadlocking. Exclusive lock is not released until the end of transaction; lengthy transaction means locks for longer time and locks for longer time turns into blocking. In some cases, blocking also converts into deadlocks. So, for faster execution and less blocking, the transaction should be kept as short as possible.
    • Use TRY-Catch for error handling

    What is recursive CTE and how much level we can have?

    Returning hierarchical data is a common use of recursive queries
    WITH Managers AS
    (
    --initialization
    SELECT EmployeeID, LastName, ReportsTo
    FROM Employees
    WHERE ReportsTo IS NULL
    UNION ALL
    --recursive execution
    SELECT e.employeeID,e.LastName, e.ReportsTo
    FROM Employees e INNER JOIN Managers m
    ON e.ReportsTo = m.employeeID
    )
    SELECT * FROM Managers
    Recursive CTE contains initialization query and a recursive execution query.
    The initialization query returns the base result and is the highest level in the hierarchy.
    The recursive execution query is then joined to the initialization query using the UNION ALL keyword.

    Sunday, April 12, 2015

    SQL - Logical Query Processing Phases

    The main statement used to retrieve data in T-SQL is the SELECT statement. Following are
    the main query clauses specified in the order that you are supposed to type them (known as
    “keyed-in order”):
    1. SELECT
    2. FROM
    3. WHERE
    4. GROUP BY
    5. HAVING
    6. ORDER BY
    But as mentioned, the logical query processing order, which is the conceptual interpretation
    order, is different. It starts with the FROM clause. Here is the logical query processing
    order of the six main query clauses:
    1. FROM
    2. WHERE
    3. GROUP BY
    4. HAVING
    5. SELECT
    6. ORDER BY
    1.What is the difference between the WHERE and HAVING clauses?
    The WHERE clause is evaluated before rows are grouped, and therefore is evaluated
    per row. The HAVING clause is evaluated after rows are grouped, and therefore
    is evaluated per group.

    2. Why are you not allowed to refer to a column alias defined by the SELECT
    clause in the WHERE clause?
    Because the WHERE clause is logically evaluated in a phase earlier to the one
    that evaluates the SELECT clause.

    3. Why are you not allowed to refer to a column alias defined by the SELECT
    clause in the same SELECT clause?
    Because all expressions that appear in the same logical query processing phase
    are evaluated conceptually at the same point in time.

    Wednesday, April 8, 2015

    Truths about Script Component - SSIS

    Planning a Custom Component:



















    After you have determined that the business problem cannot be solved by using any of the
    standard, built-in SSIS data flow components, and after you have determined that due to
    complexity, dependency, or reusability requirements (or restrictions), a Script component
    may also not be appropriate, you can use the following guidelines to plan the design of the
    custom component:

    1. Role 
    Is the component going to be used as a data source, a data destination, or to
    transform data?
    Typically, a custom source would be needed if none of the existing sources support the
    specific connection manager that you are using, or if an appropriate connection manager
    is not available. For instance, if the source data is extracted from an incompatible
    source or is stored in an incompatible format, you could develop a custom data source.
    Similarly, a custom destination could be designed if such an incompatible data store is
    used as the data flow’s destination.
    In most cases, when custom development is needed, it is to design a custom transformation—
    to support a particular data management operation that is not supported by
    any of the standard, built-in transformations.

    2. Usage 
    Is the source or transformation component going to use multiple outputs? Is
    the transformation or destination component going to use multiple inputs?
    A source or a transformation component can send data to multiple outputs, and a
    destination or a transformation component can receive data from multiple inputs. For
    instance, a transformation component used in merging data from multiple sources
    would have to support multiple inputs. A source component accessing a composite
    data set could be programmed to produce multiple, normalized row sets, instead of a
    single de-normalized one.

    3. Access 
    to external data Is the component going to use additional data sources, or
    will it consume only data in the data flow buffer?
    If the component will perform lookup operations or will need to access data that is
    not available in the current data flow, it will require access to external data sources. To
    access data stored in variables or parameters, the component will also need access to
    those variables and parameters.

    4. Behavior 
    Is the component going to be a blocking, a partially blocking, or a nonblocking
    component? Is it going to use synchronous or asynchronous outputs?
    If the component is going to pass rows to the destination without having to retain
    them, such as to calculate running totals (partially blocking), or to sort them (blocking),
    the component will not block the data flow (and is a non-blocking transformation).
    If the transformation produces a single output row for each input row, where the result
    of the transformation can be written to one or more columns of the source row, a synchronous
    output can be used. However, if the transformation could produce one or more
    output rows for each input row, or even not produce a row at all, an asynchronous
    output would have to be used.
    New rows cannot be added to a synchronous output and cannot be removed from it.

    5. Configuration
    How will the component be configured?
    To improve the reusability of a custom component, specific settings used to control its
    operation should be exposed, allowing the developer to set them at design time, or
    even expose them to the environment. For instance, a transformation performing data
    extraction or data validation using Regular Expressions should allow the developer to
    set the expressions at design time, or even allow them to be determined automatically
    at run time.

    MDX Error handling

    Hi All,

    Usually, when we are writing MDX queries used to get some issues like division errors or in calculated measures. At that we need to handle those scenarios in different style. Let's go through some of them which I have faced and some usual cases.

    1. "#Num" error while browsing calculated measure in Excel Services:

    While calculating % (percent) , we do get division errors. At the time, we have to do some checks for denominator value. So that we can eliminate division errors.

    Example:

    2. Check for division by zero and null to avoid -1.#INF in cells?

    Below screenshot on AdventureWorks cube.

    3. Check for the top level in a dimension to avoid division by an nonexsting parent(-1.#INF)?

    4. Reference dimension members/measures in MDX statements when sometimes they might not exists?


    Monday, September 1, 2014

    SQL: Heap Table

    Hi All,

    Last week, I attended a session on Index Architecture and asked about Heap table and Indexed table.

    What is Heap Table?

    In simple term - It can be define as a table created without any indexes. A table created and doesn't has any indexes on the same.

    Then what is the uses of Heap Table:

    A Heap Table helps to improve the performance of INSERTING data by using INSERT statements while loading data in to a table in a data-ware house or any data mart or a database.

    Why it improve performance:

    Beacause, while data is loading in to a heap table, it doesn't need to insert data in a order. 
    I mean, Data is not stored in any particular order.Specific data can not be retrieved quickly, unless there are also non-clustered indexes. Since there is no clustered index, additional time is not needed to maintain the index.Since there is no clustered index, there is not the need for additional space to store the clustered index tree.

    Then what about Indexed Table or Clustered Table:

    Data is stored in order based on the clustered index key.Data can be retrieved quickly based on the clustered index key, if the query uses the indexed columns. Data pages are linked for faster sequential access.

    What is main disadvantages of Indexed Table or Clustered Table:

    Additional time is needed to maintain clustered index based on INSERTS, UPDATES and DELETES. 
    While any DML operations is happening, it will search for particular index key and then only it does perform Insert, Update and Delete statements. Due to searching of index keys it will take additional time. And also additional space is needed to store clustered index tree.

    Where as Heap table doesn't occupy more space as there wont be any indexes on the table.

    There are more detailed explanation in MSDN:

    http://msdn.microsoft.com/en-IN/library/hh213609.aspx

    Other links:

    http://www.brentozar.com/archive/2012/05/video-heaps/


    Please check above link for more reference.Thanks for reading the post and let me know if you have any concern.Thanks!!!

    Friday, August 8, 2014

    SSAS - Named Set Totals or Aggregations not displaying issue in Excel

    Hi All,
    One of my client today wrote mail saying that,"Named sets are not showing aggregation data while browsing data". This stumbles me :(
    I tried to browse the named sets in the excel by access cube.It doesn't show.


    NOTE:
    When you browse named set as filter in cube browsing in management studio or Visual Studio, you can see aggregations for named set.

    I was trying to find a solution for this:

    Actual Named Set - [Top 20 Clients Alphabetical]:

    Filter( 
    order([Client].[Client].Members,[Measures].[ACCOUNT RANK],BASC), 

    [Measures].[ACCOUNT RANK] > 0 AND [Measures].[ACCOUNT RANK] < 21)

    I checked one of the source site for this issue:
    http://prologika.com/CS/blogs/blog/archive/2009/01/27/named-set-subtotals-in-excel.aspx

    I modified as per the suggestion made in the above link. It worked for one of the named set.

    Modified Named Set - [Top 20 Clients Alphabetical]:

    Filter( 
    order([Client].[Client].AllMembers,[Measures].[ACCOUNT RANK],BASC), 

    [Measures].[ACCOUNT RANK] > 0 AND [Measures].[ACCOUNT RANK] < 21)

    I didn't modified as [Client].[Client].[All], because it doesn't help my case so I modified it as 
    [Client].[Client].AllMembers only.

    Then, I was happy :) But I did the same changes to other named sets which I have.
    After I deployed and tried to browse it..suddenly some of the named sets are not showing aggregation total :(

    Then, I started to think about Visual Total function.Here it come about VisualTotals function will helps to display aggregations. 

    Modified Named Set - [Top 25 Clients Alphabetical]:

    visualtotals({[Client].[Client].[All],
    Filter( 
    order([Client].[Client].AllMembers,[Measures].[ACCOUNT RANK],BASC), 
    [Measures].[ACCOUNT RANK] > 0 AND [Measures].[ACCOUNT RANK] < 21)

    })

    After that, I checked it was showing aggregation totals for the above named set. I did the same for the rest. Again, one more issue pop up for other named sets..the above fix was not working :(

    Modified Named Set - [Top 100 Clients Alphabetical]:

    visualtotals({[Client].[Client].[All],
    FILTER(Order([Client].[Client].[All].Children,
    [Measures].[ACCOUNT RANK],BASC),
    ([Measures].[ACCOUNT RANK] > 0 
    AND [Measures].[ACCOUNT RANK] < 101))

    })

    I used Children function, then it was working :)

    I learnt a lot with named sets in cube today because of issues what I have faced.Hope, this helps for someone and please write your comments if you have any. Thanks for reading the article.



    Wednesday, August 6, 2014

    SSRS - Changing Row Text Color Based On Condition

    Hi,

    I have to implement row text color based on column condition. Even though this is simple to do, but thought to share. Hope it helps!

    Scenario:
    I have a simple report which displays person information in Adventure Works database.In that person data, where ever person title is "Ms.", then complete row should be display in other text color.Because it helps to easily identify who are female persons.

    Implementation:
    I just take top 20 rows from table and few of the columns which looks like below:

    Now where Title is "Ms.", row should be in other color font. To do this, check out the below image

    Write expression as  =IIF(Fields!Title.Value = "Ms.","White","Black")
    Then you can see the report as below



    Friday, May 23, 2014

    SQL - Varchar vs Varchar(MAX)


    SQL - Varchar vs NVarchar


    SQL - CTE vs TEMP tables

    Temp Tables in SQL Server:

    1. Are real materialized tables that exists in tempdb
    2. can be indexed
    3. can have constraints
    4. Persists for the life of the current connection
    5. Can be referenced by the queries or sub procedures
    6. Have dedicated statistics generated by sql engine
    7. A table created on disk
    8. Use temp tables for longer or larger queries data.
    9. Table name looks like #Tablename
    CTE - Common Table Expressions:
    1. CTE is an expression that can be thought of as a temporary result set which is defined   within the execution of a single SQL statement. A CTE is similar to derived table in that it is not stored as an object and lasts only for the duration of the query.
    2. Using CTE improves the readability and makes maintenance of complex queries easy
    3. The query can be divided in to separate, simple, logical buildings blocks which an be then used to build more complex CTE's until final result set is generated
    4. CTE can be defined in functions, SP's, Triggers and views
    5. After CTE is defined, it can be used as a table or a view and can select,Insert,Update or Delete
    6. No additional statistics stored in sql server engine
    7. No indexes
    8. Cannot have constraints
    9. Are essentially disposible views
    10. Persist only until the next query is run
    11. Can be recursive
    12. CTE's and Tbale Variables can be use for small dataset
    13. Can be substituted for a view
    14. Can reference itself multiple times

    SSIS - Tuning SSIS Data Flow

    Dear All,

    I have found couple of points to improve data flow task level.I would like to describe below:

    Data Flow Performance:

    I do consider tuning at ETL( Extract, Transform and Load) sections.

    Extraction Level:-

    Network Tuning: 
    1. Change the network packet size in the connection manager
    •       Higher values typically yield fast through put
    •       By Default, it will be ZERO, you can change MAX value as: 32767



    2. Try to experiment with shared memory vs TCP/IP
    3. Enable JUMBO frames on network. To do this please consult your network specialists
    4. If you have OLEDB Command statements, then try to create another connection manager with     low package size.( Not go with 32767 max value )

    SQL Queries:
    1. Consider using NOLOCK hint on table in your source query. 
    •     Removes locking overhead
    •     Improves the speed of large tables scan
    •     Risky side effects
    •     Understand before using hints in the source query
    2. SELECT query with only selected columns
    3.Do require conversions in the source query, don't go with Data Conversion T/F later.

    LookUp:-
    1.Changes SELECT statement to only use the columns you need. So that, it optimizes memory        usage
    2.Consider adding NOLOCK hint.
    3.In SSIS 2008, use shared lookup cache. Create cache transform and cache connection manager
    4.Use SQL query for reference table
    5.Use WHERE condition in SQL query in LookUp T/F.
    6.When your input row count is large then go for using partial cache or full cache lookup to improve performance.
    7.No cache lookup is a row based operation
    8.A partial cache builds the cache as the lookup T/F is executing and also comes with high transaction impact. A partial cache approach is viable solution if you have large number of input rows.
    9.However, best option is use "Full Cache" of lookup and filter the reference table.
    10. Apply an 80/20 rule and load one lookup T/F with 20 percent of the most common matching records, which will allow 80 percent matches in the full cache.

    Transform Level:-

    Different transformations:

    1.Row based(synchronous)
    •    Logically works row by row
    •    Data conversion,Derived column examples
    •    Buffer reused
    2.Partial blocking(Asynchronous)
    •     Works with groups of rows
    •     Merge,Merge Join,Union All,LookUp examples
    •     Data copied to new buffers
    3. Blocking( Asynchronous)
    •     Need all input rows before producing any output rows
    •     Aggregate,Sort,Pivot&UnPivot examples
    •     Data copied to new buffers
    Data Types:
    1.Make data types as narrow as possible so you will allocate less memory for your transformation
    2.Do not perform excessive casting of data types
    •   It will degrade performance
    •   Cast source types at the database using CAST/CONVERT functions where ever possible
    Push Up or Down where ever possible:
    1.Data flow transformations can often be optimized
    • Sort: Push to source queries when possible, instead of using SORT transformation for sorting cross database joins.
    • Use IsSorted and SortKeyPosition option to sort data fro OLEDB source data instead of going with SORT transformation.
    • Go with MERGE sql statement instead of SCD and LookUp T/F's for SCD's data handling.
    • Use GROUP BY in sql source query instead of Aggregation T/F.
    • Use INSERT INTO statement instead of a data flow task on a single instance
    • Find about DELTA load VS RELOAD data
    Loading Level:-

    1. Use SQL Server destination 
    •  only when package and server are on the same server
    •  Error handling weaker than OLEDB Destination
    2.Commit size = 0, then fast
    3.Drop some indexes based on load growth %
    4.Load data in to partitons tables
    5. Truncate table instead of DELETE statement

    Other options to consider:

    1.BLOB Temp Storage Path: Binary Large Object 
    2.Buffer Temp Storage Path:  change the Buffer Temp Storage Path and BLOB Temp Storage Path to drive locations that have been optimized and ensure they are not using the C:/ system drive.
    3.Default Buffer Max Rows - 10,000 means no single buffer can have more than 10,000 rows
    4.Default Buffer Size - 10,48,5760 Specify number of bytes that a single buffer cannot exceed
    If single row is 1200 bytes the max rows(10,000) times row width(1200)
    10,000 * 1200 ==> 12000000 which is greater than default buffer size
    5.Engine Threads - 20
    6.Run in Optimized Mode - True - Ignores unused source columns,destination columns and any T/F's.

    I Hope, all the above points might help you while you are trying to improve performance in Data Flow task level.