Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Programming Technology

How Would You Improve SQL? 271

theodp asks: "It was the best of languages, it was the worst of languages. SQL's handy, but it can also drive you nuts. For example, if you want all 100 columns from a table, 'SELECT *' works quite nicely. However, if you want all but 1 of the 100 columns, be prepared to spell out 99 column names. Wouldn't it not make sense to provide a Google-like shorthand notation like 'SELECT * -ColumnName' (or DROP=syntax like SAS)? So how would you improve SQL?"
This discussion has been archived. No new comments can be posted.

How Would You Improve SQL?

Comments Filter:
  • by joto ( 134244 ) on Thursday November 03, 2005 @06:39PM (#13945653)
    For example, if you want all 100 columns from a table, 'SELECT *' works quite nicely. However, if you want all but 1 of the 100 columns, be prepared to spell out 99 column names.

    If this is your main problem with SQL, then you have other problems as well. Who in their right mind needs a table with 100 columns? If you have 100 columns, you seriously need to normalize your database.

    Ok, I might not be a database buff. Actually, my experience with SQL is purely academical (although I've worked with object-oriented databases). But if I were to improve SQL, my attempts would be in the direction of making it into a more pure mapping of a relational database, not in adding yet more syntactic sugar.

  • ugh... (Score:5, Insightful)

    by Anonymous Coward on Thursday November 03, 2005 @06:39PM (#13945657)
    Did you know there have been people working on a general algebra for data management for, what 40 years now? Did you know, this is basically a SOLVED PROBLEM? Ever heard of "D"? Or Tutorial D? The Third Manifesto?

    You know, I totally understand why Fabian Pascal [dbdebunk.com] is always pissed off.

    Here's something for you to chew on:

    Why do programmers write this:

    A + 3

    when they want to add 3 to A? Why do we not write some lovely crap like:

    OPERATE ON A WITH 3 USING ADDITION

    why do we write:

    (A + 3) * 2

    and not:

    OPERATE ON (OPERATE ON A WITH 3 USING ADDITION) WITH 2 USING MULTIPLICATION

    Why do we do that?? Because algebraic notation is 1) declarative .. it tells the computer what you want, it doesn't tell it how to do it and in what order, and 2) algebraic notation is *completely general*. You can nest arbitrarily with parentheses, and you can clearly see what's a variable and what's a value and what's an operator. Easy to create, understand, and *optimize*.

    Do you agree with me that the verbose syntax clouds your thinking? Keeps you from seeing the underlying operations? Makes it difficult to apply the basic algebraic skills you learned in high school? Makes it difficult for the compiler writer to do the same?

    Now I ask you, why do we write:

    SELECT * FROM Order

    and not

    Order

    Why do we write:

    SELECT * FROM Order JOIN OrderItem WHERE Order.order_id = OrderItem.order_id

    and not

    Order JOIN OrderItem

    And here you are, dwelling on some little detail about projecting columns.. this is an easy one: use an "ALL BUT" operator for example:

    RelvarWith100Attributes ALL BUT (Attribute100)

    Once you see that relational algebra is just values, variables, and operators nested in any arbitrary way, just like arithmetic, you have opened the door a little more to understanding the fundamental theory of data management and how backwards and primitive "modern" data management is.

    And let's not even get into all the crap that SQL gives us like duplicate rows, NULLs, brain-dead table-oriented storage, lack of 100% updateable views, lack of arbitrary constraints, (often) lack of composite types (why the hell do we splat objects into MULTIPLE COLUMNS?? They should be stored in ONE column). SQL also confuses logical and physical layers (keys vs. indexes), and has basically kept the database industry in the dark ages for decades now.

    So the answer to your question is pretty simple: I would ditch SQL and use something that looks like relational algebra, which has been understood and documented for a probably longer than you've been *alive*. No offense.
  • by hawkbug ( 94280 ) <psxNO@SPAMfimble.com> on Thursday November 03, 2005 @07:19PM (#13946020) Homepage
    "You DON'T need 100 columns, ever. If you have that many columns you should be breaking the table up into several tables with 1:1 joins. Seriously. There will always be some fields that aren't needed. Make the rows smaller by putting commonly used fields in one table, rarely used rows in the other."

    I completely disagree. Breaking up the questionaire into seperate tables does nothing. I want every question every time, it's all one HTML page with checkboxes and links to the questions. If they were seperate tables, I'd just have a lot of joins every time the page loads, which isn't going to help performance. In my example, there are no rarely used rows - all rows are used equally.

    "And your example of a questionaire (1 row per answer, one column per question) is not even close to normalised. What happens if there is a new question? you have to alter the schema. what happens if some questions are not answered? you'd have to have nulls, and wasted space."

    No, I never said one row per question - I said one row per person - big difference. Also, the default values for the bit columns would be 0, since the questions are not answered. When they get marked as answered, they are changed to 1. No wasted space and every column is relevant in my example. EVERY time I want EVERY column, no exceptions. Yes, if you added a question you'd add a column. And give it a default value of 0, and then run this simple query "Update Questionaire SET NewQuestion = 0" and then you have no nulls, no wasted space.
  • by Anonymous Coward on Thursday November 03, 2005 @07:24PM (#13946081)
    In what way isn't it relational?

    1. SQL syntax doesn't look like relational algebra (see my long rant above). This clouds thinking and hides the simplicity of the underlying model.

    2. Relations are sets. SQL allows duplicate rows, so its tables aren't sets, and therefore aren't relations. (This property alone is enough to make it "not relational" by the way).

    3. Relation *attributes* (the column names) are also sets. SQL allows columns with the SAME NAME in a query result!!

    4. SQL has no "table equality" operator. You'd think the first operator you'd implement for a data type, especially a fundamental data type, would be equality! Imagine a programming language with no integer equality for instance.

    5. Relations require each attribute to be drawn from a single type or domain. SQL allows NULLs, which are values not drawn from the column's type. And SQL gives you very little to help you work without NULLs. To add insult to injury, the default for columns is NULLable.

    6. (related) Relational algebra requires boolean logic. SQL uses three-valued logic because of NULLs. And it uses it *inconsistently*.

    7. The relational model does not specify a type system, it just requires one. Yet SQL specifies it's own particular type system (integers, chars, etc). What if you want to store XML or audio in one of your columns?

    8. The relational model specifies nothing about physical implementation. Yet, almost every SQL product stores the columns of tables "together" in such a way that makes joins needlessly expensive.

    9. SQL distinguishes between "base tables" and "views". The relational model requires them to be indistinguishable to the end user. Specifically, most SQL implementations don't let you update views! Pretty unbelievable. Imagine a programming language that didn't let you pass arguments to any function for instance.

    10. SQL lets you do meaningless things like multiply the primary key values of two tables or add a weight to a height. This is related to the type system issues.

    11. SQL confuses KEYs (logical) with INDEXes (physical implementation).

    12. (This one gets me all the time) SQL has an EXISTS operator (is this statement true for at least one value of this result?) but not a FOREACH operator (is this statement true for all values in this result?) .. I think this is related to lack of table equality.

    13. SQL implementations don't have ANY brains whatsoever. They don't know that book_id from column A and book_id from column B are equal in a join, and that you don't need both of them in the query result. They don't "look inside" your CHECKs and foreign keys to deduce information about your database and use that information to optimize queries.

    I'm sure if you picked up a basic theory book you'd find plenty of other nitpicks for the syntax, the semantics, and the basic underlying model of SQL.

    And these aren't just "theoretical" problems, I run into them every day because I know there's something "more" out there. Here's a simple query you should try to do in one line of SQL: "give me a list of all customers who bought every product in product line X". Someone who knows relational theory just thinks up the solution (you just need to create a list P of all products in product line X, and pull out the list of orders where P is a subset of the order items, then join with the list of customers). Someone who only knows SQL will immediately run for the application layer, where you can't just *declare* your problem and have the app solve it, you literally have to write loops and procedural code to solve the problem.

    If you are interested in learning more, get Date's O'Reilly book "Database in Depth". It's very short, roughly 200 pages, and tells you all you need to know about data management theory.
  • by rho ( 6063 ) on Thursday November 03, 2005 @07:33PM (#13946142) Journal
    If they were seperate tables, I'd just have a lot of joins every time the page loads, which isn't going to help performance.

    That is not a SQL or database issue, it's an issue with your scripting language and Web server. This would be cached.

    You certainly can have keep the table at 100 rows. But at that point, you might ask yourself why you're using a database at all. A flatfile will probably have less overhead, even with file locking issues. Especially considering the simplistic questionaire you're using as an example--a long line of 0s and 1s would do you.

  • Infuriating (Score:4, Insightful)

    by TTK Ciar ( 698795 ) on Thursday November 03, 2005 @07:38PM (#13946176) Homepage Journal

    Out of all the annoying issues, I've pulled out the most hair over unique id's, and INSERT vs UPDATE.

    Most SQL implementations give you some way of assigning unique id's to newly INSERT'ed rows. It would be nice if there were a standardized way, but that's a side issue. Once rows have unique id's, you can identify rows to be updated by id. This is very fast and simple.

    Except .. in order to find out what id the DBMS has assigned a row, I usually have to follow my INSERT with a SELECT, to read the id column. Slow and annoying. Sometimes the DBMS I am working with takes a few seconds to perform the INSERT, and ten minutes to perform the SELECT. That takes it beyond an optimization issue, and into a workability issue.

    Also, if I do not yet know if a data record has been INSERT'ed, and I need to either UPDATE the existing record or INSERT a new one (say, with just a new timestamp), then I need to either attempt an UPDATE and then fall back on INSERT if the UPDATE fails (ew!) or attempt a SELECT and either INSERT or UPDATE depending on whether it returned any rows (ew!).

    If SQL came up with a standardized way to associate unique id's with newly INSERT'ed rows, it would be very, very nice if the id column(s) assigned were returned to the client in the same packet as the message confirming that the INSERT succeeded. Nearly zero additional overhead, neat, fast, and easy.

    To solve the UPDATE/INSERT issue, I'm less sure. Say, for instance, that I have a daemon which periodically scans the filesystems in a cluster of machines, and it wants to UPDATE the "exists" column of a given row identified by a ( hostname, mountpoint, path, filename) tuple with the current time, if that row already exists, or INSERT a whole new row for that file if it does not exist. Perhaps there could be a "WRITE" command which is just like INSERT but overwrites a row if it already exists? That seems like the wrong solution, too. In the meantime, I play with caches of hashes to unique id's and lose more hair.

    -- TTK

  • Re:ugh... (Score:5, Insightful)

    by rho ( 6063 ) on Thursday November 03, 2005 @08:01PM (#13946365) Journal
    SELECT * FROM Order JOIN OrderItem WHERE Order.order_id = OrderItem.order_id and not Order JOIN OrderItem

    Why do you assume Order and OrderItem will be joined on order_id? They don't have to, you know. So you have a "Order JOIN OrderItem ON order_id" format. Except, why do you assume both columns will have the same name? So you now have a "Order JOIN OrderItem ON order_id AND order_id_submitted" format.

    And before long, you've got an equally baroque language for describing a query as SQL is now. SQL's point is to have something reasonably human-readable. It's amazingly flexible, and easy for beginners to pick up with simple queries. Your New and Improved query language had better be leaps and bounds ahead, not just simpler to type.

    (Especially since probably 90% of queries are written only once, and stored for future use in a script or as a stored procedure. Building a query can be arduous for complex data, but it doesn't have to be typed every time you use it.)

  • by RingDev ( 879105 ) on Thursday November 03, 2005 @08:23PM (#13946509) Homepage Journal
    Lets look at something a little more realist:

    SELECT
      Lease.LeaseNum,
      Lease.LesseeNum,
      Invoice.InvoiceNum,
      Invoice.AmountBilled
    FROM
      Lease INNER JOIN
      Invoice ON
        Lease.LeaseNum = Invoice.InvoiceNum
    WHERE
      Lease.LeaseNum = "1234"
    ORDER BY
      LeaseNum, InvoiceNum

    Okay, that's pretty big to get some basic lease and invoice info. Now how you you write that?

    Lease.LeaseNum,
    Lease.LesseeNum,
    Invoic e.InvoiceNum,
    Invoice.AmountBilled
    Lease JOIN
    Invoice ON
      Lease.LeaseNum AND Invoice.LeaseNum
    Lease.LeaseNum = "1234"
    Lease.LeaseNum
    Invoice.InvoiceNum

    ??? All that's been accomplished is the removal of key words. I'm not seeing any benefit, and I'm seeing the pitfall of it being hard as hell to read.

    -Rick
  • by MobyDisk ( 75490 ) on Thursday November 03, 2005 @08:43PM (#13946643) Homepage
    I have yet to encounter a DBMS that didn't have an efficient, straightforward way to get the ID after an insert, but YMMV. However, the INSERT/UPDATE issue is a fundamental syntactical problem and it really should be fixed. INSERT and UPDATE do almost the same thing, yet have completely different syntax.

    INSERT INTO someTable (fld1,fld2) VALUES ("foo","bar")
    UPDATE someTable SET fld1=foo, fld2=bar

    It is REALLY annoying when you have to write some code that generates a SQL statement because you must code for two completely different syntaxes. Someone replied about the Oracle MERGE which seems like a nice way to go.

    Fortunately though, there are lots of good frameworks around SQL that make it so that writing SQL is becoming a thing of the past. I would like to see SQL treated like HTTP - nobody writes HTTP. It's a protocol. Let it be. Just use the tools. I guess it will never go away though...
  • by HawkingMattress ( 588824 ) on Thursday November 03, 2005 @09:31PM (#13946926)
    The schema you're talking about is wrong, wrong, wrong. You've made a schema based on how you want to display the information, not based on what this information is carrying.
    How can you make a request to find which user responded to more than 3 questions, for example ? you'd have to test each column individually, and then you'd have to modify the request if you add or remove some columns. Furthermore, you have to change the schema itself each time you add or remove a question.
    And finally, you don't use the relationnal part of the database, at all. Your schema supposes that there is a question table, with a questionid. And in your anwsers table, you have columns named after the ids, right ? Now how can the database check if you delete a question that there are no answers pointing to it ? it can't, because you're using the database as if it was a flat file. So the database can't check data integrity for you, because your schema isn't normalized at all. as Johnno74 says, you need at least a question table, a response table, and an user/question/response association table. If you don't do that, the database has zero advantages over a flat file.
  • by Johnno74 ( 252399 ) on Thursday November 03, 2005 @10:21PM (#13947182)
    Yeah, I know what you mean, I work in the real world too. A lot of the time we have to work with some awful database structures, and we just have to grin and bear it.

    Doesn't make it right tho... If you are given a clean sheet to design a system as the grandparent post suggested (questionare) and you put the results in one table with a row for each question, then you deserve to work with fucked-up systems for the rest of your life.
  • by spagetti_code ( 773137 ) on Friday November 04, 2005 @06:33AM (#13948827)
    No doubt it has defects, but SQL has a strong theoretical underpinning in set theory. This has made it a very durable language and one that scales to sizes probably unimagined by Dr Codd when he outlined its roots in 1970 in his article [acm.org] "A relational model of data for large shared data banks".

    Computings needs for well structured access and manipulation of large data sets has been well served by SQL.

    A clear replacement has yet to emerge. There are pretenders to the throne, of which Tutorial D is certainly technically nice, XQuery is a mess and ODBMSs (and their query tools) really haven't caught on.

    Its just that SQL passes a simple test - its good enough for the job and relatively ubiquitous. And standards do exist [wiscorp.com] (that every major vendor breaks. sigh.).
  • Standardization (Score:3, Insightful)

    by JediTrainer ( 314273 ) on Friday November 04, 2005 @10:52AM (#13949818)
    All I want is for every database to have the same functions; a standardized way to do the basics.

    Every vendor seems to have their own ways to define (or arbitrarily break standard) date functions (add/compare/convert/get current timestamp), string manipulation (like uppercasing, substrings and concatenation), getting the generated id from an inserted row (identity/serial/auto_increment), limiting the number of rows returned (TOP or FIRST?), getting a subset of rows (ie a standard way to get rows 100-150 that works with most DBs) or even getting a list of tables or viewing the schema.

    Trying to make an app portable across DBs is next to impossible, and that's not even counting stored procedures or different behaviours for the same syntax (like NULL handling across the various functions). This is very irritating and should have been fixed long ago. Instead, we get this crap which makes the differences we see across different web browsers look like child's play.
  • by brlewis ( 214632 ) on Friday November 04, 2005 @01:51PM (#13951560) Homepage
    No wasted space and every column is relevant in my example. EVERY time I want EVERY column, no exceptions.
    Then your example is contrived. In the real world somebody would ask, "How many people checked item 100?" or "How many people who checked item 33 also checked item 99?"
  • Re:What's the 1%? (Score:3, Insightful)

    by TopSpin ( 753 ) * on Friday November 04, 2005 @05:24PM (#13953355) Journal
    I've used exactly the non-aggregate expressions in the SELECT clause.

    This is non-optional. Something is either an aggregate expression or it isn't, so why are we expected to explain this in the statement? Probably because aggregates aren't 'first class' in terms of SQL. Aggregates are functions.

    Can you give an example where it would be something else? I too would like GROUP BY DEFAULT or somesuch.

    This is my idea:

    SELECT a, b, c
    AGGREGATE x, y, z
    FROM foo

    GROUP BY vanishes. Non-aggregates expressions are valid only in SELECT while aggregate expressions are valid only in AGGREGATE. Less ambiguous and no redundant expressions. Easier to use also; comment out the AGGREGATE clause and you've got a basic statement. Quickly remove group tuple elements with comments.

    Other ideas:

    Binary aggregate functions: OR(), for example. Clever techniques using these can eliminate lumps of code elsewhere. Example, given the values 1, 1, 5, 1, aggregate OR() yields 5, aggregate AND() yields 1. Very useful for analysis of 'detail' rows that have individual states.

    Reorder the basic clauses: FROM should be first.

    Eliminate HAVING: This is redundant with subselects.

  • Re:No poetry (Score:3, Insightful)

    by Tablizer ( 95088 ) on Saturday November 05, 2005 @02:23PM (#13958361) Journal
    A Cartesian Join is how you generate every combination of poetry so that you can patent everything.
  • by ghakko ( 261165 ) on Sunday November 06, 2005 @05:28AM (#13961855)
    Here's a semi-realistic example: suppose I have an intrusion detection system, and I'm logging packets into a table. I now want a tally of inbound traffic by source address and protocol, but want to ignore loopback and any source addresses from which I get fewer than 10 packets.
    SELECT INET_NTOA(src), prot, SUM(len) FROM packets WHERE src INET_ATON('127.0.0.1') GROUP BY src, prot HAVING COUNT(*) >= 10;
    When rewriting this query as Prolog:
    ?- bagof(tally(Src1, Prot1, Len1), (setof((Src, Prot), packet(Src, _, Prot, _, _, _, _, _, _, _, _, _, _), Srcs), length(Matches, N), N >= 10, member((Src1, Prot1), Matches), \+ inet_ntoa(Src1, '127,0.0.1'), bagof(Len, packet(Src1, _, Prot1, Len, _, _, _, _, _, _, _, _, _), Lens), sumlist(Lens, Len1)), Tallies).
    Notice that in Prolog:
    1. There's no natural way to express aggregates. To do what GROUP BY does, one has to nest all-solutions predicates like bagof/3 and setof/3.
    2. There's no syntax for named fields.
    3. The query is no longer concise and not easy to understand.

    This just covers SELECT queries. Transactional INSERT and UPDATE queries would be more complicated, because of the way backtracking works in Prolog.

    In short, I think Prolog is too general a language to be useful for queries on relational data. One really needs purpose-designed syntax to accommondate common queries.

"Look! There! Evil!.. pure and simple, total evil from the Eighth Dimension!" -- Buckaroo Banzai

Working...