Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
The Internet

How About an Intelligent Open Source Filter? 17

GlitchZ28 asks: "It seems to me that the problem with Internet filters is their blanket approach searching for words in pages, URLs and of course the ole blacklist. It's the same as WWII blanket bombing. Drop 500 bombs at one target and chances are you'll get it (along with a lot of things that weren't targets). Has anyone considered starting a little project to create a simple, very easily modified open source Internet filtering program? Allowing library officals to decide what tactics would fit their needs such as a blacklist of the the obvious porno sites. I really wouldn't mind a filtering system IN public libraries if it could be scrutinized BY the public and then changed."

If we must censor content on the Internet, I would feel better about it (but not much) knowing that the censorship was done by the people rather than some bureaucrat in Congress.

Comment by michael : Many anti-censorship folks have been pushing this line for a long time; that the blacklists used in public institutions should, at the minimum, be open for public inspection. This would, no doubt help cure some of the more egregious errors. But the above poster is making an error in his reasoning. Computers do the searching because there is no other way to do it - you simply cannot categorize hundreds of millions of pages by hand, period, end of sentence. And an algorithmic approach can never fully characterize the range of human expression present on the Web - even assuming, for the moment, that you could get people to agree on what should or should not be censored, there's no way to make rules which will pick out those pages with 100% accuracy, or even anything close to that. Doing so would require the development of true artificial intelligence, which isn't even on the horizon. Calling something "open source" or not doesn't make it magically able to achieve a breakthrough in artificial intelligence. When you add in the fact that with three people in a room you have four different opinions on what should and should not be censored, it should become clear that throwing an open source label at something is not going to result in an easy solution.

This discussion has been archived. No new comments can be posted.

How About an Intelligent Open Source Filter?

Comments Filter:
  • Junkbuster is a very good web spam filter. Can someone enlighten me as to why this can't be modified in some way to be used for porn? Maybe have a public blockfile somewhere that people can add to as they find porn sites. Then parents/whoever can just download it periodically, maybe run it through diff to see if the changes are ok, and then install the new version. Voila, instant porn filter with the ability to customize the blockfile to suit each persons needs.

    Sure there are the problems of being able to change the proxy settings in Netscape/IE, but if I'm not mistaken, that can be turned off.

    Comments?
  • How about a personal preference-based system, like the (cough) Amazon book recommendations, Firefly, or that music recommendation thing that used to be at MIT? You'd have to do a bit of browsing to get it to work, but not only would it help you find pages, but with a bit of logic, it would allow you to determine "prohibited" pages for your kids based on what you didn't like...

    It could use the distributed databases another poster suggested, allowing for a truly public recommendation. Also, if you wanted to look for naughty things, you could use it to find precisely the ones you want.

  • ...I've always appreciated thoughts about "how to use(or abuse) "the system" to the detriment of those that would force "the system" upon us.
  • Why do we call these programs "filters"? Because they "clean" the internet.

    A filter is simply a tool to selectively find or remove something. A word entered into a search engine acts as a filter to find the pages that have it, or if used with a minus sign, the pages that do not have it.

    Honestly, no adult should be forced to use a filter, but I would use one IF it only kept the smut away from me, like the bait-and-switch sites. One time I accidentally followed a link made by a sub-troll on slashdot and well, I didn't really want to see that, it was disgusting, IMO it wasn't sexual (but the image centered on that end of the body) and I'm not even sure if it's anatomically possible, and it certainly wasn't a clinical image.

    The problem is that most filters seem to categorize based on words, but it doesn't look for combinations of words to determine what context the word is used in. "Breast Cancer" is an often used example. Just having "breast" gets it filtered. It could be chicken breast for all it knows. It could search the page to see if it is using more clinical terms (semen) or more slang ones (cum). Many anime-related sites are blocked because they have the word 'nudity' in them, and even if you ask them to unblock them, they'll just say the page has nudity, even though there are no nekkid pictures, as if the person never even looked at the page. I honestly think a good perl scripting could do much better than what's done so far.
  • Also, did you suspect the individual was a troll prior to clicking on the link (sometimes I click on troll's links, just to see where they take me - but I am never suprised by where they go

    I forgot to check the post rating, it was -1, I didn't notice. There was nothing about the link URL that made think there was something I'd rather not see.

    You are right, no adult should be forced to have a filter on them, no government should make that choice for me. You have my agreement.
  • I know of at least one GPL site-filter tool; GUILT [squidly.org], the GnU Internet Lust Terminator. It is being developed specifically in response to the Australian draconian filtering laws. Currently the filter software meeting the requirements of the law (it's all proprietary commercial stuff) is bad in three ways: 1) they are proprietary, not open-source; 2) they block far more than what is required by the law; 3) they don't let you see the list of URLs that are blocked anyway. GUILT attempts to solve this.

    It's still in development, and *needs* developers and testers; please help [squidly.org]!



    --
  • I'm about as libertarian as they come. I find filters noxious for both technical and moral reasons, and see no possible excuse for their use with adults in libraries*. But for schoolchildren there are issues that need to be raised.
    1) There are sites that simply have no educational value. What possible reason could a child have for looking at Danni's hotbox?
    2) They can prevent bait-and-switch sites, or even accidental typing. To wit, whitehouse.com and hotboy.com - a teacher cannot monitor everybody in a lab at the same time, and I'd hate to have to explain to a 4th grader why there's a site like hotboy- and he/she would be old enough to be curious.
    A teenager looking up items at a school where I worked was terribly traumatized when a site advertising support for sexual abuse issues turned out be really selling porn involving rape fantasies.
    3) There is no way you're going to be able to convince a school board not to have filtering. Period. If the issue comes up, you will have to have it, regardless of cost. If anybody has examples to the contrary, I'd love to hear of them.

    Bearing all these things in mind, and bearing in mind that having such a plugin for Apache or Squid could further the use of open source software, a project similar to dmoz needs to be started. This one could err on the side of not filtering stuff, too. This issue is too important to leave to the vagaries of corporatism.

    *Although the comment, I think i saw it on www.librarian.net [librarian.net], to the effect of a librarian saying that she drew the line on free speech when a patron next to her was viewing bestiality websites, does make a point about the unfortunate (but necessary) price of free speech.
  • It would be a great win if the free software community came up with something like this. If not, I fear that government-imposed rating, or an ugly proprietary scheme, is inevitable. Further, the system must be integrated into the browser if you want enough participation to achieve reasonable coverage. So, clearly mozilla integration is a great opportunity.

    I would highly encourage anyone interested to begin thinking about how to make this part of mozilla.
  • I don't really care about seeing porn from time to time and I don't have any kids. What I would like though is a good distributed web moderation system.

    This is already done to some degree with some search engines, especially specialized ones for books or movies. But I would love to see something built into my browser that would allow me (securely and privately of course) to set my preferences
    (intelligent > 2 | funny > 4 ) & (commercial 1)
    so that links to other stuff are colored differently. Of course obtaining moderator status would be hard, so it would probably have to be a firefly kind of thing where you're views are matched with others of your same views (securely and privately of course). Consider it the poor man's intelligent agent.
    --
  • "You can't categorize millions of pages by hand, period."

    Huh? What happened to the Open Ideal of "many hands, many eyes, many minds"? How many millions of lines of code have been "categorized" by how many programmers? How many hours of thought are behind each one?

    How many CD id numbers and associated track lists have been categorized by users of the free(and non-free) CD databases?

    What experiences and thought processes make you so sure that it can't be done, period?

  • When Junkbuster achieves popularity, it will be subverted.
    Besides, Junkbuster works because many of the ads include in their URL obvious clues like *ad*, *banner*, *promo* and are sent from centralized sites like DoubleClick and other media brokers.

    Porn, racist URL can be much more diverse, and they usually lead to same-site places. It would be difficult to list every site that runs censurable pages.
    __
  • That seems similar to PICS (not that I know so much about PICS).

    But you would be blocking entire sites that use dynamic links like Slashdot.
    Slashdot can contain censurable comments everywhere and they can have lots of URLs, because they embed parameters into the URL.

    As Tim Berners-Lee says somewhere in W3C [w3.org], URLs should not be stuffed with representation stuff, that should be negotiated.
    __
  • Why do we call these programs "filters"? Because they "clean" the internet. However, that makes one grand assumption - that the internet contains an abbundance of dirtyness, that for some reason should not be viewed by human eyes.

    The fallacy in this reasoning is the idea that information can be dirty or clean, impure or proper, amoral or moral. I submit that all information is neither - that it only becomes so in the mind of the viewer.

    One thing I have noticed in my 27 years of life (which I realize is by no means a long life view, but it is what I have to work with) is that those people who are the most vocal about being and living "clean", tend to be the same ones who are secretly "dirty" behind the back of society. Those people who oppose or don't care about the issue are labled as "dirty" by these same people, because these "unclean" people represent a mirror to them of the way they really are. They feel that if they could rid the world of "dirtyness", that they themselves might become clean, and could thus dispense with their secret.

    This idea is a perversion of logic. Those people that do this generally have failed to be honest with themselves and others. They ususally don't realize that by being truthful (though it may be painful) they can rid themselves of the issue. The rest of the "dirty" world has already realized this.

    We cannot pretend to protect children and adults from things which, even if severely impeded by all technical means (short of brain modification - which I am sure is coming), would still arise in their thoughts anyway (show me child above the age of six who has never thought about sex - we are kidding ourselves if we believe that children don't). I find it amusing at the number of people who rail and rise against the whole issue of porn - who never consider that the porn will always be there, because it is supply and demand. Do these people really think they can rid mankind from viewing sex - when it is the thought of sex and the pleasure which is hard wired into our minds to make us procreate? If sex wasn't important, why would it be pleasurable?

    I don't think we need filters. I think we need more rationality and intelligence.
  • by cr0sh ( 43134 )
    But in the minds of those who want to impose filters onto the rest of the population, filters form a way for them to delude themselves into thinking that a filter will make themselves "clean", by "removing" the "mirror" that stares back at them, reminding them of their own secrets.

    I take sympathy with your issues on "bait-and-switch" sites - I have experienced similar sites, but I wouldn't want to impose upon myself a safeguard against such sites, as they only "suprise" me rarely (it would be akin to keeping a small boat in the middle of a desert, just in case it floods). You mentioned you got suprised by a troll. Without knowing the link to the site, did you by any means check to see where the link was pointing to before clicking on it? Also, did you suspect the individual was a troll prior to clicking on the link (sometimes I click on troll's links, just to see where they take me - but I am never suprised by where they go)?

    The only time I could see that you would want a filter for yourself would be in a work environment - one which is so draconian they monitor every move you make (and log all internet traffic). If that was a problem, I would wonder why you would continue to work there. No job is worth that kind of paranoia...
  • There is one answer which should never be accepted when a citizen goes to a government agency with a question, and that is "You don't need to know that." When the question is "What sites am I not allowed to see from the library workstations, and why?", the only answer that is acceptable in a free society is a list of site URL's and criteria. Call it a card anti-catalog, a list of things you can't find in the library, but it's all the same. When the library installs filters on the computers, they just made it your business.

    When a library buys a filter from a company which keeps its filter list secret, this principle of public accountability is violated. This looks to me like it could be grounds for a citizen lawsuit against a library, demanding that the filter list be published or the filtering be removed. This could serve both ways; people who don't want legitimate information restricted could hammer on the filter companies for their abuses, and people who do want porn, violence, hate speech or whatnot blocked could make certain that there aren't any critical omissions in the lists either. But I think the most likely outcome is that the filter companies would stop selling to libraries, along with a small flurry of lawsuits by people and organizations like Peacefire against companies like Mattel, for libelling them in their filter classifications (Peacefire, pornographic and violent? HAH!). To keep their filter lists "secret" (as if they'll be safe from the amateur cryptographers and hackers!) they'll have to stop selling to public organizations. Voila, the problem is stamped out at the source.
    --

  • by bluGill ( 862 ) on Tuesday April 04, 2000 @08:37AM (#1152148)

    There is GroupLens to apply something like this to USENET. Check out their work.

    Unfortunatly, the only work I know going on with this is a few professors that I had in college. You can look at their web pages at: Joseph Konstan [umn.edu], John Riedl [umn.edu]. The latter site has a lot more information. (ie a link to something)

    It is usenet only, but I think this is a way to start. Then we just need a way to rate pages that everyone works on, and can agree to partially. If 100 people call a web site pron, and 10 call it interesting, I'd not want my children to see it. If 5 call it porn, and 100 call it interesting, it might be interesting, but not viewable (by my children who I want to protect more then most /. readers) without a parent to decide if the child needs to know about brest cancer in that level of detail. (TO give an example of where useful information and porn can cross).

    Then there is violence. I wouldn't want to view any "violent" web site myself, if the site was movies and pictures of one person murdering anouther. However if the website was hunting videos, I personally consider that normal content and would let anyone of any age see it. A colarabative filter allows me (with time) to build up a personal database cross references with others. then it can say "100 other parents who tend to think like you have called this [murder] site bad" vs "Many people call this [hunting] site violent, but those who tend to agree with you recomend it." If the entire world rates every site they visit out of habit, this could be useful.

    Note that above I intentially stuck to single issures. The hunting site with naked hunters is unaccaptable (to me), even though the hunting content may be good. Filtering software must be complex enough to handle all this issues, and yet be simple enough that people use it.

  • by jd ( 1658 ) <imipak@ y a hoo.com> on Wednesday April 05, 2000 @06:31AM (#1152149) Homepage Journal
    Have a filter which searches a networked database, according to either default or user-entered search criteria. Each database filters by URL. In the case of FTP, Gopher, WAIS or Telnet, this would involve having the filter translate these into URL format.

    The idea here is that you can control whos database you search, and exactly how you search. eg: If you want to eliminate all sites that mention Libertarianism and have a 10 in their IP address, you should be able to do so.

    This would give the user to pick and choose whos database best met their screening requirements, AND be able to control what sort of screening was being done.

    My suggestion for the database format would be a simple one:

    Main Record:

    • Field 1: The URL of the site
    • Field 2: Keyword list for site
    • Field 3: Sex content (1-100)
    • Field 4: Violence content (1-100)
    • Field 5: Graphics content (1-100)
    • Field 6: Commercial content (1-100)
    • Field 7: Religious content (1-100)
    • Field 8: Political content (1-100)
    • Field 9: Educational content (1-100)
    • Field 10: Medical content (1-100)
    • Field 11: Military content (1-100)

    The owners of the database would be responsible for how the values are given to the site. Some may use automated searches, others may elect to screen pages themselves, yet others may elect to ask contributers to visit the sites and rate them.

    the idea of having so many different scores is so that you can include sites that would otherwise be excluded. eg: Medical sites dealing with AIDS issues are going to mention sex. On the other hand, the information is also clearly both medical and educational and so would have a rating in both of those categories, too.

    So, to search for AIDS stuff, you'd probably want a filter that you could instruct to exclude any sites with a sex score > (medical + educational).

    This doesn't deal with bogus positives, produced by people cloning the title page, and then using a redirect to send you to the "real" site. However, you can filter out those, by simply adding the following set of records:

    Index record:

    • Field 1: URL of source page
    • Field 2: URL of destination page

    Fields 1 and 2 form the combined key for the record.

    The database then computes the values, not just of that page, but of all pages to a user-specified depth (not exceeding some sensible value, or you'd end up with a DOS attack). The values used would then be some weighted average of the values obtained. This would remove the risk of fraudulant title pages, or other forms of deliberate deception.

    Because this system is so much under the control of the user, it is not, in any sense of the word, blocking free speech, or censoring anyone. The user has specifically selected the database and the criteria for exclusion. Anything blocked, then, is blocked in full knowledge and by the deliberate hand of the user, who should (by rights) have every right to say what they don't experience.

    Also, by having neworked databases, anyone can set a database up. If someone sets up a system that others feel is unfair, biased or unreasonably censoring, they're free to set up their own database. If the users agree with their opinion, they'll switch. If they don't, well, just too bad. They're entitled to their opinions, too, even if that means disagreeing.

    This system could be extended, with a very simple front-end, to be a search engine as well as a filter. It works both ways. And, because of its design, you will get far fewer bogus hits. Instead of a few thousand hits, most of which are irrelevent sites and/or scams, you'll get just what you want, because you've screened out everything else.

"If it ain't broke, don't fix it." - Bert Lantz

Working...