NatureHacker's Best Invention: Free Teeth powder!
OptEngine: Because Internet. YouCrawl. You Crawl. uCrawl.
OptEngine.org is an opt-in search engine and open source database/index where you have to submit URL's to the search engine to index. Why is this important? The majority of search engine potency is how often and fast and deep a search engine can crawl the web for content. The reason why small engines can't compete with google is they are not as good at crawling. A ranking algorithm is easy to design and it is easy for anyone to compete with google's ranking algorithm but not their crawling capability which uses likely billions of dollars of servers to do. If you can have humans crawling the web for free, less ads will be needed to support the engine and be easier for small engines to compete against the big ones. Not only that, but more importantly is that humans get to decide what pages are most important and are worth the time to index which will limit the pages to the highest quality pages only, not just every page that exists. Also this database will be open source and always available for download so any and every search engine can use the OptEngine database/index for free as a basis or part of their search engines index.
OptEngine may include an option to where your page will be archived at archive.org or similar service.
OptEngine will likely not respect robots.txt since you are opting-in by telling us the URL to index. While it is true that someone who is not the owner of the URL can tell us to index it, if the owner really doesn't want his page to be seen he can keep the URL private. We are not the ones crawling the web, you are. Why not just respect it anyway? With the rise of social media and premade sites like Blogger and Wix, these sites can include robots.txt that their customers either don't know about or have a hard time changing or cannot change. This is why we feel it is important to not respect robots.txt so everyone can get their content to the world regardless of the digital ghetto that they publish on. Below we outline our own Opt-in and Opt-out options that can be used instead (see colored text). As far as Nofollow links, we do not use links to rank pages so nofollow means nothing the OptEngine.
OptEngine will start as english only and hopefully will branch out to other languages over time.
OptEngine will require you burn any amount of Monero (XMR) to get your page into the index. Why is this? The reason for this is that we don't want bots spamming the database with either the same URL many times or spamming worthless pages in order to bog down the index. Cryptocurrency is a proof of work method to prove that you are doing work to earn it. Burning means you send the monero to a provable unusable address so no one can use the monero. Isn't this wasting monero? Yes and no. Yes because the monero can never be spent, but also no because it is impossible to spend so it increases the scarcity of monero so everyone's monero is worth more. Why monero and not bitcoin? First because the transaction fees for monero is less (typically well under 1 cent) and also because monero has a tail emission so we will never fully deplete monero from circulation no matter how popular OptEngine gets. Every URL you want to index will require a separate burn transaction. Depending on the current transaction fees on the monero network you could get your link posted for well under 1 cent. A new burn address will be made every roughly 10 XMR that accrues in the address (about 20,000 pages in today's monero value) so that these burn addresses don't become a lucrative target for future quantum computer hackers. Or we can use another method for proof of burn like this one. If monero transaction fees increase above 1000th of the lowest average monthly income country in the world (currently $0.041 since DR Congo is $41 a month average income) then we will consider adding other cryptocurrency options if they have a tail emission.
In addition to burning Monero dust (dust means an insignificant amount of monero), completing a CAPTCHA or equivalent will also be required to help prove that not only machine work was done but also human work as well. You can do up to 10 URL's at once though so you don't have to complete a captcha for every URL as this would be burdensome for trying to get your whole site indexed in a reasonable time frame. Just note if you do 10 URL's at a time though you will need proof of 10 separate burn transactions.
OptEngine search will rank results based on a few factors. These factors are ordered in terms of importance, with factor 1 weighted heavier than factor 2 and so on.
#'s 1-4 are all the index fields and they will be scraped from the site automatically when the URL is presented for indexing and proof of monero burn and captcha is presented. However upon launch the scraping algorithm will likely not be complete so the user will have to manually input fields 1-4. This shouldn't be hard as they can just copy and paste from their page. It will be on the honor system that the info you provide is what is in the page itself. Very few I think would want to take the time to mislead people about the content of their page but I'm sure it will happen as a prank from time to time and you can downvote that page. When the scraping algorithm is complete and included into optengine, then you will have the option to either automatically scrape the data needed or you can input it manually, giving you maximum control on how your page data is stored. What prevents people form uploading the same content under different url's and using other pages data to fill in the index fields? The monero burn and also the captcha make this sort of trolling labor intensive and downvoting is another way to discourage this practice. But the benefits of using Human Intelligence (HI) for tuning the field input to better help the searcher find what they are looking for is worth the risk of misuse in our mind.
1: Categories. Categories are the 3 longest words (or first 3 words) in your <title> tag. Each word is independent and no exact phrase matches here.
2: Title. Title is the 20 longest words in your <title> tag. Each word is independent and no exact phrase matches here.
3: Summary. Summary is the first 1,000 characters in your content in your <body> tag. Exact phrase can be taken from here.
4: Text. Text is the text between 1,000 and 30,000 Characters in your content in your <body> tag (so up to 29,000 characters). Exact phrase matches can be taken from here.
Those 4 things at launch will be all you can search using typical boolean queries, the following will not be included at launch but will be added later as possible.
5. Sorting Algorithm. If results rank the same in the above relevance ranking (or if not and just to provide more customization of the results) the following criteria can be used to further filter or rank the results by the user.
Category
Length
Date
Popularity (hit count)
Upvote # (every IP can vote up or down once)
Upvote %
Burn Amount (amount of Monero burned). To gain in rankings in this category you can burn more Monero than is required to index your URL.
OptEngine Rank - Our best guess at ordering. Will combine Category//Length//date//popularity//upvote # & %//Burn amount - each weighted according to our best theory and those percentages will be open source.
Design your own Rank - Tune all the above factors into what you think is the optimum ratio's. This can look something like #AABCDD6742E3 where you can copy and paste your algorithm code in future searches without having to sign in or anything. There will be no IP Address saving (except for voting on post rankings), no search saves, no sign-in's, or any other info gathering at all whatsoever.
There will also at some point be a special button next to a search result that says "links". What this does is show a list of webpages that have linked to the particular result you are considering.
So at launch here is what OptEngine will be. A page with a search box and "search it" button; and a url input box with a "Opt it" button. "Opt it" is short for "index my url using OptEngine". So Optengine will be a simple search site and url indexing site. When you enter a url and click "Opt it" then you will be presented with a page that has 5 boxes. The first box will ask for your monero transaction key or otherwise key that proves that you burned monero; the second box will ask for 3 keywords aka categories to describe your page; the third box will ask for 20 keywords aka title of your page; the fourth box will ask for 1,000 characters max that is the summary of your page; and the fifth box will ask for 30,000 characters max that is the content of your page. Complete the captcha and your url along with the data you provided will be
The entire index will be publicly available and open source and always avaliable for full download like commoncrawl.org. We hope other search engines use OptEngine Index as a part of their search algorithms.
What happens if a webpage is submitted for indexing multiple times? This is a tough question. We want to make it so the most recent indexing replaces the previous indexing. We are not an archival service which is why we want to partner with archive services to save peoples pages permanently on those sites. What we may do when our scraping algorithm has been launched is make subsequent updates require our scraping algorithm to make any changes to the index data so someone can't troll other user's page indexes. Another option we have is for us to require an "opt-in" phrase on their page such as #Opty (this is currently the preferred option). As long as a page has this somewhere in the page, the page index can be updated manually. If an #Opty is not present, then the page index can be updated only by our scraper if we are asked to do that but it can be initially indexed manually without an #Opty present. If you do not want the scraper or manual change to update your URL's index, then you can use the phrase #OptOuty in your text somewhere, however to prevent abuse by digital platforms if an #Opty is present it overrides an #OptOuty, nullifying it and allowing manual and scraped indexing. Another option is allow users to provide an email address to be notified if any change is made to one of the pages they want to receive notification about. We don't like this option though as it will make our database a target for law enforcement. Even upvoting that would need to store IP addresses is something we really don't want to do if not 100% necessary. Not sure what will be required at this time.
How will OptEngine store and make all this data easily accessible? We will begin on Amazon AWS S3 storage site. This site allows 1 TB of data storage and access for only $40 a month (the same cost of having an amazon seller account). With 1 TB of data we estimate we could host up to 1 Billion page indexes of the internet (assuming most page index's will be around 200 words or 1000 characters - about 1 kb). Google currently indexes 18 billion pages so we believe that 1 Billion may be the max we could hope for since only the highest quality sites will likely be indexed by our site because it takes work for someone to get it indexed. This means we estimate our maximum hosting cost to only be around $40 per month. Talk about a good deal! In the future we want to have this data replicated around multiple hosting platforms including our own in multiple languages so our costs might raise from this by up to 100 fold. At that point though we would be so widely adopted and used that we would be shocked if we couldn't raise $4,000 a month to cover costs by basically hosting the entire useful internet!
How will OptEngine.org make money to support itself? It will accept donations and also may sell merchandise. Also it may sell ad spots at the bottom of the page in simple links. These links may be ranked by the monthly donation amount of the sponsor and the maximum allowed donation for sponsporship could be set somewhere around $100 per month just so the competition stays under control. If more sponsors contribute the same about, pririty in ranking will be given to those that have been supporting the longest. OptEngine will never boost search ranks or show ads within search results or show banner ads or popups or even sidebars. Sponsor spots will only be text links 12pt font and a maximum of 20 characters long and always at the bottom of the page, and it will never be required to scroll down that far to use the Engine in it's fullest.
OptEngine: Because Internet. YouCrawl. You Crawl. uCrawl.
OptEngine.org is an opt-in search engine and open source database/index where you have to submit URL's to the search engine to index. Why is this important? The majority of search engine potency is how often and fast and deep a search engine can crawl the web for content. The reason why small engines can't compete with google is they are not as good at crawling. A ranking algorithm is easy to design and it is easy for anyone to compete with google's ranking algorithm but not their crawling capability which uses likely billions of dollars of servers to do. If you can have humans crawling the web for free, less ads will be needed to support the engine and be easier for small engines to compete against the big ones. Not only that, but more importantly is that humans get to decide what pages are most important and are worth the time to index which will limit the pages to the highest quality pages only, not just every page that exists. Also this database will be open source and always available for download so any and every search engine can use the OptEngine database/index for free as a basis or part of their search engines index.
OptEngine may include an option to where your page will be archived at archive.org or similar service.
OptEngine will likely not respect robots.txt since you are opting-in by telling us the URL to index. While it is true that someone who is not the owner of the URL can tell us to index it, if the owner really doesn't want his page to be seen he can keep the URL private. We are not the ones crawling the web, you are. Why not just respect it anyway? With the rise of social media and premade sites like Blogger and Wix, these sites can include robots.txt that their customers either don't know about or have a hard time changing or cannot change. This is why we feel it is important to not respect robots.txt so everyone can get their content to the world regardless of the digital ghetto that they publish on. Below we outline our own Opt-in and Opt-out options that can be used instead (see colored text). As far as Nofollow links, we do not use links to rank pages so nofollow means nothing the OptEngine.
OptEngine will start as english only and hopefully will branch out to other languages over time.
OptEngine will require you burn any amount of Monero (XMR) to get your page into the index. Why is this? The reason for this is that we don't want bots spamming the database with either the same URL many times or spamming worthless pages in order to bog down the index. Cryptocurrency is a proof of work method to prove that you are doing work to earn it. Burning means you send the monero to a provable unusable address so no one can use the monero. Isn't this wasting monero? Yes and no. Yes because the monero can never be spent, but also no because it is impossible to spend so it increases the scarcity of monero so everyone's monero is worth more. Why monero and not bitcoin? First because the transaction fees for monero is less (typically well under 1 cent) and also because monero has a tail emission so we will never fully deplete monero from circulation no matter how popular OptEngine gets. Every URL you want to index will require a separate burn transaction. Depending on the current transaction fees on the monero network you could get your link posted for well under 1 cent. A new burn address will be made every roughly 10 XMR that accrues in the address (about 20,000 pages in today's monero value) so that these burn addresses don't become a lucrative target for future quantum computer hackers. Or we can use another method for proof of burn like this one. If monero transaction fees increase above 1000th of the lowest average monthly income country in the world (currently $0.041 since DR Congo is $41 a month average income) then we will consider adding other cryptocurrency options if they have a tail emission.
In addition to burning Monero dust (dust means an insignificant amount of monero), completing a CAPTCHA or equivalent will also be required to help prove that not only machine work was done but also human work as well. You can do up to 10 URL's at once though so you don't have to complete a captcha for every URL as this would be burdensome for trying to get your whole site indexed in a reasonable time frame. Just note if you do 10 URL's at a time though you will need proof of 10 separate burn transactions.
OptEngine search will rank results based on a few factors. These factors are ordered in terms of importance, with factor 1 weighted heavier than factor 2 and so on.
#'s 1-4 are all the index fields and they will be scraped from the site automatically when the URL is presented for indexing and proof of monero burn and captcha is presented. However upon launch the scraping algorithm will likely not be complete so the user will have to manually input fields 1-4. This shouldn't be hard as they can just copy and paste from their page. It will be on the honor system that the info you provide is what is in the page itself. Very few I think would want to take the time to mislead people about the content of their page but I'm sure it will happen as a prank from time to time and you can downvote that page. When the scraping algorithm is complete and included into optengine, then you will have the option to either automatically scrape the data needed or you can input it manually, giving you maximum control on how your page data is stored. What prevents people form uploading the same content under different url's and using other pages data to fill in the index fields? The monero burn and also the captcha make this sort of trolling labor intensive and downvoting is another way to discourage this practice. But the benefits of using Human Intelligence (HI) for tuning the field input to better help the searcher find what they are looking for is worth the risk of misuse in our mind.
1: Categories. Categories are the 3 longest words (or first 3 words) in your <title> tag. Each word is independent and no exact phrase matches here.
2: Title. Title is the 20 longest words in your <title> tag. Each word is independent and no exact phrase matches here.
3: Summary. Summary is the first 1,000 characters in your content in your <body> tag. Exact phrase can be taken from here.
4: Text. Text is the text between 1,000 and 30,000 Characters in your content in your <body> tag (so up to 29,000 characters). Exact phrase matches can be taken from here.
Those 4 things at launch will be all you can search using typical boolean queries, the following will not be included at launch but will be added later as possible.
5. Sorting Algorithm. If results rank the same in the above relevance ranking (or if not and just to provide more customization of the results) the following criteria can be used to further filter or rank the results by the user.
Category
Length
Date
Popularity (hit count)
Upvote # (every IP can vote up or down once)
Upvote %
Burn Amount (amount of Monero burned). To gain in rankings in this category you can burn more Monero than is required to index your URL.
OptEngine Rank - Our best guess at ordering. Will combine Category//Length//date//popularity//upvote # & %//Burn amount - each weighted according to our best theory and those percentages will be open source.
Design your own Rank - Tune all the above factors into what you think is the optimum ratio's. This can look something like #AABCDD6742E3 where you can copy and paste your algorithm code in future searches without having to sign in or anything. There will be no IP Address saving (except for voting on post rankings), no search saves, no sign-in's, or any other info gathering at all whatsoever.
There will also at some point be a special button next to a search result that says "links". What this does is show a list of webpages that have linked to the particular result you are considering.
So at launch here is what OptEngine will be. A page with a search box and "search it" button; and a url input box with a "Opt it" button. "Opt it" is short for "index my url using OptEngine". So Optengine will be a simple search site and url indexing site. When you enter a url and click "Opt it" then you will be presented with a page that has 5 boxes. The first box will ask for your monero transaction key or otherwise key that proves that you burned monero; the second box will ask for 3 keywords aka categories to describe your page; the third box will ask for 20 keywords aka title of your page; the fourth box will ask for 1,000 characters max that is the summary of your page; and the fifth box will ask for 30,000 characters max that is the content of your page. Complete the captcha and your url along with the data you provided will be
The entire index will be publicly available and open source and always avaliable for full download like commoncrawl.org. We hope other search engines use OptEngine Index as a part of their search algorithms.
What happens if a webpage is submitted for indexing multiple times? This is a tough question. We want to make it so the most recent indexing replaces the previous indexing. We are not an archival service which is why we want to partner with archive services to save peoples pages permanently on those sites. What we may do when our scraping algorithm has been launched is make subsequent updates require our scraping algorithm to make any changes to the index data so someone can't troll other user's page indexes. Another option we have is for us to require an "opt-in" phrase on their page such as #Opty (this is currently the preferred option). As long as a page has this somewhere in the page, the page index can be updated manually. If an #Opty is not present, then the page index can be updated only by our scraper if we are asked to do that but it can be initially indexed manually without an #Opty present. If you do not want the scraper or manual change to update your URL's index, then you can use the phrase #OptOuty in your text somewhere, however to prevent abuse by digital platforms if an #Opty is present it overrides an #OptOuty, nullifying it and allowing manual and scraped indexing. Another option is allow users to provide an email address to be notified if any change is made to one of the pages they want to receive notification about. We don't like this option though as it will make our database a target for law enforcement. Even upvoting that would need to store IP addresses is something we really don't want to do if not 100% necessary. Not sure what will be required at this time.
How will OptEngine store and make all this data easily accessible? We will begin on Amazon AWS S3 storage site. This site allows 1 TB of data storage and access for only $40 a month (the same cost of having an amazon seller account). With 1 TB of data we estimate we could host up to 1 Billion page indexes of the internet (assuming most page index's will be around 200 words or 1000 characters - about 1 kb). Google currently indexes 18 billion pages so we believe that 1 Billion may be the max we could hope for since only the highest quality sites will likely be indexed by our site because it takes work for someone to get it indexed. This means we estimate our maximum hosting cost to only be around $40 per month. Talk about a good deal! In the future we want to have this data replicated around multiple hosting platforms including our own in multiple languages so our costs might raise from this by up to 100 fold. At that point though we would be so widely adopted and used that we would be shocked if we couldn't raise $4,000 a month to cover costs by basically hosting the entire useful internet!
How will OptEngine.org make money to support itself? It will accept donations and also may sell merchandise. Also it may sell ad spots at the bottom of the page in simple links. These links may be ranked by the monthly donation amount of the sponsor and the maximum allowed donation for sponsporship could be set somewhere around $100 per month just so the competition stays under control. If more sponsors contribute the same about, pririty in ranking will be given to those that have been supporting the longest. OptEngine will never boost search ranks or show ads within search results or show banner ads or popups or even sidebars. Sponsor spots will only be text links 12pt font and a maximum of 20 characters long and always at the bottom of the page, and it will never be required to scroll down that far to use the Engine in it's fullest.
No comments:
Post a Comment
Thank you for your feedback! Sharing your experience and thoughts not only helps fellow readers but also helps me to improve what I do!