[03:09] *** bwn_ has joined #urlteam [03:21] *** bwn has quit IRC (Ping timeout: 1221 seconds) [03:30] *** bwn_ has quit IRC (Read error: Operation timed out) [04:58] *** JesseW has joined #urlteam [04:58] *** svchfoo3 sets mode: +o JesseW [05:05] *** marvinw has quit IRC (Read error: Operation timed out) [05:37] *** marvinw has joined #urlteam [05:51] *** WinterFox has joined #urlteam [06:00] Is there any way to search the urls collected so far? [06:00] WinterFox: yep -- download them and search them locally. :-) [06:00] If you're referring to the ADrive search Start requested -- I'm going to get on that soon. [06:01] But it would great to have more people with a full corpus, as I'm currently wrestling with migre.me [06:01] I did get all that infocon stuff up but it looks like some of it was already on archive.org [06:01] yep. it happens. [06:01] JesseW: about how big is the total corpus right now? [06:02] Also I dont seem to have permissions to change the data types to audio [06:02] WinterFox: send a note to info@archive.org -- it'll get done eventually... [06:03] phuzion: compresssed ... 164 GB, apparently. [06:04] Huh [06:04] (minus the last few days, which I haven't downloaded yet) [06:04] * phuzion debates throwing together a tool to search the archives... [06:04] My server seems to be collecting urls pretty fast. 1.6m scans in about a week [06:04] 75 GB from the pre-daily-dumps, and the other ~90GB from the daily dumps since Nov 2014. [06:05] phuzion: yes please [06:05] No guarantees on performance [06:05] I hacked together something to search them locally, but it takes literally a couple of hours. [06:06] Hmmmm. [06:06] Sounds like something that could be done quickly if it was in sql [06:06] WinterFox: we're talking about 3+ billion rows [06:06] sql would certainly help -- but remember, this is 164 GB *compressed* [06:06] and plain text URLs compress well [06:07] Not all the data is relevant though. You can narrow it down a lot by only looking at the short urls from the url shortener you are using [06:07] *** bwn_ has joined #urlteam [06:08] mostly we search it as a corpus of URLs, so the shortner they came from isn't relevant. [06:08] (well, mostly, so far, those have been *my* searches, at least) [06:08] If the data was sorted I think you could use a binary search algorithm too [06:08] probably, yeah [06:09] Right, something like "SELECT * FROM urlteam WHERE desturl LIKE '%foobar%';" or something [06:09] God, that would be so freaking slow. [06:09] on 3.6B rows, that would probably take 5-10 hours. [06:10] Binary search would speed it up loads [06:10] *** bwn_ is now known as bwn [06:11] * phuzion wonders how well this would perform on a db.m4.10xlarge or something [06:13] I might play with this at work tomorrow [06:13] In the meantime, I'm gonna get to sleep. [06:13] If I did the math right it would take 39 comparisons at worst to find a url in 3.6B rows [06:14] phuzion: enjoy your sleep [06:14] thanks. Night. [07:33] NOTE: I reduced the size of migre.me items down to 5 URLs each, to simplify things, since items break if any one is Unavailable (because migre.me handles Unavailable by returning the same HTTP status code, but not providing any Location header... :-( ) [08:37] *** JesseW has quit IRC (Leaving.) [08:43] *** bwn_ has joined #urlteam [08:46] *** WinterFox has quit IRC (Remote host closed the connection) [08:47] *** WinterFox has joined #urlteam [08:53] *** bwn has quit IRC (Read error: Operation timed out) [09:43] *** ahrain has joined #urlteam [11:22] *** bwn_ has quit IRC (Read error: Operation timed out) [12:12] *** bwn_ has joined #urlteam [12:49] *** WinterFox has quit IRC (Read error: Operation timed out) [12:53] *** WinterFox has joined #urlteam [13:22] *** WinterFox has quit IRC (Remote host closed the connection) [14:13] *** Fusl has quit IRC (Max SendQ exceeded) [14:14] *** Fusl has joined #urlteam [15:00] *** VADemon has joined #urlteam [15:11] *** Start has quit IRC (Quit: Disconnected.) [15:44] *** swebb has left ["Textual IRC Client: www.textualapp.com"] [16:35] *** swebb has joined #urlteam [16:35] *** svchfoo3 sets mode: +o swebb [16:46] *** Start has joined #urlteam [16:50] *** Start_ has joined #urlteam [16:50] *** Start has quit IRC (Read error: Connection reset by peer) [16:56] *** Start_ has quit IRC (Read error: Operation timed out) [16:58] *** Start has joined #urlteam [16:59] *** Start has quit IRC (Read error: Connection reset by peer) [16:59] *** Start_ has joined #urlteam [17:00] *** SimpBrain has quit IRC (Ping timeout: 369 seconds) [17:07] *** Start_ has quit IRC (Quit: Disconnected.) [17:18] *** JesseW has joined #urlteam [17:18] *** svchfoo3 sets mode: +o JesseW [17:31] A total of 23,225 aDrive URLs found in the old dump. (lots of duplicates) [17:49] *** JesseW has quit IRC (Leaving.) [18:17] *** aaaaaaaaa has joined #urlteam [18:17] *** swebb sets mode: +o aaaaaaaaa [18:27] *** SimpBrain has joined #urlteam [18:34] *** Start has joined #urlteam [18:39] *** Start_ has joined #urlteam [18:39] *** Start has quit IRC (Read error: Connection reset by peer) [18:44] *** Start_ has quit IRC (Read error: Operation timed out) [19:01] *** JW_work has quit IRC (Read error: Operation timed out) [19:05] *** JW_work has joined #urlteam [19:26] *** JW_work has quit IRC (Leaving.) [19:29] *** dashcloud has quit IRC (Read error: Operation timed out) [19:34] *** dashcloud has joined #urlteam [19:34] *** svchfoo3 sets mode: +o dashcloud [19:44] *** JW_work has joined #urlteam [19:47] *** JW_work1 has joined #urlteam [19:47] *** JW_work has quit IRC (Read error: Connection reset by peer) [20:04] *** JW_work1 has quit IRC (Quit: Leaving.) [20:06] *** JW_work has joined #urlteam [20:50] *** JW_work has quit IRC (Read error: Operation timed out) [20:52] *** JW_work has joined #urlteam [21:14] *** Atluxity has joined #urlteam [21:30] *** SilSte has quit IRC (Ping timeout: 310 seconds) [21:30] *** SilSte has joined #urlteam [21:33] *** Barry has quit IRC (Ping timeout: 310 seconds) [21:35] *** Barry has joined #urlteam [21:51] *** WinterFox has joined #urlteam [22:03] *** bwn_ has quit IRC (Read error: Operation timed out) [22:34] *** bwn has joined #urlteam [23:13] *** Start has joined #urlteam [23:50] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [23:51] *** aaaaaaaaa has joined #urlteam [23:51] *** swebb sets mode: +o aaaaaaaaa