#urlteam 2015-11-13,Fri

↑back Search

Time Nickname Message
03:09 🔗 bwn_ has joined #urlteam
03:21 🔗 bwn has quit IRC (Ping timeout: 1221 seconds)
03:30 🔗 bwn_ has quit IRC (Read error: Operation timed out)
04:58 🔗 JesseW has joined #urlteam
04:58 🔗 svchfoo3 sets mode: +o JesseW
05:05 🔗 marvinw has quit IRC (Read error: Operation timed out)
05:37 🔗 marvinw has joined #urlteam
05:51 🔗 WinterFox has joined #urlteam
06:00 🔗 WinterFox Is there any way to search the urls collected so far?
06:00 🔗 JesseW WinterFox: yep -- download them and search them locally. :-)
06:00 🔗 JesseW If you're referring to the ADrive search Start requested -- I'm going to get on that soon.
06:01 🔗 JesseW But it would great to have more people with a full corpus, as I'm currently wrestling with migre.me
06:01 🔗 WinterFox I did get all that infocon stuff up but it looks like some of it was already on archive.org
06:01 🔗 JesseW yep. it happens.
06:01 🔗 phuzion JesseW: about how big is the total corpus right now?
06:02 🔗 WinterFox Also I dont seem to have permissions to change the data types to audio
06:02 🔗 JesseW WinterFox: send a note to info@archive.org -- it'll get done eventually...
06:03 🔗 JesseW phuzion: compresssed ... 164 GB, apparently.
06:04 🔗 phuzion Huh
06:04 🔗 JesseW (minus the last few days, which I haven't downloaded yet)
06:04 🔗 * phuzion debates throwing together a tool to search the archives...
06:04 🔗 WinterFox My server seems to be collecting urls pretty fast. 1.6m scans in about a week
06:04 🔗 JesseW 75 GB from the pre-daily-dumps, and the other ~90GB from the daily dumps since Nov 2014.
06:05 🔗 JesseW phuzion: yes please
06:05 🔗 phuzion No guarantees on performance
06:05 🔗 JesseW I hacked together something to search them locally, but it takes literally a couple of hours.
06:06 🔗 phuzion Hmmmm.
06:06 🔗 WinterFox Sounds like something that could be done quickly if it was in sql
06:06 🔗 phuzion WinterFox: we're talking about 3+ billion rows
06:06 🔗 JesseW sql would certainly help -- but remember, this is 164 GB *compressed*
06:06 🔗 JesseW and plain text URLs compress well
06:07 🔗 WinterFox Not all the data is relevant though. You can narrow it down a lot by only looking at the short urls from the url shortener you are using
06:07 🔗 bwn_ has joined #urlteam
06:08 🔗 JesseW mostly we search it as a corpus of URLs, so the shortner they came from isn't relevant.
06:08 🔗 JesseW (well, mostly, so far, those have been *my* searches, at least)
06:08 🔗 WinterFox If the data was sorted I think you could use a binary search algorithm too
06:08 🔗 JesseW probably, yeah
06:09 🔗 phuzion Right, something like "SELECT * FROM urlteam WHERE desturl LIKE '%foobar%';" or something
06:09 🔗 phuzion God, that would be so freaking slow.
06:09 🔗 phuzion on 3.6B rows, that would probably take 5-10 hours.
06:10 🔗 WinterFox Binary search would speed it up loads
06:10 🔗 bwn_ is now known as bwn
06:11 🔗 * phuzion wonders how well this would perform on a db.m4.10xlarge or something
06:13 🔗 phuzion I might play with this at work tomorrow
06:13 🔗 phuzion In the meantime, I'm gonna get to sleep.
06:13 🔗 WinterFox If I did the math right it would take 39 comparisons at worst to find a url in 3.6B rows
06:14 🔗 JesseW phuzion: enjoy your sleep
06:14 🔗 phuzion thanks. Night.
07:33 🔗 JesseW NOTE: I reduced the size of migre.me items down to 5 URLs each, to simplify things, since items break if any one is Unavailable (because migre.me handles Unavailable by returning the same HTTP status code, but not providing any Location header... :-( )
08:37 🔗 JesseW has quit IRC (Leaving.)
08:43 🔗 bwn_ has joined #urlteam
08:46 🔗 WinterFox has quit IRC (Remote host closed the connection)
08:47 🔗 WinterFox has joined #urlteam
08:53 🔗 bwn has quit IRC (Read error: Operation timed out)
09:43 🔗 ahrain has joined #urlteam
11:22 🔗 bwn_ has quit IRC (Read error: Operation timed out)
12:12 🔗 bwn_ has joined #urlteam
12:49 🔗 WinterFox has quit IRC (Read error: Operation timed out)
12:53 🔗 WinterFox has joined #urlteam
13:22 🔗 WinterFox has quit IRC (Remote host closed the connection)
14:13 🔗 Fusl has quit IRC (Max SendQ exceeded)
14:14 🔗 Fusl has joined #urlteam
15:00 🔗 VADemon has joined #urlteam
15:11 🔗 Start has quit IRC (Quit: Disconnected.)
15:44 🔗 swebb has left ["Textual IRC Client: www.textualapp.com"]
16:35 🔗 swebb has joined #urlteam
16:35 🔗 svchfoo3 sets mode: +o swebb
16:46 🔗 Start has joined #urlteam
16:50 🔗 Start_ has joined #urlteam
16:50 🔗 Start has quit IRC (Read error: Connection reset by peer)
16:56 🔗 Start_ has quit IRC (Read error: Operation timed out)
16:58 🔗 Start has joined #urlteam
16:59 🔗 Start has quit IRC (Read error: Connection reset by peer)
16:59 🔗 Start_ has joined #urlteam
17:00 🔗 SimpBrain has quit IRC (Ping timeout: 369 seconds)
17:07 🔗 Start_ has quit IRC (Quit: Disconnected.)
17:18 🔗 JesseW has joined #urlteam
17:18 🔗 svchfoo3 sets mode: +o JesseW
17:31 🔗 JesseW A total of 23,225 aDrive URLs found in the old dump. (lots of duplicates)
17:49 🔗 JesseW has quit IRC (Leaving.)
18:17 🔗 aaaaaaaaa has joined #urlteam
18:17 🔗 swebb sets mode: +o aaaaaaaaa
18:27 🔗 SimpBrain has joined #urlteam
18:34 🔗 Start has joined #urlteam
18:39 🔗 Start_ has joined #urlteam
18:39 🔗 Start has quit IRC (Read error: Connection reset by peer)
18:44 🔗 Start_ has quit IRC (Read error: Operation timed out)
19:01 🔗 JW_work has quit IRC (Read error: Operation timed out)
19:05 🔗 JW_work has joined #urlteam
19:26 🔗 JW_work has quit IRC (Leaving.)
19:29 🔗 dashcloud has quit IRC (Read error: Operation timed out)
19:34 🔗 dashcloud has joined #urlteam
19:34 🔗 svchfoo3 sets mode: +o dashcloud
19:44 🔗 JW_work has joined #urlteam
19:47 🔗 JW_work1 has joined #urlteam
19:47 🔗 JW_work has quit IRC (Read error: Connection reset by peer)
20:04 🔗 JW_work1 has quit IRC (Quit: Leaving.)
20:06 🔗 JW_work has joined #urlteam
20:50 🔗 JW_work has quit IRC (Read error: Operation timed out)
20:52 🔗 JW_work has joined #urlteam
21:14 🔗 Atluxity has joined #urlteam
21:30 🔗 SilSte has quit IRC (Ping timeout: 310 seconds)
21:30 🔗 SilSte has joined #urlteam
21:33 🔗 Barry has quit IRC (Ping timeout: 310 seconds)
21:35 🔗 Barry has joined #urlteam
21:51 🔗 WinterFox has joined #urlteam
22:03 🔗 bwn_ has quit IRC (Read error: Operation timed out)
22:34 🔗 bwn has joined #urlteam
23:13 🔗 Start has joined #urlteam
23:50 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
23:51 🔗 aaaaaaaaa has joined #urlteam
23:51 🔗 swebb sets mode: +o aaaaaaaaa

irclogger-viewer