Time |
Nickname |
Message |
03:09
🔗
|
|
bwn_ has joined #urlteam |
03:21
🔗
|
|
bwn has quit IRC (Ping timeout: 1221 seconds) |
03:30
🔗
|
|
bwn_ has quit IRC (Read error: Operation timed out) |
04:58
🔗
|
|
JesseW has joined #urlteam |
04:58
🔗
|
|
svchfoo3 sets mode: +o JesseW |
05:05
🔗
|
|
marvinw has quit IRC (Read error: Operation timed out) |
05:37
🔗
|
|
marvinw has joined #urlteam |
05:51
🔗
|
|
WinterFox has joined #urlteam |
06:00
🔗
|
WinterFox |
Is there any way to search the urls collected so far? |
06:00
🔗
|
JesseW |
WinterFox: yep -- download them and search them locally. :-) |
06:00
🔗
|
JesseW |
If you're referring to the ADrive search Start requested -- I'm going to get on that soon. |
06:01
🔗
|
JesseW |
But it would great to have more people with a full corpus, as I'm currently wrestling with migre.me |
06:01
🔗
|
WinterFox |
I did get all that infocon stuff up but it looks like some of it was already on archive.org |
06:01
🔗
|
JesseW |
yep. it happens. |
06:01
🔗
|
phuzion |
JesseW: about how big is the total corpus right now? |
06:02
🔗
|
WinterFox |
Also I dont seem to have permissions to change the data types to audio |
06:02
🔗
|
JesseW |
WinterFox: send a note to info@archive.org -- it'll get done eventually... |
06:03
🔗
|
JesseW |
phuzion: compresssed ... 164 GB, apparently. |
06:04
🔗
|
phuzion |
Huh |
06:04
🔗
|
JesseW |
(minus the last few days, which I haven't downloaded yet) |
06:04
🔗
|
* |
phuzion debates throwing together a tool to search the archives... |
06:04
🔗
|
WinterFox |
My server seems to be collecting urls pretty fast. 1.6m scans in about a week |
06:04
🔗
|
JesseW |
75 GB from the pre-daily-dumps, and the other ~90GB from the daily dumps since Nov 2014. |
06:05
🔗
|
JesseW |
phuzion: yes please |
06:05
🔗
|
phuzion |
No guarantees on performance |
06:05
🔗
|
JesseW |
I hacked together something to search them locally, but it takes literally a couple of hours. |
06:06
🔗
|
phuzion |
Hmmmm. |
06:06
🔗
|
WinterFox |
Sounds like something that could be done quickly if it was in sql |
06:06
🔗
|
phuzion |
WinterFox: we're talking about 3+ billion rows |
06:06
🔗
|
JesseW |
sql would certainly help -- but remember, this is 164 GB *compressed* |
06:06
🔗
|
JesseW |
and plain text URLs compress well |
06:07
🔗
|
WinterFox |
Not all the data is relevant though. You can narrow it down a lot by only looking at the short urls from the url shortener you are using |
06:07
🔗
|
|
bwn_ has joined #urlteam |
06:08
🔗
|
JesseW |
mostly we search it as a corpus of URLs, so the shortner they came from isn't relevant. |
06:08
🔗
|
JesseW |
(well, mostly, so far, those have been *my* searches, at least) |
06:08
🔗
|
WinterFox |
If the data was sorted I think you could use a binary search algorithm too |
06:08
🔗
|
JesseW |
probably, yeah |
06:09
🔗
|
phuzion |
Right, something like "SELECT * FROM urlteam WHERE desturl LIKE '%foobar%';" or something |
06:09
🔗
|
phuzion |
God, that would be so freaking slow. |
06:09
🔗
|
phuzion |
on 3.6B rows, that would probably take 5-10 hours. |
06:10
🔗
|
WinterFox |
Binary search would speed it up loads |
06:10
🔗
|
|
bwn_ is now known as bwn |
06:11
🔗
|
* |
phuzion wonders how well this would perform on a db.m4.10xlarge or something |
06:13
🔗
|
phuzion |
I might play with this at work tomorrow |
06:13
🔗
|
phuzion |
In the meantime, I'm gonna get to sleep. |
06:13
🔗
|
WinterFox |
If I did the math right it would take 39 comparisons at worst to find a url in 3.6B rows |
06:14
🔗
|
JesseW |
phuzion: enjoy your sleep |
06:14
🔗
|
phuzion |
thanks. Night. |
07:33
🔗
|
JesseW |
NOTE: I reduced the size of migre.me items down to 5 URLs each, to simplify things, since items break if any one is Unavailable (because migre.me handles Unavailable by returning the same HTTP status code, but not providing any Location header... :-( ) |
08:37
🔗
|
|
JesseW has quit IRC (Leaving.) |
08:43
🔗
|
|
bwn_ has joined #urlteam |
08:46
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
08:47
🔗
|
|
WinterFox has joined #urlteam |
08:53
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
09:43
🔗
|
|
ahrain has joined #urlteam |
11:22
🔗
|
|
bwn_ has quit IRC (Read error: Operation timed out) |
12:12
🔗
|
|
bwn_ has joined #urlteam |
12:49
🔗
|
|
WinterFox has quit IRC (Read error: Operation timed out) |
12:53
🔗
|
|
WinterFox has joined #urlteam |
13:22
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
14:13
🔗
|
|
Fusl has quit IRC (Max SendQ exceeded) |
14:14
🔗
|
|
Fusl has joined #urlteam |
15:00
🔗
|
|
VADemon has joined #urlteam |
15:11
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
15:44
🔗
|
|
swebb has left ["Textual IRC Client: www.textualapp.com"] |
16:35
🔗
|
|
swebb has joined #urlteam |
16:35
🔗
|
|
svchfoo3 sets mode: +o swebb |
16:46
🔗
|
|
Start has joined #urlteam |
16:50
🔗
|
|
Start_ has joined #urlteam |
16:50
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
16:56
🔗
|
|
Start_ has quit IRC (Read error: Operation timed out) |
16:58
🔗
|
|
Start has joined #urlteam |
16:59
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
16:59
🔗
|
|
Start_ has joined #urlteam |
17:00
🔗
|
|
SimpBrain has quit IRC (Ping timeout: 369 seconds) |
17:07
🔗
|
|
Start_ has quit IRC (Quit: Disconnected.) |
17:18
🔗
|
|
JesseW has joined #urlteam |
17:18
🔗
|
|
svchfoo3 sets mode: +o JesseW |
17:31
🔗
|
JesseW |
A total of 23,225 aDrive URLs found in the old dump. (lots of duplicates) |
17:49
🔗
|
|
JesseW has quit IRC (Leaving.) |
18:17
🔗
|
|
aaaaaaaaa has joined #urlteam |
18:17
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
18:27
🔗
|
|
SimpBrain has joined #urlteam |
18:34
🔗
|
|
Start has joined #urlteam |
18:39
🔗
|
|
Start_ has joined #urlteam |
18:39
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
18:44
🔗
|
|
Start_ has quit IRC (Read error: Operation timed out) |
19:01
🔗
|
|
JW_work has quit IRC (Read error: Operation timed out) |
19:05
🔗
|
|
JW_work has joined #urlteam |
19:26
🔗
|
|
JW_work has quit IRC (Leaving.) |
19:29
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
19:34
🔗
|
|
dashcloud has joined #urlteam |
19:34
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
19:44
🔗
|
|
JW_work has joined #urlteam |
19:47
🔗
|
|
JW_work1 has joined #urlteam |
19:47
🔗
|
|
JW_work has quit IRC (Read error: Connection reset by peer) |
20:04
🔗
|
|
JW_work1 has quit IRC (Quit: Leaving.) |
20:06
🔗
|
|
JW_work has joined #urlteam |
20:50
🔗
|
|
JW_work has quit IRC (Read error: Operation timed out) |
20:52
🔗
|
|
JW_work has joined #urlteam |
21:14
🔗
|
|
Atluxity has joined #urlteam |
21:30
🔗
|
|
SilSte has quit IRC (Ping timeout: 310 seconds) |
21:30
🔗
|
|
SilSte has joined #urlteam |
21:33
🔗
|
|
Barry has quit IRC (Ping timeout: 310 seconds) |
21:35
🔗
|
|
Barry has joined #urlteam |
21:51
🔗
|
|
WinterFox has joined #urlteam |
22:03
🔗
|
|
bwn_ has quit IRC (Read error: Operation timed out) |
22:34
🔗
|
|
bwn has joined #urlteam |
23:13
🔗
|
|
Start has joined #urlteam |
23:50
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
23:51
🔗
|
|
aaaaaaaaa has joined #urlteam |
23:51
🔗
|
|
swebb sets mode: +o aaaaaaaaa |