| Time |
Nickname |
Message |
|
03:09
🔗
|
|
bwn_ has joined #urlteam |
|
03:21
🔗
|
|
bwn has quit IRC (Ping timeout: 1221 seconds) |
|
03:30
🔗
|
|
bwn_ has quit IRC (Read error: Operation timed out) |
|
04:58
🔗
|
|
JesseW has joined #urlteam |
|
04:58
🔗
|
|
svchfoo3 sets mode: +o JesseW |
|
05:05
🔗
|
|
marvinw has quit IRC (Read error: Operation timed out) |
|
05:37
🔗
|
|
marvinw has joined #urlteam |
|
05:51
🔗
|
|
WinterFox has joined #urlteam |
|
06:00
🔗
|
WinterFox |
Is there any way to search the urls collected so far? |
|
06:00
🔗
|
JesseW |
WinterFox: yep -- download them and search them locally. :-) |
|
06:00
🔗
|
JesseW |
If you're referring to the ADrive search Start requested -- I'm going to get on that soon. |
|
06:01
🔗
|
JesseW |
But it would great to have more people with a full corpus, as I'm currently wrestling with migre.me |
|
06:01
🔗
|
WinterFox |
I did get all that infocon stuff up but it looks like some of it was already on archive.org |
|
06:01
🔗
|
JesseW |
yep. it happens. |
|
06:01
🔗
|
phuzion |
JesseW: about how big is the total corpus right now? |
|
06:02
🔗
|
WinterFox |
Also I dont seem to have permissions to change the data types to audio |
|
06:02
🔗
|
JesseW |
WinterFox: send a note to info@archive.org -- it'll get done eventually... |
|
06:03
🔗
|
JesseW |
phuzion: compresssed ... 164 GB, apparently. |
|
06:04
🔗
|
phuzion |
Huh |
|
06:04
🔗
|
JesseW |
(minus the last few days, which I haven't downloaded yet) |
|
06:04
🔗
|
* |
phuzion debates throwing together a tool to search the archives... |
|
06:04
🔗
|
WinterFox |
My server seems to be collecting urls pretty fast. 1.6m scans in about a week |
|
06:04
🔗
|
JesseW |
75 GB from the pre-daily-dumps, and the other ~90GB from the daily dumps since Nov 2014. |
|
06:05
🔗
|
JesseW |
phuzion: yes please |
|
06:05
🔗
|
phuzion |
No guarantees on performance |
|
06:05
🔗
|
JesseW |
I hacked together something to search them locally, but it takes literally a couple of hours. |
|
06:06
🔗
|
phuzion |
Hmmmm. |
|
06:06
🔗
|
WinterFox |
Sounds like something that could be done quickly if it was in sql |
|
06:06
🔗
|
phuzion |
WinterFox: we're talking about 3+ billion rows |
|
06:06
🔗
|
JesseW |
sql would certainly help -- but remember, this is 164 GB *compressed* |
|
06:06
🔗
|
JesseW |
and plain text URLs compress well |
|
06:07
🔗
|
WinterFox |
Not all the data is relevant though. You can narrow it down a lot by only looking at the short urls from the url shortener you are using |
|
06:07
🔗
|
|
bwn_ has joined #urlteam |
|
06:08
🔗
|
JesseW |
mostly we search it as a corpus of URLs, so the shortner they came from isn't relevant. |
|
06:08
🔗
|
JesseW |
(well, mostly, so far, those have been *my* searches, at least) |
|
06:08
🔗
|
WinterFox |
If the data was sorted I think you could use a binary search algorithm too |
|
06:08
🔗
|
JesseW |
probably, yeah |
|
06:09
🔗
|
phuzion |
Right, something like "SELECT * FROM urlteam WHERE desturl LIKE '%foobar%';" or something |
|
06:09
🔗
|
phuzion |
God, that would be so freaking slow. |
|
06:09
🔗
|
phuzion |
on 3.6B rows, that would probably take 5-10 hours. |
|
06:10
🔗
|
WinterFox |
Binary search would speed it up loads |
|
06:10
🔗
|
|
bwn_ is now known as bwn |
|
06:11
🔗
|
* |
phuzion wonders how well this would perform on a db.m4.10xlarge or something |
|
06:13
🔗
|
phuzion |
I might play with this at work tomorrow |
|
06:13
🔗
|
phuzion |
In the meantime, I'm gonna get to sleep. |
|
06:13
🔗
|
WinterFox |
If I did the math right it would take 39 comparisons at worst to find a url in 3.6B rows |
|
06:14
🔗
|
JesseW |
phuzion: enjoy your sleep |
|
06:14
🔗
|
phuzion |
thanks. Night. |
|
07:33
🔗
|
JesseW |
NOTE: I reduced the size of migre.me items down to 5 URLs each, to simplify things, since items break if any one is Unavailable (because migre.me handles Unavailable by returning the same HTTP status code, but not providing any Location header... :-( ) |
|
08:37
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
08:43
🔗
|
|
bwn_ has joined #urlteam |
|
08:46
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
|
08:47
🔗
|
|
WinterFox has joined #urlteam |
|
08:53
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
|
09:43
🔗
|
|
ahrain has joined #urlteam |
|
11:22
🔗
|
|
bwn_ has quit IRC (Read error: Operation timed out) |
|
12:12
🔗
|
|
bwn_ has joined #urlteam |
|
12:49
🔗
|
|
WinterFox has quit IRC (Read error: Operation timed out) |
|
12:53
🔗
|
|
WinterFox has joined #urlteam |
|
13:22
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
|
14:13
🔗
|
|
Fusl has quit IRC (Max SendQ exceeded) |
|
14:14
🔗
|
|
Fusl has joined #urlteam |
|
15:00
🔗
|
|
VADemon has joined #urlteam |
|
15:11
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
|
15:44
🔗
|
|
swebb has left ["Textual IRC Client: www.textualapp.com"] |
|
16:35
🔗
|
|
swebb has joined #urlteam |
|
16:35
🔗
|
|
svchfoo3 sets mode: +o swebb |
|
16:46
🔗
|
|
Start has joined #urlteam |
|
16:50
🔗
|
|
Start_ has joined #urlteam |
|
16:50
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
|
16:56
🔗
|
|
Start_ has quit IRC (Read error: Operation timed out) |
|
16:58
🔗
|
|
Start has joined #urlteam |
|
16:59
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
|
16:59
🔗
|
|
Start_ has joined #urlteam |
|
17:00
🔗
|
|
SimpBrain has quit IRC (Ping timeout: 369 seconds) |
|
17:07
🔗
|
|
Start_ has quit IRC (Quit: Disconnected.) |
|
17:18
🔗
|
|
JesseW has joined #urlteam |
|
17:18
🔗
|
|
svchfoo3 sets mode: +o JesseW |
|
17:31
🔗
|
JesseW |
A total of 23,225 aDrive URLs found in the old dump. (lots of duplicates) |
|
17:49
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
18:17
🔗
|
|
aaaaaaaaa has joined #urlteam |
|
18:17
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
|
18:27
🔗
|
|
SimpBrain has joined #urlteam |
|
18:34
🔗
|
|
Start has joined #urlteam |
|
18:39
🔗
|
|
Start_ has joined #urlteam |
|
18:39
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
|
18:44
🔗
|
|
Start_ has quit IRC (Read error: Operation timed out) |
|
19:01
🔗
|
|
JW_work has quit IRC (Read error: Operation timed out) |
|
19:05
🔗
|
|
JW_work has joined #urlteam |
|
19:26
🔗
|
|
JW_work has quit IRC (Leaving.) |
|
19:29
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
19:34
🔗
|
|
dashcloud has joined #urlteam |
|
19:34
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
|
19:44
🔗
|
|
JW_work has joined #urlteam |
|
19:47
🔗
|
|
JW_work1 has joined #urlteam |
|
19:47
🔗
|
|
JW_work has quit IRC (Read error: Connection reset by peer) |
|
20:04
🔗
|
|
JW_work1 has quit IRC (Quit: Leaving.) |
|
20:06
🔗
|
|
JW_work has joined #urlteam |
|
20:50
🔗
|
|
JW_work has quit IRC (Read error: Operation timed out) |
|
20:52
🔗
|
|
JW_work has joined #urlteam |
|
21:14
🔗
|
|
Atluxity has joined #urlteam |
|
21:30
🔗
|
|
SilSte has quit IRC (Ping timeout: 310 seconds) |
|
21:30
🔗
|
|
SilSte has joined #urlteam |
|
21:33
🔗
|
|
Barry has quit IRC (Ping timeout: 310 seconds) |
|
21:35
🔗
|
|
Barry has joined #urlteam |
|
21:51
🔗
|
|
WinterFox has joined #urlteam |
|
22:03
🔗
|
|
bwn_ has quit IRC (Read error: Operation timed out) |
|
22:34
🔗
|
|
bwn has joined #urlteam |
|
23:13
🔗
|
|
Start has joined #urlteam |
|
23:50
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
|
23:51
🔗
|
|
aaaaaaaaa has joined #urlteam |
|
23:51
🔗
|
|
swebb sets mode: +o aaaaaaaaa |