[00:03] *** Coderjoe_ has joined #urlteam [00:05] *** Coderjoe has quit IRC (Read error: Operation timed out) [01:01] *** JesseW has joined #urlteam [01:02] *** Start has joined #urlteam [01:29] could someone please scrape the urlteam results for home.comcast.net, comcastbiz.net and comcastbiz.com? [01:53] I can do so, unless someone who has already downloaded them wants to do it... [02:17] OK, I'm working on downloading the URLteam results (starting with the 88 GB generation 1 torrent) [02:29] *** aaaaaaaa_ has joined #urlteam [02:29] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [02:29] *** swebb sets mode: +o aaaaaaaa_ [02:34] *** aaaaaaaa_ is now known as aaaaaaaaa [02:41] Interesting -- the 379 incremental items come to a total of 86.5 GB, as compared with the last torrent, which is 88 GB. [03:26] Well, I have a few from home.comcast.net, such as [03:26] SfwqG0|http://home.comcast.net/~s.namkung/ [03:28] The incremental ones appear to be .zip files *containing* .xz files, which contain the actual data... [04:05] queuing up all 379 incremental dumps via bt... [04:13] with xargs, I bet! [04:14] yep. :-) [04:36] *** aaaaaaaa_ has joined #urlteam [04:36] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [04:36] *** swebb sets mode: +o aaaaaaaa_ [04:37] *** aaaaaaaa_ is now known as aaaaaaaaa [04:38] Here's the pipeline I'm using to search for the comcast URLs: [04:38] (cd /mnt/bigdisk/transmission_files/downloads/urlteam_2015-09-17-19-00-08/ ; for foo in *.zip; do echo $foo; unzip -p $foo '*.txt.xz' | xz -d | fgrep $'home.comcast.com\ncomcastbiz.net\ncomcastbiz.com' | tee -a /mnt/bigdisk/comcast_urls.txt; done) [04:38] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [05:04] First hit (and it's not actually a hit)... [05:04] JlRq4r|https://www.actonsoftware.com/acton/beacon/xreport.jsp?c=Visit Details&u=https://www.actonsoftware.com/acton/beacon/beaconCompaniesDrillDown.jsp%3Fa%3D248%26aa%3D7968%26k%3D729502%26ips%3D%255B71.194.171.69%255D%26start%3D1335121634700%26email%3DGregory@gdwgroup.comcastbiz.net [05:07] Another, actual hit (although probably spam): [05:08] YodXKi|http://lifo.comcastbiz.net/25/asbestos-plaster-walls [05:08] No, the domain is actually a thing (a presumably, worth archiving): http://lifo.comcastbiz.net/ [05:08] "Life Organizers / Tax & Planning Consultants" [05:35] Well, here's something certainly worth saving: http://thediscoverycenter.comcastbiz.net/about/ -- website of a 6 acre privately-owned park in Fresno, CA devoted to teaching kids about science since 1956. [07:35] *** JesseW has quit IRC (Read error: Operation timed out) [12:31] *** dashcloud has quit IRC (Read error: Operation timed out) [12:34] *** dashcloud has joined #urlteam [12:35] *** svchfoo1 sets mode: +o dashcloud [14:15] *** Start has quit IRC (Quit: Disconnected.) [14:16] *** Start has joined #urlteam [14:20] *** Start has quit IRC (Client Quit) [15:10] *** JesseW has joined #urlteam [15:20] Start: Here are my results so far; I've run into HW trouble, so I thought it better to provide what I have: http://pastebin.ca/3165040 [15:32] OK, I've got analysis working again -- about 2000 zip files downloaded, still to be processed. (and various more that I haven't downloaded yet) [15:39] *** Start has joined #urlteam [16:09] *** Start_ has joined #urlteam [16:09] *** Start has quit IRC (Read error: Connection reset by peer) [16:13] *** Start_ is now known as Start [16:14] *** JesseW has quit IRC (Read error: Operation timed out) [16:32] *** JesseW has joined #urlteam [16:34] Maybe corrupted file: urlteam_2015-03-22-01-09-14/bitly_6.2015-03-22-01-09-14.zip [17:01] Updated results (110, so far): http://pastebin.ca/3165137 [17:13] *** JesseW has quit IRC (Read error: Operation timed out) [17:38] *** Start has quit IRC (Quit: Disconnected.) [18:24] *** VADemon has joined #urlteam [18:59] *** slang has joined #urlteam [19:08] *** dashcloud has quit IRC (Read error: Operation timed out) [19:10] *** dashcloud has joined #urlteam [19:10] *** svchfoo1 sets mode: +o dashcloud [19:30] *** _0x2A has quit IRC (Quit: ZNC - 1.6.0 - http://znc.in) [21:40] *** slang has quit IRC (Quit: Page closed) [23:40] *** Start has joined #urlteam