[05:09] http://archive.org/details/webshots-freeze-frame [05:19] parking some more cars in the IA driveway eh [05:57] SketchCow: I'm uploading 132 iso of linux format [05:58] using your naming for the isos so there standard and easy to find [14:38] Jason, can you pull mny new version of the front page plz, http://archiveteam.org/index.php?title=Djsmiley2k/main_page [14:39] And possibly tweet to followers about Warrior? - ( I tweeted Oi followers - webshots are going to delete all member photos - were backing them up - HELP http://archiveteam.org/index.php?title=ArchiveTeam_Warrior & Download, VM - PLZ RT ) I'm sure you can come up with something more witty. [17:13] anyone can tell me how the archiveteam warrior works? I'm not sure if mines doing anything productive... [17:14] Zym_: It has a webinterface, try http://localhost:8001/ [17:18] got that. I'm still not sure what happens under the surface because all i get is a gui but no output [18:43] http://bt.custhelp.com/app/answers/detail/a_id/39105/~/the-free-bt-web-hosting-service-is-closing [18:43] IMMEDIATELY. [18:47] uhoh [18:47] To access our FTP site using a web browser, type the following into your browser: [18:47] ftp://username:password@ftp.btinternet.com [18:47] Once you're logged in you'll see an FTP listing of directories. [18:47] OREALLY? [18:47] FTP listing of directories [18:47] You may have to select the "pub" directory to view your files. Once you have found your files, right click on the ones you want to save, then choose "Save Target as". [18:57] How do we usually archive such small webhosting providers? wget-warc? heritrix? [19:06] root:*:0:1:Operator:/: [19:06] bin:*:2:2::/:/bin/csh [19:06] daemon:*:1:1::/: [19:06] ftp:*:99:99:ftp user:/var/ftp-anon:nosuchshell [19:06] sys:*:3:3::/bin: [19:06] nobody:*:65534:65534::/: [19:06] lulz? [19:06] that's the ftp server? [19:06] yes. [19:06] lolwut? [19:07] they have hiden directory listings but not actually blocked it [19:07] should make backup easier :p [19:07] its a chroot though I think. [19:10] https://www.google.co.uk/#q=site:btinternet.com&hl=en&safe=off&prmd=imvns&ei=cnV0ULvMKcvK0AW1_oDwDg&start=10&sa=N&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&fp=70cfab5615ed83a6&biw=888&bih=625 [19:10] plenty of sites :S [19:11] indeed [19:11] how can we get a list of them all? [19:13] http://www.btinternet.com/~memoriesofmax/ <- [19:13] that's what we do it for... [19:17] Really? [19:17] I do it for http://www.btinternet.com/~carol.stirling2/New_Folder/whips_and_crops.htm [19:17] kinky [19:18] http://www.btinternet.com/~shawater/warning.html [19:18] this looks like a tiny geocities [19:18] It IS a tiny Geocities. [19:19] SketchCow: thats "Does it for me" ;) [19:19] DO we have anything to rip google results yet? [19:19] We could put it on the warrior. [19:19] SmileyG: https://gist.github.com/2788197d2db2779cb7b0 [19:19] ipv6? :S [19:20] Is faster. [19:20] not if I don't already have IPv6 setup :S [19:20] No. Do you want to be blocked by Google? [19:20] For how long :S [19:21] Well, the trick is: if you use ipv6 you can query at a much higher rate. [19:21] D: [19:21] I can do it and hand out results... [19:21] Looks like theres only 10 pages of google results before they turn into random pages on the sites... [19:21] pestilenz.org has got a native connection and not really a bandwidth limit as our hoster does not have a billing plan for v6 in place [19:22] C-Keen: You do need a ipv6-over-ipv4 tunnel from tunnelbroker.net. [19:22] alard: aawwww [19:22] Because you need a /48 subnet. [19:23] it's only a /64 [19:23] LEGAL torrenting is legal [19:23] torrenting is not [19:23] torrenting IS legal [19:23] prepare to laugh [19:24] lols [19:24] WHere is this? I want to point and laugh [19:24] car annology [19:24] "Driving is legal" [19:24] "Legal driving is legal" "Driving is not" [19:24] :D [19:24] SmileyG: #webtech on freenode [19:28] Not worth wasting your time over joepie91 [19:31] hm, using the google search api limits you to 100 search results per day. how dumb is that [19:31] has anyone tried and actually *asked* btinternet for a list of all the sites? [19:34] so I'm setting up ipv6 tunnel for future usage :< [19:35] seriously [19:35] these people are sad [19:35] joepie91: haters, don't waste time on them [19:35] no, not haters [19:35] just straight out idiots [19:38] * SketchCow is desperately trying to upload everything off FOS while this insane webshots thing is going down. [19:38] Just in case you're wondering what has my attention. [19:39] http://www.btinternet.com/new/content/custhome/listings/A.shtml [19:39] BINGO [19:39] This isn't all sites [19:39] but a large chunjk :D [19:40] NB Sites are not listed automatically however the listing submission process is currently unavailable while the system is being redesigned. Apologies for the inconvenience. << not sure whath appened there. [19:41] I bet they actually add all sites manually [19:41] althought _a lot_ of sites are dead [19:41] the first two entries on that site are broken [19:42] maybe it is already out of date [19:42] http://www.accordions.btinternet.co.uk/ yeaaaah [19:42] it seems to always redirect me to that yahoo login page [19:42] yeah the last one works :S [19:43] Why is it anytime something about deleting data comes up, it's Yahoo that is the culprit? [19:43] yahoo is starting to be annoying, do these guys actually keep anything they buy? [19:43] lol [19:43] "streamline the business" [19:43] Remeber, they want to go 482mph. [19:44] maybe redbull can sponsor the balloon for them too .... [19:45] SmileyG: seems like that index is all the way broken [19:45] C-Keen: the last link worked :( [19:45] hense why I was celebrating, only to realise the rest didn't. :( [19:45] Is there any index that doesn't ban you? [19:45] duckduckgo for example? [19:46] the duckduckgo folks might even contribute the stuff in a sane way [19:46] prepacked list? [19:47] for example! [19:47] * SmileyG asks some guys who love ddg. [19:49] god [19:49] SmileyG: these guys are complete retards, seriously... I give them a techdirt link after they ask me for a legal source saying that copyright infringement =/= theft [19:49] and they are so busy bitching about me and stroking their own ego that they don't even notice that the legal document itself is embedded on the page [19:50] joepie91: ¬_¬ I deal with people like this all day. Not worth the hassle [19:50] and keep claiming that I should 'show a legal document' and that 'techdirt is not a legal source' [19:50] then again I get involved too [19:50] SmileyG: I am asking on their "official IRC channel" [19:50] some guy was telling me earlier how " " == "" [19:50] i.e. space == null [19:51] 21:51 < yano> duckduckgo doesn't provide more than the 0-click and the first result via it's API [19:51] 21:51 < crazedpsyc> we don't and can't allow that due to licensing on the results [19:52] SmileyG: ^ [19:53] ghay. [19:54] they like to see themselves as rebels "ooo we block tracking" [19:54] but then if you want to shove the results up theirs they all go "I am only thirteen!" [19:54] or somethign [19:58] lol; [20:04] what about simple wget? [20:05] and a picture of rob? http://www.asallen.btinternet.co.uk/Files/Rob.JPG [20:05] Wget no way. [20:05] We'll be doing WARCs. [20:05] We want these in wayback.archive.org [20:06] joepie, what the fuck are you arguing about, and why is it ot in -bs? [20:06] Because it really should be there. [20:06] Damn, Save Rob [20:06] Because 10-1 Rob ain't making it another year [20:08] SketchCow: not really an argument, more a quick remark :P [20:08] but yeah, I'll take it to bs [20:08] Something to be aware of: www.btinternet.com/~$u == www.$u.btinternet.co.uk [20:08] Oh hey its big rob. [20:09] alard: urgh I should Of said that, it seemed obvious to me ¬_¬ [20:09] Well, it means: 1. any searches (google etc.) for usernames have to be done for both types of urls, and 2. we need to decide which type to download. [20:10] it gets more complicated: [20:10] http://www.sam.gamble.btinternet.co.uk/languages/english/english.inc.php [20:10] Both versions, perhaps. [20:10] note the sam.gamble [20:10] yes [20:10] http://www.btinternet.com/~sam.gamble/ [20:10] My attitude is download both [20:10] so btinternet.com/~sam.gamble? [20:11] All of them. [20:11] We'll keep them separate in dump [20:11] Do some compares [20:12] et presto! [20:12] SketchCow, alard, SmileyG, http://www.btinternet.com/new/content/custhome/ [20:12] directory :) [20:12] oh wait [20:12] ah yes [20:12] .com == .co.uk [20:13] so that should have a listing [20:13] [20:39:41] < SmileyG> http://www.btinternet.com/new/content/custhome/listings/A.shtml [20:13] joepie91: those indexes are outta date it seems :( [20:13] :( [20:14] This is a nice username: http://www.c.h.f.btinternet.co.uk/ [20:14] mmm [20:15] SmileyG, what are you basing it on that it's outdated? [20:15] lots of them redirect to yahoo login [20:15] the first 10 or so links didn't work for me :< [20:17] mmm [20:17] well alot redirect to yahoo login and alot just 404 [20:17] or rather 500 [20:18] I'd like this to be a warrior project, and I want us to fucking DEMOLISH this site [20:18] This is a chance to get a geocities "right" [20:19] mmmm [20:19] anyone up for running a google crawler? [20:19] I'm doing a little bit of google crawling now, I'll stop later. 2200 usernames so far. [20:19] My ipv6 trick no longer works, so it's going slow. [20:19] alard: and you're not getting banned? [20:19] ipv6 trick..? [20:20] also, in case it's useful, http://git.cryto.net/cgit/crytobooks/tree/crawler/crawler.py has some code that should work again with a few fixes [20:20] to crawl google (as well as calibres) [20:20] does pretty well in the not-getting-banned department :P [20:20] It used to be that with a /48 ipv6 subnet you could switch ip addresses between /64 subnets. [20:20] ahh [20:20] So as long as you picked a different /64 subnet each time, you could keep on searching. [20:21] hmm... alard: what if you spread the task between warriors? [20:21] the googling [20:21] It would be nice if we could find this newsgroup: btinternet.homepages.announce [20:22] probably on their nntp server [20:22] 1. I don't have a googling warrior task, no infra to handle the results. 2. You'd have a lot of unused time for the warrior. [20:22] also, alard, another option although it may be a bit strange, is downloading database dumps from hacks etc, extracting btinternet e-mail addresses, and checking if a site exists for the username used for the email [20:22] hmm [20:23] So to make this a warrior project, we'd need to: 1. figure out a good wget command that downloads everything for a user 2. make a warrior project (with rsync upload) 3. have a list of users [20:23] wget -m with it limited to the ~user won't work? [20:24] Yes, something like wget -m. [20:25] list of users is the main thing [20:26] and page-requisites, and something to prevent infinite recursion, and adjust-extension, and ... ? [20:27] SketchCow: Can you make a btinternet thing on fos? [20:27] (We'll need it at some point.) [20:32] I'm sure there's lots of users whose index pages don't link to everything on the account [20:32] DFJustin: in those cases we are screwed afaik [20:33] just grabbing a list of sub-urls from google would probably improve it substantially [20:33] unless google has randomly crawed them due to linking at some point [20:33] I'll generate a list of all shorturls that link to btinternet if that helps [20:34] there's also http://wayback.archive.org/web/*/http://www.btinternet.com/~shawater/* [20:35] well, yeah... when crawling google for usernames, why not save all the URLs along the way? [20:35] :P [20:36] well your gonna know the username, but yeah [20:36] add the pages to the list for that username... [20:36] thing is... google seems to pick up URLs from a lot of places [20:37] not just crawling [20:37] I've more than once seen URLs show up in the search engine index simply because it was pasted in google docs, sent via gmail, opened in a browser that uses google safe browsing, ... [20:37] even if we can't find it, maybe google has [20:38] google safe browsing ought not submit urls, they use a bloom filter to vastly reduce queries [20:39] in most cases the url doesn't leave your machine [20:45] ... I have an idea, one moment [20:46] http://ip.robtex.com/213.123.20.90.html#shared [20:46] well, that's a start [20:48] yeah, looks like all free hosted sites are on 213.123.20.90 [20:48] Oh, good idea [20:48] literally all of them [20:48] the IPs around it host the FTP server, real domains, and internal servers [20:48] lol @ them being blacklisted [20:49] this has a few as well: http://support.clean-mx.de/clean-mx/viruses.php?ip=213.123.20.90&sort=email%20asc [21:02] what's going on [21:03] http://bt.custhelp.com/app/answers/detail/a_id/39105/~/the-free-bt-web-hosting-service-is-closing [21:04] oh [21:08] https://github.com/ArchiveTeam/btinternet-grab [21:08] More specifically, what about these wget options? https://github.com/ArchiveTeam/btinternet-grab/blob/master/pipeline.py#L65-86 [21:12] alard: Seems fine, but wouldn't that fetch everything twice, using both URLs? [21:12] Yes. [21:12] Guess that's better if it goes into the wayback machine. [21:12] Does wget-warc support dedup? [21:13] A little bit, but not for items with different urls. [21:14] How many users do you already have? [21:14] 2250 [21:15] :/ [21:15] one or two people should be able to bang this out... [21:15] We're going to find more, right? [21:16] I hope so :D [21:16] I hope so, I am searching both urlteam data and the domain name system for usernames [21:20] sry guys I'm utterly out of it atm, this cold is really hammering me [21:35] alard: http://helo.nodes.soultcer.com/btinternet-dns.txt (usernames for btinternet) [21:38] Thanks. Processed items: 1297 [21:38] http://tracker.archiveteam.org/btinternet/ [21:45] alard: http://helo.nodes.soultcer.com/btinternet-urlteam1.txt (from URLteam this time), I'll have a second list where I scan for the co.uk version of the URL, but it have to go to bed now so I can only send that last one to you tomorrow [21:48] soultcer: Added. Processed items: 960, added to main queue: 621 [21:48] (And going to bed is a sensible idea. I might copy that.) [21:49] Well extracting the URLs would take a couple hours. Not going to bed right now, but before processing would be done [21:50] I can give you an account on the tracker so you can add them yourself, if you'd like. [21:50] (Not sure how many more usernames you'll be trying to find.) [21:51] Only the ones from the second run over the URLteam data [21:51] I have no other data sources available [21:51] I am actually quite surprised I even found any hostnames via DNS at all [22:15] underscor: I add another: http://archive.org/details/cdrom-linuxformatmagazine-154 [22:16] before you say its too early to upload: http://linuxformat.com/archives?issue=154 [23:27] alard: You have alardland to make new subworlds [23:35] I wonder how big soundcloud is