[00:03] *** rolfb has joined #archiveteam [00:03] SketchCow: around? :-) [00:06] *** rolfb has quit IRC (Linkinus - http://linkinus.com) [00:37] *** schbirid2 has quit IRC (Quit: Leaving) [00:39] *** primus104 has quit IRC (Leaving.) [01:14] *** mistym has quit IRC (Remote host closed the connection) [01:23] *** ehea617 has quit IRC (Quit: Page closed) [01:40] *** WubTheCap has joined #archiveteam [01:40] 8chan is going to delete a lot of stuff soon, if that matters. https://twitter.com/infinitechan/status/574345499832029184 [02:30] there was a paranoia grab in january: http://archive.fart.website/archivebot/viewer/job/1xnyf [02:30] almost 300GB of data [02:32] run it in archivebot again for a bit to grab recent things maybe? [02:33] It's the redeem board page, let me find it [02:33] 03:36:33 @WubTheCaptain | copypaste: I don't like you going destroying boards like that. reddit doesn't delete inactive boards. At least make a dump and upload it to archive.org [02:33] 03:37:01 @WubTheCaptain | Also, https://twitter.com/infinitechan/status/574345499832029184 was vague for me, all three conditions required or one condition only? [02:33] 03:42:53 +copypaste | all 3 [02:35] I can't seem to find it now [02:35] Here it is. https://8ch.net/claim.html [02:35] A list of the boards, at least [02:39] i'll throw the subdirectories into archivebot in a few hours if no one does it by then [02:40] *** okeuday has joined #archiveteam [02:40] 67-68 hours remaining [02:49] *** Froggypwn has quit IRC (Ping timeout: 512 seconds) [02:50] *** Froggypwn has joined #archiveteam [02:55] *** nwf has quit IRC (WeeChat 1.0.1) [03:26] *** nwf has joined #archiveteam [03:28] *** Ymgve has quit IRC () [04:01] *** khaoohs_ has joined #archiveteam [04:04] *** oli has quit IRC (Read error: Operation timed out) [04:04] *** T31m_ has joined #archiveteam [04:04] *** khaoohs has quit IRC (Read error: Connection reset by peer) [04:05] *** oli has joined #archiveteam [04:09] *** T31M has quit IRC (Read error: Operation timed out) [04:15] *** techapj has joined #archiveteam [04:16] Hello Archive team! [04:17] I need help with installation of wget-warc-lua on Ubuntu 14.04 server [04:17] I am an intern at Discourse and I have been assigned a task to archive fairly large vBulletin forum. I found your excellent vBulletin archive script: https://github.com/ArchiveTeam/wget-lua-forum-scripts/blob/master/vbulletin.lua [04:18] That script is exactly what I need for archiving vBulletin forum, but it needs wget-warc-lua installed/complied on system. I tried compiling it via: https://github.com/ArchiveTeam/tabblo-grab/blob/master/get-wget-warc-lua.sh [04:19] try using a recent build script from here for example: https://github.com/ArchiveTeam/testflight-grab [04:19] *** test_ has joined #archiveteam [04:20] the one from 2012 is too old [04:21] @chfoo thanks for the help, trying to compile from that right now [04:23] the build is failing [04:23] checking lua.h usability... no checking lua.h presence... no checking for lua.h... no checking lua5.1/lua.h usability... no checking lua5.1/lua.h presence... no checking for lua5.1/lua.h... no configure: error: lua not found wget-lua not successfully built. [04:24] I installed lua on Ubuntu 14.04 [04:24] ~/testflight-grab# lua -v Lua 5.2.3 Copyright (C) 1994-2013 Lua.org, PUC-Rio root@wget-gearbox:~/testflight-grab# [04:25] you need libgnutls-dev lua5.1 liblua5.1-0 liblua5.1-0-dev bzip2 zlib1g-dev [04:26] ok, just installed these dependencies, trying building again [04:31] now getting error: [04:31] POD document had syntax errors at /usr/bin/pod2man line 71. make[2]: *** [wget.1] Error 255 [04:33] the bit near the bottom the readme should fix that [04:34] wget-lua should already be built [04:34] oh sorry, should have read the README :) [04:35] thanks a lot for your help @chfoo! you are awesome ;) [04:37] *** Kenshin has quit IRC (Ping timeout: 246 seconds) [04:38] no problem [04:41] @chfoo i cannot find wget file in /get-wget-lua.tmp/src [04:41] i can see wget.h but not wget [04:41] *** Kenshin has joined #archiveteam [04:45] techapj: i guess you have a different error. could you see if you can make this change at https://github.com/ArchiveTeam/wget-lua/commit/5d7348c0d047331539ac38e64fdb53bb5e52aae4 and avoid trying to build the doc [04:46] techapj: welcome to archiveteam! (btw, typical way to mention people on irc is like i just did here, not with an @-sign.) [04:46] glad to hear someone else is tackling The Forums Problem :) [04:47] thanks xmc, this is my first time chatting in irc :) [04:47] xmc: vBulletin fourm is a nightmare to archive [04:47] i guessed as much ... using @ to talk to people is a pretty strong indicator [04:47] yes. yes it is. [04:48] i am slowly working on a semi-secret project to archive all the forums. [04:48] the vbulletin.lua script is the only hope i have to archive a fairly large forum [04:49] care to say what forum it is? [04:49] oldforums.gearboxsoftware.com [04:49] they are now our client, moved to Discourse: http://forums.gearboxsoftware.com/ [04:50] ah that's fairly large then [04:50] i have a 1/4-working vbulletin crawler, abandoned dev on that a few years ago because it became unmaintainable [04:51] i was trying to make it resilient to variation across themes, which as it turns out is an AI-hard problem [04:51] is vubulletin.lua script reliable for gearbox vB archive? [04:52] to be honest, i have doubts [04:52] beats me. it will get all the things, if it doesn't explode in the process [04:52] i have spent more than 4 days struggling with pure wget, gave up at last :( [04:52] i see that they use normal URLs [04:52] yes, pure wget on a forum will explode quickly [04:53] are you using wget or wpull? [04:53] wget 1.16.2 [04:53] ah, wget. [04:54] so if wget doesn't work for you, you may get better results with wpull. [04:54] iirc wpull is more suited to be a crawler than wget, for a few reasons [04:54] such as it keeps its queue of urls in (iirc) a sqlite file, rather than ram [04:54] ah, https://github.com/chfoo/wpull [04:55] made by chfoo :) [04:55] :) [04:56] so would you recommend that instead of using vbulletin.lua? [04:57] vbulletin.lua has treated me well in the past [04:57] the main problem with gearbox forum is: [04:57] Threads: 348,575, Posts: 4,879,823, Members: 519,365 [04:57] if the job turns out to be too big for wget, though, you might want to look into getting vbulletin.lua to run with wpull instead of wget [04:57] yes [04:57] it is a not small forum [04:58] i tried doing: `wget --mirror --adjust-extension --no-clobber --convert-links --random-wait --no-parent --page-requisites robots=off -U mozilla http://oldforums.gearboxsoftware.com/` [04:58] but it downloaded over 20GB of data, ran infinite loop [04:59] using a crawler directly on a blog site, without a script to tell it what to ignore, will result in an unsatisfactory output [04:59] please commit that sentence to memory [04:59] er [04:59] s/blog site/forum/ [04:59] sorry, it's 21:00 on saturday night and i just had my first coffee [05:00] to crawl a forum satisfactorily, you need one of two things: [05:00] 1} [05:00] xmc: no problem, thanks for your help and recommendations [05:00] 1} a large set of regular expressions to ignore urls [05:00] i really apprecite it :) [05:00] 2} a script to drive the crawler [05:00] sure thing [05:00] i really appreciate it :) [05:01] my pleasure [05:01] and still learning how to chat in irc ;) [05:01] oh, there's this too https://github.com/ludios/grab-site if you want to ignore urls on the fly [05:02] chfoo: thanks! will look into it [05:07] chfoo: I am a wget beginner and never archived a site/forum/blog before. What would you recommend me for archiving a large vB forum? Until now i have figured out options like vbulletin.lua, wpull. but what would you recommend? [05:11] techapj: i guess you could take a look at running heritrix (i never used it before) or perhaps try setting up archiveteam's archivebot (given that you change the user-agent and other hardcoded settings) [05:16] chfoo: ok, will look into it. thanks for your help :) [05:34] since you work with the company hosting the forum, can't you just ask for a database dump? [05:36] fenn: we have database dump, but we want to convert the vB forum to read only static HTML archive [05:37] we have already imported the old data in new Discourse forum [05:37] but gearbox wants to host a read only static HTML of old forum [05:52] *** dashcloud has quit IRC (Read error: Operation timed out) [05:54] *** dashcloud has joined #archiveteam [05:57] *** dashcloud has quit IRC (Read error: Operation timed out) [06:05] *** dashcloud has joined #archiveteam [06:11] i'd like to move the warrior tracker machine (shilling) to a different hosting provider before the end of march [06:12] headsup, chfoo Smiley yipdw GLaDOS arkiver underscor [06:15] cool np [06:15] FWIW, DigitalOcean seems to do ok [06:15] oh, i have a provider in mind [06:15] ah [06:16] i'm a part owner in http://vpssd.com/ so it'll be $40 a month of actual savings for me [06:16] is there an ArchiveTeam discount [06:17] lol [06:17] it's a money-losing business venture already, why would we give out discounts [06:18] how is it losing money? [06:18] DigitalOcean is ready to kick you out on first abuse notices, even if they're not valid. [06:19] Also, >clown >no OpenBSD ISOs [06:19] Ctrl-S: we don't have enough paying customers to pay all of the colo bills. [06:19] basically. [06:20] I'm getting a Portlane colo next week. [06:21] xmc: SVColo is quite an unusual data center choice. [06:21] For what the webserver is hosted on, at least [06:21] -> #archiveteam-bs [06:26] anyway, people who have shit on that box ( chfoo Smiley yipdw GLaDOS arkiver underscor ), ping me if i need to do something other than rsync it over [06:27] xmc: it should be fine, just let me or chfoo know before you shut it down so we can get redis to properly shut down and write [06:28] def. i think i'll do it next weekend, unless we're in the middle of a firedrill. [06:28] ok [06:28] turn off all the services, point the dns to the new box, rsync everything over, turn on services on new box ... ? [06:29] provided it's an rsync from / I don't see a reason why it wouldn't work [06:29] * xmc nods [06:29] sounds good [06:30] /home/tinytown has about 40G [06:30] oof [06:30] :P [06:31] the box is previous version of debian, how much of a little hell would it be to go from deb 6 to deb 7 on the new box? [06:31] and not a full / rsync [06:32] Depends on your partitioning and the RNG [06:32] ? [06:33] i'm also a fan of occasionally redeploying machines from scratch, to head off bitrot [06:33] we could just see what happens [06:33] unless you're planning to turn off the old tracker host immediately [06:33] that's the spirit! [06:34] it's probably also the only sane approach :P [06:34] everything else gets lost in HN-style what-if fappage [06:34] nah I can keep it up until the end of the month [06:35] http://xrtc.net/f/pixen/we-stop-bit-rot.png [06:35] i should eventually do the same for archivebot's control host [06:36] it's like on Ubuntu pwned.04 [06:36] LTS [06:36] hahaha [06:36] long term suckage [06:40] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [06:48] *** edward_ has joined #archiveteam [07:22] *** signius has quit IRC (Ping timeout: 306 seconds) [07:34] *** signius has joined #archiveteam [07:43] *** X-Scale has joined #archiveteam [07:43] *** primus104 has joined #archiveteam [07:49] *** mistym has joined #archiveteam [08:38] *** nertzy has joined #archiveteam [08:39] *** dashcloud has quit IRC (Read error: Connection reset by peer) [08:40] *** nertzy2 has quit IRC (Read error: Operation timed out) [08:40] *** codinghor has joined #archiveteam [08:40] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [08:42] this is a very unfortunate substring length for my handle [08:43] *** dashcloud has joined #archiveteam [08:44] yahoosucks [08:45] that worked thanks [08:47] *** nertzy has quit IRC (Read error: Operation timed out) [08:49] *** edward_ has quit IRC (Ping timeout: 512 seconds) [09:05] *** lag has joined #archiveteam [09:19] codinghor: yeah. you might want to trim the other side: inghorror [09:20] though i'm fond of dinghorro [09:32] *** mistym has quit IRC (Remote host closed the connection) [09:53] *** primus104 has quit IRC (Leaving.) [10:08] in the future, people will be able to use handles of MORE than 10 characters! Madness. [10:13] sheeeit, we're still stuck at 9 over here [10:25] cdnghrrr, and you get one spare [10:25] xmc: kill all my stuff [10:26] nuke the account and everything in it? [10:26] yup [10:26] ok [10:26] what [10:26] there are no files in your ~ anyway [10:26] nod [10:26] we don't nuke things over here [10:27] not sure i even used that one? [10:27] ok fine i will comment out the line in /etc/passwd and carefully move the ~ to somewhere in /var/backup [10:27] excellent [10:27] lol k [10:28] "it's empty" is also data [10:28] just kidding [10:28] deluser is fine [10:28] this conversation records that it is empty [10:29] oh hey there are hidden history files [10:29] how the hell did you view redtube on this box [10:29] lmao [10:29] and why is it in your redis_cli history [10:30] actually, switch the "how" and "why" in those two sentences [10:31] [silence] [10:45] *** nertzy has joined #archiveteam [11:06] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [11:32] *** schbirid has joined #archiveteam [11:42] *** primus104 has joined #archiveteam [11:58] *** Sellyme_ has quit IRC (Ping timeout: 265 seconds) [11:58] *** Sellyme has joined #archiveteam [12:03] *** Ymgve has joined #archiveteam [12:26] *** primus104 has quit IRC (Read error: Connection reset by peer) [12:31] *** primus104 has joined #archiveteam [12:41] *** codinghor has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [12:44] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [12:45] *** dashcloud has joined #archiveteam [12:47] *** RuairiCOL has joined #archiveteam [12:48] Hello all - a site I'm a fan of is at risk because its file download server has gone offline - I have a copy of the entire folder structure up until November 2014 and I'm uploading it to my own server for the moment, but I'm wondering if archiveteam have somewhere I can store it in addition just in case [12:48] The site is hybridized.org and contains a load of DJ sets [12:49] I've got around 190GB to upload [12:49] SketchCow: *boop* I'm @rc55 from Twitter and the demoscene, I also run a small demoparty called Sundown :) [12:50] Internet archive will want it [12:50] definitely IA material imo [12:53] cool - the upload tool is a bit of a blunt instrument, is there any way I can ftp up the stuff considering the size? There are also cue files that have useful metadata [12:56] errr [12:56] i wasn't viewing redtube :D [12:56] we may of looked at archiving it D: [12:58] *** dashcloud has quit IRC (Read error: Connection reset by peer) [12:59] *** dashcloud has joined #archiveteam [13:02] wht not just zip it all? [13:02] site.tar.gs or site.zip [13:11] I'll work on that - I'll get this first upload done then upload it to archive.org next week (192gb at 512k p/sec...) [13:17] ask the others before you bother doing anything [13:17] i'm not a good source of advice [13:27] RuairiCOL: archive.org does have a way of uploading as a torrent, which sounds like it's easier for you http://archiveteam.org/index.php?title=Internet_Archive#Uploading_to_archive.org [13:53] dont zip things like media files [13:54] you can upload with curl, s3cmd or some python/perl tools [13:55] and the iauploader script [14:34] Is there something other than the WaybackMachine I can check for removed/expired pastebins? [14:37] that sounds like a good project for urlteam [14:37] pastebins? [14:37] i've got code for pastebin stuff [14:37] sort of [14:38] recursive download stuff [14:38] well, I'm looking for one specific paste [14:38] there's also a AT project for new pastes [14:38] why? [14:38] from 2013 [14:38] well, the paste "has been removed" and the author doesn't have it any more [15:07] *** Emcy has quit IRC (Read error: Connection reset by peer) [15:07] *** Emcy has joined #archiveteam [15:18] *** ohhdemgir has quit IRC (Read error: Operation timed out) [15:20] *** ohhdemgir has joined #archiveteam [15:22] *** Emcy has quit IRC (Read error: Connection reset by peer) [15:53] *** Emcy has joined #archiveteam [16:00] http://blog.postach.io/post/brand-new-version-launched-billing-changes [16:23] *** dashcloud has quit IRC (Read error: Connection reset by peer) [16:24] *** dashcloud has joined #archiveteam [16:39] *** Emcy has quit IRC (Ping timeout: 512 seconds) [16:48] *** mietek has joined #archiveteam [16:51] Does anyone happen to have an archived copy of ftp://ftp.cpsc.ucalgary.ca/pub/projects/charity/ ? [16:51] The web site is still online, but all the source/code links are dead: http://pll.cpsc.ucalgary.ca/charity1/www/home.html [16:54] mietek: https://synrc.com/publications/cat/Functional%20Languages/Charity/ ? [16:55] ats: wow, cool. [16:56] ats as in the ATS language? [16:57] no, ats as in my name ;-) [16:57] Thanks. Was that just a good Google search? [16:58] yup; I searched for charity-src.tar.gz... [16:58] Thanks. [17:55] *** mistym has joined #archiveteam [17:57] *** dashcloud has quit IRC (Ping timeout: 370 seconds) [17:57] *** dashcloud has joined #archiveteam [18:02] *** Jonimus has quit IRC (Ping timeout: 260 seconds) [18:07] *** rolfb has joined #archiveteam [18:11] Unfortunately the example programs seem not to have been mirrored anywhere.. [18:12] (http://pll.cpsc.ucalgary.ca/charity1/www/examples.html) [18:14] mietek: have you tried emailing them? [18:14] Not yet. Will do. [18:15] First collecting what I can before I start bothering people who apparently gave up on the thing 15 years ago. [18:15] Sorry if this isn’t the right channel for these questions. [18:17] *** rolfb has quit IRC (Leaving...) [18:20] I was hoping there’s some secret FTP analog of archive.org which someone here might know. [18:22] mietek: unfortunately there isn't. people have started for FTPs, but it's a bit late [18:44] *** the_fox is now known as TheOtherF [18:45] *** TheOtherF is now known as OtherFox [18:57] *** lag2 has joined #archiveteam [19:01] *** lag has quit IRC (Ping timeout: 512 seconds) [19:12] *** robink has quit IRC (Quit: No Ping reply in 180 seconds.) [19:13] is there any good way to scrape yahoo automatically? [19:13] *** robink has joined #archiveteam [19:13] i normally save search engine pages manually and extract urls with regex, but in the case of google business sitebuilder there's over 100 result pages per domain [19:19] Start: yes, use the bing API [19:35] bling api [19:36] Sanqui: https://archive.org/details/pastebinpastes goes back to 2013-10-30 [20:18] *** yan has joined #archiveteam [20:50] *** bzc6p has joined #archiveteam [20:51] Start: http://archiveteam.org/index.php?title=Site_exploration [20:57] *** mistym has quit IRC (Remote host closed the connection) [21:13] *** bzc6p has left [21:32] *** Emcy has joined #archiveteam [21:37] *** schbirid has quit IRC (Quit: Leaving) [21:49] *** Ravenloft has joined #archiveteam [22:31] where are the nokia memories archives? i can't find them anywhere [22:35] *** BlueMaxim has joined #archiveteam [22:53] *** serapeum has joined #archiveteam [22:56] *** Emcy_ has joined #archiveteam [22:56] *** Emcy has quit IRC (Ping timeout: 306 seconds) [22:58] *** lag2 has quit IRC (Read error: Operation timed out) [23:52] *** signius has quit IRC (Ping timeout: 306 seconds)