[00:12] *** JetBalsa has quit IRC (Read error: Operation timed out) [00:13] *** JetBalsa has joined #archiveteam [01:26] *** balrog has quit IRC (Bye) [01:27] *** philpem has quit IRC (Read error: Operation timed out) [01:37] *** balrog has joined #archiveteam [01:37] *** swebb sets mode: +o balrog [01:49] balrog DFJustin yipdw ersi chfoo can one of you test my rsync setup? i'm behind a router. [02:01] *** Microguru has joined #archiveteam [02:05] http://www.tsumino.com/Forum/viewtopic.php?f=6&t=2615 the fate of this porn comic site looks grim. [02:05] I [02:06] I've already started to work out how to archive the contents, some of which may not me mirrored elsewhere (the site's already the hear to the collection of another site the wen under earlier this year) [02:07] comics are numbered in asending numbers, starting at 1. assuming tha tthe front page accuratly represnet the latest contests, there are ~20,500 comics hosted [02:09] download links require an account (name/password/non-verified email for reg.) and follow the pattern of http://www.tsumino.com/Book/Download/#### and gallery links follow the pattern of http://www.tsumino.com/Book/Info/#### [02:10] the aforementioned download links include only the images, no tags. I suspect that download link + html of the gallery page should suffice to preserve almost all of the information [02:11] the bad news is that reCAPTCHA use is rampant. doing anything requires using the "I'm not a robot" checkbox [02:13] there's also a forum http://www.tsumino.com/Forum/ but it doesn't look like there's much there [02:15] https://web.archive.org/web/20160224224112/http://www.tsumino.com/ this is the latest IA snapshot, form march 7 [02:16] nothing's browsable via the IA link, as I would have guessed form the CAPTCHAs [03:08] *** bwn has quit IRC (Ping timeout: 492 seconds) [03:17] *** ppsym has joined #archiveteam [03:18] *** altlabel has quit IRC (Ping timeout: 258 seconds) [03:21] *** PurpleSym has quit IRC (Ping timeout: 506 seconds) [03:23] *** ppsym is now known as PurpleSym [03:26] *** Random_Ma has joined #archiveteam [03:27] What Forsooth, Prithee Tell Me The Secret Word [03:28] yahoosucks [03:28] what is your quest [03:30] Random_Ma: see above for the secret word [03:31] *** Random_Ma has quit IRC (Quit: Page closed) [03:32] http://www.tsumino.com/Forum/viewtopic.php?f=6&t=2615 the fate of this porn comic site looks grim. [03:32] I've already started to work out how to archive the contents, some of which may not me mirrored elsewhere (the site's already the hear to the collection of another site the wen under earlier this year) [03:32] comics are numbered in asending numbers, starting at 1. assuming tha tthe front page accuratly represnet the latest contests, there are ~20,500 comics hosted [03:32] download links require an account (name/password/non-verified email for reg.) and follow the pattern of http://www.tsumino.com/Book/Download/#### and gallery links follow the pattern of http://www.tsumino.com/Book/Info/#### [03:32] Microguru: you mentioned this above [03:32] the aforementioned download links include only the images, no tags. I suspect that download link + html of the gallery page should suffice to preserve almost all of the information [03:33] We heard you the first time. [03:33] the bad news is that reCAPTCHA use is rampant. doing anything requires using the "I'm not a robot" checkbox [03:33] there's also a forum http://www.tsumino.com/Forum/ but it doesn't look like there's much there [03:33] https://web.archive.org/web/20160224224112/http://www.tsumino.com/ this is the latest IA snapshot, form march 7 [03:33] nothing's browsable via the IA link, as I would have guessed form the CAPTCHAs [03:33] Microguru: do not post the same thing every hour [03:33] but thank you for bringing it to our attention [03:34] The channel is publically archived; once you say something once, it's heard. [03:34] (also lots of the people in here keep private archives) [03:35] I didn't know if anyone saw it. as I understand it, IRC has no way of relaying messages to people who aren't in the chat when they were sent. the only solution I know of it to re-post every now and then. that, as I saw someone join. [03:35] Understandable mistake. [03:35] woop woop woop off-topic siren [03:35] let's move on [03:36] cool. in that case I don't need to keep re-posting. I'll only need to update when I find out something new. [03:36] (or take it to #archiveteam-bs where longer conversations go) [03:37] Microguru: yeah, go ahead and keep working on this. post your findings in here, people will def help out as able [03:39] I'm afraid it would have to be a manual process, with hundreds of IPs and burner accounts. I asked the owner on the forum how long they think they can keep it going, and what their exit strategy was. It was recently so they likely haven't seen it yet [03:40] hm yeah. they seem to give 0-byte files if you don't come in with their cookie [03:41] if you keep answering reCAPTCHAs correctly, will websites typically do IP blocks? [03:42] I could spin up another copy of my VPN server to make a sacrificial IP address to test, time permitting. [03:43] well maybe you could answer it right once and reuse the cookie a bunch [03:44] on the good side, a good chunk of the content is re-uploads of stuff from other sites, meaning it can be skipped if IDed properly. on the bad side, I haven't figured out how to tell if that's the case or not for any particular comic [03:46] *** PotcFdk has quit IRC (Remote host closed the connection) [03:49] *** PotcFdk has joined #archiveteam [04:00] *** gibigian1 has quit IRC (Ping timeout: 260 seconds) [04:01] *** PurpleSym has quit IRC (*) [04:01] *** ppsym has joined #archiveteam [04:01] *** ppsym is now known as PurpleSym [04:08] *** gibigiana has joined #archiveteam [04:11] *** bwn has joined #archiveteam [04:11] *** jmad980_ has quit IRC (Read error: Operation timed out) [04:12] wget-ing the gallery pages stores important information like the tags and the other information, and would also give us a better look a what's on the site. I wrote the following bash script to archive them all, and I want to know form someone who knows what they're doing to check it for errors before I let it run overnight. [04:12] for i in {1..21580}; do mkdir ./$i && cd ./$i && wget http://www.tsumino.com/Book/Info/$i/ && cd ../ && sleep 100; done #ending number larger that largest known comic number to avoid undershooting. may produce 404s. sleep is to avoid IP bans and other negitive effects. [04:14] *** jmad980 has joined #archiveteam [04:14] wget-ing with no additional arguements produces a file names "index.html" hence the folder madness. there may be a more elegant way to do it. [04:16] you may wish to do instead [04:16] pushd && wget ; popd [04:16] the second being ; is important, it will get back to the original directory even if wget fails [04:16] *** jmad980 has quit IRC (Read error: Operation timed out) [04:16] you should use something that records warc [04:16] wget has an option for it, or you can use wpull [04:19] good thought. I don't do much bash scripting, nor have I done any warc work so I have some googling to do. early results form running the script seems effective. can ext4 filesystems handle this may folders in one directory? [04:27] it'll top out at a bit over 60,000 [04:28] *** jmad980 has joined #archiveteam [04:33] should be enough for now. [04:35] since I have wget version 1.15 installed, do I need any other software installed? the AT wiki is a little out of date and was written before some of the WARC stuff was integrated with the codebase. [04:39] Has this been mentioned here yet: http://gmc.yoyogames.com/index.php?showtopic=694391 ? [04:39] Apparently they host 70,000 GameMaker games [04:39] And they close in April [04:40] Ah, looks like we grabbed it back on 2014: https://archive.org/details/archiveteam_gamemaker [04:41] yep, it's been mentioned recently. IDK if enough effort has been put into it, though. But check the archvies. [04:41] *** JesseW has quit IRC (Quit: Leaving.) [04:41] here's my updated tsumino gallery archive script v2.0: [04:42] for i in {1..23580}; do mkdir ./$i && pushd ./$i && wget http://www.tsumino.com/Book/Info/$i/ --warc-file=warc && popd && sleep 50; done #ending number larger that largest known comic number to avoid undershooting. may produce 404s. sleep is to avoid IP bans and other negitive effects. [04:45] *** jmad980 has quit IRC (Read error: Operation timed out) [04:47] now I'm also getting the warc information. I'm going to let that run overnight if no one sees any issues with it [04:48] *** jmad980 has joined #archiveteam [04:52] *** jmad980 has quit IRC (Read error: Operation timed out) [04:53] *** jmad980 has joined #archiveteam [04:55] Microguru: you want ;popd not &&popd [04:56] *** jmad980 has quit IRC (Read error: Operation timed out) [04:57] just fixed it. I'll be back around 8-9 PM UTC tomorow reporting on the results. once it's finished, where should I upload a zip of it to for further work? [05:00] you'll want to make a .tar instead of a .zip, run it through megawarc, and upload to archive.org [05:00] set the media type on the item to 'web' but that can be changed any time after you upload it [05:09] I'll set the media type later, since I'm already 13 deep and have other things I need to do today. since I have a 50 sec wait for anti-ip ban reasons, it will take a while to finish. if anyone else wants to meet me halfway, tell us in the channel where your start and stop points are in the for loop [05:09] sounds good [05:10] given how the comic downloads have the title if the comic as their title, matching download zip and gallery page shouldn't be too hard. someone could write a script that does it automatically once we get them all. [05:12] using comic #1 as an example, "
Shigoite Ageru / シゴいてアゲル
" should be easy to match with "[TSUMINO.COM] Shigoite Ageru シゴいてアゲル.zip" [05:32] *** JesseW has joined #archiveteam [05:48] I noticed that the debian BT tracker has removed some of the older versions -- before I delete my copies of them, is it worth tossing them up on IA? I don't *think* they are already there, but debian is pretty well archived, so IDK... [05:52] It looks like they are available from http://cdimage.debian.org/cdimage/archive/ [05:56] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [06:02] *** Sk1d has joined #archiveteam [06:04] *** JesseW has quit IRC (Quit: Leaving.) [06:13] *** WinterFox has joined #archiveteam [06:17] HCross i could pull the files from you? [06:18] my rsyns daemon is being uncooperative, but a pull still works [06:21] *** Honno has joined #archiveteam [06:23] so it's saved... [06:23] restore other users after reinstalling linux [06:23] sudo adduser --home /home/$username $username [06:23] sudo chown -R $username:username in user's home folder [06:23] Repeat for each user. [06:32] *** bwn has quit IRC (Ping timeout: 492 seconds) [06:51] *** metalcamp has joined #archiveteam [07:05] *** jmad980 has joined #archiveteam [07:12] *** jmad980 has quit IRC (Read error: Operation timed out) [07:31] *** bwn has joined #archiveteam [07:33] *** jmad980 has joined #archiveteam [07:48] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [07:52] *** Honno has quit IRC (Ping timeout: 492 seconds) [08:04] *** bsmith093 has quit IRC (Ping timeout: 370 seconds) [08:19] *** bsmith093 has joined #archiveteam [08:40] *** DFJustin has quit IRC (Read error: Connection reset by peer) [08:40] *** DFJustin has joined #archiveteam [08:40] *** swebb sets mode: +o DFJustin [08:53] *** superkuh has joined #archiveteam [09:06] *** lytv has quit IRC (Ping timeout: 244 seconds) [09:07] *** JetBalsa has quit IRC (Read error: Operation timed out) [09:07] *** lytv has joined #archiveteam [09:07] *** JetBalsa has joined #archiveteam [09:09] *** atomotic has joined #archiveteam [09:11] *** schbirid has joined #archiveteam [09:32] *** RichardG has joined #archiveteam [09:56] *** BlueMaxim has quit IRC (Quit: Leaving) [10:18] *** brayden has quit IRC (Quit: Leaving) [10:19] *** brayden has joined #archiveteam [10:19] *** swebb sets mode: +o brayden [10:26] *** WinterFox has quit IRC (Remote host closed the connection) [11:37] *** dan- has quit IRC (Quit: Nyan nyan) [11:51] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [12:30] *** HCross2 has quit IRC () [12:34] *** dan- has joined #archiveteam [12:46] *** atomotic has joined #archiveteam [12:53] *** metalcamp has joined #archiveteam [13:15] *** HCross2 has joined #archiveteam [14:02] *** pgoetz has quit IRC (Remote host closed the connection) [14:19] *** Start has quit IRC (Quit: Disconnected.) [14:25] *** scyther has joined #archiveteam [15:34] *** Start has joined #archiveteam [15:47] *** scyther has quit IRC (Quit: Leaving) [15:55] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [15:57] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [15:59] *** RichardG has joined #archiveteam [16:07] *** Start has quit IRC (Quit: Disconnected.) [16:13] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [16:21] *** pfallenop has quit IRC (Ping timeout: 260 seconds) [16:28] *** JesseW has joined #archiveteam [16:40] *** metalcamp has joined #archiveteam [16:49] *** JesseW has quit IRC (Quit: Leaving.) [16:54] *** atomotic has joined #archiveteam [16:58] *** Honno has joined #archiveteam [17:17] *** pfallenop has joined #archiveteam [17:34] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [17:38] *** JW_work has quit IRC (Quit: Leaving.) [17:39] *** JW_work has joined #archiveteam [17:43] *** Start has joined #archiveteam [17:43] *** pfallenop has quit IRC (Remote host closed the connection) [17:44] *** pfallenop has joined #archiveteam [17:45] *** metalcamp has quit IRC (Ping timeout: 250 seconds) [17:48] *** JW_work has quit IRC (Quit: Leaving.) [17:50] *** JW_work has joined #archiveteam [17:52] *** signius_ has quit IRC (Read error: Operation timed out) [17:55] *** JW_work has quit IRC (Client Quit) [17:56] *** JW_work has joined #archiveteam [18:02] *** Start has quit IRC (Quit: Disconnected.) [18:06] *** signius_ has joined #archiveteam [18:08] *** JW_work has quit IRC (Quit: Leaving.) [18:10] *** JW_work has joined #archiveteam [18:10] *** metalcamp has joined #archiveteam [18:18] *** Start has joined #archiveteam [18:46] Hey chfoo, got them downloaded, I organized them so there would be root/id-name-of-game/file, ie /229177-ahri-story/AhriStory.exe, that ok? [18:46] Just uploading them to my google drive, will take a bit because of my horrible upload speeds heh [18:48] Ugh btw, reCaptcha makes captchas progressively harder if it detects you're solving a lot of captchas, was doing this at school so couldn't use proxies or whatever, what a chore reading the ambiguous black and white text [19:11] Honno: yes, that's fine [19:11] didn't know that recaptcha would do that [19:11] chfoo, https://drive.google.com/file/d/0BwGNKTXvt_HfaWt5V3NxeENxUnM/view?usp=sharing [19:12] Yeah well that's what I thought anyway, because half the games was simple 3-5 character long numbers [19:12] Then it sometimes went street names [19:12] And mostly was the black and white rubbish [19:14] Yep, that's reCaptcha for ya. [19:14] i've had it give me math [19:14] so i typed LaTeX markup at it [19:14] woop woop woop off-topic siren [19:16] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [19:32] *** K4k__ has quit IRC (Quit: WeeChat 1.4) [19:32] *** K4k has joined #archiveteam [19:39] *** ebins3 has joined #archiveteam [19:39] checking back from yesterday, at the current rate, I'm archiving around 1000 gallery pages/ 12 hours, which means I'll need about 2 weeks to do all ~20,000 pages. once that's on IA we can plan out attack. [19:40] *** bwn has quit IRC (Ping timeout: 246 seconds) [19:43] *** tomwsmf-a has joined #archiveteam [19:44] *** Start has quit IRC (Quit: Disconnected.) [19:59] isn't that the whole project? [20:14] *** bwn has joined #archiveteam [20:23] *** Tomcat_ has joined #archiveteam [20:38] *** philpem has joined #archiveteam [20:38] Hey, how long was the Geocities torrent actually being used for? [20:38] Now it's just dead [20:39] it was up for years [20:39] Ah, thats good to know [20:40] I'm interested because I was intending to make a torrent which would be similiar in size for a massive collection of games, prob won't reach the same level of notoriety [20:40] But I think I can get some folk pretty keen to download and seed [20:40] cool, go for it [20:41] What would be a good way to get the ball rolling, like I should download some server to seed everything at the start right? [20:41] Any cost effective ways for a torrent operation? Services I should know of? [20:41] take this to #archiveteam-bs [20:41] Ah yeah sorry [20:45] *** ebins3 has quit IRC (Ping timeout: 268 seconds) [20:46] *** Tomcat_ has quit IRC (Remote host closed the connection) [20:58] *** elsif_ has joined #archiveteam [20:59] (previously newbie) [20:59] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [20:59] yahoosucks [20:59] lol [21:00] thank you! [21:00] sure [21:10] *** RichardG has quit IRC (Ping timeout: 260 seconds) [21:14] *** elsif_ has quit IRC (Quit: Page closed) [21:16] *** schbirid has quit IRC (Quit: Leaving) [21:26] *** RichardG has joined #archiveteam [21:36] *** xXx_ndidd has joined #archiveteam [21:43] *** ndiddy has quit IRC (Read error: Operation timed out) [21:48] *** xXx_ndidd is now known as ndiddy [21:53] nope. that's just where I figured it would be best to start. the actual comics themselves are hidden behind CAPTCHAs, making it not wget-able. what the gallery pages will tell us is 1. exactly how many comics there are 2. give us the ability to try to determine which are hosted elsewhere and thus low priority and 3. the metadata associated with them, like author, tags, etc. [21:53] o [21:55] http://www.tsumino.com/Book/Info/18652/1/back-alley-rendezvous-fakku this one, for example, is mirrored at fakku as the title indicates, so no need to save this one [21:56] do we have experience dealing with CAPTCHAs? it's either we get around reCAPTCHAS or we do it by hand [21:57] yeah, you can't load a single picture without solving a reCAPTCHA, unless I'm missing something [22:00] the operator of the site responded. http://www.tsumino.com/Forum/viewtopic.php?f=6&t=2615&start=10#p11013 "Let's not worry and assume that something bad will happen." I know better than to take the chance, better to get started now while the server's still up. [22:04] "Full-sized images are already available for every recent upload (roughly since the start of the month) if you download the images. They're only resized for Online Viewing, you get them full-sized when downloading. It's not possible to magically make the old entries that were ripped from Pururin full-sized when they never were to begin with." just found this in the forum. good to note. [22:09] *** RichardG has quit IRC (Read error: Connection reset by peer) [22:12] *** RichardG has joined #archiveteam [22:25] *** RichardG has quit IRC (Ping timeout: 244 seconds) [22:31] *** Honno has quit IRC (Ping timeout: 492 seconds) [22:33] *** RichardG has joined #archiveteam [22:59] *** maseck_ has quit IRC (Ping timeout: 244 seconds) [23:01] *** maseck has joined #archiveteam [23:20] Hello hello. [23:25] Hi. we're just talking about hoarding for a good cause aka archiving, as always [23:38] *** Start has joined #archiveteam [23:56] SketchCow: we now support 51 Russian newssites in newsgrabber [23:56] We should have pretty much all of Russia covered [23:58] *** RedType_ has left