[00:04] *** Stilett0 is now known as Stiletto [00:29] i'm the untalented schlub from reddit, doing the grab of fanfiction.net, can someone help me tweak this script to be recursive. it's the thing that scans a directory for stories and gloms them into a csv file for the metadata db. http://paste.ubuntu.com/25028481/ [00:30] Somebody2 made that for me a while ago, casue their awesome, and i can;t code my way out of a paper bag. [00:44] *** bmcginty has quit IRC (Ping timeout: 250 seconds) [00:46] *** icedice2 has joined #archiveteam-bs [00:46] *** bmcginty has joined #archiveteam-bs [00:47] *** MRX3 has joined #archiveteam-bs [00:49] *** icedice has quit IRC (Ping timeout: 245 seconds) [00:50] *** icedice2 has quit IRC (Ping timeout: 260 seconds) [00:57] heh [00:58] when I was 16 I wrote a fanfiction index scraper so I could write my own search functions... [01:19] eccfill: seriously, it's really useful, but it only grabs one folder at a time and doesn't walk down the tree. any suggestions? [01:22] *** BlueMaxim has joined #archiveteam-bs [01:34] actually, nvm, i'm only grabbing 100k stories at a time, ( to save on disk space) so i just flattened the tree and ran the script on that. [01:34] eccfill: btw here, you like fanfic https://archive.org/details/fanfictiondotnet_repack [01:34] there's ALL OF IT [01:40] * jrwr adds it to his "Private" Collection [01:44] i'm also saving all of ao3 and fictionpress. FP is REALLY small. [01:45] there's only 3 million links to check (so far) and currently, about 60% are dead. the vast majority of the first million are gone. the oldest story is fictionpress.com/s/290 [01:46] o.o where did they go [01:46] first *million*? [01:46] fp and ffnet both count up sequentially for stories [01:47] the oldest fanfic on ffnet is fanfiction.net/s/4 [01:49] Frogging: there's a shocking amount of fanfic [01:49] of, uh, varying quality [01:49] lol [01:49] lol [01:49] indeed [01:50] * jrwr makes a random fanfic button as a single propose site [01:50] purpose* [01:50] but yeah, not a bad idea [01:50] lol [01:50] 12GB of Harry Potter Fanfic [01:50] W T F [01:50] jrwr: though you may want to make it have two buttons [01:50] 1) SFW, 2) possibly NSFW [01:52] >implying any fanfic is SFW [01:52] you poor soul [01:53] lol [01:53] "Oh, well -- I was at Hogwarts meself but I -- er -- got expelled, ter tell yeh the truth. In me third year. They snapped me wang in half an' everything [01:53] http://www.bash.org/?111338 [01:55] bsmith093: what do you mean by walking down the tree? [01:56] BlueMaxim: SFW fanfic exists :P [01:56] jrwr: I have an 8GB literotica archive that I call "tissuebox" [01:56] not making any claims about the ratio though... [01:57] bsmith093: how do you handle formatting if your stories are text files? [01:57] it doesn't go into subdirectories. anyway i solved that by just moving all the files into one big folder and running the csv maker on that. [01:58] markdown. *italic* _bold_ [01:59] guess that probably mostly works [01:59] ffnet is very limited in the formatting that it takes, anyway. [02:00] yeah, but there could be edge cases if there's _/* in the text that might need escaping. not too important [02:01] my current problem, because i'm an idiot, i flattened the folder structure *in place* so now i have thousands of empty folders to purge before i zip this. [02:01] ouch [02:01] and a directory with millions of files? [02:01] only ~70k [02:01] i have multiple packs of fanfic [02:04] got it [02:05] did you start this before they purged a lot of nsfw stuff? [02:06] maybe? i wasn't aware of the mass purge until well into it. [02:07] *** marvinw is now known as ivan [02:07] $240 He8 https://www.bhphotovideo.com/c/product/1303685-REG/hgst_0s04012_8tb_3_5_sata_internal.html [02:08] the thing I like most about personal archives of these sorts of sites is that you can make *much* better search interfaces [02:08] sqlite's FTS5 engine is pretty good [02:18] *** dashcloud has joined #archiveteam-bs [02:19] *** dashcloud has quit IRC (Remote host closed the connection) [02:24] *** dashcloud has joined #archiveteam-bs [02:25] ivan: jesus that's cheap! [02:39] eh, it's not *that* cheap.. [02:41] cheaper than usual though for sure [03:29] *** BlueMaxim has quit IRC (Read error: Operation timed out) [03:32] *** MRX3 has quit IRC (Quit: Leaving) [03:34] *** BlueMaxim has joined #archiveteam-bs [03:42] i'm starting to upload Strategy Magazine [03:42] https://archive.org/details/Strategy_Magazine-2006-04 [03:49] *** BlueMaxim has quit IRC (Read error: Operation timed out) [03:51] *** BlueMaxim has joined #archiveteam-bs [04:12] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:19] *** Sk1d has joined #archiveteam-bs [04:34] *** BubuAnabe has quit IRC (Ping timeout: 270 seconds) [04:47] !a http://sbgi.net/ --useragent firefox --phantomjs [04:47] oops [05:43] *** jrwr has quit IRC (Read error: Operation timed out) [05:45] *** jrwr has joined #archiveteam-bs [05:45] *** luckcolor has quit IRC (Quit: No Ping reply in 180 seconds.) [05:47] *** luckcolor has joined #archiveteam-bs [05:57] *** fie has quit IRC (Leaving) [06:03] *** mhazinsk has quit IRC (Read error: Operation timed out) [06:04] *** mhazinsk has joined #archiveteam-bs [06:06] *** ZexaronS has joined #archiveteam-bs [06:25] *** Jonison has joined #archiveteam-bs [07:47] *** Mayonaise has quit IRC (Read error: Operation timed out) [07:49] *** Mayonaise has joined #archiveteam-bs [07:50] *** SHODAN_UI has joined #archiveteam-bs [08:40] *** brayden_ has joined #archiveteam-bs [08:40] *** swebb sets mode: +o brayden_ [08:40] *** brayden has quit IRC (Read error: Connection reset by peer) [08:40] *** brayden_ is now known as brayden [08:44] *** brayden_ has joined #archiveteam-bs [08:44] *** swebb sets mode: +o brayden_ [08:44] *** brayden has quit IRC (Read error: Connection reset by peer) [08:44] *** brayden_ is now known as brayden [08:59] *** SHODAN_UI has quit IRC (Remote host closed the connection) [09:34] *** tfgbd_znc has quit IRC (Ping timeout: 600 seconds) [09:48] Crap, Tilt shut down their API. It's now redirecting to the homepage. :-( [09:51] *** brayden_ has joined #archiveteam-bs [09:51] *** swebb sets mode: +o brayden_ [09:51] Same for all campaign and user pages. Well, I guess the party's over. [09:54] Looks like it started happening around 01:36 UTC. [10:02] *** j08nY has joined #archiveteam-bs [10:02] *** brayden has quit IRC (Read error: Operation timed out) [10:02] *** brayden_ is now known as brayden [10:16] *** BlueMaxim has quit IRC (Read error: Operation timed out) [10:56] *** Jonison has quit IRC (Read error: Connection reset by peer) [10:59] *** Jonison has joined #archiveteam-bs [11:19] *** kristian_ has joined #archiveteam-bs [11:58] *** kristian_ has quit IRC (Quit: Leaving) [12:13] *** SHODAN_UI has joined #archiveteam-bs [14:10] *** Jonison has quit IRC (Read error: Connection reset by peer) [14:49] seesaw @ "subprocess.py", line 1457, in _execute_child [14:49] FileNotFoundError: [Errno 2] No such file or directory: 'youtube-dl' [14:50] that should probably be looking from /usr/local/bin/ instead of the current working directory [14:51] or ~/.local/bin/ for pip --user installs [14:55] Shouldn't it just search through the path? [15:10] *** SHODAN_UI has quit IRC (Remote host closed the connection) [15:32] yes, the wpull argument for that is --youtube-dl-exe PATH [15:34] Yeah, you can use that to specify the path directly. But the value is passed to subprocess.Popen and defaults to youtube-dl, so without using that option, it would search the PATH environment variable. [15:35] Or should, as far as I know. [15:36] Haven't tested it. [15:47] *** godane has quit IRC (Read error: Operation timed out) [15:59] https://archive.org/search.php?query=pre-production%20cart%20EMI [16:30] *** godane has joined #archiveteam-bs [16:42] *** Stiletto has quit IRC (Read error: Operation timed out) [16:43] *** Stilett0 has joined #archiveteam-bs [16:47] *** bitBaron has joined #archiveteam-bs [16:59] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [17:02] *** Famicoman has joined #archiveteam-bs [17:05] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) [17:06] *** bitBaron has joined #archiveteam-bs [17:08] *** bitBaron has quit IRC (Read error: Connection reset by peer) [17:32] *** SHODAN_UI has joined #archiveteam-bs [17:46] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [17:53] *** Famicoman has joined #archiveteam-bs [17:54] *** Aranje has joined #archiveteam-bs [18:01] *** bitBaron has joined #archiveteam-bs [18:31] bsmith093: Thank you for saving A03! Although they seem to be doing okay, solid open source codebase and non-profit funding. [18:32] Signed, just another #stucky fangirl :-) [18:33] Asparagir: np, thanks! I just figured someone should save them, there are millions of stories on those sites. [18:33] Yup yup yup. [18:35] Someday, if they ever open up their Solr search as an API (it's been on their development roadmap for ages), I have a dream to mine the database and create a "fic recommendation" system. Like, if you like fic A from fandom B, you might also like fic C from fandom D. [18:35] They have made really good use of tags, and there's also the tracking of who left kudos for what, so you could build a real recommendation system. [18:36] People like you who like THIS would also like THAT. [18:37] Asparagir: goodreads does a surprisingly good job of that, mostly by NOT having the same 3 genres as everyone [18:37] They're not open though AFAIK [18:37] The more granularity, the better. [18:38] Like, maybe I want one saved search cluster profile for "slow burn, long build up, interpersonal angst" stories, but another saved search profile for "cheesy silly plot-what-plot" stories [18:40] And you could monetize an app/site like this, like $5/month, with some of the funds earmarked as going as donations to AO3. So everybody wins: people find the stories they didn't even know they wanted, system gets more use, system also gets donations. [18:40] I already know Solr but I would have to learn Hadoop, probably. [18:41] How do I tag something 'archive team' in the web interface? [18:41] Is it actually a tag and not a collection? [18:41] there's a user on fimfiction who archives all of those stories as well, turns out there's only about 200k [18:41] Asparagir: google chinchillax fimfic archive [18:42] Asparagir: every fimfic story ever posted combined, can still fit on one dvd. [18:43] Ah, I see -- they release torrents of all their stories: https://www.fimfiction.net/user/116950/Fimfarchive [18:45] kisspunch: Do you mean on the actual web interface for the Internet Archive? [18:47] Yes [18:47] The uploader interface [18:48] I'm awake, I swear. Yes, on archive.org [18:56] *** xarph has joined #archiveteam-bs [18:59] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) [19:16] *** Stilett0 has quit IRC (Read error: Operation timed out) [19:16] *** Stilett0 has joined #archiveteam-bs [19:18] *** pie_ has joined #archiveteam-bs [19:18] anyone archiving pixiv? [19:19] I think it was just the chats that were going away, but no [19:19] fwiw folks you've been incredibly welcoming in archiveteam-bs, but i was really put off by visiting #archiveteam in the past, to the point of not wanting to talk to archive team again [19:19] would have been better just to redirect me here i think [19:22] kisspunch: If I upload WARC's to the INternet Archive based on my own crawls, I then e-mail Jason Scott (a.k.a. SketchCow here on IRC) at jscott@archive.org and ask him to "move such-and-such (paste URL too) to ArchiveTeam bucket". He usually does it within a few days. I haven't asked lately, though. I just usually use ArchiveBot to get stuff up, which of course automatically puts stuff into the ArchiveTeam bucket. [19:23] Asparagir: Yeah, I usually am writing my own software to e.g. download all of an entire site and put it in some nice format. It's more along the lines of the public datasets and less generic WARC crawls [19:24] Also, some parts of ArchiveTeam are less-than-friendly to newcomers. It's almost like a defense mechanism, because a lot of site owners don't like what we're doing, saving their sites. [19:24] And that spills over, unfortunately, sometimes, even to people who want to help archive. [19:25] Asparagir: i happened to tell a bunch of fanfic writers that i was saving all their stories, and you'd think i said i ate puppies. [19:25] oh gosh, yes, that [19:25] People weird. [19:25] that's how fic always is [19:26] there are WARC crawls of ffnet ,too i just thought it would be easier for the average person to grab a bunch of stories in an easliy readable format. [19:26] Not just fic. Even commercial sites hosting tons of user-generated content that close. [19:26] *shrug* I'm not sure if I'm complaining exactly, just sharing my story. If there was a second group of people interested in archiving I wouldn't have ended up here. And again, archiveteam-bs is notably friendlier [19:26] It's a little weird [19:26] It is. [19:26] -bs is where off-topic stuff goes for archiveteam [19:26] Or another angle on this -- I wish ArchiveBot had a direct Twitter interface, not just an IRC interface. We would get so many more people willing to help submit URL's to save. And a greater variety of sites to save, corners of the Internet we don't usually crawl. [19:26] they really try to keep the main channel clear of chatter. [19:27] same people, but we try to have as little talking in the main channel as we can [19:27] Asparagir: how do I get permission to give AchriveBot urls on IRC? [19:27] Hang out in the channel and ask someone to op you. [19:27] kisspunch: you join #archivebot and ask for voice [19:27] Asparagir: if you know sql, i also provide a metadata db of the stories for easy searching. [19:27] yeah, that [19:27] Oooooh. [19:28] https://archive.org/details/fanfictiondotnet_repack [19:28] thats almost 350gb of stories, id 1-10 million [19:28] Nice! thanks. [19:28] i'm grabbing the rest in smaller chunks. [19:29] https://archive.org/details/Fanfictiondotnet1011dump 10-11 million [19:31] I recommend https://github.com/mispy/twitter_ebooks for archiving twitter btw [19:31] https://archive.org/details/Fanfictiondotnet111million the next 100k [19:32] It's very clean and does incremental update, which is important given the 3200 tweet limit [19:38] *** kristian_ has joined #archiveteam-bs [20:00] *** pie_ has quit IRC (Read error: Connection reset by peer) [20:00] *** pie_ has joined #archiveteam-bs [20:10] *** fie has joined #archiveteam-bs [20:19] *** kristian_ has quit IRC (Quit: Leaving) [20:32] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [20:34] *** Famicoman has joined #archiveteam-bs [20:34] Asparagir: Thanks for the information regarding how to upload files to IA. I'll have to do that soon. [21:12] JAA: No problem. I've been meaning to do some new ones soon... [21:12] *** kristian_ has joined #archiveteam-bs [21:12] Possibly helpful: https://gist.github.com/Asparagirl/6202872 and https://gist.github.com/Asparagirl/6206247 [21:13] I wrote those a while ago, but I think they're still accurate [21:13] *** SHODAN_UI has quit IRC (Remote host closed the connection) [21:22] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [21:31] *** Famicoman has joined #archiveteam-bs [21:37] Thanks, that second one looks useful to me. [21:38] Just one comment though: no need to `export` variables (= make them envinroment variables) if they aren't used by the subprocess, here curl, anyway. [21:42] *** sep332 has joined #archiveteam-bs [21:50] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [21:53] *** fie has quit IRC (Ping timeout: 246 seconds) [21:58] *** Famicoman has joined #archiveteam-bs [22:06] *** fie has joined #archiveteam-bs [22:11] JAA: there's also a python pip package, if nobody's suggested it yet. internetarchive, command is ia [22:12] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [22:14] Yeah, I've seen that and actually suggested it to other people before without ever having used it. :-D [22:19] *** Famicoman has joined #archiveteam-bs [22:22] *** zino has joined #archiveteam-bs [22:31] *** Stilett0 has quit IRC (Ping timeout: 260 seconds) [22:59] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [23:01] *** Famicoman has joined #archiveteam-bs [23:03] *** Stilett0 has joined #archiveteam-bs [23:04] *** kristian_ has quit IRC (Quit: Leaving) [23:06] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [23:12] *** Famicoman has joined #archiveteam-bs [23:39] *** BubuAnabe has joined #archiveteam-bs [23:40] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [23:44] *** odemg has quit IRC (Quit: Leaving) [23:49] *** Famicoman has joined #archiveteam-bs