#archiveteam-bs 2017-07-06,Thu

↑back Search

Time Nickname Message
00:04 🔗 Stilett0 is now known as Stiletto
00:29 🔗 bsmith093 i'm the untalented schlub from reddit, doing the grab of fanfiction.net, can someone help me tweak this script to be recursive. it's the thing that scans a directory for stories and gloms them into a csv file for the metadata db. http://paste.ubuntu.com/25028481/
00:30 🔗 bsmith093 Somebody2 made that for me a while ago, casue their awesome, and i can;t code my way out of a paper bag.
00:44 🔗 bmcginty has quit IRC (Ping timeout: 250 seconds)
00:46 🔗 icedice2 has joined #archiveteam-bs
00:46 🔗 bmcginty has joined #archiveteam-bs
00:47 🔗 MRX3 has joined #archiveteam-bs
00:49 🔗 icedice has quit IRC (Ping timeout: 245 seconds)
00:50 🔗 icedice2 has quit IRC (Ping timeout: 260 seconds)
00:57 🔗 eccfill heh
00:58 🔗 eccfill when I was 16 I wrote a fanfiction index scraper so I could write my own search functions...
01:19 🔗 bsmith093 eccfill: seriously, it's really useful, but it only grabs one folder at a time and doesn't walk down the tree. any suggestions?
01:22 🔗 BlueMaxim has joined #archiveteam-bs
01:34 🔗 bsmith093 actually, nvm, i'm only grabbing 100k stories at a time, ( to save on disk space) so i just flattened the tree and ran the script on that.
01:34 🔗 bsmith093 eccfill: btw here, you like fanfic https://archive.org/details/fanfictiondotnet_repack
01:34 🔗 bsmith093 there's ALL OF IT
01:40 🔗 * jrwr adds it to his "Private" Collection
01:44 🔗 bsmith093 i'm also saving all of ao3 and fictionpress. FP is REALLY small.
01:45 🔗 bsmith093 there's only 3 million links to check (so far) and currently, about 60% are dead. the vast majority of the first million are gone. the oldest story is fictionpress.com/s/290
01:46 🔗 Frogging o.o where did they go
01:46 🔗 Frogging first *million*?
01:46 🔗 bsmith093 fp and ffnet both count up sequentially for stories
01:47 🔗 bsmith093 the oldest fanfic on ffnet is fanfiction.net/s/4
01:49 🔗 joepie91 Frogging: there's a shocking amount of fanfic
01:49 🔗 joepie91 of, uh, varying quality
01:49 🔗 joepie91 lol
01:49 🔗 Frogging lol
01:49 🔗 Frogging indeed
01:50 🔗 * jrwr makes a random fanfic button as a single propose site
01:50 🔗 joepie91 purpose*
01:50 🔗 joepie91 but yeah, not a bad idea
01:50 🔗 joepie91 lol
01:50 🔗 jrwr 12GB of Harry Potter Fanfic
01:50 🔗 jrwr W T F
01:50 🔗 joepie91 jrwr: though you may want to make it have two buttons
01:50 🔗 joepie91 1) SFW, 2) possibly NSFW
01:52 🔗 BlueMaxim >implying any fanfic is SFW
01:52 🔗 BlueMaxim you poor soul
01:53 🔗 jrwr lol
01:53 🔗 jrwr "Oh, well -- I was at Hogwarts meself but I -- er -- got expelled, ter tell yeh the truth. In me third year. They snapped me wang in half an' everything
01:53 🔗 jrwr http://www.bash.org/?111338
01:55 🔗 eccfill bsmith093: what do you mean by walking down the tree?
01:56 🔗 joepie91 BlueMaxim: SFW fanfic exists :P
01:56 🔗 eccfill jrwr: I have an 8GB literotica archive that I call "tissuebox"
01:56 🔗 joepie91 not making any claims about the ratio though...
01:57 🔗 eccfill bsmith093: how do you handle formatting if your stories are text files?
01:57 🔗 bsmith093 it doesn't go into subdirectories. anyway i solved that by just moving all the files into one big folder and running the csv maker on that.
01:58 🔗 bsmith093 markdown. *italic* _bold_
01:59 🔗 eccfill guess that probably mostly works
01:59 🔗 bsmith093 ffnet is very limited in the formatting that it takes, anyway.
02:00 🔗 eccfill yeah, but there could be edge cases if there's _/* in the text that might need escaping. not too important
02:01 🔗 bsmith093 my current problem, because i'm an idiot, i flattened the folder structure *in place* so now i have thousands of empty folders to purge before i zip this.
02:01 🔗 eccfill ouch
02:01 🔗 eccfill and a directory with millions of files?
02:01 🔗 bsmith093 only ~70k
02:01 🔗 bsmith093 i have multiple packs of fanfic
02:04 🔗 eccfill got it
02:05 🔗 eccfill did you start this before they purged a lot of nsfw stuff?
02:06 🔗 bsmith093 maybe? i wasn't aware of the mass purge until well into it.
02:07 🔗 marvinw is now known as ivan
02:07 🔗 ivan $240 He8 https://www.bhphotovideo.com/c/product/1303685-REG/hgst_0s04012_8tb_3_5_sata_internal.html
02:08 🔗 eccfill the thing I like most about personal archives of these sorts of sites is that you can make *much* better search interfaces
02:08 🔗 eccfill sqlite's FTS5 engine is pretty good
02:18 🔗 dashcloud has joined #archiveteam-bs
02:19 🔗 dashcloud has quit IRC (Remote host closed the connection)
02:24 🔗 dashcloud has joined #archiveteam-bs
02:25 🔗 bsmith093 ivan: jesus that's cheap!
02:39 🔗 Frogging eh, it's not *that* cheap..
02:41 🔗 Frogging cheaper than usual though for sure
03:29 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
03:32 🔗 MRX3 has quit IRC (Quit: Leaving)
03:34 🔗 BlueMaxim has joined #archiveteam-bs
03:42 🔗 godane i'm starting to upload Strategy Magazine
03:42 🔗 godane https://archive.org/details/Strategy_Magazine-2006-04
03:49 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
03:51 🔗 BlueMaxim has joined #archiveteam-bs
04:12 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:19 🔗 Sk1d has joined #archiveteam-bs
04:34 🔗 BubuAnabe has quit IRC (Ping timeout: 270 seconds)
04:47 🔗 hook54321 !a http://sbgi.net/ --useragent firefox --phantomjs
04:47 🔗 hook54321 oops
05:43 🔗 jrwr has quit IRC (Read error: Operation timed out)
05:45 🔗 jrwr has joined #archiveteam-bs
05:45 🔗 luckcolor has quit IRC (Quit: No Ping reply in 180 seconds.)
05:47 🔗 luckcolor has joined #archiveteam-bs
05:57 🔗 fie has quit IRC (Leaving)
06:03 🔗 mhazinsk has quit IRC (Read error: Operation timed out)
06:04 🔗 mhazinsk has joined #archiveteam-bs
06:06 🔗 ZexaronS has joined #archiveteam-bs
06:25 🔗 Jonison has joined #archiveteam-bs
07:47 🔗 Mayonaise has quit IRC (Read error: Operation timed out)
07:49 🔗 Mayonaise has joined #archiveteam-bs
07:50 🔗 SHODAN_UI has joined #archiveteam-bs
08:40 🔗 brayden_ has joined #archiveteam-bs
08:40 🔗 swebb sets mode: +o brayden_
08:40 🔗 brayden has quit IRC (Read error: Connection reset by peer)
08:40 🔗 brayden_ is now known as brayden
08:44 🔗 brayden_ has joined #archiveteam-bs
08:44 🔗 swebb sets mode: +o brayden_
08:44 🔗 brayden has quit IRC (Read error: Connection reset by peer)
08:44 🔗 brayden_ is now known as brayden
08:59 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
09:34 🔗 tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
09:48 🔗 JAA Crap, Tilt shut down their API. It's now redirecting to the homepage. :-(
09:51 🔗 brayden_ has joined #archiveteam-bs
09:51 🔗 swebb sets mode: +o brayden_
09:51 🔗 JAA Same for all campaign and user pages. Well, I guess the party's over.
09:54 🔗 JAA Looks like it started happening around 01:36 UTC.
10:02 🔗 j08nY has joined #archiveteam-bs
10:02 🔗 brayden has quit IRC (Read error: Operation timed out)
10:02 🔗 brayden_ is now known as brayden
10:16 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
10:56 🔗 Jonison has quit IRC (Read error: Connection reset by peer)
10:59 🔗 Jonison has joined #archiveteam-bs
11:19 🔗 kristian_ has joined #archiveteam-bs
11:58 🔗 kristian_ has quit IRC (Quit: Leaving)
12:13 🔗 SHODAN_UI has joined #archiveteam-bs
14:10 🔗 Jonison has quit IRC (Read error: Connection reset by peer)
14:49 🔗 t2t2 seesaw @ "subprocess.py", line 1457, in _execute_child
14:49 🔗 t2t2 FileNotFoundError: [Errno 2] No such file or directory: 'youtube-dl'
14:50 🔗 t2t2 that should probably be looking from /usr/local/bin/ instead of the current working directory
14:51 🔗 t2t2 or ~/.local/bin/ for pip --user installs
14:55 🔗 JAA Shouldn't it just search through the path?
15:10 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
15:32 🔗 t2t2 yes, the wpull argument for that is --youtube-dl-exe PATH
15:34 🔗 JAA Yeah, you can use that to specify the path directly. But the value is passed to subprocess.Popen and defaults to youtube-dl, so without using that option, it would search the PATH environment variable.
15:35 🔗 JAA Or should, as far as I know.
15:36 🔗 JAA Haven't tested it.
15:47 🔗 godane has quit IRC (Read error: Operation timed out)
15:59 🔗 SketchCow https://archive.org/search.php?query=pre-production%20cart%20EMI
16:30 🔗 godane has joined #archiveteam-bs
16:42 🔗 Stiletto has quit IRC (Read error: Operation timed out)
16:43 🔗 Stilett0 has joined #archiveteam-bs
16:47 🔗 bitBaron has joined #archiveteam-bs
16:59 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
17:02 🔗 Famicoman has joined #archiveteam-bs
17:05 🔗 bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
17:06 🔗 bitBaron has joined #archiveteam-bs
17:08 🔗 bitBaron has quit IRC (Read error: Connection reset by peer)
17:32 🔗 SHODAN_UI has joined #archiveteam-bs
17:46 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
17:53 🔗 Famicoman has joined #archiveteam-bs
17:54 🔗 Aranje has joined #archiveteam-bs
18:01 🔗 bitBaron has joined #archiveteam-bs
18:31 🔗 Asparagir bsmith093: Thank you for saving A03! Although they seem to be doing okay, solid open source codebase and non-profit funding.
18:32 🔗 Asparagir Signed, just another #stucky fangirl :-)
18:33 🔗 bsmith093 Asparagir: np, thanks! I just figured someone should save them, there are millions of stories on those sites.
18:33 🔗 Asparagir Yup yup yup.
18:35 🔗 Asparagir Someday, if they ever open up their Solr search as an API (it's been on their development roadmap for ages), I have a dream to mine the database and create a "fic recommendation" system. Like, if you like fic A from fandom B, you might also like fic C from fandom D.
18:35 🔗 Asparagir They have made really good use of tags, and there's also the tracking of who left kudos for what, so you could build a real recommendation system.
18:36 🔗 Asparagir People like you who like THIS would also like THAT.
18:37 🔗 kisspunch Asparagir: goodreads does a surprisingly good job of that, mostly by NOT having the same 3 genres as everyone
18:37 🔗 kisspunch They're not open though AFAIK
18:37 🔗 Asparagir The more granularity, the better.
18:38 🔗 Asparagir Like, maybe I want one saved search cluster profile for "slow burn, long build up, interpersonal angst" stories, but another saved search profile for "cheesy silly plot-what-plot" stories
18:40 🔗 Asparagir And you could monetize an app/site like this, like $5/month, with some of the funds earmarked as going as donations to AO3. So everybody wins: people find the stories they didn't even know they wanted, system gets more use, system also gets donations.
18:40 🔗 Asparagir I already know Solr but I would have to learn Hadoop, probably.
18:41 🔗 kisspunch How do I tag something 'archive team' in the web interface?
18:41 🔗 kisspunch Is it actually a tag and not a collection?
18:41 🔗 bsmith093 there's a user on fimfiction who archives all of those stories as well, turns out there's only about 200k
18:41 🔗 bsmith093 Asparagir: google chinchillax fimfic archive
18:42 🔗 bsmith093 Asparagir: every fimfic story ever posted combined, can still fit on one dvd.
18:43 🔗 Asparagir Ah, I see -- they release torrents of all their stories: https://www.fimfiction.net/user/116950/Fimfarchive
18:45 🔗 Asparagir kisspunch: Do you mean on the actual web interface for the Internet Archive?
18:47 🔗 kisspunch Yes
18:47 🔗 kisspunch The uploader interface
18:48 🔗 kisspunch I'm awake, I swear. Yes, on archive.org
18:56 🔗 xarph has joined #archiveteam-bs
18:59 🔗 bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
19:16 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
19:16 🔗 Stilett0 has joined #archiveteam-bs
19:18 🔗 pie_ has joined #archiveteam-bs
19:18 🔗 pie_ anyone archiving pixiv?
19:19 🔗 Frogging I think it was just the chats that were going away, but no
19:19 🔗 kisspunch fwiw folks you've been incredibly welcoming in archiveteam-bs, but i was really put off by visiting #archiveteam in the past, to the point of not wanting to talk to archive team again
19:19 🔗 kisspunch would have been better just to redirect me here i think
19:22 🔗 Asparagir kisspunch: If I upload WARC's to the INternet Archive based on my own crawls, I then e-mail Jason Scott (a.k.a. SketchCow here on IRC) at jscott@archive.org and ask him to "move such-and-such (paste URL too) to ArchiveTeam bucket". He usually does it within a few days. I haven't asked lately, though. I just usually use ArchiveBot to get stuff up, which of course automatically puts stuff into the ArchiveTeam bucket.
19:23 🔗 kisspunch Asparagir: Yeah, I usually am writing my own software to e.g. download all of an entire site and put it in some nice format. It's more along the lines of the public datasets and less generic WARC crawls
19:24 🔗 Asparagir Also, some parts of ArchiveTeam are less-than-friendly to newcomers. It's almost like a defense mechanism, because a lot of site owners don't like what we're doing, saving their sites.
19:24 🔗 Asparagir And that spills over, unfortunately, sometimes, even to people who want to help archive.
19:25 🔗 bsmith093 Asparagir: i happened to tell a bunch of fanfic writers that i was saving all their stories, and you'd think i said i ate puppies.
19:25 🔗 xmc oh gosh, yes, that
19:25 🔗 Asparagir People weird.
19:25 🔗 xmc that's how fic always is
19:26 🔗 bsmith093 there are WARC crawls of ffnet ,too i just thought it would be easier for the average person to grab a bunch of stories in an easliy readable format.
19:26 🔗 Asparagir Not just fic. Even commercial sites hosting tons of user-generated content that close.
19:26 🔗 kisspunch *shrug* I'm not sure if I'm complaining exactly, just sharing my story. If there was a second group of people interested in archiving I wouldn't have ended up here. And again, archiveteam-bs is notably friendlier
19:26 🔗 kisspunch It's a little weird
19:26 🔗 Asparagir It is.
19:26 🔗 bsmith093 -bs is where off-topic stuff goes for archiveteam
19:26 🔗 Asparagir Or another angle on this -- I wish ArchiveBot had a direct Twitter interface, not just an IRC interface. We would get so many more people willing to help submit URL's to save. And a greater variety of sites to save, corners of the Internet we don't usually crawl.
19:26 🔗 bsmith093 they really try to keep the main channel clear of chatter.
19:27 🔗 xmc same people, but we try to have as little talking in the main channel as we can
19:27 🔗 kisspunch Asparagir: how do I get permission to give AchriveBot urls on IRC?
19:27 🔗 Asparagir Hang out in the channel and ask someone to op you.
19:27 🔗 xmc kisspunch: you join #archivebot and ask for voice
19:27 🔗 bsmith093 Asparagir: if you know sql, i also provide a metadata db of the stories for easy searching.
19:27 🔗 xmc yeah, that
19:27 🔗 Asparagir Oooooh.
19:28 🔗 bsmith093 https://archive.org/details/fanfictiondotnet_repack
19:28 🔗 bsmith093 thats almost 350gb of stories, id 1-10 million
19:28 🔗 Asparagir Nice! thanks.
19:28 🔗 bsmith093 i'm grabbing the rest in smaller chunks.
19:29 🔗 bsmith093 https://archive.org/details/Fanfictiondotnet1011dump 10-11 million
19:31 🔗 kisspunch I recommend https://github.com/mispy/twitter_ebooks for archiving twitter btw
19:31 🔗 bsmith093 https://archive.org/details/Fanfictiondotnet111million the next 100k
19:32 🔗 kisspunch It's very clean and does incremental update, which is important given the 3200 tweet limit
19:38 🔗 kristian_ has joined #archiveteam-bs
20:00 🔗 pie_ has quit IRC (Read error: Connection reset by peer)
20:00 🔗 pie_ has joined #archiveteam-bs
20:10 🔗 fie has joined #archiveteam-bs
20:19 🔗 kristian_ has quit IRC (Quit: Leaving)
20:32 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
20:34 🔗 Famicoman has joined #archiveteam-bs
20:34 🔗 JAA Asparagir: Thanks for the information regarding how to upload files to IA. I'll have to do that soon.
21:12 🔗 Asparagir JAA: No problem. I've been meaning to do some new ones soon...
21:12 🔗 kristian_ has joined #archiveteam-bs
21:12 🔗 Asparagir Possibly helpful: https://gist.github.com/Asparagirl/6202872 and https://gist.github.com/Asparagirl/6206247
21:13 🔗 Asparagir I wrote those a while ago, but I think they're still accurate
21:13 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
21:22 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
21:31 🔗 Famicoman has joined #archiveteam-bs
21:37 🔗 JAA Thanks, that second one looks useful to me.
21:38 🔗 JAA Just one comment though: no need to `export` variables (= make them envinroment variables) if they aren't used by the subprocess, here curl, anyway.
21:42 🔗 sep332 has joined #archiveteam-bs
21:50 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
21:53 🔗 fie has quit IRC (Ping timeout: 246 seconds)
21:58 🔗 Famicoman has joined #archiveteam-bs
22:06 🔗 fie has joined #archiveteam-bs
22:11 🔗 bsmith093 JAA: there's also a python pip package, if nobody's suggested it yet. internetarchive, command is ia
22:12 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
22:14 🔗 JAA Yeah, I've seen that and actually suggested it to other people before without ever having used it. :-D
22:19 🔗 Famicoman has joined #archiveteam-bs
22:22 🔗 zino has joined #archiveteam-bs
22:31 🔗 Stilett0 has quit IRC (Ping timeout: 260 seconds)
22:59 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
23:01 🔗 Famicoman has joined #archiveteam-bs
23:03 🔗 Stilett0 has joined #archiveteam-bs
23:04 🔗 kristian_ has quit IRC (Quit: Leaving)
23:06 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
23:12 🔗 Famicoman has joined #archiveteam-bs
23:39 🔗 BubuAnabe has joined #archiveteam-bs
23:40 🔗 Famicoman has quit IRC (Ping timeout: 260 seconds)
23:44 🔗 odemg has quit IRC (Quit: Leaving)
23:49 🔗 Famicoman has joined #archiveteam-bs

irclogger-viewer