#archiveteam-bs 2017-07-06,Thu

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***Stilett0 is now known as Stiletto [00:04]
...... (idle for 25mn)
bsmith093i'm the untalented schlub from reddit, doing the grab of fanfiction.net, can someone help me tweak this script to be recursive. it's the thing that scans a directory for stories and gloms them into a csv file for the metadata db. http://paste.ubuntu.com/25028481/
Somebody2 made that for me a while ago, casue their awesome, and i can;t code my way out of a paper bag.
[00:29]
***bmcginty has quit IRC (Ping timeout: 250 seconds)
icedice2 has joined #archiveteam-bs
bmcginty has joined #archiveteam-bs
MRX3 has joined #archiveteam-bs
icedice has quit IRC (Ping timeout: 245 seconds)
icedice2 has quit IRC (Ping timeout: 260 seconds)
[00:44]
eccfillheh
when I was 16 I wrote a fanfiction index scraper so I could write my own search functions...
[00:57]
..... (idle for 21mn)
bsmith093eccfill: seriously, it's really useful, but it only grabs one folder at a time and doesn't walk down the tree. any suggestions? [01:19]
***BlueMaxim has joined #archiveteam-bs [01:22]
bsmith093actually, nvm, i'm only grabbing 100k stories at a time, ( to save on disk space) so i just flattened the tree and ran the script on that.
eccfill: btw here, you like fanfic https://archive.org/details/fanfictiondotnet_repack
there's ALL OF IT
[01:34]
jrwrjrwr adds it to his "Private" Collection [01:40]
bsmith093i'm also saving all of ao3 and fictionpress. FP is REALLY small.
there's only 3 million links to check (so far) and currently, about 60% are dead. the vast majority of the first million are gone. the oldest story is fictionpress.com/s/290
[01:44]
Froggingo.o where did they go
first *million*?
[01:46]
bsmith093fp and ffnet both count up sequentially for stories
the oldest fanfic on ffnet is fanfiction.net/s/4
[01:46]
joepie91Frogging: there's a shocking amount of fanfic
of, uh, varying quality
lol
[01:49]
Frogginglol
indeed
[01:49]
jrwrjrwr makes a random fanfic button as a single propose site [01:50]
joepie91purpose*
but yeah, not a bad idea
lol
[01:50]
jrwr12GB of Harry Potter Fanfic
W T F
[01:50]
joepie91jrwr: though you may want to make it have two buttons
1) SFW, 2) possibly NSFW
[01:50]
BlueMaxim>implying any fanfic is SFW
you poor soul
[01:52]
jrwrlol
"Oh, well -- I was at Hogwarts meself but I -- er -- got expelled, ter tell yeh the truth. In me third year. They snapped me wang in half an' everything
http://www.bash.org/?111338
[01:53]
eccfillbsmith093: what do you mean by walking down the tree? [01:55]
joepie91BlueMaxim: SFW fanfic exists :P [01:56]
eccfilljrwr: I have an 8GB literotica archive that I call "tissuebox" [01:56]
joepie91not making any claims about the ratio though... [01:56]
eccfillbsmith093: how do you handle formatting if your stories are text files? [01:57]
bsmith093it doesn't go into subdirectories. anyway i solved that by just moving all the files into one big folder and running the csv maker on that.
markdown. *italic* _bold_
[01:57]
eccfillguess that probably mostly works [01:59]
bsmith093ffnet is very limited in the formatting that it takes, anyway. [01:59]
eccfillyeah, but there could be edge cases if there's _/* in the text that might need escaping. not too important [02:00]
bsmith093my current problem, because i'm an idiot, i flattened the folder structure *in place* so now i have thousands of empty folders to purge before i zip this. [02:01]
eccfillouch
and a directory with millions of files?
[02:01]
bsmith093only ~70k
i have multiple packs of fanfic
[02:01]
eccfillgot it
did you start this before they purged a lot of nsfw stuff?
[02:04]
bsmith093maybe? i wasn't aware of the mass purge until well into it. [02:06]
***marvinw is now known as ivan [02:07]
ivan$240 He8 https://www.bhphotovideo.com/c/product/1303685-REG/hgst_0s04012_8tb_3_5_sata_internal.html [02:07]
eccfillthe thing I like most about personal archives of these sorts of sites is that you can make *much* better search interfaces
sqlite's FTS5 engine is pretty good
[02:08]
***dashcloud has joined #archiveteam-bs
dashcloud has quit IRC (Remote host closed the connection)
[02:18]
dashcloud has joined #archiveteam-bs [02:24]
bsmith093ivan: jesus that's cheap! [02:25]
Froggingeh, it's not *that* cheap..
cheaper than usual though for sure
[02:39]
.......... (idle for 48mn)
***BlueMaxim has quit IRC (Read error: Operation timed out)
MRX3 has quit IRC (Quit: Leaving)
BlueMaxim has joined #archiveteam-bs
[03:29]
godanei'm starting to upload Strategy Magazine
https://archive.org/details/Strategy_Magazine-2006-04
[03:42]
***BlueMaxim has quit IRC (Read error: Operation timed out)
BlueMaxim has joined #archiveteam-bs
[03:49]
..... (idle for 21mn)
Sk1d has quit IRC (Ping timeout: 250 seconds) [04:12]
Sk1d has joined #archiveteam-bs [04:19]
.... (idle for 15mn)
BubuAnabe has quit IRC (Ping timeout: 270 seconds) [04:34]
hook54321!a http://sbgi.net/ --useragent firefox --phantomjs
oops
[04:47]
............ (idle for 56mn)
***jrwr has quit IRC (Read error: Operation timed out)
jrwr has joined #archiveteam-bs
luckcolor has quit IRC (Quit: No Ping reply in 180 seconds.)
luckcolor has joined #archiveteam-bs
[05:43]
fie has quit IRC (Leaving) [05:57]
mhazinsk has quit IRC (Read error: Operation timed out)
mhazinsk has joined #archiveteam-bs
ZexaronS has joined #archiveteam-bs
[06:03]
.... (idle for 19mn)
Jonison has joined #archiveteam-bs [06:25]
................. (idle for 1h22mn)
Mayonaise has quit IRC (Read error: Operation timed out)
Mayonaise has joined #archiveteam-bs
SHODAN_UI has joined #archiveteam-bs
[07:47]
........... (idle for 50mn)
brayden_ has joined #archiveteam-bs
swebb sets mode: +o brayden_
brayden has quit IRC (Read error: Connection reset by peer)
brayden_ is now known as brayden
brayden_ has joined #archiveteam-bs
swebb sets mode: +o brayden_
brayden has quit IRC (Read error: Connection reset by peer)
brayden_ is now known as brayden
[08:40]
.... (idle for 15mn)
SHODAN_UI has quit IRC (Remote host closed the connection) [08:59]
........ (idle for 35mn)
tfgbd_znc has quit IRC (Ping timeout: 600 seconds) [09:34]
JAACrap, Tilt shut down their API. It's now redirecting to the homepage. :-( [09:48]
***brayden_ has joined #archiveteam-bs
swebb sets mode: +o brayden_
[09:51]
JAASame for all campaign and user pages. Well, I guess the party's over.
Looks like it started happening around 01:36 UTC.
[09:51]
***j08nY has joined #archiveteam-bs
brayden has quit IRC (Read error: Operation timed out)
brayden_ is now known as brayden
[10:02]
BlueMaxim has quit IRC (Read error: Operation timed out) [10:16]
......... (idle for 40mn)
Jonison has quit IRC (Read error: Connection reset by peer)
Jonison has joined #archiveteam-bs
[10:56]
..... (idle for 20mn)
kristian_ has joined #archiveteam-bs [11:19]
........ (idle for 39mn)
kristian_ has quit IRC (Quit: Leaving) [11:58]
.... (idle for 15mn)
SHODAN_UI has joined #archiveteam-bs [12:13]
........................ (idle for 1h57mn)
Jonison has quit IRC (Read error: Connection reset by peer) [14:10]
........ (idle for 39mn)
t2t2seesaw @ "subprocess.py", line 1457, in _execute_child
FileNotFoundError: [Errno 2] No such file or directory: 'youtube-dl'
that should probably be looking from /usr/local/bin/ instead of the current working directory
or ~/.local/bin/ for pip --user installs
[14:49]
JAAShouldn't it just search through the path? [14:55]
.... (idle for 15mn)
***SHODAN_UI has quit IRC (Remote host closed the connection) [15:10]
..... (idle for 22mn)
t2t2yes, the wpull argument for that is --youtube-dl-exe PATH [15:32]
JAAYeah, you can use that to specify the path directly. But the value is passed to subprocess.Popen and defaults to youtube-dl, so without using that option, it would search the PATH environment variable.
Or should, as far as I know.
Haven't tested it.
[15:34]
***godane has quit IRC (Read error: Operation timed out) [15:47]
SketchCowhttps://archive.org/search.php?query=pre-production%20cart%20EMI [15:59]
....... (idle for 31mn)
***godane has joined #archiveteam-bs [16:30]
Stiletto has quit IRC (Read error: Operation timed out)
Stilett0 has joined #archiveteam-bs
bitBaron has joined #archiveteam-bs
[16:42]
Famicoman has quit IRC (Ping timeout: 260 seconds)
Famicoman has joined #archiveteam-bs
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
bitBaron has joined #archiveteam-bs
bitBaron has quit IRC (Read error: Connection reset by peer)
[16:59]
..... (idle for 24mn)
SHODAN_UI has joined #archiveteam-bs [17:32]
Famicoman has quit IRC (Ping timeout: 260 seconds) [17:46]
Famicoman has joined #archiveteam-bs
Aranje has joined #archiveteam-bs
[17:53]
bitBaron has joined #archiveteam-bs [18:01]
....... (idle for 30mn)
Asparagirbsmith093: Thank you for saving A03! Although they seem to be doing okay, solid open source codebase and non-profit funding.
Signed, just another #stucky fangirl :-)
[18:31]
bsmith093Asparagir: np, thanks! I just figured someone should save them, there are millions of stories on those sites. [18:33]
AsparagirYup yup yup.
Someday, if they ever open up their Solr search as an API (it's been on their development roadmap for ages), I have a dream to mine the database and create a "fic recommendation" system. Like, if you like fic A from fandom B, you might also like fic C from fandom D.
They have made really good use of tags, and there's also the tracking of who left kudos for what, so you could build a real recommendation system.
People like you who like THIS would also like THAT.
[18:33]
kisspunchAsparagir: goodreads does a surprisingly good job of that, mostly by NOT having the same 3 genres as everyone
They're not open though AFAIK
[18:37]
AsparagirThe more granularity, the better.
Like, maybe I want one saved search cluster profile for "slow burn, long build up, interpersonal angst" stories, but another saved search profile for "cheesy silly plot-what-plot" stories
And you could monetize an app/site like this, like $5/month, with some of the funds earmarked as going as donations to AO3. So everybody wins: people find the stories they didn't even know they wanted, system gets more use, system also gets donations.
I already know Solr but I would have to learn Hadoop, probably.
[18:37]
kisspunchHow do I tag something 'archive team' in the web interface?
Is it actually a tag and not a collection?
[18:41]
bsmith093there's a user on fimfiction who archives all of those stories as well, turns out there's only about 200k
Asparagir: google chinchillax fimfic archive
Asparagir: every fimfic story ever posted combined, can still fit on one dvd.
[18:41]
AsparagirAh, I see -- they release torrents of all their stories: https://www.fimfiction.net/user/116950/Fimfarchive
kisspunch: Do you mean on the actual web interface for the Internet Archive?
[18:43]
kisspunchYes
The uploader interface
I'm awake, I swear. Yes, on archive.org
[18:47]
***xarph has joined #archiveteam-bs
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
[18:56]
.... (idle for 17mn)
Stilett0 has quit IRC (Read error: Operation timed out)
Stilett0 has joined #archiveteam-bs
pie_ has joined #archiveteam-bs
[19:16]
pie_anyone archiving pixiv? [19:18]
FroggingI think it was just the chats that were going away, but no [19:19]
kisspunchfwiw folks you've been incredibly welcoming in archiveteam-bs, but i was really put off by visiting #archiveteam in the past, to the point of not wanting to talk to archive team again
would have been better just to redirect me here i think
[19:19]
Asparagirkisspunch: If I upload WARC's to the INternet Archive based on my own crawls, I then e-mail Jason Scott (a.k.a. SketchCow here on IRC) at jscott@archive.org and ask him to "move such-and-such (paste URL too) to ArchiveTeam bucket". He usually does it within a few days. I haven't asked lately, though. I just usually use ArchiveBot to get stuff up, which of course automatically puts stuff into the ArchiveTeam bucket. [19:22]
kisspunchAsparagir: Yeah, I usually am writing my own software to e.g. download all of an entire site and put it in some nice format. It's more along the lines of the public datasets and less generic WARC crawls [19:23]
AsparagirAlso, some parts of ArchiveTeam are less-than-friendly to newcomers. It's almost like a defense mechanism, because a lot of site owners don't like what we're doing, saving their sites.
And that spills over, unfortunately, sometimes, even to people who want to help archive.
[19:24]
bsmith093Asparagir: i happened to tell a bunch of fanfic writers that i was saving all their stories, and you'd think i said i ate puppies. [19:25]
xmcoh gosh, yes, that [19:25]
AsparagirPeople weird. [19:25]
xmcthat's how fic always is [19:25]
bsmith093there are WARC crawls of ffnet ,too i just thought it would be easier for the average person to grab a bunch of stories in an easliy readable format. [19:26]
AsparagirNot just fic. Even commercial sites hosting tons of user-generated content that close. [19:26]
kisspunch*shrug* I'm not sure if I'm complaining exactly, just sharing my story. If there was a second group of people interested in archiving I wouldn't have ended up here. And again, archiveteam-bs is notably friendlier
It's a little weird
[19:26]
AsparagirIt is. [19:26]
bsmith093-bs is where off-topic stuff goes for archiveteam [19:26]
AsparagirOr another angle on this -- I wish ArchiveBot had a direct Twitter interface, not just an IRC interface. We would get so many more people willing to help submit URL's to save. And a greater variety of sites to save, corners of the Internet we don't usually crawl. [19:26]
bsmith093they really try to keep the main channel clear of chatter. [19:26]
xmcsame people, but we try to have as little talking in the main channel as we can [19:27]
kisspunchAsparagir: how do I get permission to give AchriveBot urls on IRC? [19:27]
AsparagirHang out in the channel and ask someone to op you. [19:27]
xmckisspunch: you join #archivebot and ask for voice [19:27]
bsmith093Asparagir: if you know sql, i also provide a metadata db of the stories for easy searching. [19:27]
xmcyeah, that [19:27]
AsparagirOooooh. [19:27]
bsmith093https://archive.org/details/fanfictiondotnet_repack
thats almost 350gb of stories, id 1-10 million
[19:28]
AsparagirNice! thanks. [19:28]
bsmith093i'm grabbing the rest in smaller chunks.
https://archive.org/details/Fanfictiondotnet1011dump 10-11 million
[19:28]
kisspunchI recommend https://github.com/mispy/twitter_ebooks for archiving twitter btw [19:31]
bsmith093https://archive.org/details/Fanfictiondotnet111million the next 100k [19:31]
kisspunchIt's very clean and does incremental update, which is important given the 3200 tweet limit [19:32]
***kristian_ has joined #archiveteam-bs [19:38]
..... (idle for 22mn)
pie_ has quit IRC (Read error: Connection reset by peer)
pie_ has joined #archiveteam-bs
[20:00]
fie has joined #archiveteam-bs [20:10]
kristian_ has quit IRC (Quit: Leaving) [20:19]
Famicoman has quit IRC (Ping timeout: 260 seconds)
Famicoman has joined #archiveteam-bs
[20:32]
JAAAsparagir: Thanks for the information regarding how to upload files to IA. I'll have to do that soon. [20:34]
........ (idle for 38mn)
AsparagirJAA: No problem. I've been meaning to do some new ones soon... [21:12]
***kristian_ has joined #archiveteam-bs [21:12]
AsparagirPossibly helpful: https://gist.github.com/Asparagirl/6202872 and https://gist.github.com/Asparagirl/6206247
I wrote those a while ago, but I think they're still accurate
[21:12]
***SHODAN_UI has quit IRC (Remote host closed the connection) [21:13]
Famicoman has quit IRC (Ping timeout: 260 seconds) [21:22]
Famicoman has joined #archiveteam-bs [21:31]
JAAThanks, that second one looks useful to me.
Just one comment though: no need to `export` variables (= make them envinroment variables) if they aren't used by the subprocess, here curl, anyway.
[21:37]
***sep332 has joined #archiveteam-bs [21:42]
Famicoman has quit IRC (Ping timeout: 260 seconds)
fie has quit IRC (Ping timeout: 246 seconds)
[21:50]
Famicoman has joined #archiveteam-bs [21:58]
fie has joined #archiveteam-bs [22:06]
bsmith093JAA: there's also a python pip package, if nobody's suggested it yet. internetarchive, command is ia [22:11]
***Famicoman has quit IRC (Ping timeout: 260 seconds) [22:12]
JAAYeah, I've seen that and actually suggested it to other people before without ever having used it. :-D [22:14]
***Famicoman has joined #archiveteam-bs
zino has joined #archiveteam-bs
[22:19]
Stilett0 has quit IRC (Ping timeout: 260 seconds) [22:31]
...... (idle for 28mn)
Famicoman has quit IRC (Ping timeout: 260 seconds)
Famicoman has joined #archiveteam-bs
Stilett0 has joined #archiveteam-bs
kristian_ has quit IRC (Quit: Leaving)
Famicoman has quit IRC (Ping timeout: 260 seconds)
[22:59]
Famicoman has joined #archiveteam-bs [23:12]
...... (idle for 27mn)
BubuAnabe has joined #archiveteam-bs
Famicoman has quit IRC (Ping timeout: 260 seconds)
odemg has quit IRC (Quit: Leaving)
[23:39]
Famicoman has joined #archiveteam-bs [23:49]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)