#archiveteam-bs 2019-03-25,Mon

↑back Search

Time Nickname Message
00:01 🔗 JAA icedice: So uh, the scrape of the first 1000 pages for site:spit.mixtape.moe returned 196 URLs... :-|
00:02 🔗 icedice :/
00:02 🔗 Flashfire its a start
00:02 🔗 icedice Yeah
00:02 🔗 Flashfire Could we run it through URLteam?
00:02 🔗 Flashfire Then use the results for a scrape?
00:02 🔗 JAA Flashfire: No, codes are too long. See the math yesterday.
00:02 🔗 Flashfire oh ok
00:03 🔗 JAA Oh you mean scrape Bing through URLTeam? Hell no. :-P
00:03 🔗 JAA It really just shows how much Bing sucks.
00:04 🔗 JAA That said: out of the 55k mixtape.moe URLs Fusl gave me, only 308 are on spit.
00:05 🔗 JAA So it seems like there really are few pastes on there.
00:14 🔗 benjinsmi has quit IRC (Leaving)
00:18 🔗 godane so this guy is posting on my Nightline reviews: https://archive.org/details/@peterfranz_0&tab=reviews
00:21 🔗 godane like one comment said: Take those goddamn motherfucking Nightline episodes from 6/2/2004 to 12/30/2005 and fucking post them onto www.archive.org right this fucking minute.
00:21 🔗 Flashfire Blimey he doesnt seem happy
00:22 🔗 godane first review here was not that bad: https://archive.org/details/ABC-Nightline-2004-06-01
00:23 🔗 godane so like after 5 days he lost it or something
00:26 🔗 icedice JAA: Any plans on crawling https://archive.nyafuu.org/_/search/text/mixtape.moe/ https://rbt.asia/_/search/text/mixtape.moe/ https://archive.4plebs.org/_/search/text/mixtape.moe/ and https://warosu.org/ ?
00:27 🔗 astrid godane: wow ..
00:35 🔗 bithippo has joined #archiveteam-bs
00:38 🔗 icedice Also: https://yuki.la/search.html?boards=all&startdate=2008-02-02&enddate=2019-03-25&sort=descending&postext=mixtape.moe#page=1
00:45 🔗 godane astrid: i just find it weird to go from being nice to dropping f bombs at the person you want to continue uploading stuff like it
00:45 🔗 astrid right??
00:50 🔗 benjins has joined #archiveteam-bs
00:50 🔗 JAA icedice: Scraping the first three now, though there'll probably be a lot of overlap with archived.moe and desuarchive.org. I don't see any search on Warosu.
00:50 🔗 SketchCow I've deleted all his reviews.
00:51 🔗 JAA icedice: archive.4plebs.org has a very strict rate limit on the search, so skipping that.
00:51 🔗 icedice https://thebarchive.com/_/search/text/%22mixtape.moe%22/
00:52 🔗 icedice ^ Found another one that isn't on the wiki
00:53 🔗 icedice thebarchive.com + randomarchive.com + yuki.la are not on the list of 4chan archivers
00:53 🔗 icedice Warosu has a search feature, but it's per board
00:54 🔗 icedice Yeah, I think it will be a lot of the same links
00:54 🔗 JAA rbt.asia's search is really slow.
00:54 🔗 icedice Some might have started archiving a board earlier than another one and some might have had outages, and so on
00:55 🔗 icedice So that's one reason to double check using multiple archivers
00:55 🔗 JAA 541 additional links from archive.nyafuu.org.
00:55 🔗 icedice Not bad
00:59 🔗 JAA This is getting really messy. I won't be adding anything further for now.
01:00 🔗 JAA The test job seems to have completed successfully, by the way.
01:01 🔗 JAA Bing site:my.mixtape.moe returned 3 results from 100 pages....... Sigh.
01:14 🔗 icedice Ok
01:15 🔗 icedice I'll try to get the Mixtape.moe links from those forums manually next week
01:15 🔗 JAA Current URL count: 82951
01:15 🔗 icedice Not sure I can do much else to help
01:18 🔗 JAA 82950*, there was a BOM in a duplicated URL.
01:18 🔗 icedice https://searx.me/?preferences=eJx1VcGO2zgM_Zr6Ysyi3R725MOiRbEFFpiime5VoCXa4VoWXVFOxv36Uhk7VjrpIQ6skO89PopMRJl9EsPBBDybBG3zCbxg5ZhMRGF_wtgw6OsfHPvKQ-hn6LHB8PDtUHm24LGpHAm0Hp2Z_NxTkOY_Gh88DWiOnAZc5M2fHx4nDBlS-YynMFxJ28hnwZjJc9zn0FGghEZsZO-3zL-tRRHz8fGzgpyjBlQ0qhIzRX5eVtUwJ7Y8Th6TqhLoUBCiPTZvq3TEERsWC7HCcKv2gL4zystxhEQcsoynCHZQVd--_quEI6sPevrP09OXwyZA3w8X-FyJhYQ9x8UIerRptwSDcqA0A9kBtALTkcfM0EXEWrhLZ4hYO4qalgEMJf31xJCMEbYEvh7REeghBTDmRA45A0R0jl4HzUEm
01:18 🔗 icedice D3JUoGxQjuyZe4-1Hi81TFMh4uPs1FzTY8AI2Wxr7UM6FSw99VoISCqjNmrt36UUF5ncjhoWgILezXbIn55LjKshO1Uhc5yFrDGXL_1pgeDw-Z6E_aRHHBKNKKuFLyo2BEd9mwoysL3w_Dt2PhGW0t49F_UIoptQL8cVfOT_lbwUs8Ffk1hvccSJC_N3nFc93b17kb2loA5Ldnon6iAyXztxhDZCfqwe5OtJJ6xJ7onb5LeUWu0PpjVrgSPf9KqLOrEEthCWeFg4sRx5gLBDrXL2VEzLyEEnHsvTl95cy-4ijOCpjbgqGJdRhy4udYoQxOtwuXsFbLbkVNDoHCG3tX4_Q7i5JvttrF-KudvZ26u8vf1S9FX_2qXNhF_C9n7_CDCWWuRMIlb3X6nBQoi1pDjbNMcsTyxhsHhByA217LDOj9WsAmWHfvf-_V_PN3OcPL
01:18 🔗 icedice RbRoKYprw-S2NcXzu8bN-8BO_O6qXC6wLddls1Yjqya748Hp6qdRPq_DQrQHVZuw-SFv2z8NxT1n76CXM1cVc=&q="mixtape.moe"+site:steamcommunity.com
01:18 🔗 JAA What a URL.
01:21 🔗 icedice Yeah
01:21 🔗 icedice https://tiny.cc/searxforarchivation
01:22 🔗 icedice ^ I selected every supported meta search engine into the configuration and deselected all other sites (like image search and so on, dictionary search, and so on)
01:23 🔗 icedice Is that usable for crawling?
01:25 🔗 JAA I've tried to scrape searx.me before, and that didn't work well. Rate limits and other issues I don't remember right now.
01:25 🔗 JAA It'd be better to run our own Searx instance instead.
01:28 🔗 icedice Yeah
01:28 🔗 icedice There is a list of searx instances here btw: https://github.com/asciimoo/searx/wiki/Searx-instances
01:33 🔗 icedice Is the site design and formatting of searx scrapeable though if we got searx instance with decent limits?
01:33 🔗 JAA Don't even need to scrape since it has JSON output as well.
01:34 🔗 icedice Ah, nice
01:48 🔗 bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
01:49 🔗 BlueMax has quit IRC (Quit: Leaving)
01:54 🔗 hook54321 VoynichCr, JAA: How should we add sites to the list that have all of their subdomains excluded from the Wayback Machine?
02:04 🔗 Flashfire To which list hook54321
02:11 🔗 hook54321 https://www.archiveteam.org/index.php?title=List_of_websites_excluded_from_the_Wayback_Machine
02:13 🔗 Flashfire Thats actually not a bot updated one
02:14 🔗 balrog has quit IRC (Quit: Bye)
02:14 🔗 hook54321 i know
02:14 🔗 Flashfire I created that one inititally hook54321 I just add each of the subdomains individually myself but its personal preference
02:16 🔗 Flashfire Who is Amerepheasant
02:16 🔗 balrog has joined #archiveteam-bs
02:39 🔗 ndiddy has quit IRC (Ping timeout: 268 seconds)
02:46 🔗 BlueMax has joined #archiveteam-bs
02:48 🔗 wyatt8740 has joined #archiveteam-bs
03:02 🔗 Panasonic has joined #archiveteam-bs
03:28 🔗 godane dashcloud: so ffmpeg got the audio issue again
03:29 🔗 godane SketchCow: i may need to reboot so don't touch whats in the Godane VHS Capture folder
03:30 🔗 godane anyways there are 2 theories
03:30 🔗 godane one is don't upload when capturing to same hd drive
03:31 🔗 godane 2nd theory is that the zram swap is too full
03:32 🔗 godane has quit IRC (Quit: Leaving.)
03:40 🔗 godane has joined #archiveteam-bs
03:43 🔗 Lord_Nigh ftp://89.179.20.136/oldftp/TOOLS/SOFTWARE/Lang/ASEMBLER/ i suspect there's some more stuff from the lost motorola/freescale bbs lurking in there
03:43 🔗 Lord_Nigh its sad that no complete mirror of that ftp site was ever made before it went defunct, or even a directory listing that i'm aware of
03:44 🔗 Lord_Nigh given the bits of it spread all over old ftp sites on the net i'm guessing more than half of it could be reconstructed
03:44 🔗 Flashfire Lord_Nigh many of them are on the FTP list. I have kind of abandoned it with no work going into it on the coding end and me having other duties to fulfill
03:45 🔗 Flashfire But there is a fairly large list on the FTP/List now
03:45 🔗 Lord_Nigh so the ftp project is officially dead, or just on hiatus?
03:45 🔗 Flashfire https://www.archiveteam.org/index.php?title=FTP/List
03:45 🔗 Flashfire Hiatus I hope
03:45 🔗 Lord_Nigh when i find ftp sites which have a working http gateway sometimes i feed them to archivebot
03:45 🔗 Lord_Nigh this one does not
03:46 🔗 Flashfire I have done a lot of work adding to the FTP/List but as far as I know the code is outdated
03:46 🔗 Flashfire Archivebot supports FTP and makes them into WARCS but I dont believe FTP is incorporated into WAYBACK
03:46 🔗 Flashfire I dont have the coding knowledge or the time to gain that coding knowledge to check if the code for grabbing FTP is currently up to date
03:47 🔗 Flashfire #effteepee is our channel for that project if you want to join me there for the discussion but its pretty dead looking
03:48 🔗 Flashfire Lord_Nigh join that channel to discuss further
04:28 🔗 qw3rty112 has joined #archiveteam-bs
04:34 🔗 qw3rty111 has quit IRC (Read error: Operation timed out)
04:35 🔗 Mateon1 has quit IRC (Ping timeout: 255 seconds)
04:36 🔗 Mateon1 has joined #archiveteam-bs
04:42 🔗 odemgi has joined #archiveteam-bs
04:44 🔗 odemgi_ has quit IRC (Ping timeout: 252 seconds)
04:50 🔗 odemg has quit IRC (Ping timeout: 615 seconds)
04:57 🔗 odemg has joined #archiveteam-bs
05:05 🔗 Despatche has quit IRC (Read error: Operation timed out)
05:48 🔗 dhyan_nat has joined #archiveteam-bs
05:48 🔗 Despatche has joined #archiveteam-bs
06:23 🔗 marked Who's yipdw ? questions about an tracker branch by them
06:43 🔗 wp494 has quit IRC (Ping timeout: 506 seconds)
06:43 🔗 wp494 has joined #archiveteam-bs
06:45 🔗 killsushi has quit IRC (Quit: Leaving)
06:58 🔗 Exairnous has quit IRC (Read error: Operation timed out)
06:58 🔗 Exairnous has joined #archiveteam-bs
07:14 🔗 hook54321 marked: they're not active right now. someone else might be able to answer them though.
08:25 🔗 marked K, thanks. There's a branch called "Teach tracker how to value some upload locations more enhancement" any knowledge why this code was abandoned? https://github.com/ArchiveTeam/universal-tracker/pull/20
08:47 🔗 dhyan_nat has quit IRC (Read error: Connection reset by peer)
08:47 🔗 dhyan_nat has joined #archiveteam-bs
09:41 🔗 MrRadar_ has joined #archiveteam-bs
09:44 🔗 MrRadar has quit IRC (Read error: Operation timed out)
10:20 🔗 Despatche has quit IRC (Quit: Read error: Connection reset by deer)
10:41 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
10:57 🔗 godane dashcloud: the audio sync issue is back
10:58 🔗 godane a part me think it was my fail cause i was messing with firefox debug mode during capturing
11:21 🔗 Panasonic has quit IRC (Remote host closed the connection)
12:33 🔗 JAA hook54321: I think for readability it might be a good idea to list all subdomains on one line. Start with the whole domain, then list everything else afterwards. My bot preserves the lines as they are and sorts them by the first URL it finds.
12:33 🔗 JAA Otherwise the subdomains are scattered throughout the whole list and it gets messy.
12:34 🔗 JAA Depends a bit on the context though of course. If it's a free web hoster type thing with usernames as subdomains, it might make more sense to have individual entries.
13:22 🔗 Jopik hello anyone knows by chance who wrote this static json index search thingy? hhttps://archive.org/download/webshots-freeze-frame-index/index.html is there any code that generates the index files? I want to use this for the blogspot google+ comment database to find the exact file for a domain
13:23 🔗 JAA Would be very useful for Tumblr as well since there have been various people asking how to find a particular blog in the archives.
13:24 🔗 Jopik I tried to look for any source or docs on github and the wider web and couldn't find anything, so it seems it's original work done by someone on archiveteam.
13:40 🔗 Dj-Wawa has joined #archiveteam-bs
14:48 🔗 MrRadar_ is now known as MrRadar
14:48 🔗 MrRadar2 sets mode: +o MrRadar
14:50 🔗 marked MrRadar : is there somewhere I can read about what you found the rsync error to be?
14:50 🔗 MrRadar Take a look at the pull request and commit messages
14:51 🔗 MrRadar https://github.com/ArchiveTeam/seesaw-kit/pull/113
14:52 🔗 MrRadar Basically I just looked at the exceptions, figured out what was causing them and also what effect they had on the control flow
14:54 🔗 MrRadar tl;dr: Some system calls can randomly fail due to being interrupted by signals, which can get generated by child processes
14:54 🔗 MrRadar That was causing an exception that would cause the item to fail
14:54 🔗 MrRadar Except the wrong task was being notified of the failure
14:54 🔗 MrRadar Resulting in the item not actually failing
14:57 🔗 marked Ok, thanks. I understand that better than the code changes. Man, I wish I realized this was from 2013 instead of just started on google+ before I started down that rabbit hole.
14:58 🔗 MrRadar So now it caches those exceptions and retries the operation, notifies the correct task of failures, and doesn't fail external process tasks for internal Pyton tasks
14:58 🔗 MrRadar *internal python errors
14:58 🔗 marked If there wrong thread was notified, is there a chance an item would get confirmed but the rsync had failed?
14:59 🔗 MrRadar It wasn't obvious what was going on at first but I am surprised nobody else took the time to dig into it
15:00 🔗 MrRadar2 I actually did observe that behavior: the item would "fail" due to this spuriuos exception but the upload would continue and it would even get confirmed with the tracker, however the upload would then be retried over and over
15:03 🔗 marked could that also explain duplicate confirms to the tracker?
15:03 🔗 MrRadar2 Possibly
15:06 🔗 marked --concurrent 1 would presumably never have a misrouted fail?
15:06 🔗 MrRadar2 It still could (EINTR exceptions can be caused by any signal), though it would be much less likely (since you are only dealing with 1 child process at a time instead of up to 20)
15:10 🔗 marked that matches my experimental data, that I've seen it at level 1 before, but was hard to get it to happen again
15:11 🔗 VerifiedJ has quit IRC (Read error: Connection reset by peer)
15:12 🔗 VerifiedJ has joined #archiveteam-bs
15:30 🔗 ndiddy has joined #archiveteam-bs
15:33 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
15:36 🔗 dhyan_nat has joined #archiveteam-bs
15:45 🔗 wp494 has quit IRC (Ping timeout: 615 seconds)
15:45 🔗 wp494 has joined #archiveteam-bs
16:10 🔗 Dj-Wawa has quit IRC (Quit: Connection closed for inactivity)
16:34 🔗 Stilett0- has joined #archiveteam-bs
16:37 🔗 Stiletto has quit IRC (Read error: Operation timed out)
18:01 🔗 gogondwan has quit IRC (Ping timeout: 260 seconds)
18:22 🔗 Asparagir has joined #archiveteam-bs
19:00 🔗 VerifiedJ has quit IRC (Read error: Connection reset by peer)
19:01 🔗 VerifiedJ has joined #archiveteam-bs
19:18 🔗 Exairnous has quit IRC (Ping timeout: 252 seconds)
19:26 🔗 Exairnous has joined #archiveteam-bs
19:50 🔗 Asparagir has quit IRC (Asparagir)
19:54 🔗 VerifiedJ has quit IRC (Read error: Connection reset by peer)
19:54 🔗 VerifiedJ has joined #archiveteam-bs
20:09 🔗 Exairnous has quit IRC (Read error: Connection reset by peer)
20:14 🔗 Exairnous has joined #archiveteam-bs
20:15 🔗 overflowe has quit IRC (Remote host closed the connection)
20:20 🔗 lindalap has quit IRC (Quit: lindalap)
20:21 🔗 Exairnous has quit IRC (Ping timeout: 268 seconds)
20:27 🔗 overflowe has joined #archiveteam-bs
20:41 🔗 hook54321 JAA: k, I'll do that when adding subdomains.
20:41 🔗 Despatche has joined #archiveteam-bs
20:42 🔗 hook54321 Also, we should try to find a way to take a large domain list and just go through them automatically in bulk to get a list of ones that are excluded
20:49 🔗 alex___ has joined #archiveteam-bs
21:59 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
22:06 🔗 Stilett0- is now known as Stiletto
22:20 🔗 BlueMax has joined #archiveteam-bs
22:24 🔗 schbirid has quit IRC (Remote host closed the connection)
22:50 🔗 PhrackD- has joined #archiveteam-bs
22:50 🔗 PhrackD has quit IRC (Read error: Operation timed out)
22:50 🔗 PhrackD- is now known as PhrackD
23:02 🔗 alex___ has quit IRC (Quit: take care ye all. Have fun!)

irclogger-viewer